Operator league tables and Simpson's paradox

Last updated 29 Apr 2026

Why Gauge Intelligence always renders both aggregate and corridor-stratified operator rankings, and how it detects Simpson's paradox.

This page is part of the standing methodology for the public archive (see also Methodology) and explains the statistical discipline applied to every operator ranking Gauge Intelligence publishes.

Aggregate and corridor-stratified views

Every operator league table in the public archive includes two views side by side. The first is an aggregate ranking covering all services across all corridors. The second is a corridor-stratified ranking that shows each operator’s performance broken down by the specific routes it operates. Neither view is suppressed or treated as secondary.

Aggregate rankings can reverse when the population of services is not comparable across operators. One freight operating company may concentrate its traffic on high-density, infrastructure-constrained corridors where average delays are structurally higher.

A second operator, running a lower volume on more reliable routes, will record better aggregate punctuality even if it performs worse than the first operator on every corridor they share. Presenting only the aggregate ranking would misrepresent each operator’s performance to its own network conditions.

Corridor-stratified views control for route mix. Within each corridor, every operator is measured against the same infrastructure, the same traffic density, and the same seasonal patterns — a like-for-like comparison. The aggregate figure is also published, because it answers a different legitimate question: how reliably did each operator’s traffic arrive, nationally and in total?

Corridor residuals and median polish

The residual for each operator-corridor cell (how much that operator’s punctuality deviates from what its corridor baseline would predict) is computed using Tukey’s median polish. Median polish decomposes the full operator × corridor table into an overall effect, per-operator row effects, per-corridor column effects, and cell residuals, and is resistant to outlying cells.

It is applied before corridor-stratified rankings are computed, so that a single anomalous cell does not distort the row or column baselines. For a worked numerical example showing how raw rankings and residual rankings can diverge, see Anomaly detection methodology.

Simpson’s paradox

The reversal described above is an instance of Simpson’s paradox: a trend that appears in aggregate data disappears, or reverses, when the data are disaggregated by a third variable.

The classic teaching example is the 1973 Berkeley graduate admissions data, where the university appeared to favour male applicants in aggregate, but women had higher admission rates in the majority of individual departments. The aggregate figure was misleading because women applied in disproportionate numbers to the most competitive departments.

David Spiegelhalter gives a clear treatment of the paradox in the context of public performance tables (Spiegelhalter, The Art of Statistics, Ch 4, pp 110–112). The same reversal appears routinely when hospitals, schools, or transport operators are ranked without controlling for case mix. The lesson is not that aggregate figures are wrong. They answer a different question from stratified figures, and conflating the two produces incorrect causal inference.

Detection method

Gauge Intelligence flags a Simpson’s reversal when an operator ranks higher in aggregate punctuality than a comparator but ranks lower on every corridor where both operators have a material volume of services. “Material volume” is defined as at least ten journey segments in the measurement period on the shared corridor; corridors below this threshold are excluded from the reversal test to avoid comparisons based on sparse counts.

When a reversal is detected, the operator’s aggregate ranking is annotated to indicate that the headline figure is influenced by corridor mix and that the corridor-stratified view is the appropriate basis for like-for-like comparison. No figure is suppressed or adjusted; the annotation is an interpretive flag, not a correction.

Confounding variable

The confounding factor in operator league tables is corridor mix: which routes an operator serves, at what frequency, and under what infrastructure conditions. Corridors differ in congestion, signalling age, possession intensity, and junction complexity. An operator that concentrates on the most constrained parts of the network will accumulate more delay minutes than one that does not, for reasons that are largely outside its operational control.

Corridor mix is not a measure of operator quality. It is a structural feature of the access regime. Controlling for it (by comparing operators only on corridors they share) produces rankings that are fairer to operators and more informative to readers who want to know how well each operator manages within its actual operating environment.

Partial pooling for small samples

Corridor-stratified rankings compare operators on corridors they share. An operator with fifteen services on a corridor can appear first or last based on a single anomalous week. Observed punctuality is a poor estimate of underlying performance when the sample is small; the sampling noise dominates the signal.

Gauge Intelligence addresses this using a partial-pooling (multilevel) estimator. Each operator’s corridor-specific punctuality is shrunk toward the corridor mean, with shrinkage proportional to the ratio of within-operator variance to total variance:

shrinkage weight = σ²_within / (σ²_within + n · σ²_between)

where n is the service count for that operator-corridor cell. An operator with two services in the period is shrunk strongly toward the corridor mean; an operator with two hundred services is barely moved.

The ranked table shows posterior mean estimates rather than observed means. Operators with high estimation uncertainty (small n, large shrinkage) are identified in the table. The effect is conservative: it reduces the apparent gap between operators where that gap is mainly sampling noise, not genuine performance difference.

The live worked example is the shrunk A2F league table on the operators index, which ranks all eleven active freight operators by posterior mean with a 90% credible interval beside each.

Source: Gelman, A. & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. Ch 12.

Where this is implemented

Two canonical classes carry the methodology described above. A reader who wants to replicate a published figure can trace it back to the code that produced it through these entry points.

Periodic::PeerComparisonAssessment (at app/models/periodic/peer_comparison_assessment.rb) is the entry point for period-report peer comparison panels. Given an operator and a date range it returns the shrunk A2F by corridor, the corridor-mix-controlled ranking, and the underlying counts. Every “trailing on N of N shared corridors” claim in a period report resolves through this class.

OperatorLeagueTable::PartialPool (at app/models/operator_league_table/partial_pool.rb) carries the partial-pooling estimator described in the section above. It applies the shrinkage weight to the operator-corridor observation set and returns posterior-mean estimates with 90% credible intervals. The live operators-index league table (/operators/) calls this class directly.

For Wilson interval computation on the per-corridor A2F figures, see app/lib/reports/artefact/partial_pool.rb which packages the partial-pool output for the period-report artefact section.

The data window applied to every query is the period boundary documented in data-window methodology; the canonical aggregation key is toc_id from the TRUST feed (not atoc_code, which collapses every freight operator under ZZ).

Source

Spiegelhalter, D. (2019). The Art of Statistics. Pelican. Ch 4, pp 110–112.

Gelman, A. & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. Ch 12.