Delay attribution: compositional analysis of cause-code distributions

Last updated 30 Apr 2026

Why DAPR delay cause codes are compositional data, how Gauge Intelligence applies log-ratio transforms before comparing attribution distributions across periods, and what this prevents.

DAPR (Delay Attribution to Prescribed Responsibilities) assigns each delay event to a responsible party and cause category. Gauge Intelligence publishes these as proportional distributions: “42% infrastructure, 31% operator, 17% external, 10% unattributed.”

Proportional distributions are compositional data: the components sum to one. Standard statistical methods applied to raw percentages produce artefacts because the constraint creates spurious negative correlations between categories.

The core problem is structural. If infrastructure share rises from 42% to 51%, at least one other category must fall. Whether “operator” fell because operator performance improved or because attribution shifted is invisible without the correct transformation.

The sum-to-one constraint

A composition with D parts lives in the (D−1)-dimensional simplex, not in Euclidean space. Raw-percentage arithmetic fails in the simplex: “infrastructure rose 9 points” is a vector in Euclidean space, not the simplex; it does not correspond to a unique compositional change.

A 9-point infrastructure rise can reflect different operational realities depending on which other categories fell. A rise driven by “operator” falling 9 points is a different claim from one driven by “external” falling 6 points and “unattributed” falling 3 points. The raw-percentage view conflates them.

The simplex is not a vector space, and arithmetic on its coordinates is not addition (Aitchison, The Statistical Analysis of Compositional Data (1986), Ch 4).

Centred log-ratio transform

Gauge Intelligence applies the centred log-ratio (clr) transform before period-on-period comparison:

clr(x)_i = log(x_i) − (1/D) Σ_j log(x_j)

The clr transform maps the composition to unconstrained Euclidean space, where standard distances and comparisons are valid. After transformation, each category is expressed as the log-ratio of its share to the geometric mean of all shares.

Zero shares cannot be log-transformed directly. They require a small-constant replacement before transformation; Gauge Intelligence uses the Bayesian-multiplicative replacement of Martín-Fernández et al. (2003). This preserves the ratios between non-zero parts while admitting the zero-share cases.

Sources: Aitchison (1986), Ch 5–6; Martín-Fernández, Barceló-Vidal & Pawlowsky-Glahn (2003).

Aitchison distance

The distance between two compositions is the Euclidean distance between their clr vectors — the Aitchison distance. Period-on-period comparisons use Aitchison distance rather than a sum of absolute percentage differences.

A large Aitchison distance between periods indicates a genuine compositional shift: the relative balance of causes has moved. A small distance indicates the distribution of causes is stable, even if individual percentages have moved by several points (Aitchison (1986), Ch 4).

This matters for editorial framing. A headline “infrastructure share rose 9 points” is interpretable only if the Aitchison distance confirms the shift is compositionally material. Otherwise it is an artefact of the constraint.

Attribution basis: the affected-train view

The responsible-party shares above describe delay as it is coded on each delayed train. A train held by a signal failure carries an infrastructure code; a train held behind it carries a reactionary code. Each row sits in the published distribution under the category its own cause code names.

The Office of Rail and Road builds its national responsibility split differently. Under the Delay Attribution Principles and Rules, it attributes reactionary delay to the principal incident that caused it, not to the train that suffered it. Minutes that cascade from a Network Rail incident onto a freight service return to Network Rail; minutes that cascade from one operator onto another return to the originating operator. The split is constructed incident-first, across the full reactionary chain.

The two bases answer different questions. The affected-train view reports the cause profile of the delay an operator experiences. The incident-level view reports who originated it. They diverge wherever reactionary delay crosses a responsibility boundary.

The divergence runs one way and is measurable. Reconstructed against the Office of Rail and Road’s published 2024-25 freight statistics, the affected-train view places roughly 35% of freight delay minutes on Network Rail where the incident-level method places roughly 44% — a gap of about nine percentage points, with the balance sitting in the operator categories. The affected-train view understates infrastructure responsibility and overstates operator self-responsibility by that margin.

Gauge Intelligence publishes the affected-train view because it is the defensible limit of the licensed source. The Historic Delay Attribution “Transparency” extract records each delayed train with its own cause code and a single responsible-train pointer; it does not preserve the inter-incident links needed to return delay to its prime cause. Reconstructing the incident-level split would require the full TRUST incident graph, which the extract flattens. The published shares are therefore the cause profile of delay suffered, not an independent reconstruction of the national responsibility split.

What is published

Raw DAPR shares are published as percentages — the intuitive form for most readers. The proportional view remains in the public archive and is not transformed before display.

Period-on-period editorialising uses the clr-based framing: “the distribution of delay causes shifted materially between adjacent periods” rather than “infrastructure share rose 9 points.” Where the Aitchison distance is small, the period-on-period sentence reports stability rather than movement.

Period 13 2025-26 (1–28 March 2026) is the inaugural published period. Period-on-period clr distances cannot be computed until Period 1 2026-27 closes; the Period 13 GEML report says so plainly rather than manufacturing a comparator.

Within a single period, the same discipline applies to delay-reason composition. The Delay-reason composition (clr-transform) section appears on a corridor’s report only once HDA cause-code coverage allows raw shares to be re-expressed as Aitchison-valid compositional data. Raw-share comparison is editorially forbidden under Aitchison 1982 in the interim; the visible omission is itself the disclosure.

Compositional distance statistics, including pairwise Aitchison distances and clr-transformed time series, are available in licensed analytical content.

Version history

Version 1.0 — April 2026. Initial publication. Centred log-ratio transform with Bayesian-multiplicative zero replacement; Aitchison distance for period-on-period comparison. Applies to all DAPR attribution distributions in published reports from Period 1 2026–27 onward.

Where this is implemented

DelayReasonBreakdown (at app/models/delay_reason_breakdown.rb) is the entry point for raw-minute delay-attribution aggregations. Given an operator and date range it returns delay minutes by DAPR category and by responsible party (#by_category, #by_responsible_party). Every gross-minute attribution figure in a published report resolves through this class.

DelayAttribution::Composition (at app/models/delay_attribution/composition.rb) carries the compositional-data transform. Given a DelayReasonBreakdown share-of-total vector it applies the centred log-ratio (#clr) and computes the Aitchison distance (#aitchison_distance) between periods. Every “share shifted from X to Y” claim in a period report resolves through this class — not from raw-percentage subtraction, which is the unsafe operation the section above warns against.

Sources

Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall.

Martín-Fernández, J. A., Barceló-Vidal, C. & Pawlowsky-Glahn, V. (2003). “Dealing with Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation.” Mathematical Geology, 35(3), 253–278.