v1.0 industry review edition. Coverage, methodology and entity pages open for correction through March 2027. Release cadence.
Email [email protected].

Data ingestion — how Network Rail feeds reach the archive

Last updated 28 May 2026

How the national Network Rail feeds — live TRUST movements, the CIF timetable, BPLAN track geography, and Historic Delay Attribution — are ingested, corrected, and reconciled before any published figure is computed.

Every figure in the public archive begins as a message on a Network Rail feed. Between that message and a published punctuality percentage sits an ingestion pipeline that connects to the feeds, parses and corrects the raw data, reconstructs each freight journey from a stream of fragmentary events, and joins the result to the schedule and geography it should be measured against. This page documents that pipeline in the order data travels through it, so that a reader can trace any published number back to its source feed and judge it against the corrections applied along the way.

It complements the data window reference, which sets the boundary between what the feeds can observe and what is invisible to them. This page covers ingest architecture; the data-window page covers observability limits. Read together, they describe both how the data arrives and what it can and cannot tell you once it has.

The NROD live feed and the STOMP consumer

The primary source is Network Rail’s Open Data (NROD) platform, which broadcasts the national TRUST train-movement feed over the STOMP messaging protocol at roughly fifty events per second. Gauge Intelligence ingests the full national feed — every freight activation, movement, and cancellation on the network — not a single route or a sampled subset.

The consumer that holds this connection runs as its own operating-system process, separate from the web server and from the background-job workers. This separation is deliberate. The STOMP connection must be held open continuously, exchange heartbeats with the upstream broker, decompress gzipped frames, and acknowledge each message individually. Running that work inside the web server would couple the availability of the public site to the health of the feed and risk starving request threads during traffic spikes. As a standalone process the consumer can crash and restart without dropping a single web request, and its connection management is isolated from the request-and-response cycle.

The consumer itself is intentionally thin. Its only responsibilities are to connect, receive, decompress, parse, and hand each message to the job queue. No domain logic — no schedule lookup, no delay calculation, no journey reconstruction — runs in the consumer. All of that happens downstream in idempotent background jobs, so that a message received twice produces the same result as a message received once.

At present a single consumer holds the connection. A dual-consumer design is planned: two independent processes on two separate NROD accounts, both receiving the full national feed, with one dispatching to the queue and the other monitoring it. That redundancy guards against the feed’s most dangerous failure mode — described under Where this breaks below — but it is not yet in production, and the single-consumer gap is a known limitation of the current pipeline.

TRUST message processing

The TRUST feed is a stream of discrete events, not a record of complete journeys. The ingestion pipeline reconstructs each freight service from three core message classes:

  • Activation. A schedule is signed in to TRUST and becomes a live working on the network, with an assigned TRUST identifier. The activation links a real-world journey to its planned schedule.
  • Movement. A train is reported at a TIPLOC or STANOX location, with planned, public-timetable, and actual timestamps. Delay is the difference between the actual and the planned timestamp at each measurement point.
  • Cancellation. A live activation is cancelled, with a reason code and a location.

A TRUST identifier carries structure of its own — it encodes the STANOX area, the headcode, the train’s speed class, a call code, and the day of the month — which the pipeline uses to validate and disambiguate messages as they arrive.

Reconstructing a complete journey from these events requires two kinds of identity work, both of which are permanent features of the TRUST data model rather than transitional problems.

The first is change-of-identity chain tracking. A single freight train can run under several TRUST identifiers in succession — re-signalling at an intermediate point, splits and joins, or operational re-platforming each hand the journey from one identifier to the next. The pipeline follows the full chain and treats the underlying working as one continuous service from origin to destination, rather than as a series of disconnected activations.

The second is change-of-origin handling and deduced activations. A movement message sometimes arrives for a train that has no matching activation on record — typically because the activation was lost in transit, sent outside the connection window, or dropped during a feed outage. Rather than discard a journey that physically ran, the pipeline infers the missing activation from the movement payload and flags it as deduced, so that licensed users can isolate or exclude these synthesised records. Change-of-origin messages, which revise the starting location of a live working, are applied to the existing journey rather than spawning a new one.

A note on headcodes, since it is a common misconception: Network Rail removed headcode obfuscation from the freight feed in March 2023. De-obfuscation is no longer a technical challenge. The only standing identity work is the chain tracking and deduced-activation handling described above, both of which are implemented.

The CIF timetable

A movement timestamp on its own says only when a train passed a point. To turn that into a measure of performance, the pipeline needs to know when the train was supposed to pass — and that planned timing comes from the Common Interface File, Network Rail’s schedule feed.

The CIF describes what the railway intends to run: the booked timing points for every working, against which actual performance is measured. Each journey is matched to its schedule through the composite key of train UID, schedule start date, and schedule type; matching on the UID alone would conflate schedules that overlap in time and produce incorrect delay figures. The CIF is refreshed periodically, whenever Network Rail publishes a revised schedule, and the scheduled timings it carries are the baseline that every punctuality and delay figure is computed against.

BPLAN track geography

Where the CIF says what should run and TRUST says what did, BPLAN says where. It is Network Rail’s track-topology dataset, mapping STANOX location codes to physical routes, signalling areas, and engineer’s line references. The pipeline uses BPLAN to assign each journey segment to an infrastructure corridor, which is what makes corridor-level performance reporting possible at all.

BPLAN reloading is, in practice, a manual operation. Automated delivery through the Rail Data Marketplace is unreliable, so the dataset is downloaded by hand. The geography file is deliberately kept out of the deployed application, so a reload means fetching the current file, importing it locally, and pushing it to the production server explicitly. The dataset ships as three files covering consecutive timetable periods; the correct one is whichever covers the current date, identified from the date range in each file’s header. Because BPLAN is reloaded by hand on a periodic cadence rather than streamed, staleness is a real risk, and the pipeline monitors the age of the last successful import against a freshness threshold.

Historic Delay Attribution

The TRUST feed records that a train was late; it does not record whose fault the lateness was. That attribution comes from Network Rail’s Historic Delay Attribution dataset, which assigns each delay event to a responsible party and a cause category under the industry’s delay-attribution framework.

HDA feeds two things: the delay-reason breakdowns published in the archive, and the direction of Schedule 8 estimates — that is, whether a given delay was caused by Network Rail or by the operator. (The Schedule 8 regime is asymmetric: operators pay Network Rail roughly twice per minute of delay what Network Rail pays operators, so getting the direction right is material.) Like BPLAN, HDA arrives by manual download or push delivery only, with the same Rail Data Marketplace unreliability, and on a periodic rather than continuous cadence. The most recent periods are therefore the last to receive full attribution, which is why recent delay attribution should be read as provisional until a later HDA release fills it in.

Corrections applied at ingest

Two corrections are applied to TRUST data at ingest, before any performance calculation runs. Both fix known properties of the raw feed; without them, published figures would be subtly but consistently wrong, and the error would be invisible in the figures themselves.

British Summer Time timestamp correction. During British Summer Time, several TRUST timestamp fields are issued one hour ahead of the correct clock time. The pipeline subtracts exactly one hour — 3,600,000 milliseconds — from each affected field at ingest. The fields corrected are the public-timetable, planned, actual, cancellation, departure, original-departure, and original-location timestamps. Every published timestamp is therefore clock-time correct regardless of the time of year.

Schedule-type O/P swap correction. TRUST activations carry a schedule type whose overlay (O) and permanent (P) values are swapped relative to the schedule database. The pipeline inverts the field before the schedule lookup, so that overlays match overlays and permanents match permanents. This matters because more than one schedule can be in force for the same train on the same day. When that happens, the pipeline resolves to the one that actually ran using the Short Term Plan priority order — a cancellation beats an overlay, which beats a new short-term plan, which beats a permanent schedule. Without the swap correction, the lookup would silently return the wrong record for any day that carried both an overlay and a permanent schedule, and the resulting delay would be measured against a baseline that was never in force.

Where this breaks

No ingestion pipeline is infallible, and a reader auditing a figure should know the conditions under which the data feeding it is suspect.

  • Silent feed cessation. The most dangerous NROD failure mode is not a visible disconnect but a silent one: the TCP connection stays open and heartbeats continue while messages quietly stop arriving. With a single consumer there is no independent reference to compare against, so a silent cessation can open a gap in the record before it is detected. This is the precise risk the planned dual-consumer design exists to close; until it ships, sustained quiet periods on the feed should be treated as suspect rather than as a genuine lull in traffic.
  • Lost activations. When an activation message never arrives, the journey would be invisible were it not reconstructed from its movements. The deduced-activation fallback recovers the journey, but a deduced activation rests on a single, unambiguous inference rather than a primary record. A sustained rise in the share of deduced activations is a signal that the feed is dropping messages, and figures from such a window deserve extra scrutiny.
  • Stale CIF. If the timetable feed lags a genuine schedule change, actual movements are compared against a plan that was superseded. The result is phantom delay — or phantom punctuality — that reflects a stale baseline rather than the railway’s real performance. Delay figures are only as current as the schedule they are measured against.
  • Stale BPLAN. Because track geography is reloaded by hand, a missed or late reload leaves journey segments mapped against an outdated topology. Segments can then be assigned to the wrong corridor, which corrupts corridor-level aggregation and, downstream, corridor Schedule 8 attribution. The freshness monitor exists precisely because this failure is silent in the output.
  • HDA lag. Delay attribution for the most recent periods is incomplete until the next Historic Delay Attribution release lands. Until then, a recent period’s cause breakdown and the direction of its Schedule 8 estimates are provisional, and any conclusion that depends on attribution for a just-closed period should be read as preliminary.
  • The daylight-saving boundary. The clock-change weekends in spring and autumn are the riskiest moments for timestamp correction. A message straddling the transition, or a correction applied to the wrong side of the boundary, would shift a timestamp by a full hour. The correction is applied automatically and consistently, but the transition windows are where any timestamp anomaly would surface first, and are watched accordingly.