Skip to content

Journey planner that estimates trip success probabilities based on historical data using Spark and Hadoop

Notifications You must be signed in to change notification settings

giacomoorsi/Swiss-Spark-Journey-Planner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Robust Journey Planning on Swiss Railway Data using Spark and Hadoop

In this repository it is contained a journey planner on Swiss railway data, that is able to estimate the probability of success of a trip, analyzing terabytes of historical data on demand, using Spark and Hadoop.

You can find a video presentation here.


Problem Motivation

Imagine you are a regular user of the public transport system, and you are checking the operator's schedule to meet your friends for a class reunion. The choices are:

  1. You could leave in 10mins, and arrive with enough time to spare for gossips before the reunion starts.

  2. You could leave now on a different route and arrive just in time for the reunion.

Undoubtedly, if this is the only information available, most of us will opt for option 1.

If we now tell you that option 1 carries a fifty percent chance of missing a connection and be late for the reunion. Whereas, option 2 is almost guaranteed to take you there on time. Would you still consider option 1?

Probably not. However, most public transport applications will insist on the first option. This is because they are programmed to plan routes that offer the shortest travel times, without considering the risk factors.


Problem Description

Given a desired arrival time, the route planner will compute the fastest route between departure and arrival stops within a provided confidence tolerance expressed as interquartiles. For instance, "what route from A to B is the fastest at least Q% of the time if I want to arrive at B before instant T". Note that confidence is a measure of a route being feasible within the travel time computed by the algorithm.

The output of the algorithm is a list of routes between A and B and their confidence levels. The routes must be sorted from latest (fastest) to earliest (longest) departure time at A, they must all arrive at B before T with a confidence level greater than or equal to Q. Ideally, it should be possible to visualize the routes on a map with straight lines connecting all the stops traversed by the route.

In order to answer this question this project:

  • Models the public transport infrastructure for a route planning algorithm using the data provided by SBB.
  • Builds a predictive model using the historical arrival/departure time data, and optionally other sources of data.
  • Implements a robust route planning algorithm using this predictive model.
  • Test and validates the results.
  • Implement a simple Jupyter-based visualization to demonstrate the method, using Jupyter dashboard with ipywidgets.

We have implemented the following simplifying assumptions:

  • We only consider journeys at reasonable hours of the day, and on a typical business day, and assuming the schedule of May 13-17, 2019.
  • We allow short (total max 500m "As the Crows Flies") walking distances for transfers between two stops, and assume a walking speed of 50m/1min on a straight line, regardless of obstacles, human-built or natural, such as building, highways, rivers, or lakes.
  • We only consider journeys that start and end on known station coordinates (train station, bus stops, etc.), never from a random location. However, walking from the departure stop to a nearby stop is allowed.
  • We only consider departure and arrival stops in a 15km radius of Zürich's train station, Zürich HB (8503000), (lat, lon) = (47.378177, 8.540192).
  • We only consider stops in the 15km radius that are reachable from Zürich HB. If needed stops may be reached via transfers through other stops outside the 15km area.
  • Delays and travels on public transport network are uncorrelated with one another.
  • Once a route is computed, a traveller is expected to follow the planned routes to the end, or until it fails (i.e. miss a connection).

Dataset Description

For this project we will use the data published on the Open Data Platform Mobility Switzerland.

We will use the SBB data limited around the Zurich area, focusing only on stops within 15km of the Zurich main train station.

Actual data

The 2018 to 2021 data is available as a Hive table in partitioned ORC format on our university HDFS system, under /data/sbb/orc/istdaten.

The relevant column descriptions below. The full description of the data is available in the opentransportdata.swiss data istdaten cookbooks.

  • BETRIEBSTAG: date of the trip
  • FAHRT_BEZEICHNER: identifies the trip
  • BETREIBER_ABK, BETREIBER_NAME: operator (name will contain the full name, e.g. Schweizerische Bundesbahnen for SBB)
  • PRODUCT_ID: type of transport, e.g. train, bus
  • LINIEN_ID: for trains, this is the train number
  • LINIEN_TEXT,VERKEHRSMITTEL_TEXT: for trains, the service type (IC, IR, RE, etc.)
  • ZUSATZFAHRT_TF: boolean, true if this is an additional trip (not part of the regular schedule)
  • FAELLT_AUS_TF: boolean, true if this trip failed (cancelled or not completed)
  • HALTESTELLEN_NAME: name of the stop
  • ANKUNFTSZEIT: arrival time at the stop according to schedule
  • AN_PROGNOSE: actual arrival time (see AN_PROGNOSE_STATUS)
  • AN_PROGNOSE_STATUS: method used to measure AN_PROGNOSE, the time of arrival.
  • ABFAHRTSZEIT: departure time at the stop according to schedule
  • AB_PROGNOSE: actual departure time (see AN_PROGNOSE_STATUS)
  • AB_PROGNOSE_STATUS: method used to measure AB_PROGNOSE, the time of departure.
  • DURCHFAHRT_TF: boolean, true if the transport does not stop there

Each line of the file represents a stop and contains arrival and departure times. When the stop is the start or end of a journey, the corresponding columns will be empty (ANKUNFTSZEIT/ABFAHRTSZEIT). In some cases, the actual times were not measured so the AN_PROGNOSE_STATUS/AB_PROGNOSE_STATUS will be empty or set to PROGNOSE and AN_PROGNOSE/AB_PROGNOSE will be empty.

Timetable data

We have copied the timetable to HDFS.

Only GTFS format has been copied on HDFS, the full description of which is available in the opentransportdata.swiss data timetable cookbooks. The more courageous who want to give a try at the Hafas Raw Data Format (HRDF) format must contact us.

We provide a summary description of the files below. The most relevant files are marked by (+):

  • stops.txt(+):

    • STOP_ID: unique identifier (PK) of the stop
    • STOP_NAME: long name of the stop
    • STOP_LAT: stop latitude (WGS84)
    • STOP_LON: stop longitude
    • LOCATION_TYPE:
    • PARENT_STATION: if the stop is one of many collocated at a same location, such as platforms at a train station
  • stop_times.txt(+):

    • TRIP_ID: identifier (FK) of the trip, unique for the day - e.g. 1.TA.1-100-j19-1.1.H
    • ARRIVAL_TIME: scheduled (local) time of arrival at the stop (same as DEPARTURE_TIME if this is the start of the journey)
    • DEPARTURE_TIME: scheduled (local) time of departure at the stop
    • STOP_ID: stop (station) identifier (FK), from stops.txt
    • STOP_SEQUENCE: sequence number of the stop on this trip id, starting at 1.
    • PICKUP_TYPE:
    • DROP_OFF_TYPE:
  • trips.txt:

    • ROUTE_ID: identifier (FK) for the route. A route is a sequence of stops. It is time independent.
    • SERVICE_ID: identifier (FK) of a group of trips in the calendar, and for managing exceptions (e.g. holidays, etc).
    • TRIP_ID: is one instance (PK) of a vehicle journey on a given route - the same route can have many trips at regular intervals; a trip may skip some of the route stops.
    • TRIP_HEADSIGN: displayed to passengers, most of the time this is the (short) name of the last stop.
    • TRIP_SHORT_NAME: internal identifier for the trip_headsign (note TRIP_HEADSIGN and TRIP_SHORT_NAME are only unique for an agency)
    • DIRECTION_ID: if the route is bidirectional, this field indicates the direction of the trip on the route.
  • calendar.txt:

    • SERVICE_ID: identifier (PK) of a group of trips sharing a same calendar and calendar exception pattern.
    • MONDAY..SUNDAY: 0 or 1 for each day of the week, indicating occurence of the service on that day.
    • START_DATE: start date when weekly service id pattern is valid
    • END_DATE: end date after which weekly service id pattern is no longer valid
  • routes.txt:

    • ROUTE_ID: identifier for the route (PK)
    • AGENCY_ID: identifier of the operator (FK)
    • ROUTE_SHORT_NAME: the short name of the route, usually a line number
    • ROUTE_LONG_NAME: (empty)
    • ROUTE_DESC: Bus, Zub, Tram, etc.
    • ROUTE_TYPE:

Note: PK=Primary Key (unique), FK=Foreign Key (refers to a Primary Key in another table)

The other files are:

  • calendar-dates.txt contains exceptions to the weekly patterns expressed in calendar.txt.
  • agency.txt has the details of the operators
  • transfers.txt contains the transfer times between stops or platforms.

Figure 1. better illustrates the above concepts relating stops, routes, trips and stop times on a real example (route 11-3-A-j19-1, direction 0)

journeys

Figure 1. Relation between stops, routes, trips and stop times. The vertical axis represents the stops along the route in the direction of travel. The horizontal axis represents the time of day on a non-linear scale. Solid lines connecting the stops correspond to trips. A trip is one instances of a vehicle journey on the route. Trips on same route do not need to mark all the stops on the route, resulting in trips having different stop lists for the same route.

Stations data

We have a consolidated liste of stop locations in ORC format under /data/sbb/orc/allstops. The schema of this table is the same as for the stops.txt format described earlier.

It has the schema:

  • STATIONID: identifier of the station/stop
  • LONGITUDE: longitude (WGS84)
  • LATITUDE: latitude (WGS84)
  • HEIGHT: altitude (meters) of the stop
  • REMARK: long name of the stop

About

Journey planner that estimates trip success probabilities based on historical data using Spark and Hadoop

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published