Add Primary and Foreign ID fields #278

botanize · 2021-07-14T17:10:55Z

Define Primary ID and Foreign ID types
Re-type existing IDs
Add Primary ID (ID, ...) to file introduction to clarify requirements to uniquely identify rows

https://groups.google.com/g/gtfs-changes/c/YIOx6JYADMk/m/Zd0clk5lAAAJ

scmcca · 2021-07-20T17:50:34Z

@botanize Thanks for putting this together! Here are my suggestions and feedback.

Style
I wonder if it would be enough to simply indicate what the Primary ID is at the top of each file as suggested in this PR in the style of "Primary ID (field)".

Currently in the proposal, the Primary ID is indicated at the top of each file and then again as a field type. This creates confusing instances where we are defining a Primary ID for the file, and then again as a related but separate concept per field. In simple cases these match, such as for routes.txt and route_id. But ambiguities/redundancies arise. For example, a Primary ID at the file level might contain Foreign ID type values as well as non-ID type values such as in stop_times.txt (trip_id, stop_sequence). Moreover, some ID fields do not fit in the "Primary" or "Foreign" concept such as stops.zone_id.

My suggestion is to only indicate what the Primary ID is at the file level, and do away with defining "Primary" or "Foreign" at the individual field levels (i.e., keep them as they were defined originally as "ID" or "ID referencing"). "Foreign" IDs are already clear IMHO with the phrase "ID referencing".

I'm a fan of keeping the definitions of "Primary ID" and "Foreign ID" in the Field Types section of the documentation.

Tech

The definition for "Foreign ID" in the Field Types section should be edited to include ID referencing IDs in the same file (and not just in other files). An example of this would be stops.parent_station, which references stops.stop_id in the same stops.txt file.
fare_rules.txt: No Primary ID was defined for this file. My interpretation is that the collection of all fields defined in a record in fare_rules.txt must be unique.
frequencies.txt: The Primary ID should include start_time and end_time to yield Primary ID (trip_id, start_time, end_time). My understanding is that you can define successive time intervals for the same trip_id to define different frequencies throughout a trip. This would make Primary ID (trip_id) alone false.
transfers.txt: The Primary ID should be defined as Primary ID (from_stop_id, to_stop_id).
pathways.txt: The Primary ID should be defined as Primary ID ('pathway_id, 'from_stop_id, to_stop_id). If different pathways need to be defined for the same from_stop_id and to_stop_id pair to define different pathway_modes, for example, a new pathway_id must be used. Correct me if I'm wrong!
translations.txt: I think the collection of all fields defined in a given record of translations.txt must be unique. Additionally, I think record_id should be of type "ID referencing".
feed_info.txt: Is the implication of feed_start_date and feed_end_date that only 1 record is allowed in the feed_info.txt file? If so, this should be made clear in the file description.
attributions.txt: Seems like the Primary ID here should be Primary ID (attribution_id, organization_name). In which case, attribution_id should have its presence changed to Required. This is perhaps a separate PR.

Let me know what you think and if I overlooked anything! Thanks.

botanize · 2021-07-22T15:14:37Z

@scmcca thank you for the feedback!

I wonder if it would be enough to simply indicate what the Primary ID is at the top of each file as suggested in this PR in the style of "Primary ID (field)".

Currently in the proposal, the Primary ID is indicated at the top of each file and then again as a field type. This creates confusing instances where we are defining a Primary ID for the file, and then again as a related but separate concept per field. In simple cases these match, such as for routes.txt and route_id. But ambiguities/redundancies arise. For example, a Primary ID at the file level might contain Foreign ID type values as well as non-ID type values such as in stop_times.txt (trip_id, stop_sequence). Moreover, some ID fields do not fit in the "Primary" or "Foreign" concept such as stops.zone_id.

My suggestion is to only indicate what the Primary ID is at the file level, and do away with defining "Primary" or "Foreign" at the individual field levels (i.e., keep them as they were defined originally as "ID" or "ID referencing"). "Foreign" IDs are already clear IMHO with the phrase "ID referencing".

I intended the file-level and field-level Primary ID concepts to be identical. The file-level Primary ID uniquely identifies a row in the file, if there's only one field required to do that, that field is a Primary ID. Seeing that trip_id is of type Primary ID tells me that it is sufficient to uniquely identify a record, which is exactly the same information provided at the file-level, Primary ID (`trip_id`). When multiple fields (whether ID types or not) are required to identify a row, the file level Primary ID lists them all out, but it's possible that some of them are not IDs (frequencies.start_time) and others are plain IDs (shapes.shape_id), and some may be Foreign IDs (frequencies.trip_id). And I think it's helpful to know that shapes.shape_id isn't a Primary ID, despite it being pretty obvious, but the ID is named after the file, so one could assume otherwise.

While the existing language for foreign IDs (ID referencing…) is clear when the ID can reference only one field in one file, it fails when there's an ID that can reference a variety of IDs across files. For example, translations.record_id is a Foreign ID, but what it references isn't set in stone the way most of the foreign IDs are (like stop_times.stop_id). So I think it's more clear to mark that field as Foreign ID than ID or ID referencing *.

Thank you for the technical notes, I was unsure of the primary keys for many of these files, which was the motivation for the issue and pull request!

fare_rules.txt: No Primary ID was defined for this file. My interpretation is that the collection of all fields defined in a record in fare_rules.txt must be unique.

I think we could use Primary ID (`*`) to indicate that the combination of all provided fields must be unique, I could document that convention in the Field Type definition for Primary ID. Otherwise I can spell it out and list all of the fields as part of the primary ID.

frequencies.txt: The Primary ID should include start_time and end_time to yield Primary ID (trip_id, start_time, end_time). My understanding is that you can define successive time intervals for the same trip_id to define different frequencies throughout a trip. This would make Primary ID (trip_id) alone false.

That was also my understanding.

pathways.txt: The Primary ID should be defined as Primary ID ('pathway_id, 'from_stop_id, to_stop_id). If different pathways need to be defined for the same from_stop_id and to_stop_id pair to define different pathway_modes, for example, a new pathway_id must be used. Correct me if I'm wrong!

Makes sense to me, do we know what any consumers of pathways.txt are doing?

translations.txt: I think the collection of all fields defined in a given record of translations.txt must be unique. Additionally, I think record_id should be of type "ID referencing".

I label record_id as "Foreign ID", which means "ID referencing…". In this case it can reference a lot of different things, and that's explained well in the description, so I don't know what else to say here other than Foreign ID.

feed_info.txt: Is the implication of feed_start_date and feed_end_date that only 1 record is allowed in the feed_info.txt file? If so, this should be made clear in the file description.

I agree, my understanding is that this is a one-line file. I could add something like Primary ID (none) and replace the start of the file description with "This file contains a single row of…".

attributions.txt: Seems like the Primary ID here should be Primary ID (attribution_id, organization_name). In which case, attribution_id should have its presence changed to Required. This is perhaps a separate PR.

I don't know how consumers interpret this, OneBusAway, OpenTripPlanner and TransitClock don't seem to consume it. I used the "ID named after the file" heuristic to select the Primary ID, but now I see that it's not even a required field. I don't see anything to gain from requiring attribution_id to be unique only in combination with organization_name, I think it should become Required and become the Primary ID.

gtfs/spec/en/reference.md

abyrd · 2021-08-20T04:46:41Z

I definitely support clearly stating the primary keys of each table in GTFS. Overall the direction of this proposal seems good, but I share some of the same concerns as @scmcca, who says "This creates confusing instances where we are defining a Primary ID for the file, and then again as a related but separate concept per field."

This confusion arises because this PR is conflating the two distinct concepts of type and key. The primary key is a set of attributes, and every attribute has a type whether or not it is in the primary key. The fact that an attribute is a member of the primary key or not is independent of its type. It so happens that in GTFS we also have (retroactively) defined a type called ID. Primary keys often include fields of type ID, but when they are included in the primary key their type remains ID.

I like that the primary key is shown as a tuple at the top of each table definition. This will be immediately recognizable and understandable to anyone familiar with the relational model of data, including many people implementing GTFS consumer systems. I don't think it's contradictory or problematic to also flag which fields are members of the primary key or are foreign keys, but these should not be treated as types - doing so is likely to create subtle confusion. Either they should not be in the type column, or if they are placed in one of the existing columns for brevity, it should be clear they are just annotations and do not replace the values in that column.

I think the discussion of primary and foreign keys should also be moved out of the types list into a different section.

barbeau · 2021-08-24T20:25:44Z

Note that primary and foreign keys have been defined in the canonical GTFS validator that MobilityData has been working on with input from the community. See the table schemas at:
https://github.com/MobilityData/gtfs-validator/tree/master/main/src/main/java/org/mobilitydata/gtfsvalidator/table

Primary keys are fields annotated with @PrimaryKey, and foreign keys are annotated with @ForeignKey and a definition of the field they are a foreign key to. For example, for trips.txt route_id is annotated with @ForeignKey(table = "routes.txt", field = "route_id").

I'd suggest cross-checking this PR with that tool to see if there are any discrepancies.

e-lo · 2021-08-25T18:14:23Z

I believe representing the specification in a standard schema format (i.e. frictionless, etc) is part of the MobilityData work plan for this quarter. As such, it would be great to:
a. confirm which schema format MobilityData plans to use (cc: @scmcca)
b. align any definitions of foreign and primary keys with that schema format (e.g. for frictionless schemas) either by reference or making sure they are consistent

botanize · 2021-08-26T20:55:16Z

A couple of thoughts after looking through the validator and the frictionless spec.

The validator's PrimaryKey annotation applies to only one field. There's no concept of clustered keys which doesn't match the SQL concept of primary keys or the frictionless definition of primary keys. So I'd prefer to use the more common definition that allows for the primary key to be a tuple.
A few of the clustered keys are defined using FirstKey, SequenceKey (shapes, stop_times), calendar_dates could be defined the same way with service_id as FirstKey and date as SequenceKey. But again, I think this is a bit of a hack, and for the purposes of communicating with humans (as opposed to machines) we should stick with the primary key tuple concept.
Foreign keys that can reference a field in multiple tables aren't annotated with ForeignKey, presumably the annotation can't account for the multiple linkages. For example, trips.service_id is a foreign key into both calendar.txt and calendar_dates.txt. It looks like frictionless also doesn't support foreign keying to multiple tables. As people have commented above, this problem arises because calendar.txt and calendar_dates.txt combine to form the one true calendar table, but that virtual table isn't part of the spec. I don't think there's any problem with describing trips.service_id as a Foreign ID referencing calendar.service_id or calendar_dates.service_id. The other place this crops up is in translations.txt, where record_id and record_sub_id are used to identify a row that requires one or more keys. Since the table is defined in another field, we can't know ahead of time what the foreign key is going to reference, but both fields are definitely foreign keys, and should be labeled as such (even though neither frictionless or the validator are capable of describing the situation accurately).
The validator describes pathways.pathway_id as a PrimaryKey, which limits the definition of a pathway to be a link between exactly two stop_ids, the only clue in the spec that a pathway can describe only one link is where it's described as a graph representation, which is in my opinion a step too close to jargon for what is mostly a plain language spec. That description should probably be updated to state that a pathway is a single link between exactly two nodes defined in stops.txt.
Frictionless, like this proposal, annotates primary keys at the dataset level, and solves the potential for confusion between the file level primary key and a field primary key by labeling the fields as "unique" if it uniquely identifies a row. I think that's a good solution that I can easily implement.

botanize · 2021-08-26T21:01:43Z

Based on the comments in the linked issue, the key on frequencies.txt should be trip_id + start_time, not trip_id alone, correct?

scmcca · 2021-09-07T15:22:43Z

Based on the comments in the linked issue, the key on frequencies.txt should be trip_id + start_time, not trip_id alone, correct?

I can reiterate the suggestion I made earlier to have the primary key for frequencies.txt be (trip_id, start_time, end_time). Having the primary key only at (trip_id, start_time) would theoretically allow for multiple records with the same trip_id and start_time but with confounding end_times. Maybe I'm missing something?

Also, it's a picky detail but to keep consistency should we be using common nouns for "primary key", "unique ID", and "foreign ID"? For example "Primary Key" would be "Primary key" at the top of each file and "unique ID" would be "Unique ID" in the Type column.

Otherwise LGTM.

botanize · 2021-09-07T20:16:08Z

I can reiterate the suggestion I made earlier to have the primary key for frequencies.txt be (trip_id, start_time, end_time). Having the primary key only at (trip_id, start_time) would theoretically allow for multiple records with the same trip_id and start_time but with confounding end_times. Maybe I'm missing something?

I think it's actually the opposite. A primary key of (trip_id, start_time) means there can be only one row for any combination of trip_id and start_time. If you make the primary key (trip_id, start_time, end_time) you'd allow multiple rows with the same trip_id and start_time as long as they're disambiguated by end_time, which makes no sense to me.

scmcca · 2021-09-08T00:19:26Z

@botanize That makes sense. Would that mean that the primary key would be equally correct if it were (trip_id, end_time)?

botanize · 2021-09-08T14:56:08Z

Given that headways for the same trip must not overlap (see the description of headway_secs), either start_time or end_time could be used in combination with trip_id to make the primary key. However, it seems far more conventional to me to use the start time of the service.

Bertware · 2021-09-09T13:59:08Z

Regarding transfers.txt:

This PR would define (from_stop_id, to_stop_id) as the unique primary key, which would prevent us (Samtrafiken/Trafiklab) from extending this file with trip-to-trip transfer information as we do today. Right now we publish this data according to the Google trip-to-trip transfer extension, but this would become invalid as it would cause duplicate keys. The same functionality is included in MBTA's transfers.txt extension. How should this be handled? The issue would be the uniqueness of the from_stop_id, to_stop_id combination, which isn't defined as "must be unique" today.

@mbta @paulswartz

paulswartz · 2021-09-09T14:04:28Z

@Bertware internally, we use (from_stop_id, to_stop_id, from_trip_id, to_trip_id) as the unique key.

Bertware · 2021-09-09T14:09:55Z

We use (from_stop_id, to_stop_id, from_trip_id, to_trip_id) as unique key as well, where from_trip_id and to_trip_id can be empty for normal (default) stop-to-stop transfers. I'm not sure if there are producers using the from_route_id and to_route_id fields. Should these fields (from_trip_id, to_trip_id, and possibly from_route_id , to_route_id if there are producers and consumers) be proposed in a separate issue and PR, and possibly included in the official spec, so they can be used as unique key in this PR?

skinkie · 2021-09-09T14:46:04Z

We use (from_stop_id, to_stop_id, from_trip_id, to_trip_id) as unique key as well, where from_trip_id and to_trip_id can be empty for normal (default) stop-to-stop transfers. I'm not sure if there are producers using the from_route_id and to_route_id fields. Should these fields (from_trip_id, to_trip_id, and possibly from_route_id , to_route_id if there are producers and consumers) be proposed in a separate issue and PR, and possibly included in the official spec, so they can be used as unique key in this PR?

I think this PR should not address it, but the trip-to-trip extension should.

Bertware · 2021-09-09T16:08:42Z

Just checked the repository and there already is an open PR for the trip-to-trip extension #32 I was unaware of. Whichever gets merged in second should take the other into account, but if the trip-to-trip transfer proposal can be merged first we would prevent backwards compatibility issues (since this PR would declare (from_stop_id, to_stop_id) to be unique, while this would be undone if #32 gets merged after this one)

antrim · 2021-09-15T00:48:52Z

Related: I closed PR #32. I think it makes sense to open a separate PR that's cleaner covering transfer rules for routes and trips, and in-seat transfers.

skinkie · 2021-09-15T11:21:20Z

@antrim what was the reason not to resolve that in #32?

botanize · 2021-09-16T18:10:30Z

I think the bigger issue is, what are you supposed to do if you develop a GTFS extension that violates a primary key described in this PR?

If the extension becomes part of the spec the primary key would be updated, but while it's a proprietary extension there would be a conflict between the extended feed and the spec.

We could just add something to the new Dataset Attributes subheading that says un-official extensions may change these relationships by adding new fields to the end of the table's primary key? That would prevent someone from extending trips.txt by adding service_id to the primary key, since service_id already exists in the spec. It would also allow adding transfers.from_trip_id and transfers.to_trip_id to the end of the primary key resulting in (from_stop_id, to_stop_id, from_trip_id, to_trip_id), which is backwards compatible with (from_stop_id, to_stop_id) in the sense that null values for from_trip_id and to_trip_id would mean that each from_stop_id, to_stop_id pair would need to be unique.

scmcca · 2021-09-16T18:22:08Z

@botanize Do you have examples of other extensions where this problem presents itself? If so it would be useful to note them here.

Otherwise, it seems that there is demand for from_trip_id and to_trip_id as an official part of the spec without causing problems here (if merged before this PR).

botanize · 2021-09-16T18:28:20Z

I don't have any examples, but it seems bound to come up again at some point.

How long do we sit on this PR to try to get #32 or a substitute PR approved?

scmcca · 2021-09-16T19:44:34Z

but it seems bound to come up again at some point.

Agreed that this is a foreseeable issue. I can open a substitute PR tomorrow, at which point we'll have to let 7 discussion days pass followed by 7 days for voting. So we should be able to resume with #278 by early October.

scmcca · 2021-10-05T13:47:40Z

@botanize #284 for trip-to-trip and route-to-route transfers has been merged. Feel free to move forward with defining the IDs for transfers.txt as discussed above.

botanize · 2021-10-05T14:34:17Z

I ordered the fields in the transfers.txt key to keep the existing keys for producers of the extension compatible: (from_stop_id, to_stop_id, from_trip_id, to_trip_id, from_route_id, to_route_id)

@Bertware @paulswartz does this meet your needs?

Bertware · 2021-10-05T14:39:03Z

@botanize looks good to me!

Bertware · 2021-10-07T10:28:26Z

@botanize if there is no more discussion, can we call for a vote on this proposal?

botanize · 2021-10-07T15:33:43Z

It looks like we're ready for a vote.

The vote is for adding a primary key attribute to each table's description and labeling "unique" and "foreign" IDs where applicable.

Voting ends on 2021-10-14T23:59:59Z.

I look forward to any final feedback and wrapping this up!

skinkie · 2021-10-07T16:38:58Z

+1 OpenGeo

Bertware · 2021-10-08T07:49:20Z

+1 Samtrafiken/Trafiklab

botanize · 2021-10-13T14:58:14Z

@barbeau @scmcca @e-lo @abyrd can I encourage you to vote or comment on this PR?

flocsy · 2021-10-14T08:34:07Z

fare_rules.txt is a problem IMHO. Although mostly not because changes in this PR. However Primary key (*) would allow me to have N lines that define (fare_id, route_id, origin_id, destination_id, {contains_id_1, ...., contains_id_N}) (this is what we want to allow), but it also allows me to add another line: (fare_id, route_id, origin_id, destination_id, null), which makes no sense and I think it shouldn't be allowed. So maybe contains_id should be Conditionally Required with an explanation that is all the other fields equal between any two rows then contains_id is required.

Side note: However the bigger problem is that the current gtfs doesn't allow me to do this:
(fare_id, route_id, origin_id, destination_id, {contains_id_1, contains_id_2, contains_id_3})
(fare_id, route_id, origin_id, destination_id, {contains_id_1, contains_id_2})
Maybe we should start a separate discussion on how to solve this problem.

botanize · 2021-10-14T12:17:02Z

I'm extending voting until 2021-10-21T23:59:59Z so that I can address comments between when I go on vacation (now) until voting closes.

scmcca · 2021-10-14T13:31:56Z

@flocsy

Maybe we should start a separate discussion on how to solve this problem.

Indeed! Can you please start a separate issue so we can elaborate the problem(s) and solutions?

paulswartz · 2021-10-14T19:30:16Z

+1

botanize · 2021-10-22T12:36:09Z

The vote ended on 2021-10-21 at 23:59:59 UTC with 3 votes in favor and 0 opposed. As per the Specification Amendment Process, this vote passes!

Thanks everyone!

Add Primary and Foreign ID fields

97d7fb8

google-cla bot added the cla: yes label Jul 14, 2021

botanize added 2 commits July 22, 2021 10:06

Improve ID field definitions

cd01dc0

Fix incorrect and add missing Primary IDs

9eb8bbf

skinkie reviewed Jul 22, 2021

View reviewed changes

gtfs/spec/en/reference.md Show resolved Hide resolved

skinkie reviewed Jul 22, 2021

View reviewed changes

gtfs/spec/en/reference.md Outdated Show resolved Hide resolved

scmcca linked an issue Aug 19, 2021 that may be closed by this pull request

Static spec doesn't differentiate between primary and foreign IDs #266

Closed

botanize added 2 commits August 27, 2021 08:28

Split key from ID

1ddc2e9

Merge branch 'master' into id_types

227a359

scmcca added the GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule label Aug 30, 2021

Fix capitalization

11deac6

antrim mentioned this pull request Sep 15, 2021

transfer rules for routes and trips, and in-seat transfers #32

Closed

scmcca mentioned this pull request Sep 17, 2021

Trip-to-trip and route-to-route transfers #284

Merged

botanize added 2 commits October 5, 2021 09:28

Merge branch 'master' of github.com:google/transit into id_types

91855e1

Update key and ids for transfers.txt

190c999

scmcca merged commit b3ea73f into google:master Oct 22, 2021

This was referenced Oct 27, 2021

implement primary keys from spec public-transport/gtfs-via-postgres#15

Open

Too strict service_id foreign key requirement public-transport/gtfs-via-postgres#16

Closed

Bertware mentioned this pull request Nov 20, 2021

Unique constraint failed when trying to import the latest data from BART jarondl/pygtfs#61

Closed

isabelle-dr mentioned this pull request Dec 20, 2021

Review the Q4 modifications to the GTFS specification MobilityData/gtfs-validator#1079

Closed

This was referenced Mar 15, 2022

Update validator after the addition of primary and foreign keys in the specification MobilityData/gtfs-validator#1113

Closed

Update Validator after addition of trip-to-trip and route-to-route transfers MobilityData/gtfs-validator#1114

Closed

e-lo mentioned this pull request Apr 13, 2022

Maintain persistent trip_id across data iterations as a best practice MobilityData/GTFS_Schedule_Best-Practices#49

Closed

molisani mentioned this pull request Oct 10, 2022

Update to pygtfs 0.1.7 home-assistant/core#79975

Merged

23 tasks

isabelle-dr mentioned this pull request Oct 11, 2022

Add a validation rule to check for correct trip to route pair after the addition of trip-to-trip and route-to-route transfers MobilityData/gtfs-validator#1268

Closed

Add Primary and Foreign ID fields #278

Add Primary and Foreign ID fields #278

Conversation

botanize commented Jul 14, 2021 • edited Loading

scmcca commented Jul 20, 2021

botanize commented Jul 22, 2021

abyrd commented Aug 20, 2021

barbeau commented Aug 24, 2021

e-lo commented Aug 25, 2021 • edited Loading

botanize commented Aug 26, 2021

botanize commented Aug 26, 2021

scmcca commented Sep 7, 2021

botanize commented Sep 7, 2021

scmcca commented Sep 8, 2021

botanize commented Sep 8, 2021

Bertware commented Sep 9, 2021 • edited Loading

paulswartz commented Sep 9, 2021

Bertware commented Sep 9, 2021 • edited Loading

skinkie commented Sep 9, 2021

Bertware commented Sep 9, 2021

antrim commented Sep 15, 2021

skinkie commented Sep 15, 2021

botanize commented Sep 16, 2021

scmcca commented Sep 16, 2021

botanize commented Sep 16, 2021

scmcca commented Sep 16, 2021

scmcca commented Oct 5, 2021

botanize commented Oct 5, 2021

Bertware commented Oct 5, 2021

Bertware commented Oct 7, 2021

botanize commented Oct 7, 2021

skinkie commented Oct 7, 2021

Bertware commented Oct 8, 2021

botanize commented Oct 13, 2021

flocsy commented Oct 14, 2021

botanize commented Oct 14, 2021

scmcca commented Oct 14, 2021

paulswartz commented Oct 14, 2021

botanize commented Oct 22, 2021

botanize commented Jul 14, 2021 •

edited

Loading

e-lo commented Aug 25, 2021 •

edited

Loading

Bertware commented Sep 9, 2021 •

edited

Loading

Bertware commented Sep 9, 2021 •

edited

Loading