Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Primary and Foreign ID fields #278

Merged
merged 8 commits into from
Oct 22, 2021
Merged

Add Primary and Foreign ID fields #278

merged 8 commits into from
Oct 22, 2021

Conversation

botanize
Copy link
Contributor

@botanize botanize commented Jul 14, 2021

Re #266

  • Define Primary ID and Foreign ID types
  • Re-type existing IDs
  • Add Primary ID (ID, ...) to file introduction to clarify requirements to uniquely identify rows

https://groups.google.com/g/gtfs-changes/c/YIOx6JYADMk/m/Zd0clk5lAAAJ

@google-cla google-cla bot added the cla: yes label Jul 14, 2021
@scmcca
Copy link
Contributor

scmcca commented Jul 20, 2021

@botanize Thanks for putting this together! Here are my suggestions and feedback.

Style
I wonder if it would be enough to simply indicate what the Primary ID is at the top of each file as suggested in this PR in the style of "Primary ID (field)".

Currently in the proposal, the Primary ID is indicated at the top of each file and then again as a field type. This creates confusing instances where we are defining a Primary ID for the file, and then again as a related but separate concept per field. In simple cases these match, such as for routes.txt and route_id. But ambiguities/redundancies arise. For example, a Primary ID at the file level might contain Foreign ID type values as well as non-ID type values such as in stop_times.txt (trip_id, stop_sequence). Moreover, some ID fields do not fit in the "Primary" or "Foreign" concept such as stops.zone_id.

My suggestion is to only indicate what the Primary ID is at the file level, and do away with defining "Primary" or "Foreign" at the individual field levels (i.e., keep them as they were defined originally as "ID" or "ID referencing"). "Foreign" IDs are already clear IMHO with the phrase "ID referencing".

I'm a fan of keeping the definitions of "Primary ID" and "Foreign ID" in the Field Types section of the documentation.

Tech

  • The definition for "Foreign ID" in the Field Types section should be edited to include ID referencing IDs in the same file (and not just in other files). An example of this would be stops.parent_station, which references stops.stop_id in the same stops.txt file.
  • fare_rules.txt: No Primary ID was defined for this file. My interpretation is that the collection of all fields defined in a record in fare_rules.txt must be unique.
  • frequencies.txt: The Primary ID should include start_time and end_time to yield Primary ID (trip_id, start_time, end_time). My understanding is that you can define successive time intervals for the same trip_id to define different frequencies throughout a trip. This would make Primary ID (trip_id) alone false.
  • transfers.txt: The Primary ID should be defined as Primary ID (from_stop_id, to_stop_id).
  • pathways.txt: The Primary ID should be defined as Primary ID ('pathway_id, 'from_stop_id, to_stop_id). If different pathways need to be defined for the same from_stop_id and to_stop_id pair to define different pathway_modes, for example, a new pathway_id must be used. Correct me if I'm wrong!
  • translations.txt: I think the collection of all fields defined in a given record of translations.txt must be unique. Additionally, I think record_id should be of type "ID referencing".
  • feed_info.txt: Is the implication of feed_start_date and feed_end_date that only 1 record is allowed in the feed_info.txt file? If so, this should be made clear in the file description.
  • attributions.txt: Seems like the Primary ID here should be Primary ID (attribution_id, organization_name). In which case, attribution_id should have its presence changed to Required. This is perhaps a separate PR.

Let me know what you think and if I overlooked anything! Thanks.

@botanize
Copy link
Contributor Author

@scmcca thank you for the feedback!

I wonder if it would be enough to simply indicate what the Primary ID is at the top of each file as suggested in this PR in the style of "Primary ID (field)".

Currently in the proposal, the Primary ID is indicated at the top of each file and then again as a field type. This creates confusing instances where we are defining a Primary ID for the file, and then again as a related but separate concept per field. In simple cases these match, such as for routes.txt and route_id. But ambiguities/redundancies arise. For example, a Primary ID at the file level might contain Foreign ID type values as well as non-ID type values such as in stop_times.txt (trip_id, stop_sequence). Moreover, some ID fields do not fit in the "Primary" or "Foreign" concept such as stops.zone_id.

My suggestion is to only indicate what the Primary ID is at the file level, and do away with defining "Primary" or "Foreign" at the individual field levels (i.e., keep them as they were defined originally as "ID" or "ID referencing"). "Foreign" IDs are already clear IMHO with the phrase "ID referencing".

I intended the file-level and field-level Primary ID concepts to be identical. The file-level Primary ID uniquely identifies a row in the file, if there's only one field required to do that, that field is a Primary ID. Seeing that trip_id is of type Primary ID tells me that it is sufficient to uniquely identify a record, which is exactly the same information provided at the file-level, Primary ID (`trip_id`). When multiple fields (whether ID types or not) are required to identify a row, the file level Primary ID lists them all out, but it's possible that some of them are not IDs (frequencies.start_time) and others are plain IDs (shapes.shape_id), and some may be Foreign IDs (frequencies.trip_id). And I think it's helpful to know that shapes.shape_id isn't a Primary ID, despite it being pretty obvious, but the ID is named after the file, so one could assume otherwise.

While the existing language for foreign IDs (ID referencing…) is clear when the ID can reference only one field in one file, it fails when there's an ID that can reference a variety of IDs across files. For example, translations.record_id is a Foreign ID, but what it references isn't set in stone the way most of the foreign IDs are (like stop_times.stop_id). So I think it's more clear to mark that field as Foreign ID than ID or ID referencing *.

Thank you for the technical notes, I was unsure of the primary keys for many of these files, which was the motivation for the issue and pull request!

  • fare_rules.txt: No Primary ID was defined for this file. My interpretation is that the collection of all fields defined in a record in fare_rules.txt must be unique.

I think we could use Primary ID (`*`) to indicate that the combination of all provided fields must be unique, I could document that convention in the Field Type definition for Primary ID. Otherwise I can spell it out and list all of the fields as part of the primary ID.

  • frequencies.txt: The Primary ID should include start_time and end_time to yield Primary ID (trip_id, start_time, end_time). My understanding is that you can define successive time intervals for the same trip_id to define different frequencies throughout a trip. This would make Primary ID (trip_id) alone false.

That was also my understanding.

  • pathways.txt: The Primary ID should be defined as Primary ID ('pathway_id, 'from_stop_id, to_stop_id). If different pathways need to be defined for the same from_stop_id and to_stop_id pair to define different pathway_modes, for example, a new pathway_id must be used. Correct me if I'm wrong!

Makes sense to me, do we know what any consumers of pathways.txt are doing?

  • translations.txt: I think the collection of all fields defined in a given record of translations.txt must be unique. Additionally, I think record_id should be of type "ID referencing".

I label record_id as "Foreign ID", which means "ID referencing…". In this case it can reference a lot of different things, and that's explained well in the description, so I don't know what else to say here other than Foreign ID.

  • feed_info.txt: Is the implication of feed_start_date and feed_end_date that only 1 record is allowed in the feed_info.txt file? If so, this should be made clear in the file description.

I agree, my understanding is that this is a one-line file. I could add something like Primary ID (none) and replace the start of the file description with "This file contains a single row of…".

  • attributions.txt: Seems like the Primary ID here should be Primary ID (attribution_id, organization_name). In which case, attribution_id should have its presence changed to Required. This is perhaps a separate PR.

I don't know how consumers interpret this, OneBusAway, OpenTripPlanner and TransitClock don't seem to consume it. I used the "ID named after the file" heuristic to select the Primary ID, but now I see that it's not even a required field. I don't see anything to gain from requiring attribution_id to be unique only in combination with organization_name, I think it should become Required and become the Primary ID.

gtfs/spec/en/reference.md Outdated Show resolved Hide resolved
@scmcca scmcca linked an issue Aug 19, 2021 that may be closed by this pull request
@abyrd
Copy link

abyrd commented Aug 20, 2021

I definitely support clearly stating the primary keys of each table in GTFS. Overall the direction of this proposal seems good, but I share some of the same concerns as @scmcca, who says "This creates confusing instances where we are defining a Primary ID for the file, and then again as a related but separate concept per field."

This confusion arises because this PR is conflating the two distinct concepts of type and key. The primary key is a set of attributes, and every attribute has a type whether or not it is in the primary key. The fact that an attribute is a member of the primary key or not is independent of its type. It so happens that in GTFS we also have (retroactively) defined a type called ID. Primary keys often include fields of type ID, but when they are included in the primary key their type remains ID.

I like that the primary key is shown as a tuple at the top of each table definition. This will be immediately recognizable and understandable to anyone familiar with the relational model of data, including many people implementing GTFS consumer systems. I don't think it's contradictory or problematic to also flag which fields are members of the primary key or are foreign keys, but these should not be treated as types - doing so is likely to create subtle confusion. Either they should not be in the type column, or if they are placed in one of the existing columns for brevity, it should be clear they are just annotations and do not replace the values in that column.

I think the discussion of primary and foreign keys should also be moved out of the types list into a different section.

@barbeau
Copy link
Collaborator

barbeau commented Aug 24, 2021

Note that primary and foreign keys have been defined in the canonical GTFS validator that MobilityData has been working on with input from the community. See the table schemas at:
https://github.com/MobilityData/gtfs-validator/tree/master/main/src/main/java/org/mobilitydata/gtfsvalidator/table

Primary keys are fields annotated with @PrimaryKey, and foreign keys are annotated with @ForeignKey and a definition of the field they are a foreign key to. For example, for trips.txt route_id is annotated with @ForeignKey(table = "routes.txt", field = "route_id").

I'd suggest cross-checking this PR with that tool to see if there are any discrepancies.

@e-lo
Copy link

e-lo commented Aug 25, 2021

I believe representing the specification in a standard schema format (i.e. frictionless, etc) is part of the MobilityData work plan for this quarter. As such, it would be great to:
a. confirm which schema format MobilityData plans to use (cc: @scmcca)
b. align any definitions of foreign and primary keys with that schema format (e.g. for frictionless schemas) either by reference or making sure they are consistent

@botanize
Copy link
Contributor Author

A couple of thoughts after looking through the validator and the frictionless spec.

  1. The validator's PrimaryKey annotation applies to only one field. There's no concept of clustered keys which doesn't match the SQL concept of primary keys or the frictionless definition of primary keys. So I'd prefer to use the more common definition that allows for the primary key to be a tuple.
  2. A few of the clustered keys are defined using FirstKey, SequenceKey (shapes, stop_times), calendar_dates could be defined the same way with service_id as FirstKey and date as SequenceKey. But again, I think this is a bit of a hack, and for the purposes of communicating with humans (as opposed to machines) we should stick with the primary key tuple concept.
  3. Foreign keys that can reference a field in multiple tables aren't annotated with ForeignKey, presumably the annotation can't account for the multiple linkages. For example, trips.service_id is a foreign key into both calendar.txt and calendar_dates.txt. It looks like frictionless also doesn't support foreign keying to multiple tables. As people have commented above, this problem arises because calendar.txt and calendar_dates.txt combine to form the one true calendar table, but that virtual table isn't part of the spec. I don't think there's any problem with describing trips.service_id as a Foreign ID referencing calendar.service_id or calendar_dates.service_id. The other place this crops up is in translations.txt, where record_id and record_sub_id are used to identify a row that requires one or more keys. Since the table is defined in another field, we can't know ahead of time what the foreign key is going to reference, but both fields are definitely foreign keys, and should be labeled as such (even though neither frictionless or the validator are capable of describing the situation accurately).
  4. The validator describes pathways.pathway_id as a PrimaryKey, which limits the definition of a pathway to be a link between exactly two stop_ids, the only clue in the spec that a pathway can describe only one link is where it's described as a graph representation, which is in my opinion a step too close to jargon for what is mostly a plain language spec. That description should probably be updated to state that a pathway is a single link between exactly two nodes defined in stops.txt.
  5. Frictionless, like this proposal, annotates primary keys at the dataset level, and solves the potential for confusion between the file level primary key and a field primary key by labeling the fields as "unique" if it uniquely identifies a row. I think that's a good solution that I can easily implement.

@botanize
Copy link
Contributor Author

Based on the comments in the linked issue, the key on frequencies.txt should be trip_id + start_time, not trip_id alone, correct?

@scmcca scmcca added the GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule label Aug 30, 2021
@scmcca
Copy link
Contributor

scmcca commented Sep 7, 2021

Based on the comments in the linked issue, the key on frequencies.txt should be trip_id + start_time, not trip_id alone, correct?

I can reiterate the suggestion I made earlier to have the primary key for frequencies.txt be (trip_id, start_time, end_time). Having the primary key only at (trip_id, start_time) would theoretically allow for multiple records with the same trip_id and start_time but with confounding end_times. Maybe I'm missing something?

Also, it's a picky detail but to keep consistency should we be using common nouns for "primary key", "unique ID", and "foreign ID"? For example "Primary Key" would be "Primary key" at the top of each file and "unique ID" would be "Unique ID" in the Type column.

Otherwise LGTM.

@botanize
Copy link
Contributor Author

botanize commented Sep 7, 2021

I can reiterate the suggestion I made earlier to have the primary key for frequencies.txt be (trip_id, start_time, end_time). Having the primary key only at (trip_id, start_time) would theoretically allow for multiple records with the same trip_id and start_time but with confounding end_times. Maybe I'm missing something?

I think it's actually the opposite. A primary key of (trip_id, start_time) means there can be only one row for any combination of trip_id and start_time. If you make the primary key (trip_id, start_time, end_time) you'd allow multiple rows with the same trip_id and start_time as long as they're disambiguated by end_time, which makes no sense to me.

@scmcca
Copy link
Contributor

scmcca commented Sep 8, 2021

@botanize That makes sense. Would that mean that the primary key would be equally correct if it were (trip_id, end_time)?

@botanize
Copy link
Contributor Author

botanize commented Sep 8, 2021

Given that headways for the same trip must not overlap (see the description of headway_secs), either start_time or end_time could be used in combination with trip_id to make the primary key. However, it seems far more conventional to me to use the start time of the service.

@Bertware
Copy link

Bertware commented Sep 9, 2021

Regarding transfers.txt:

This PR would define (from_stop_id, to_stop_id) as the unique primary key, which would prevent us (Samtrafiken/Trafiklab) from extending this file with trip-to-trip transfer information as we do today. Right now we publish this data according to the Google trip-to-trip transfer extension, but this would become invalid as it would cause duplicate keys. The same functionality is included in MBTA's transfers.txt extension. How should this be handled? The issue would be the uniqueness of the from_stop_id, to_stop_id combination, which isn't defined as "must be unique" today.

@mbta @paulswartz

@paulswartz
Copy link
Contributor

@Bertware internally, we use (from_stop_id, to_stop_id, from_trip_id, to_trip_id) as the unique key.

@Bertware
Copy link

Bertware commented Sep 9, 2021

We use (from_stop_id, to_stop_id, from_trip_id, to_trip_id) as unique key as well, where from_trip_id and to_trip_id can be empty for normal (default) stop-to-stop transfers. I'm not sure if there are producers using the from_route_id and to_route_id fields. Should these fields (from_trip_id, to_trip_id, and possibly from_route_id , to_route_id if there are producers and consumers) be proposed in a separate issue and PR, and possibly included in the official spec, so they can be used as unique key in this PR?

@skinkie
Copy link
Contributor

skinkie commented Sep 9, 2021

We use (from_stop_id, to_stop_id, from_trip_id, to_trip_id) as unique key as well, where from_trip_id and to_trip_id can be empty for normal (default) stop-to-stop transfers. I'm not sure if there are producers using the from_route_id and to_route_id fields. Should these fields (from_trip_id, to_trip_id, and possibly from_route_id , to_route_id if there are producers and consumers) be proposed in a separate issue and PR, and possibly included in the official spec, so they can be used as unique key in this PR?

I think this PR should not address it, but the trip-to-trip extension should.

@Bertware
Copy link

Bertware commented Sep 9, 2021

Just checked the repository and there already is an open PR for the trip-to-trip extension #32 I was unaware of. Whichever gets merged in second should take the other into account, but if the trip-to-trip transfer proposal can be merged first we would prevent backwards compatibility issues (since this PR would declare (from_stop_id, to_stop_id) to be unique, while this would be undone if #32 gets merged after this one)

@antrim
Copy link
Contributor

antrim commented Sep 15, 2021

Related: I closed PR #32. I think it makes sense to open a separate PR that's cleaner covering transfer rules for routes and trips, and in-seat transfers.

@skinkie
Copy link
Contributor

skinkie commented Sep 15, 2021

@antrim what was the reason not to resolve that in #32?

@botanize
Copy link
Contributor Author

I think the bigger issue is, what are you supposed to do if you develop a GTFS extension that violates a primary key described in this PR?

If the extension becomes part of the spec the primary key would be updated, but while it's a proprietary extension there would be a conflict between the extended feed and the spec.

We could just add something to the new Dataset Attributes subheading that says un-official extensions may change these relationships by adding new fields to the end of the table's primary key? That would prevent someone from extending trips.txt by adding service_id to the primary key, since service_id already exists in the spec. It would also allow adding transfers.from_trip_id and transfers.to_trip_id to the end of the primary key resulting in (from_stop_id, to_stop_id, from_trip_id, to_trip_id), which is backwards compatible with (from_stop_id, to_stop_id) in the sense that null values for from_trip_id and to_trip_id would mean that each from_stop_id, to_stop_id pair would need to be unique.

@scmcca
Copy link
Contributor

scmcca commented Sep 16, 2021

@botanize Do you have examples of other extensions where this problem presents itself? If so it would be useful to note them here.

Otherwise, it seems that there is demand for from_trip_id and to_trip_id as an official part of the spec without causing problems here (if merged before this PR).

@botanize
Copy link
Contributor Author

I don't have any examples, but it seems bound to come up again at some point.

How long do we sit on this PR to try to get #32 or a substitute PR approved?

@scmcca
Copy link
Contributor

scmcca commented Sep 16, 2021

but it seems bound to come up again at some point.

Agreed that this is a foreseeable issue. I can open a substitute PR tomorrow, at which point we'll have to let 7 discussion days pass followed by 7 days for voting. So we should be able to resume with #278 by early October.

@scmcca
Copy link
Contributor

scmcca commented Oct 5, 2021

@botanize #284 for trip-to-trip and route-to-route transfers has been merged. Feel free to move forward with defining the IDs for transfers.txt as discussed above.

@botanize
Copy link
Contributor Author

botanize commented Oct 5, 2021

I ordered the fields in the transfers.txt key to keep the existing keys for producers of the extension compatible: (from_stop_id, to_stop_id, from_trip_id, to_trip_id, from_route_id, to_route_id)

@Bertware @paulswartz does this meet your needs?

@Bertware
Copy link

Bertware commented Oct 5, 2021

@botanize looks good to me!

@Bertware
Copy link

Bertware commented Oct 7, 2021

@botanize if there is no more discussion, can we call for a vote on this proposal?

@botanize
Copy link
Contributor Author

botanize commented Oct 7, 2021

It looks like we're ready for a vote.

The vote is for adding a primary key attribute to each table's description and labeling "unique" and "foreign" IDs where applicable.

Voting ends on 2021-10-14T23:59:59Z.

I look forward to any final feedback and wrapping this up!

@skinkie
Copy link
Contributor

skinkie commented Oct 7, 2021

+1 OpenGeo

@Bertware
Copy link

Bertware commented Oct 8, 2021

+1 Samtrafiken/Trafiklab

@botanize
Copy link
Contributor Author

@barbeau @scmcca @e-lo @abyrd can I encourage you to vote or comment on this PR?

@flocsy
Copy link
Contributor

flocsy commented Oct 14, 2021

fare_rules.txt is a problem IMHO. Although mostly not because changes in this PR. However Primary key (*) would allow me to have N lines that define (fare_id, route_id, origin_id, destination_id, {contains_id_1, ...., contains_id_N}) (this is what we want to allow), but it also allows me to add another line: (fare_id, route_id, origin_id, destination_id, null), which makes no sense and I think it shouldn't be allowed. So maybe contains_id should be Conditionally Required with an explanation that is all the other fields equal between any two rows then contains_id is required.

Side note: However the bigger problem is that the current gtfs doesn't allow me to do this:
(fare_id, route_id, origin_id, destination_id, {contains_id_1, contains_id_2, contains_id_3})
(fare_id, route_id, origin_id, destination_id, {contains_id_1, contains_id_2})
Maybe we should start a separate discussion on how to solve this problem.

@botanize
Copy link
Contributor Author

I'm extending voting until 2021-10-21T23:59:59Z so that I can address comments between when I go on vacation (now) until voting closes.

@scmcca
Copy link
Contributor

scmcca commented Oct 14, 2021

@flocsy

Maybe we should start a separate discussion on how to solve this problem.

Indeed! Can you please start a separate issue so we can elaborate the problem(s) and solutions?

@paulswartz
Copy link
Contributor

+1

@botanize
Copy link
Contributor Author

The vote ended on 2021-10-21 at 23:59:59 UTC with 3 votes in favor and 0 opposed. As per the Specification Amendment Process, this vote passes!

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Static spec doesn't differentiate between primary and foreign IDs
10 participants