Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve foreign keys API #155

Closed
roll opened this issue Aug 30, 2017 · 6 comments · Fixed by #156
Closed

Improve foreign keys API #155

roll opened this issue Aug 30, 2017 · 6 comments · Fixed by #156
Assignees

Comments

@roll
Copy link
Member

roll commented Aug 30, 2017

Proposed API

The library supports foreign keys described in the Table Schema specification. It means if your data package descriptor use resources[].schema.foreignKeys property for some resources a data integrity will be checked on reading operations.

Consider we have a data package:

DESCRIPTOR = {
  'resources': [
    {
      'name': 'teams',
      'data': [
        ['id', 'name', 'city'],
        ['1', 'Arsenal', 'London'],
        ['2', 'Real', 'Madrid'],
        ['3', 'Bayern', 'Munich'],
      ],
      'schema': {
        'fields': [
          {'name': 'id', 'type': 'integer'},
          {'name': 'name', 'type': 'string'},
          {'name': 'city', 'type': 'string'},
        ],
        'foreignKeys': [
          {
            'fields': 'city',
            'reference': {'resource': 'cities', 'fields': 'name'},
          },
        ],
      },
    }, {
      'name': 'cities',
      'data': [
        ['name', 'country'],
        ['London', 'England'],
        ['Madrid', 'Spain'],
      ],
    },
  ],
}

Let's check relations for a teams resource:

from datapackage import Package

package = Package(DESCRIPTOR)
teams = package.get_resource('teams')
teams.check_relations()
# tableschema.exceptions.RelationError: Foreign key "['city']" violation in row "4"

As we could see there is a foreign key violation. That's because our lookup table cities doesn't have a city of Munich but we have a team from there. We need to fix it in cities resource:

package.descriptor['resources'][1]['data'].append(['Munich', 'Germany'])
package.commit()
teams = package.get_resource('teams')
teams.check_relations()
# True

Fixed! But not only a check operation is available. We could use relations argument for resource.iter/read methods to dereference a resource relations:

teams.read(keyed=True, relations=True)
#[{'id': 1, 'name': 'Arsenal', 'city': {'name': 'London', 'country': 'England}},
# {'id': 2, 'name': 'Real', 'city': {'name': 'Madrid', 'country': 'Spain}},
# {'id': 3, 'name': 'Bayern', 'city': {'name': 'Munich', 'country': 'Germany}}]

Instead of plain city name we've got a dictionary containing a city data. These resource.iter/read methods will fail with the same error as resource.check_relations does if there is an integrity issue. But only if relations=True flag is passed.

@roll
Copy link
Member Author

roll commented Aug 30, 2017

@pwalsh
Please take a look. It uses your design.

@pwalsh
Copy link
Member

pwalsh commented Aug 30, 2017

@roll I like it :). Shall we get some API review comments from @amercader or @vitorbaptista too/first before merging? Are you good with this, as an enhancement on the first iteration, or .... ?

@roll
Copy link
Member Author

roll commented Aug 30, 2017

@pwalsh
It's hard to judge for me because I'm not really familiar with real-world uses cases for FK in data packages. But after update I think it's nicer and as we discussed more efficient by default (FK opt-in).

May be we also could get some comment from @Stephen-Gates in addition to Adria's and Vitor's. It's related to his current work - frictionlessdata/tableschema-js#88 (comment)

@Stephen-Gates
Copy link

Stephen-Gates commented Aug 30, 2017

Here are my thoughts for foreign key use.

From a data publisher perspective:

  1. Too many codes
  • I have a list of codes in my table
  • there are too many to put in an enum Constraint
  • I create a separate look up table and use FK for validation
  1. What does the code mean?
  • I have a list of codes in my table
  • without documentation they are meaningless
  • I create a separate look up table with the code and description
  • use FK for validation
  1. Consistent codes for multiple datasets (novice)
  • I have a set of codes I want everyone to use
  • They're important so I got them published as an open .CSV (I'm a novice and don't know about data packages yet)
  • I ask my peer organisations to use the codes in their data publications
  • Some of my peers know about data packages, others don't
  • I need to easily discover instructions, examples, and tools to help me and others publish and use the shared code list
  • for peer organisations that understand data packages, could they reference the external CSV and create a FK relationship?
  1. Consistent codes for multiple datasets
  • I have a set of codes I want everyone to use
  • They're important so I've got them published online as data package
  • I ask my peer organisations to use the codes in their data publications
  • They use external FK relationship in their data packages to comply
  1. Consistent codes but breaking change
  • I have a set of codes I want everyone to use
  • They're important so I've got them published as a data package
  • My peer organisation are using the codes in their data publications
  • I have to make major changes to the code list. Some codes are no longer valid.
  • How do I tell others what's happening (who is using my code list?)
  • How do I safely make the change (versions, date ranges on codes?)
  • What do I tell users of code-list to do?

Thinking as a Data Package Tool Provider:

  1. Help users understand the meaning of codes
  • The table displayed has a column with codes
  • The codes and their meaning have a FK relationship to this column
  • When I hover my mouse over the code, a tooltip showing its meaning is displayed
  • When a user clicks the code, show the related row in the FK table (where they may discover more than just the meaning of the code)
  1. Help users enter valid codes
  • Data is being entered in a FK column
  • present a drop down list of valid values from the look-up table

As a data consumer:

  1. Make it easily for me to download
  • I'm comforted by the effort you've taken to validate the data and apply consistent codes
  • your efforts give me confidence to explore the data further
  • give it to me in one big dereferenced table so I can explore it using tools I know like Excel.

@roll
Copy link
Member Author

roll commented Aug 31, 2017

@Stephen-Gates
Thanks! That's a great input.

@roll roll closed this as completed in #156 Aug 31, 2017
@roll roll removed the review label Aug 31, 2017
@roll
Copy link
Member Author

roll commented Aug 31, 2017

It seems there are no objections and other opinions. So merging because anyway those changes include important internal improvements. I think we have a few days before v1 release for late call feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants