Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are primary and foreign keys on the right level of abstraction? #297

Closed
roll opened this issue Sep 20, 2016 · 18 comments
Closed

Are primary and foreign keys on the right level of abstraction? #297

roll opened this issue Sep 20, 2016 · 18 comments

Comments

@roll
Copy link
Member

roll commented Sep 20, 2016

Overview

For now primary and foreign keys are part of JSON Table Schema spec. There is strong reasoning why I suppose - like self-referencing, ability to create just one csv file with FK to "somewhere" etc. May be just easier to put it here.

But for implementations honestly it really badly breaks normal situation with level of abstractions. For example to follow this spec perfect JTS lib should have a circular dependency with DP lib to download and check referenced datapackage. Or goodtables checking some atomic tables should I don't know to do so on FK definition because between atomic tables there is no cross-naming mechanism.

Thoughts

For me it looks like a situations when idea is great but it will be just not supported in a real word. May be we will be able to hack it in python somehow but other implementations..

May be better to reduce the scope of this to make it really workable? Like json file can't have comments but has implementations for 100 languages. I see here one great concept that could lead the things:

datapackage integrity

Instead having PKs and FKs as an aside part of JTS spec moving it to DP to be core part of DP spec ensuring datapackage integrity like in SQL. It will be much more implementable across all implementations. In this case it will be also a better separation of concerns:

  • JTS is for types and constraints
  • DP is for data containerization

@akariv
@pwalsh
@sirex
@Stephen-Gates
et al WDYT?

@sirex
Copy link

sirex commented Sep 20, 2016

I like current JTS Foreign Keys specification and I would expect foreign keys to be part of a resource or table definition. It looks logical.

Regarding implementation, I think, libraries that work on single resource level, should ignore foreign keys, because they reference other resources and that is out of library scope.

But higher level libraries implementing data package specification could depend on those lower level libraries working on single resource level and add foreign key support.

@roll
Copy link
Member Author

roll commented Sep 20, 2016

@sirex
But you say almost the same what I've said - we have a situation when we ignore it on a table level (Stream->Table->Resource->Datapackage) and use it on resource level. So why it's attribute of table not resource?

@pwalsh
Copy link
Member

pwalsh commented Sep 20, 2016

I think it is critical to the spec: PK and FK are critical to the context of a table.

@roll
Copy link
Member Author

roll commented Sep 20, 2016

@pwalsh
Table (eg lonely csv file) or resource (part of datapackage)? For example we have just one csv file with defined FK. It's like to have one SQL table without a database as a context.

@pwalsh
Copy link
Member

pwalsh commented Sep 20, 2016

@roll honestly, I'm not sold on there being a practical distinction between Table and Resource, for our needs.

@roll
Copy link
Member Author

roll commented Sep 20, 2016

@pwalsh
On the last iteration of work on python libraries I've found that this distinction like a key to solve some problems that was like unsolvable without this abstraction:

# jts level
Stream - headers+rows
Table - Stream+schema

# dp level
Resource - Table/Image/Document/etc + metadata in context of data container (datapackage)
Datapackage - container contains Resources + metadata

For example resources have names but tables (atomic csv files) just can't have names (not filenames of course=) because there is no namespace for it.

As example:

# datapackage
name: datapackage
resources:
  - name: resource1
  # resource
  - name: resource2
    # table
    schema:
      fields: ...
      foreignKeys: <to resource1>

Extracting resource2 JTS (we want to describe csv file separately):

# JTS of resource 2
fields: ...
foreignKeys: <to resource1>

Now foreignKey points to nowhere because in this case foreignKey just doesn't make sense. No namespace.

So question is simple why referential entities not on resource level like this:

# datapackage
name: datapackage
resources:
  - name: resource1
  - name: resource2
    foreignKeys: <to resource1>
    schema:
      fields: ...     

@sirex
Copy link

sirex commented Sep 20, 2016

@roll I see resource and table to be the same thing and my idea, to do the separation in the implementation not in specification, by just ignoring foreignKeys for jts level libraries.

Since resource can be anything, not just tabular data, then foreignKeys will not make sense for other resource types except tabular data.

Also, there is possibility to have schema outside of resource:

{
  "resources": [{"schema": "xyz-schema"}],
  "schemas": {
    "xyz-schema": {
      schema goes here ...
    }
  }
}

Foreight keys are directly tied to the schema, because they refer to fields defined in a schema, so I thing it is not good idea, to move foreight keys outside of schema definition.

@roll
Copy link
Member Author

roll commented Sep 20, 2016

@sirex
You're right that it's tied to schema. But when your lower level things depends on higher-level things (jts has reference to dp) it's also a bad thing. Seems no perfect solution here. In SQL spec everything OK related to this problem because there is no separate specification for table that should make sense by itself.

@sirex
Copy link

sirex commented Sep 20, 2016

@roll, in that case, maybe foreign keys should point to jts schemas, no to data packages?

@pwalsh
Copy link
Member

pwalsh commented Sep 20, 2016

well, JTS lib doesn't need to depend on DP lib.

JTS lib just needs "something" to tell it the table/resource that is being referenced.

If it has a defined API for this, then DP just needs to follow that API, but it does not mean that JTS lib depends on the DP lib.

@roll
Copy link
Member Author

roll commented Sep 20, 2016

@sirex
What do you mean? My main point about difference between table and resource (aside it's too different specs) that resource has name but table hasn't. So we could reference only resources in datapackage context.

@pwalsh
So should it be somehow explained in specification? Like without datapackage context this, this and this will be completely ignored? For now it's not clear.

@pwalsh
Copy link
Member

pwalsh commented Sep 20, 2016

@roll I guess we should talk about this more, in the coming days. For me, it is not a specification issue at all, but rather an implementation issue. However, I agree that the specs need work in this area anyway.

@roll
Copy link
Member Author

roll commented Sep 20, 2016

@pwalsh
Yea great. Just wanted to raise it because from my experience there are 2 main weak points in specs:

  • this level of abstraction break (may be is ok as you said with proper explanation in spec and impls)
  • tabular data package instead of mechanism to define tabular resources

@sirex
Copy link

sirex commented Sep 20, 2016

In #297 (comment) comment I was saying that for example here:

  "foreignKeys": [
    {
      "fields": "state",
      "reference": {
        "datapackage": "http://data.okfn.org/data/mydatapackage/",
        "resource": "the-resource",
        "fields": "state_id"
      }
    }

reference points to an external data package, so this makes circular dependency between two specs.

To fix that, reference could point directly to a JTS schema instead of data package resource.

For example, datapackage and resource should be replaced to something like that:

"schema": "the-resource" - points to "current" space of schemas in case if jts is embedded somewhere, mapping of all schemas should be provided to a library from outside, for example, if datapackage is embeding jts schema it would provide all schemas from other resources to the library so that jts library could validate foreign keys.

"schema": "dp+http://data.okfn.org/data/mydatapackage/?resource=the-resource" - this would point to external schema, it means, that jts library still would have to support datapackage specs, but only small subset of it, to get schemas from resources.

"schema": "http://data.okfn.org/data/mydatapackage/a-schema.json" - this could point directly to other schema.

If two specifications depend on each other, they should be either merged to one, otherwise dependencies should be removed. So I sort of agree with @roll, but as I understand, specs can't be changed that freely.

@roll
Copy link
Member Author

roll commented Oct 10, 2016

@sirex
This makes sense but kinda complex and use pointing to schema (description) instead of resource (dataunit).

To clarify my point. Imagine I start specs in this area from scratch:

  • I use jsontableschema for types and constraints
  • I use datapackage as it is for data packaging
  • I add TabularResource section to the datapackage spec saying:

TabularResource is a resource which:

  • MUST have schema attribute pointing to JTS
  • SHOULD have primaryKey attribute
  • COULD have foreignKeys attribute pointing to other TabularResource in this datapackage

In this case foreign keys is used to provide datapackage integrity exactly like in SQL. For me it makes sense because data packages is about data containerization. When we have container we could do references.


One thing I suppose should be considered anyway - removing cross-datapackage referencing for v1. It's even solves circle-dependency between jts and dp:

On JTS level saying that resource could mean anything on higher levels or self for self-referencing (or no resource for self-referencing).

foreignKeys
  - fields
    reference
      resource
      fields

On DP level adding that resource is a datapackage resource. So it's kinda an extend bottom levels on top levels approach. Other specifications could be able to use JTS foreign keys adding other meaning to resource.

@rufuspollock
Copy link
Contributor

I would like to move to close this issue.

Valuable discussion but if i read it correctly I do not think there is now anything outstanding in terms of a specific proposed change or an immediate bug with the specs. Let me know if any objections.

@roll
Copy link
Member Author

roll commented Oct 19, 2016

@rgrp
I've extracted a specific idea from this discussion - #314

@rufuspollock
Copy link
Contributor

INVALID / WONTFIX. See previous comment. See #314 for specific suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants