Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Package "profiles" #87

Closed
rufuspollock opened this issue Dec 28, 2013 · 45 comments
Closed

Data Package "profiles" #87

rufuspollock opened this issue Dec 28, 2013 · 45 comments
Assignees

Comments

@rufuspollock
Copy link
Contributor

It is clear people will want to extend base Data Package spec for particular formats or structures. This could be both for types of data (e.g. tabular as we do with Simple Data Format) and topical areas (e.g. financial data).

This proposal is about:

  • Having a name for these "extensions" of Data Package. Proposed name is: "profile" - cf concept of JSON profiles
  • Having a way for Data Package spec to explicitly designate what the profile is

Proposal

Introduction of a profiles attribute whose value is a hash consisting of profile names and a version of that profile. Structure is same as for dependencies.

profiles: {
  {profile-name}: ""
  {profile-name}: "0.9"
}

Material from #183 on microschemas and profiles

Idea: allow people to start registering simple little schemas for their data (esp tabular data). This would be in the form of a JSON Table Schema. For example, we have Budget Data Package for public finances, we could have a simple schema for crime reports or restaurant inspections.

Notes:

  • Basically these would be profiles . So maybe we want to standardized name for this (either profile or schema but not both). On the other hand maybe useful to distinguish e.g. Tabular Data Package as a profile from a microschema which actually specifies what fields should be present in a given dataset.
  • We could / should have one page for each profile / microschema on dataprotocols.org at dataprotocols.org/microschemas/
  • Suggest a strong focus on Tabular data to start with - i.e. schemas (required columns) for tabular data
  • Proposals could be classified at:
    • Draft: just an idea
    • Alpha: trialling publishing with at least 2 users
    • Beta: in use with at least 2 users and trialling with more
    • Accepted: multiple (5+) users (publishers)
@sballesteros
Copy link

_Somewhat_ related: [JSON-LD](http://json-ld.org/) (and adding `@context` to a resource)

@trickvi
Copy link

trickvi commented Jan 22, 2014

+1 on this. This is rather import for the work I'm currently doing where a lot of the metadata I need to provide, doesn't apply to the general data package.

@rufuspollock
Copy link
Contributor Author

I've updated profiles propose to change to plural in that a given data package can implement multiple profiles.

@stevage
Copy link
Contributor

stevage commented May 8, 2015

Has there been any further discussion or work on this? It seems pretty necessary in order to be able to have pluggability between data packages and different tools. As things stand, there doesn't seem to be anything to indicate that a data package is a tabular data package, rather than any other kind - other than inspecting all the fields.

I'd suggest that the reference to the schema should be a URI though, and perhaps a dereferencable schema URL.

@rufuspollock
Copy link
Contributor Author

@stevage i agree we should progress this - probably as a small separate spec similar to http://dataprotocols.org/data-package-identifier/ - would you be up for drafting something?

Personally I'd like to have a single name rather than a URI as for most users this will be easier but to have a way for that to be dereferenceable e.g. have a convention that profiles.dataprotocols.org or something exists and profile X goes to profiles.dataprotocols.org/X

@stevage
Copy link
Contributor

stevage commented May 8, 2015

yeah that works. You can actually have the best of both worlds:

profile: "foo" => http://profiles.dataprotocols.org/foo
profile: "http://foo.bar" => http://foo.bar

Give me a chance to learn the rest of the ecosystem first, and see #183. Might be good to chat if you have time?

@rufuspollock
Copy link
Contributor Author

@stevage definitely good to chat. Ping on #okfn on freenode or rufuspollock on skype.

@rufuspollock
Copy link
Contributor Author

OK, I think we are going to introduce this and probably merge "microschemas" - #183 - with it. My sense is this will be a separate spec referenced from Data Package rather than inlined into that spec.

Thoughts anyone?

@rufuspollock
Copy link
Contributor Author

@pwalsh could you update on the status of the profiles / schemas registry. Is this now live?

@rufuspollock
Copy link
Contributor Author

Need to clarify inheritance model for profiles:

For Data Packages themselves and the use of the profiles field we go for explicitness. You write out all the profiles you expect to conform to even if some imply the other e.g. you'd have both "tabular" and "budget" listed even if the budget data package is a type of tabular data package.

@rufuspollock
Copy link
Contributor Author

I propose that this will become a separate protocol that extends Data Packages but is not part of the base specification.

@jpmckinney @pwalsh @paulfitz @stevage wdyt?

@jpmckinney
Copy link

I'm not sure I understand. Profiles are a mechanism for extending DP. Are you saying that the mechanism itself should be an extension of DP, and not part of the base spec?

@pwalsh
Copy link
Member

pwalsh commented Jul 10, 2015

@rgrp

https://github.com/dataprotocols/registry

Registry is live in the sense that we have a javascript library that works with it, using rawgit.com to serve the registry file straight from github: https://github.com/okfn/datapackage-registry-js/blob/master/index.js

We are using this in DataPackagist with success.

I'd be happy formalise it, and make an announcement about it, by:

  • updating the readme a bit to encourage others to use it, and link to datapackagist as a quick way to create package descriptors for any entry in registry
  • if possible, serve it (and DataPackagist) from a proper domain to make it more official. Eg: apis.datapackages.com/registry datapackagist.io

@stevage
Copy link
Contributor

stevage commented Jul 12, 2015

Ok, just to check that I understand what this proposal currently is:

  • data packages can conform to one or more profiles
  • profiles should be registered in the data protocol registry
  • a data package should list the profiles that it conforms to in an optional (recommended?) profiles field of the form:
"profiles": {
   "budget-data-package": 1.0,
  "tabular-data-package":1.1
}

(I wonder if it would be better to also have direct links to schemas, for simplicity. It would also better support use cases where for some reason whoever is curating the registry doesn't want to accept a proposed profile. That situation would be a real road block with this proposal because there's no other way to link to it...).

  • although profiles can extend other profiles, a data package should list every profile that it conforms to explicitly
  • profiles can add fields, constrain existing fields, and add constraints that aren't expressed in schema (eg, structures of files). Anything else?

Are informal profiles (that is, an agreement to add additional fields to data packages, without actually writing a schema) still allowed/encouraged?

@rufuspollock
Copy link
Contributor Author

@stevage that's a great summary.

Right now, I don't think we do want to link schemas in the datapackage.json itself as I think that isn't yet quite resolved and is also not necessarily essential for a lot of what people may use this for (e.g. i'm just looking for all tabular data packages - i'm not looking to validate against the schema)

@stevage
Copy link
Contributor

stevage commented Jul 13, 2015

Sure, fair enough.

@rufuspollock
Copy link
Contributor Author

I'm also thinking what about pure JSON Table Schemas vs full data package profiles. Should we support listing pure JSON Table Schemas?

@trickvi
Copy link

trickvi commented Jul 23, 2015

Just a quick comment. I understand why you would want to have profiles as a hash of all profiles the datapackage follows. It makes it simple to parse, but it does make it more tedious to generate where you'd need to know a lot more than just the datapackage profile you're trying to follow.

Made up example: I have an "historical budget data package" which might combine "budget data package" and "historical data package" (made up) profiles. Budget data package might include "openspending data package" and these are "tabular data packages" and the historical data package might be something like a "function data package" (made up). Now if I were just reading the "historical budget data package" profile page I would now possibly have to read a lot of background profiles or which I'd probably just do: copy things from one package to another without any thought.

I also do not like versioning, mostly because I don't think that versioning is handled in a good way and it becomes confusing quickly. It would make sense to me if it was only a single profile I'd be inheriting from, but I might be inheriting the budget data package profile version 2.0 which supports multiple versions of the "tabular data package".

I'm also slightly afraid of using names like "budget-data-package" as keys which could also be "Budget data package" or something.

I think we should try to go for simplicity and I think we're going in the wrong direction with this. Why not just a link to the specification or the schema?

Sorry for being a naysayer here, but I'm just very afraid that I might be less inclined to use data package profiles if this gets too complicated.

@rufuspollock
Copy link
Contributor Author

@tryggvib so:

  • inheritance stuff. I think you are asking for automated cascaded inheritance. I think this is nice but painful for implementors and not essential for many use cases. I'd like to see how we do with simplicity and then refactor if not good (basic point: it may not matter if you just put one profile in there and leave out the others)
  • naming: we need a canonical name in the registry if we have one and you should use that. I get it is a little annoying but i don't see an alternative other than direct schema links
  • direct schema links: I don't like them atm because:
    • one level of indirection is useful (it much reduces breakage in urls etc)
    • having the name in the profile is useful for other reasons than simply dereferencing to validate. If you have a schema link i have to guess from that what the actual profile is

I'm still very open-minded and the proof of this stuff is in the implementation so trying this out is what matters :-)

Aside: "*" or simple "" means that you can have any version of the schema you like.

@trickvi
Copy link

trickvi commented Jul 27, 2015

@rgrp

If we already have a canonical name in the registry, couldn't we then just have the registry manage the inheritance? That does make the use case of "I want all tabular data packages which should also pick up budget data packages and other derived profiles" slightly more difficult, but at least we won't have to rely on package maintainers to remember to add all the inheritance stuff (which would mean we might miss out on packages in the use case because of the assumption that everyone will add all profiles).

Also I'm sceptical whether the "" and "" are a good thing. If the profile updates and becomes backwards incompatible. Those packages that used "" or "" will not adhere to the profile and we'll end up with inconsistencies. This puts a lot of restrictions on profile creator about how they can develop their profiles in the future.

@stevage
Copy link
Contributor

stevage commented Jul 27, 2015

Just thinking of the way other kinds of systems manage this stuff, I imagine a bit of XSL and NPM.

XSL documents tend to start by explicitly listing all the namespaces they refer to, and you just copy this chunk of text from one similar document to the next. I imagine this working out the same - there'll just be a chunk of three lines that you include in every OpenSpending Data Package, for instance.

The "*" / "" thing weirds me out a bit though, as it's just so loose. It does rather imply that no backwardly-incompatible change will ever be made. How about an NPM-style scheme: "^1.0.0", etc.

@rufuspollock
Copy link
Contributor Author

@stevage i borrowed the "" or "*" from node / npm (at least i thought i was!)

@stevage
Copy link
Contributor

stevage commented Jul 27, 2015

Heh, maybe that was an older npm mechanism - deprecated now if it's even supported. The current mechanism works very well - "^1.0.0" means "any 1.x.x".

@danfowler
Copy link
Contributor

@rgrp @pwalsh So with our working implementation of profiles, any given Data Package has only one unversioned profile. And all inheritance from any other profile (e.g. fiscal > tabular) is specified directly in the JSON Schema that defines the profile.

And so, would a simple profile key whose value is a string identifier for the profile in some registry (as suggested earlier by @stevage (#87 (comment)) be enough to support?:

"profile": "tabular"

Should we then also specify alternate registry URLs (as already supported in datapackage-py? https://github.com/frictionlessdata/datapackage-py)

"profileRegistryURL": "http://xxx" // defaults to http://schemas.datapackages.org/registry.csv

@pwalsh
Copy link
Member

pwalsh commented Mar 29, 2016

@danfowler

I don't follow.

What do you mean by:

all inheritance from any other profile (e.g. fiscal > tabular) is specified directly in the JSON Schema that defines the profile.

And, what problem are you solving by adding profileRegistryURL to the spec?

@danfowler
Copy link
Contributor

@pwalsh earlier in the thread @rgrp says this:

For Data Packages themselves and the use of the profiles field we go for explicitness. You write out all the profiles you expect to conform to even if some imply the other e.g. you'd have both "tabular" and "budget" listed even if the budget data package is a type of tabular data package.

But in describing how a Data Package publisher can specify a profile for their Data Package, we can just keep it simple and suggest adding a single property (e.g. "profile": "fiscal") and it is up to the maintainer of that profile to ensure that, if she suggests it inherits from tabular, that it whatever validation mechanism is used (e.g. JSON Schema) implements that. Does that make sense?

I suggested "profileRegistryURL" based on @stevage suggestion to allow for a mechanism to allow for profiles that don't exist on the core registry.

It would also better support use cases where for some reason whoever is curating the registry doesn't want to accept a proposed profile. That situation would be a real road block with this proposal because there's no other way to link to it...

@pwalsh
Copy link
Member

pwalsh commented Mar 29, 2016

Ok, so on point 1, no, the JSON Schema for a profile (which is really just an implementation detail, or, if you like, a representation of the spec) does not implicitly or explicitly declare any inheritance, so I still don't understand what you mean there.

On point 2, well, I get the idea, but there is not actually any dependency between the spec and the core registry (the specs do not say, for example, that fiscal exists in the core registry), so I'm still asking what problem we are solving here, at the spec level.

@danfowler
Copy link
Contributor

@pwalsh

  1. OK, so in describing how to create a profile, I'll leave out anything about inheritance. A profile for a data package stands alone. Its representation can be a JSON Schema.
  2. OK, but the specs don't say anything about profiles yet. Once we go down the path of providing names for profiles, don't we have to talk about how those names might be resolved? This could all go in a mini-spec as @rgrp suggests above.

danfowler added a commit that referenced this issue Apr 4, 2016
This is a simple document describing profiles and why they're useful.
It could obviously be more detailed, but I also wanted to avoid
describing the current implementation for handling this.

See #199 and #87
@danfowler
Copy link
Contributor

@pwalsh @rgrp @stevage @jpmckinney drafting a description of profiles here: https://github.com/dataprotocols/dataprotocols/blob/add-profiles-documentation/data-package-profiles/index.md

I think the next section in this doc could be a template for writing such a profile (#196)

@stevage
Copy link
Contributor

stevage commented Apr 5, 2016

A few comments on what you have there:

  • it refers to writing "a small specification", but FDP is pretty big. Is talking about what's needed "at a minimum" actually helpful?
  • are the three italicised questions meant to be section headings for the profile? If not, are they just "things that your profile should address at some point"? I'd suggest that providing a template might be more useful, especially when dealing with something so abstract.
  • The para about JSON schema is pretty muddy. I think you're just saying "The profile should be accompanied by a JSON schema file which can validate that a given data package complies with it."
  • Is "Assigning a profile" the right language? Maybe make it clearer what the implications of specifying that property are. Also, should it be a list? A DP can comply with several profiles, no?

@danfowler
Copy link
Contributor

Thanks for the comments, @stevage. To your points:

  1. This is a good point. I suppose FDP is probably at one end of the complexity spectrum for Data Packages, and that most will be far simpler. But I suppose this can be informed by the use cases people actually have for Data Packages.
  2. Also an excellent point. I was meaning to add a template afterwards anyway, but perhaps replacing these general questions with the template would be ideal.
  3. Thanks, that's an improvement.
  4. I was approaching this in the simplest possible way in which there is only one profile per Data Package. But maybe we need more discussion on that 😄

@pwalsh
Copy link
Member

pwalsh commented Dec 19, 2016

I implemented this in f1ccbee

It applies to Data Package, and also to the new Data Resource, descriptors.

It is not implemented as an object, nor with hierarchy - those suggestions are way too complex and require a publisher to know details about subclassing and inheritance that a publisher should frankly never need to know.

Instead, we have:

  • a profile property
  • the base descriptors have a value of "profile": "default". The absence of a profile property is considered equivalent to "profile": "default"
  • (Sub)Profiles MUST have a profile property. It can have a value of the ID of a profile from the registry, or, a URI to a JSON Schema for the profile.

@rufuspollock
Copy link
Contributor Author

@pwalsh ok. BTW this is one example where i think we could do the PR separately and distinctly - but ok if not.

@Fak3
Copy link

Fak3 commented Feb 1, 2017

From the draft commit: f1ccbee:

Custom profiles `MUST` have a `profile` property, where the value is a unique identifier for that profile.

I am lost here - what is a profile? Datapackage itself or a resource? Does it mean that any datapackage MUST have profile property? Honestly, the whole section wording is so ambiguous that I can't understand how profiles work.

@pwalsh
Copy link
Member

pwalsh commented Feb 2, 2017

@Fak3 using a single commit is probably not the best way to get the context you need. The profile concept is explained in narrative form in the base data package spec, and is not further explained in the reference information for each profile. I'll push a built site for @rufuspollock to review and we'll take it from there.

@pwalsh
Copy link
Member

pwalsh commented Feb 5, 2017

We won't merge #337 until we can support all the changes in the core implementations we maintain. So closing this as it is implemented in #337 , as leaving it open creates confusion.

@pwalsh pwalsh closed this as completed Feb 5, 2017
rufuspollock added a commit that referenced this issue May 24, 2017
…parate mini-spec.

* /profiles/ is a mini-spec explaining and defining meaning and syntax of profile property
* [dr]: add profile property
* [dp]: add profile property
roll added a commit that referenced this issue Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

9 participants