Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiles and distributions #531

Closed
davebrowning opened this issue Nov 4, 2018 · 9 comments
Closed

Profiles and distributions #531

davebrowning opened this issue Nov 4, 2018 · 9 comments
Assignees
Labels
dcat:Distribution dcat feedback Issues stemming from external feedback to the WG profile-guidance

Comments

@davebrowning
Copy link
Contributor

(Issue created to track comments received on Second PWD of DCAT Recommendation from Clemens Portele - email archived here)

Additional question about the interplay of profile choice and distribution:

Say I have a dataset of buildings and it is made accessible according to two different profiles (e.g. two different XML schemas or two different JSON schemas). The two profiles use different vocabularies and there are differences in the content. However, both representations are sourced from the same data. To me this would be a single dataset. However, this is not that clear in DCAT 1.0 and one could also take the view that these are two different datasets - with separate dataset metadata. At least I know cases where this has been represented as two datasets in catalogs. The new DCAT draft adds language about dataset as "a single conceptual entity" which seems to support the view that there is a single dataset in this case. Could guidance be included in the revision to support more consistent implementations, maybe just an example for such a case?

Assuming this would be consondered one dataset: If both profiles would be served through the same API (or service) and profile negotiation would be used, would this be one distribution (since it is a single API) or two distributions (one per profile, but with the same accessURL)?

Currently you can only specify the media type of a distribution. Considering the work on profiles and profile negotiation in the DXWG wouldn’t it make sense to be able to specify the profile(s) that a distribution supports in DCAT?

@makxdekkers
Copy link
Contributor

This issue seems to be related to the discussion about 'informational equivalence' between distributions. If I understand correctly, serving up data according to a particular profile will deliver a different set of data -- assuming a profile will always deliver a selection/subset of the available data. If we require informational equivalence, in the sense that, for distributions A and B, a transformation A->B->A delivers the exact same data, such profiled data should be modelled as two datasets. Maybe the reference to the profile should then be at the level of Dataset, not on Distribution?

@rob-metalinkage
Copy link
Contributor

Each distribution can have multiple values of dct:conformsTo to indicate profiles the distribution conforms to. Different distributions can conform to different sets of profiles.

The range of conformsTo is dct:Standard

The profiles ontology subclasses dct:Standard - so the mechanisms are available for the declarations required.

The issue of enforcing informational equivalence seems fraught however - its hard to imagine any distribution provided by a service that doesnt support either subsetting, calculations, additional links or lossy transformations - e.g. flattening objects to CSV with "magic" needed to relate columns together.

e.g. cost="USD 23.44" => "cost" = 23.44, "currency"="USD"

I havent seen a cogent argument for requiring informational equivalence in distributions yet...

@makxdekkers
Copy link
Contributor

@rob-metalinkage I do not understand your statement that you haven't seen a 'cogent argument'. Is it that you haven't seen any arguments, or that you think that arguments made are not cogent?
Let's try this argument: Suppose there is a dataset that says it contains the data for the budget for the year 2019. As a user, I think it would be reasonable to expect that every one of the distributions contain all the data for that year and that they only differ in the format. I would not expect to having to browse through the descriptions of the distributions to find out which of them did contain all the data rather than some subset of it. Requiring 'informational equivalence' among distributions would provide this guarantee. If not, you really do not know beforehand what distributions contain.
My earlier point was (https://www.w3.org/TR/dcat-ucr/#ID34 and https://www.w3.org/TR/dcat-ucr/#RDIDF) that there needs to be a clear definition, so I would love to hear what your opinion is on this issue.
Please note that for distributions we're not talking about data services; distributions essentially give access to static files.

@pwin
Copy link
Contributor

pwin commented Nov 6, 2018

@makxdekkers would https://w3c.github.io/dxwg/ucr/#ID50 come into play here? ... a flag could indication if the distribution was the result of a lossy transformation of the dataset

@makxdekkers
Copy link
Contributor

@pwin Indeed. The important part in the use case is "these events do not reduce the information content" which to me sounds a lot like "informationally equivalent".

@davebrowning
Copy link
Contributor Author

This will be addressed with #317

@rob-metalinkage
Copy link
Contributor

noticed this is being worked on and I missed a question earlier...

I believe the arguments for "information equivalence" have not sufficiently defined this term and the competency questions for the DCAT ontology related to it.

the example about whether a dataset contains data for the year 2019 is more clear cut than, for example, the rounding off of microseconds in dates in a different encoding. So the interpretation of each and every perspective seems to hang on the precise nature of "informational equivalence". I do not believe we have grounded this term well enough in Use Cases or derived Requirements.

@dr-shorthair
Copy link
Contributor

@rob-metalinkage the consensus is that anything short of losslessly-convertible would be use-case specific.

And since we are reluctant (unwilling) to go with the former (and we think it would be a hard sell in the market), we will have to just come up with some wording to hedge the issue.

@dr-shorthair
Copy link
Contributor

Ultimately it is the prerogative of the provider or cataloguer or indexer to make a judgement about how to factor the descriptions between Datasets and Distributions. Different applications and different communities will have different needs and different practices and I do not think we can provide universal guidelines. The big NOTE in https://www.w3.org/TR/vocab-dcat-2/#Class:Distribution partially speaks to this, but maybe could be improved further. I'll have a go.

@davebrowning davebrowning added the due for closing Issue that is going to be closed if there are no objection within 6 days label Jul 24, 2019
@davebrowning davebrowning removed the due for closing Issue that is going to be closed if there are no objection within 6 days label Sep 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dcat:Distribution dcat feedback Issues stemming from external feedback to the WG profile-guidance
Projects
None yet
Development

No branches or pull requests

6 participants