-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linking a checksum to DataDownload #66
Comments
use of schema:identifier mentioned as potential solution here: https://github.com/schemaorg/schemaorg/issues/1831 |
This issue has been automatically marked as stale because it has not had recent activity. |
I fully support adding checksums. We should follow an established format to encode the hash and the algorithm. There are a number of possibilities, but I like the hash URI format the best because of its readability, and formatting as a URI:
Other possibilities are describe more thoroughly at https://hash-archive.org/:
The Named Info syntax is described in RFC 6920 but has the disadvantage of being harder to parse and doesn't use a base64 representation of the hash value. @cboettig You've compared the pros and cons of different hash syntaxes. Do you have a writeup somewhere associated with your work on https://github.com/cboettig/contentid ? |
A write-up is a good idea. so far it's only in the issue thread, cboettig/contentid#1 |
Here's a proposed checksum example using the Hash URI format, that builds upon the {
"@context": {
"@vocab": "https://schema.org/"
},
"@type": "Dataset",
"@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
"sameAs": "https://doi.org/10.18739/A2NK36607",
"name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
"distribution": {
"@type": "DataDownload",
"@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
"identifier": [
{
"@type": "PropertyValue",
"@id": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
"propertyID": "https://rfc-editor.org/rfc/rfc4122.txt",
"value": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae"
},
{
"@type": "PropertyValue",
"@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
"propertyID": "http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256",
"value": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51"
}
]
}
} I think this works, but it would be good for our guidance to specify which |
I'm all for embedding the checksum in the identifier, but it is surprising to me that DCAT2 and Schema.org don't have a more native concept to express a checksum just as a checksum.
I'm not sure if it wouldn't be cleaner to use |
@cboettig said:
For the records: the current Working Draft of DCAT 3 does include the possibility of specifying the checksum of a distribution by following the DCAT-AP approach - see https://www.w3.org/TR/vocab-dcat-3/#Class:Checksum (relevant issue: w3c/dxwg#1287) |
@cboettig I did consider using
The benefits would be
So, given the direction of DCAT3, here's an alternative proposal that would be very clear about the semantics of the checksum, and avoids blank nodes by using the hash uri serialization as the "@id": {
"@context": {
"@vocab": "https://schema.org/",
"spdx": "http://spdx.org/rdf/terms#"
},
"@type": "Dataset",
"@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
"sameAs": "https://doi.org/10.18739/A2NK36607",
"name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
"distribution": {
"@type": "DataDownload",
"@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
"identifier": [
{
"@type": "PropertyValue",
"@id": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
"propertyID": "https://rfc-editor.org/rfc/rfc4122.txt",
"value": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae"
}
],
"spdx:Checksum": {
"@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
"spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
"spdx:checksumAlgorithm": {
"@id": "spdx:checksumAlgorithm_sha256"
}
}
}
} The triples that are related to the checksum would then be:
So, let's call our two options:
Right now, I think I like Option 2 better. Thoughts? |
To me it makes sense to use |
I agree with @datadavev , that there's really two different things being discussed here, which actually makes for 3 non-empty cases:
I think these different roles would go in different blocks. I think something like:
A few notes:
(Also sounds like they are saying there's no need to use
Based on these thoughts, I would have done something like:
Yeah, doing |
@cboettig said:
Yes,
|
Thanks, makes sense to include the type, and my prior example did improperly capitalize the property. Here's a revised full example with the type included and the property correctly capitalized: {
"@context": {
"@vocab": "https://schema.org/",
"spdx": "http://spdx.org/rdf/terms#"
},
"@type": "Dataset",
"@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
"sameAs": "https://doi.org/10.18739/A2NK36607",
"name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
"distribution": {
"@type": "DataDownload",
"@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
"identifier": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
"spdx:checksum": {
"@type": "spdx:Checksum",
"@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
"spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
"spdx:checksumAlgorithm": {
"@id": "spdx:checksumAlgorithm_sha256"
}
}
}
} |
I am in favour of the spdx:Checksum approach. That seems most consistent to me. I am not sure I understand why we need the type though (I understand it from the modelling perspective, but not as a simple approach for data providers to expose their data which is why many are looking into SO). Concerning using it as an identifier as well, I agree with @datadavev perspective. The problem is creating guidance information helping data providers. |
@steingod for me a primary interest in this is to aid potential disambiguation of data being exposed by multiple parties. So I view this as a vital element for data. I'm already using this quite a bit and the latest proposed guidance seems fine on quick inspection. In most cases for the end user the goal is to easily get the data. This is trivial to JSON-LD Frame out and no more complex in SPARQL space than any guidance (which can be taken many ways) ;) |
Definitely agree with @fils on this!
That's also why I like the proposal of identifiers based on the checksums and independent of the party exposing the data. Certainly we can always extract that information if it's in the checksum field like in these examples, but it's also easily lost from there -- e.g. like @fils says there's the temptation to just json-ld frame it out and have a pure-schema.org representation, or any of the other representations that don't have a native checksum attribute. So I still support @mbjones original suggestion that it would be nice to also normalize including this in an If I had my druthers it would be the 'canonical' identifier or Related question is whether to encourage providing more than one checksum. If so, I think it would look something like:
|
@fils I have no problem understanding the wish to avoid ambiguity and establish a consistent approach. My main concern (not related to this specific issue) is that in some communities SO is expected/claimed to be a lightweight alternative to GCMD DIF or ISO19115 and APIs. Which I do not think it is. As it evolves to make it useful for proper filtering in search engines it is becoming equally complex. Sorry for the small detour on the specific issue, it is a different discussion. :-) |
@steingod sorry... I contributed to that detour too. ;) In our defense ... the internet thrives on such detours. So lets continue.... :) I completely agree with you on this point and it is a major concern of mine. I think the examples we have for minimum and full in the examples directory helps with this and perhaps we should make sure we don't loose focus on making these smaller more basic examples along the way. Like Google with their required and recommended. Breadth and momentum are still vital to this community. It would also be nice to pay attention to those recommendations that could impacting search ranking. If we can separate guidance that is more for result decoration, vs those properties that could affect relevancy ranking that might be useful. A nice SHACL or framing plus python data analysis use case perhaps. There are different clients of course. A search engine result for discovery is different than a query supporting report generation on FAIR data alignment for example. More difficult are those types and properties that relate to connections (edges) as those can impact a more interesting category of query. For example semantic page rank values is one I'm interested in that might scope in to that type. |
@fils thanks, seems we are pretty aligned then :-) My hope is that SO can be useful for many of the small research stations and communities that have important data but lack the ability to properly announce them today. I very much support your statement on guidance for ranking and decoration, I think that is crucial to achieve this goal. |
Sorry for dragging this further off topic, but I completely agree with @fils and @steingod on this. Having an agreed set of minimal Science on Schema attributes for discovery is vital for those "data curation/preservation/management" resource poor organisations who have need to publish their metadata to allow their data to be found. Hand in hand with this is good, simple to use tooling to allow that publication to happen and to validate the metadata. The checksum discussion is absolutely important for trust in, and validation of, data file downloads but is not going to be where every data provider is at at this point in time. |
In the spirit of the tangent, how do folks feel about the other part of @mbjones proposal, which would optionally put the checksum in the more widely recognized field,
(Of course this could also use the @adamml notes that
But as @fils noted, I think the content hash is equally important for the use case when the same data file may found across multiple providers. The existing identifiers do not serve this purpose well, since the providers tend to mint their own identifiers. For example, this famous ice core CO2 data, https://doi.org/10.3334/CDIAC/ATG.009 I use in my undergraduate teaching can be found at both https://cn.dataone.org/cn/v2/resolve/ess-dive-0462dff585f94f8-20180716T160600643874 and https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542, where it is known by a different |
On the minimalism front, I hear what folks are saying and agree with some aspects of it, but I think there is room and need for guidance supporting different discovery (and other) use cases. As Carl lays out, there are important discovery use cases for checksums, and so simple guidance for "if you want to provide a checksum, do it like this..." will go a long ways towards overcoming the ambiguity and multiplicity of approaches in schema.org. At no point are we saying that people must provide checksums -- we're simply trying to provide implementation guidance for how to do in a simple interoperable way for those who want to provide it. We can continue this discussion on minimalism, but I think someone should open a new issue on it -- its not really the topic here for Checksum, and the minimalism discussion applies to many other fields in the already released SOSO guidance docs. In addition, folks that would like to see the SOSO effort change direction and be more minimalist might consider joining our twice monthly calls so we can discuss that strategic direction in more detail. To summarize the Checksum discussion thus far, and try to reach agreement on it, we have proposed two options: 1) to include checksum as an |
I added PR #171 that implements the checksum approach we discussed during a previous call. @ashepherd this closely follows the discussed solution, so if the wording and example are clear, it should be ready to merge to develop. I validated the example in JSON-LD playground. |
The proposed text of the Checksum guidance can be read as formatted markdown on the branch here: https://github.com/ESIPFed/science-on-schema.org/blob/feature_66_checksum/guides/Dataset.md#checksum |
Can we use the schema:identifier property?
URN schema to indicate checksum?
Proposal:
md5:9e85e71b33f71ac738e4793ff142c464
)schema:propertyID
to specify the type of checksum as textschema:additionalType
to specify the type of checksum using controlled vocabulariesschema:value
to specify the value of the schecksumExamples:
MD5:
SHA256:
The text was updated successfully, but these errors were encountered: