Linking a checksum to DataDownload #66

ashepherd · 2019-12-19T21:31:52Z

Can we use the schema:identifier property?
URN schema to indicate checksum?

Proposal:

Use schema:PropertyValue
use schema:identifier to specify the urn of the checksum (e.g. md5:9e85e71b33f71ac738e4793ff142c464)
use schema:propertyID to specify the type of checksum as text
use schema:additionalType to specify the type of checksum using controlled vocabularies
use schema:value to specify the value of the schecksum

Examples:

MD5:

{
  "@type": "DataDownload",
  "identifier": [
    ...DOI and other identifiers go here...,
    {
      "@type": "PropertyValue",
      "additionalType": ["http://www.wikidata.org/entity/Q185235", "http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/md5"],
      "identifier": "md5:9e85e71b33f71ac738e4793ff142c464",
      "propertyID": "MD5",
      "value": "9e85e71b33f71ac738e4793ff142c464",
    }
  ]
}

SHA256:

{
  "@type": "DataDownload",
  "identifier": [
    ...DOI and other identifiers go here...,
    {
      "@type": "PropertyValue",
      "additionalType": "http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256",
      "identifier": "sha256:8808ACDC7FB7DC2F941EBACC7906B32D2676044494A740C21F6E0DC20893A2A6",
      "propertyID": "SHA256",
      "value": "8808ACDC7FB7DC2F941EBACC7906B32D2676044494A740C21F6E0DC20893A2A6",
    }
  ]
}

The text was updated successfully, but these errors were encountered:

ashepherd · 2020-01-17T02:57:26Z

use of schema:identifier mentioned as potential solution here: https://github.com/schemaorg/schemaorg/issues/1831

stale · 2020-03-17T03:42:29Z

This issue has been automatically marked as stale because it has not had recent activity.

mbjones · 2021-01-28T01:10:50Z

I fully support adding checksums. We should follow an established format to encode the hash and the algorithm. There are a number of possibilities, but I like the hash URI format the best because of its readability, and formatting as a URI:

Hash URI: hash://sha256/030d8c2d6b7163a482865716958ca03806dfde99a309c927e56aa9962afbb95d

Other possibilities are describe more thoroughly at https://hash-archive.org/:

Web URL: https://torrents.linuxmint.com/torrents/linuxmint-18-cinnamon-64bit.iso.torrent
Named Info: ni:///sha256;Aw2MLWtxY6SChlcWlYygOAbf3pmjCckn5Wqplir7uV0
Subresource Integrity: sha256-Aw2MLWtxY6SChlcWlYygOAbf3pmjCckn5Wqplir7uV0=
SSB: &Aw2MLWtxY6SChlcWlYygOAbf3pmjCckn5Wqplir7uV0=.sha256
MultiHash: QmNYZuyWz3U71Dwv7phEgh4WcWcQwdvpBWd99MkiaBoyBA
Magnet URI: magnet:?xt=urn:sha256:030d8c2d6b7163a482865716958ca03806dfde99a309c927e56aa9962afbb95d

The Named Info syntax is described in RFC 6920 but has the disadvantage of being harder to parse and doesn't use a base64 representation of the hash value.

@cboettig You've compared the pros and cons of different hash syntaxes. Do you have a writeup somewhere associated with your work on https://github.com/cboettig/contentid ?

cboettig · 2021-01-28T05:35:33Z

A write-up is a good idea. so far it's only in the issue thread, cboettig/contentid#1

mbjones · 2021-05-11T01:58:12Z

Here's a proposed checksum example using the Hash URI format, that builds upon the identifier field as proposed above and in https://github.com/schemaorg/schemaorg/issues/1831

{
    "@context": {
        "@vocab": "https://schema.org/"
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
      "identifier": [
        {
          "@type": "PropertyValue",
          "@id": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
          "propertyID": "https://rfc-editor.org/rfc/rfc4122.txt",
          "value": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae"
        },
        {
          "@type": "PropertyValue",
          "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
          "propertyID": "http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256",
          "value": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51"
        }
      ]
    }
}

I think this works, but it would be good for our guidance to specify which propertyID URIs represent checksum algorithms, so that consumers could look for those ids for indexing and other uses. We could recommend a list of common checksum algorithm propertyID values. Following @ashepherd 's lead, I've used the Library of Congress vocabulary URIs above. See https://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions.html for a list of named algorithms. The problem with this is that the actual syntax of the value used follows the Hash URI specification, which isn't actually the SHA256 value per se. But I'm not sure what value to use for propertyID there.

cboettig · 2021-05-11T16:28:14Z

Related issue on schema.org: Fingerprints for software and data schemaorg/suggestions-questions-brainstorming#165, which largely aligns with the above (although recommends ni:/// and doesn't discuss use of a propertyValue term for the identifier)

I'm all for embedding the checksum in the identifier, but it is surprising to me that DCAT2 and Schema.org don't have a more native concept to express a checksum just as a checksum.

DCAT-AP recommends SPDX terms for this, which has the more explicit http://spdx.org/rdf/terms#ChecksumAlgorithm and http://spdx.org/rdf/terms#Checksum. Looks like SPDX defines a handful of common checksums with explicit IRIs, e.g. http://spdx.org/rdf/terms#checksumAlgorithm_sha256, not sure if that's really preferable to the LOC vocab though...

I'm not sure if it wouldn't be cleaner to use spdx:ChecksumAlgorithm and spdx:Checksum as the property/value pair for the raw checksum, and separately list the hash URI as an associated identifier... (I see the appeal of using ni:/// and linking to RFC 6920, but as noted above the ni:/// syntax is somewhat cumbersome from a developer perspective (having a base64-encoded string with optional and non-optional rules about which characters should then be percent-encoded is a bit tricky and means that more than one valid string can be used for the same identifier). Having a RFC specification for the hash URI spec would be a nice resolution to all of this....

andrea-perego · 2021-05-11T17:05:19Z

@cboettig said:

[...]

I'm all for embedding the checksum in the identifier, but it is surprising to me that DCAT2 and Schema.org don't have a more native concept to express a checksum just as a checksum.

DCAT-AP recommends SPDX terms for this, which has the more explicit http://spdx.org/rdf/terms#ChecksumAlgorithm and http://spdx.org/rdf/terms#Checksum. Looks like SPDX defines a handful of common checksums with explicit IRIs, e.g. http://spdx.org/rdf/terms#checksumAlgorithm_sha256, not sure if that's really preferable to the LOC vocab though...

For the records: the current Working Draft of DCAT 3 does include the possibility of specifying the checksum of a distribution by following the DCAT-AP approach - see https://www.w3.org/TR/vocab-dcat-3/#Class:Checksum (relevant issue: w3c/dxwg#1287)

mbjones · 2021-05-11T18:46:47Z

@cboettig I did consider using spdx:ChecksumAlgorithm and spdx:Checksum, and spdx:checksumValue, but I thought there were some issues:

it separates the checksum algorithm from the value representation, so complicates parsing and introduces blank nodes unless we are careful
the spdx class definitions have a number of domain and range entailments in SPDX (like to spdx:File) and are defined specifically for software
it puts yet another term in our vocabulary outside of SO. But we'd done that several times already...
spdx:algorithm has a defined range that only includes md5, sha1, and sha256. probably an oversight that could be fixed.

The benefits would be

easier to recognize it as a checksum because of the dedicated class
- doesn't conflate identifier and checksum semantics
consistency with the direction of DCAT3 as @andrea-perego points out above

So, given the direction of DCAT3, here's an alternative proposal that would be very clear about the semantics of the checksum, and avoids blank nodes by using the hash uri serialization as the "@id":

{
    "@context": {
        "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#"
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
      "identifier": [
        {
          "@type": "PropertyValue",
          "@id": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
          "propertyID": "https://rfc-editor.org/rfc/rfc4122.txt",
          "value": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae"
        }
      ],
      "spdx:Checksum": {
        "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": { 
          "@id": "spdx:checksumAlgorithm_sha256" 
        }
      }
    }
}

The triples that are related to the checksum would then be:

<https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae> <http://spdx.org/rdf/terms#Checksum> <hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51> .
<hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51> <http://spdx.org/rdf/terms#checksumAlgorithm> <http://spdx.org/rdf/terms#checksumAlgorithm_sha256> .
<hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51> <http://spdx.org/rdf/terms#checksumValue> "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51" .

So, let's call our two options:

Option 1: Place checksum as schema:identifier
Option 2. Place checksum as spdx:Checksum

Right now, I think I like Option 2 better. Thoughts?

datadavev · 2021-05-11T19:18:05Z

To me it makes sense to use spdx:Checksum when asserting the value of the checksum for the resource. identifier with a checksum value can also be associated with the resource, but perhaps only when it is recognized as an identifier by the content creator / producer / publisher.

cboettig · 2021-05-12T17:10:16Z

I agree with @datadavev , that there's really two different things being discussed here, which actually makes for 3 non-empty cases:

Reporting one or more checksums, but not using checksum as an identifier
using checksum as an identifier, but not explicitly reporting an spdx:Checksum,
doing both

I think these different roles would go in different blocks. I think something like:

{
    "@context": {
        "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#"
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
      "identifier": [
        {
          "@type": "PropertyValue",
          "@id": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
          "propertyID": "https://rfc-editor.org/rfc/rfc4122.txt",
          "value": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae"
        },
        {
        "@type": "PropertyValue",
        "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "propertyID": "https://github.com/hash-uri/hash-uri/tree/master/cli",
        "value": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51"
        }
      ],
      "spdx:Checksum": {
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": { 
          "@id": "spdx:checksumAlgorithm_sha256" 
        }
      }
    }
}

A few notes:

Obviously https://github.com/hash-uri/hash-uri/tree/master/cli is not a particularly satisfying propertyID. Schema.org defines three options for this field, maybe one of the other routes is better?

A commonly used identifier for the characteristic represented by the property, e.g. a manufacturer or a standard code for a property. propertyID can be (1) a prefixed string, mainly meant to be used with standards for product properties; (2) a site-specific, non-prefixed string (e.g. the primary key of the property or the vendor-specific id of the property), or (3) a URL indicating the type of the property, either pointing to an external vocabulary, or a Web resource that describes the property (e.g. a glossary entry). Standards bodies should promote a standard prefix for the identifiers of properties from their standards

spdx:Checksum probably ought to have an explicit @type (could be done in the context but I think validator might complain about the above. (not sure what the @type is in SPDX -- SPDX doesn't seem to use the camelCase convention? I might have thought the property was spdx:checksum and the type was spdx:Checksum).
in my version, spdx:Checksum is a blank node. Doesn't bother me, this seems like a reasonable use for a blank-node, but if it needs a URI, I think the id should reflect that this is the checksum of the dataDownload it is nested in, e.g. something like: https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae?sha256=39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51.
Maybe this is out of scope, but I'm actually rather confused that the @id of the DataDownload is https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae. I think of @id as the 'canonical' identifier for the object, e.g. one of the two identifiers in identifier list. (i.e. I think of the schema:DataDownload is the equivalent of the dcat2:distribution which is the actual file (i.e. .csv file, whatever bytes have that hash). It makes sense to refer to such an object with a location-agnostic identifier like either the UUID or sha256sum in the identifier list, but I'm not sure that's really 'the same thing as' https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae.
The schema.org discussion on the recommended use of identifier as a PropertyValue seems to suggest that the properyValue should always be setting a name field as well as a value field? (Also echoed in the PropertyValue def).
Also I'm pretty sure the value ends up being typed as a string in this case and not a URI, e.g. see playground N-Quads. Would it make sense to use a structuredValue (feels totally overkill)? Sounds like this should be identifier instead based on the above-linked schema-org docs:

In this case, a PropertyValue pair ('name', 'identifier') pair can be used when a standard URI form of the identifier is unavailable. We do not currently have a recommended identifier scheme for identifier schemes, but in most cases there is a conventional short name for most identifier schemes (which should be used in lowercase form).

(Also sounds like they are saying there's no need to use PropertyValue when you already have a URI-formatted identifier...)

Several of the above issues are basically about how to set @id. It's curious to me that the google examples completely avoid @id declarations: https://developers.google.com/search/docs/data-types/dataset, but that makes me more fine about the Checksum annotation creating a blank node.

Based on these thoughts, I would have done something like:


{
    "@context": {
      "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#",
      "identifier": {"@id": "identifier", "@type": "@id"}
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "identifier": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "identifier": ["urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
                         "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51"
      ],
      "spdx:Checksum": {
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": { 
          "@id": "spdx:checksumAlgorithm_sha256" 
        }
      }
    }
}

Yeah, doing "identifier": {"@id": "identifier", "@type": "@id"} is non-standard and maybe better to have done in-line, I'm not sure what is the best way to get identifiers to come out as actual URIs (i.e. when we transform to N-Quads / standard RDF). Google doesn't seem to care about that either since it's examples type identifiers as strings, in which case this casting could just be omitted.

andrea-perego · 2021-05-12T20:55:06Z

@cboettig said:

spdx:Checksum probably ought to have an explicit @type (could be done in the context but I think validator might complain about the above. (not sure what the @type is in SPDX -- SPDX doesn't seem to use the camelCase convention? I might have thought the property was spdx:checksum and the type was spdx:Checksum).

Yes, spdx:checksum is the property and spdx:Checksum the class. The example should be revised as follows:

...
      "spdx:checksum": {
        "@type":"spdx:Checksum",
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": { 
          "@id": "spdx:checksumAlgorithm_sha256" 
        }
      }
...

mbjones · 2021-05-12T21:35:19Z

Thanks, makes sense to include the type, and my prior example did improperly capitalize the property. Here's a revised full example with the type included and the property correctly capitalized:

{
    "@context": {
      "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#"
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
      "identifier": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
      "spdx:checksum": {
        "@type": "spdx:Checksum",
        "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": { 
          "@id": "spdx:checksumAlgorithm_sha256" 
        }
      }
    }
}

steingod · 2021-05-13T09:46:30Z

I am in favour of the spdx:Checksum approach. That seems most consistent to me. I am not sure I understand why we need the type though (I understand it from the modelling perspective, but not as a simple approach for data providers to expose their data which is why many are looking into SO). Concerning using it as an identifier as well, I agree with @datadavev perspective. The problem is creating guidance information helping data providers.

fils · 2021-05-13T13:33:05Z

@steingod for me a primary interest in this is to aid potential disambiguation of data being exposed by multiple parties. So I view this as a vital element for data. I'm already using this quite a bit and the latest proposed guidance seems fine on quick inspection.

In most cases for the end user the goal is to easily get the data. This is trivial to JSON-LD Frame out and no more complex in SPARQL space than any guidance (which can be taken many ways) ;)

cboettig · 2021-05-13T16:59:44Z

Definitely agree with @fils on this!

for me a primary interest in this is to aid potential disambiguation of data being exposed by multiple parties. So I view this as a vital element for data. I'm already using this quite a bit and the latest proposed guidance seems fine on quick inspection.

That's also why I like the proposal of identifiers based on the checksums and independent of the party exposing the data. Certainly we can always extract that information if it's in the checksum field like in these examples, but it's also easily lost from there -- e.g. like @fils says there's the temptation to just json-ld frame it out and have a pure-schema.org representation, or any of the other representations that don't have a native checksum attribute. So I still support @mbjones original suggestion that it would be nice to also normalize including this in an identifier field, which is (for better or worse) a much more widely implemented field.

If I had my druthers it would be the 'canonical' identifier or @id for any downloadable content object (schema:DataDownload), becuase "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51" is not subject to link rot and not provider specific the way that "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae" is, but also recognize maybe we need to walk before we run. Having checksums in the checksum field is at least a nice start.

Related question is whether to encourage providing more than one checksum. If so, I think it would look something like:

{
"@context": {
      "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#",
      "spdx:checksumAlgorithm": {"@id": "spdx:checksumAlgorithm", "@type": "@id"}
    },
  "spdx:checksum": [
        {
        "@type": "spdx:Checksum",
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": "spdx:checksumAlgorithm_sha256" 
        },
        {
        "@type": "spdx:Checksum",
        "spdx:checksumValue": "65d3616852dbf7b1a6d4b53b00626032",
        "spdx:checksumAlgorithm": "spdx:checksumAlgorithm_md5" 
        }
        ]
}

steingod · 2021-05-13T21:42:30Z

@fils I have no problem understanding the wish to avoid ambiguity and establish a consistent approach. My main concern (not related to this specific issue) is that in some communities SO is expected/claimed to be a lightweight alternative to GCMD DIF or ISO19115 and APIs. Which I do not think it is. As it evolves to make it useful for proper filtering in search engines it is becoming equally complex. Sorry for the small detour on the specific issue, it is a different discussion. :-)

fils · 2021-05-13T23:56:45Z

@steingod sorry... I contributed to that detour too. ;) In our defense ... the internet thrives on such detours. So lets continue.... :)

I completely agree with you on this point and it is a major concern of mine. I think the examples we have for minimum and full in the examples directory helps with this and perhaps we should make sure we don't loose focus on making these smaller more basic examples along the way. Like Google with their required and recommended. Breadth and momentum are still vital to this community.

It would also be nice to pay attention to those recommendations that could impacting search ranking. If we can separate guidance that is more for result decoration, vs those properties that could affect relevancy ranking that might be useful. A nice SHACL or framing plus python data analysis use case perhaps. There are different clients of course. A search engine result for discovery is different than a query supporting report generation on FAIR data alignment for example.

More difficult are those types and properties that relate to connections (edges) as those can impact a more interesting category of query. For example semantic page rank values is one I'm interested in that might scope in to that type.

steingod · 2021-05-14T08:07:08Z

@fils thanks, seems we are pretty aligned then :-) My hope is that SO can be useful for many of the small research stations and communities that have important data but lack the ability to properly announce them today. I very much support your statement on guidance for ranking and decoration, I think that is crucial to achieve this goal.

adamml · 2021-05-14T16:46:28Z

Sorry for dragging this further off topic, but I completely agree with @fils and @steingod on this. Having an agreed set of minimal Science on Schema attributes for discovery is vital for those "data curation/preservation/management" resource poor organisations who have need to publish their metadata to allow their data to be found. Hand in hand with this is good, simple to use tooling to allow that publication to happen and to validate the metadata.

The checksum discussion is absolutely important for trust in, and validation of, data file downloads but is not going to be where every data provider is at at this point in time.

cboettig · 2021-05-14T17:43:02Z

In the spirit of the tangent, how do folks feel about the other part of @mbjones proposal, which would optionally put the checksum in the more widely recognized field, identifier? e.g. (using the more compact notation for the minimalist-inclined):


{
    "@context": "https://schema.org/",
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "identifier": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "identifier": [
        "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
        "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "ni:///sha256;Oa5jnTPOpKKHGYu83KXmhW5mB6fJHcTFQ0gDG-KtTFE"
      ]
    }
}

(Of course this could also use the propertyValue notation if that was preferred. Either should validate against https://validator.schema.org/).

@adamml notes that

The checksum discussion is absolutely important for trust in, and validation of, data file downloads

But as @fils noted, I think the content hash is equally important for the use case when the same data file may found across multiple providers. The existing identifiers do not serve this purpose well, since the providers tend to mint their own identifiers. For example, this famous ice core CO2 data, https://doi.org/10.3334/CDIAC/ATG.009 I use in my undergraduate teaching can be found at both https://cn.dataone.org/cn/v2/resolve/ess-dive-0462dff585f94f8-20180716T160600643874 and https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542, where it is known by a different ess-dive identifier in each database. Comparing the hashes not only helps me know my download wasn't corrupted, it's the only way for me to know that these different identifiers are referring to precisely the same data. Better, we can "resolve" the hash and find other sources that are identical: https://hash-archive.org/sources/hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37

mbjones · 2021-05-14T18:41:38Z

On the minimalism front, I hear what folks are saying and agree with some aspects of it, but I think there is room and need for guidance supporting different discovery (and other) use cases. As Carl lays out, there are important discovery use cases for checksums, and so simple guidance for "if you want to provide a checksum, do it like this..." will go a long ways towards overcoming the ambiguity and multiplicity of approaches in schema.org. At no point are we saying that people must provide checksums -- we're simply trying to provide implementation guidance for how to do in a simple interoperable way for those who want to provide it. We can continue this discussion on minimalism, but I think someone should open a new issue on it -- its not really the topic here for Checksum, and the minimalism discussion applies to many other fields in the already released SOSO guidance docs. In addition, folks that would like to see the SOSO effort change direction and be more minimalist might consider joining our twice monthly calls so we can discuss that strategic direction in more detail.

To summarize the Checksum discussion thus far, and try to reach agreement on it, we have proposed two options: 1) to include checksum as an identifier, or 2) to include checksum as spdx:checksum. While these approaches are not exclusive, my read of the conversation thus far is that people think it is better to use spdx:checksum because it specifically signals the intent of the field, and doesn't conflate it with using a checksum as an identifier (which can be done as well but for different reasons). I think we've explored the options pretty thoroughly in this thread, and so I propose that we follow the examples of spdx:checksum in previous comments, and that we discuss this to get agreement on our next call. I will write up guidance docs in our proposed decision format for that meeting if that's ok. I've added it to the agenda for the May 27th call.

mbjones · 2021-06-15T22:47:30Z

I added PR #171 that implements the checksum approach we discussed during a previous call. @ashepherd this closely follows the discussed solution, so if the wording and example are clear, it should be ready to merge to develop. I validated the example in JSON-LD playground.

mbjones · 2021-06-15T22:55:00Z

The proposed text of the Checksum guidance can be read as formatted markdown on the branch here: https://github.com/ESIPFed/science-on-schema.org/blob/feature_66_checksum/guides/Dataset.md#checksum

ashepherd added Leading Practice Recommended practices for repository implementation enhancement New feature or request help wanted Extra attention is needed labels Dec 19, 2019

weekly-digest bot mentioned this issue Dec 22, 2019

Weekly Digest (15 December, 2019 - 22 December, 2019) #68

Closed

horsburgh mentioned this issue Jan 14, 2020

HydroShare Schema.org Implementation hydroshare/hydroshare#3618

Closed

stale bot added the stale label Mar 17, 2020

ashepherd removed the stale label Apr 2, 2020

mbjones added this to the v1.3 (possibly 2.0) milestone Jan 28, 2021

mbjones self-assigned this Jan 28, 2021

mbjones added Update Documentation updates to the guidance docs and removed help wanted Extra attention is needed labels May 11, 2021

mbjones mentioned this issue Jun 15, 2021

convention for including spdx:checksum for objects #171

Merged

mbjones closed this as completed Jun 16, 2021

ashepherd mentioned this issue May 2, 2022

Prepare and release v1.3 #201

Closed

13 tasks

mbjones mentioned this issue Jul 18, 2022

Dataset checksum algorithm property is missnamed #230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linking a checksum to DataDownload #66

Linking a checksum to DataDownload #66

ashepherd commented Dec 19, 2019

ashepherd commented Jan 17, 2020

stale bot commented Mar 17, 2020

mbjones commented Jan 28, 2021 •

edited

Loading

cboettig commented Jan 28, 2021

mbjones commented May 11, 2021

cboettig commented May 11, 2021

andrea-perego commented May 11, 2021

mbjones commented May 11, 2021 •

edited

Loading

datadavev commented May 11, 2021

cboettig commented May 12, 2021

andrea-perego commented May 12, 2021

mbjones commented May 12, 2021

steingod commented May 13, 2021

fils commented May 13, 2021

cboettig commented May 13, 2021

steingod commented May 13, 2021

fils commented May 13, 2021

steingod commented May 14, 2021

adamml commented May 14, 2021

cboettig commented May 14, 2021

mbjones commented May 14, 2021

mbjones commented Jun 15, 2021

mbjones commented Jun 15, 2021

Linking a checksum to DataDownload #66

Linking a checksum to DataDownload #66

Comments

ashepherd commented Dec 19, 2019

ashepherd commented Jan 17, 2020

stale bot commented Mar 17, 2020

mbjones commented Jan 28, 2021 • edited Loading

cboettig commented Jan 28, 2021

mbjones commented May 11, 2021

cboettig commented May 11, 2021

andrea-perego commented May 11, 2021

mbjones commented May 11, 2021 • edited Loading

datadavev commented May 11, 2021

cboettig commented May 12, 2021

andrea-perego commented May 12, 2021

mbjones commented May 12, 2021

steingod commented May 13, 2021

fils commented May 13, 2021

cboettig commented May 13, 2021

steingod commented May 13, 2021

fils commented May 13, 2021

steingod commented May 14, 2021

adamml commented May 14, 2021

cboettig commented May 14, 2021

mbjones commented May 14, 2021

mbjones commented Jun 15, 2021

mbjones commented Jun 15, 2021

mbjones commented Jan 28, 2021 •

edited

Loading

mbjones commented May 11, 2021 •

edited

Loading