Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking a checksum to DataDownload #66

Closed
ashepherd opened this issue Dec 19, 2019 · 23 comments
Closed

Linking a checksum to DataDownload #66

ashepherd opened this issue Dec 19, 2019 · 23 comments
Assignees
Labels
enhancement New feature or request Leading Practice Recommended practices for repository implementation Update Documentation updates to the guidance docs
Milestone

Comments

@ashepherd
Copy link
Member

Can we use the schema:identifier property?
URN schema to indicate checksum?

Proposal:

  1. Use schema:PropertyValue
  2. use schema:identifier to specify the urn of the checksum (e.g. md5:9e85e71b33f71ac738e4793ff142c464)
  3. use schema:propertyID to specify the type of checksum as text
  4. use schema:additionalType to specify the type of checksum using controlled vocabularies
  5. use schema:value to specify the value of the schecksum

Examples:

MD5:

{
  "@type": "DataDownload",
  "identifier": [
    ...DOI and other identifiers go here...,
    {
      "@type": "PropertyValue",
      "additionalType": ["http://www.wikidata.org/entity/Q185235", "http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/md5"],
      "identifier": "md5:9e85e71b33f71ac738e4793ff142c464",
      "propertyID": "MD5",
      "value": "9e85e71b33f71ac738e4793ff142c464",
    }
  ]
}

SHA256:

{
  "@type": "DataDownload",
  "identifier": [
    ...DOI and other identifiers go here...,
    {
      "@type": "PropertyValue",
      "additionalType": "http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256",
      "identifier": "sha256:8808ACDC7FB7DC2F941EBACC7906B32D2676044494A740C21F6E0DC20893A2A6",
      "propertyID": "SHA256",
      "value": "8808ACDC7FB7DC2F941EBACC7906B32D2676044494A740C21F6E0DC20893A2A6",
    }
  ]
}
@ashepherd ashepherd added Leading Practice Recommended practices for repository implementation enhancement New feature or request help wanted Extra attention is needed labels Dec 19, 2019
@ashepherd
Copy link
Member Author

use of schema:identifier mentioned as potential solution here: https://github.com/schemaorg/schemaorg/issues/1831

@stale
Copy link

stale bot commented Mar 17, 2020

This issue has been automatically marked as stale because it has not had recent activity.

@stale stale bot added the stale label Mar 17, 2020
@ashepherd ashepherd removed the stale label Apr 2, 2020
@mbjones
Copy link
Collaborator

mbjones commented Jan 28, 2021

I fully support adding checksums. We should follow an established format to encode the hash and the algorithm. There are a number of possibilities, but I like the hash URI format the best because of its readability, and formatting as a URI:

  • Hash URI: hash://sha256/030d8c2d6b7163a482865716958ca03806dfde99a309c927e56aa9962afbb95d

Other possibilities are describe more thoroughly at https://hash-archive.org/:

  • Web URL: https://torrents.linuxmint.com/torrents/linuxmint-18-cinnamon-64bit.iso.torrent
  • Named Info: ni:///sha256;Aw2MLWtxY6SChlcWlYygOAbf3pmjCckn5Wqplir7uV0
  • Subresource Integrity: sha256-Aw2MLWtxY6SChlcWlYygOAbf3pmjCckn5Wqplir7uV0=
  • SSB: &Aw2MLWtxY6SChlcWlYygOAbf3pmjCckn5Wqplir7uV0=.sha256
  • MultiHash: QmNYZuyWz3U71Dwv7phEgh4WcWcQwdvpBWd99MkiaBoyBA
  • Magnet URI: magnet:?xt=urn:sha256:030d8c2d6b7163a482865716958ca03806dfde99a309c927e56aa9962afbb95d

The Named Info syntax is described in RFC 6920 but has the disadvantage of being harder to parse and doesn't use a base64 representation of the hash value.

@cboettig You've compared the pros and cons of different hash syntaxes. Do you have a writeup somewhere associated with your work on https://github.com/cboettig/contentid ?

@mbjones mbjones added this to the v1.3 (possibly 2.0) milestone Jan 28, 2021
@mbjones mbjones self-assigned this Jan 28, 2021
@cboettig
Copy link

A write-up is a good idea. so far it's only in the issue thread, cboettig/contentid#1

@mbjones
Copy link
Collaborator

mbjones commented May 11, 2021

Here's a proposed checksum example using the Hash URI format, that builds upon the identifier field as proposed above and in https://github.com/schemaorg/schemaorg/issues/1831

{
    "@context": {
        "@vocab": "https://schema.org/"
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
      "identifier": [
        {
          "@type": "PropertyValue",
          "@id": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
          "propertyID": "https://rfc-editor.org/rfc/rfc4122.txt",
          "value": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae"
        },
        {
          "@type": "PropertyValue",
          "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
          "propertyID": "http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256",
          "value": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51"
        }
      ]
    }
}

I think this works, but it would be good for our guidance to specify which propertyID URIs represent checksum algorithms, so that consumers could look for those ids for indexing and other uses. We could recommend a list of common checksum algorithm propertyID values. Following @ashepherd 's lead, I've used the Library of Congress vocabulary URIs above. See https://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions.html for a list of named algorithms. The problem with this is that the actual syntax of the value used follows the Hash URI specification, which isn't actually the SHA256 value per se. But I'm not sure what value to use for propertyID there.

@mbjones mbjones added Update Documentation updates to the guidance docs and removed help wanted Extra attention is needed labels May 11, 2021
@cboettig
Copy link

I'm all for embedding the checksum in the identifier, but it is surprising to me that DCAT2 and Schema.org don't have a more native concept to express a checksum just as a checksum.

I'm not sure if it wouldn't be cleaner to use spdx:ChecksumAlgorithm and spdx:Checksum as the property/value pair for the raw checksum, and separately list the hash URI as an associated identifier... (I see the appeal of using ni:/// and linking to RFC 6920, but as noted above the ni:/// syntax is somewhat cumbersome from a developer perspective (having a base64-encoded string with optional and non-optional rules about which characters should then be percent-encoded is a bit tricky and means that more than one valid string can be used for the same identifier). Having a RFC specification for the hash URI spec would be a nice resolution to all of this....

@andrea-perego
Copy link

@cboettig said:

[...]

I'm all for embedding the checksum in the identifier, but it is surprising to me that DCAT2 and Schema.org don't have a more native concept to express a checksum just as a checksum.

For the records: the current Working Draft of DCAT 3 does include the possibility of specifying the checksum of a distribution by following the DCAT-AP approach - see https://www.w3.org/TR/vocab-dcat-3/#Class:Checksum (relevant issue: w3c/dxwg#1287)

@mbjones
Copy link
Collaborator

mbjones commented May 11, 2021

@cboettig I did consider using spdx:ChecksumAlgorithm and spdx:Checksum, and spdx:checksumValue, but I thought there were some issues:

  • it separates the checksum algorithm from the value representation, so complicates parsing and introduces blank nodes unless we are careful
  • the spdx class definitions have a number of domain and range entailments in SPDX (like to spdx:File) and are defined specifically for software
  • it puts yet another term in our vocabulary outside of SO. But we'd done that several times already...
  • spdx:algorithm has a defined range that only includes md5, sha1, and sha256. probably an oversight that could be fixed.

The benefits would be

  • easier to recognize it as a checksum because of the dedicated class
    • doesn't conflate identifier and checksum semantics
  • consistency with the direction of DCAT3 as @andrea-perego points out above

So, given the direction of DCAT3, here's an alternative proposal that would be very clear about the semantics of the checksum, and avoids blank nodes by using the hash uri serialization as the "@id":

{
    "@context": {
        "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#"
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
      "identifier": [
        {
          "@type": "PropertyValue",
          "@id": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
          "propertyID": "https://rfc-editor.org/rfc/rfc4122.txt",
          "value": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae"
        }
      ],
      "spdx:Checksum": {
        "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": { 
          "@id": "spdx:checksumAlgorithm_sha256" 
        }
      }
    }
}

The triples that are related to the checksum would then be:

<https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae> <http://spdx.org/rdf/terms#Checksum> <hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51> .
<hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51> <http://spdx.org/rdf/terms#checksumAlgorithm> <http://spdx.org/rdf/terms#checksumAlgorithm_sha256> .
<hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51> <http://spdx.org/rdf/terms#checksumValue> "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51" .

So, let's call our two options:

  1. Option 1: Place checksum as schema:identifier
  2. Option 2. Place checksum as spdx:Checksum

Right now, I think I like Option 2 better. Thoughts?

@datadavev
Copy link
Collaborator

To me it makes sense to use spdx:Checksum when asserting the value of the checksum for the resource. identifier with a checksum value can also be associated with the resource, but perhaps only when it is recognized as an identifier by the content creator / producer / publisher.

@cboettig
Copy link

I agree with @datadavev , that there's really two different things being discussed here, which actually makes for 3 non-empty cases:

  1. Reporting one or more checksums, but not using checksum as an identifier
  2. using checksum as an identifier, but not explicitly reporting an spdx:Checksum,
  3. doing both

I think these different roles would go in different blocks. I think something like:

{
    "@context": {
        "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#"
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
      "identifier": [
        {
          "@type": "PropertyValue",
          "@id": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
          "propertyID": "https://rfc-editor.org/rfc/rfc4122.txt",
          "value": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae"
        },
        {
        "@type": "PropertyValue",
        "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "propertyID": "https://github.com/hash-uri/hash-uri/tree/master/cli",
        "value": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51"
        }
      ],
      "spdx:Checksum": {
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": { 
          "@id": "spdx:checksumAlgorithm_sha256" 
        }
      }
    }
}

A few notes:

  • Obviously https://github.com/hash-uri/hash-uri/tree/master/cli is not a particularly satisfying propertyID. Schema.org defines three options for this field, maybe one of the other routes is better?

A commonly used identifier for the characteristic represented by the property, e.g. a manufacturer or a standard code for a property. propertyID can be (1) a prefixed string, mainly meant to be used with standards for product properties; (2) a site-specific, non-prefixed string (e.g. the primary key of the property or the vendor-specific id of the property), or (3) a URL indicating the type of the property, either pointing to an external vocabulary, or a Web resource that describes the property (e.g. a glossary entry). Standards bodies should promote a standard prefix for the identifiers of properties from their standards

  • spdx:Checksum probably ought to have an explicit @type (could be done in the context but I think validator might complain about the above. (not sure what the @type is in SPDX -- SPDX doesn't seem to use the camelCase convention? I might have thought the property was spdx:checksum and the type was spdx:Checksum).

  • in my version, spdx:Checksum is a blank node. Doesn't bother me, this seems like a reasonable use for a blank-node, but if it needs a URI, I think the id should reflect that this is the checksum of the dataDownload it is nested in, e.g. something like: https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae?sha256=39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51.

  • Maybe this is out of scope, but I'm actually rather confused that the @id of the DataDownload is https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae. I think of @id as the 'canonical' identifier for the object, e.g. one of the two identifiers in identifier list. (i.e. I think of the schema:DataDownload is the equivalent of the dcat2:distribution which is the actual file (i.e. .csv file, whatever bytes have that hash). It makes sense to refer to such an object with a location-agnostic identifier like either the UUID or sha256sum in the identifier list, but I'm not sure that's really 'the same thing as' https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae.

  • The schema.org discussion on the recommended use of identifier as a PropertyValue seems to suggest that the properyValue should always be setting a name field as well as a value field? (Also echoed in the PropertyValue def).

  • Also I'm pretty sure the value ends up being typed as a string in this case and not a URI, e.g. see playground N-Quads. Would it make sense to use a structuredValue (feels totally overkill)? Sounds like this should be identifier instead based on the above-linked schema-org docs:

In this case, a PropertyValue pair ('name', 'identifier') pair can be used when a standard URI form of the identifier is unavailable. We do not currently have a recommended identifier scheme for identifier schemes, but in most cases there is a conventional short name for most identifier schemes (which should be used in lowercase form).

(Also sounds like they are saying there's no need to use PropertyValue when you already have a URI-formatted identifier...)

Based on these thoughts, I would have done something like:


{
    "@context": {
      "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#",
      "identifier": {"@id": "identifier", "@type": "@id"}
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "identifier": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "identifier": ["urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
                         "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51"
      ],
      "spdx:Checksum": {
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": { 
          "@id": "spdx:checksumAlgorithm_sha256" 
        }
      }
    }
}

Yeah, doing "identifier": {"@id": "identifier", "@type": "@id"} is non-standard and maybe better to have done in-line, I'm not sure what is the best way to get identifiers to come out as actual URIs (i.e. when we transform to N-Quads / standard RDF). Google doesn't seem to care about that either since it's examples type identifiers as strings, in which case this casting could just be omitted.

@andrea-perego
Copy link

@cboettig said:

  • spdx:Checksum probably ought to have an explicit @type (could be done in the context but I think validator might complain about the above. (not sure what the @type is in SPDX -- SPDX doesn't seem to use the camelCase convention? I might have thought the property was spdx:checksum and the type was spdx:Checksum).

Yes, spdx:checksum is the property and spdx:Checksum the class. The example should be revised as follows:

...
      "spdx:checksum": {
        "@type":"spdx:Checksum",
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": { 
          "@id": "spdx:checksumAlgorithm_sha256" 
        }
      }
...

@mbjones
Copy link
Collaborator

mbjones commented May 12, 2021

Thanks, makes sense to include the type, and my prior example did improperly capitalize the property. Here's a revised full example with the type included and the property correctly capitalized:

{
    "@context": {
      "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#"
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
      "identifier": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
      "spdx:checksum": {
        "@type": "spdx:Checksum",
        "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": { 
          "@id": "spdx:checksumAlgorithm_sha256" 
        }
      }
    }
}

@steingod
Copy link

I am in favour of the spdx:Checksum approach. That seems most consistent to me. I am not sure I understand why we need the type though (I understand it from the modelling perspective, but not as a simple approach for data providers to expose their data which is why many are looking into SO). Concerning using it as an identifier as well, I agree with @datadavev perspective. The problem is creating guidance information helping data providers.

@fils
Copy link
Collaborator

fils commented May 13, 2021

@steingod for me a primary interest in this is to aid potential disambiguation of data being exposed by multiple parties. So I view this as a vital element for data. I'm already using this quite a bit and the latest proposed guidance seems fine on quick inspection.

In most cases for the end user the goal is to easily get the data. This is trivial to JSON-LD Frame out and no more complex in SPARQL space than any guidance (which can be taken many ways) ;)

@cboettig
Copy link

Definitely agree with @fils on this!

for me a primary interest in this is to aid potential disambiguation of data being exposed by multiple parties. So I view this as a vital element for data. I'm already using this quite a bit and the latest proposed guidance seems fine on quick inspection.

That's also why I like the proposal of identifiers based on the checksums and independent of the party exposing the data. Certainly we can always extract that information if it's in the checksum field like in these examples, but it's also easily lost from there -- e.g. like @fils says there's the temptation to just json-ld frame it out and have a pure-schema.org representation, or any of the other representations that don't have a native checksum attribute. So I still support @mbjones original suggestion that it would be nice to also normalize including this in an identifier field, which is (for better or worse) a much more widely implemented field.

If I had my druthers it would be the 'canonical' identifier or @id for any downloadable content object (schema:DataDownload), becuase "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51" is not subject to link rot and not provider specific the way that "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae" is, but also recognize maybe we need to walk before we run. Having checksums in the checksum field is at least a nice start.

Related question is whether to encourage providing more than one checksum. If so, I think it would look something like:

{
"@context": {
      "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#",
      "spdx:checksumAlgorithm": {"@id": "spdx:checksumAlgorithm", "@type": "@id"}
    },
  "spdx:checksum": [
        {
        "@type": "spdx:Checksum",
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": "spdx:checksumAlgorithm_sha256" 
        },
        {
        "@type": "spdx:Checksum",
        "spdx:checksumValue": "65d3616852dbf7b1a6d4b53b00626032",
        "spdx:checksumAlgorithm": "spdx:checksumAlgorithm_md5" 
        }
        ]
}

@steingod
Copy link

@fils I have no problem understanding the wish to avoid ambiguity and establish a consistent approach. My main concern (not related to this specific issue) is that in some communities SO is expected/claimed to be a lightweight alternative to GCMD DIF or ISO19115 and APIs. Which I do not think it is. As it evolves to make it useful for proper filtering in search engines it is becoming equally complex. Sorry for the small detour on the specific issue, it is a different discussion. :-)

@fils
Copy link
Collaborator

fils commented May 13, 2021

@steingod sorry... I contributed to that detour too. ;) In our defense ... the internet thrives on such detours. So lets continue.... :)

I completely agree with you on this point and it is a major concern of mine. I think the examples we have for minimum and full in the examples directory helps with this and perhaps we should make sure we don't loose focus on making these smaller more basic examples along the way. Like Google with their required and recommended. Breadth and momentum are still vital to this community.

It would also be nice to pay attention to those recommendations that could impacting search ranking. If we can separate guidance that is more for result decoration, vs those properties that could affect relevancy ranking that might be useful. A nice SHACL or framing plus python data analysis use case perhaps. There are different clients of course. A search engine result for discovery is different than a query supporting report generation on FAIR data alignment for example.

More difficult are those types and properties that relate to connections (edges) as those can impact a more interesting category of query. For example semantic page rank values is one I'm interested in that might scope in to that type.

@steingod
Copy link

@fils thanks, seems we are pretty aligned then :-) My hope is that SO can be useful for many of the small research stations and communities that have important data but lack the ability to properly announce them today. I very much support your statement on guidance for ranking and decoration, I think that is crucial to achieve this goal.

@adamml
Copy link

adamml commented May 14, 2021

Sorry for dragging this further off topic, but I completely agree with @fils and @steingod on this. Having an agreed set of minimal Science on Schema attributes for discovery is vital for those "data curation/preservation/management" resource poor organisations who have need to publish their metadata to allow their data to be found. Hand in hand with this is good, simple to use tooling to allow that publication to happen and to validate the metadata.

The checksum discussion is absolutely important for trust in, and validation of, data file downloads but is not going to be where every data provider is at at this point in time.

@cboettig
Copy link

In the spirit of the tangent, how do folks feel about the other part of @mbjones proposal, which would optionally put the checksum in the more widely recognized field, identifier? e.g. (using the more compact notation for the minimalist-inclined):


{
    "@context": "https://schema.org/",
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "identifier": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "identifier": [
        "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
        "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "ni:///sha256;Oa5jnTPOpKKHGYu83KXmhW5mB6fJHcTFQ0gDG-KtTFE"
      ]
    }
}

(Of course this could also use the propertyValue notation if that was preferred. Either should validate against https://validator.schema.org/).

@adamml notes that

The checksum discussion is absolutely important for trust in, and validation of, data file downloads

But as @fils noted, I think the content hash is equally important for the use case when the same data file may found across multiple providers. The existing identifiers do not serve this purpose well, since the providers tend to mint their own identifiers. For example, this famous ice core CO2 data, https://doi.org/10.3334/CDIAC/ATG.009 I use in my undergraduate teaching can be found at both https://cn.dataone.org/cn/v2/resolve/ess-dive-0462dff585f94f8-20180716T160600643874 and https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-457358fdc81d3a5-20180726T203952542, where it is known by a different ess-dive identifier in each database. Comparing the hashes not only helps me know my download wasn't corrupted, it's the only way for me to know that these different identifiers are referring to precisely the same data. Better, we can "resolve" the hash and find other sources that are identical: https://hash-archive.org/sources/hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37

@mbjones
Copy link
Collaborator

mbjones commented May 14, 2021

On the minimalism front, I hear what folks are saying and agree with some aspects of it, but I think there is room and need for guidance supporting different discovery (and other) use cases. As Carl lays out, there are important discovery use cases for checksums, and so simple guidance for "if you want to provide a checksum, do it like this..." will go a long ways towards overcoming the ambiguity and multiplicity of approaches in schema.org. At no point are we saying that people must provide checksums -- we're simply trying to provide implementation guidance for how to do in a simple interoperable way for those who want to provide it. We can continue this discussion on minimalism, but I think someone should open a new issue on it -- its not really the topic here for Checksum, and the minimalism discussion applies to many other fields in the already released SOSO guidance docs. In addition, folks that would like to see the SOSO effort change direction and be more minimalist might consider joining our twice monthly calls so we can discuss that strategic direction in more detail.

To summarize the Checksum discussion thus far, and try to reach agreement on it, we have proposed two options: 1) to include checksum as an identifier, or 2) to include checksum as spdx:checksum. While these approaches are not exclusive, my read of the conversation thus far is that people think it is better to use spdx:checksum because it specifically signals the intent of the field, and doesn't conflate it with using a checksum as an identifier (which can be done as well but for different reasons). I think we've explored the options pretty thoroughly in this thread, and so I propose that we follow the examples of spdx:checksum in previous comments, and that we discuss this to get agreement on our next call. I will write up guidance docs in our proposed decision format for that meeting if that's ok. I've added it to the agenda for the May 27th call.

@mbjones
Copy link
Collaborator

mbjones commented Jun 15, 2021

I added PR #171 that implements the checksum approach we discussed during a previous call. @ashepherd this closely follows the discussed solution, so if the wording and example are clear, it should be ready to merge to develop. I validated the example in JSON-LD playground.

@mbjones
Copy link
Collaborator

mbjones commented Jun 15, 2021

The proposed text of the Checksum guidance can be read as formatted markdown on the branch here: https://github.com/ESIPFed/science-on-schema.org/blob/feature_66_checksum/guides/Dataset.md#checksum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Leading Practice Recommended practices for repository implementation Update Documentation updates to the guidance docs
Projects
None yet
Development

No branches or pull requests

8 participants