Home | Dataset
Google has drafted a guide to help publishers. The guide describes the only required fields as - name and description.
- name - A descriptive name of a dataset (e.g., “Snow depth in Northern Hemisphere”)
- description - A short summary describing a dataset.
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. " }
The guide suggests the following recommended fields:
- url - Location of a page describing the dataset.
- sameAs - Other URLs that can be used to access the dataset page. A link to a page that provides more information about the same dataset, usually in a different repository.
- version - The version number or identifier for this dataset (text or numeric).
- isAccessibleForFree - Boolean (true|false) speficying if the dataset is accessible for free.
- keywords - Keywords summarizing the dataset.
- identifier - An identifier for the dataset, such as a DOI. (text,URL, or PropertyValue).
- variableMeasured - What does the dataset measure? (e.g., temperature, pressure)
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "sameAs": "https://search.dataone.org/#view/https://www.sample-data-repository.org/dataset/472032", "version": "2013-11-21", "isAccessibleForFree": true, "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"], "license": [ "http://spdx.org/licenses/CC0-1.0", "https://creativecommons.org/publicdomain/zero/1.0"] }
Back to top
Adding the schema:keywords field can be done in three ways - a text description, a URL, or by using schema:DefinedTerm. We recommend using schema:DefinedTerm
if a keyword comes from a controlled vocabulary.
For a dataset with the keywords of: ocean acidification
, Dissolved Organic Carbon
, bacterioplankton respiration
, pCO2
, carbon dioxide
, oceans
, you can express these:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"] }
If you have information about a controlled vocabulary from which keywords come from, use schema:DefinedTerm
to descibe that kewyword. The relevant properties of a schema:DefinedTerm
are:
- name - The name of the keyword. (Required)
- inDefinedTermSet - The controlled vocabulary responisble for this keyword. (Required)
- url - The canonical URL for the keyword. (Optional)
- termCode - A representative code for this keyword in the controlled vocabulary (Optional)
As an example, we demonstrate these fields using the oceans
keyword from the NASA GCMD Keyword vocabulary, ice core studies
from SnowTerm, and Baked Clay
from EarthRef controlled vocabulary.
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Dataset shell for example DefinedTerm keywords", "keywords": [ { "@type": "DefinedTerm", "name": "OCEANS", "inDefinedTermSet": "https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/sciencekeywords", "url": "https://gcmd.earthdata.nasa.gov/kms/concept/91697b7d-8f2b-4954-850e-61d5f61c867d", "termCode": "91697b7d-8f2b-4954-850e-61d5f61c867d" }, { "@type": "DefinedTerm", "name": "ice core studies", "inDefinedTermSet": "https://vocabularyserver.com/cnr/ml/snowterm/en/", "url": "https://vocabularyserver.com/cnr/ml/snowterm/en/index.php?tema=29330", "identifier": { "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/ark", "value": "ark:/99152/t3v4yo3eeqepj0", "url": "https://vocabularyserver.com/cnr/ml/snowterm/en/?ark=ark:/99152/t3v4yo3eeqepj0" } }, { "@type": "DefinedTerm", "name": "Baked Clay", "inDefinedTermSet": "https://www2.earthref.org/vocabularies/controlled" } ] }
Adding the schema:identifier field can be done in three ways - a text description, a URL, or by using the schema:PropertyValue field.
We highly recommend using schema:PropertyValue.
Q: Why are simple text or URLs not good enough?
A: Identifiers have multiple properties that are useful when trying to find them across the web.
Most identifiers have these properties:
- a value,
- a domain or scheme (in which the value is guaranteed to be unique),
- (optionally) a resolvable URL (where the thing being identified can be found),
- (optionally) a domain prefix (a token string of characters succeeded by a colon ':' that represents the domain or scheme).
For example, the Digital Object Identifier (DOI) for a dataset may be: doi:10.5066/F7VX0DMQ. To break it down into its properties, we arrive at:
- value:
10.5066/F7VX0DMQ
- scheme:
Digital Object Identifier (DOI)
- url:
https://doi.org/10.5066/F7VX0DMQ
- prefix:
doi
Q: Can't we just say the scheme is a 'DOI'?
A: Yes, but there's a better way - a URI or URL. Because the we are publishing schema.org to express the explicit values of our content, we want to explicitly identify and classify our content such that harvesters can determine when our content appears elsewhere on the web. By detectinng these shared pieces content, we form the Web of Data.
Because the scheme Digital Object Identifier (DOI)
is described using unstructured text, we need a better way to explicitly state this value. Fortunately, identifiers.org has registered URIs for almost 700 different identifier schemes which can be browsed at: https://registry.identifiers.org/registry.
We can specify the scheme as being a DOI with this identifiers.org Registry URI:
https://registry.identifiers.org/registry/doi
Looking at the available fields from schema:PropertyValue, we can map our identifier fields as such:
schema:value
as the identifier value10.5066/F7VX0DMQ
schema:propertyID
is the registry.identifiers.org URI for the identifier schemehttps://registry.identifiers.org/registry/doi
,schema:url
is the resolvable url for that identifierhttps://doi.org/10.5066/F7VX0DMQ
.
Q: Where should the prefix go?
A: There is no ideal property for the prefix. But, we may include it as part of the schema:value
.
Q: Why include doi:
as part of the value? Doesn't the URL https://doi.org/10.5066/F7VX0DMQ
acheive the same result?
A: While the actual value of the DOI is 10.5066/F7VX0DMQ
, we felt that this representation helps schema.org publishers specify an identifier value that is familiar to the research community. For example, in most citation styles such as APA, the DOI 10.5066/F7VX0DMQ is cited as doi:10.5066/F7VX0DMQ
. Also, there can be many proper URLs for a specific identifier:
- http://doi.org/10.5066/F7VX0DMQ
- https://doi.org/10.5066/F7VX0DMQ
- http://dx.doi.org/10.5066/F7VX0DMQ
- https://dx.doi.org/10.5066/F7VX0DMQ
- https://www.sciencebase.gov/catalog/item/56b3e649e4b0cc79997fb5ec
For these reasons, we recommend that any identifier having a known prefix value should be included in the value succeeded by a colon to form ':', or for this DOI: doi:10.5066/F7VX0DMQ
.
Q: How do I know if an Identifier has a known prefix?
A: Each Identifier in the identifiers.org Registry that has a known prefix will be specified on the identifers.org registry page under the section called 'Identifier Schemes' at the field labeled 'Prefix'.
An example of using schema:PropertyValue to describe an Identifier:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "sameAs": "https://search.dataone.org/#view/https://www.sample-data-repository.org/dataset/472032", "version": "2013-11-21", "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"], "identifier": { "@id": "https://doi.org/10.5066/F7VX0DMQ", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/doi", "value": "doi:10.5066/F7VX0DMQ", "url": "https://doi.org/10.5066/F7VX0DMQ" } }
Optionally, the schema:name
field can be used to give this specific identifier a label such as "DOI: 10.5066/F7VX0DMQ" or "DOI 10.5066/F7VX0DMQ", but schema:name
should never be used to simply say "DOI".
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "identifier": { "@id": "https://doi.org/10.5066/F7VX0DMQ", "@type": "PropertyValue", "name": "DOI: 10.5066/F7VX0DMQ", "propertyID": "https://registry.identifiers.org/registry/doi", "value": "doi:10.5066/F7VX0DMQ", "url": "https://doi.org/10.5066/F7VX0DMQ" } }
For more examples of using schema:PropertyValue
for identifiers other than DOIs:
- ARK: https://registry.identifiers.org/registry/ark
- PubMed: https://registry.identifiers.org/registry/pubmed
- PaleoDB: https://registry.identifiers.org/registry/paleodb
- Protein Data Bank: https://registry.identifiers.org/registry/pdb
"identifier": [ { "@id": "https://n2t.net/ark:13030/c7833mx7t", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/ark", "name": "ARK: 13030/c7833mx7t", "value": "ark:13030/c7833mx7t", "url": "https://n2t.net/ark:13030/c7833mx7t" }, { "@id": "http://www.ncbi.nlm.nih.gov/pubmed/16333295", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/pubmed", "name": "Pubmed ID #16333295", "value": "pubmed:16333295", "url": "http://www.ncbi.nlm.nih.gov/pubmed/16333295" }, { "@id": "https://identifiers.org/paleodb:83088", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/paleodb", "name": "Paleo Database ID #83088", "value": "paleodb:83088", "url": "https://identifiers.org/paleodb:83088" }, { "@id": "https://identifiers.org/pdb:2gc4", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/pdb", "name": "Protein Data Bank 2gc4", "value": "pdb:2gc4", "url": "https://identifiers.org/pdb:2gc4" } ]
While we strongly recommend using a schema:PropertyValue, in it's most basic form, the schema:identifier
as text can be published as:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "sameAs": "https://search.dataone.org/#view/https://www.sample-data-repository.org/dataset/472032", "version": "2013-11-21", "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"], "identifier": "urn:sdro:dataset:472032" }
Or as a URL:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "identifier": "http://id.sampledatarepository.org/dataset/472032/version/1" }
However, if the identifier is a persistent identifier such as a DOI, ARK, or accession nmumber, then the best way to represent these identifiers is by using a schema:PropertyValue. The PropertyValue allows for more information about the identifier to be represented such as the identifier type or scheme, the identifier's value, it's URL and more. Because of this flexibility, we recommend using PropertyValue for all identifier types.
schema:Dataset also defines a field for the schema:citation as either text or a schema:CreativeWork. To provide citation text:
NOTE: If you have a DOI, the citation text can be automatically generated for you by querying a DOI URL with the Accept Header of 'text/x-bibliography'.
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "sameAs": "https://search.dataone.org/#view/https://www.sample-data-repository.org/dataset/472032", "version": "2013-11-21", "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"], "identifier": { "@id": "https://doi.org/10.5066/F7VX0DMQ", "@type": "PropertyValue", "name": "DOI: 10.5066/F7VX0DMQ", "propertyID": "https://registry.identifiers.org/registry/doi", "value": "doi:10.5066/F7VX0DMQ", "url": "https://doi.org/10.5066/F7VX0DMQ" }, "citation": "J.Smith 'How I created an awesome dataset’, Journal of Data Science, 1966" }
Short DOIs is a redirect service offered by the International DOI Foundation that provides a shorter version of an orginial DOI. For example, the original DOI doi:10.5066/F7VX0DMQ
has a short DOI of doi.org/csgf
. Short DOIs are resolvable using standard DOI URLS such as http://doi.org/fg5v
. These short DOIs are treated identically to the original DOI. If you are using the short DOI service, we recommend publishing a short DOI URL using the schema:sameAs
property of the schema:Dataset
:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "sameAs": [ "https://search.dataone.org/#view/https://www.sample-data-repository.org/dataset/472032", "http://doi.org/fg5v" ], "version": "2013-11-21", "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"], "identifier": { "@id": "https://doi.org/10.5066/F7VX0DMQ", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/doi", "value": "doi:10.5066/F7VX0DMQ", "url": "https://doi.org/10.5066/F7VX0DMQ" } }
schema:sameAs
is used here for the following reasons:
- It doesn't add too many more statements that might increase the page weight (which may impact major search engine crawlers stopping the crawl of schema.org markup).
- Crawlers that follow the URL for the short DOI can retrieve structured metadata for the DOI itself:
curl --location --request GET "http://doi.org/fg5v" --header "Accept: application/ld+json"
Back to top
Adding the schema:variableMeasured field can be done in two ways - a text description of each variable or by using the schema:PropertyValue type to describe the variable in more detail. We highly recommend using the schema:PropertyValue.
In it's most basic form, the variable as a schema:PropertyValue can be published as:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "variableMeasured": [ { "@type": "PropertyValue", "name": "Bottle identifier", "description": "The bottle number for each associated measurement." }, ... ] }
If a URI is available that identifies the variable, it should be included as the PropertyID:
{ "@context": [ "https://schema.org/", { "gsn-quantity": "http://www.geoscienceontology.org/geo-lower/quantity#" } ], "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "variableMeasured": [ { "@type": "PropertyValue", "name": "latitude", "propertyID":"http://www.geoscienceontology.org/geo-lower/quantity#latitude", "url": "https://www.sample-data-repository.org/dataset-parameter/665787", "description": "Latitude where water samples were collected; north is positive.", "unitText": "decimal degrees", "minValue": "45.0", "maxValue": "15.0" }, ... ] }
Back to top
For some repositories, defining a one or many data collections helps contextualize the datasets. In schema.org, you define these collections using schema:DataCatalog.
The most optimal way to use these DataCatalogs for a repository is to define these catalogs as an "offering" of your repository and including the @id
property to be reused in the dataset JSON-LD. For example, the repository JSON-LD defines a schema:DataCatalog with the
"@id": "https://www.sample-data-repository.org/collection/biological-data"
.
In the dataset JSON-LD, we reuse that @id
to say a dataset belongs in that catalog:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "includedInDataCatalog": { "@id": "https://www.sample-data-repository.org/collection/biological-data", "@type": "DataCatalog" } }
Back to top
While this schema.org record represents metadata about a Dataset, many providers will also have other metadata records that may be more complete or that conform to other metadata formats and vocabularies that might be useful. For example, repositories often contain detailed records in ISO TC 211 formats, EML, and other formats. Aggregators and other consumers can make use of this additional metadata if they are linked in a standardized way to the schema.org record. We recommend that the location of the alternative forms of the metadata be provided using the schema:subjectOf and schema:about properties:
Link metadata documents to a schema:Dataset by using schema:subjectOf. - Or if a schema.org snippet describes the metadata as the main resource, then link to the Dataset it describes using schema:about.
These two approaches are equivalent, and which is used depends on the subject of the schema.org record.
Once the linkage has been made, further details about the metadata can be provided. We recommend using schema:encodingFormat to indicate the metadata format/vocabulary to which the metadata record conforms. If it conforms to multiple formats, or to a specific and general format types, multiple types can be listed.
We use the schema:DataDownload class for Metadata files so that we can use the schema:MediaObject properties for describing bytesize, encoding, etc.
It can be useful to aggregators and other consumers to indicate when the metadata record was last modified using schema:dateModified
, which can be used to optimize harvesting schedules for search indices and other applications.
An example of a metadata reference to an instance of EML-formatted structured metadata, embedded within a schema:Dataset
record:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "distribution": { "@type": "DataDownload", ... }, "subjectOf": { "@type": "DataDownload", "name": "eml-metadata.xml", "description": "EML metadata describing the dataset", "encodingFormat": ["application/xml", "https://eml.ecoinformatics.org/eml-2.2.0"], "dateModified":"2019-06-12T14:44:15Z" } }
Alternatively, if the schema.org record is meant to describe the metadata record, one could use the inverse property schema:about
to indicate the linkage back to the Dataset that it describes. This would be a more rare situation, as typically the schema.org record would be focused on the Dataset itself.
Note that the The encodingFormat
property contains an array of formats to describe multiple formats to which the document conforms (in this example, the document is both conformant with XML and the EML metadata dialect).
Back to top
Where the schema:url property of the Dataset should point to a landing page, the way to describe how to download the data in a specific format is through the schema:distribution property. The "distribution" property describes where to get the data and in what format by using the schema:DataDownload type. If your dataset is not accessible through a direct download URL, but rather through a service URL that may need input parameters jump to the next section Accessing Data through a Service Endpoint.
For data available in multipe formats, there will be multiple values of the schema:DataDownload:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "distribution": { "@type": "DataDownload", "contentUrl": "https://www.sample-data-repository.org/dataset/472032.tsv", "encodingFormat": "text/tab-separated-values" } }
If access to the data requires some input parameters before a download can occur, we can use the schema:potentialAction in this way:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "potentialAction": { "@type": "SearchAction", "target": { "@type": "EntryPoint", "contentType": ["application/x-netcdf", "text/tab-separated-values"], "urlTemplate": "https://www.sample-data-repository.org/dataset/1234/download?format={format}&startDateTime={start}&endDateTime={end}&bounds={bbox}", "description": "Download dataset 1234 based on the requested format, start/end dates and bounding box", "httpMethod": ["GET", "POST"] }, "query-input": [ { "@type": "PropertyValueSpecification", "valueName": "format", "description": "The desired format requested either 'application/x-netcdf' or 'text/tab-separated-values'", "valueRequired": true, "defaultValue": "application/x-netcdf", "valuePattern": "(application\/x-netcdf|text\/tab-separated-values)" }, { "@type": "PropertyValueSpecification", "valueName": "start", "description": "A UTC ISO DateTime", "valueRequired": false, "valuePattern": "(-?(?:[1-9][0-9]*)?[0-9]{4})-(1[0-2]|0[1-9])-(3[01]|0[1-9]|[12][0-9])T(2[0-3]|[01][0-9]):([0-5][0-9]):([0-5][0-9])(.[0-9]+)?(Z)?" }, { "@type": "PropertyValueSpecification", "valueName": "end", "description": "A UTC ISO DateTime", "valueRequired": false, "valuePattern": "(-?(?:[1-9][0-9]*)?[0-9]{4})-(1[0-2]|0[1-9])-(3[01]|0[1-9]|[12][0-9])T(2[0-3]|[01][0-9]):([0-5][0-9]):([0-5][0-9])(.[0-9]+)?(Z)?" }, { "@type": "PropertyValueSpecification", "valueName": "bbox", "description": "Two points in decimal degrees that create a bounding box fomatted at 'lon,lat' of the lower-left corner and 'lon,lat' of the upper-right", "valueRequired": false, "valuePattern": "(-?[0-9]+(.[0-9]+)?),[ ]*(-?[0-9]+(.[0-9]+)?)[ ]*(-?[0-9]+(.[0-9]+)?),[ ]*(-?[0-9]+(.[0-9]+)?)" } ] } }
Here, we use the schema:SearchAction type becuase it lets you define the query parameters and HTTP methods so that machines can build user interfaces to collect those query parmaeters and actuate a request to provide the user what they are looking for.
Back to top
Temporal coverage is defined as "the time period during which data was collected or observations were made; or a time period that an activity or collection is linked to intellectually or thematically (for example, 1997 to 1998; the 18th century)" (ARDC RIF-CS). For documentation of Earth Science, Paleobiology or Paleontology datasets, we are interested in the second case-- the time period that data are linked to thematically.
Temporal coverage is a difficult concept to cover across all the possible scenarios. Schema.org uses ISO 8601 time interval format to describe time intervals and time points, but doesn't provide capabilities for geologic time scales or dynamically generated data up to present time. We have created our own geologic timescale vocabulary and it is found at https://geoschemas.org/extensions/temporal.html. We ask for your feedback on any temporal coverages you may have that don't currently fit into schema.org. You can follow similar issues at the schema.org Github issue queue
To represent a single date and time:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "temporalCoverage": "2018-01-22T14:51:12+00:00" }
Or a single date:
{ ... "temporalCoverage": "2018-01-22" }
Or a date range:
{ ... "temporalCoverage": "2012-09-20/2016-01-22" }
Or an open-ended date range (thanks to @lewismc for this example from NASA PO.DAAC) :
{ ... "temporalCoverage": "2012-09-20/.." }
Schema.org also lets you provide date ranges and other temporal coverages through the DateTime data type and URL. For more granular temporal coverages go here: https://schema.org/DateTime.
Geologic Time
There are many different ways of defining geologic age. See the examples below for a few cases. Descriptions of the vocabulary are at geoschemas.org. More example formats can be found in temporalCoverage.jsonld
A time interval using the ISO 8601 standard:
{ "@context": [ "https://schema.org/", { "time": "http://www.w3.org/2006/time#", "gstime": "http://schema.geoschemas.org/contexts/temporal#", "ts": "http://resource.geosciml.org/vocabulary/timescale/", "icsc": "http://resource.geosciml.org/clashttps://vocabs.ardc.edu.au/repository/api/lda/csiro/international-chronostratigraphic-chart/geologic-time-scale-2020/resource?uri=http://resource.geosciml.org/classifier/ics/ischart/Boundariessifier/ics/ischart/" } ], "@type": "Dataset", "description": "Eruptive activity at Mt. St. Helens, Washington, March 1980- January 1981; temporal coverage expressed as range of dateTime", "temporalCoverage": "1980-03-27T19:36:00Z/1981-01-03T00:00:00Z", "time:hasTime": { "@type": "time:Interval", "time:hasBeginning": { "@type": "time:Instant", "time:inXSDDateTimeStamp": "1980-03-27T19:36:00Z" }, "time:hasEnd": { "@type": "time:Instant", "time:inXSDDateTimeStamp": "1981-01-03T00:00:00Z" } }
A geologic age given in millions of years ago (Ma):
"@type": "Dataset", "description": "Geologic time expressed numerically scaled in millions of years increasing backwards relative to 1950. To specify a Geologic Time Scale, we use an OWL Time Instant. The example below specifies 760,000 years (0.76 Ma) before present", "temporalCoverage": "Eruption of Bishop Tuff, about 760,000 years ago", "time:hasTime": { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": {"@id": "gstime:MillionsOfYears"}, "time:numericPosition": 0.76, } } }
A geologic age with an uncertainty given at two-sigma:
"@type": "Dataset", "description": "Example of a geologic time with an uncertainty. Very old zircons from the Jack Hills formation Australia 4.404 +- 0.008 Ga (2-sigma)", "temporalCoverage": "Age of one of the oldest zircon found on Earth from the Jack Hills Austrailia, 4.404 +- 0.008 Ga (2-sigma)", "time:hasTime": { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": {"@id": "gstest:BillionsOfYears"}, "time:numericPosition": 4.404, } "gstime:uncertainty": 0.008, "gstime:uncertaintySigma": 2 }
A geologic interval bounded by two eras:
"@type": "Dataset", "description": "Temporal position expressed with an interval bounded by named time ordinal eras from [International Chronostratigraphic Chart](https://stratigraphy.org/chart):", "temporalCoverage": "Triassic to Jurassic", "time:hasTime": { "@type": "time:Interval", "time:hasBeginning": { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": {"@id": "ts:gts2020"}, "time:NominalPosition": { "@value": "icsc:Triassic", "@type": "xsd:anyURI" } } }, "time:hasEnd": { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": {"@id": "ts:gts2020"}, "time:NominalPosition": { "@value": "icsc:Jurassic", "@type": "xsd:anyURI" } } } }
Back to top
Used to document the location on Earth that is the focus of the dataset content, using schema:Place. Recommended practice is to use the schema:geo property with either a schema:GeoCoordinates object to specify a point location, or a schema:GeoShape object to specify a line or area coverage extent. Coordinates describing these extents are expressed as latitude longitude tuples (in that order) using decimal degrees.
Schema.org documentation does not specify a convention for the coordinate reference system, our recommended practice is to use WGS84 for at least one spatial coverage description if applicable. Spatial coverage location using other coordinate systems can be included, see recommendation for specifying coordinate reference systems, below.
A point location specified by a schema:GeoCoordinates object with schema:latitude and schema:longitude properties. Not Recommended the schema:Place definition allows the latitude and longitude of a point location to be specified as properties directly of place; although this is more succinct, it makes parsing the metadata more complex and should be avoided.
Point locations are recommended for data that is associated with specific sample locations, particularly if these are widely spaced such that an enclosing bounding box would be a misleading representation of the spatial location. Be aware that some client applications might only index or display bounding box extents or a single point location.
A schema:Dataset that is about a point location would documented in this way:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton ....", ... "spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoCoordinates", "latitude": 39.3280 "longitude": 120.1633 } } }
A schema:GeoShape can describe spatial coverage as a line (e.g. a ship track), a bounding box, a polygon, or a circle. The geometry is described with a set of latitude/longitude pairs. The spatial definitions were added to schema.org early in its development based on the GeoRSS specification. The documentation for schema:GeoShape states "Either whitespace or commas can be used to separate latitude and longitude; whitespace should be used when writing a list of several such points." At least for bounding boxes (see the discussion below), it appears that the Google Dataset Search parsing of the coordinate strings depends on whether a comma or space is used to delimit the coordinates in an individual tuple.
Be aware that some client applications might only index or display bounding box extents.
- line - a series of two or more points.
- polygon - a series of four or more points where the first and final points are identical.
- box - A rectangular (in lat-long space) extent specified by two points, the first in the lower left (southwest) corner and the second in the upper right (northeast) corner.
- circle - A circular region of a specified radius centered at a specified latitude and longitude, represented as a coordinate pair followed by a radius in meters. Not recommended for use.
Examples Linear spatial location A line spatial location. Useful for data that were collected along a traverse, ship track, flight line or other linear sampling feature.
"spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoShape", "line": "39.3280 120.1633 40.445 123.7878" } } }
Polygon spatial location A polygon provides the most precise approach to delineating the spatial extent of the focus area for a dataset, but polygon spatial locations might not be recognized (indexed, displayed) by some client applications.
"polygon": "39.3280 120.1633 40.445 123.7878 41 121 39.77 122.42 39.3280 120.1633"
Bounding Boxes A GeoShape box defines an area on the surface of the earth defined by point locations of the southwest corner and northeast corner of the rectangle in latitude-longitude coordinates. Point locations are tuples of {latitude east-longitude} (y x). The schema.org GeoShape documentation states "Either whitespace or commas can be used to separate latitude and longitude; whitespace should be used when writing a list of several such points." Since the box is a list of points, a space should be used to separate the latitude and longitude values. The two corner coordinate points are separated by a space. 'East longitude' means positive longitude values are east of the prime (Greenwich) meridian. A box where 'lower-left' (southwest) corner is 39.3280/120.1633 and 'upper-right' (northeast) corner is 40.445/123.7878 would be encoded thus:
"box": "39.3280 120.1633 40.445 123.7878"
NOTE-- see discussion in GitHub issue 101 on what works with Google Dataset search to display spatial locatation in their search results.
East longitude values can be reported 0 <= X <= 360 or -180 <= X <= 180. Some applications will fail under one or the other of these conventions. Recommendation is to use -180 <= X <= 180, consistent with the WKT specification. Following this recommendation, bounding boxes that cross the antimeridian at ±180° longitude, the West longitude value will be numerically greater than the East longitude value. For example, to describe Fiji the box might be
"box": "-19 176 -15 -178"
NOTES: Some spatial data processors will not correctly interpret the bounding coordinates across the antimeridian even if they follow the recommended southwest, northeast corner convention, resulting in boxes that span the circumference of the Earth, excluding the actual area of interest. For applications operating with data in the vicinity of longitude 180, testing is strongly recommended to determine if it works for bounding boxes crossing the antimeridian (+/- 180); an alternative is to define two bounding boxes, one on each side of 180.
For bounding boxes that include the north or south pole, schema:box will not work. Recommended practice is to use a schema:polygon to describe spatial location extents that include the poles.
Multiple geometries If you have multiple geometries, you can publish those by making the schema:geo field an array of GeoShape or GeoCoordinates like so:
{ ... "spatialCoverage": { "@type": "Place", "geo": [ { "@type": "GeoCoordinates", "latitude": -17.65, "longitude": 50 }, { "@type": "GeoCoordinates", "latitude": -19, "longitude": 51 }, ... ] } ... }
Be aware that some client application might not index or display multiple geometries.
A Spatial Reference System (SRS) or Coordinate Reference System (CRS) is the method for defining the frame of reference for geospatial location representation. Schema.org currently has no defined property for specifying a Spatial Reference System; the assumption is that coordinates are WGS84 decimal degrees.
In the mean time, to represent an SRS in schema.org, we recommend using the schema:additionalProperty property to specify an object of type schema:PropertyValue, with a schema:propertyID of http://dbpedia.org/resource/Spatial_reference_system to identify the property as a spatial reference system, and the schema:PropertyValue/schema:value is a URI (IRI) that identifies a specific SRS. Some commonly used values are:
Spatial Reference System | IRI |
---|---|
WGS84 | http://www.w3.org/2003/01/geo/wgs84_pos#lat_long |
CRS84 | http://www.opengis.net/def/crs/OGC/1.3/CRS84 |
EPSG:26911 | https://spatialreference.org/ref/epsg/nad83-utm-zone-11n/ |
EPSG:3413 | https://spatialreference.org/ref/epsg/wgs-84-nsidc-sea-ice-polar-stereographic-north/ |
NOTE: Beware of coordinate order differences. WGS84 in the table above specifies latitude, longitude coordinate order, whereas CRS84 specifies longitude, latitude order (like GeoJSON). WGS84 is the assumed typical value for coordinates, so in general the SRS does not need to be specified.
A spatial reference system can be added in this way:
{ "@context": [ "https://schema.org/", { "dbpedia": "http://dbpedia.org/resource/" } ], "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoShape", "line": "39.3280 120.1633 40.445 123.7878" }, "additionalProperty": { "@type": "PropertyValue", "propertyID":"http://dbpedia.org/resource/Spatial_reference_system", "value": "http://www.w3.org/2003/01/geo/wgs84_pos#lat_long" } } }
Back to top
People can be linked to datasets using three fields: author, creator, and contributor. Since schema:contributor is defined as a secondary author, and schema:Creator is defined as being synonymous with the schema:author field, we recommend using the more expressive fields creator and contributor, but using any of these fields is acceptable.
NOTE: Because JSON-LD doesn't preserve the order of its collection values, for more see Getting Started - JSON-LD Lists, we can preserve the order of people roles by applying the @list
JSON-LD keyword. Given the following creator
JSON-LD block,:
{
...
"creator:[
{
"@type": "Person",
"name": "Creator #1"
},
{
"@type": "Person",
"name": "Creator #2"
}
]
}
The order of these creators can be preserved by the using the @list
JSON-LD keyword:
{
...
"creator:{
"@list": [
{
"@type": "Person",
"name": "Creator #1"
},
{
"@type": "Person",
"name": "Creator #2"
}
]
}
}
Because there are more things that can be said about how and when a person contributed to a Dataset, we use the schema:Role. You'll notice that the schema.org documentation does not state that the Role type is an expected data type of author, creator and contributor, but that is addressed in this blog post introducing Role into schema.org. Thanks to Stephen Richard for this contribution
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "creator": [ { "@id": "https://www.sample-data-repository.org/person-role/472036", "@type": "Role", "roleName": "Principal Investigator", "creator": { "@id": "https://www.sample-data-repository.org/person/51317", "@type": "Person", "name": "Dr Uta Passow", "givenName": "Uta", "familyName": "Passow", "url": "https://www.sample-data-repository.org/person/51317" } }, { "@id": "https://www.sample-data-repository.org/person-role/472038", "@type": "Role", "roleName": "Co-Principal Investigator", "url": "https://www.sample-data-repository.org/person-role/472038", "creator": { "@id": "https://www.sample-data-repository.org/person/50663", "@type": "Person", "identifier": { "@id": "https://orcid.org/0000-0003-3432-2297", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/orcid", "url": "https://orcid.org/0000-0003-3432-2297", "value": "orcid:0000-0003-3432-2297" }, "name": "Dr Mark Brzezinski", "url": "https://www.sample-data-repository.org/person/50663" } } }
NOTE that the Role inherits the property creator
and contributor
from the Dataset when pointing to the schema:Person.
{ "@context": "https://schema.org/", "@type": "Dataset", ... "creator": [ { "@id": "https://www.sample-data-repository.org/person-role/472036", "@type": "Role", "roleName": "Principal Investigator", "url": "https://www.sample-data-repository.org/person-role/472036", "creator": { "@id": "https://www.sample-data-repository.org/person/51317", "@type": "Person", "name": "Dr Uta Passow", "givenName": "Uta", "familyName": "Passow", "url": "https://www.sample-data-repository.org/person/51317" } } }
If a single Person plays multiple roles on a Dataset, each role should be explicitly defined in this way:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "creator": [ { "@id": "https://www.sample-data-repository.org/person-role/472036", "@type": "Role", "roleName": "Principal Investigator", "url": "https://www.sample-data-repository.org/person-role/472036", "creator": { "@id": "https://www.sample-data-repository.org/person/51317", "@type": "Person", "name": "Dr Uta Passow", "givenName": "Uta", "familyName": "Passow", "url": "https://www.sample-data-repository.org/person/51317" } }, { "@id": "https://www.sample-data-repository.org/person-role/472037", "@type": "Role", "roleName": "Contact", "url": "https://www.sample-data-repository.org/person-role/472037", "creator": { "@id": "https://www.sample-data-repository.org/person/51317" } }, { "@id": "https://www.sample-data-repository.org/person-role/472038", "@type": "Role", "roleName": "Co-Principal Investigator", "url": "https://www.sample-data-repository.org/person-role/472038", "creator": { "@id": "https://www.sample-data-repository.org/person/50663", "@type": "Person", "identifier": { "@id": "https://orcid.org/0000-0003-3432-2297", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/orcid", "url": "https://orcid.org/0000-0003-3432-2297", "value": "orcid:0000-0003-3432-2297" }, "name": "Dr Mark Brzezinski", "url": "https://www.sample-data-repository.org/person/50663" } } }
Notice that since Uta Passow has already been defined in the document with "@id": "https://www.sample-data-repository.org/person/51317"
for her role as Principal Investigator, the @id
can be used for her role as Contact by defining the Role's creator as "creator": { "@id": "https://www.sample-data-repository.org/person/51317" }
.
Back to top
If your repository is the publisher and/or provider of the dataset then you don't have to describe your repository as a schema:Organziation if your repository markup includes the @id
. For example, if you published repository markup such as:
{ "@context": "https://schema.org/", "@type": ["Service", "Organization"], ... "@id": "https://www.sample-data-repository.org" ... }
then you can reuse that @id
here. Harvesters such as Google and Project418 will make the appropriate linkages and your dataset publisher/provider can be published in this way:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "provider": { "@id": "https://www.sample-data-repository.org" }, "publisher": { "@id": "https://www.sample-data-repository.org" } }
Otherwise, you can define the organization inline in this way:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "provider": { "@id": "https://www.sample-data-repository.org", "@type": "Organization", "legalName": "Sample Data Repository Office", "name": "SDRO", "sameAs": "http://www.re3data.org/repository/r3dxxxxxxxxx", "url": "https://www.sample-data-repository.org" }, "publisher": { "@id": "https://www.sample-data-repository.org" } }
Back to top
Data providers should include funding information in their Dataset descriptions to enable discovery and cross-linking. The information that would be useful includes the title, identifier, and url of the grant or award, along with structured information about the funding organization, including its name and identifier. Organizational identifiers are best represented using either a general purpose institutional identifier such as a ROR, GRID, or ISNI identifier, or a more specific Funder ID from the Crossref Funder Registry. The ROR for the National Science Foundation (https://ror.org/021nxhr62), for example, provides linkages to related identifiers as well. The Funder ID has the advantage that it includes both agency funders like the National Science Foundation (http://dx.doi.org/10.13039/100000001), but also provides identifiers for individual funding programs within those agencies, such as the NSF GEO Directorate (https://api.crossref.org/funders/100000085). When possible, providing both a ROR and Funder ID is helpful. Here's an example of identifiers for the National Science Foundation:
Linking a Dataset to the grants and awards that fund it can be acheived by adding a schema:MonetaryGrant through the schema:funding
property.
{ "@context": "https://schema.org/", "@type": "Dataset", "@id": "https://doi.org/10.18739/A22V2CB44", "name": "Stable water isotope data from Arctic Alaska snow pits in 2019", "funding": [ { "@id": "https://www.nsf.gov/awardsearch/showAward?AWD_ID=1604105", "@type": "MonetaryGrant", "identifier": "1604105", "name": "Collaborative Research: Nutritional Landscapes of Arctic Caribou: Observations, Experiments, and Models Provide Process-Level Understanding of Forage Traits and Trajectories", "url": "https://www.nsf.gov/awardsearch/showAward?AWD_ID=1604105", "funder": { "@id": "http://dx.doi.org/10.13039/100000001", "@type": "Organization", "name": "National Science Foundation", "identifier": [ "http://dx.doi.org/10.13039/100000001", "https://ror.org/021nxhr62" ] } }, { "@type": "MonetaryGrant", "@id": "https://akareport.aka.fi/ibi_apps/WFServlet?IBIF_ex=x_hakkuvaus2&HAKNRO1=316349&UILANG=en&TULOSTE=HTML", "identifier": "316349", "name": "Where does water go when snow melts? New spatio-temporal resolution in stable water isotopes measurements to inform cold climate hydrological modelling", "url": "https://akareport.aka.fi/ibi_apps/WFServlet?IBIF_ex=x_hakkuvaus2&HAKNRO1=316349&UILANG=en&TULOSTE=HTML", "funder": { "@id": "http://dx.doi.org/10.13039/501100002341", "@type": "Organization", "name": "Academy of Finland", "identifier": [ "http://dx.doi.org/10.13039/501100002341", "https://ror.org/05k73zm37" ] } } ] }
We recommend providing as much structured information about the grants that fund a Dataset as possible so that aggregators and harvesters can crosslink to the Funding agencies and grants that provided resources for the Dataset.
Back to top
Link a Dataset to its license to document legal constraints by adding a schema:license property. The guide recommends providing a URL that unambiguously identifies a specific version of the license used, but for many licenses it is hard to determine what that URL should be. Thus, we recommend that the license URL be drawn from the SPDX license list, which provides a curated list of licenses and their properties that is well maintained. For each SPDX entry, SPDX provides a canonical URL for the license (e.g., http://spdx.org/licenses/CC0-1.0
), a unique licenseId
(e.g., CC0-1.0
), and other metadata about the license. Here's an example using the SPDX license URI for the Creative Commons CC-0 license:
{ "@context": "https://schema.org/", "@id": "http://www.sample-data-repository.org/dataset/123", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "license": "http://spdx.org/licenses/CC0-1.0" ... }
SPDX URIs for each license can be found by finding the appropriate license in the SPDX license list, and then remove the final .html
extension from the filename. For example, in the table one can find the license page for Apache at the URI https://spdx.org/licenses/Apache-2.0.html
, which can be converted into the associated linked data URI by removing the .html
, leaving us with https://spdx.org/licenses/Apache-2.0
. Alternatively, one can find the license file in the structured data listings and copy the URL from the associated file. For example, the URL for the Apache-2.0 license is listed in the file at https://github.com/spdx/license-list-data/blob/master/rdfturtle/Apache-2.0.turtle.
While many licenses are ambiguous about the license URI for the license, the Creative Commons licenses and a few others are exceptions in that they provide extremely consistent URIs for each license, and these are in widespread use. So, while we recommend using the SPDX URI, we recognize that some sites may want to use the CC license URIs directly, which is helpful in recognizing the license. In this case, we recommend that the SPDX URI still be used as described above, and the other URI also be provided as well in a list. Here's an example using the traditional Creative Commons URI along with the SPDX URI.
{ "@context": "https://schema.org/", "@id": "http://www.sample-data-repository.org/dataset/123", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "license": [ "http://spdx.org/licenses/CC0-1.0", "https://creativecommons.org/publicdomain/zero/1.0"] ... }
The following table contains the SPDX URIs for some of the most common licenses. Others can be looked up at the SPDX site as described above.
License | SPDX URI |
---|---|
Apache-2.0 | https://spdx.org/licenses/Apache-2.0 |
BSD-3-Clause | https://spdx.org/licenses/BSD-3-Clause |
CC-BY-3.0 | https://spdx.org/licenses/CC-BY-3.0 |
CC-BY-4.0 | https://spdx.org/licenses/CC-BY-4.0 |
CC-BY-SA-4.0 | https://spdx.org/licenses/CC-BY-SA-4.0 |
CC0-1.0 | https://spdx.org/licenses/CC0-1.0 |
GPL-3.0-only | https://spdx.org/licenses/GPL-3.0-only |
GPL-3.0-or-later | https://spdx.org/licenses/GPL-3.0-or-later |
MIT | https://spdx.org/licenses/MIT |
MIT-0 | https://spdx.org/licenses/MIT-0 |
Back to top
A schema:Dataset
can be composed of multiple digital objects which are listed in the schema:distribution
list. For each schema:DataDownload
, it can be useful to provide an cryptographic checksum value (like SHA 256 or MD5) that can be used to characterize the contents of the object. Aggregators and distributors can use these values to verify that they have retrieved exactly the same content as the original provider made available, and that replica copies of an object are identical to the original, among other uses. Because schema.org does not contain a class for representing checksum values, by convention we recommend using the spdx:checksum
property, which points at an spdx:Checksum
instance that provides both the value of the checksum and the algorithm that was used to calculate the checksum.
Here's an example that provides two different checksum values for a single digital object within a schema:DataDownload
description. Note that providers will need to define the spdx
prefix in their @context
block in order to use the prefix as shown in the example.
{ "@context": [ "https://schema.org/", { "spdx": "http://spdx.org/rdf/terms#" } ], "@type": "Dataset", "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607", "sameAs": "https://doi.org/10.18739/A2NK36607", "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)", "distribution": { "@type": "DataDownload", "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae", "identifier": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae", "spdx:checksum": [ { "@type": "spdx:Checksum", "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51", "spdx:checksumAlgorithm": { "@id": "spdx:checksumAlgorithm_sha256" } }, { "@type": "spdx:Checksum", "spdx:checksumValue": "65d3616852dbf7b1a6d4b53b00626032", "spdx:checksumAlgorithm": { "@id": "spdx:checksumAlgorithm_md5" } } ] } }
The algorithm property is chosen from the controlled SPDX vocabulary of checksum types, making it easy for processors to recalculate checksum values to verify them. Common algorithms that many providers would use include spdx:checksumAlgorithm_sha256
and spdx:checksumAlgorithm_md5
. Note specifically that the spdx:checksumAlgorithm_sha256
value is inside of an @id
property so that the SPDX namespace from the context definition is used to define the algorithm URI.
Back to top
High level relationships that link datasets based on their processing workflows and versioning relationships are critical for data consumers and search engines to link different versions of a schema:Dataset, to clarify when a dataset is derived from one or more source Datasets, and to specify linkages to the software and activities that created these derived datasets for reproducibility. Collectively, this is provenance information.
The PROV-O recommendation provides the widely-adopted vocabulary for representing this type of provenance information, and should be used within Dataset descriptions, as most of the necessary provenance properties are currently missing from schema.org. The main exception is schema:isBasedOn
, which provides a predicate for indicating that a Dataset was derived from one or more source Datasets. Producers and consumers should interpret schema:isBasedOn
to be an equivalent property to prov:wasDerivedFrom
(in the owl:equivalentProperty
sense). Either is acceptable for representing derivation relationships, but there is utility in expressing the relationship with both predicates for consumers that might only be looking for one or the other. When other PROV
predicates are used, it is preferred to use prov:wasDerivedFrom
for consistency.
We recommend providing provenance information about data processing workflows, data derivation relationships, and versioning information using PROV-O and schema.org predicates, and describe the structures to do this in the following subsections. Aggregators and search systems should use these properties to cluster and cross-link versions of Datasets, and to provide bi-directional linkages to source and derived data products.
Link a Dataset to a prior version that it replaces by adding a prov:wasRevisionOf
property. This indicates that the current schema:Dataset
replaces or obsoletes the source Dataset indicated. The value of the prov:wasRevisionOf
should be the canonical IRI for the identifier for the original dataset, preferably to a persistently resolvable IRI such as as a DOI, but other persistent identifiers for the dataset can be used.
{ "@context": [ "https://schema.org/", { "prov": "http://www.w3.org/ns/prov#" } ], "@id": "https://doi.org/10.xxxx/Dataset-2.v2", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "prov:wasRevisionOf": { "@id": "https://doi.org/10.xxxx/Dataset-2.v1" } }
A derived Dataset is one in which the values in the data are somehow related or created from the values in one or more source datasets. For example, raw voltage values from a sensor might be recorded in a raw data file, which is then processed through calibration functions to produce a derived dataset with values in scientific units. Other examples of derived data include data that has been error corrected, gap-filled, or integrated with other sources.
To indicate that a Dataset has been derived from a source Dataset, use the prov:wasDerivedFrom
property. This indicates that the current schema:Dataset
was created in whole or in part from content in the source Dataset, and therefore does not represent an independent set of measurements. The value of the prov:wasDerivedFrom
should be the canonical IRI for the identifer for the source dataset, preferably to a persistently resolvable IRI such as as a DOI, but other persistent identifiers for the dataset can be used. In addition, if a persistent identifier for a digital object within a Dataset is available, the prov:wasDerivedFrom
may also be used to indicate that that digital object was derived from that particular source object, rather than the overall Dataset. This allows one to be more specific about the exact relationship between the source and derived data objects.
In addition to prov:wasDerivedFrom
, schema.org provides the schema:isBasedOn
property, which should be considered to be an equivalent property to prov:wasDerivedFrom
. For compatibility with schema.org, we recommend that producers use schema:isBasedOn
in addition to or instead of prov:wasDerivedFrom
to indicate derivation relationships.
{ "@context": [ "https://schema.org/", { "prov": "http://www.w3.org/ns/prov#" } ], "@id": "https://doi.org/10.xxxx/Dataset-2", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "prov:wasDerivedFrom": { "@id": "https://doi.org/10.xxxx/Dataset-1" }, "isBasedOn": { "@id": "https://doi.org/10.xxxx/Dataset-1" } }
Frequently data are processed to create derived Datasets or other products using software programs that use some source data, transform it in various ways, and create the derived products. Understanding these software workflows promotes understanding of the products, and facilitates reproducibility. Describing a software workflow is really just a mechanism to provide more detail about how derived products were created when software was executed. The ProvONE vocabulary extends PROV to define a specific concept for an execution event (provone:Execution
) during which a software program (provone:Program
) is executed. During this execution, the software can use source data (prov:used
) and generate outputs (prov:wasGeneratedBy
), which then can be inferred to have been derived from the source data.
Any portion of the software workflow can be described to increase information about derived datasets. For example, use prov:used
to link an execution to one or more source datasets, and use prov:wasGeneratedBy
to link an execution to one or more derived products. When information about the execution event itself is known, use provone:Execution
to describe that event, and link it to the source and derived products, as well as the program. The program is often a software script that is itself dereferenceable, and may be part of the archived Dataset itself if it has an accessible IRI.
{ "@context": [ "https://schema.org/", { "prov": "http://www.w3.org/ns/prov#", "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#" } ], "@id": "https://doi.org/10.xxxx/Dataset-2", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "prov:wasDerivedFrom": { "@id": "https://doi.org/10.xxxx/Dataset-1" }, "schema:isBasedOn": { "@id": "https://doi.org/10.xxxx/Dataset-1" }, "prov:wasGeneratedBy": { "@id": "https://example.org/executions/execution-42", "@type": "provone:Execution", "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R", "prov:used": { "@id": "https://doi.org/10.xxxx/Dataset-1" } } }
Back to top