Home | Dataset
- Describing a Dataset
Google has drafted a guide to help publishers. The guide describes the only required fields as - name and description.
- name - A descriptive name of a dataset (e.g., “Snow depth in Northern Hemisphere”)
- description - A short summary describing a dataset.
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. " }
The Google guide also recommends following fields:
- url - Location of a page describing the dataset.
- sameAs - Other URLs that can be used to access the dataset page. A link to a page that provides more information about the same dataset, usually in a different repository.
- version - The version number or identifier for this dataset (text or numeric).
- isAccessibleForFree - Boolean (true|false) specifying if the dataset is accessible for free.
- keywords - Keywords summarizing the dataset.
- identifier - An identifier for the dataset, such as a DOI. (text,URL, or PropertyValue).
- variableMeasured - What does the dataset measure? (e.g., temperature, pressure)
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "sameAs": "https://search.dataone.org/#view/https://www.sample-data-repository.org/dataset/472032", "version": "2013-11-21", "isAccessibleForFree": true, "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"], "license": [ "http://spdx.org/licenses/CC0-1.0", "https://creativecommons.org/publicdomain/zero/1.0"] }
Back to top
Adding the schema:keywords field can be done in three ways - a text description, a URL, or by using schema:DefinedTerm. We recommend using schema:DefinedTerm
if a keyword comes from a controlled vocabulary.
For a dataset with the keywords of: ocean acidification
, Dissolved Organic Carbon
, bacterioplankton respiration
, pCO2
, carbon dioxide
, oceans
, you can express these:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"] }
If you have information about a controlled vocabulary from which keywords come, use schema:DefinedTerm
to describe that keyword. The relevant properties of a schema:DefinedTerm
are:
- name - The name of the keyword. (Required)
- inDefinedTermSet - The controlled vocabulary responsible for this keyword. (Required)
- url - The canonical URL for the keyword. (Optional)
- termCode - A representative code for this keyword in the controlled vocabulary (Optional)
As an example, we demonstrate these fields using the oceans
keyword from the NASA GCMD Keyword vocabulary, ice core studies
from SnowTerm, and Baked Clay
from EarthRef controlled vocabulary.
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Dataset shell for example DefinedTerm keywords", "keywords": [ { "@type": "DefinedTerm", "name": "OCEANS", "inDefinedTermSet": "https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/sciencekeywords", "url": "https://gcmd.earthdata.nasa.gov/kms/concept/91697b7d-8f2b-4954-850e-61d5f61c867d", "termCode": "91697b7d-8f2b-4954-850e-61d5f61c867d" }, { "@type": "DefinedTerm", "name": "ice core studies", "inDefinedTermSet": "https://vocabularyserver.com/cnr/ml/snowterm/en/", "url": "https://vocabularyserver.com/cnr/ml/snowterm/en/index.php?tema=29330", "identifier": { "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/ark", "value": "ark:/99152/t3v4yo3eeqepj0", "url": "https://vocabularyserver.com/cnr/ml/snowterm/en/?ark=ark:/99152/t3v4yo3eeqepj0" } }, { "@type": "DefinedTerm", "name": "Baked Clay", "inDefinedTermSet": "https://www2.earthref.org/vocabularies/controlled" } ] }
Adding the schema:identifier field can be done in three ways - a text description, a URL, or by using the schema:PropertyValue field.
We highly recommend using schema:PropertyValue.
Q: Why are simple text or URLs not good enough? A: Identifiers have multiple properties that are useful when trying to find them across the web.
Most identifiers have these properties:
- a value,
- a domain or scheme (in which the value is guaranteed to be unique),
- (optionally) a resolvable URL (where the thing being identified can be found),
- (optionally) a domain prefix (a token string of characters succeeded by a colon ':' that represents the domain or scheme).
For example, the Digital Object Identifier (DOI) for a dataset may be: doi:10.5066/F7VX0DMQ. To break it down into its properties, we arrive at:
- value:
10.5066/F7VX0DMQ
- scheme:
Digital Object Identifier (DOI)
- url:
https://doi.org/10.5066/F7VX0DMQ
- prefix:
doi
Q: Can't we just say the scheme is a 'DOI'? A: Yes, but there's a better way - a URI or URL. Because we are using schema.org to express the explicit values of our content, we want to explicitly identify and classify our content such that harvesters can determine when our content appears elsewhere on the web. By detecting these shared pieces of content, we form the Web of Data.
Because the scheme Digital Object Identifier (DOI)
is described using unstructured text, we need a better way to explicitly state this value. Fortunately, identifiers.org has registered URIs for almost 700 different identifier schemes which can be browsed at: https://registry.identifiers.org/registry.
We can specify the scheme as being a DOI with this identifiers.org Registry URI:
https://registry.identifiers.org/registry/doi
Looking at the available fields from schema:PropertyValue, we can map our identifier fields as follows:
schema:value
as the identifier value10.5066/F7VX0DMQ
schema:propertyID
is the registry.identifiers.org URI for the identifier schemehttps://registry.identifiers.org/registry/doi
,schema:url
is the resolvable url for that identifierhttps://doi.org/10.5066/F7VX0DMQ
.
Q: Where should the prefix go?
A: There is no ideal property for the prefix, but we may include it as part of the schema:value
.
Q: Why include doi:
as part of the value? Doesn't the URL https://doi.org/10.5066/F7VX0DMQ
achieve the same result?
A: While the actual value of the DOI is 10.5066/F7VX0DMQ
, we felt that this representation helps schema.org publishers specify an identifier value that is familiar to the research community. For example, in most citation styles such as APA, the DOI 10.5066/F7VX0DMQ is cited as doi:10.5066/F7VX0DMQ
. Also, there can be many proper URLs for a specific identifier:
- http://doi.org/10.5066/F7VX0DMQ
- https://doi.org/10.5066/F7VX0DMQ
- http://dx.doi.org/10.5066/F7VX0DMQ
- https://dx.doi.org/10.5066/F7VX0DMQ
- https://www.sciencebase.gov/catalog/item/56b3e649e4b0cc79997fb5ec
For these reasons, we recommend that any identifier having a known prefix value should be included in the value succeeded by a colon to form ':', or for this DOI: doi:10.5066/F7VX0DMQ
.
Q: How do I know if an Identifier has a known prefix? A: Each Identifier in the identifiers.org Registry that has a known prefix will be specified on the identifiers.org registry page under the section called 'Identifier Schemes' at the field labeled 'Prefix'.
An example of using schema:PropertyValue to describe an Identifier:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "sameAs": "https://search.dataone.org/#view/https://www.sample-data-repository.org/dataset/472032", "version": "2013-11-21", "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"], "identifier": { "@id": "https://doi.org/10.5066/F7VX0DMQ", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/doi", "value": "doi:10.5066/F7VX0DMQ", "url": "https://doi.org/10.5066/F7VX0DMQ" } }
Optionally, the schema:name
field can be used to give this specific identifier a label such as "DOI: 10.5066/F7VX0DMQ" or "DOI 10.5066/F7VX0DMQ", but schema:name
should never be used to simply say "DOI".
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "identifier": { "@id": "https://doi.org/10.5066/F7VX0DMQ", "@type": "PropertyValue", "name": "DOI: 10.5066/F7VX0DMQ", "propertyID": "https://registry.identifiers.org/registry/doi", "value": "doi:10.5066/F7VX0DMQ", "url": "https://doi.org/10.5066/F7VX0DMQ" } }
For more examples of using schema:PropertyValue
for identifiers other than DOIs:
- ARK: https://registry.identifiers.org/registry/ark
- PubMed: https://registry.identifiers.org/registry/pubmed
- PaleoDB: https://registry.identifiers.org/registry/paleodb
- Protein Data Bank: https://registry.identifiers.org/registry/pdb
"identifier": [ { "@id": "https://n2t.net/ark:13030/c7833mx7t", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/ark", "name": "ARK: 13030/c7833mx7t", "value": "ark:13030/c7833mx7t", "url": "https://n2t.net/ark:13030/c7833mx7t" }, { "@id": "http://www.ncbi.nlm.nih.gov/pubmed/16333295", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/pubmed", "name": "Pubmed ID #16333295", "value": "pubmed:16333295", "url": "http://www.ncbi.nlm.nih.gov/pubmed/16333295" }, { "@id": "https://identifiers.org/paleodb:83088", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/paleodb", "name": "Paleo Database ID #83088", "value": "paleodb:83088", "url": "https://identifiers.org/paleodb:83088" }, { "@id": "https://identifiers.org/pdb:2gc4", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/pdb", "name": "Protein Data Bank 2gc4", "value": "pdb:2gc4", "url": "https://identifiers.org/pdb:2gc4" } ]
While we strongly recommend using a schema:PropertyValue, in its most basic form, the schema:identifier
as text can be published as:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "sameAs": "https://search.dataone.org/#view/https://www.sample-data-repository.org/dataset/472032", "version": "2013-11-21", "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"], "identifier": "urn:sdro:dataset:472032" }
Or as a URL:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "identifier": "http://id.sampledatarepository.org/dataset/472032/version/1" }
However, if the identifier is a persistent identifier such as a DOI, ARK, or accession number, then the best way to represent these identifiers is by using a schema:PropertyValue. The PropertyValue allows for more information about the identifier to be represented such as the identifier type or scheme, the identifier's value, its URL and more. Because of this flexibility, we recommend using PropertyValue for all identifier types.
schema:Dataset also defines a field for the schema:citation as either text or a schema:CreativeWork. To provide citation text:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "sameAs": "https://search.dataone.org/#view/https://www.sample-data-repository.org/dataset/472032", "version": "2013-11-21", "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"], "identifier": { "@id": "https://doi.org/10.5066/F7VX0DMQ", "@type": "PropertyValue", "name": "DOI: 10.5066/F7VX0DMQ", "propertyID": "https://registry.identifiers.org/registry/doi", "value": "doi:10.5066/F7VX0DMQ", "url": "https://doi.org/10.5066/F7VX0DMQ" }, "citation": "J.Smith 'How I created an awesome dataset’, Journal of Data Science, 1966" }
NOTE: If you have a DOI, the citation text can be automatically generated for you by querying a DOI URL with the Accept Header of 'text/x-bibliography'.
Short DOI is a redirect service offered by the International DOI Foundation that provides a shorter version of an original DOI. For example, the original DOI doi:10.5066/F7VX0DMQ
has a short DOI of doi.org/csgf
. Short DOIs are resolvable using standard DOI URLS such as http://doi.org/fg5v
. These short DOIs are treated identically to the original DOI. If you are using the short DOI service, we recommend publishing a short DOI URL using the schema:sameAs
property of the schema:Dataset
:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. ", "url": "https://www.sample-data-repository.org/dataset/472032", "sameAs": [ "https://search.dataone.org/#view/https://www.sample-data-repository.org/dataset/472032", "http://doi.org/fg5v" ], "version": "2013-11-21", "keywords": ["ocean acidification", "Dissolved Organic Carbon", "bacterioplankton respiration", "pCO2", "carbon dioxide", "oceans"], "identifier": { "@id": "https://doi.org/10.5066/F7VX0DMQ", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/doi", "value": "doi:10.5066/F7VX0DMQ", "url": "https://doi.org/10.5066/F7VX0DMQ" } }
schema:sameAs
is used here for the following reasons:
- It doesn't add too many more statements that might increase the page weight (which may impact major search engine crawlers stopping the crawl of schema.org markup).
- Crawlers that follow the URL for the short DOI can retrieve structured metadata for the DOI itself:
curl --location --request GET "http://doi.org/fg5v" --header "Accept: application/ld+json"
Back to top
A Dataset is a collection of data entities, each of which contains structured and unstructured values for a set of properties about that entity. For example, an hypothetical Dataset might contain three data files: 1) a data table in CSV format containing columns of data that both classify and measure the properties of a set of lakes in a region; 2) an image file containing rasterized geospatial data values for each location for properties like water temperature at multiple depths; and 3) a text file containing responses to a survey assessing perspectives on water rights, with values for questions containing both natural language responses and responses on a Likert scale. In each of these examples, we are recording the value of attributes (aka properties) about an entity of interest (lake). In schema.org, details about these attributes can be recorded using schema:variableMeasured
. So, while schema.org uses the term "variable" and the term "measured", it is usually conceptualized as a listing of any of the properties or attributes of an entity that are recorded, and not strictly a measured variable. Thus, we recommend using schema:variableMeasured
to represent any recordable property of an entity that is found in the dataset. While this includes quantitatively "measured" observations (e.g., rainfall in mm), it also includes classification values that are asserted or qualitatively assigned (e.g., "moderate velocity"), contextual attributes such as spatial locations, times, or sampling information associated with a value, and textual values such as narrative text.
Information about the variables/attributes in a dataset can enhance discovery and support evaluation of the data. This can be done using the schema:variableMeasured field. Schema.org allows the value of variableMeasured to be a simple text string, but it is strongly recommended to use the schema:PropertyValue type to describe the variable in more detail.
This recommendation outlines several tiers of variable description. Tier 1 is the simplest, with other tiers adding recommendations for additional content (Tier 2 and 3). See Experimental Recommendations for proposed recommendations to document variables with with non-numeric or enumerated (controlled vocabulary) values, variables whose values are structured objects (e.g., json objects, arrays, gridded data), or are references to external value representations.
The simplest approach is to provide a schema:name
and a textual description of the variable. The schema:name
should match the label associated with the variable in the dataset serialization (e.g., the column name in a CSV file). If the variable name in the dataset does not clearly convey the variable concept, a more human-intelligible name can be provide using schema:alternateName
. The field schema:description
is used to provide a definition of the variable/property/attribute that allows others to correctly understand and interpret the values.
Example:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities ...", ... "variableMeasured": [ { "@type": "PropertyValue", "name": "latdd", "alternateName":"latitude, decimal degrees", "description": "Latitude where water samples were collected ...", }, ... ] }
In Tier 2, we recommend using a schema:PropertyValue
object to provide a schema:propertyID that better defines the semantics of the variable than plain text can. This schema:propertyID
should be a URI that resolves to a web page providing a human-friendly description of the variable and, ideally, this identifier should also be resolvable to obtain an RDF representation using a documented vocabulary for machine consumption, for example a sosa:Observation or DDI represented variable. Describing the variables with machine understandable vocabularies is necessary if you want your data to be interoperable with other data, i.e. to be more FAIR. The property can be identified at any level of specificity, depending on what the data provider can determine about the interpretation of the variable. For example, one might use a propertyID for the property 'temperature', or use a more specific property like 'water temperature', 'sea surface water temperature', or 'sea surface water temperature measured with protocol X, daily average, Kelvins, xsd:decimal'. If there are choices, the most specific property identifier should be used.
Example:
{ "@context": { "@vocab": "https://schema.org/" }, "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities ...", ... "variableMeasured": [ { "@type": "PropertyValue", "name": "latdd", "alternateName":"latitude, decimal degrees", "propertyID":"http://purl.obolibrary.org/obo/NCIT_C68642", "description": "Latitude where water samples were collected ...", }, ... ] }
For variables with numeric measured values, other properties of schema:PropertyValue can add additional useful information:
- schema:unitText. A string that identifies a unit of measurement that applies to all values for this variable.
- schema:unitCode. Value is expected to be TEXT or URL. We recommend providing an HTTP URI that identifies a unit of measure from a vocabulary accessible on the web. The QUDT unit vocabulary provides an extensive set of registered units of measure that can be used to populate the schema:unitCode property to specify the units of measure used to report data values when that is appropriate.
- schema:minValue. If the value for the variable is numeric, this is the minimum value that occurs in the dataset. Not useful for other value types.
- schema:maxValue. If the value for the variable is numeric, this is the maximum value that occurs in the dataset. Not useful for other value types.
- schema:measurementTechnique. A text description of the measurement method used to determine values for this variable. If standard measurement protocols are defined and registered, these can be identified via http URIs.
- schema:url Any schema:Thing can have a URL property, but because the value is simply a url the relationship of the linked resource can not be expressed. Usage is optional. The recommendation is that
schema:url
should link to a web page that would be useful for a person to interpret the variable, but is not intended to be machine-actionable.
Example:
{ "@context": { "@vocab": "https://schema.org/" }, "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "variableMeasured": [ { "@type": "PropertyValue", "name": "latitude", "propertyID":"http://purl.obolibrary.org/obo/NCIT_C68642", "url": "https://www.sample-data-repository.org/dataset-parameter/665787", "description": "Latitude where water samples were collected; north is positive. Latitude is a geographic coordinate which refers to the angle from a point on the Earth's surface to the equatorial plane", "unitText": "decimal degrees", "unitCode":"http://qudt.org/vocab/unit/DEG", "minValue": "45.0", "maxValue": "15.0" }, ... ] }
Back to top
For some repositories, data collections containing multiple data sets are used to help contextualize or organize the data. In schema.org, you define these collections using schema:DataCatalog.
The best way to use these DataCatalogs is to define these catalogs as an "offering" of your repository and including the offering @id
in the dataset JSON-LD. For example, the repository JSON-LD defines a schema:DataCatalog with the
"@id": "https://www.sample-data-repository.org/collection/biological-data"
.
In the dataset JSON-LD, we reuse that @id
to say a dataset belongs in that DataCatalog:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "includedInDataCatalog": { "@id": "https://www.sample-data-repository.org/collection/biological-data", "@type": "DataCatalog" } }
Back to top
While this schema.org record represents metadata about a Dataset, many providers will also have other metadata records that may be more complete or that conform to other metadata formats and vocabularies that might be useful. For example, repositories often contain detailed records in ISO TC 211 formats, EML, and other formats. Aggregators and other consumers can make use of this additional metadata if they are linked in a standardized way to the schema.org record. We recommend that the location of the alternative forms of the metadata be provided using the schema:subjectOf and schema:about properties:
- Link metadata documents to a schema:Dataset by using schema:subjectOf.
- Or if a schema.org snippet describes the metadata as the main resource, then link to the Dataset it describes using schema:about.
These two approaches are equivalent, and which is used depends on the subject of the schema.org record.
Once the linkage has been made, further details about the metadata can be provided. We recommend using schema:encodingFormat to indicate the metadata format/vocabulary to which the metadata record conforms. If it conforms to multiple formats, or to a specific and general format types, multiple types can be listed. We use the schema:DataDownload class for Metadata files so that we can use the schema:MediaObject properties for describing bytesize, encoding, etc.
It can be useful to aggregators and other consumers to indicate when the metadata record was last modified using schema:dateModified
, which can be used to optimize harvesting schedules for search indices and other applications.
An example of a metadata reference to an instance of EML-formatted structured metadata, embedded within a schema:Dataset
record follows:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "distribution": { "@type": "DataDownload", ... }, "subjectOf": { "@type": "DataDownload", "name": "EML metadata for dataset", "description": "EML metadata describing the dataset", "encodingFormat": ["application/xml", "https://eml.ecoinformatics.org/eml-2.2.0"], "contentURL":"https://example.com/metadata/eml-metadata.xml", "dateModified":"2019-06-12T14:44:15Z" } }
Alternatively, if the schema.org record is meant to describe the metadata record, one could use the inverse property schema:about
to indicate the linkage back to the Dataset that it describes. This should be a rare situation, as typically the schema.org record will describe the Dataset itself.
Note that the encodingFormat
property contains an array of formats to describe multiple formats to which the document conforms (in this example, the document is both conformant with XML and the EML metadata dialect).
Back to top
While the schema:url property of the Dataset should point to a landing page, the way to describe how to download the data is through the schema:distribution property. The "distribution" property describes where to get the data and in what format by using the schema:DataDownload type. If your dataset is not accessible through a direct download URL, but rather through a service URL that may need input parameters jump to the next section Accessing Data through a Service Endpoint.
For data available in multiple formats, there will be multiple values of the schema:DataDownload:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "distribution": { "@type": "DataDownload", "contentUrl": "https://www.sample-data-repository.org/dataset/472032.tsv", "encodingFormat": "text/tab-separated-values" } }
If access to the data requires some input parameters before a download can occur, we can use the schema:potentialAction in this way:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "potentialAction": { "@type": "SearchAction", "target": { "@type": "EntryPoint", "contentType": ["application/x-netcdf", "text/tab-separated-values"], "urlTemplate": "https://www.sample-data-repository.org/dataset/1234/download?format={format}&startDateTime={start}&endDateTime={end}&bounds={bbox}", "description": "Download dataset 1234 based on the requested format, start/end dates and bounding box", "httpMethod": ["GET", "POST"] }, "query-input": [ { "@type": "PropertyValueSpecification", "valueName": "format", "description": "The desired format requested either 'application/x-netcdf' or 'text/tab-separated-values'", "valueRequired": true, "defaultValue": "application/x-netcdf", "valuePattern": "(application\/x-netcdf|text\/tab-separated-values)" }, { "@type": "PropertyValueSpecification", "valueName": "start", "description": "A UTC ISO DateTime", "valueRequired": false, "valuePattern": "(-?(?:[1-9][0-9]*)?[0-9]{4})-(1[0-2]|0[1-9])-(3[01]|0[1-9]|[12][0-9])T(2[0-3]|[01][0-9]):([0-5][0-9]):([0-5][0-9])(.[0-9]+)?(Z)?" }, { "@type": "PropertyValueSpecification", "valueName": "end", "description": "A UTC ISO DateTime", "valueRequired": false, "valuePattern": "(-?(?:[1-9][0-9]*)?[0-9]{4})-(1[0-2]|0[1-9])-(3[01]|0[1-9]|[12][0-9])T(2[0-3]|[01][0-9]):([0-5][0-9]):([0-5][0-9])(.[0-9]+)?(Z)?" }, { "@type": "PropertyValueSpecification", "valueName": "bbox", "description": "Two points in decimal degrees that create a bounding box fomatted at 'lon,lat' of the lower-left corner and 'lon,lat' of the upper-right", "valueRequired": false, "valuePattern": "(-?[0-9]+(.[0-9]+)?),[ ]*(-?[0-9]+(.[0-9]+)?)[ ]*(-?[0-9]+(.[0-9]+)?),[ ]*(-?[0-9]+(.[0-9]+)?)" } ] } }
Here, we use the schema:SearchAction type because it lets you define the query parameters and HTTP methods so that machines can build user interfaces to collect those query parameters and actuate a request to provide the user what they are looking for.
Back to top
Scientific datasets typically have multiple associated date or time periods. Time periods can be specified for 1) the time at which an entity or phenomenon occurred or was measured, and 2) the time periods when a dataset containing that information was created, changed, published, etc. temporalCoverage
describes the age of the sample and the other dates describe the data created from observations or process. For example, if one took a sample 200 ft down in an ice core, temporalCoverage
would describe the period when that layer of ice was deposited in geologic time, while other date properties (dateCreated, dateModified, datePublished, and expired) would describe the dataset that was created by measuring and analyzing that ice core sample. The temporalCoverage might also be a range of ages. For example if the dataset was from a study of the whole ice core it could have a range of ages from 300 to 6000 years before present (BP).
Schema.org offers various date properties that can be used to encode this information. We recommend use of the following fields for Dates:
-
schema:temporalCoverage
:: use to specify the time period(s) that the content applies to, i.e. the time the entity or phenomenon described in the dataset occurred. See details at temporalCoverage.temporalCoverage
is usually prior to the date of data publication for observational data, and can be afterwards for models, simulations, and forecasts. -
schema:dateCreated
:: use to specify the date the dataset was initially generated (e.g., when a sensor recorded a value, when a model was run, or when data processing was completed). This is typically fixed when the first dataset version is created. -
schema:dateModified
:: use to specify the date the dataset was most recently updated or changed. -
schema:datePublished
:: use to specify the date when a dataset was made available to the public through a publication process. -
schema:expires
:: use to specify the date when the dataset expires and is no longer useful or available. IfdatePublished
is when the dataset is made available, then 'expires' brackets the time the dataset is valid or recommended for use.
Back to top
Temporal coverage is defined as "the time period during which data was collected or observations were made; or a time period that an activity or collection is linked to intellectually or thematically (for example, 1997 to 1998; the 18th century)" (ARDC RIF-CS). For documentation of Earth Science, Paleobiology or Paleontology datasets, we are interested in the second case: the time period that data are linked to thematically.
Temporal coverage is a difficult concept to cover across all the possible scenarios. Schema.org uses ISO 8601 time interval format to describe time intervals and time points, but doesn't provide capabilities for geologic time scales or dynamically generated data up to present time. We ask for your feedback on any temporal coverages you may have that don't currently fit into schema.org. You can follow similar issues on the schema.org GitHub issue queue. We hope that our examples of the use of OWL Time and temporal reference system (TRS) elements will help you describe the dates or ages of the entities in a dataset. We have also included examples that describe dataset's time uncertainties.
To represent a single date and time:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "temporalCoverage": "2018-01-22T14:51:12+00:00" }
Or a single date:
{ ... "temporalCoverage": "2018-01-22" }
Or a date range:
{ ... "temporalCoverage": "2012-09-20/2016-01-22" }
Or an open-ended date range (thanks to @lewismc for this example from NASA PO.DAAC) :
{ ... "temporalCoverage": "2012-09-20/.." }
Schema.org also lets you provide date ranges and other temporal coverages through the DateTime data type and URL. For more granular temporal coverages go here: https://schema.org/DateTime.
Dates or ages used for describing geological, archeological, and paleontological samples range from the very simple to highly complex. A lava rock age could be simply described as 1.23 million years. Other ages are more descriptive. Some other examples are: a zircon crystal with an age of 456.4 +/- 1.4 billion years (Ga) at a standard error of 2-sigma, a core with rocks from the Triassic to the Jurassic, a carbon date of a bone with non-symmetrical uncertainties of 3242 (+160 -40) B.C. We make use of the OWL time (Cox and Little) descriptive tags (elements), the Queensland Department of Natural Resources, Mines and Energy Temporal Reference Systems (TRS), and geoschemas' properties to describe ages and age ranges in detail. These methods could also be used to describe the temporal coverage for other disciplines as well.
There are two main types of geologic time: Proper Intervals and Instants. They are diagrammed below and used in the examples that follow.
Proper Interval
Instant
These examples can be found in one JSON-LD file at temporalCoverage.jsonld
- The dataset's temporalCoverage is described using ProperInterval, hasBeginning, and hasEnd elements from OWL Time. The human readable description can be found in the description field: "Eruptive activity at Mt. St. Helens, Washington, March 1980 - January 1981".
Example:
{ "@context": { "@vocab": "http://schema.org/", "time": "http://www.w3.org/2006/time#", }, "@type": "Dataset", "description": "Eruptive activity at Mt. St. Helens, Washington, March 1980 - January 1981", "temporalCoverage": [ { "@type": "time:ProperInterval", "time:hasBeginning": { "@type": "time:Instant", "time:inXSDDateTimeStamp": "1980-03-27T19:36:00Z" }, "time:hasEnd": { "@type": "time:Instant", "time:inXSDDateTimeStamp": "1981-01-03T00:00:00Z" } }] }
-
The dataset's temporalCoverage is described using Instant, inTimePosition, hasTRS, and numericPosition elements for a single geological date/age without uncertainties from OWL Time. Use a decimal value with appropriate timescale temporal reference system (TRS) and date/age unit abbreviation. The human readable description can be found in the description field: "Eruption of Bishop Tuff, about 760,000 years ago".
Example:
{ "@context": { "@vocab": "http://schema.org/", "gsqtime": "https://vocabs.gsq.digital/object?uri=http://linked.data.gov.au/def/trs", "time": "http://www.w3.org/2006/time#", "xsd": "https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html" }, "@type": "Dataset", "description": "Eruption of Bishop Tuff, about 760,000 years ago", "temporalCoverage": [ { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": { "@id": "gsqtime:MillionsOfYearsAgo" }, "time:numericPosition": { "@type": "xsd:decimal", "value": 0.76 }, "gstime:geologicTimeUnitAbbreviation": { "@type": "xsd:string", "value": "Ma" } } } ] }
-
The dataset's temporalCoverage is described using the Instant, inTimePosition, TimePosition, numericPosition from OWL Time with a geological date/age with uncertainties. Use a decimal value with appropriate timescale temporal reference system(TRS), date/age unit abbreviation, the uncertainty value and specify at what sigma. The human readable description can be found in the description field: "Very old zircons from the Jack Hills formation Australia 4.404 +- 0.008 Ga (2-sigma)".
Example:
{ "@context": { "@vocab": "http://schema.org/", "gsqtime": "https://vocabs.gsq.digital/object?uri=http://linked.data.gov.au/def/trs", "gstime": "http://schema.geoschemas.org/contexts/temporal#", "time": "http://www.w3.org/2006/time#", "xsd": "https://www.w3.org/TR/200 (from [OWL Time](http://www.w3.org/2006/time))4/REC-xmlschema-2-20041028/datatypes.html" }, "@type": "Dataset", "description": "Very old zircons from the Jack Hills formation Australia 4.404 +- 0.008 Ga (2-sigma)", "temporalCoverage": [ { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": { "@id": "gsqtime:BillionsOfYearsAgo" }, "time:numericPosition": { "@type": "xsd:decimal", "value": 4.404 }, "gstime:geologicTimeUnitAbbreviation": { "@type": "xsd:string", "value": "Ma" }, "gstime:uncertainty": { "@type": "xsd:decimal", "value": 0.008 }, "gstime:uncertaintySigma": { "@type": "xsd:decimal", "value": 2.0 } } } ] }
-
The dataset's temporalCoverage is described using the ProperInterval, hasBeginning, hasEnd, Instant, inTimePosition, TimePosition, and hasTRS elements from OWL Time with a geological date/age range with uncertainties. Use a decimal value with appropriate timescale temporal reference system(TRS), date/age unit abbreviation, uncertainty value and at what sigma. The human readable description can be found in the description field: "Isotopic ages determined at the bottom and top of a stratigraphic section in the Columbia River Basalts".
Example:
{ "@context": { "@vocab": "http://schema.org/", "gsqtime": "https://vocabs.gsq.digital/object?uri=http://linked.data.gov.au/def/trs", "gstime": "http://schema.geoschemas.org/contexts/temporal#", "rdfs": "https://www.w3.org/2001/sw/RDFCore/Schema/200212/", "time": "http://www.w3.org/2006/time#", "xsd": "https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html" }, "@type": "Dataset", "description": "Isotopic ages determined at the bottom and top of a stratigraphic section in the Columbia River Basalts", "temporalCoverage": [ { "@type": "time:ProperInterval", "time:hasBeginning": { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "rdfs:comment": "beginning is older bound of age envelop", "time:hasTRS": { "@id": "gsqtime:MillionsOfYearsAgo" }, "time:numericPosition": { "@type": "xsd:decimal", "value": 17.1 }, "gstime:geologicTimeUnitAbbreviation": { "@type": "xsd:string", "value": "Ma" } }, "gstime:uncertainty": { "@type": "xsd:decimal", "value": 0.15 }, "gstime:uncertaintySigma": { "@type": "xsd:decimal", "value": 1.0 } }, "time:hasEnd": { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "rdfs:comment": "ending is younger bound of age envelop", "time:hasTRS": { "@id": "gsqtime:MillionsOfYearsAgo" }, "time:numericPosition": { "@type": "xsd:decimal", "value": 15.7 }, "gstime:geologicTimeUnitAbbreviation": { "@type": "xsd:string", "value": "Ma" } }, "gstime:uncertainty": { "@type": "xsd:decimal", "value": 0.14 }, "gstime:uncertaintySigma": { "@type": "xsd:decimal", "value": 2.0 } } } ] }
-
The dataset's temporalCoverage is described using the Instant, inTimePosition, TimePosition, and hasTRS elements from OWL Time with an archeological date/age range with uncertainties. Use a decimal value with appropriate timescale temporal reference system (TRS), date/age unit abbreviation, the older and younger uncertainty values and at what sigma. The human readable description can be found in the description field: "Age of a piece of charcoal found in a burnt hut at an archeological site in Kenya carbon dated at BP Calibrated of 2640 +130 -80 (one-sigma) using the INTCAL20 carbon dating curve."
Example:
{ "@context": { "@vocab": "http://schema.org/", "gsqtime": "https://vocabs.gsq.digital/object?uri=http://linked.data.gov.au/def/trs", "gstime": "http://schema.geoschemas.org/contexts/temporal#", "time": "http://www.w3.org/2006/time#", "xsd": "https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html" }, "@type": "Dataset", "description": "Age of a piece of charcoal found in a burnt hut at an archeological site in Kenya carbon dated at BP Calibrated of 2640 +130 -80 (one-sigma) using the INTCAL20 carbon dating curve.", "temporalCoverage": [ { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": { "@id": "gsqtime:BeforePresentCalibrated" }, "time:numericPosition": { "@type": "xsd:decimal", "value": 2460.0 }, "gstime:geologicTimeUnitAbbreviation": { "@type": "xsd:string", "value": "BP-CAL" }, "gstime:uncertaintyOlder": { "@type": "xsd:decimal", "value": 130.0 }, "gstime:uncertaintyYounger": { "@type": "xsd:decimal", "value": 80.0 }, "gstime:uncertaintySigma": { "@type": "xsd:decimal", "value": 1.0 } } } ] }
-
The dataset's temporalCoverage is described using the Instant, TimePosition, inTimePosition, NominalPosition, Interval, hasBeginning, hasEnd, and hasTRS elements from OWL Time. With temporal coverage that is a named time interval from a geologic time scale, provide numeric positions of the beginning and end for interoperability. Providing the numeric values is only critical, but still recommended, if the TRS for the nominalPosition is not the International Chronostratigraphic Chart. In this example the temporalCoverage is described in two ways: by the named interval Bartonian and by defining a time interval using numerical beginning and end values.
Example:
{ "@context": { "@vocab": "http://schema.org/", "gsqtime": "https://vocabs.gsq.digital/object?uri=http://linked.data.gov.au/def/trs", "gstime": "http://schema.geoschemas.org/contexts/temporal#", "icsc": "http://resource.geosciml.org/clashttps://vocabs.ardc.edu.au/repository/api/lda/csiro/international-chronostratigraphic-chart/geologic-time-scale-2020/resource?uri=http://resource.geosciml.org/classifier/ics/ischart/Boundariessifier/ics/ischart/", "time": "http://www.w3.org/2006/time#", "ts": "http://resource.geosciml.org/vocabulary/timescale/", "xsd": "https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html" }, "@type": "Dataset", "description": "Temporal position expressed with a named time ordinal era from [International Chronostratigraphic Chart](https://stratigraphy.org/chart):", "temporalCoverage": [ { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": { "@id": "ts:gts2020" }, "time:nominalPosition": { "@type": "xsd:anyURI", "value": "icsc:Bartonian" } } }, { "@type": "time:ProperInterval", "time:hasBeginning": { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": { "@id": "gsqtime:MillionsOfYearAgo" }, "time:numericPosition": { "@type": "xsd:decimal", "value": 41.2 }, "gstime:geologicTimeUnitAbbreviation": { "@type": "xsd:string", "value": "Ma" } } }, "time:hasEnd": { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": { "@id": "gsqtime:MillionsOfYearAgo" }, "time:numericPosition": { "@type": "xsd:decimal", "value": 37.71 }, "gstime:geologicTimeUnitAbbreviation": { "@type": "xsd:string", "value": "Ma" } } } } ] }
-
Temporal intervals with nominal temporal position that have identifiers. When possible, use temporal intervals defined by the International Chronostratigraphic Chart, access via ARDC vocabulary service, or via GeoSciML vocabularies landing page. If temporal intervals with identifiers from other schemes are available, they can be included in a separate time:ProperInterval or time:Instant element. If intervals are not from the ICS chart it is recommended to provide an interval with beginning and end numeric positions for better interoperability.
Example:
{ "@context": { "@vocab": "http://schema.org/", "icsc": "http://resource.geosciml.org/clashttps://vocabs.ardc.edu.au/repository/api/lda/csiro/international-chronostratigraphic-chart/geologic-time-scale-2020/resource?uri=http://resource.geosciml.org/classifier/ics/ischart/Boundariessifier/ics/ischart/", "time": "http://www.w3.org/2006/time#", "ts": "http://resource.geosciml.org/vocabulary/timescale/", "xsd": "https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html" }, "@type": "Dataset", "description": "Temporal position expressed with an interval bounded by named time ordinal eras from [International Chronostratigraphic Chart](https://stratigraphy.org/chart). NumericPositions not included, expect clients can lookup bounds for ISC nominal positions:", "temporalCoverage": [{ "@type": "time:ProperInterval", "time:hasBeginning": { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": {"@id": "ts:gts2020"}, "time:nominalPosition": { "@value": "icsc:Triassic", "@type": "xsd:anyURI" } } }, "time:hasEnd": { "@type": "time:Instant", "time:inTimePosition": { "@type": "time:TimePosition", "time:hasTRS": {"@id": "ts:gts2020"}, "time:nominalPosition": { "@value": "icsc:Jurassic", "@type": "xsd:anyURI" } } } }] }
Back to top
Used to document the location on Earth that is the focus of the dataset content, using schema:Place. Recommended practice is to use the schema:geo property with either a schema:GeoCoordinates object to specify a point location, or a schema:GeoShape object to specify a line or area coverage extent. Coordinates describing these extents are expressed as latitude longitude tuples (in that order) using decimal degrees.
Schema.org documentation does not specify a convention for the coordinate reference system, our recommended practice is to use WGS84 for at least one spatial coverage description if applicable. Spatial coverage location using other coordinate systems can be included, see recommendation for specifying coordinate reference systems, below.
Please indicate a point location by using a schema:GeoCoordinates object with schema:latitude and schema:longitude properties.
Not Recommended The schema:Place definition allows the latitude and longitude of a point location to be specified directly; although this is more succinct, it makes parsing the metadata more complex and should be avoided.
Point locations are recommended for data that is associated with specific sample locations, particularly if these are widely spaced such that an enclosing bounding box would be a misleading representation of the spatial location. Be aware that some client applications might only index or display bounding box extents or a single point location.
A schema:Dataset that is about a point location would documented in this way:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton ....", ... "spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoCoordinates", "latitude": 39.3280 "longitude": 120.1633 } } }
A schema:GeoShape can describe spatial coverage as a line (e.g., a ship track), a bounding box, a polygon, or a circle. The geometry is described with a set of latitude/longitude pairs. The spatial definitions were added to schema.org early in its development based on the GeoRSS specification. The documentation for schema:GeoShape states "Either whitespace or commas can be used to separate latitude and longitude; whitespace should be used when writing a list of several such points." At least for bounding boxes (see the discussion below), it appears that the Google Dataset Search parsing of the coordinate strings depends on whether a comma or space is used to delimit the coordinates in an individual tuple.
Be aware that some client applications might only index or display bounding box extents.
- line - a series of two or more points.
- polygon - a series of four or more points where the first and final points are identical.
- box - A rectangular (in lat-long space) extent specified by two points, the first in the lower left (southwest) corner and the second in the upper right (northeast) corner.
- circle - A circular region of a specified radius centered at a specified latitude and longitude, represented as a coordinate pair followed by a radius in meters. Not recommended for use.
Examples:
Useful for data that were collected along a traverse, ship track, flight line or other linear sampling feature.
"spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoShape", "line": "39.3280 120.1633 40.445 123.7878" } } }
A polygon provides the most precise approach to delineating the spatial extent of the focus area for a dataset, but polygon spatial locations might not be recognized (indexed, displayed) by some client applications.
"polygon": "39.3280 120.1633 40.445 123.7878 41 121 39.77 122.42 39.3280 120.1633"
A GeoShape box defines an area on the surface of the earth defined by point locations of the southwest corner and northeast corner of the rectangle in latitude-longitude coordinates. Point locations are tuples of {latitude east-longitude} (y x). The schema.org GeoShape documentation states "Either whitespace or commas can be used to separate latitude and longitude; whitespace should be used when writing a list of several such points." Since the box is a list of points, a space should be used to separate the latitude and longitude values. The two corner coordinate points are separated by a space. 'East longitude' means positive longitude values are east of the prime (Greenwich) meridian. A box where 'lower-left' (southwest) corner is 39.3280/120.1633 and 'upper-right' (northeast) corner is 40.445/123.7878 would be encoded thus:
"box": "39.3280 120.1633 40.445 123.7878"
NOTE-- see discussion in GitHub issue 101 on what works with Google Dataset search to display spatial location in their search results.
East longitude values can be reported 0 <= X <= 360 or -180 <= X <= 180. Some applications will fail under one or the other of these conventions. Recommendation is to use -180 <= X <= 180, consistent with the WKT specification. Following this recommendation, bounding boxes that cross the antimeridian at ±180° longitude, the West longitude value will be numerically greater than the East longitude value. For example, to describe Fiji the box might be
"box": "-19 176 -15 -178"
NOTES: Some spatial data processors will not correctly interpret the bounding coordinates across the antimeridian even if they follow the recommended southwest, northeast corner convention, resulting in boxes that span the circumference of the Earth, excluding the actual area of interest. For applications operating with data in the vicinity of longitude 180, testing is strongly recommended to determine if it works for bounding boxes crossing the antimeridian (+/- 180); an alternative is to define two bounding boxes, one on each side of 180.
For bounding boxes that include the north or south pole, schema:box will not work. Recommended practice is to use a schema:polygon to describe spatial location extents that include the poles.
If you have multiple geometries, you can publish those by making the schema:geo field an array of GeoShape or GeoCoordinates like so:
{ ... "spatialCoverage": { "@type": "Place", "geo": [ { "@type": "GeoCoordinates", "latitude": -17.65, "longitude": 50 }, { "@type": "GeoCoordinates", "latitude": -19, "longitude": 51 }, ... ] } ... }
Be aware that some client applications might not index or display multiple geometries.
A Spatial Reference System (SRS) or Coordinate Reference System (CRS) is the method for defining the frame of reference for geospatial location representation. Schema.org currently has no defined property for specifying a Spatial Reference System; the assumption is that coordinates are WGS84 decimal degrees.
In the meantime, to represent an SRS in schema.org, we recommend using the schema:additionalProperty property to specify an object of type schema:PropertyValue, with a schema:propertyID of http://dbpedia.org/resource/Spatial_reference_system to identify the property as a spatial reference system, and the schema:PropertyValue/schema:value is a URI (IRI) that identifies a specific SRS. Some commonly used values are:
Spatial Reference System | IRI |
---|---|
WGS84 | http://www.w3.org/2003/01/geo/wgs84_pos#lat_long |
CRS84 | http://www.opengis.net/def/crs/OGC/1.3/CRS84 |
EPSG:26911 | https://spatialreference.org/ref/epsg/nad83-utm-zone-11n/ |
EPSG:3413 | https://spatialreference.org/ref/epsg/wgs-84-nsidc-sea-ice-polar-stereographic-north/ |
NOTE: Beware of coordinate order differences. WGS84 in the table above specifies latitude, longitude coordinate order, whereas CRS84 specifies longitude, latitude order (like GeoJSON). WGS84 is the assumed typical value for coordinates, so in general the SRS does not need to be specified.
A spatial reference system can be added in this way:
{ "@context": [ "https://schema.org/", { "dbpedia": "http://dbpedia.org/resource/" } ], "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "spatialCoverage": { "@type": "Place", "geo": { "@type": "GeoShape", "line": "39.3280 120.1633 40.445 123.7878" }, "additionalProperty": { "@type": "PropertyValue", "propertyID":"http://dbpedia.org/resource/Spatial_reference_system", "value": "http://www.w3.org/2003/01/geo/wgs84_pos#lat_long" } } }
Back to top
People can be linked to datasets using three fields: author, creator, and contributor. Since schema:contributor is defined as a secondary author, and schema:Creator is defined as being synonymous with the schema:author field, we recommend using the more expressive fields creator and contributor, but using any of these fields is acceptable.
NOTE: JSON-LD doesn't preserve the order of its collection values, so if you need to preserve the order of people's names (e.g., for a citation) you can do so by applying the @list
JSON-LD keyword (for more information about this see Getting Started - JSON-LD Lists).
Given the following creator
JSON-LD block:
{
...
"creator:[
{
"@type": "Person",
"name": "Creator #1"
},
{
"@type": "Person",
"name": "Creator #2"
}
]
}
The order of these creators can be preserved by the using the @list
JSON-LD keyword:
{
...
"creator:{
"@list": [
{
"@type": "Person",
"name": "Creator #1"
},
{
"@type": "Person",
"name": "Creator #2"
}
]
}
}
Because there are more things that can be said about how and when a person contributed to a Dataset, we use the schema:Role. You'll notice that the schema.org documentation does not state that the Role type is an expected data type of author, creator and contributor, but that is addressed in this blog post introducing Role into schema.org. Thanks to Stephen Richard for this contribution
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "creator": [ { "@id": "https://www.sample-data-repository.org/person-role/472036", "@type": "Role", "roleName": "Principal Investigator", "creator": { "@id": "https://www.sample-data-repository.org/person/51317", "@type": "Person", "name": "Dr Uta Passow", "givenName": "Uta", "familyName": "Passow", "url": "https://www.sample-data-repository.org/person/51317" } }, { "@id": "https://www.sample-data-repository.org/person-role/472038", "@type": "Role", "roleName": "Co-Principal Investigator", "url": "https://www.sample-data-repository.org/person-role/472038", "creator": { "@id": "https://www.sample-data-repository.org/person/50663", "@type": "Person", "identifier": { "@id": "https://orcid.org/0000-0003-3432-2297", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/orcid", "url": "https://orcid.org/0000-0003-3432-2297", "value": "orcid:0000-0003-3432-2297" }, "name": "Dr Mark Brzezinski", "url": "https://www.sample-data-repository.org/person/50663" } } }
NOTE that the Role inherits the property creator
and contributor
from the Dataset when pointing to the schema:Person.
{ "@context": "https://schema.org/", "@type": "Dataset", ... "creator": [ { "@id": "https://www.sample-data-repository.org/person-role/472036", "@type": "Role", "roleName": "Principal Investigator", "url": "https://www.sample-data-repository.org/person-role/472036", "creator": { "@id": "https://www.sample-data-repository.org/person/51317", "@type": "Person", "name": "Dr Uta Passow", "givenName": "Uta", "familyName": "Passow", "url": "https://www.sample-data-repository.org/person/51317" } } }
If a single Person plays multiple roles on a Dataset, each role should be explicitly defined in this way:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "creator": [ { "@id": "https://www.sample-data-repository.org/person-role/472036", "@type": "Role", "roleName": "Principal Investigator", "url": "https://www.sample-data-repository.org/person-role/472036", "creator": { "@id": "https://www.sample-data-repository.org/person/51317", "@type": "Person", "name": "Dr Uta Passow", "givenName": "Uta", "familyName": "Passow", "url": "https://www.sample-data-repository.org/person/51317" } }, { "@id": "https://www.sample-data-repository.org/person-role/472037", "@type": "Role", "roleName": "Contact", "url": "https://www.sample-data-repository.org/person-role/472037", "creator": { "@id": "https://www.sample-data-repository.org/person/51317" } }, { "@id": "https://www.sample-data-repository.org/person-role/472038", "@type": "Role", "roleName": "Co-Principal Investigator", "url": "https://www.sample-data-repository.org/person-role/472038", "creator": { "@id": "https://www.sample-data-repository.org/person/50663", "@type": "Person", "identifier": { "@id": "https://orcid.org/0000-0003-3432-2297", "@type": "PropertyValue", "propertyID": "https://registry.identifiers.org/registry/orcid", "url": "https://orcid.org/0000-0003-3432-2297", "value": "orcid:0000-0003-3432-2297" }, "name": "Dr Mark Brzezinski", "url": "https://www.sample-data-repository.org/person/50663" } } }
Notice that since Uta Passow has already been defined in the document with "@id": "https://www.sample-data-repository.org/person/51317"
for her role as Principal Investigator, the @id
can be used for her role as Contact by defining the Role's creator as "creator": { "@id": "https://www.sample-data-repository.org/person/51317" }
.
Back to top
If your repository is the publisher and/or provider of the dataset then you don't have to describe your repository as a schema:Organization if your repository markup includes the @id
. For example, if you published repository markup such as:
{ "@context": "https://schema.org/", "@type": ["Service", "Organization"], ... "@id": "https://www.sample-data-repository.org" ... }
then you can reuse that @id
here. Harvesters such as Google and Project418 will make the appropriate linkages and your dataset publisher/provider can be published in this way:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "provider": { "@id": "https://www.sample-data-repository.org" }, "publisher": { "@id": "https://www.sample-data-repository.org" } }
Otherwise, you can define the organization inline in this way:
{ "@context": "https://schema.org/", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", ... "provider": { "@id": "https://www.sample-data-repository.org", "@type": "Organization", "legalName": "Sample Data Repository Office", "name": "SDRO", "sameAs": "http://www.re3data.org/repository/r3dxxxxxxxxx", "url": "https://www.sample-data-repository.org" }, "publisher": { "@id": "https://www.sample-data-repository.org" } }
Back to top
Data providers should include funding information in their Dataset descriptions to enable discovery and cross-linking. The information that would be useful includes the title, identifier, and url of the grant or award, along with structured information about the funding organization, including its name and identifier. Organizational identifiers are best represented using either a general purpose institutional identifier such as a ROR, GRID, or ISNI identifier, or a more specific Funder ID from the Crossref Funder Registry. The ROR for the National Science Foundation (https://ror.org/021nxhr62), for example, provides linkages to related identifiers as well. The Funder ID has the advantage that it includes both agency funders like the National Science Foundation (http://dx.doi.org/10.13039/100000001), but also provides identifiers for individual funding programs within those agencies, such as the NSF GEO Directorate (https://api.crossref.org/funders/100000085). When possible, providing both a ROR and Funder ID is helpful. Here's an example of identifiers for the National Science Foundation:
Linking a Dataset to the grants and awards that fund it can be achieved by adding a schema:MonetaryGrant through the schema:funding
property.
{ "@context": "https://schema.org/", "@type": "Dataset", "@id": "https://doi.org/10.18739/A22V2CB44", "name": "Stable water isotope data from Arctic Alaska snow pits in 2019", "funding": [ { "@id": "https://www.nsf.gov/awardsearch/showAward?AWD_ID=1604105", "@type": "MonetaryGrant", "identifier": "1604105", "name": "Collaborative Research: Nutritional Landscapes of Arctic Caribou: Observations, Experiments, and Models Provide Process-Level Understanding of Forage Traits and Trajectories", "url": "https://www.nsf.gov/awardsearch/showAward?AWD_ID=1604105", "funder": { "@id": "http://dx.doi.org/10.13039/100000001", "@type": "Organization", "name": "National Science Foundation", "identifier": [ "http://dx.doi.org/10.13039/100000001", "https://ror.org/021nxhr62" ] } }, { "@type": "MonetaryGrant", "@id": "https://akareport.aka.fi/ibi_apps/WFServlet?IBIF_ex=x_hakkuvaus2&HAKNRO1=316349&UILANG=en&TULOSTE=HTML", "identifier": "316349", "name": "Where does water go when snow melts? New spatio-temporal resolution in stable water isotopes measurements to inform cold climate hydrological modelling", "url": "https://akareport.aka.fi/ibi_apps/WFServlet?IBIF_ex=x_hakkuvaus2&HAKNRO1=316349&UILANG=en&TULOSTE=HTML", "funder": { "@id": "http://dx.doi.org/10.13039/501100002341", "@type": "Organization", "name": "Academy of Finland", "identifier": [ "http://dx.doi.org/10.13039/501100002341", "https://ror.org/05k73zm37" ] } } ] }
We recommend providing as much structured information about the grants that fund a Dataset as possible so that aggregators and harvesters can crosslink to the Funding agencies and grants that provided resources for the Dataset.
Back to top
Link a Dataset to its license to document legal constraints by adding a schema:license property. The guide recommends providing a URL that unambiguously identifies a specific version of the license used, but for many licenses it is hard to determine what that URL should be. Thus, we recommend that the license URL be drawn from the SPDX license list, which provides a curated list of licenses and their properties that is well maintained. For each SPDX entry, SPDX provides a canonical URL for the license (e.g., http://spdx.org/licenses/CC0-1.0
), a unique licenseId
(e.g., CC0-1.0
), and other metadata about the license. Here's an example using the SPDX license URI for the Creative Commons CC-0 license:
{ "@context": "https://schema.org/", "@id": "http://www.sample-data-repository.org/dataset/123", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "license": "http://spdx.org/licenses/CC0-1.0" ... }
SPDX URIs for each license can be found by finding the appropriate license in the SPDX license list, and then remove the final .html
extension from the filename. For example, in the table one can find the license page for Apache at the URI https://spdx.org/licenses/Apache-2.0.html
, which can be converted into the associated linked data URI by removing the .html
, leaving us with https://spdx.org/licenses/Apache-2.0
. Alternatively, one can find the license file in the structured data listings and copy the URL from the associated file. For example, the URL for the Apache-2.0 license is listed in the file at https://github.com/spdx/license-list-data/blob/master/rdfturtle/Apache-2.0.turtle.
While many licenses are ambiguous about the license URI for the license, the Creative Commons licenses and a few others are exceptions in that they provide extremely consistent URIs for each license, and these are in widespread use. So, while we recommend using the SPDX URI, we recognize that some sites may want to use the CC license URIs directly, which is helpful in recognizing the license. In this case, we recommend that the SPDX URI still be used as described above, and the other URI also be provided as well in a list. Here's an example using the traditional Creative Commons URI along with the SPDX URI.
{ "@context": "https://schema.org/", "@id": "http://www.sample-data-repository.org/dataset/123", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "license": [ "http://spdx.org/licenses/CC0-1.0", "https://creativecommons.org/publicdomain/zero/1.0"] ... }
The following table contains the SPDX URIs for some of the most common licenses. Others can be looked up at the SPDX site as described above.
License | SPDX URI |
---|---|
Apache-2.0 | https://spdx.org/licenses/Apache-2.0 |
BSD-3-Clause | https://spdx.org/licenses/BSD-3-Clause |
CC-BY-3.0 | https://spdx.org/licenses/CC-BY-3.0 |
CC-BY-4.0 | https://spdx.org/licenses/CC-BY-4.0 |
CC-BY-SA-4.0 | https://spdx.org/licenses/CC-BY-SA-4.0 |
CC0-1.0 | https://spdx.org/licenses/CC0-1.0 |
GPL-3.0-only | https://spdx.org/licenses/GPL-3.0-only |
GPL-3.0-or-later | https://spdx.org/licenses/GPL-3.0-or-later |
MIT | https://spdx.org/licenses/MIT |
MIT-0 | https://spdx.org/licenses/MIT-0 |
Back to top
A schema:Dataset
can be composed of multiple digital objects which are listed in the schema:distribution
list. For each schema:DataDownload
, it can be useful to provide an cryptographic checksum value (like SHA 256 or MD5) that can be used to characterize the contents of the object. Aggregators and distributors can use these values to verify that they have retrieved exactly the same content as the original provider made available, and that replica copies of an object are identical to the original, among other uses. Because schema.org does not contain a class for representing checksum values, by convention we recommend using the spdx:checksum
property, which points at an spdx:Checksum
instance that provides both the value of the checksum and the algorithm that was used to calculate the checksum.
Here's an example that provides two different checksum values for a single digital object within a schema:DataDownload
description. Note that providers will need to define the spdx
prefix in their @context
block in order to use the prefix as shown in the example.
{ "@context": [ "https://schema.org/", { "spdx": "http://spdx.org/rdf/terms#" } ], "@type": "Dataset", "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607", "sameAs": "https://doi.org/10.18739/A2NK36607", "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)", "distribution": { "@type": "DataDownload", "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae", "identifier": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae", "spdx:checksum": [ { "@type": "spdx:Checksum", "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51", "spdx:algorithm": { "@id": "spdx:checksumAlgorithm_sha256" } }, { "@type": "spdx:Checksum", "spdx:checksumValue": "65d3616852dbf7b1a6d4b53b00626032", "spdx:algorithm": { "@id": "spdx:checksumAlgorithm_md5" } } ] } }
The algorithm property is chosen from the controlled SPDX vocabulary of checksum types, making it easy for processors to recalculate checksum values to verify them. Common algorithms that many providers would use include spdx:checksumAlgorithm_sha256
and spdx:checksumAlgorithm_md5
. Note specifically that the spdx:checksumAlgorithm_sha256
value is inside of an @id
property so that the SPDX namespace from the context definition is used to define the algorithm URI.
Back to top
High level relationships that link datasets based on their processing workflows and versioning relationships are critical for data consumers and search engines to link different versions of a schema:Dataset, to clarify when a dataset is derived from one or more source Datasets, and to specify linkages to the software and activities that created these derived datasets for reproducibility. Collectively, this is provenance information.
The PROV-O recommendation provides the widely-adopted vocabulary for representing this type of provenance information, and should be used within Dataset descriptions, as most of the necessary provenance properties are currently missing from schema.org. The main exception is schema:isBasedOn
, which provides a predicate for indicating that a Dataset was derived from one or more source Datasets. Producers and consumers should interpret schema:isBasedOn
to be an equivalent property to prov:wasDerivedFrom
(in the owl:equivalentProperty
sense). Either is acceptable for representing derivation relationships, but there is utility in expressing the relationship with both predicates for consumers that might only be looking for one or the other. When other PROV
predicates are used, it is preferred to use prov:wasDerivedFrom
for consistency.
We recommend providing provenance information about data processing workflows, data derivation relationships, and versioning information using PROV-O and schema.org predicates, and describe the structures to do this in the following subsections. Aggregators and search systems should use these properties to cluster and cross-link versions of Datasets, and to provide bi-directional linkages to source and derived data products.
Link a Dataset to a prior version that it replaces by adding a prov:wasRevisionOf
property. This indicates that the current schema:Dataset
replaces or obsoletes the source Dataset indicated. The value of the prov:wasRevisionOf
should be the canonical IRI for the identifier for the original dataset, preferably to a persistently resolvable IRI such as as a DOI, but other persistent identifiers for the dataset can be used.
{ "@context": [ "https://schema.org/", { "prov": "http://www.w3.org/ns/prov#" } ], "@id": "https://doi.org/10.xxxx/Dataset-2.v2", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "prov:wasRevisionOf": { "@id": "https://doi.org/10.xxxx/Dataset-2.v1" } }
A derived Dataset is one in which the values in the data are somehow related or created from the values in one or more source datasets. For example, raw voltage values from a sensor might be recorded in a raw data file, which is then processed through calibration functions to produce a derived dataset with values in scientific units. Other examples of derived data include data that has been error corrected, gap-filled, or integrated with other sources.
To indicate that a Dataset has been derived from a source Dataset, use the prov:wasDerivedFrom
property. This indicates that the current schema:Dataset
was created in whole or in part from content in the source Dataset, and therefore does not represent an independent set of measurements. The value of the prov:wasDerivedFrom
should be the canonical IRI for the identifier for the source dataset, preferably to a persistently resolvable IRI such as a DOI, but other persistent identifiers for the dataset can be used. In addition, if a persistent identifier for a digital object within a Dataset is available, the prov:wasDerivedFrom
may also be used to indicate that that digital object was derived from that particular source object, rather than the overall Dataset. This allows one to be more specific about the exact relationship between the source and derived data objects.
In addition to prov:wasDerivedFrom
, schema.org provides the schema:isBasedOn
property, which should be considered to be an equivalent property to prov:wasDerivedFrom
. For compatibility with schema.org, we recommend that producers use schema:isBasedOn
in addition to or instead of prov:wasDerivedFrom
to indicate derivation relationships.
{ "@context": [ "https://schema.org/", { "prov": "http://www.w3.org/ns/prov#" } ], "@id": "https://doi.org/10.xxxx/Dataset-2", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "prov:wasDerivedFrom": { "@id": "https://doi.org/10.xxxx/Dataset-1" }, "isBasedOn": { "@id": "https://doi.org/10.xxxx/Dataset-1" } }
Frequently data are processed to create derived Datasets or other products using software programs that use some source data, transform it in various ways, and create the derived products. Understanding these software workflows promotes understanding of the products, and facilitates reproducibility. Describing a software workflow is really just a mechanism to provide more detail about how derived products were created when software was executed. The ProvONE vocabulary extends PROV to define a specific concept for an execution event (provone:Execution
) during which a software program (provone:Program
) is executed. During this execution, the software can use source data (prov:used
) and generate outputs (prov:wasGeneratedBy
), which then can be inferred to have been derived from the source data.
Any portion of the software workflow can be described to increase information about derived datasets. For example, use prov:used
to link an execution to one or more source datasets, and use prov:wasGeneratedBy
to link an execution to one or more derived products. When information about the execution event itself is known, use provone:Execution
to describe that event, and link it to the source and derived products, as well as the program. The program is often a software script that is itself dereferenceable, and may be part of the archived Dataset itself if it has an accessible IRI.
{ "@context": [ "https://schema.org/", { "prov": "http://www.w3.org/ns/prov#", "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#" } ], "@id": "https://doi.org/10.xxxx/Dataset-2", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "prov:wasDerivedFrom": { "@id": "https://doi.org/10.xxxx/Dataset-1" }, "schema:isBasedOn": { "@id": "https://doi.org/10.xxxx/Dataset-1" }, "prov:wasGeneratedBy": { "@id": "https://example.org/executions/execution-42", "@type": "provone:Execution", "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R", "prov:used": { "@id": "https://doi.org/10.xxxx/Dataset-1" } } }
Back to top