Details of the harvesting of Bioschemas markup from live deployments on the Web.
The initial purpose is to track the harvesting of data for use in Project 29 at the BioHackathon-Europe 2021. The harvesting will be conducted with BMUSE and the data hosted on a server at Heriot-Watt University.
We aim to harvest data from the sites on the Bioschemas live deploy page for which we have a sitemap. We will also include sites where we have a list of URLs. Full details of the datasets to be harvested and their progress can be found on the project board.
We have loaded the harvested data into a GraphDB triplestore:
- SPARQL Endpoint
- grlc REST API: Contains a curated list of queries.
Need to change the endpoint tohttps://swel.macs.hw.ac.uk/data/repositories/bioschemas
- Alternate grlc REST API: Contains all queries in queries directory
- Snorql Extended Interface
- Data directory
- Executable query notebook
- DisProt: 2,044 pages harvested using the dynamic scraper (v0.4.0) on 20 October 2021
- MobiDB: 2,083 pages harvested using the dynamic scraper (v0.4.0) on 27 October 2021
- Paired Omics: 78 pages harvested using the dynamic scraper (v0.5.0) on 28 October 2021
- BridgeDb: 2 pages harvested using the static scraper (v0.5.1) on 2 November 2021
- PCDDB: 1,402 pages harvested using the static scraper (v0.5.1) on 2 November 2021
- MassBank: 76,253 pages harvested using the static scraper (v0.5.0) on 4 November 2021; 10,326 pages did not harvest due to errors in the JSON-LD. For loading into the triplestore, the nquad files were merged using the command
find . -name *.nq -exec cat {} \; > massbank.nq
as detailed here. - Cosmic: 2,424 pages harvested using the static scraper (v0.5.2) on 4 November 2021
- Nanocommons: 3 pages harvested using the static scraper (v0.5.2) on 4 November 2021
- Alliance of Genomes: 12 pages harvested using scraper (v0.5.2) on 5 November 2021
- BioVersions: 3 pages harvested using the static scraper (v0.5.2) on 5 November 2021
- EGA: 11,834 pages harvested using scraper (v0.5.2) on 5 November 2021; 745 pages could not be harvested
- IFB: 87 pages harvested using scraper (v0.5.2) on 5 November 2021
- PDBe: 672 pages harvested using scraper (v0.5.2) on 5 November 2021
- Prosite: 5,859 pages harvested using scraper (v0.5.2) on 5 November 2021
- UniProt: 3 pages harvested using the static scraper (v0.5.2) on 5 November 2021
- FAIRsharing: 6,351 pages harvested using scraper (v0.5.2) on 6 November 2021
- COVID19 Portal: 20 pages harvested using the dynamic scraper (v0.5.2) on 7 November 2021
- GBIF: 68,167 pages harvested using the static scraper (v0.5.2) on 7 November 2021
- TeSS: 13,940 pages harvested using scraper (v0.5.2) on 7 November 2021
- Scholia:
- 5,345 pages harvested out of 660k supplied URLs using dynamic scraper (v0.5.2) on 8 November 2021; 1 page did not scrape
- 68,974 pages harvested using dynamic scraper (v0.5.2) on 10 November 2021; 21 pages did not scrape
- Protein Ensembl (PED): 187 pages harvested using the dynamic scraper (v0.5.2) on 9 November 2021
- Bgee: statically scraped (v0.5.2) on 9-10 November
- https://bgee.org/sitemap_main.xml 22 pages
- https://bgee.org/sitemap_gene1.xml 49,001 pages
- COVIDmine (no longer maintained): 49,959 pages scraped using the dynamic scraper (v0.5.2) on 8 November 2021
- MetaNetX: statically scraped (v0.5.2) on 11 November 2021
- https://www.metanetx.org/sitemap_main.xml 12 pages
- https://www.metanetx.org/sitemap_chem1.xml 49,001 pages
We have started testing loading data dumps made available as the experimental Schema.org data feed. The following table details the feeds that have been loaded. The raw data is available here.
Data Source | Date Generated | Date Loaded | Named Graph |
---|---|---|---|
bio.tools | 2021-11-09 | 2021-12-17 | http://bio.tools/comp-tools-0.6-draft/ |
chembl-28 | 2022-01-15 | 2022-03-04 | https://www.ebi.ac.uk/chembl-28/ |
The following triples were hand inserted to track the provenance of the data feeds. Note that the location retrieved from pav:retrievedFrom
refers to the domain of the data and the date pav:retrievedOn
is the date the date was generated. This is to be consistent with the data coming from BMUSE.
# Bio.Tools
INSERT DATA {
<http://bio.tools/comp-tools-0.6-draft/> <http://purl.org/pav/retrievedFrom> <https://bio.tools> .
<http://bio.tools/comp-tools-0.6-draft/> <http://purl.org/pav/retrievededOn> "2021-11-09T09:28:45"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://bio.tools/comp-tools-0.6-draft/> a <https://schema.org/DataFeed> .
}
# ChEMBL 28
INSERT DATA {
<https://www.ebi.ac.uk/chembl-28/> <http://purl.org/pav/retrievedFrom> <https://www.ebi.ac.uk/chembl/> .
<https://www.ebi.ac.uk/chembl-28/> <http://purl.org/pav/retrievededOn> "2022-01-15T09:28:45"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<https://www.ebi.ac.uk/chembl-28/> a <https://schema.org/DataFeed> .
}