Skip to content

Journal scraper definitions for the ContentMine framework

Notifications You must be signed in to change notification settings

ianthe/journal-scrapers

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

journal-scrapers

Journal scraper definitions for the ContentMine framework.

This repo is a collection of ScraperJSON definitions targetting academic journals. They can be used to extract and download data from URLs of journal articles, such as:

  • Title, author list, date
  • Figures and their captions
  • Fulltext PDF, HTML, XML, RDF
  • Supplementary materials
  • Reference lists

ScraperJSON definitions

Scrapers are defined in JSON, using a schema called ScraperJSON which is currently evolving.

The current schema is described below.

There can be two keys in the root object:

  • url - a string-form regular expression specifying which URL(s) this scraper targets
  • elements - a dictionary of elements to scrape

Elements are defined as key-value pairs, where the key is a description of the element, and the value is a dictionary of specifiers defining the element and its processing. Allowed keys in the specifier dictionary are:

  • selector - an XPath selector targetting the element to be selected.
  • attribute - a string specifying the attribute to extract from the selected element. Optional (omitting this key is equivalent to giving it a value of text). In addition to html attributes there are two special attributes allowed:
    • text - extracts any plaintext inside the selected element
    • html - extracts the inner HTML of the selected element
  • download - a boolean flag: true if the element is a URL to a resource that must be downloaded. Optional (omitting this key is equivalent to giving it a value of false).

Example:

{
  "url": "plos.*\\.org",
  "elements": {
    "fulltext_pdf": {
      "selector": "//meta[@name='citation_pdf_url']",
      "attribute": "content",
      "download": true
    },
    "title": {
      "selector": "//meta[@name='citation_title']"
    }
  }
}

Usage

Currently these definitions can be used with the quickscrape tool.

About

Journal scraper definitions for the ContentMine framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published