ContentMine · ryscher · May 26, 2014 · May 27, 2014 · May 27, 2014 · May 29, 2014
diff --git a/.coveralls.yml b/.coveralls.yml
@@ -0,0 +1,2 @@
+repo_token: vHHvsw3QKvjK7zPb9DcFgt6ivLVS8r5uP
+service_name: travis
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+results.json
+coverage.json
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,14 @@
+language: ruby
+rvm:
+  - 2.1.0
+before_install:
+  - sudo apt-get update -qq
+  - sudo apt-get install -y software-properties-common python-software-properties
+  - sudo add-apt-repository -y ppa:chris-lea/node.js
+  - sudo apt-get update
+  - sudo apt-get install -y python g++ make nodejs libfontconfig1
+  - curl --insecure https://www.npmjs.org/install.sh | bash
+install:
+  - sudo -H npm install --global quickscrape
+script:
+  - ruby test/test_all.rb
diff --git a/README.md b/README.md
@@ -1,37 +1,64 @@
 journal-scrapers
 ================
 
-Journal scraper definitions for the ContentMine framework
+[travis]: http://travis-ci.org/ContentMine/journal-scrapers
+[license]: https://creativecommons.org/publicdomain/zero/1.0/
+[coverage]: https://coveralls.io/r/ContentMine/journal-scrapers
 
-### Definition
+[![Build Status](http://img.shields.io/travis/ContentMine/journal-scrapers.svg)][travis]
+[![Coverage](http://img.shields.io/coveralls/ContentMine/journal-scrapers.svg)][coverage]
+[![License](http://img.shields.io/badge/license-CC0-blue.svg)][license]
 
-Scrapers are defined in JSON, using a schema that is currently evolving:
+Journal scraper definitions for the ContentMine framework.
 
-There can be two keys in the root object:
+### Table of Contents
 
-- ***url*** - a string-form regular expression specifying which URL(s) this scraper targets
-- ***elements*** - a dictionary of elements to scrape
+- [Summary](#summary)
+- [Scraper collection status](#scraper-collection-status)
+- [ScraperJSON definitions](#scraperjson-definitions)
+- [Contributing scrapers](#contributing-scrapers)
+- [Usage](#usage)
+- [License](#license)
 
-Elements are defined as key-value pairs, where the key is a description of the element, and the value is a dictionary of specifiers defining the element and its processing. Allowed keys in the specifier dictionary are:
+### Summary
 
-- ***selector*** - an XPath or CSS selector targetting the element to be selected
-- ***attribute*** - a string specifying the attribute to extract from the selected element
-- ***download*** - a boolean flag: true if the element is a URL to a resource that must be downloaded
+This repo is a collection of scraperJSON definitions targeting academic journals. They can be used to extract and download data from URLs of journal articles, such as:
 
-Example:
-```json
-{
-  "url": "plosgenetics.org",
-  "elements": {
-    "fulltext_pdf": {
-      "selector": "//meta[@name='citation_pdf_url']",
-      "attribute": "content",
-      "download": true
-    }
-  }
-}
-```
+- Title, author list, date
+- Figures and their captions
+- Fulltext PDF, HTML, XML, RDF
+- Supplementary materials
+- Reference lists
+
+### Scraper collection status
+
+All the scrapers in the collection are automatically tested daily as well as every time any scraper is changed. The tests work by having the expected results for a set of URLs stored, and randomly selecting one of those URLs to re-scrape. If the results match those expected the test passes. If the badge is green and says `build|passing`, all the scrapers are OK. If the badge is red and says `build|failing`, one or more of the scrapers has stopped working. You can click on the badge to see the test report, to see which scrapers are failing and how.
+
+[![Build Status](http://img.shields.io/travis/ContentMine/journal-scrapers.svg)][travis]
+
+How well the scrapers are covered by the tests is also checked. Coverage should be 100% - this means every element of every scraper is checked at least once in the testing. If coverage is below 100%, you can see exactly which parts of which scrapers are not covered by clicking the `coverage` badge below.
+
+[![Coverage](http://img.shields.io/coveralls/ContentMine/journal-scrapers.svg)][coverage]
+
+### ScraperJSON definitions
+
+Scrapers are defined in JSON, using a schema called scraperJSON which is currently evolving. The current schema is described at [the scraperJSON repo](https://github.com/ContentMine/scraperJSON).
+
+### Contributing scrapers
+
+If your favourite publisher or journal is not covered by a scraper in our collection, we'd love you to submit a new scraper.
+
+We ask that all contributions follow some simple rules that help us maintain a high-quality collection.
+
+1. The scraper covers all [the data elements used in the ContentMine](https://github.com/ContentMine/journal-scrapers/wiki/data_collected_for_ContentMine).
+2. You must submit a set of 5-10 test URLs.
+3. It comes with a regression test ([which can be auto-generated](https://github.com/ContentMine/journal-scrapers/wiki/Generating%20tests%20for%20your%20scrapers)).
+4. You agree to release the scraper definition and tests under the [Creative Commons Zero license](https://creativecommons.org/publicdomain/zero/1.0/).
 
 ### Usage
 
-Currently these definitions can be used with the [quickscrape](http://github.com/ContentMine/quickscrape) tool.
+Currently these definitions can be used with the [quickscrape](http://github.com/ContentMine/quickscrape) tool.
+
+### License
+
+All scrapers are released under the [Creative Commons 0 (CC0)](https://creativecommons.org/publicdomain/zero/1.0/) license.
diff --git a/scrapers/elife.json b/scrapers/elife.json
@@ -0,0 +1,92 @@
+{
+  "url": "elifesciences\\.org",
+  "elements": {
+    "publisher": {
+      "selector": "//meta[@name='citation_publisher']",
+      "attribute": "content"
+    },
+    "journal": {
+      "selector": "//meta[@name='citation_journal_title']",
+      "attribute": "content"
+    },
+    "title": {
+      "selector": "//meta[@name='citation_title']",
+      "attribute": "content"
+    },
+    "authors": {
+      "selector": "//meta[@name='citation_author']",
+      "attribute": "content"
+    },
+    "date": {
+      "selector": "//meta[@name='citation_date']",
+      "attribute": "content"
+    },
+    "doi": {
+      "selector": "//meta[@name='citation_doi']",
+      "attribute": "content"
+    },
+    "volume": {
+      "selector": "//meta[@name='citation_volume']",
+      "attribute": "content"
+    },
+    "issue": {
+      "selector": "//meta[@name='citation_issue']",
+      "attribute": "content"
+    },
+    "firstpage": {
+      "selector": "//meta[@name='citation_firstpage']",
+      "attribute": "content"
+    },
+    "description": {
+      "selector": "//meta[@name='description']",
+      "attribute": "content"
+    },
+    "abstract": {
+      "selector": "//div[contains(class, 'abstract')]//p[1]",
+      "attribute": "content"
+    },
+    "fulltext_html": {
+      "selector": "//meta[@name='citation_fulltext_html_url']",
+      "attribute": "content",
+      "download": {
+        "rename": "fulltext.html"
+      }
+    },
+    "fulltext_pdf": {
+      "selector": "//meta[@name='citation_pdf_url']",
+      "attribute": "content",
+      "download": {
+        "rename": "fulltext.pdf"
+      }
+    },
+    "fulltext_xml": {
+      "selector": "//meta[@name='citation_xml_url']",
+      "attribute": "content",
+      "download": {
+        "rename": "fulltext.xml"
+      }
+    },
+    "supplementary_material": {
+      "selector": "//a[contains(concat(' ', normalize-space(@class), ' '), ' article-supporting-download ')]",
+      "attribute": "href",
+      "download": true
+    },
+    "figure": {
+      "selector": "//div[contains(concat(' ', normalize-space(@class), ' '), ' elife-figure-link-download ')]/a",
+      "attribute": "href",
+      "download": true
+    },
+    "figure_caption": {
+      "selector": "//div[contains(class, 'fig-caption')]",
+      "attribute": "text"
+    },
+    "license": {
+      "selector": "//meta[@name='DC.Rights']",
+      "attribute": "text"
+    },
+    "copyright": {
+      "selector": "//meta[@name='DC.Rights']",
+      "attribute": "text"
+    }
+  }
+}
diff --git a/generic_open.json → scrapers/generic_open.json b/generic_open.json → scrapers/generic_open.json
@@ -4,12 +4,16 @@
     "fulltext_pdf": {
       "selector": "//meta[@name='citation_pdf_url']",
       "attribute": "content",
-      "download": true
+      "download": {
+        "rename": "fulltext.pdf"
+      }
     },
     "fulltext_html": {
       "selector": "//meta[@name='citation_fulltext_html_url']",
       "attribute": "content",
-      "download": true
+      "download": {
+        "rename": "fulltext.html"
+      }
     },
     "title": {
       "selector": "//meta[@name='citation_title']",
@@ -44,4 +48,4 @@
       "attribute": "content"
     }
   }
-}
+}
diff --git a/scrapers/mdpi.json b/scrapers/mdpi.json
@@ -0,0 +1,91 @@
+{
+  "url": "mdpi\\.com",
+  "elements": {
+    "publisher": {
+      "selector": "//meta[@name='citation_publisher']",
+      "attribute": "content"
+    },
+    "journal": {
+      "selector": "//meta[@name='citation_journal_title']",
+      "attribute": "content"
+    },
+    "title": {
+      "selector": "//meta[@name='citation_title']",
+      "attribute": "content"
+    },
+    "authors": {
+      "selector": "//meta[@name='citation_author']",
+      "attribute": "content"
+    },
+    "date": {
+      "selector": "//meta[@name='citation_date']",
+      "attribute": "content"
+    },
+    "doi": {
+      "selector": "//meta[@name='citation_doi']",
+      "attribute": "content"
+    },
+    "volume": {
+      "selector": "//meta[@name='citation_volume']",
+      "attribute": "content"
+    },
+    "issue": {
+      "selector": "//meta[@name='citation_issue']",
+      "attribute": "content"
+    },
+    "firstpage": {
+      "selector": "//meta[@name='citation_firstpage']",
+      "attribute": "content"
+    },
+    "description": {
+      "selector": "//meta[@name='description']",
+      "attribute": "content"
+    },
+    "abstract": {
+      "selector": "//meta[@name='description']",
+      "attribute": "content"
+    },
+    "fulltext_html": {
+      "selector": "//meta[@name='citation_fulltext_html_url']",
+      "attribute": "content",
+      "download": {
+        "rename": "fulltext.html"
+      }
+    },
+    "fulltext_pdf": {
+      "selector": "//meta[@name='citation_pdf_url']",
+      "attribute": "content",
+      "download": {
+        "rename": "fulltext.pdf"
+      }
+    },
+    "fulltext_xml": {
+      "selector": "//meta[@name='fulltest_xml']",
+      "attribute": "content",
+      "download": {
+        "rename": "fulltext.xml"
+      }
+    },
+    "supplementary_material": {
+      "selector": "//a[contains(concat(' ', normalize-space(@class), ' '), ' article-supporting-download ')]",
+      "attribute": "href",
+      "download": true
+    },
+    "figure": {
+      "selector": "//div[contains(@id, 'fig')]/div/img",
+      "attribute": "src",
+      "download": true
+    },
+    "figure_caption": {
+      "selector": "//div[contains(@class, 'html-fig_description')]"
+    },
+    "license": {
+      "selector": "//div[contains(concat(' ', normalize-space(@class), ' '), ' license-p ')]",
+      "attribute": "text"
+    },
+    "copyright": {
+      "selector": "//div[contains(@class, 'copyright')]",
+      "attribute": "text"
+    }
+  }
+}