Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mozilla hack scrapers #16

Open
wants to merge 74 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
1ecac9e
Update README.md
blahah May 26, 2014
e05e2a3
update example
blahah May 27, 2014
a35fecd
fix README formatting and typo
blahah May 27, 2014
acb9e7c
add html and text special attributes to README
blahah May 29, 2014
a2a6d26
Update README.md
blahah Jun 1, 2014
ebea30a
Create science_direct.json
ianthe Jun 19, 2014
f6380f9
Merge pull request #5 from ianthe/master
blahah Jun 19, 2014
40b80a4
travis setup
Jun 22, 2014
13ef95b
auto test generation script
Jun 22, 2014
073a4da
move scrapers to subdir
Jun 22, 2014
4962bab
test generator script fixes
Jun 22, 2014
cdb542b
self-populating tests and peerj example
Jun 22, 2014
3118404
Merge branch 'master' of github.com:ContentMine/journal-scrapers
Jun 22, 2014
0ee30be
move sciencedirect to scrapers
Jun 22, 2014
c1cafdd
fix tmpdir use
Jun 22, 2014
1cdbd3b
debug test generator tmpdir error
Jun 22, 2014
9c82fa0
test set for peerj scraper
Jun 22, 2014
4e94869
fix test generator - now working
Jun 22, 2014
ad734f3
fix test runner - now working
Jun 22, 2014
86c3735
attempted fix for travis dependency install
Jun 22, 2014
1ce66d6
remove unneeded prints from tests
Jun 22, 2014
bf2c926
tests for plos scraper
Jun 22, 2014
9b1f20d
another attempted travis install fix
Jun 22, 2014
aa0cfc6
delete wayward results file
Jun 22, 2014
998b0de
add .gitignore
Jun 22, 2014
fb463a8
add travis badge and explanation to README
Jun 22, 2014
ab88faf
tidy formatting in README
blahah Jun 22, 2014
8e9d287
add science direct tests
Jun 22, 2014
04902e3
Merge branch 'master' of https://github.com/ContentMine/journal-scrapers
Jun 22, 2014
da2a0f2
add CC0 license
blahah Jun 23, 2014
5ce546e
matching badges
Jun 23, 2014
d1f3ddf
fix badge address
Jun 23, 2014
d5f7b08
coverage reporting for scrapers
Jun 23, 2014
ac46bc2
fix coverage reporting
Jun 23, 2014
29b9cce
another coveralls fix
Jun 23, 2014
ed42173
another coveralls fix
Jun 23, 2014
622191f
mend broken curl command
Jun 23, 2014
1a82584
remove empty file
Jun 23, 2014
396b658
fix coveralls CURL command
Jun 23, 2014
641c260
add coveralls to README
Jun 23, 2014
a443c18
make travis badges consistent
Jun 23, 2014
09c6410
another CURL cmd fix
Jun 23, 2014
04d5300
add contribution instructions
Jun 23, 2014
3d6e3a7
fix typo; finalise self-testing (fixes #4)
Jun 23, 2014
2340fa3
add TOC to README
Jun 23, 2014
222a00a
prettify TOC
Jun 23, 2014
9d189bc
tidy TOC
Jun 23, 2014
9ffe438
coveralls submission recognises travis environment
Jun 24, 2014
34c4606
typo
Jun 24, 2014
e60fe56
peerj scraper now implements all ContentMine fields
Jul 2, 2014
7317635
fix contributing doc links
blahah Jul 2, 2014
905c04c
run tests in debug mode
Jul 3, 2014
4370307
Merge branch 'master' of github.com:ContentMine/journal-scrapers
Jul 3, 2014
7198106
install libfontconfig before running travis tests
Jul 4, 2014
a5746a3
fix broken peerj tests
blahah Jul 13, 2014
2901ab4
link out to scraperJSON
Jul 13, 2014
3746fcc
handle mac MD5hash
Jul 13, 2014
187846b
MDPI full
Jul 13, 2014
4f239fa
Extract abstract from PLOS pages
CristianCantoro Jul 16, 2014
62752cd
Merge pull request #9 from CristianCantoro/master
blahah Jul 17, 2014
f64dd8d
add renaming to all scrapers
Jul 17, 2014
cec7cf3
make test line counting more accurate
Jul 17, 2014
a736045
Merge branch 'master' of github.com:ContentMine/journal-scrapers
Jul 17, 2014
dd8f6e0
update PLOS with fulltext xml and new tests
Jul 17, 2014
f50850e
add fulltext_xml to MDPI
Jul 17, 2014
edb7635
add fulltext xml to compatible scrapers
Jul 17, 2014
0ecc002
generate tests for MDPI
blahah Jul 17, 2014
fac1fdc
elife scraper
Jul 17, 2014
ab4c14b
Merge branch 'master' of github.com:ContentMine/journal-scrapers
Jul 17, 2014
1c3648c
elife tests
blahah Jul 17, 2014
0f25cdc
update test coverage calculation with element names only
Jul 17, 2014
d5f611d
Merge branch 'master' of github.com:ContentMine/journal-scrapers
Jul 17, 2014
e196c18
Initial version of scraper for Molecular Ecology
ryscher Jul 22, 2014
cb8efe1
Changed molecol.json to wiley.json
ryscher Jul 22, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .coveralls.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
repo_token: vHHvsw3QKvjK7zPb9DcFgt6ivLVS8r5uP
service_name: travis
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
results.json
coverage.json
14 changes: 14 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
language: ruby
rvm:
- 2.1.0
before_install:
- sudo apt-get update -qq
- sudo apt-get install -y software-properties-common python-software-properties
- sudo add-apt-repository -y ppa:chris-lea/node.js
- sudo apt-get update
- sudo apt-get install -y python g++ make nodejs libfontconfig1
- curl --insecure https://www.npmjs.org/install.sh | bash
install:
- sudo -H npm install --global quickscrape
script:
- ruby test/test_all.rb
75 changes: 51 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,64 @@
journal-scrapers
================

Journal scraper definitions for the ContentMine framework
[travis]: http://travis-ci.org/ContentMine/journal-scrapers
[license]: https://creativecommons.org/publicdomain/zero/1.0/
[coverage]: https://coveralls.io/r/ContentMine/journal-scrapers

### Definition
[![Build Status](http://img.shields.io/travis/ContentMine/journal-scrapers.svg)][travis]
[![Coverage](http://img.shields.io/coveralls/ContentMine/journal-scrapers.svg)][coverage]
[![License](http://img.shields.io/badge/license-CC0-blue.svg)][license]

Scrapers are defined in JSON, using a schema that is currently evolving:
Journal scraper definitions for the ContentMine framework.

There can be two keys in the root object:
### Table of Contents

- ***url*** - a string-form regular expression specifying which URL(s) this scraper targets
- ***elements*** - a dictionary of elements to scrape
- [Summary](#summary)
- [Scraper collection status](#scraper-collection-status)
- [ScraperJSON definitions](#scraperjson-definitions)
- [Contributing scrapers](#contributing-scrapers)
- [Usage](#usage)
- [License](#license)

Elements are defined as key-value pairs, where the key is a description of the element, and the value is a dictionary of specifiers defining the element and its processing. Allowed keys in the specifier dictionary are:
### Summary

- ***selector*** - an XPath or CSS selector targetting the element to be selected
- ***attribute*** - a string specifying the attribute to extract from the selected element
- ***download*** - a boolean flag: true if the element is a URL to a resource that must be downloaded
This repo is a collection of scraperJSON definitions targeting academic journals. They can be used to extract and download data from URLs of journal articles, such as:

Example:
```json
{
"url": "plosgenetics.org",
"elements": {
"fulltext_pdf": {
"selector": "//meta[@name='citation_pdf_url']",
"attribute": "content",
"download": true
}
}
}
```
- Title, author list, date
- Figures and their captions
- Fulltext PDF, HTML, XML, RDF
- Supplementary materials
- Reference lists

### Scraper collection status

All the scrapers in the collection are automatically tested daily as well as every time any scraper is changed. The tests work by having the expected results for a set of URLs stored, and randomly selecting one of those URLs to re-scrape. If the results match those expected the test passes. If the badge is green and says `build|passing`, all the scrapers are OK. If the badge is red and says `build|failing`, one or more of the scrapers has stopped working. You can click on the badge to see the test report, to see which scrapers are failing and how.

[![Build Status](http://img.shields.io/travis/ContentMine/journal-scrapers.svg)][travis]

How well the scrapers are covered by the tests is also checked. Coverage should be 100% - this means every element of every scraper is checked at least once in the testing. If coverage is below 100%, you can see exactly which parts of which scrapers are not covered by clicking the `coverage` badge below.

[![Coverage](http://img.shields.io/coveralls/ContentMine/journal-scrapers.svg)][coverage]

### ScraperJSON definitions

Scrapers are defined in JSON, using a schema called scraperJSON which is currently evolving. The current schema is described at [the scraperJSON repo](https://github.com/ContentMine/scraperJSON).

### Contributing scrapers

If your favourite publisher or journal is not covered by a scraper in our collection, we'd love you to submit a new scraper.

We ask that all contributions follow some simple rules that help us maintain a high-quality collection.

1. The scraper covers all [the data elements used in the ContentMine](https://github.com/ContentMine/journal-scrapers/wiki/data_collected_for_ContentMine).
2. You must submit a set of 5-10 test URLs.
3. It comes with a regression test ([which can be auto-generated](https://github.com/ContentMine/journal-scrapers/wiki/Generating%20tests%20for%20your%20scrapers)).
4. You agree to release the scraper definition and tests under the [Creative Commons Zero license](https://creativecommons.org/publicdomain/zero/1.0/).

### Usage

Currently these definitions can be used with the [quickscrape](http://github.com/ContentMine/quickscrape) tool.
Currently these definitions can be used with the [quickscrape](http://github.com/ContentMine/quickscrape) tool.

### License

All scrapers are released under the [Creative Commons 0 (CC0)](https://creativecommons.org/publicdomain/zero/1.0/) license.
92 changes: 92 additions & 0 deletions scrapers/elife.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
{
"url": "elifesciences\\.org",
"elements": {
"publisher": {
"selector": "//meta[@name='citation_publisher']",
"attribute": "content"
},
"journal": {
"selector": "//meta[@name='citation_journal_title']",
"attribute": "content"
},
"title": {
"selector": "//meta[@name='citation_title']",
"attribute": "content"
},
"authors": {
"selector": "//meta[@name='citation_author']",
"attribute": "content"
},
"date": {
"selector": "//meta[@name='citation_date']",
"attribute": "content"
},
"doi": {
"selector": "//meta[@name='citation_doi']",
"attribute": "content"
},
"volume": {
"selector": "//meta[@name='citation_volume']",
"attribute": "content"
},
"issue": {
"selector": "//meta[@name='citation_issue']",
"attribute": "content"
},
"firstpage": {
"selector": "//meta[@name='citation_firstpage']",
"attribute": "content"
},
"description": {
"selector": "//meta[@name='description']",
"attribute": "content"
},
"abstract": {
"selector": "//div[contains(class, 'abstract')]//p[1]",
"attribute": "content"
},
"fulltext_html": {
"selector": "//meta[@name='citation_fulltext_html_url']",
"attribute": "content",
"download": {
"rename": "fulltext.html"
}
},
"fulltext_pdf": {
"selector": "//meta[@name='citation_pdf_url']",
"attribute": "content",
"download": {
"rename": "fulltext.pdf"
}
},
"fulltext_xml": {
"selector": "//meta[@name='citation_xml_url']",
"attribute": "content",
"download": {
"rename": "fulltext.xml"
}
},
"supplementary_material": {
"selector": "//a[contains(concat(' ', normalize-space(@class), ' '), ' article-supporting-download ')]",
"attribute": "href",
"download": true
},
"figure": {
"selector": "//div[contains(concat(' ', normalize-space(@class), ' '), ' elife-figure-link-download ')]/a",
"attribute": "href",
"download": true
},
"figure_caption": {
"selector": "//div[contains(class, 'fig-caption')]",
"attribute": "text"
},
"license": {
"selector": "//meta[@name='DC.Rights']",
"attribute": "text"
},
"copyright": {
"selector": "//meta[@name='DC.Rights']",
"attribute": "text"
}
}
}
10 changes: 7 additions & 3 deletions generic_open.json → scrapers/generic_open.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,16 @@
"fulltext_pdf": {
"selector": "//meta[@name='citation_pdf_url']",
"attribute": "content",
"download": true
"download": {
"rename": "fulltext.pdf"
}
},
"fulltext_html": {
"selector": "//meta[@name='citation_fulltext_html_url']",
"attribute": "content",
"download": true
"download": {
"rename": "fulltext.html"
}
},
"title": {
"selector": "//meta[@name='citation_title']",
Expand Down Expand Up @@ -44,4 +48,4 @@
"attribute": "content"
}
}
}
}
91 changes: 91 additions & 0 deletions scrapers/mdpi.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
{
"url": "mdpi\\.com",
"elements": {
"publisher": {
"selector": "//meta[@name='citation_publisher']",
"attribute": "content"
},
"journal": {
"selector": "//meta[@name='citation_journal_title']",
"attribute": "content"
},
"title": {
"selector": "//meta[@name='citation_title']",
"attribute": "content"
},
"authors": {
"selector": "//meta[@name='citation_author']",
"attribute": "content"
},
"date": {
"selector": "//meta[@name='citation_date']",
"attribute": "content"
},
"doi": {
"selector": "//meta[@name='citation_doi']",
"attribute": "content"
},
"volume": {
"selector": "//meta[@name='citation_volume']",
"attribute": "content"
},
"issue": {
"selector": "//meta[@name='citation_issue']",
"attribute": "content"
},
"firstpage": {
"selector": "//meta[@name='citation_firstpage']",
"attribute": "content"
},
"description": {
"selector": "//meta[@name='description']",
"attribute": "content"
},
"abstract": {
"selector": "//meta[@name='description']",
"attribute": "content"
},
"fulltext_html": {
"selector": "//meta[@name='citation_fulltext_html_url']",
"attribute": "content",
"download": {
"rename": "fulltext.html"
}
},
"fulltext_pdf": {
"selector": "//meta[@name='citation_pdf_url']",
"attribute": "content",
"download": {
"rename": "fulltext.pdf"
}
},
"fulltext_xml": {
"selector": "//meta[@name='fulltest_xml']",
"attribute": "content",
"download": {
"rename": "fulltext.xml"
}
},
"supplementary_material": {
"selector": "//a[contains(concat(' ', normalize-space(@class), ' '), ' article-supporting-download ')]",
"attribute": "href",
"download": true
},
"figure": {
"selector": "//div[contains(@id, 'fig')]/div/img",
"attribute": "src",
"download": true
},
"figure_caption": {
"selector": "//div[contains(@class, 'html-fig_description')]"
},
"license": {
"selector": "//div[contains(concat(' ', normalize-space(@class), ' '), ' license-p ')]",
"attribute": "text"
},
"copyright": {
"selector": "//div[contains(@class, 'copyright')]",
"attribute": "text"
}
}
}
Loading