Skip to content

Commit

Permalink
README: other resources section (#7)
Browse files Browse the repository at this point in the history
Reference recent Crossref dumps discussed in #5.
Closes #5

Refs CrossRef/rest-api-doc#271
  • Loading branch information
dhimmel authored Feb 7, 2018
1 parent 15ec9f6 commit 1dc4171
Showing 1 changed file with 15 additions and 1 deletion.
16 changes: 15 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ mongoexport \

See [`data/mongo-export`](data/mongo-export) for more information on `crossref-works.json.xz`.
Note that creating this file from the Crossref API takes several weeks.
Users are encouraged to use the cached version available on [figshare](https://doi.org/10.6084/m9.figshare.4816720).
Users are encouraged to use the cached version available on [figshare](https://doi.org/10.6084/m9.figshare.4816720) (see also [Other resources](#other-resources) below).

[`1.works-to-dataframe.ipynb`](1.works-to-dataframe.ipynb) is a Jupyter notebook that extracts tabular datasets of works (TSVs), which are tracked using Git LFS:

Expand Down Expand Up @@ -74,6 +74,20 @@ conda env create --file=environment.yml

Then use `source activate crossref` and `source deactivate` to activate or deactivate the environment. On windows, use `activate crossref` and `deactivate` instead.

## Other resources

Ideally, Crossref would provide a complete database dump, rather than requiring users to go through the inefficient process of API querying all works: see [CrossRef/rest-api-doc#271](https://github.com/CrossRef/rest-api-doc/issues/271).
Until then, users should checkout the Crossref data currently hosted by this repository, whose query date is 2017-03-21, and its corresponding [figshare](https://doi.org/10.6084/m9.figshare.4816720.v1).
For users who need more recent data, Bryan Newbold [used this codebase](https://github.com/greenelab/crossref/issues/5) to create a MongoDB dump dated January 2018 (query date of approximately 2018-01-10), which he uploaded to the [Internet Archive](https://archive.org/download/crossref_doi_dump_201801).
His output file `crossref-works.2018-01-21.json.xz` contains 93,585,242 DOIs and consumes 28.9 GB compared to 87,542,370 DOIs and 7.0 GB for the `crossref-works.json.xz` dated 2017-03-21.
This increased size is presumably due to the addition of [I4OC](https://i4oc.org/ "Initiative for Open Citations") references to Crossref work records.
This repository is currently seeking contributions to update the convenient TSV outputs based on the January 2018 database dump.

Daniel Ecer also downloaded the Crossref work metadata in January 2018, using the codebase at [elifesciences/datacapsule-crossref](https://github.com/elifesciences/datacapsule-crossref).
His database dump is available on [figshare](https://doi.org/10.6084/m9.figshare.5845554.v2 "Crossref Works Dump - January 2018").
While the multi-part format of this dump is likely less convenient than the dumps produced by this repository, Daniel Ecer's analysis also exports a DOI-to-DOI table of citations/references [available here](https://doi.org/10.6084/m9.figshare.5849916.v1 "Crossref Citation Links - January 2018").
This citation catalog contains 314,785,303 citations ([summarized here](https://elifesci.org/crossref-data-notebook)) and is thus more comprehensive than the catalog available from [greenelab/opencitations](https://github.com/greenelab/opencitations).

## Acknowledgements

This work is funded in part by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant [GBMF4552](https://www.moore.org/grant-detail?grantId=GBMF4552) to [**@cgreene**](https://github.com/cgreene "Casey Greene on GitHub").

0 comments on commit 1dc4171

Please sign in to comment.