Skip to content

web crawler for American Chemical Society journals

License

Notifications You must be signed in to change notification settings

jproddy/jacscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JACScraper is designed to scrape American Chemical Scociety (ACS) journals and collect the following publically available information for each paper:

DOI, title, authors, journal, year, volume, issue, page start, page end, article type, publication date, abstract

The datasets and sample cleaning and analysis are available on Kaggle. Scraping does take a fair amount of time due to rate-limiting (1 request/s) so this is likely a better source for the data than re-scraping unless a different journal is needed.

The scraper can be run on the command line using:

python scrape.py

This will create a default journal_data.json file that will allow for scraping the six journals listed below. The list can then be modified by rerunning the program with add-journal or remove-journal arguments and following the prompts. Running the script once more will create a jsons directory in which the generated files will be saved and begin the scraping process.

This information will be used to trace various trends in chemical research. Other uses may include generating author connectivity graphs. Unfortunately there are a handful of limitations; articles published prior to 1996 (presumably when ACS embraced digitization?) are not accompanied by their abstracts. Many of the older papers have miscategorized article types and are plagued by typos. Finally, some of the entries are not true research articles but instead mastheads, book reviews, software reviews etc. However, the vast majority of these can be removed via fairly trivial data cleaning techniques.

It was used to scrape JACS, Inorganic Chemistry, Organometallics, Biochemistry, JOC and Chemical Reviews as a representative sample of journals relevant to my former research.

About

web crawler for American Chemical Society journals

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages