Because Human is perfectible and error-prone, because Science should be open and flow and because cogito ergo sum.
This tool intends to scraping google scholar, scopus, etc. in order to aid the research process for building up a survey of a topic. We are not the first ones to attempt this idea. And one of the problems might be that some policies might be violated. I’m not doing apology of evil miss-usage of the search engines. However, research and reading time is scarce enough to be waisted not making use of our best skills and tools.
This project attempts to reduce the amount of papers to be read for any given topic by scraping the website, collecting meta information of the papers and allowing for a better choice on which papers spent your time on. Take a look at this tools:
- Scholar.py allows for querying google scholar. I use it to directly import the BibTex of a reference directly from my editor.
- gscholar takes a pdf file and finds its entry from google shcolar.
- This tools might come handy to build up graphs of the references based on some criteria. (see the comment of its author)
I think this project was doing what I had in mind, but I didn’t manage to make it work, yet. I’ve some issues with mongo and starting the server.
https://github.com/argaen/citations_crawler https://github.com/McKenzieKohn/scopus_spider https://github.com/giamo/scopus-search-api-client http://dev.elsevier.com/blog.html
This project has been structured using this [rr-init repository] as a template. [rr-init repository]: https://github.com/massich/rr-init
project
|- doc/ # documentation for the study
| |- paper/ # manuscript(s), whether generated or not
| +- source/ # sphinx source
|
|- data # raw and primary data, are not changed once created
| |- raw/ # raw data, will not be altered
| +- clean/ # cleaned data, will not be altered once created
|
|- results # all output from workflows and analyses
| |- figures/ # graphs, likely designated for manuscript figures
| +- pictures/ # diagrams, images, and other non-graph graphics
|
|- scratch/ # temporary files that can be safely deleted or lost
|- src/ # any programmatic code
|
|- third-party/ # GIT submodules pointing to needed libraries. The
| # libaries are placed in this manner so that they can
| # be installed from here if they are not present at
| # PyPI (because they are experimental, they are some
| # snippets of other code etc..)
|
|- requirements.txt # list of the required packages (see virtualenv)
|
|- LICENSE.md
|- README.md # the top level description of content
- move libraries as third-party
- Journal info collection uses BeautifulSoup 4.4 while scholar.py 3.2