Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoS spider: use OAI-PMH spider #265

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vbalbp
Copy link
Contributor

@vbalbp vbalbp commented Feb 26, 2019

Signed-off-by: Victor Balbuena vbalbp@gmail.com

@vbalbp vbalbp requested a review from michamos February 26, 2019 15:55
@vbalbp vbalbp force-pushed the adapt_pos_for_oaipmh branch 8 times, most recently from da00961 to ea0ace2 Compare February 27, 2019 16:20
Signed-off-by: Victor Balbuena <vbalbp@gmail.com>
@vbalbp
Copy link
Contributor Author

vbalbp commented Feb 28, 2019

The spider is working just fine, both the normal and the single spiders. The tests are failing though because the new adaption completely breaks what was there.
Apart from that, functional cds and arxiv fail because of the removal of

# Allow duplicate requests
DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter"

However, since we harvest the proceedings page as well as the paper, we get the proceedings multiple times in one run, since it gets it once per each record, even if it's the same proceedings for every record (That is the usual case when harvesting by sets, since sets are conferences). By removing that line, we get the proceedings record only once instead of multiple times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant