Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

BASE #3

Open
davidar opened this issue Aug 25, 2015 · 52 comments
Open

BASE #3

davidar opened this issue Aug 25, 2015 · 52 comments

Comments

@davidar
Copy link
Collaborator

davidar commented Aug 25, 2015

https://base-search.net

@davidar davidar self-assigned this Aug 25, 2015
@pietsch
Copy link

pietsch commented Sep 10, 2015

The first trial delivery of a data dump has arrived in this directory: http://gateway.ipfs.io/ipfs/QmVUMQttFqwFKqu33AZL6gSkv89RFcPBSnT9kxrCDUNisz
I called it “First Million Records”. There are 76 more million metadata records to follow if the BASE team can be convinced that this is a promising approach.

The deal is that if this community succeeds in extracting all the full text links (mainly PDF, but also PostScript, DjVu and perhaps other file formats), the remaining records will be released. The trouble is that in the current metadata, you will often find pointers to an HTML landing page instead of the full text. The task is then to identify the correct full text, discarding links to unrelated (e.g. policy or license) files. I am willing to help in both teams.

License: CC-BY-NC 4.0

@davidar
Copy link
Collaborator Author

davidar commented Sep 10, 2015

@pietsch Awesome, checking it out now :)

Sounds like a fair deal, will make a start on that when I get the chance.

I'm also around at #ipfs on freenode.net if you wanted to discuss further :)

CC: @jbenet

@jbenet
Copy link
Contributor

jbenet commented Sep 10, 2015

@pietsch this sounds great! works for us :)

if anyone else wants to assist in this effort, it would be very useful for everyone involved.

@davidar
Copy link
Collaborator Author

davidar commented Sep 21, 2015

@pietsch I've put together a rough scraper to pull pdf links out of landing pages referenced by the BASE metadata: https://morph.io/davidar/base-data

@pietsch
Copy link

pietsch commented Sep 21, 2015

That is a very promising starting point, @davidar! Do let me know where I can help.

@davidar
Copy link
Collaborator Author

davidar commented Sep 21, 2015

@pietsch It would be great if you could do a quick sanity check of the results so far. I know there's some false positives, but we should be able to filter those out later. Also, if you have any ideas about better ways to identify full text links, that would be great --- I initially tried to use the meta tags, but unfortunately that only worked in a minority of cases, so I've just fallen back to extracting any links containing the string 'pdf'

@davidar
Copy link
Collaborator Author

davidar commented Sep 23, 2015

@pietsch There's about 3.6k records there now. I'll need to contact @openaustralia (cc @mlandauer @henare) about raising the time limit so we can process the whole thing, once you're happy to go ahead :)

@pietsch
Copy link

pietsch commented Sep 23, 2015

Hi @davidar, let me warn you that there is bad weather in Bielefeld, Germany. Literally. Almost always. So don't be surprised about a dose of negativity:

a) Frankly, I do not quite see what you need morph.io for.

b) The PDF identification strategy will have to become smarter. For instance, the PDF download links in our institutional repository do not contain the string "pdf" at all. For some reason I forget. What should work is doing a HEAD request on all links and evaluating the response MIME type. If that turns out to be unreliable, file type sniffing as in file (1) is still an option. Then you know the file type. But you don't want all PDF files from all landing pages. Some repositories put rather unrelated stuff there apart from the actual document.

@davidar
Copy link
Collaborator Author

davidar commented Sep 24, 2015

Hi @davidar, let me warn you that there is bad weather in Bielefeld, Germany. Literally. Almost always.

@pietsch Not to rub salt in the wound, but it's a lovely day here today ;)

a) Frankly, I do not quite see what you need morph.io for.

That's true, just an old habit from scraperwiki I guess. It probably would make more sense to run it on one of the storage nodes.

b) The PDF identification strategy will have to become smarter.

I was afraid you'd say that ;). I'm starting to think it would make sense to build an IPFS crawler that can pull all these pages and their links into IPFS first, which we can then use to experiment with different ways of identifying fulltext links (cf ipfs/infra#92).

What should work is doing a HEAD request on all links and evaluating the response MIME type. If that turns out to be unreliable, file type sniffing as in file (1) is still an option.

The problem with that is it won't work for archives that don't allow bots to download PDFs (and instead redirect either to an error page or back to the original landing page). But I agree it would be helpful to pick up files the simple heuristic misses.

Some repositories but rather unrelated stuff there apart from the actual document.

Yeah, I noticed that (wtf?). With any luck those types of links will follow a pattern, so we can filter them out afterwards.

@jbenet
Copy link
Contributor

jbenet commented Sep 25, 2015

@davidar that's cool, you linked to a webcam. so your "today" claim now means "forever" :D -- beautiful webcam, btw. i miss how cool webcams were in 1995.

@davidar maybe more people can help clean things up if you leverage us? like maybe post here or in another issue what you're running up against, what sort of data, etc. lower barrier for us to take a look and make suggestions / scripts?

@davidar
Copy link
Collaborator Author

davidar commented Sep 26, 2015

that's cool, you linked to a webcam. so your "today" claim now means "forever" :D -- beautiful webcam, btw.

Every "today" is a beautiful day ;)

i miss how cool webcams were in 1995.

@jbenet shows his age... :p

maybe more people can help clean things up if you leverage us? like maybe post here or in another issue what you're running up against, what sort of data, etc. lower barrier for us to take a look and make suggestions / scripts?

Yes, more help would be fantastic, it's been a number of years since I did much in the way of web scraping, so I'm a little rusty :) . I've extracted relevant URLs from the BASE metadata, along with a very basic script for finding PDF links, here: https://github.com/davidar/base-data

@jbenet I think what we need to do now is:

  • download all URLs in urls.txt, as well as every URL directly linked from them (and upload it to IPFS). I can do this myself, but it might take a while since I need to be careful with rate limiting, so anyone who wanted to donate an IP/bandwidth to this would make things go faster
  • experiment with techniques to accurately identify which of the linked URLs correspond to fulltext articles, using a combination of:
    • analysing the downloaded files
    • analysing the links themselves (necessary for archives that disallow robots from downloading fulltext)
  • identify false positives (e.g. PDFs that only contain metadata information but no fulltext), and filter out anything matching a similar pattern

CC: @ikreymer my new web archiving expert :)

@davidar davidar mentioned this issue Sep 29, 2015
65 tasks
@davidar
Copy link
Collaborator Author

davidar commented Oct 17, 2015

@pietsch Sorry that this isn't moving very quickly, I've been caught up with TeX.js recently (which will soon be applied to the arxiv corpus). I hope this deal doesn't have a deadline attached to it? :)

I've started mirroring landing pages and their direct links, so we should have some concrete data to work with soon.

@pietsch
Copy link

pietsch commented Oct 17, 2015

@davidar No worries, you have not missed any deadlines here. I do not think we have any. Btw: TeX.js looks great!

@jbenet
Copy link
Contributor

jbenet commented Oct 18, 2015

@davidar somehow i missed you made TeX.js. it's awesome, great work! :)

@davidar
Copy link
Collaborator Author

davidar commented Oct 19, 2015

sigh Apparently I keep hitting ubuntu/wget#1002870, so either I need to wait until that gets fixed, or roll my own crawler...

@pietsch
Copy link

pietsch commented Oct 19, 2015

Amazed to find such a fat bug in Ubuntu Trusty's version of wget. Do not despair! Building wget from sources is not too much pain. (Or switch to sweet Debian Jessie.)

I did the compiling thing a few years ago because I needed a version with WARC support. WARC archives store HTTP headers and time-stamps in addition to the usual payload. You might want to use them for archiving in IPFS.

@whyrusleeping
Copy link
Contributor

@davidar
Copy link
Collaborator Author

davidar commented Oct 20, 2015

@pietsch @whyrusleeping If it were up to me, the storage nodes would be running plain Debian ;)

/me passes the buck to @lgierth

I guess compiling from source is an option (is that before or after I install Gentoo? :p ), though I might try the package provided by Nix first

You might want to use them for archiving in IPFS.

@pietsch See #28

@cryptix
Copy link

cryptix commented Oct 28, 2015

@davidar have you tried httrack? If not, don't try the cli, it's horrible. Use the web interface.

@harlantwood
Copy link

Hm, @ikreymer is building WARC records with his tool in #28 and saving to IPFS... Maybe use that?

@cryptix
Copy link

cryptix commented Oct 28, 2015

oh wow, of course! There is way to much awesome stuff going on here... ;)

@davidar
Copy link
Collaborator Author

davidar commented Nov 12, 2015

Hrm, the most recent version of wget (from Nix) segfaults too, so it seems like it might be a bug in wget itself :/

@davidar
Copy link
Collaborator Author

davidar commented Nov 19, 2015

Running the crawl again, hopefully with better tolerance of crashes this time.

@pietsch
Copy link

pietsch commented Nov 19, 2015

The weekly BASE meeting will discuss a full OAI-DC data release tomorrow.

@pietsch
Copy link

pietsch commented Nov 23, 2015

The file size is 23 GB. Unless you stop me, I will add this TAR file in one piece while you sleep.

@pietsch
Copy link

pietsch commented Nov 23, 2015

This is what I did:

$ ipfs add -p -r for_ipfs/
added QmUJvY7e4mBSEBqfvRbz4YhTBf9z2kx4Adz6ke9UqMF9G1 for_ipfs/README.markdown
added QmcH3PYRNt5dKbC7YcYjq2MMt9G7xeoSiaBXmJwxVGgtt6 for_ipfs/oai_dc-dump-2015-07-06.tar
added QmctbbiEcEapEcY2hGZ4puchfu2chtziMcuZHQHVM7zuds for_ipfs

I am sorry the dump dates from July, so it contains less than 80M records – more like 77M.

@jbenet
Copy link
Contributor

jbenet commented Nov 26, 2015

@pietsch this is fantastic. we'll help replicate it.

Our dump currently consists of gzipped XML files containing 1000 records each, wrapped in one huge TAR file. Is this convenient?

we have 2nd class tar support right now (will be even tighter integration later). sorry to make you add again, but if you add things with ipfs tar then it will add each tar header/data segment as an ipfs object, thus deduping everything nicely within the tar. See this example-- it is the result of adding this directory

afaik, ipfs tar doesn't have a progress bar. (should add one).

cc @whyrusleeping take a look at the dedup graphviz graph-- it seems to not be deduping as much as i thought it would?

@davidar
Copy link
Collaborator Author

davidar commented Nov 26, 2015

@pietsch Awesome! I'm having trouble accessing that hash though :(

@jbenet How much dedup can we actually expect here though?

@pietsch
Copy link

pietsch commented Nov 26, 2015

@davidar I just deleted the two IPFS nodes I was running here in Bielefeled, making a fresh start.

@pietsch
Copy link

pietsch commented Nov 26, 2015

Ouch, ipfs tar add seems to require way more RAM than plain old ipfs add:

ipfs tar add oai_dc-dump-2015-07-06.tar
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0xf1b4c0, 0x16)
        /usr/local/go/src/runtime/panic.go:527 +0x90
runtime.sysMap(0xca21500000, 0x200000000, 0x441600, 0x13bfff8)
        /usr/local/go/src/runtime/mem_linux.go:143 +0x9b
runtime.mHeap_SysAlloc(0x139f960, 0x200000000, 0x0)
        /usr/local/go/src/runtime/malloc.go:423 +0x160
runtime.mHeap_Grow(0x139f960, 0x100000, 0x0)
        /usr/local/go/src/runtime/mheap.go:628 +0x63
runtime.mHeap_AllocSpanLocked(0x139f960, 0x100000, 0x7ff100000000)
        /usr/local/go/src/runtime/mheap.go:532 +0x5f1
runtime.mHeap_Alloc_m(0x139f960, 0x100000, 0xffffff0100000000, 0x7ff1d6ffcdd0)
        /usr/local/go/src/runtime/mheap.go:425 +0x1ac
runtime.mHeap_Alloc.func1()
        /usr/local/go/src/runtime/mheap.go:484 +0x41
runtime.systemstack(0x7ff1d6ffcde8)
        /usr/local/go/src/runtime/asm_amd64.s:278 +0xab
runtime.mHeap_Alloc(0x139f960, 0x100000, 0x10100000000, 0xc82012c300)
        /usr/local/go/src/runtime/mheap.go:485 +0x63
runtime.largeAlloc(0x1fffffe00, 0x1, 0x0)
        /usr/local/go/src/runtime/malloc.go:745 +0xb3
runtime.mallocgc.func3()
        /usr/local/go/src/runtime/malloc.go:634 +0x33
runtime.systemstack(0xc82001c000)
        /usr/local/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
...

This happened with yesterday's ipfs 0.3.10-dev on a Xen VM with 8 GB RAM that had no problem adding this file when I did ipfs add. Any idea?

@davidar
Copy link
Collaborator Author

davidar commented Nov 26, 2015

@pietsch yeah, the perf issues are a real pita

Paging Dr Sleeping, Dr @whyrusleeping

@pietsch
Copy link

pietsch commented Nov 26, 2015

I could not get ipfs tar add … working, so I did ipfs add oai_dc-dump-2015-07-06.tar, resulting in:
QmeRuyEnJkLaXZpRGNMBYat6YA3U6QCmiFvLGh5Z9nyDzj

@davidar
Copy link
Collaborator Author

davidar commented Nov 27, 2015

@pietsch cool, pinning now :)

@pietsch
Copy link

pietsch commented Nov 27, 2015

Have you managed to get the data? When I first added the entire directory, ipfs cat on another computer on campus worked. Now I get this error message immediately:

$ ipfs cat QmeRuyEnJkLaXZpRGNMBYat6YA3U6QCmiFvLGh5Z9nyDzj > oai_dc-dump-2015-07-06.tar 
Error: Maximum storage limit exceeded. Maybe unpin some files?

@davidar
Copy link
Collaborator Author

davidar commented Dec 1, 2015

@pietsch Yes, I've mirrored it to one of our storage nodes :)

@wetneb
Copy link

wetneb commented Dec 1, 2015

So, in OpenJournal/central#8 I mentioned using existing scrapers (zotero/translators) to retrieve metadata from HTML pages. These scrapers often return a link to the full text (without downloading it).

I'm not sure how this would work for URLs extracted from BASE. I suspect BASE covers a lot of small repositories that are not supported by Zotero. Many of them are probably instances of generic repository softwares such as DSpace, Eprints, Fedora or OJS, but Zotero selects scrapers based on regexes run on the URL, so they are not likely to trigger on many different domains. I will do a few experiments and report the results here.

@davidar
Copy link
Collaborator Author

davidar commented Dec 1, 2015

@wetneb awesome, thanks :)

@pietsch Have crawled 162GB worth of HTML/PDFs so far

@jbenet
Copy link
Contributor

jbenet commented Dec 1, 2015

@pietsch curious what failed in ipfs tar add ? (it appears to lack a progress bar, nor verbose progress output, so it may just hang a long time until a root hash maerializes. (yes this ux is painful, we'll fix it) cc @whyrusleeping

@pietsch
Copy link

pietsch commented Dec 1, 2015

@jbenet Have you noticed that I pasted the error message and the first third of the stack dump above?

@jbenet
Copy link
Contributor

jbenet commented Dec 2, 2015

@pietsch ahhh thanks hadn't noticed.

@wetneb
Copy link

wetneb commented Dec 3, 2015

@davidar @pietsch Zotero looks actually quite suitable to extract PDF urls, even for small sites. For instance, it successfully scrapes instances of the Open Journal System, hosted on any domain.

@davidar
Copy link
Collaborator Author

davidar commented Dec 3, 2015

@wetneb cool, that's good to know

Well, I've officially exhausted the remaining disk space on one of the storage nodes by crawling this stuff, so time to start analysing it I guess :)

@pietsch
Copy link

pietsch commented Feb 17, 2016

Finally, a fresh BASE dump in OAI-DC metadata format is available. It is a directory containing 88,104 xz-compressed XML files containing 1,000 records each: QmbdLBA51HsQ9PpcED1epXAxLfHgrd2PDZ3ktmjhFTjg94
In total, these files contain 88,103,128 OAI-DC records corresponding to publications indexed by BASE.
Total compressed size is almost 19 GB.

@davidar
Copy link
Collaborator Author

davidar commented Feb 20, 2016

@pietsch Awesome! Mirroring now.

@davidar
Copy link
Collaborator Author

davidar commented Feb 29, 2016

@pietsch Is your ipfs node still online? I've partially mirrored the dump, but it seems to be stuck partway through.

@pietsch
Copy link

pietsch commented Feb 29, 2016

@davidar The ipfs daemon is still running. Here are some messages it printed multiple times:

context deadline exceeded swarm_listen.go:129
EOF swarm_listen.go:129
read tcp4 129.70.xx.xx:4001->119.230.xx.xx:4001: read: connection reset by peer swarm_listen.go:129

@davidar
Copy link
Collaborator Author

davidar commented Feb 29, 2016

@pietsch could you try restarting the daemon?

@pietsch
Copy link

pietsch commented Feb 29, 2016

@davidar Done.

@davidar
Copy link
Collaborator Author

davidar commented Feb 29, 2016

@pietsch thanks, seems to be moving again :)

@pietsch
Copy link

pietsch commented Mar 7, 2016

@davidar Can you confirm that all files have arrived on your side? It should look like this:

ls -1 | wc -l
88105

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants