Wikipedia #20

davidar · 2015-09-16T13:27:25Z

In terms of being able to view this on the web, I'm tempted to push Pandoc through a Haskell-to-JS compiler like Haste.

rht · 2015-09-17T01:29:56Z

In this case, why does the xml -> html have to be done client-side?

In the archiver's machine

get-dump dump/  # using any of the tool in https://meta.wikimedia.org/wiki/Data_dumps/Download_tools, there is one with rsync
dump2html -r dump/
ipfs add -r dump/ # and ipns it

(although yes it'd be much convenient to just use pandoc as a universal markup viewer)

davidar · 2015-09-17T02:58:52Z

That's also a possibility, but more time consuming and inflexible

On Thu, 17 Sep 2015 11:29 rht notifications@github.com wrote:

In this case, why does the xml -> html have to be done client-side?

In the archiver's machine

get-dump dump/ # using any of the tool in https://meta.wikimedia.org/wiki/Data_dumps/Download_tools, there is one with rsync
dump2html -r dump/
ipfs add -r dump/ # and ipns it

(although yes it'd be much convenient to just use pandoc as a universal
markup viewer)

—
Reply to this email directly or view it on GitHub
#20 (comment).

David A Roberts
https://davidar.io

DataWraith · 2015-09-18T19:09:22Z

I actually started on this a while ago, but then thought it would be silly for a single person to attempt this and stopped, but now that I see this issue, I think it might not have been such a bad idea:

I've been experimenting with using a 15GiB (compressed and without images) dump of the English Wikipedia and extracting HTML files using gozim and wget. This gave me a folder full of HTML pages that interlink nicely using relative links.

It took a couple of hours to extract every page reachable from 'Internet' within 2 hops, which amounted to about 1% of the articles in the dump, so it would take at least a week to create HTML pages for the entire dump. And since these HTML files are uncompressed, I'm not sure I have enough disk space available to do the complete dump, but I could repeat my initial trial and make it available in IPFS.

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates. If it is decided that this way of doing it might not be such a bad idea, it might be possible to alter gozim to embed such license information. Or maybe we can simply put a LICENSE-file in the top-most directory.

davidar · 2015-09-19T06:30:24Z

@DataWraith Just had a look at the gozim demo, looks really cool. In the short-term, this does seem like the best option (apologies for my terse reply earlier @rht :). Would it also be possible to also do client-side search with something like https://github.com/cebe/js-search ?

I'm not sure I have enough disk space available to do the complete dump

If you can give me a script, and an estimate of the storage requirements, I can run this on one of the storage nodes for you :)

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates.

Are you sure? I can see:

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

in the footer of http://scaleway.nobugware.com/zim/A/Wikipedia.html

Or maybe we can simply put a LICENSE-file in the top-most directory.

Definitely. See #25

DataWraith · 2015-09-19T07:28:08Z

@DataWraith Just had a look at the gozim demo, looks really cool. In the short-term, this does seem like the best option (apologies for my terse reply earlier @rht :). Would it also be possible to also do client-side search with something like https://github.com/cebe/js-search ?

I'm no JavaScript expert, but I don't see why not. We could pre-compile a search index and store it alongside the static files. However, resource usage on the client may or may not be prohibitively large.

I'm not sure I have enough disk space available to do the complete dump

If you can give me a script, and an estimate of the storage requirements, I can run this on one of the storage nodes for you :)

There is no real script. It's literally:

gozimhttpd -path <wikipedia-dump> -port 8080 -mmap
wget -e robots=off -m -k http://localhost:8080/zim/A/Internet.html

This will crawl everything reachable from 'Internet'. It may be possible to directly crawl the index of pages itself, but I haven't tried that yet.

You probably need to wrap gozimhttpd in a while loop, because it tends to crash once in a while. As for storage requirements: The 60.000 articles I extracted take up 5GiB of storage, so a full dump of the 5.000.000 articles in the dump is probably on the order of 500GiB.

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates.

Are you sure? I can see:

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

in the footer of http://scaleway.nobugware.com/zim/A/Wikipedia.html

Hm. Maybe that's because they are using a different dump, or a newer version of gozim (though the latter seems unlikely); the pages I extracted don't have that footer.

I'm currently running ipfs add on the pages I have extracted, to get a proof-of-concept going. It's inserting the pages alphabetically, but it tends to crash around the 'D's, with an unhelpful 'killed' message. Possibly ran out of memory.

davidar · 2015-09-19T08:55:59Z

I'm no JavaScript expert, but I don't see why not. We could pre-compile a search index and store it alongside the static files.

For context, this is what @brewsterkahle uses for his IPFS-hosted blog

However, resource usage on the client may or may not be prohibitively large.

Yeah, that was my concern too. If so, it might have to wait until #8

There is no real script. It's literally:

gozimhttpd -path <wikipedia-dump> -port 8080 -mmap
wget -e robots=off -m -k http://localhost:8080/zim/A/Internet.html

Too easy

a full dump of the 5.000.000 articles in the dump is probably on the order of 500GiB.

Ok, we'll have to wait until we get some more storage then.

I'm currently running ipfs add on the pages I have extracted, to get a proof-of-concept going. It's inserting the pages alphabetically, but it tends to crash around the 'D's, with an unhelpful 'killed' message. Possibly ran out of memory.

Thanks. Ping me on http://chat.ipfs.io to help debug.

DataWraith · 2015-09-19T12:48:20Z

Short progress update: I'm now feeding files to ipfs add in batches of 25, that seems to have solved the memory issue for now. I hope that feeding in the files piecemeal will prevent the crash that occurs when adding the entire directory at once. I'll probably be able to try adding the entire thing again tomorrow.

I also took another look at gozim. It is relatively easy to extract the HTML-files without going through wget first -- should've thought of that before coming up with the wget-scheme. That way we won't miss any articles; I'll have to do more research on redirects though.

Quick & dirty dumping program here.

DataWraith · 2015-09-20T15:22:29Z

I had no luck getting ipfs add to ingest the HTML files; pre-adding the files in batches didn't do anything. ipfs (without the daemon running) consumed enough RAM to fill a 100GB swap file and then crashed with an error, runtime: out of memory. A script I wrote to add files one by one using the object patch subcommand was too slow, taking 3 to 5 seconds for a single page, so I abandoned that approach.

There are two related issues describing problems with ipfs add. I'll try again once those are resolved.

davidar · 2015-09-23T07:23:45Z

@DataWraith Hmm, that's no good 😕. For the moment, could you tar/zip all the files together and add that?

CC: @whyrusleeping

DataWraith · 2015-09-23T14:55:29Z

Hi.

I've decided to delete the trial-files obtained using wget and go all out and try to actually dump the entire most-recent English Wikipedia snapshot (with images) with my program. It's currently in the 'D's (1.3 million articles done) and I estimate it will finish in another 60 to 70 hours. I'll try adding the dump using the undocumented ipfs tar add, which did not seem to blow up memory-wise in the small trial I did. Not sure why that would be different from the normal ipfs add, but apparently it is. If that still fails, I'll run the tar-archive through lrzip and upload that.

My initial estimate of space required was off, because the article sample I obtained using wget did not contain the small stub articles, of which there are many. The 1.3 million articles I have now add up to 40GiB, so, assuming that the distribution of article sizes is not skewed, we are looking at an overall size of about 160GiB plus maybe another 40GiB for the images. In addition, I'm using btrfs to store the dump, and its built-in compression support halves the actual amount of data stored, so size should not be a problem.

Edit: ipfs tar add is not much faster than the custom script I had cobbled together earlier. At 3 to 5 seconds per file, it'd take the better part of a year to add the entire dump. :/

davidar · 2015-09-24T10:39:46Z

@DataWraith Awesome, can't wait to see it :)

Edit: ipfs tar add is not much faster than the custom script I had cobbled together earlier. At 3 to 5 seconds per file, it'd take the better part of a year to add the entire dump. :/

@whyrusleeping Please make ipfs add faster 🙏

rht · 2015-09-24T16:24:08Z

@whyrusleeping

For scale (foo/ is 11 MB, 10 files of 1.1 MB each):

cp: cp -r foo bar 0.00s user 0.01s system 86% cpu 0.008 total
master: ipfs add -q -r foo >actual 0.13s user 0.04s system 10% cpu 1.582 total
master (no sync on flatfs): ipfs add -q -r foo > actual 0.11s user 0.03s system 102% cpu 0.136 total (the remaining time bloat comes from leveldb)
git: git add foo 0.00s user 0.00s system 84% cpu 0.006 total
rsync: rsync -r foo bar 5.16s user 1.18s system 108% cpu 5.840 total
tar: tar cvf foo.tar foo 0.00s user 0.01s system 95% cpu 0.013 total
ipfs tar add: ipfs tar add foo.tar 0.25s user 0.05s system 35% cpu 0.857 total

It appears that cp doesn't have an explicit call to fsync in its implementation https://github.com/coreutils/coreutils/search?utf8=%E2%9C%93&q=fsync.
(I think it's fine to not have explicit sync call?)

whyrusleeping · 2015-09-24T16:43:24Z

@davidar @rht okay, I'll make that top priority after UDT and ipns land.

rht · 2015-09-24T16:44:40Z

(git does explicit sync https://github.com/git/git/blob/master/pack-write.c#L277
edit: but only on pack updates)

rht · 2015-09-24T17:15:23Z

@davidar I get you point, which either mean 1. "if someone can put the kernel on the browser, why not pandoc", or 2. "we need to be able to do more than just viewing static simulated piece of paper" (more of what a "document"/"book" should be).
Though it is currently slow (e.g. pandoc pdf to html << (or maybe ~) pdf.js << browser plugin for pdf).

As with the client-side search, it works for small sites, but for huge sites (wikipedia?), transporting the index files to the client seems to be too much.

rht · 2015-09-24T18:57:14Z

I wonder if some of the critical operations should be offloaded to FPGA.

davidar · 2015-09-25T01:59:51Z

"if someone can put the kernel on the browser, why not pandoc", or 2. "we need to be able to do more than just viewing static simulated piece of paper" (more of what a "document"/"book" should be).

Uh oh, which side of this argument am I on now? #25 @jbenet

with the client-side search, it works for small sites, but for huge sites (wikipedia?), transporting the index files to the client seems to be too much.

The idea is that you'd encode the index as a trie and dump it into IPLD, so the client would only have to download small parts of the index to answer a query.

rht · 2015-09-25T15:53:11Z

The idea is that you'd encode the index as a trie and dump it into IPLD, so the client would only have to download small parts of the index to answer a query.

And this can be repurposed for any 'pre-computed' stuff, not just search indexes? e.g. (content sorted/filtered by paramX, or entire sql queries https://github.com/ipfs/ipfs/issues/82?)

davidar · 2015-09-26T02:44:56Z

@rht yes, I would think so, I don't see any reason why it wouldn't be possible to build a SQL database format on top of IPLD (albeit non-trivial)

davidar · 2015-09-26T13:29:27Z

@rht looks like someone already beat me to it: http://markup.rocks

rht · 2015-09-27T01:13:32Z

@davidar by a few months. Very useful to know that it is fast.
Currently imagining the possibilities.

Also, found this http://git.kernel.org/cgit/git/git.git/tree/Documentation/config.txt#n693:

This is a total waste of time and effort on a filesystem that orders data writes properly, but can be useful for filesystems that do not use journalling (traditional UNIX filesystems) or that only journal metadata and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").

@whyrusleeping disable fsync by default and add a config flag to enable it? (wanted to close the gap with git, which is still 2 orders of magnitude away).

davidar · 2015-09-27T01:58:40Z

Very useful to know that it is fast.

Yeah, Haskell is high-level enough that it tends to compile to JS reasonably well. The FP Complete IDE is also written in a subset of Haskell.

Currently imagining the possibilities.

Something like the ipfs markdown viewer but using pandoc would be cool.

davidar · 2015-09-27T10:05:47Z

IPFS-hosted version of markup.rocks: https://ipfs.io/ipfs/QmSyfirfxBbgh8sZPzy4yyMQjHgzKX7iQeXG9Zet4VYk9P/

rht · 2015-09-27T17:37:50Z

@davidar saw it, neat. i.e. it's a pandoc but without the huge GHC stuff, cabal-install ritual, etc.
It's a pandoc.

Yeah, Haskell is high-level enough that it tends to compile to JS reasonably well.

But so does python, ruby, ... You mean sane type system?
https://github.com/faylang/fay/wiki says fay doesn't have GHC's STM, concurrency--which is fine.

This has nice things like:

Additionally, because all Fay code is Haskell code, certain modules can be shared between the ‘native’ Haskell and ‘web’ Haskell, most interestingly the types module of your project. This enables two things:
The enforced (by GHC) coherence of client-side and server-side data types. The transparent serializing and deserializing of data types between these two entities (e.g. over AJAX).

(haven't actually looked at a minimalist typed :lambda: calculus metacircular evaluator (the one people write (or chant) every day for the untyped ones))

davidar · 2015-09-28T00:50:11Z

... You mean sane type system?

Yeah, I meant of the languages with a strong enough type system to be able to produce optimised code

davidar · 2015-09-29T07:43:39Z

Also see this simple but awesome wiki editor by @jamescarlyle

davidar · 2015-10-04T06:04:52Z

@DataWraith Awesome, downloading now :)

davidar · 2015-10-05T08:11:15Z

@DataWraith And now it's on IPFS 🎈

@whyrusleeping Looking forward to ipfs add being fast enough to handle the extracted version ;)

DataWraith · 2015-10-05T08:36:02Z

@davidar Awesome!

whyrusleeping · 2015-10-05T15:48:12Z

@davidar its very high on my todo list.

davidar · 2015-10-06T09:31:55Z

@whyrusleeping ❤️

rht · 2015-11-28T04:30:01Z

This can proceed with ipfs/kubo#1964 + ipfs/kubo#1973 merged (pending @jbenet's CR).
nosync is still not sufficient.

davidar · 2015-11-28T05:31:20Z

@rht that's awesome :). Are you also testing perf on spinning disks (not just SSDs)? It seems to be the random access latency that really kills perf

Edit: also make sure the test files are created in a random order (not in lexicographical order)

rht · 2015-11-28T05:38:18Z

The first reduces the number of operations needed (including disk io), so will make add on HDD faster. For the second, channel iterators in golang has been reported to be slow (but I'm not sure of its direct impact on disk io), so should make add on HDD faster.

jbenet · 2015-12-01T15:21:57Z

on it! (cr)

DataWraith · 2015-12-02T16:24:31Z

I'm trying out those pull requests on the Wikipedia dump right now. ipfs tar add still crashed with an out-of-memory error, but plain ipfs add -r -H -p . is chugging along nicely. It's been running for almost 12 hours now, so hopefully it's not going to crash.

It has added the articles starting with numbers, and is now working on the articles starting with A, so it'll be a while until the whole dump is processed.

jbenet · 2015-12-02T20:05:28Z

@DataWraith thanks, good to hear -- btw, dev0.4.0 has many interesting perf upgrades, with flags like --no-sync which should make it much faster.

dignifiedquire · 2016-01-10T11:24:45Z

ipfs add is mich faster in 0.4 maybe we can revisit this and try to setup a script to constantly update the mirrored version in ipfs

eminence · 2016-01-11T12:55:42Z

Instead of working with the massive Wikipedia, I've been playing with the smaller, but still sizable Wikispecies project. It has 439,460 articles, and is about 4.5 GB on disk.

I've imported the static HTML dumps from the Kiwik openzim dump files. The dump to disk took less than 10 minutes, and the import into ipfs (with ipfs040 with Datastore.NoSync: true) took about 3 or 4 hours.

It's browsable on my local gateway, but I've not been able to get the site to load on the ipfs public gateways. Can any of you try?

http://localhost:8120/ipfs/QmbZp1H1mCbVSiD2K8xpFFhzRGoLJTU6E4keY9WQpyuxP1/A/index.htm

(edit Jan 14th -- after upgrading my nodes to master branch, I stopped running my dev040 node, so this hash is no longer available. Stay tuned for updates)

davidar · 2016-01-12T02:48:02Z

I've not been able to get the site to load on the ipfs public gateways

Same :/

eminence · 2016-01-17T05:40:45Z

Ok, here is my next iteration on this project :

http://v04x.ipfs.io/ipfs/QmV6H1quZ4VwzaaoY1zDxmrZEtXMTN1WLJHpPWY627dYVJ/A/20/8f/Main_Page.html

This is also an IPFS-hosted version of Wikispecies, but with one major change:

Instead of having every article in one massive folder, each article has been partitioned into sub-folders based on the hash of the filename. For articles, there are two levels of hashing, and for images there is one level of hashing.

The goal of this is to reduce the number of links in the A/ and I/m nodes, since they appeared to be too large to load via the public IPFS gateways. I think in this regard, this has been successful.

However, there still seem to be some issues. As I browse around the Main_Page.html link (see above), sometimes the page will load quickly and instantly. Other times, images will be missing, the page will load slowly, or maybe even not at all. This is true even for pages that I've visited already (and thus should be in the gateway's cache)

I can't really tell what's going on here. Running ipfs refs on these hashes from another node of mine works pretty flawlessly. So I conclude the problem might not be with my node. But I'm not sure what other debugging tricks I can use to get to the bottom of this. I think this is a fairly important issue to resolve.

Finally, here are the two tools I wrote in the process of working on this:

https://github.com/eminence/zim -- extracts the contents of a .zim file to a directory
https://github.com/eminence/wiki_rewrite -- rewrites links in a zim dump to partition/shard according to the above description

zim dumping takes a few minutes, wiki_rewriting takes less than an hour, and ipfs add -r probably took a few hours. in all cases, i appear to disk-io bound

whyrusleeping · 2016-01-17T06:41:32Z

@eminence this is great! It also further emphasizes the fact that we need to figure out directory sharding. I'll think on this today and see what I come up with.

Keep up the good work :)

jbenet · 2016-01-19T10:54:06Z

@whyrusleeping note that directory sharding will go on top of IPLD, and that it should work for arbitrary objects (not just unxifs directories). Take a look at the spec. we can use another directive there.

davidar · 2016-01-19T10:57:03Z

ipfs/notes#76

davidar · 2016-02-04T09:26:57Z

https://strategy.m.wikimedia.org/wiki/Proposal:Distributed_Wikipedia

rht · 2016-02-04T09:45:43Z

https://strategy.m.wikimedia.org/wiki/Proposal:Distributed_Wikipedia

(last updated ~3.5 years ago, but penned ~7 years ago)

davidar · 2016-02-04T11:04:23Z

@rht yeah I know, but it might still be relevant

yuvipanda · 2016-06-09T18:30:40Z

There's also https://en.wikipedia.org/wiki/User:HaeB/Timeline_of_distributed_Wikipedia_proposals :)

donothesitate · 2017-01-19T01:06:54Z

The question is if we want a HTML static only version, dynamic, or both.
As for static, the storage or filesystem where the data is stored can use compression.

In case of dynamic, with use of a Service Worker, zlib compression with dictionary, xml entries stored compressed, one could quickly fetch article, render as HTML, and link in pre-determined way. With the optional fallback in Service Worker to real wikipedia.

XML Wiki dump compressed with xz in 256k chunks, without dictionary equals the size of the bzip2 xml dump, and that is 13GB. Given English and pre-made zlib dictionary, I believe one can get to a nice number.

As for search a js variant for the terms only w/ suggestions of top terms could function well.

Edit: I'm being tempted by zim files. Having each cluster as a (raw) block.
Edit: Extracted a 1/1000 sparse sample of enwiki xml dump (105/13MB):
https://ipfs.io/ipfs/QmVYQwcq5jMnEjL1oXiFhED8Gp7S1um1wBHEjJrqWH3bzb/enwiki-20170101-pages-articles-1000th-sample.xml.7z

Edit: The only way to have good compression via widespread compression methods seems to be clustering. Compressing per record results in 4-5x the size. Which leads to storage compression.
The only other way could be a purpose crafted dictionary + huffman coder.

davidar added the help wanted label Sep 19, 2015

davidar added in progress on hold labels Sep 19, 2015

rht mentioned this issue Sep 27, 2015

Add sync flag to flatfs ipfs/go-datastore#30

Closed

davidar mentioned this issue Oct 20, 2015

scholarpedia.org #32

Open

davidar mentioned this issue Jan 3, 2016

Internet in a box #49

Open

ghost mentioned this issue Jan 12, 2016

Sprint January 11th ipfs/team-mgmt#79

Closed

ghost mentioned this issue Sep 18, 2016

Serve Wikipedia from NYC Mesh nodes ipfs/notes#165

Open

flyingzumwalt added backlog and removed in progress labels Jan 15, 2017

flyingzumwalt mentioned this issue May 1, 2017

Gather background info from other repositories and add to this one ipfs/distributed-wikipedia-mirror#6

Closed

Wikipedia #20

Wikipedia #20

Comments

davidar commented Sep 16, 2015

rht commented Sep 17, 2015

davidar commented Sep 17, 2015

DataWraith commented Sep 18, 2015

davidar commented Sep 19, 2015

DataWraith commented Sep 19, 2015

davidar commented Sep 19, 2015

DataWraith commented Sep 19, 2015

DataWraith commented Sep 20, 2015

davidar commented Sep 23, 2015

DataWraith commented Sep 23, 2015

davidar commented Sep 24, 2015

rht commented Sep 24, 2015

whyrusleeping commented Sep 24, 2015

rht commented Sep 24, 2015

rht commented Sep 24, 2015

rht commented Sep 24, 2015

davidar commented Sep 25, 2015

rht commented Sep 25, 2015

davidar commented Sep 26, 2015

davidar commented Sep 26, 2015

rht commented Sep 27, 2015

davidar commented Sep 27, 2015

davidar commented Sep 27, 2015

rht commented Sep 27, 2015

davidar commented Sep 28, 2015

davidar commented Sep 29, 2015

davidar commented Oct 4, 2015

davidar commented Oct 5, 2015

DataWraith commented Oct 5, 2015

whyrusleeping commented Oct 5, 2015

davidar commented Oct 6, 2015

rht commented Nov 28, 2015

davidar commented Nov 28, 2015

rht commented Nov 28, 2015

jbenet commented Dec 1, 2015

DataWraith commented Dec 2, 2015

jbenet commented Dec 2, 2015

dignifiedquire commented Jan 10, 2016

eminence commented Jan 11, 2016

davidar commented Jan 12, 2016

eminence commented Jan 17, 2016

whyrusleeping commented Jan 17, 2016

jbenet commented Jan 19, 2016

davidar commented Jan 19, 2016

davidar commented Feb 4, 2016

rht commented Feb 4, 2016

davidar commented Feb 4, 2016

yuvipanda commented Jun 9, 2016

donothesitate commented Jan 19, 2017 • edited Loading

donothesitate commented Jan 19, 2017 •

edited

Loading