-
Notifications
You must be signed in to change notification settings - Fork 24
Wikipedia #20
Comments
In this case, why does the xml -> html have to be done client-side? In the archiver's machine get-dump dump/ # using any of the tool in https://meta.wikimedia.org/wiki/Data_dumps/Download_tools, there is one with rsync
dump2html -r dump/
ipfs add -r dump/ # and ipns it (although yes it'd be much convenient to just use pandoc as a universal markup viewer) |
That's also a possibility, but more time consuming and inflexible On Thu, 17 Sep 2015 11:29 rht notifications@github.com wrote:
|
I actually started on this a while ago, but then thought it would be silly for a single person to attempt this and stopped, but now that I see this issue, I think it might not have been such a bad idea: I've been experimenting with using a 15GiB (compressed and without images) dump of the English Wikipedia and extracting HTML files using gozim and wget. This gave me a folder full of HTML pages that interlink nicely using relative links. It took a couple of hours to extract every page reachable from 'Internet' within 2 hops, which amounted to about 1% of the articles in the dump, so it would take at least a week to create HTML pages for the entire dump. And since these HTML files are uncompressed, I'm not sure I have enough disk space available to do the complete dump, but I could repeat my initial trial and make it available in IPFS. One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates. If it is decided that this way of doing it might not be such a bad idea, it might be possible to alter gozim to embed such license information. Or maybe we can simply put a LICENSE-file in the top-most directory. |
@DataWraith Just had a look at the gozim demo, looks really cool. In the short-term, this does seem like the best option (apologies for my terse reply earlier @rht :). Would it also be possible to also do client-side search with something like https://github.com/cebe/js-search ?
If you can give me a script, and an estimate of the storage requirements, I can run this on one of the storage nodes for you :)
Are you sure? I can see:
in the footer of http://scaleway.nobugware.com/zim/A/Wikipedia.html
Definitely. See #25 |
I'm no JavaScript expert, but I don't see why not. We could pre-compile a search index and store it alongside the static files. However, resource usage on the client may or may not be prohibitively large.
There is no real script. It's literally:
This will crawl everything reachable from 'Internet'. It may be possible to directly crawl the index of pages itself, but I haven't tried that yet. You probably need to wrap
Hm. Maybe that's because they are using a different dump, or a newer version of gozim (though the latter seems unlikely); the pages I extracted don't have that footer. I'm currently running |
For context, this is what @brewsterkahle uses for his IPFS-hosted blog
Yeah, that was my concern too. If so, it might have to wait until #8
Too easy
Ok, we'll have to wait until we get some more storage then.
Thanks. Ping me on http://chat.ipfs.io to help debug. |
Short progress update: I'm now feeding files to I also took another look at gozim. It is relatively easy to extract the HTML-files without going through wget first -- should've thought of that before coming up with the Quick & dirty dumping program here. |
I had no luck getting There are two related issues describing problems with |
@DataWraith Hmm, that's no good 😕. For the moment, could you tar/zip all the files together and add that? CC: @whyrusleeping |
Hi. I've decided to delete the trial-files obtained using My initial estimate of space required was off, because the article sample I obtained using Edit: |
@DataWraith Awesome, can't wait to see it :)
@whyrusleeping Please make |
For scale (foo/ is 11 MB, 10 files of 1.1 MB each):
It appears that |
(git does explicit sync https://github.com/git/git/blob/master/pack-write.c#L277 |
@davidar I get you point, which either mean 1. "if someone can put the kernel on the browser, why not pandoc", or 2. "we need to be able to do more than just viewing static simulated piece of paper" (more of what a "document"/"book" should be). As with the client-side search, it works for small sites, but for huge sites (wikipedia?), transporting the index files to the client seems to be too much. |
I wonder if some of the critical operations should be offloaded to FPGA. |
Uh oh, which side of this argument am I on now? #25 @jbenet
The idea is that you'd encode the index as a trie and dump it into IPLD, so the client would only have to download small parts of the index to answer a query. |
And this can be repurposed for any 'pre-computed' stuff, not just search indexes? e.g. (content sorted/filtered by paramX, or entire sql queries https://github.com/ipfs/ipfs/issues/82?) |
@rht yes, I would think so, I don't see any reason why it wouldn't be possible to build a SQL database format on top of IPLD (albeit non-trivial) |
@rht looks like someone already beat me to it: http://markup.rocks |
@davidar by a few months. Very useful to know that it is fast. Also, found this http://git.kernel.org/cgit/git/git.git/tree/Documentation/config.txt#n693:
@whyrusleeping disable fsync by default and add a config flag to enable it? (wanted to close the gap with git, which is still 2 orders of magnitude away). |
Yeah, Haskell is high-level enough that it tends to compile to JS reasonably well. The FP Complete IDE is also written in a subset of Haskell.
Something like the ipfs markdown viewer but using pandoc would be cool. |
IPFS-hosted version of markup.rocks: https://ipfs.io/ipfs/QmSyfirfxBbgh8sZPzy4yyMQjHgzKX7iQeXG9Zet4VYk9P/ |
@davidar saw it, neat. i.e. it's a pandoc but without the huge GHC stuff, cabal-install ritual, etc.
But so does python, ruby, ... You mean sane type system? This has nice things like:
(haven't actually looked at a minimalist typed :lambda: calculus metacircular evaluator (the one people write (or chant) every day for the untyped ones)) |
Yeah, I meant of the languages with a strong enough type system to be able to produce optimised code |
Also see this simple but awesome wiki editor by @jamescarlyle |
@DataWraith Awesome, downloading now :) |
@DataWraith And now it's on IPFS 🎈 @whyrusleeping Looking forward to |
@davidar Awesome! |
@davidar its very high on my todo list. |
This can proceed with ipfs/kubo#1964 + ipfs/kubo#1973 merged (pending @jbenet's CR). |
@rht that's awesome :). Are you also testing perf on spinning disks (not just SSDs)? It seems to be the random access latency that really kills perf Edit: also make sure the test files are created in a random order (not in lexicographical order) |
The first reduces the number of operations needed (including disk io), so will make add on HDD faster. For the second, channel iterators in golang has been reported to be slow (but I'm not sure of its direct impact on disk io), so should make add on HDD faster. |
on it! (cr) |
I'm trying out those pull requests on the Wikipedia dump right now. It has added the articles starting with numbers, and is now working on the articles starting with A, so it'll be a while until the whole dump is processed. |
@DataWraith thanks, good to hear -- btw, dev0.4.0 has many interesting perf upgrades, with flags like |
ipfs add is mich faster in 0.4 maybe we can revisit this and try to setup a script to constantly update the mirrored version in ipfs |
Instead of working with the massive Wikipedia, I've been playing with the smaller, but still sizable Wikispecies project. It has 439,460 articles, and is about 4.5 GB on disk. I've imported the static HTML dumps from the Kiwik openzim dump files. The dump to disk took less than 10 minutes, and the import into ipfs (with ipfs040 with It's browsable on my local gateway, but I've not been able to get the site to load on the ipfs public gateways. Can any of you try?
(edit Jan 14th -- after upgrading my nodes to master branch, I stopped running my dev040 node, so this hash is no longer available. Stay tuned for updates) |
Same :/ |
Ok, here is my next iteration on this project : http://v04x.ipfs.io/ipfs/QmV6H1quZ4VwzaaoY1zDxmrZEtXMTN1WLJHpPWY627dYVJ/A/20/8f/Main_Page.html This is also an IPFS-hosted version of Wikispecies, but with one major change: Instead of having every article in one massive folder, each article has been partitioned into sub-folders based on the hash of the filename. For articles, there are two levels of hashing, and for images there is one level of hashing. The goal of this is to reduce the number of links in the However, there still seem to be some issues. As I browse around the Main_Page.html link (see above), sometimes the page will load quickly and instantly. Other times, images will be missing, the page will load slowly, or maybe even not at all. This is true even for pages that I've visited already (and thus should be in the gateway's cache) I can't really tell what's going on here. Running Finally, here are the two tools I wrote in the process of working on this:
zim dumping takes a few minutes, wiki_rewriting takes less than an hour, and ipfs add -r probably took a few hours. in all cases, i appear to disk-io bound |
@eminence this is great! It also further emphasizes the fact that we need to figure out directory sharding. I'll think on this today and see what I come up with. Keep up the good work :) |
@whyrusleeping note that directory sharding will go on top of IPLD, and that it should work for arbitrary objects (not just unxifs directories). Take a look at the spec. we can use another directive there. |
(last updated ~3.5 years ago, but penned ~7 years ago) |
@rht yeah I know, but it might still be relevant |
The question is if we want a HTML static only version, dynamic, or both. In case of dynamic, with use of a Service Worker, zlib compression with dictionary, xml entries stored compressed, one could quickly fetch article, render as HTML, and link in pre-determined way. With the optional fallback in Service Worker to real wikipedia. XML Wiki dump compressed with xz in 256k chunks, without dictionary equals the size of the bzip2 xml dump, and that is 13GB. Given English and pre-made zlib dictionary, I believe one can get to a nice number. As for search a js variant for the terms only w/ suggestions of top terms could function well. Edit: I'm being tempted by zim files. Having each cluster as a (raw) block. Edit: The only way to have good compression via widespread compression methods seems to be clustering. Compressing per record results in 4-5x the size. Which leads to storage compression. |
In terms of being able to view this on the web, I'm tempted to push Pandoc through a Haskell-to-JS compiler like Haste.
CC: @jbenet
The text was updated successfully, but these errors were encountered: