Wikidata Viewer #27

hobofan · 2016-06-12T09:16:21Z

Wikidata contains a vast amount of structured, semantic data that can be useful for other IPFS apps. If someone wants to create a close clone of Wikipedia, this would also be required project since the backbone of the current Wikipedia is built on Wikidata (related to #17).

For now I have 2 main goals for this project:

Make all Wikidata entities available via IPFS
Create a read-only interface to view those entities

Progress so far:

Downloaded the latest-all.json.bz2 dump from here (found via this page). This file is ~5GB big.
Uncompressed the file to its original size of ~75GB. This was a lot bigger than I expected (forgot the crazy compression ratio for JSON)
and forced me to restart the process with a different server.
Created a script that splits the dump into the individual entities and tested that on a small part of it.

Thoughts:

The data is much bigger and unwieldy than expected.
Processing the weekly dump into the desired format might take up to a day on the instance I currently use (Scaleway C1).
I haven't even touched the IPFS part yet. I am very interested to see how well it performs when adding ~18 million files. A root directory containing all those files will probably not work that well ATM, considering that "adding wikipedia" is a edge case listed in Sharding - unixfs support for large directories specs#32.

The text was updated successfully, but these errors were encountered:

Kubuxu · 2016-06-12T09:38:58Z

Currently you won't be able to add 18milion files into on directory but me and @magik6k are currently adding https://cdnjs.com/ (about 22GB) to IPFS. The difference is that there is directory tree, as long that is you have less than about thousand of files or directories in one directory you should be good. Also we've found few performance bugs and are working on resolving them.

Also remember that adding files into IPFS means coping them so you will have to have double to storage capacity of the dump.

Adding it to IFPS will take even longer so I don't think that task is feasible on C1.

hobofan · 2016-06-12T09:54:31Z

Also remember that adding files into IPFS means coping them so you will have to have double to storage capacity of the dump.

I planned for that and attached a 150GB volume for IPFS.

Adding it to IFPS will take even longer so I don't think that task is feasible on C1.

Does the time needed improve after a intial add? So would it drop from e.g. 12hr to 1hr when processing it the second week?

Kubuxu · 2016-06-12T09:57:23Z

Does the time needed improve after a intial add? So would it drop from e.g. 12hr to 1hr when processing it the second week?

Yes it should, as it doesn't have then to save files to disk nor publish them to DHT (for the most part as they will be the same).

Most importantly this has to be resolved: ipfs/kubo#2823 as you will run out of the memory otherwise.

hobofan · 2016-06-12T10:06:47Z

Most importantly this has to be resolved: ipfs/kubo#2823 as you will run out of the memory otherwise.

I guess that can be worked around by adding files in smaller batches and then restarting the server? (I assume this is a server and not a CLI bug?)

So with that workaround and some layered directory scheme, it should be possible to get it at least somewhat working? I am not in a rush since this is a weekend project, so it would even be okay for me if the initial add takes the whole week 😅

Edit: As for performance, would there be any benefit to using master compared to 0.4.2.?

hobofan · 2016-08-25T22:21:42Z

Had a bit of time to get back to the project. After trying to solve the part where I split up the files with a bash script and standard tools I ended up writing a small Rust program that splits up the large weekly dump and places the entities in sharded directories: https://github.com/hobofan/wikidata-split

I am now starting to add the entities to IPFS and think I am experiencing ipfs/kubo#2828 . After adding about 50MB I have a bandwidth of TotalIn: 7.3GB and TotalOut: 1.6GB.

jbenet · 2016-08-26T01:57:57Z

Try with go-ipfs master. Should be down by a large factor
On Thu, Aug 25, 2016 at 18:22 Maximilian Goisser notifications@github.com
wrote:

Had a bit of time to get back to the project. After trying to solve the
part where I split up the files with a bash script and standard tools I
ended up writing a small Rust program that splits up the large weekly dump
and places the entities in sharded directories:
https://github.com/hobofan/wikidata-split

I am now starting to add the entities to IPFS and think I am experiencing
ipfs/kubo#2828 ipfs/kubo#2828 . After
adding about 50MB I have a bandwidth of TotalIn: 7.3GB and TotalOut: 1.6GB
.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#27 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAIcoWN2ajl4Aybe4oaWh3AKxTBgEfe2ks5qjhV2gaJpZM4IzvyG
.

hobofan · 2016-08-28T18:18:55Z

As a weekend project, I learned some react.js and made a basic viewer for this project: https://github.com/hobofan/ipfs-wikidata-ui

It has still a lot of rough edges (see the open issues) but since I might not get too much time the next few weeks I wanted to put it out there. Anybody reading this, feel free to join in! 😉

As for progress of adding the dataset, I am at 3.89 GB / 76.56 GB with the add process dying every ~1.5GB. I might be hitting ipfs/kubo#2823 there (see ipfs/kubo#2823 (comment)). I think I should also mention that I am not on the Scaleway C1 mentioned in the first comment anymore, but switched to 32GB quad-core i7 root server (https://www.hetzner.de/ot/hosting/produkte_rootserver/ex40).

hobofan · 2016-09-15T11:45:40Z

The first complete publish of the dataset is finished! I am now tracking those at ipfs-wikidata/wikidata-split#2 . Next step on the dataset side is now to do the whole thing again with the current dump to see how long the diff takes to publish, and judge if that's maintainable or not (and automate as much as possible).

hobofan mentioned this issue Aug 27, 2016

Massive DHT traffic on adding big datasets ipfs/kubo#2828

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wikidata Viewer #27

Wikidata Viewer #27

hobofan commented Jun 12, 2016 •

edited

Loading

Kubuxu commented Jun 12, 2016 •

edited

Loading

hobofan commented Jun 12, 2016

Kubuxu commented Jun 12, 2016 •

edited

Loading

hobofan commented Jun 12, 2016 •

edited

Loading

hobofan commented Aug 25, 2016

jbenet commented Aug 26, 2016

hobofan commented Aug 28, 2016 •

edited

Loading

hobofan commented Sep 15, 2016

Wikidata Viewer #27

Wikidata Viewer #27

Comments

hobofan commented Jun 12, 2016 • edited Loading

Kubuxu commented Jun 12, 2016 • edited Loading

hobofan commented Jun 12, 2016

Kubuxu commented Jun 12, 2016 • edited Loading

hobofan commented Jun 12, 2016 • edited Loading

hobofan commented Aug 25, 2016

jbenet commented Aug 26, 2016

hobofan commented Aug 28, 2016 • edited Loading

hobofan commented Sep 15, 2016

hobofan commented Jun 12, 2016 •

edited

Loading

Kubuxu commented Jun 12, 2016 •

edited

Loading

Kubuxu commented Jun 12, 2016 •

edited

Loading

hobofan commented Jun 12, 2016 •

edited

Loading

hobofan commented Aug 28, 2016 •

edited

Loading