Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikidata Viewer #27

Open
hobofan opened this issue Jun 12, 2016 · 8 comments
Open

Wikidata Viewer #27

hobofan opened this issue Jun 12, 2016 · 8 comments

Comments

@hobofan
Copy link

hobofan commented Jun 12, 2016

Wikidata contains a vast amount of structured, semantic data that can be useful for other IPFS apps. If someone wants to create a close clone of Wikipedia, this would also be required project since the backbone of the current Wikipedia is built on Wikidata (related to #17).

For now I have 2 main goals for this project:

  • Make all Wikidata entities available via IPFS
  • Create a read-only interface to view those entities

Progress so far:

  • Downloaded the latest-all.json.bz2 dump from here (found via this page). This file is ~5GB big.
  • Uncompressed the file to its original size of ~75GB. This was a lot bigger than I expected (forgot the crazy compression ratio for JSON)
    and forced me to restart the process with a different server.
  • Created a script that splits the dump into the individual entities and tested that on a small part of it.

Thoughts:

  • The data is much bigger and unwieldy than expected.
    Processing the weekly dump into the desired format might take up to a day on the instance I currently use (Scaleway C1).
  • I haven't even touched the IPFS part yet. I am very interested to see how well it performs when adding ~18 million files. A root directory containing all those files will probably not work that well ATM, considering that "adding wikipedia" is a edge case listed in Sharding - unixfs support for large directories specs#32.
@Kubuxu
Copy link
Member

Kubuxu commented Jun 12, 2016

Currently you won't be able to add 18milion files into on directory but me and @magik6k are currently adding https://cdnjs.com/ (about 22GB) to IPFS. The difference is that there is directory tree, as long that is you have less than about thousand of files or directories in one directory you should be good. Also we've found few performance bugs and are working on resolving them.

Also remember that adding files into IPFS means coping them so you will have to have double to storage capacity of the dump.

Adding it to IFPS will take even longer so I don't think that task is feasible on C1.

@hobofan
Copy link
Author

hobofan commented Jun 12, 2016

Also remember that adding files into IPFS means coping them so you will have to have double to storage capacity of the dump.

I planned for that and attached a 150GB volume for IPFS.

Adding it to IFPS will take even longer so I don't think that task is feasible on C1.

Does the time needed improve after a intial add? So would it drop from e.g. 12hr to 1hr when processing it the second week?

@Kubuxu
Copy link
Member

Kubuxu commented Jun 12, 2016

Does the time needed improve after a intial add? So would it drop from e.g. 12hr to 1hr when processing it the second week?

Yes it should, as it doesn't have then to save files to disk nor publish them to DHT (for the most part as they will be the same).

Most importantly this has to be resolved: ipfs/kubo#2823 as you will run out of the memory otherwise.

@hobofan
Copy link
Author

hobofan commented Jun 12, 2016

Most importantly this has to be resolved: ipfs/kubo#2823 as you will run out of the memory otherwise.

I guess that can be worked around by adding files in smaller batches and then restarting the server? (I assume this is a server and not a CLI bug?)

So with that workaround and some layered directory scheme, it should be possible to get it at least somewhat working? I am not in a rush since this is a weekend project, so it would even be okay for me if the initial add takes the whole week 😅

Edit: As for performance, would there be any benefit to using master compared to 0.4.2.?

@hobofan
Copy link
Author

hobofan commented Aug 25, 2016

Had a bit of time to get back to the project. After trying to solve the part where I split up the files with a bash script and standard tools I ended up writing a small Rust program that splits up the large weekly dump and places the entities in sharded directories: https://github.com/hobofan/wikidata-split

I am now starting to add the entities to IPFS and think I am experiencing ipfs/kubo#2828 . After adding about 50MB I have a bandwidth of TotalIn: 7.3GB and TotalOut: 1.6GB.

@jbenet
Copy link
Member

jbenet commented Aug 26, 2016

Try with go-ipfs master. Should be down by a large factor
On Thu, Aug 25, 2016 at 18:22 Maximilian Goisser notifications@github.com
wrote:

Had a bit of time to get back to the project. After trying to solve the
part where I split up the files with a bash script and standard tools I
ended up writing a small Rust program that splits up the large weekly dump
and places the entities in sharded directories:
https://github.com/hobofan/wikidata-split

I am now starting to add the entities to IPFS and think I am experiencing
ipfs/kubo#2828 ipfs/kubo#2828 . After
adding about 50MB I have a bandwidth of TotalIn: 7.3GB and TotalOut: 1.6GB
.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#27 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAIcoWN2ajl4Aybe4oaWh3AKxTBgEfe2ks5qjhV2gaJpZM4IzvyG
.

@hobofan
Copy link
Author

hobofan commented Aug 28, 2016

As a weekend project, I learned some react.js and made a basic viewer for this project: https://github.com/hobofan/ipfs-wikidata-ui

It has still a lot of rough edges (see the open issues) but since I might not get too much time the next few weeks I wanted to put it out there. Anybody reading this, feel free to join in! 😉

As for progress of adding the dataset, I am at 3.89 GB / 76.56 GB with the add process dying every ~1.5GB. I might be hitting ipfs/kubo#2823 there (see ipfs/kubo#2823 (comment)). I think I should also mention that I am not on the Scaleway C1 mentioned in the first comment anymore, but switched to 32GB quad-core i7 root server (https://www.hetzner.de/ot/hosting/produkte_rootserver/ex40).

@hobofan
Copy link
Author

hobofan commented Sep 15, 2016

The first complete publish of the dataset is finished! I am now tracking those at ipfs-wikidata/wikidata-split#2 . Next step on the dataset side is now to do the whole thing again with the current dump to see how long the diff takes to publish, and judge if that's maintainable or not (and automate as much as possible).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants