-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wikidata Viewer #27
Comments
Currently you won't be able to add 18milion files into on directory but me and @magik6k are currently adding https://cdnjs.com/ (about 22GB) to IPFS. The difference is that there is directory tree, as long that is you have less than about thousand of files or directories in one directory you should be good. Also we've found few performance bugs and are working on resolving them. Also remember that adding files into IPFS means coping them so you will have to have double to storage capacity of the dump. Adding it to IFPS will take even longer so I don't think that task is feasible on C1. |
I planned for that and attached a 150GB volume for IPFS.
Does the time needed improve after a intial add? So would it drop from e.g. 12hr to 1hr when processing it the second week? |
Yes it should, as it doesn't have then to save files to disk nor publish them to DHT (for the most part as they will be the same). Most importantly this has to be resolved: ipfs/kubo#2823 as you will run out of the memory otherwise. |
I guess that can be worked around by adding files in smaller batches and then restarting the server? (I assume this is a server and not a CLI bug?) So with that workaround and some layered directory scheme, it should be possible to get it at least somewhat working? I am not in a rush since this is a weekend project, so it would even be okay for me if the initial add takes the whole week 😅 Edit: As for performance, would there be any benefit to using master compared to |
Had a bit of time to get back to the project. After trying to solve the part where I split up the files with a bash script and standard tools I ended up writing a small Rust program that splits up the large weekly dump and places the entities in sharded directories: https://github.com/hobofan/wikidata-split I am now starting to add the entities to IPFS and think I am experiencing ipfs/kubo#2828 . After adding about 50MB I have a bandwidth of |
Try with go-ipfs master. Should be down by a large factor
|
As a weekend project, I learned some react.js and made a basic viewer for this project: https://github.com/hobofan/ipfs-wikidata-ui It has still a lot of rough edges (see the open issues) but since I might not get too much time the next few weeks I wanted to put it out there. Anybody reading this, feel free to join in! 😉 As for progress of adding the dataset, I am at |
The first complete publish of the dataset is finished! I am now tracking those at ipfs-wikidata/wikidata-split#2 . Next step on the dataset side is now to do the whole thing again with the current dump to see how long the diff takes to publish, and judge if that's maintainable or not (and automate as much as possible). |
Wikidata contains a vast amount of structured, semantic data that can be useful for other IPFS apps. If someone wants to create a close clone of Wikipedia, this would also be required project since the backbone of the current Wikipedia is built on Wikidata (related to #17).
For now I have 2 main goals for this project:
Progress so far:
latest-all.json.bz2
dump from here (found via this page). This file is ~5GB big.and forced me to restart the process with a different server.
Thoughts:
Processing the weekly dump into the desired format might take up to a day on the instance I currently use (Scaleway C1).
The text was updated successfully, but these errors were encountered: