-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve data onbaording speed: ipfs add and ipfs dag import|export #10383
Comments
I don't know about what hashing method is running under the hood, but this was the questions that I had when contemplating it. Normally the entire file is passed through in chunks, and each chunk is passed in serial, and includes the previous chunk, which means that parallelizing this will not work. In the alternative it should be possible hash chunks in parallel, and then take the hashes of the parallel chunks and then do a sha256 of those chunks. Additionally, I know that there is hardware acceleration and various hardware instructions for sha256, however I dont know how much you really will enjoy supporting different hardware, or what sort of applications which will automatically handle the accelerators and what level of support that they have. I seem to remember it taking around 2 days to index ~4TB of models on a 2TB NVME cached ZFS 8 drive SAS array |
@endomorphosis mind providing some answers to below questions? These will let us narrow down the areas that require optimization to better serve your use case (AI model data?):
|
it looks like: Mostly large language models mostly between 7GB to 70G using badgerds on huggingface repositories, IPFS daemon is running online and i use -r to archive the entire folder, e.g. https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/tree/main https://github.com/endomorphosis/ipfs_transformers/blob/main/ipfs_transformers/ipfs_kit_lib/ipfs.py#L73 |
ipfs@workstation:/storage/cloudkit-models/Mixtral-8x7B-v0.1-GGUF-Q8_0$ time ipfs add mixtral-8x7b-v0.1.Q8_0.gguf (copy from zfs to ssd, then zfs to zfs) real 1m59.033s barberb@workstation:/storage/cloudkit-models/Mixtral-8x7B-v0.1-GGUF-Q8_0$ time cp mixtral-8x7b-v0.1.Q8_0.gguf ../ real 2m28.763s Dual Xeon E5-V4 CPU with 8x WD gold and 1x Samsung 3x8 pcie lane enterprise NVME on ZFS Windows nvme -> nvme (different devices) Intel(R) Xeon(R) CPU E3-1535M v6 using the desktop client |
@endomorphosis Triage questions:
|
devel@workstation:/tmp$ time ipfs add mixtral-8x7b-v0.1.Q8_0.gguf running offline fregg@workstation:/tmp$ time ipfs add mixtral-8x7b-v0.1.Q8_0.gguf syncwrite to false |
I do want to mention that I am writing a wrapper to import datasets into IPFS as well, where the number of files will be on the order of 7 million legal caselaw documents, I want to know if you have any feedback about what optimizations ought to be made for that instance? |
Per #9678 I have an old branch with a bunch of optimizations for data import that I could rescue/cleanup. (edit: not for merging though as it hardcodes some stuff, just for using). One low hanging fruit would be to add support for badger4 and pebble backends too. Badgerv1 is ages old. |
I will look at that for my repositories I would like to impress on your org (protocol labs) that I am trying to make this ergonomic for machine learning developers, and from what I have seen is that other projects (e.g. Bachalau, Iroh) are migrating away from libp2p / IPFS, because the project needs to be performance optimized. I only have two more weeks I can spend on this hugging face bridge, and right now its more of a life raft in case the government decides to overregulate machine learning, than it is a viable solution that ML devs would turn to, but if something reasonably effective for decentralized low latency MLops existed I would probably stop development and just buy filecoins. |
Checklist
Description
This is a followup on user @endomorphosis's comment in the Filecoin community discussions about IPFS hashing being slow.
Per @lidel in an ipfs-steering conversation on 2 April 2024:
The text was updated successfully, but these errors were encountered: