-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunker-argument for 'files write' #7532
Comments
Just as a tip: unless you cause GC, you can add I second the request, however. This is just a (less-bad?) work-around. |
@namibj but when I'm causing a GC I'm in a deadlock since my data which I've added just got deleted with no way of knowing that it happened. |
@RubenKelevra It seems that I don't see where there would be a deadlock. Can you elaborate? |
@namibj wrote
Well, when you run
While this might work, it's still not thread-safe, since after the I've worked around it in my code like this: |
While looking for status updates on the IPFS Arch mirror, I came across this issue again and realized I missed something:
Actually, you'd do this: cid=`ipfs add --pin=false --chunker=buzhash --quieter "$file"` &&
ipfs files cp "/ipfs/$cid" "/path/to/place/$file" &&
if ! $( ipfs --enc json files stat --with-local "/path/to/place/$file" | jq -e '.CumulativeSize == .SizeLocal' )
ipfs add --pin=false --chunker=buzhash --quieter "$file"
fi if "is referenced somewhere in the MFS" prevents GC, I don't see how this could race in this bad way.
I looked at that (well, the master branch), and I'd suggest to skip the precondition checking from the happy path, e.g. in https://github.com/RubenKelevra/rsync2ipfs-cluster/blob/7e2b846f6449ed8c568d0cd2e435d63c618099a4/bin/rsync2cluster.sh#L461-L468, try Further optimizationsAlso,
Do note the warning on
MFS-spamming considered harmfulAs my computer (in a test of some 8 GiB of research data files with If it takes a day to process the 100-ish GB of the Arch repo incrementally, these incremental tactics might not be the answer. |
Hey @namibj, thanks for taking the time to look into this. I try to give you the answers I can provide, but it feels a bit confusing to me. So let me try to make sense of it.
I don't do random writes. And I don't want to. There's support for this, which means it can be handled by the chunking backend. But this doesn't matter as I don't use random writes.
Why is this better than extending a command which already does what I need by a parameter to be user-customizable than which is currently just fixed? Can you elaborate?
That's exactly what I do. But this won't protect it from the GC. The GC is not threadsafe atm and will disrupt any operation by deleting data. That's why I only run the GC after I completed an operation and don't do anything on the MFS.
While I don't run the GC automatically I could do this. But as the GC is not threadsafe it wouldn't guarantee anything. Also an In this specific case the ipfs add --pin=false might be deleted right after it has been added. Alternatively
How does this "optimization" help anything? Correct me if I'm wrong but that's just a variation and it would actually send 3 commands if a file is not there, instead of just one. So recovering from a partly written rsync log is at the end slower than before.
I mean, I do use And yes I have read the warning before implementing it. That's why I remove the old cluster pin after updating the old one.
This doesn't make any sense: First there's a good reason I don't run Second Third, the paths are not correct if I do this. I do remap some paths and also do some filtering of files and folders which are not needed. I don't see how this would improve anything.
Filestore is an experimental function. I will not use experimental functions. Apart from that, there are a lot of limitations on how filestore do gets handled. And I run into limitations without using it. Why should I switch to a more untested and marked experimental code path?
I'm not sure why you want me to hash the whole directory if you think that doing it incrementally does take days. I don't think doing more work will improve anything. It takes actually less than a minute, usually. |
Hey @RubenKelevra , thanks for the effort you put into popularizing practically-useful IPFS applications like the Arch mirror that runs a localhost HTTP gateway and shares recently-fetched packages via P2P on internet and intranet, with no mirrorlist tricks needed to use an intranet cache if and when available.
Sorry about the partially-incoherent writing, I had never intended that much commentary, or, for the record, this (I'm responding out-of-order; this is the last bit I'm writing) much response commentary.
The Also, it's a strictly streaming one, instead of letting the gateway read the files directly from the filesystem (at least that's how I understand
I assumed the GC was usable concurrently when I wrote that, and it'd have been to stick the CID(s) into the MFS within the same transaction so as to not expose them for a brief window as orphans for the GC to try and reap.
Oh, I didn't realize it was that bad. I expected them to have some form of concurrent GC by now that isn't so bad that one has to use it in stop-the-world regardless (and the concurrent-ness only guards against DB corruption).
I (seemingly misstakenly) assumed it'd usually succeed when trying to delete a file. It seems the error message for "file wasn't there to begin with" is easy to match on, probably even easy to accurately predict at least if filenames aren't exotic. Should probably be a little string construction and then comparison of the predicted "file not found" error message with the returned error message to confirm a failed delete command won't need to be repeated (this is more effort though, and I didn't expect it to be in the hot path).
I somehow read a version from mid-2020 or therabouts when I switched from GH to Sourcegraph for some mild IDE-quality-of-life. Sorry about the confusion.
I'm not sure to what extend follower nodes apply these commands sequentially vs. concurrently, but delaying the removal by a few hours wouldn't seem like a bad idea. Should be easy to confirm either way by looking at a follower's pin-update-performance for non-trivial deltas that give a window for the potential race condition to have an opportunity. Though I'd not be surprised if it'd only be able to cause issues when the GC strikes on the partially-unpinned DAG before those blocks of the DAG have been consulted by the running pin update.
Just to be clear: I know this is an experimental feature. I don't know if it's practical/feasible to prevent it from messing with the lower levels of the DAG. But it should do the part of allowing the top-level add-to-MFS step to work. I'm speaking of data inlined into a CID.
Oh, fair. If it's still a performance issue due to IPFS being bad, I'd expect hard/reflinks of the underlying filesystem with
That's a fair stance.
It exists to allow
As mentioned above, I was looking at an old version of the code. That seemed to process the rsync log serially with typically multiple
Oh, if so, that's great, I just read the issues in RubenKelevra/pacman.store#62 about runs taking a long time to finish (and quite possibly, just fail at the end). |
I use buzhash as chunker on one of my projects. Since the chunker argument is currently not available on
ipfs files write
I need to add all files as pins first, copy them to the location in the MFS and unpin it afterwards, which is a rather crude workaround.I like to have the
--chunker
flag onipfs files write
as well.The text was updated successfully, but these errors were encountered: