-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Content routing issues with "Reprovider.Strategy" set to "roots" #9416
Comments
This is correct, exactly what is happening. ConnMgr grace set to 20s means after connection is dropped, Kubo is not able to find providers because there is no longer a peer that can respond to bitswap requests (and DHT has no info due to the strategy set to roots). Kubo should be smart enough to retry content routing for parent node, when content path has any subpaths. Suggestions / designs / PRs welcome.
Oof, I moved this to separate issue: #9418 |
The simplest thing for now might be just removing this "roots" functionality. I might be missing something though — is there a use case in which it doesn't trigger this buggy behavior? If we do want to keep it, then it's not at all obvious to me how this can be done in a robust way. For example, if a node keeps track of where it got the parent folder's data, then that is not enough: what if the original node went offline, but another node that also has "roots" configured came online in the meantime. The only robust solution that I can think of is to scan all possible nodes that advertise as having the directory CID, to check if it has any children. This can become quite slow when there are a lot of nodes that have accessed the directory CID, and therefore have it cached, but do not have any children. Ok, here is one more potential solution: including metadata when advertising to let others on the DHT know if you have children of a particular CID. Not sure how hard that would be to implement? |
Announcing only pinned roots is useful for use cases where people work with non-filesystem data (e.g., DAG-CBOR databases and/or search indexes, like one in ipfs-geoip). This allows other peers to pin the entire dataset via root CID without asking the original publisher for announcing every CID (this may not scale if the dataset is so bit it takes longer than 24h to announce all CIDs). I would not remove 'roots', but instead make content resolution smarter: if provider is not found for 👉 In other words, imo, the fix here is not removing "roots" option, but make content resolution more robust:
(digression) @hsn10 rfm17-provider-record-liveness.md#5-Conclusion suggests that there is absolutely no need for such low reprovider intervals, and we are looking into increasing the defaults (#9326). If you have reproducible data, mind sharing more info about your use case and measurements in an issue at https://github.com/protocol/network-measurements/? |
Thanks both for the replies. Perhaps a minimal mitigation of this issue would be to at least clarify this behavior and the intended use cases for "roots" in the documentation?
Wouldn't that case also break if a node (call it node "B") loses a connection halfway through downloading the data (from node "A")? After all, "B" doesn't remember that it got the original root CID from "A". Furthermore, a third node ("C") could initiate a download from "B", but is never able to finish it since "B" doesn't have anything. Per my current understanding, "C" wouldn't know to contact "A" for the rest, since it's not advertising the remaining child CIDs directly. In other words, isn't this issue independent of the type of data (filesystem or not)?
Indeed, that does seem to be a potential fix. However, it might get expensive in certain edge cases. For example, what if lots of nodes have only partial downloads of |
@AnnaArchivist We believe these issues are fixed as of 0.18-rc2. Would you like to try again and let us know? |
@mishmosh can you provide any info about the resolution to this issue? I reviewed the changes in 0.18-rc2 and diffed it against other releases, but no clear solution to this issue stood out. |
@sevenrats thank you for the ping, which sub-issue you have in mind?
ps. Once again, thank you for filling this issue. In the future, it's better to fill one issue per bug, easier to triage 🙏 |
@lidel I'm specificaly curious what the status of this is:
Is this on the roadmap? As one of your users, this seems broken and a warning in the docs does very little to improve that. |
I was searching for this issue earlier but I didn't found it. Here are some more detailed comments: #10249 (comment) |
Checklist
Installation method
ipfs-update or dist.ipfs.tech
Version
Config
Description
I'm seeing unexpected behavior when using
Reprovider.Strategy
"roots".Now, this might be one or multiple bugs in kubo, in the documentation, in the gateway, in specs, or in some interaction between them all. This is kind of an umbrella bug, since I'm not too familiar with the different subsystems that might be involved. I'm writing this from a user's perspective, not from an IPFS developer perspective.
Hopefully someone can help figure out if this needs to be split up in smaller tickets, or filed elsewhere. I'd be happy to work with you all to make that happen!
Background
When (re)providing very large directories with lots of data, it can take a long time to complete. My understanding from the config docs is that you can use
Reprovider.Strategy
"roots" to only provide the directory CID itself, provided that it is pinned. Fantastic, that should make (re)providing much faster!This blog post then suggests that IPFS nodes should still be able to find the content that is linked to in this directory, e.g. when using
<directory CID>/filename
(emphasis mine):Therefore, I would expect that I can do this:
Reprovider.Strategy
"roots", so that only the "directory CID" is provided./ipfs/<directory CID>/filename
(NOT/ipfs/<file CID>
, since that will surely not work, since it has never been provided)Steps taken
Some of these steps might be unnecessary, but I tried to stick to my production use case as close as possible, while trying to make a minimal example. When debugging this, you might be able to make it even more minimal.
Ok, we have now provided just the directory CID onto the IPFS network. We can verify that it can be found by using this diagnostic tool. Indeed, I can consistently confirm that the directory can be found and accessed on my node. Similarly, we can consistently confirm that the CIDs for
1.txt
and2.txt
cannot be found, which is in line with our expectations.First bug:
2.txt
not accessibleNow it becomes a bit more iffy. The next steps I can't always fully reproduce, but most of the time I can.
/ipfs/<directory CID>/1.txt
. Usually, this works!/ipfs/<directory CID>/2.txt
. Usually, this times out. No matter how long I wait, I can't get it to work. Even when using a different gateway it often still times out! And sometimes I can't even access the directory itself;/ipfs/<directory CID>
.Interestingly, on a different IPFS node,
ipfs resolve -r /ipfs/<directory CID>/2.txt
usually does work, and a subsequentipfs dag get
on its output also usually does give me the data from2.txt
.Second bug: crash during
ipfs bitswap reprovide
Related discussions
In my investigations so far, I've found a couple of people who have mentioned potentially related issues:
My hypothesis is that once an IPFS node has cached the directory CID locally, it doesn't remember the IPFS node that it originally connected to when looking up
2.txt
. And since the CID for2.txt
itself is not cached, or provided on IPFS at all, it simply can't find it.If I'm right about this, I see two possibilities, depending on whether this is the intended behavior:
Reprovider.Strategy
"roots" in the first place? Is it only useful if we know that other nodes will always fully recursively fetch all the "children" of a root node? If so, then that should be more clearly documented. It would also be fairly brittle behavior, because what if a node goes down while doing the recursive fetching (which could potentially be many terabytes!). It would be hard to resume in that case.I hope this is helpful! Again, happy to provide more details and think through this with you all.
The text was updated successfully, but these errors were encountered: