-
Notifications
You must be signed in to change notification settings - Fork 26
Switch to sharding based on estimated directory size #87
Comments
Yes, I agree. I'd much prefer to push this logic into
We can also stop enumerating when we reach the limit. This will be especially important when switching from sharded to non-sharded.
Can we not enumerate links till we reach the maximum to determine that we shouldn't "switch back"? This will have a performance impact, but it shouldn't be terrible (especially if we memoize) and is only incurred when deleting files. This isn't absolutely critical but it would be nice to figure out how viable it is.
The goal here is to replace this flag with something that "just works". I wouldn't try to maintain both in tandem. (@aschmahmann?)
Why is this a huge performance issue? In practice, I expect starting with a non-sharded directory and sharding late will actually have better performance:
|
I'm fine dropping it, just not in this issue to maintain as much backward compatibility as possible and make the least amount of possible changes in
Maybe it's not, just wanted to flag current behavior, don't really care about performance in this issue. |
I'm assuming that keeping it will be more work than dropping it but I'm not entirely sure how you're planning on going about it. |
I don't think it's worth keeping the current behavior where it's possible for someone to create a directory block of size >1MiB. In theory we could have some behavior where unless |
Yes, I'm not doing that here. Feel free to submit another issue and discuss potential solutions for that after this one lands. |
FWIW, the MFS implementation for js-IPFS does transition back to a basic directory if the sharding threshold is crossed, though it's a bit simpler as it just uses an arbitrary directory entry count as the threshold value. |
|
The background and motivation for this is in ipfs/kubo#8106, but this is a self-contained issue.
Add an option similar to
UseHAMTSharding
that switches from basic to HAMT directory based on an approximated directory size.Proposed option's name (just for the sake of this issue description; feel free to suggest any other):
HAMTShardingSize
.Directory size estimation: aggregate byte length of all of
BasicDirectory.ProtoNode
'sLink
s (namely their name and CID). This is only an estimation because we don't marshal/encode the underlyingProtoNode
to get the exact block size (which is the motivation for the sharding in the first place) but it is close enough given theBasicDirectory
doesn't use theProtoNode
's data field.This option will work in tandem with the global
UseHAMTSharding
; either of the two can trigger the HAMT transition. Any plans for the deprecation ofUseHAMTSharding
are outside of the scope of this issue.Known drawbacks (inherited from current design) mentioned here just to make sure stakeholders are in sync:
HAMTDirectory
always aHAMTDirectory
. There won't be any system of high and low watermarks: once the estimated directory size grows aboveHAMTShardingSize
we switch and that is it.UseHAMTSharding
option for all directories, not just directory D.The switch from basic to HAMT directory logic lives here in the MFS repo. This should actually live in UnixFS, MFS shouldn't need to know what type of directory it is manipulating, it only needs the
Directory
interface to mount its mutable FS (the sole objective of this layer). This is clearly evidenced by the fact that theUseHAMTSharding
option itself is a UnixFS option (thatgo-ipfs
sets directly). If we can fix this in #86 before proceeding here, we will implement the logic described here in UnixFS instead, otherwise theHAMTShardingSize
will be added to the MFS layer alongside the global option inaddUnixFSChild
.The text was updated successfully, but these errors were encountered: