-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pinning is slow when there are many pins #5221
Comments
Also sounds like another use case for an embedded graph database |
Not really. We have MFS, we can just use that. The current blockers are:
Unfortunately, this'll only get worse as we hack in new pin types for cluster. We need some way to specify (in unixfs) how a file/directory should be pinned (where pin policies higher up the directory tree take precedence). |
The current thought here is to introduce an intermediate fix that stores pins in go-ipld-hamt. Blockers:
|
Maybe need just use read only or archive flag for pined blocks in the underline file system? |
Unfortunately, it's not quite that simple. Pinning happens at a higher layer and not all of our datastores store one file per block. |
This is also causing an issue with monitoring over at netdata/netdata#3156 as it makes a lot of 'ipfs pin ls' calls. |
Is the pinset object stored and read/written on disk when operations are performed? If so wouldn't it be possible to load the object into memory and read/write to there to get high performance IO with memory access? You could copy the object to disk as a backup, but you wouldn't incur expensive read operations as you are reading from the in-memory object. This would serve as a reasonable intermediate fix until the pinning system at large is reworked. |
Reading is fast, we store the pinset in memory. The slow part is flushing to disk. |
Why would an 'ipfs pin ls' flush to disk? I think there must be something else going on, if the netdata guys are seeing an inordinate load due to an 'ipfs pin ls' being sent once every 5s or so. |
You can list pins you added by running |
That still doesn't answer why the guys over on netdata/netdata#3156 are seeing massive IPFS resource usage when 1) they have a large repo (several thousand objects) and 2) they turn on monitoring (which does an 'ipfs pin ls' every few seconds). Is there instrumentation they could turn on? |
It's listing every single object (block) that has been pinned. It's consuming a ton of ram because we, unfortunately, create a list of pins in-memory before returning them to the client. We should fix* (the second part) this but doing so will be a breaking API change so we'll have to be careful. |
It's also probably garbage collecting a bunch (we're working on some fixes to CIDs that'll make them allocate less but that's still in progress). |
@pjz So on my own nodes to avoid having to constantly poll IPFS and incur slow performance from examining the pinset, I maintain a database which contains an exact copy of the pins my IPFS nodes currently have. Any updates that would effect the pinset must also update the database. By doing this, I avoid having to contact my IPFS node and perform performance impacting operations like Yes while this isn' desirable it has been working very well but has a couple of considerations, namely that all operations which effect pinset must also update the DB. Don't forget, IPFS is still very new so sometimes you have to make small accommodations until such issues are resolved. |
I should mention that in working with pins, I've also come to the pattern of maintaining my own cache, to avoid delay on large nodes, even when only listing recursive pins, In my specific case, I'm interested in both the listing being more performant, but also having some means of notification from the node. Like an event that I can subscribe to, which signals when the pinset has changed. For context, I'm dealing with |
Yes, absolutely it's made my node perform significantly better. i've currently begun moving to a model where the only time I need to talk to my IPFS node to list anything is for crucial operations. Otherwise, everything else that isn't a write operation should be reading from my cache/database |
While those are great workarounds, they're not really feasible for a general monitoring solution. I guess they'll just have to wait until the IPFS server gets it together. I think it's clear that whatever datastructure it's using needs to be re-evaluated or supplemented to make this kind of monitoring/usage not cause it to eat itself. |
So, adding pins should be faster. But listing every single object that has been pinned (directly or indirectly by some recursive pin) in your datastore will always be somewhat slower. |
If everyone's solution is to maintain a parallel cache of what pins exist... why not have IPFS do that internally instead? Keep a cache that's invalidated on add/remove of pins, but otherwise is untouched. Then repeated calls to 'ipfs pin ls' would be trivial. Maybe make 'ipfs pin verify' also serve as a way to manually invalidate the cache/force a rebuild of it. |
@pjz However, this only seems useful to implement if there's more than 1 event (more than just "pins have changed"). Any opinions on this? |
What you describe sounds somewhat like a way to tap into the logging system. |
@Stebalien Has there been much progress / prioritization on this front? As we continue to scale, this becomes increasingly relevant. |
No progress. |
We're also facing this issue with aroung 2mio hashes and around 4-500k pins, can we support you in any way? We're currently "workarounding" this with multiple ipfs instances |
@dirkmc you were looking into this for js-ipfs. Are you still planing on applying that same optimization to go-ipfs? |
@Stebalien I'm currently doing some research to understand where the performance bottlenecks are with adding large numbers of files to go-ipfs, which will likely include performance analysis for pinning. Before making any pinning optimizations, we'll likely want to decide if it makes sense for pins to be stored in the blockstore, which is a bigger conversation. |
I will soon need to pin a million+ pins, I'm hoping this can be improved |
This was actually fixed in go-ipfs 0.8.0, we just never closed the issue (see https://github.com/ipfs/go-ipfs/blob/master/CHANGELOG.md#-faster-local-pinning-and-unpinning). The number of pins you have should no longer matter when adding new pins. |
Awesome, thanks! Love your work. I'll report any issues with scalability later if we run into them. |
We store all pins in a single massive object so adding and removing pins is really slow when we have many pins.
This affects:
ipfs dag add --pin=true
ipfs add
ipfs pin add
ipfs pin rm
Listing pins also appears to be slow but for a different reason:
ipfs pin ls
buffers pins in memory before sending them back to the user (see pin ls should stream the result #6304).ipfs pin ls
lists all pinned blocks, directly or indirectly, by default. Callingipfs pin ls --type=recursive
is much faster.Proposed solutions:
I'm filing this issue so we can have a single issue that succinctly describes the entire issue and all variants.
The text was updated successfully, but these errors were encountered: