Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pin cmd: stream recursive pins #6493

Merged
merged 6 commits into from
Jul 12, 2019
Merged

Conversation

MichaelMure
Copy link
Contributor

Add a --stream flag to stream the results instead of
accumulating the final result in memory.

This is a rework of #5005

@MichaelMure
Copy link
Contributor Author

Notes:

Add a --stream flag to stream the results instead of
accumulating the final result in memory.

This is a rework of ipfs#5005
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to correctly handle CID encoding (introduced since #5005).

Also note: we could also just send back a sequence of {cid: type} objects instead using a union (closer to ipfs ls --stream but definitely more expensive in terms of allocations). I don't feel strongly either way.

// Pin ls needs to output two different type depending on if it's streamed or not.
// We use this to bypass the cmds lib refusing to have interface{}
type PinLsOutputWrapper struct {
RefKeyList
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to have to use "omitempty" for full backwards compatibility.

@MichaelMure
Copy link
Contributor Author

This will need to correctly handle CID encoding (introduced since #5005).

Any pointer/example about this ? I don't really get what you are talking about.

core/commands/pin.go Show resolved Hide resolved
core/commands/pin.go Show resolved Hide resolved
Copy link
Contributor

@lanzafame lanzafame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelMure considering that this is a perf-orientated improvement, could this PR get a go benchmark written for it so we can monitor the on going performance of this call, especially since this is a critical path for large deployments of go-ipfs

@MichaelMure
Copy link
Contributor Author

A few improvements:

  • CID are encoded properly
  • better struct naming (they were named after the refs command but are only used by pin/ls now)
  • struct field are properly JSON omitempty-ed for backward compat
  • output struct are documented to make codeclimate happy

@MichaelMure
Copy link
Contributor Author

@lanzafame not sure how I would do a go benchmark for this. This PR doesn't fundamentally change anything perf wise, it just start to output the result as it comes. The total run time will be similar.

@MichaelMure
Copy link
Contributor Author

More improvements:

  • fixed a bug where indirect pins were reported as "all"
  • output the recursive pins first as they don't require walking the graph
  • use EnumerateChildrenAsync instead of EnumerateChildren for faster/parallel walking of the graph
root@ip-10-0-25-166:/tmp# timeout -k 1 59s ./ipfsseq pin ls --stream --type=indirect > indirectseq
12:50:51.021 ERROR       core: failure on stop:  context canceled builder.go:47
Killed
root@ip-10-0-25-166:/tmp# wc -l indirectseq
786 indirectseq
root@ip-10-0-25-166:/tmp# timeout -k 1 59s ./ipfsasync pin ls --stream --type=indirect > indirectasync
12:52:52.668 ERROR       core: failure on stop:  context canceled builder.go:47
Error: canceled
root@ip-10-0-25-166:/tmp# wc -l indirectasync
2645 indirectasync

For indirect pins:
Sequential walking: 13 pins/s
Parallel walking: 44 pins/s

--> ~3.4x faster !

It also looks like this change to use parallel graph walking might be used elsewhere. A quick search show that at least the GC might benefit from that.

@MichaelMure
Copy link
Contributor Author

Test are now failing due to using EnumerateChildrenAsync. If EnumerateChildren is used instead, tests are ok (at least as far as I can tell, I have some unrelated failing on my machine).

@Stebalien
Copy link
Member

is used instead, tests are ok (at least as far as I can tell, I have some unrelated failing on my machine).

We probably need to sort in the tests.

@MichaelMure
Copy link
Contributor Author

FYI, I removed the commit that just change EnumerateChildren to EnumerateChildrenAsync. As you can see, the CI is now okay-ish. If you have insights on the remaining failure and why using EnumerateChildrenAsync could be problematic, I'm interested.

@Stebalien
Copy link
Member

I'm guessing that the traversal order is random, causing the comparison against the expected pin list to fail. We need to sort both before comparing.

@MichaelMure
Copy link
Contributor Author

@Stebalien
Copy link
Member

So... this is a bug. EnumerateChildren enumerates children. EnumerateChildrenAsync also enumerates the root CID. I believe we should make it operate like EnumerateChildren but we're going to have to carefully make sure we don't break something in the process.

@MichaelMure
Copy link
Contributor Author

Hoooo, that explains a lot. One problem I had after adding the recursive pins before the indirect pins to the set and using EnumerateChildrenAsync was that indirect pins would not be listed anymore. It was because the root cid was returned first by EnumerateChildrenAsync, considered as already enumerated and the whole subgraph was pruned.

@Stebalien
Copy link
Member

Fix in #6499. Well, not really a fix, I just renamed these functions so that the names/documentation match up with what they do...

@lanzafame
Copy link
Contributor

@lanzafame not sure how I would do a go benchmark for this. This PR doesn't fundamentally change anything perf wise, it just start to output the result as it comes. The total run time will be similar.

With a benchmark, you can run go test -run none -bencmark . -benchmem and you will get number of allocations etc.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but, as an API change, we'll need a signoff from @alanshaw as well.

@Stebalien Stebalien requested a review from alanshaw July 12, 2019 02:59
@Stebalien
Copy link
Member

@alanshaw this PR adds a --stream (short -s) option to ipfs pin ls that returns of pins as a stream of :

{"Cid": "Qm1", "Type": "direct"}
{"Cid": "Qm2", "Type": "recursive"}
...

Instead of a single:

{
  "Keys": {
    "Qm1": "direct",
    "Qm2": "recursive",
  }
}

This is completely backwards compatible but is likely something js-ipfs will want to implement. Thoughts? Objections?

@alanshaw
Copy link
Member

Yes please. I want this. No objections.

@Stebalien Stebalien merged commit 20e64ae into ipfs:master Jul 12, 2019
@Stebalien
Copy link
Member

🚀!

Thanks @MichaelMure, @lanzafame, and @alanshaw!

@MichaelMure
Copy link
Contributor Author

Ha, I was waiting for #6499 (comment) to get resolved so I could have the neat 3.4x perf improvement with the parallel walking, but ok !

@Stebalien
Copy link
Member

Test case in #6504.

Note: IMO, a benchmark isn't really useful here as this just streams the results, it doesn't reduce overall memory (still need to deduplicate results by storing them in a map).

@Stebalien
Copy link
Member

Ha, I was waiting for #6499 (comment) to get resolved so I could have the neat 3.4x perf improvement with the parallel walking, but ok !

Let's do that in a followup. No reason to block this on a different change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants