Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast (parallel) Traversal For A DAG That Stops At Arbitrary Points #5487

Open
hannahhoward opened this issue Sep 18, 2018 · 7 comments
Open
Labels
kind/enhancement A net-new feature or improvement to an existing feature status/in-progress In progress

Comments

@hannahhoward
Copy link
Contributor

hannahhoward commented Sep 18, 2018

Summary

Given a root CID, I would like to traverse that node's DAG quickly (making requests for children in parallel), and stop traversal of links on a per node basis using logic I provide.

Use Case

The time to list large sharded directories is prohibitive because link traversal in the DAG is serialized. We need to speed this up. (#4908 )

Requirements / Acceptance Criteria

Requirements:

  • Minimize memory footprint
  • Maximize requests made in parallel, enumerate the DAG as quickly as possible
  • Leave what to do with each node up to caller (VisitNode)

First Steps

  • Use EnumeratChildrenAsync from go-merkeldag in HAMT link enumberation (ForEachLink) and test performance

Additional Optimizations

  • Get nodes in parallel (GetMany)
  • Additional operations to increase speed

Not Included

  • Enumeration in any consistent order
@hannahhoward hannahhoward changed the title "enhancement": Fast (parallel) Traversal For A DAG and That Stops At an Arbitrary Points "enhancement": Fast (parallel) Traversal For A DAG and That Stops At Arbitrary Points Sep 18, 2018
@hannahhoward hannahhoward changed the title "enhancement": Fast (parallel) Traversal For A DAG and That Stops At Arbitrary Points "enhancement": Fast (parallel) Traversal For A DAG That Stops At Arbitrary Points Sep 18, 2018
@eingenito eingenito changed the title "enhancement": Fast (parallel) Traversal For A DAG That Stops At Arbitrary Points Fast (parallel) Traversal For A DAG That Stops At Arbitrary Points Sep 18, 2018
@eingenito eingenito added the kind/enhancement A net-new feature or improvement to an existing feature label Sep 18, 2018
@schomatis
Copy link
Contributor

Given a root CID, I would like to traverse that node's DAG quickly (making requests for children in parallel), and stop traversal of links on a per node basis using logic I provide.

  • Leave what to do with each node up to caller (VisitNode)

Get nodes in parallel (GetMany)

Hey @hannahhoward, I'm currently working on a PR (ipfs/go-ipld-format#39) that has many similarities with the requirements you're describing but it still has a long way to go to meet all of them. It's basically a structure to walk DAGs providing a Visitor function to let the user lay some logic on top of it, it was born out of some code extracted from the UnixFS DAG reader.

If you have the time it would be great if you could take a look at it and tell me what you think it. I'm always interested in finding use cases besides the original motivation for it that was decoupling some of the DAG reader logic (https://github.com/ipfs/go-unixfs/pull/12/files), since I would like to arrive at an API that may be useful in more cases than just that one.

@schomatis
Copy link
Contributor

Also, welcomed to the team! 🎉

@hannahhoward hannahhoward added the status/in-progress In progress label Sep 20, 2018
@hannahhoward
Copy link
Contributor Author

@schomatis thank you for the pointer -- I need to review it more in full and will give more feedback shortly. Have you also seen the code here: https://github.com/ipfs/go-merkledag/blob/master/merkledag.go#L363

This code specifically deals with (abstract? - can't tell) DAG traversal but seems to have good parallelism -- though I think without the level of control over how traversal is done in your code. I'm trying to figure out what needs to be used where.

@schomatis
Copy link
Contributor

Yes, EnumerateChildrenAsyncDepth has the distinct advantage that it doesn't traverse in order, but that could be added to the DAG walker, actually I wanted to add something like this to replace the way gx traverses a dependency graph to download the packages, since Gx dependencies are basically a DAG of packages (that could be abstracted as NavigableNode).

Anyway, there are many parts in the code base where similar DAG traversing logic is taking place and the idea would be to abstract them in a single API (where reasonable) to simplify the code.

@hannahhoward
Copy link
Contributor Author

Yea I need to leave a comment on your PR -- cause I've review the code and it's really awesome. @eingenito + @michaelavila think so too. For the in-order case, and also for general tree walking it's great.

The big challenge here is we're just trying to get nodes as fast as possible in the use case of LS -- which requires making requests for a nodes links as soon as we have them and in parallel as much as possible -- effectively we want to be walking the tree in multiple directions in an effort to discover as many blocks to fetch as we can as quickly as possible. And then if order becomes a requirement, deal with that through sorting later. Anyway, I will put a comment on your PR and I think there is still potential to incorporate it for this use case in the future.

@schomatis
Copy link
Contributor

Yes, agreed, actually @hsanjuan had already proposed something like this in #5257 (comment), I'll open an issue about it.

@Stebalien
Copy link
Member

@hannahhoward this is moot now, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature status/in-progress In progress
Projects
None yet
Development

No branches or pull requests

4 participants