-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] unixfs: decouple the DAG traversal logic from the DAG reader #5257
Conversation
just wanted to say that this is really cool |
Decouple the DAG traversal logic from the `PBDagReader` encapsulating it in the (new) `Traversal` structure. Methods like `Read` and `Seek` now have the remaining non-traversal logic inside the `VisitHandler` (from the `Traversal` structure) where the file DAG logic dictates how to interpret each node visited in the `Traversal`'s `Iterate` and `Search` algorithms. TODO: Expand on the following, * Remove `ErrUnexpectedEOF`. * Move `precalcNextBuf` to the `Traversal` structure encapsulating it in the `fetchChild` method. * Deprecate the `ReadSeekCloser` interface that implicitly chained the node/readers into a DAG path in favor of the more explicit `path` slice in `Traversal` . * The iterative nature of `Traversal` and how it is not crucial to this change (the `Iterate` and `Search` could have been implemented with recursion). * Collapse PB and Buffer DAG readers into one, they now can both be handled through `Traversal`. License: MIT Signed-off-by: Lucas Molas <schomatis@gmail.com>
@Mr0grog Could you take a look at this please? It's not finished, but I would like to get your opinion on how understandable is this design, any kind of feedback will be very useful. You don't need to proof read or make any corrections to the documentation at this point (I will bother you with that later), it's just to get your take on how this can fit the mental model described in the documentation of the balanced builder. From your previous knowledge of the |
@schomatis I don’t know if I’ll be able to get to this before Wednesday or Thursday. Is that ok? |
// | ||
// TODO: Revisit the name. `Traverser`? (is "traverser" the correct noun?), | ||
// `Iterator` (this would put to much emphasis on iterate whereas other | ||
// traversal operations like search are supported), `Topology`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Walk
and WalkFunc
are the terms chosen in some comparable standard library types, fwiw: https://golang.org/pkg/path/filepath/#Walk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks @warpfork, I'm not a native English speaker and there's something in the term "traversal" that just doesn't sound right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, “traverse/traversal/traverser” is a technically correct English term for this that I have seen in many libraries (it even has a Wikipedia page). That said, I think I’ve seen “walk/walking/walker” used just as often and it’s shorter :)
@Mr0grog No rush at all, this is not a priority. |
dagutils/traversal.go
Outdated
ipld "gx/ipfs/QmZtNq8dArGfnpCZfx2pUNY7UcjGhVp5qqwQ4hH6mpTMRQ/go-ipld-format" | ||
) | ||
|
||
// Traversal provides methods to move through a DAG of IPLD nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this whole dagutils.Traversal
system might make sense to live in a package inside the go-ipld-format
repo? It's already so nicely separated from any other imports like unixfs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, rereading #5304 (comment) I see that why's mentioning to actually move these type of packages there instead of go-merkledag
.
License: MIT Signed-off-by: Lucas Molas <schomatis@gmail.com>
After initial review this PR will need to be split in two: |
License: MIT Signed-off-by: Lucas Molas <schomatis@gmail.com>
@Mr0grog Just a reminder ping (but again, no hurry). |
Ah! Sorry, was unclear on the status after the comment about splitting it up. Looking now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! The main separation here (tools to traverse a DAG vs. file-like operations on the DAG) made a lot of sense to me right away. 😍
Reading through the Traversal
code, though, my first thought here was that it seems odd to assign the VisitHandler to the struct. Why not pass it as an argument to Iterate()
?
err := traversal.Iterate(func(node ipld.Node) error {
// visitor logic
})
That feels like the right place for it to me because the logic you are executing is inherently tied to this call to Iterate()
or Search()
. I may be too used to languages where you pass callbacks or continuations, though :P
I also feel like the use of Iterate()
and Search()
could be streamlined or even unified. My first assumption when I saw the two was that one might leverage the other, since I assumed their behavior ought to overlap a lot. What if the direction and progress of the traversal was controlled by the types of errors the visitor returns? (I’ve definitely seen this pattern before, and Go’s filepath.Walk(path, walkFunc)
builtin that @warpfork mentioned does this, too.)
Reading through the DAG would be pretty similar to what you had:
err := traversal.Iterate(func(node ipld.Node) error {
if len(node.Links()) == 0 {
err = dr.saveNodeData(node)
if err != nil {
return err
}
n += dr.readDataBuffer(out[n:])
if n == len(out) {
// Return an error that causes the traversal to pause
return dagutils.PauseIteration
// or maybe
// return traversal.Pause
}
}
return nil
})
…but then seeking would be more streamlined, and doesn’t require a separate Search()
function:
err = traversal.Iterate(func(node ipld.Node) error {
if len(node.Links()) > 0 {
// [Clipped: extract `fsNode` and make sure it has hints]
i := 0
for i; i < len(node.Links()); i++ {
childSize := fsNode.BlockSize(i)
if childSize < uint64(left) {
left -= int64(childSize)
} else {
break
}
}
// Return an error that causes the traversal to skip the next N nodes
return dagutils.SkipNext(i)
} else {
// [Clipped: Buffer the data from the node at the seeked position]
// Return an error that causes the traversal to pause
return dagutils.PauseIteration
// or maybe
// return traversal.Pause
}
})
Ultimately, you’re really moving a lot of the error handling and boiler plate from the user-supplied visitor function into the Iterate()
function itself. I’m not sure whether it makes more sense for the error types/functions to live on the package (in dagutils
) or directly on the struct (so you might call traversal.SkipNext(n)
). I definitely don’t know what’s more common in Go.
// Flag to stop the current iteration, intended to be set by the user | ||
// inside `VisitHandler` to stop the current traversal operation. | ||
Stop bool | ||
// TODO: Use a function? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 on this TODO. Or the ideas I outlined in the review summary.
@@ -19,6 +22,10 @@ var ( | |||
ErrUnkownNodeType = errors.New("unknown node type") | |||
) | |||
|
|||
// TODO: Rename the `DagReader` interface, this doesn't read *any* DAG, just | |||
// DAGs with UnixFS node (and it *belongs* to the `unixfs` package). Some | |||
// alternatives: `FileReader`, `UnixFSFileReader`, `UnixFSReader`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 on this. I like UnixFSReader
. (I dislike FileReader
because I wouldn’t be sure if it was reading bytes of a file from disk or reading through a UnixFS file’s DAG.)
// `up`, `Right`) that modify its `path`. The `down` move method is the | ||
// analogous of the recursive call and the one in charge of visiting | ||
// the node (through the `VisitHandler`) and performing some user-defined | ||
// logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right()
being the only one of these that is public feels a bit odd to me, but I get why it’s here. See my summary comment for some alternative thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree. Maybe some other public method like SkipNode
should be exported (that would internally call an un-exported right
), but I feel that the graphic "turn right" movement is something that the user could grasp rather easily and it could relate more to that than a Skip
function. Not really sure though, I usually lean towards abstracting but this felt like a natural "take look under the hood, you see? there's not much complex logic down there".
// by the current `level` to avoid including much indexing like | ||
// `dt.path[dt.level]` in the code. Maybe another more strong refactoring | ||
// that would allow to think of the current node in the `path` without | ||
// concerning at what `level` it's in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there were definitely some points in the code where all the indexing got a bit muddy.
// this format. | ||
childIndex []uint | ||
// TODO: Revisit name, `childPosition`? Do not use the *link* term, put | ||
// emphasis in the child (the *what*, not the *how* we get it). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this name made plenty of sense for me. The only thing clearer that comes to mind is currentChildIndex
.
// `Iterate()` and `Search`). This value should never be returned to after | ||
// the first `down` movement, moving up from the root should always return | ||
// `ErrUpOnRoot`. | ||
level int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call this depth
? It seems like that would be much more straightforward, especially since your comments define it in terms of “depth” ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. I avoided the term "depth" at first because I considered it an absolute property, "the depth of the entire DAG", and didn't want to give the sense that this property was being modified here. I could use the name currentDepth
or positionDepth
, or yes, just depth
alone and indicate this is related more to the current position in the DAG than the DAG itself.
// | ||
// Any of the exported methods of this API should be allowed to be called | ||
// from within this method. | ||
VisitHandler func(node ipld.Node) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might call this a Visitor
. In fact, I think that’s generally the common term I’ve seen for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to re-use this across the go-ipfs codebase the Visit function should:
- Receive the
level
of the visited node - Receive the parent(s) of the visited node
We probably want to provide a couple default Visit functions to wrap some commonly used logic:
- One implementing full recursive traversal of the dag, pruning already visited branches
- One implementing depth limited traversal, pruning already limited branches (if they are not going to be explored deeper).
var ErrDownNoChild = errors.New("can't go down, no child available") | ||
|
||
// ErrUpOnRoot signals the end of the DAG after returning to the root. | ||
var ErrUpOnRoot = errors.New("can't go up, already on root") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is a pretty important public signal that calling functions will see, but I feel like UpOnRoot
doesn’t quite explain what this is really about (that you have traversed through the entire graph). Maybe EndOfDag
or EndOfGraph
or TraversalComplete
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I feel like UpOnRoot doesn’t quite explain what this is really about (that you have traversed through the entire graph).
That is actually the case only for the Iterate()
context, in future traversal operations that error may be interpreted differently, but you're right that Iterate()
should wrap this error (which seems more like an internal error) and return something more meaningful like EndOfDag
or IterateComplete
.
var ErrRightNoChild = errors.New("can't move right, no more child nodes in this parent") | ||
|
||
// errDownStopIteration signals the stop of the traversal operation. | ||
var errDownStopIteration = errors.New("stop") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if Pause
would be better than Stop
here — in my mind, stopping means you can’t continue from where you left off. With this code, though, you can (and that’s important, because that’s how seeking works).
|
||
// NewDagReaderWithSize constructs a new `dagReader` with the file `size` | ||
// specified. | ||
func NewDagReaderWithSize(ctx context.Context, n ipld.Node, serv ipld.NodeGetter, size uint64) (DagReader, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a little unclear to me at first that this was the precalculated size of the whole file/DAG — I thought it might be the maximum amount of data to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was an ugly patch I had to introduce to preserve the Size()
API, I'll try to document it more clearly.
@Mr0grog Another great review as usual, thanks. I'll be bothering you with some follow-up questions in the coming days. |
I’ll bet this will be really useful for simplifying the traversal logic in the new depth-limited |
@Mr0grog Regarding the You can have Even if internally they are implemented with very similar (almost the same code, actually) my main question would be: what is the advantage, from the user's perspective, to unify them and present them only as |
level: -1, | ||
// Starting position, "on top" of the root node, see `level`. | ||
|
||
ctx: ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should avoid storing the context in the struct, it's generally considered a bad practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality was inherited from the PBDagReader
which stored the context to later be able to call ipld.GetNodes
. How would you advise this be handled?
|
||
// The CID of each child of each node in the `path` (indexed by | ||
// `level` and `childIndex`). | ||
childCIDs [][]*cid.Cid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's gonna be a lot of memory, as is the promise set below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also inherited from the PBDagReader
(I'm glad this refactoring exposed some of the already existing characteristics of the current implementation) which has the
// NodePromises for each of 'nodes' child links
promises []*ipld.NodePromise
// the cid of each child of the current node
links []*cid.Cid
slices, each of those is instantiated in every level of the DAG when the recursive algorithm creates a new PBDagReader
.
lot of memory
Have in mind that level
relates to the depth of the DAG (I should change its name as suggested above), so the size of this matrix grows at an O(log n)
rate with the size of the DAG.
How would you advise this be handled?
It's not so much that the logic is complex (it's way shorter than this), but that there are multiple places in the code base making use of very similar logic, with very small differences. It would be good if this could replace one or several of those instances, rather than becoming another way of doing the same. On another note, it seems that this cannot do "async" (or multithreading) DAG fetching, which would be very good to have. Without it, it probably cannot replace the logic used for fetching blocks when pinning. |
@hsanjuan Thanks for the feedback. I'm going to leave the async feature for a second iteration though (I'll open an issue once I get this merged in |
@schomatis yeah, that's the most sensible approach. Let's iterate on it later. |
Two reasons:
I always think of internationalization and visual layout functions when it comes to stuff like this. Lots of people have written code focused on aligning things to the “left” or “right” only to realize later when they internationalized for right-to-left languages that their life would have been much simpler if they’d aligned to the “start” and “end” instead. So I usually try to avoid APIs that focus on how I visualize a model and opt for ones that talk more about the properties of the model itself.
I like that sentiment a lot. The reason I tend towards returning a command for the iterator to execute here is because I think it’s generally safer to keep control in one place. When the iterator function and the visitor are both messing with the Side note: are you planning to shift towards “walk” terminology? I stuck with |
Sorry, I think I didn't express myself correctly before, I'm definitely not saying that iterate/search cover every use case (the did cover the necessity of the DAG reader that was my test case used to iterate this idea), my motivation to have them both is to avoid passing that burden to the user, that is, instead of forcing every user-defined The same logic applies for future operations that may be needed: if many parts of the code need (for example) an iterator that goes up and down twice this structure should provide a different operation, e.g.,
Exactly, but if those function do something conceptually different than iterate I would want to analyze if I need to be providing a different traversal operation here (that was the motivation to add I'm interpreting @hsanjuan comment in this same sense (but I may be missing something),
to provide different operations (like
Yes, I'll switch it once I move this to the other repos. |
I was trying address this in my first point. Here’s seeking with my version of err = traversal.Iterate(func(node ipld.Node) error {
if len(node.Links()) > 0 {
// Figure out which child to skip to
// Return an error that causes the traversal to skip the next N nodes
return dagutils.SkipNext(i)
} else {
// Handle the found node
// Return an error that causes the traversal to pause
return dagutils.PauseIteration()
}
}) …and here it is with err = traversal.Search(func(node ipld.Node) error {
if len(node.Links()) > 0 {
// Figure out which child to skip to
// Return an error that causes the traversal to skip the next N nodes
return dagutils.SkipNext(i)
} else {
// Handle the found node
return nil
}
}) Which just didn’t feel different enough to be worth making a special case for. But you’re right; if we changed the API to // This returns the found node, which is slightly different than what we had before
func (dt *Traversal) Search(visitor Visitor) (ipld.Node, error) {
var foundNode ipld.Node
err := dt.Iterate(func(node ipld.Node) error {
if len(node.Links()) == 0 {
foundNode = node
return dagutils.PauseIteration()
}
return visitor(node)
})
return foundNode, err
}
// Different contract for search simplifies things, sort of (more error handling,
// but might be conceptually clearer)
foundNode, err = traversal.Search(func(node ipld.Node) error {
// Figure out which child to skip to
// Return an error that causes the traversal to skip the next N nodes
return dagutils.SkipNext(i)
})
if err != nil {
// handle error
} else {
// handle the found node
} (Side note: I think I would rename this, though — I would expect
Totally agree!
I was assuming this specific case isn’t that common. Since there’s so little extra to add here, it didn’t seem worthwhile if only using it once or twice. But otherwise: HELL YES ✅ |
Thanks @Mr0grog and everyone for all the feedback!! (I like the I'll be closing this now and start moving it to the corresponding repos, incrementally incorporating these changes and bothering each feedbacker (not sure if that word exists :) to see if the new code correctly reflects your input. |
I don't think Why is going to have time to review his comments (1, 2) so I'm bothering you @Stebalien just to check if you'd like to have any modifications there. |
In the mean time, maybe worthwhile to mark those with |
That's a good idea but I would like to postpone it at the moment, the only place to do some actual optimizations is in the logic around node promises which at the moment I'm not sure if I have implemented them correctly so I don't want to waste anybody's time optimizing something that may be changed in the near future (note to self: bother @Stebalien to take a quick look at 694c9a2 to see if it makes sense). |
I think this is super cool and I agree that replicating dag traversal logic in a million places is awkward and it'd be nice to have a generalized version. My comments generally match the previous comments:
All things being equal though, pending @Stebalien 's comments I think this is good to merge for now. |
Update: This PR (now closed) has been split in:
Add a DAG walker with support for IPLD
Node
s go-ipld-format#39decouple the DAG traversal logic from the DAG reader go-unixfs#12
Do not merge.
TODO:
Traversal
structure (providing the same functionality asprecalcNextBuf
).foofoo.block
has more links than UnixFSBlocksizes
#5312 which is causing the test failures.dagutils/traversal.go
file to go-ipld-format (Package Extraction Strategy #5304).Traversal
structure toWalker
(that includes comments, commit message, filename).Closes #5192.