-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor async loading for simplicity and correctness #356
Conversation
b303f1a
to
0818f67
Compare
// for a request and does NOT cause a request to fail completely | ||
type RemoteMissingBlockErr struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the rename? it's still for remotes and we have other Remote*
errors like the one you introduced below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically it's now just a "missing block" -- missing both locally & remotely -- maybe a rename that breaks compatibility isn't worth it just to be super accurate.
impl/graphsync_test.go
Outdated
@@ -109,6 +109,11 @@ func TestMakeRequestToNetwork(t *testing.T) { | |||
require.True(t, found) | |||
require.Equal(t, td.extensionData, returnedData, "Failed to encode extension") | |||
|
|||
builder := gsmsg.NewBuilder() | |||
builder.AddResponseCode(receivedRequest.ID(), graphsync.RequestRejected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this now? is it because this test never needed to actually make a response and the previous version had all the blocks locally; but this new one puts the blocks on the remote but then tells the client to get lost?
So we're simply testing the ability to make a network connection and have the most basic interaction here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea this test is from very far back in the early graphsync development days -- when I was just seeing if the protocol was generating the network traffic I expected. So previously, it yes loaded absolutely everything locally but we just wanted to make sure a request got made.
Maybe the right move now is eliminate, since really, at this point, I'm pretty sure we can send network requests :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eliminated
} | ||
|
||
// SetRemoteState records whether or not the request is online | ||
func (rl *ReconciledLoader) SetRemoteState(online bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename this so it's more natural with a boolean? (or use an enum for the values, but there's really only two). Maybe SetRemoteOnline()
? Reading SetRemoteState(false)
is a bit mysterious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I agree.
// if we've only loaded locally so far and hit a missing block | ||
// initiate remote request and retry the load operation from remote | ||
if _, ok := result.Err.(graphsync.MissingBlockErr); ok && !requestSent { | ||
requestSent = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a state that's available on the loader rather than having to figure it out here? It seems a little risky having to make inferences like this from the outside. It seems to me that there's a new loader made for each request in the requestmanager, and then this assumption is implicit in the executor.
It's also a bit weird that we're asking the loader for a block, and then using the result that the loader gives us to then tell the loader what its own state is. It seems that we're doing state management here so we can trigger the RetryLastOfflineLoad()
. But the flow does raise questions about who's responsible for what. The loader clearly isn't just a dumb container, yet it's not responsible for its own state management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, looking at requestTask()
it looks like could be a reused loader, so we have local state (to the executor) and loader state going on here. So maybe it's already "online" as far as its concerned, from a previous use, but then we go through this local state management to figure out it's online, then we tell it's online (again) and do the RetryLastOfflineLoad()
dance.
Is there risk involved in this dual state management and reuse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I mean, maybe? honestly this is a boundary I thought about a lot and couldn't decide upon.
Honestly, I thought about having the loader just make the request itself cause yea, then state in only one place... I dunno, at that point it's just doing a whole whole lot and it makes this already large PR even bigger.
I'm not sure where to draw the line, and my gut feeling is: let's ship it, prove it works, and then refactor more as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough, I'll buy that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable. Docs are probably the only one of the 3 bullets that i'd push to have in this PR rather than a follow up
// RemoteIncorrectResponseError indicates that the remote peer sent a response | ||
// to a traversal that did not correspond with the expected next link | ||
// in the selector traversal based on verification of data up to this point | ||
type RemoteIncorrectResponseError struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the path where this link mismatched occured might be useful context as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
CleanupRequest(p peer.ID, requestID graphsync.RequestID) | ||
// PersistenceOptions is an interface for getting loaders by name | ||
type PersistenceOptions interface { | ||
GetLinkSystem(name string) (ipld.LinkSystem, bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need docs on why there are many named link systems here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this interface is unrelated to the current PR -- it just got moved around a bit -- can we defer to a seperate ticket? This got large
requestmanager/executor/executor.go
Outdated
// ReconciledLoader is an interface that can be used to load blocks from a local store or a remote request | ||
type ReconciledLoader interface { | ||
SetRemoteState(online bool) | ||
RetryLastOfflineLoad() types.AsyncLoadResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should document thread safety and semantics here - i assume that the intention is this loader can only be used by one client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like something is off with this interface as well:
should this just be a 'retryLastLoad()`?
why does this reconciled loader need to know that the state is online/offline? when it asks the linksystem to load when offline, the linksystem can immediately return an error - in the same way it would if the loader was online, but there was a network failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thread safety is relatively well documented in reconciledloader.go itself. This is joy of go's whole "put the interface next to where its used" -- you expect docs in one palce that are elsewhere -- maybe I should put a pointer here.
Re: RetryLastOfflineLoad -- it's named as such cause I thought this was really the only useful case -- trying a load again when you go online. But I agree, a global retry would be more useful and clear. In my mind the operations are as follows:
- offline load -> retry online -- this time wait for remote data and try again
- online load -> retry online -- just try loading from local store (if you were online and loaded successfully, it got saved locally, the remote isn't sending any more data on this particular load, so the only possibility is more local data saved from another source)
- online load -> retry offline -- just load local (and previous remote data was saved)
- offline load -> retry offline -- just load local
A future version could theoretically make an actual request to the network to tell the remote to try the block again, but that's not something I think we tackle now at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
|
||
// verify it matches the expected next load | ||
if !head.link.Equals(link.(cidlink.Link).Cid) { | ||
return nil, graphsync.RemoteIncorrectResponseError{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have zero tolerance for out-of-order blocks right? so this error is final and breaks the entire transfer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and yes. Remotes had best not fuck up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably remove swearing in a Github comment and yet I am not feeling compelled to :P
// update path tracking | ||
rl.pathTracker.recordRemoteLoadAttempt(lctx.LinkPath, head.action) | ||
|
||
// if block == nil, we have no remote block to load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment could do with more extrapolation, there's some stuff related to states and the duplicate block logic in the metadata ingester
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair I'll work on it based on what I wrote above.
func (v *Verifier) appendUntilLink() { | ||
if v.tip().link == nil && len(v.tip().children) > 0 { | ||
v.stack = append(v.stack, v.tip().children[0]) | ||
v.appendUntilLink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this recurse rather than iterate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cause I used to be an Elixir/Erlang programmer? Totally fair. I'll fix.
// Verifier allows you to verify series of links loads matches a previous traversal | ||
// order when those loads are successful | ||
// At any point it can reconstruct the current path. | ||
type Verifier struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of this seems necessary because of ipld-prime's insistence on pretending that block boundaries don't exist, so it has very poor support for the distinction that links make within a series of nodes; is that fair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean yea sorta. Technically, we could just re-run the selector against the results, but that seems like A LOT of overkill.
I get why ipld-prime wants to treat data as a giant abstract structured data source in the sky, but yes block boundaries are a thing and it can be tricky to deal with.
Co-authored-by: Rod Vagg <rod@vagg.org>
@willscott + @rvagg I believe I have resolved all PR comments -- hopefully we are good to merge? That said, this merits another re-review cause boy "update the documentation" turned into "wow, the architecture docs are really out of date and need a bunch of updates" |
99257d2
to
de4f016
Compare
@@ -69,130 +66,6 @@ var protocolsForTest = map[string]struct { | |||
"(v1.0 -> v1.0)": {[]protocol.ID{gsnet.ProtocolGraphsync_1_0_0}, []protocol.ID{gsnet.ProtocolGraphsync_1_0_0}}, | |||
} | |||
|
|||
func TestMakeRequestToNetwork(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my guess is that these two are removed because they were, as you were saying recently, early bootstrapping tests that made sense before the whole thing was built, now they are a bit too basic to be bothered with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
trace.WithAttributes(attribute.String("cid", link.String()))) | ||
defer span.End() | ||
|
||
log.Debugw("verified block", "request_id", rl.requestID, "total_queued_bytes", buffered) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would these be interesting attributes in the verifyBlock
trace above too?
TODO: these are no longer used as the old responses are consumed upon restart, to minimize | ||
network utilization -- does this make sense? Maybe we should throw out these responses while paused? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, this is interesting .. and I have no idea what the answer to this might be; are you comfortable enough with the new behaviour to commit to it for now?
ReconciledLoader -> LocalStorage : Load Block | ||
LocalStorage --> ReconciledLoader : Block or missing | ||
else | ||
loop until block loaded, missing, or error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gets a bit dense in here; to the point where I wonder about the utility of the detail, but I think this looks right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're happy with the state of this, then fine by me, let's give it a go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that the 'request-execution' state diagram is pretty hard to make sense of in one go. i wonder if there are smaller parts that make sense as their own broken out diagrams. That seems pretty nit-picky though, so happy to see this go in as is.
…ng (#332) * feat(net): initial dag-cbor protocol support also added first roundtrip benchmark * feat(requestid): use uuids for requestids Ref: #278 Closes: #279 Closes: #281 * fix(requestmanager): make collect test requests with uuids sortable * fix(requestid): print requestids as string uuids in logs * fix(requestid): use string as base type for RequestId * chore(requestid): wrap requestid string in a struct * feat(libp2p): add v1.0.0 network compatibility * chore(net): resolve most cbor + uuid merge problems * feat(net): to/from ipld bindnode types, more cbor protoc improvements * feat(net): introduce 2.0.0 protocol for dag-cbor * fix(net): more bindnode dag-cbor protocol fixes Not quite working yet, still need some upstream fixes and no extensions work has been attempted yet. * chore(metadata): convert metadata to bindnode * chore(net,extensions): wire up IPLD extensions, expose as Node instead of []byte * Extensions now working with new dag-cbor network protocol * dag-cbor network protocol still not default, most tests are still exercising the existing v1 protocol * Metadata now using bindnode instead of cbor-gen * []byte for deferred extensions decoding is now replaced with datamodel.Node everywhere. Internal extensions now using some form of go-ipld-prime decode to convert them to local types (metadata using bindnode, others using direct inspection). * V1 protocol also using dag-cbor decode of extensions data and exporting the bytes - this may be a breaking change for exising extensions - need to check whether this should be done differently. Maybe a try-decode and if it fails export a wrapped Bytes Node? * fix(src): fix imports * fix(mod): clean up go.mod * fix(net): refactor message version format code to separate packages * feat(net): activate v2 network as default * fix(src): build error * chore: remove GraphSyncMessage#Loggable Ref: #332 (comment) * chore: remove intermediate v1.1 pb protocol message type v1.1.0 was introduced to start the transition to UUID RequestIDs. That change has since been combined with the switch to DAG-CBOR messaging format for a v2.0.0 protocol. Thus, this interim v1.1.0 format is no longer needed and has not been used at all in a released version of go-graphsync. Fixes: filecoin-project/lightning-planning#14 * fix: clarify comments re dag-cbor extension data As per dission in #338, we are going to be erroring on extension data that is not properly dag-cbor encoded from now on * feat: new LinkMetadata iface, integrate metadata into Response type (#342) * feat(metadata): new LinkMetadata iface, integrate metadata into Response type * LinkMetadata wrapper around existing metadata type to allow for easier backward-compat upgrade path * integrate metadata directly into GraphSyncResponse type, moving it from an optional extension * still deal with metadata as an extension for now—further work for v2 protocol will move it into the core message schema Ref: #335 * feat(metadata): move metadata to core protocol, only use extension in v1 proto * fix(metadata): bindnode expects Go enum strings to be at the type level * fix(metadata): minor fixes, tidy up naming * fix(metadata): make gofmt and staticcheck happy * fix(metadata): docs and minor tweaks after review Co-authored-by: Daniel Martí <mvdan@mvdan.cc> * fix: avoid double-encode for extension size estimation Closes: filecoin-project/lightning-planning#15 * feat(requesttype): introduce RequestType enum to replace cancel&update bools (#352) Closes: #345 * fix(metadata): extend round-trip tests to byte representation (#350) * feat!(messagev2): tweak dag-cbor message schema (#354) * feat!(messagev2): tweak dag-cbor message schema For: 1. Efficiency: compacting the noisy structures into tuples representations and making top-level components of a message optional. 2. Migrations: providing a secondary mechanism to lean on for versioning if we want a gentler upgrade path than libp2p protocol versioning. Closes: #351 * fix(messagev2): adjust schema per feedback * feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID (#355) * feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID Closes: #349 * fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID * fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID when using error type T, use *T with As, rather than **T * fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID * fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID Co-authored-by: Daniel Martí <mvdan@mvdan.cc> * feat: SendUpdates() API to send only extension data to via existing request * fix(responsemanager): send update while completing If request has finished selector traversal but is still sending blocks, I think it should be possible to send updates. As a side effect, this fixes our race. Logically, this makes sense, cause our external indicator that we're done (completed response listener) has not been called. * fix(requestmanager): revert change to pointer type * Refactor async loading for simplicity and correctness (#356) * feat(reconciledloader): first working version of reconciled loader * feat(traversalrecorder): add better recorder for traversals * feat(reconciledloader): pipe reconciled loader through code style(lint): fix static checks * Update requestmanager/reconciledloader/injest.go Co-authored-by: Rod Vagg <rod@vagg.org> * feat(reconciledloader): respond to PR comments Co-authored-by: Rod Vagg <rod@vagg.org> * fix(requestmanager): update test for rebase Co-authored-by: Daniel Martí <mvdan@mvdan.cc> Co-authored-by: hannahhoward <hannah@hannahhoward.net>
Was missed in #356 This doesn't break anything obvious because the loader isn't used from the traverser's linksystem, and the traverser sets up defaults where the provided linksystem is missing them. But, we don't get certain things, like KnownReifiers, which are kind of important.
Was missed in #356 This doesn't break anything obvious because the loader isn't used from the traverser's linksystem, and the traverser sets up defaults where the provided linksystem is missing them. But, we don't get certain things, like KnownReifiers, which are kind of important.
Goals
fix #339 -- this is essentially an implementation of the algorithm outline there
In the process, it lays the foundation for much more robust handling of different asyncloading scenarios. It also enables "fire request only when needed" behavior
Implementation
The first concept to understand is the TraversalRecord and the Verifier. The TraversalRecord is a compact representation of an ordered history of links loaded in a traversal, along with their paths. It's based on the pathNode implementation in ipld/go-ipld-prime#354 (hattip @willscott). The Verifier is a class for replaying a traversal history against a series of responses from a remote. It also handles missing links the local may have had the remote does not. Note that the Verifier currently has to run without changes being made to the TraversalRecord.
The the second concept is the ReconciledLoader and it's associated remote queue of metadata and blocks.
The loading process is as follows:
Injesting data is now very straightforward -- we add divide up responses+ blocks by request ID and then queue responses into the respective queues for the ReconciledLoader. Coordination is simply handled via mutexes and sync.Cond for signalling.
For Discussion