Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor async loading for simplicity and correctness #356

Merged
merged 5 commits into from
Feb 18, 2022

Conversation

hannahhoward
Copy link
Collaborator

Goals

fix #339 -- this is essentially an implementation of the algorithm outline there

In the process, it lays the foundation for much more robust handling of different asyncloading scenarios. It also enables "fire request only when needed" behavior

Implementation

The first concept to understand is the TraversalRecord and the Verifier. The TraversalRecord is a compact representation of an ordered history of links loaded in a traversal, along with their paths. It's based on the pathNode implementation in ipld/go-ipld-prime#354 (hattip @willscott). The Verifier is a class for replaying a traversal history against a series of responses from a remote. It also handles missing links the local may have had the remote does not. Note that the Verifier currently has to run without changes being made to the TraversalRecord.

The the second concept is the ReconciledLoader and it's associated remote queue of metadata and blocks.

The loading process is as follows:

  • when we request a link, we wait for new items from the remote
    • if we're offline, we return immediately
    • if we're online, we wait until we have new items to read
      • when we first go online, we initiate a "verification" against previous loaded data -- this comes from the traversal record
      • this means when we first go online, we actually don't have any "new" remote items until the verification finishes. if the verification fails, we actually pass as an error in response to the load request -- this is a bit odd, but keep in mind the traverser that calls BlockReadOpener will simply throw this in the graphsync error channel and close the request. I really didn't want to deal with having the channel inside the reconciled loaded and getting drawn request closing process.
  • if we have new remote items, and we're not loading from a path we know the remote is missing, we attempt to load the link from the first remote item, erroring if the link we're loading and the remote do no match
  • if we're on a path the remote is missing, we simply go right too loading local
  • if the remote sent the block, it's directly in the remote item queue and we save and return it. This queue itself is the unverified blockstore
  • if the remote said it had the block but it was a duplicate, we load local
  • if the remote said its missing the block, we record the new missing path, then load local
  • Once we load, we record the load into the traversal record, with one exception: if we're offline, we actually hold till the next load before putting the link in the traversal record. This enables a "retry" to be run when local load fails but then the node goes online

Injesting data is now very straightforward -- we add divide up responses+ blocks by request ID and then queue responses into the respective queues for the ReconciledLoader. Coordination is simply handled via mutexes and sync.Cond for signalling.

For Discussion

  • TODOs (feel free to rank optional or required in PR review):
    • the Verifier is a bit imperative and finicky especially in terms of not being able to handle more data incoming to the TraversalRecord while verifying -- I took a couple shots at this and then gave up
      • this means for the time being we don't let the local traversal run ahead when online -- I need a verifyer you can continually verify against as data gets added -- I'm pretty sure it's doable just... haven't done it yet
    • There are some commented out tests in the RequestManager -- this is because now they're basically not relevant with the new "only request once you need it mode" -- however, there's some question of whether we need to make this mode optional either as a config option or as a per-request option (we don't currently have a framework for that -- we probably need it). Firing requests only when you need it is a big change -- I kind of love the efficiency -- but it leaves open questions about what happens with stuff like go-data-transfer (what if you never got any graphsync request in response to a Push cause the other side already had all your data? eek!).
    • Updating the documentation -- this is probably a required -- probably I'll copy a lot of this PR description.

// for a request and does NOT cause a request to fail completely
type RemoteMissingBlockErr struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the rename? it's still for remotes and we have other Remote* errors like the one you introduced below

Copy link
Collaborator Author

@hannahhoward hannahhoward Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically it's now just a "missing block" -- missing both locally & remotely -- maybe a rename that breaks compatibility isn't worth it just to be super accurate.

@@ -109,6 +109,11 @@ func TestMakeRequestToNetwork(t *testing.T) {
require.True(t, found)
require.Equal(t, td.extensionData, returnedData, "Failed to encode extension")

builder := gsmsg.NewBuilder()
builder.AddResponseCode(receivedRequest.ID(), graphsync.RequestRejected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this now? is it because this test never needed to actually make a response and the previous version had all the blocks locally; but this new one puts the blocks on the remote but then tells the client to get lost?

So we're simply testing the ability to make a network connection and have the most basic interaction here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea this test is from very far back in the early graphsync development days -- when I was just seeing if the protocol was generating the network traffic I expected. So previously, it yes loaded absolutely everything locally but we just wanted to make sure a request got made.

Maybe the right move now is eliminate, since really, at this point, I'm pretty sure we can send network requests :P

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eliminated

}

// SetRemoteState records whether or not the request is online
func (rl *ReconciledLoader) SetRemoteState(online bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename this so it's more natural with a boolean? (or use an enum for the values, but there's really only two). Maybe SetRemoteOnline()? Reading SetRemoteState(false) is a bit mysterious.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree.

// if we've only loaded locally so far and hit a missing block
// initiate remote request and retry the load operation from remote
if _, ok := result.Err.(graphsync.MissingBlockErr); ok && !requestSent {
requestSent = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a state that's available on the loader rather than having to figure it out here? It seems a little risky having to make inferences like this from the outside. It seems to me that there's a new loader made for each request in the requestmanager, and then this assumption is implicit in the executor.

It's also a bit weird that we're asking the loader for a block, and then using the result that the loader gives us to then tell the loader what its own state is. It seems that we're doing state management here so we can trigger the RetryLastOfflineLoad(). But the flow does raise questions about who's responsible for what. The loader clearly isn't just a dumb container, yet it's not responsible for its own state management.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, looking at requestTask() it looks like could be a reused loader, so we have local state (to the executor) and loader state going on here. So maybe it's already "online" as far as its concerned, from a previous use, but then we go through this local state management to figure out it's online, then we tell it's online (again) and do the RetryLastOfflineLoad() dance.

Is there risk involved in this dual state management and reuse?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I mean, maybe? honestly this is a boundary I thought about a lot and couldn't decide upon.

Honestly, I thought about having the loader just make the request itself cause yea, then state in only one place... I dunno, at that point it's just doing a whole whole lot and it makes this already large PR even bigger.

I'm not sure where to draw the line, and my gut feeling is: let's ship it, prove it works, and then refactor more as needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, I'll buy that

Copy link
Collaborator

@willscott willscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable. Docs are probably the only one of the 3 bullets that i'd push to have in this PR rather than a follow up

// RemoteIncorrectResponseError indicates that the remote peer sent a response
// to a traversal that did not correspond with the expected next link
// in the selector traversal based on verification of data up to this point
type RemoteIncorrectResponseError struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the path where this link mismatched occured might be useful context as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

CleanupRequest(p peer.ID, requestID graphsync.RequestID)
// PersistenceOptions is an interface for getting loaders by name
type PersistenceOptions interface {
GetLinkSystem(name string) (ipld.LinkSystem, bool)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need docs on why there are many named link systems here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this interface is unrelated to the current PR -- it just got moved around a bit -- can we defer to a seperate ticket? This got large

// ReconciledLoader is an interface that can be used to load blocks from a local store or a remote request
type ReconciledLoader interface {
SetRemoteState(online bool)
RetryLastOfflineLoad() types.AsyncLoadResult
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should document thread safety and semantics here - i assume that the intention is this loader can only be used by one client.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like something is off with this interface as well:

should this just be a 'retryLastLoad()`?
why does this reconciled loader need to know that the state is online/offline? when it asks the linksystem to load when offline, the linksystem can immediately return an error - in the same way it would if the loader was online, but there was a network failure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thread safety is relatively well documented in reconciledloader.go itself. This is joy of go's whole "put the interface next to where its used" -- you expect docs in one palce that are elsewhere -- maybe I should put a pointer here.

Re: RetryLastOfflineLoad -- it's named as such cause I thought this was really the only useful case -- trying a load again when you go online. But I agree, a global retry would be more useful and clear. In my mind the operations are as follows:

  • offline load -> retry online -- this time wait for remote data and try again
  • online load -> retry online -- just try loading from local store (if you were online and loaded successfully, it got saved locally, the remote isn't sending any more data on this particular load, so the only possibility is more local data saved from another source)
  • online load -> retry offline -- just load local (and previous remote data was saved)
  • offline load -> retry offline -- just load local

A future version could theoretically make an actual request to the network to tell the remote to try the block again, but that's not something I think we tackle now at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!


// verify it matches the expected next load
if !head.link.Equals(link.(cidlink.Link).Cid) {
return nil, graphsync.RemoteIncorrectResponseError{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have zero tolerance for out-of-order blocks right? so this error is final and breaks the entire transfer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and yes. Remotes had best not fuck up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably remove swearing in a Github comment and yet I am not feeling compelled to :P

// update path tracking
rl.pathTracker.recordRemoteLoadAttempt(lctx.LinkPath, head.action)

// if block == nil, we have no remote block to load
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment could do with more extrapolation, there's some stuff related to states and the duplicate block logic in the metadata ingester

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair I'll work on it based on what I wrote above.

func (v *Verifier) appendUntilLink() {
if v.tip().link == nil && len(v.tip().children) > 0 {
v.stack = append(v.stack, v.tip().children[0])
v.appendUntilLink()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this recurse rather than iterate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cause I used to be an Elixir/Erlang programmer? Totally fair. I'll fix.

// Verifier allows you to verify series of links loads matches a previous traversal
// order when those loads are successful
// At any point it can reconstruct the current path.
type Verifier struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of this seems necessary because of ipld-prime's insistence on pretending that block boundaries don't exist, so it has very poor support for the distinction that links make within a series of nodes; is that fair?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean yea sorta. Technically, we could just re-run the selector against the results, but that seems like A LOT of overkill.

I get why ipld-prime wants to treat data as a giant abstract structured data source in the sky, but yes block boundaries are a thing and it can be tricky to deal with.

Co-authored-by: Rod Vagg <rod@vagg.org>
@hannahhoward
Copy link
Collaborator Author

@willscott + @rvagg I believe I have resolved all PR comments -- hopefully we are good to merge?

That said, this merits another re-review cause boy "update the documentation" turned into "wow, the architecture docs are really out of date and need a bunch of updates"

@@ -69,130 +66,6 @@ var protocolsForTest = map[string]struct {
"(v1.0 -> v1.0)": {[]protocol.ID{gsnet.ProtocolGraphsync_1_0_0}, []protocol.ID{gsnet.ProtocolGraphsync_1_0_0}},
}

func TestMakeRequestToNetwork(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my guess is that these two are removed because they were, as you were saying recently, early bootstrapping tests that made sense before the whole thing was built, now they are a bit too basic to be bothered with?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

trace.WithAttributes(attribute.String("cid", link.String())))
defer span.End()

log.Debugw("verified block", "request_id", rl.requestID, "total_queued_bytes", buffered)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would these be interesting attributes in the verifyBlock trace above too?

Comment on lines +782 to +783
TODO: these are no longer used as the old responses are consumed upon restart, to minimize
network utilization -- does this make sense? Maybe we should throw out these responses while paused?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this is interesting .. and I have no idea what the answer to this might be; are you comfortable enough with the new behaviour to commit to it for now?

ReconciledLoader -> LocalStorage : Load Block
LocalStorage --> ReconciledLoader : Block or missing
else
loop until block loaded, missing, or error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gets a bit dense in here; to the point where I wonder about the utility of the detail, but I think this looks right

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're happy with the state of this, then fine by me, let's give it a go

Copy link
Collaborator

@willscott willscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that the 'request-execution' state diagram is pretty hard to make sense of in one go. i wonder if there are smaller parts that make sense as their own broken out diagrams. That seems pretty nit-picky though, so happy to see this go in as is.

@hannahhoward hannahhoward merged commit f6a08b5 into rvagg/uuid-rebasing Feb 18, 2022
hannahhoward added a commit that referenced this pull request Feb 18, 2022
…ng (#332)

* feat(net): initial dag-cbor protocol support

also added first roundtrip benchmark

* feat(requestid): use uuids for requestids

Ref: #278
Closes: #279
Closes: #281

* fix(requestmanager): make collect test requests with uuids sortable

* fix(requestid): print requestids as string uuids in logs

* fix(requestid): use string as base type for RequestId

* chore(requestid): wrap requestid string in a struct

* feat(libp2p): add v1.0.0 network compatibility

* chore(net): resolve most cbor + uuid merge problems

* feat(net): to/from ipld bindnode types, more cbor protoc improvements

* feat(net): introduce 2.0.0 protocol for dag-cbor

* fix(net): more bindnode dag-cbor protocol fixes

Not quite working yet, still need some upstream fixes and no extensions work
has been attempted yet.

* chore(metadata): convert metadata to bindnode

* chore(net,extensions): wire up IPLD extensions, expose as Node instead of []byte

* Extensions now working with new dag-cbor network protocol
* dag-cbor network protocol still not default, most tests are still exercising
  the existing v1 protocol
* Metadata now using bindnode instead of cbor-gen
* []byte for deferred extensions decoding is now replaced with datamodel.Node
  everywhere. Internal extensions now using some form of go-ipld-prime
	decode to convert them to local types (metadata using bindnode, others using
	direct inspection).
* V1 protocol also using dag-cbor decode of extensions data and exporting the
  bytes - this may be a breaking change for exising extensions - need to check
	whether this should be done differently. Maybe a try-decode and if it fails
	export a wrapped Bytes Node?

* fix(src): fix imports

* fix(mod): clean up go.mod

* fix(net): refactor message version format code to separate packages

* feat(net): activate v2 network as default

* fix(src): build error

* chore: remove GraphSyncMessage#Loggable

Ref: #332 (comment)

* chore: remove intermediate v1.1 pb protocol message type

v1.1.0 was introduced to start the transition to UUID RequestIDs. That
change has since been combined with the switch to DAG-CBOR messaging format
for a v2.0.0 protocol. Thus, this interim v1.1.0 format is no longer needed
and has not been used at all in a released version of go-graphsync.

Fixes: filecoin-project/lightning-planning#14

* fix: clarify comments re dag-cbor extension data

As per dission in #338, we are going
to be erroring on extension data that is not properly dag-cbor encoded from now
on

* feat: new LinkMetadata iface, integrate metadata into Response type (#342)

* feat(metadata): new LinkMetadata iface, integrate metadata into Response type

* LinkMetadata wrapper around existing metadata type to allow for easier
  backward-compat upgrade path
* integrate metadata directly into GraphSyncResponse type, moving it from an
  optional extension
* still deal with metadata as an extension for now—further work for v2 protocol
  will move it into the core message schema

Ref: #335

* feat(metadata): move metadata to core protocol, only use extension in v1 proto

* fix(metadata): bindnode expects Go enum strings to be at the type level

* fix(metadata): minor fixes, tidy up naming

* fix(metadata): make gofmt and staticcheck happy

* fix(metadata): docs and minor tweaks after review

Co-authored-by: Daniel Martí <mvdan@mvdan.cc>

* fix: avoid double-encode for extension size estimation

Closes: filecoin-project/lightning-planning#15

* feat(requesttype): introduce RequestType enum to replace cancel&update bools (#352)

Closes: #345

* fix(metadata): extend round-trip tests to byte representation (#350)

* feat!(messagev2): tweak dag-cbor message schema (#354)

* feat!(messagev2): tweak dag-cbor message schema

For:

1. Efficiency: compacting the noisy structures into tuples representations and
   making top-level components of a message optional.
2. Migrations: providing a secondary mechanism to lean on for versioning if we
   want a gentler upgrade path than libp2p protocol versioning.

Closes: #351

* fix(messagev2): adjust schema per feedback

* feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID (#355)

* feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID

Closes: #349

* fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID

* fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID

when using error type T, use *T with As, rather than **T

* fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID

* fixup! feat(graphsync): unify req & resp Pause, Unpause & Cancel by RequestID

Co-authored-by: Daniel Martí <mvdan@mvdan.cc>

* feat: SendUpdates() API to send only extension data to via existing request

* fix(responsemanager): send update while completing

If request has finished selector traversal but is still sending blocks,
I think it should be possible to send updates. As a side effect, this
fixes our race.

Logically, this makes sense, cause our external indicator that we're
done (completed response listener) has not been called.

* fix(requestmanager): revert change to pointer type

* Refactor async loading for simplicity and correctness (#356)

* feat(reconciledloader): first working version of reconciled loader

* feat(traversalrecorder): add better recorder for traversals

* feat(reconciledloader): pipe reconciled loader through code

style(lint): fix static checks

* Update requestmanager/reconciledloader/injest.go

Co-authored-by: Rod Vagg <rod@vagg.org>

* feat(reconciledloader): respond to PR comments

Co-authored-by: Rod Vagg <rod@vagg.org>

* fix(requestmanager): update test for rebase

Co-authored-by: Daniel Martí <mvdan@mvdan.cc>
Co-authored-by: hannahhoward <hannah@hannahhoward.net>
@mvdan mvdan deleted the feat/refactor-asyncloader branch March 7, 2022 11:57
rvagg added a commit that referenced this pull request Feb 27, 2023
Was missed in #356

This doesn't break anything obvious because the loader isn't used from the
traverser's linksystem, and the traverser sets up defaults where the provided
linksystem is missing them. But, we don't get certain things, like
KnownReifiers, which are kind of important.
hannahhoward pushed a commit that referenced this pull request Feb 28, 2023
Was missed in #356

This doesn't break anything obvious because the loader isn't used from the
traverser's linksystem, and the traverser sets up defaults where the provided
linksystem is missing them. But, we don't get certain things, like
KnownReifiers, which are kind of important.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants