Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug graphsync responder memory retention issues #256

Closed
hannahhoward opened this issue Oct 26, 2021 · 3 comments
Closed

Debug graphsync responder memory retention issues #256

hannahhoward opened this issue Oct 26, 2021 · 3 comments
Labels
effort/days Estimated to take multiple days, but less than a week exp/expert Having worked on the specific codebase is important need/triage Needs initial labeling and prioritization P1 High: Likely tackled by core team if no one steps up

Comments

@hannahhoward
Copy link
Collaborator

We are seeing ongoing issues with Estuary and memory retention of blocks when responding to go-graphsync requests.
pprof.estuary-shuttle.alloc_objects.alloc_space.inuse_objects.inuse_space.004.pb.gz
pprof.estuary-shuttle.alloc_objects.alloc_space.inuse_objects.inuse_space.003.pb.gz

The hot path goes through queryexecutor.go and runtraversal.go, but it's not clear why the blocks that are loaded are retained.

After the blocks are loaded, there are two paths they go on:

  1. They get sent back to go-ipld-prime's selector traversal code as the data returned on link load. https://github.com/ipfs/go-graphsync/blob/main/responsemanager/runtraversal/runtraversal.go#L69
  2. They get sent to the ResponseAssembler and then MessageQueue to go over the wire.
    https://github.com/ipfs/go-graphsync/blob/main/responsemanager/runtraversal/runtraversal.go#L77
    https://github.com/ipfs/go-graphsync/blob/main/responsemanager/queryexecutor.go#L97
    (sidebar: RunTraversal is getting to be a somewhat bizarre and unneccesary abstraction, and I wonder if we should just but it back in the query executor)

We have fairly extensive code intended to back pressure memory usage in traversal for the path through the MessageQueue. The Allocator SHOULD block the second code path, which is synchronous, preventing more blocks from being loaded off disk until the messages go over the wire, at which the block memory SHOULD be able to be freed in a GC cycle.

So far, most of my efforts have focused on the second code path, and verifying that the allocator is blocking the traversal properly, and that block memory is in fact being freed upon being sent over the wire. As of yet, I have been unable to replicate memory issues due to issues on this code path similar to those witnessed in Estuary. Graphsync has an extensive testground testplan with lots of parameters, and you can see my experiments in https://github.com/ipfs/go-graphsync/tree/feat/estuary-memory-debugging

I have not explored the first code path cause we haven't witness issues in it up to this point, but I think it is worth examining.

I am not sure the best next steps, but I think debugging this particular issue is top priority.

@hannahhoward hannahhoward added need/triage Needs initial labeling and prioritization effort/days Estimated to take multiple days, but less than a week exp/expert Having worked on the specific codebase is important P1 High: Likely tackled by core team if no one steps up labels Oct 26, 2021
@rvagg
Copy link
Member

rvagg commented Oct 26, 2021

(sidebar: RunTraversal is getting to be a somewhat bizarre and unneccesary abstraction, and I wonder if we should just but it back in the query executor)

that's what I've been thinking while looking through this, but merging them back together as they are is going to make testing harder, so it'll need to be done in a way that doesn't increase the difficulty; I'll include this as an option in what I'm working on now with queryexecutor.

@hannahhoward
Copy link
Collaborator Author

One thing to add to the testing mix. We understand that NFT.storage and other sites use https://github.com/filecoin-project/go-dagaggregator-unixfs/tree/2b317c005fd561156cac93924ffeb31c58fdb966 to assemble their dags. It seems like we could add an option to assemble a dag using this code to the testplan and see what we learn.

@hannahhoward
Copy link
Collaborator Author

resolved in #268

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/days Estimated to take multiple days, but less than a week exp/expert Having worked on the specific codebase is important need/triage Needs initial labeling and prioritization P1 High: Likely tackled by core team if no one steps up
Projects
None yet
Development

No branches or pull requests

2 participants