Debug graphsync responder memory retention issues #256
Labels
effort/days
Estimated to take multiple days, but less than a week
exp/expert
Having worked on the specific codebase is important
need/triage
Needs initial labeling and prioritization
P1
High: Likely tackled by core team if no one steps up
We are seeing ongoing issues with Estuary and memory retention of blocks when responding to go-graphsync requests.
pprof.estuary-shuttle.alloc_objects.alloc_space.inuse_objects.inuse_space.004.pb.gz
pprof.estuary-shuttle.alloc_objects.alloc_space.inuse_objects.inuse_space.003.pb.gz
The hot path goes through queryexecutor.go and runtraversal.go, but it's not clear why the blocks that are loaded are retained.
After the blocks are loaded, there are two paths they go on:
https://github.com/ipfs/go-graphsync/blob/main/responsemanager/runtraversal/runtraversal.go#L77
https://github.com/ipfs/go-graphsync/blob/main/responsemanager/queryexecutor.go#L97
(sidebar: RunTraversal is getting to be a somewhat bizarre and unneccesary abstraction, and I wonder if we should just but it back in the query executor)
We have fairly extensive code intended to back pressure memory usage in traversal for the path through the MessageQueue. The Allocator SHOULD block the second code path, which is synchronous, preventing more blocks from being loaded off disk until the messages go over the wire, at which the block memory SHOULD be able to be freed in a GC cycle.
So far, most of my efforts have focused on the second code path, and verifying that the allocator is blocking the traversal properly, and that block memory is in fact being freed upon being sent over the wire. As of yet, I have been unable to replicate memory issues due to issues on this code path similar to those witnessed in Estuary. Graphsync has an extensive testground testplan with lots of parameters, and you can see my experiments in https://github.com/ipfs/go-graphsync/tree/feat/estuary-memory-debugging
I have not explored the first code path cause we haven't witness issues in it up to this point, but I think it is worth examining.
I am not sure the best next steps, but I think debugging this particular issue is top priority.
The text was updated successfully, but these errors were encountered: