-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: recon protocol hang with large diffs #356
Conversation
If two nodes each had 2K+ events that the other node did not have it was possible for the protocol to deadlock as both nodes tried to write those events without read the values from the other node. The fix was accomplished by splitting the protocol into two loops, a read loop and a write loop. They communicate with each other using message passing. The code now has both a clean separation of concerns as well as the behavior that we concurrently read and write from the network so we do not enter a state where we are exclusively writing to the network.
) -> Result<()> { | ||
// TODO: This logic has two potential failure modes we need to test them | ||
// 1. We allocate memory of all keys in the range, this can be very large. | ||
// 2. We spend a lot of time writing out to the stream but not reading from the stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these points are addressed with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments / questions but I think this looks great!
recon/src/protocol.rs
Outdated
.await?; | ||
for key in keys { | ||
if let Some(value) = recon.value_for_key(key.clone()).await? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there isn't a value should we be concerned at all now that we should discover them together or am I forgetting about something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, going to convert this into an error.
If two nodes each had 2K+ events that the other node did not have it was possible for the protocol to deadlock as both nodes tried to write those events without read the values from the other node.
The fix was accomplished by splitting the protocol into two loops, a read loop and a write loop. They communicate with each other using message passing. The code now has both a clean separation of concerns as well as the behavior that we concurrently read and write from the network so we do not enter a state where we are exclusively writing to the network.