Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure on test_take_over_seeder #3329

Closed
kostasrim opened this issue Jul 17, 2024 · 4 comments · Fixed by #3344
Closed

Failure on test_take_over_seeder #3329

kostasrim opened this issue Jul 17, 2024 · 4 comments · Fixed by #3344
Assignees
Labels
bug Something isn't working

Comments

@kostasrim
Copy link
Contributor

Run https://github.com/dragonflydb/dragonfly/actions/runs/9967436978/job/27541119341

E           redis.exceptions.ResponseError: Couldn't execute takeover
@kostasrim kostasrim added the bug Something isn't working label Jul 17, 2024
@dranikpg
Copy link
Contributor

@adiholden
Copy link
Collaborator

adiholden commented Jul 18, 2024

I want to update on my current findings on this bug as I believe I will not be able to continue investigating this.

  1. The takeover fails in the test as after sending the takeover command some data received in replica side is corrupted - this I can see as I saw the error log in replica and verified it comming from
    return make_unexpected(make_error_code(errc::bad_message));

    Sometimes I saw different message error message f.e
    replica.cc:247] Error stable sync with localhost:45677 generic:34 Numerical result out of range
    which also implies for corrupted data in the journal reader.
  2. once the replica reached that state it was exiting the stable sync state and was not able to reconnect to master (this is at the time the takeover was already executed).
  3. I suspect the problem in data corruption is comming from async writer on master side, when running the pytest with small change in code removing - the condition total_pending > kFlushThreshold from here
    if (in_flight_bytes_ == 0 || total_pending > kFlushThreshold) {
    the tests did not fail any more. But I did not get proving to my assumption.
    FYI @romange

To reproduce test failure just build main branch in release mode and run the test_take_over_seeder test

@chakaz
Copy link
Collaborator

chakaz commented Jul 18, 2024

@adiholden did you get a different error reply other than "Couldn't execute takeover" like Kostas pointed out in the first comment?
I ask this because I only saw once an error indicating "bad message", and when I ran it many more times I saw different failures.
When I analyzed the data exchange between the master and replica I saw that the communication is slow, and that the replica takes long to reply (and the TCP window is full for much of the exchange). As a result, most of the failures that I saw were simply time out because the replica did not catch up with the master.
I can't explain this though, because it's not that many commands, and increasing the REPLTAKEOVER timeout did not help..

@romange romange assigned romange and unassigned chakaz Jul 19, 2024
@romange
Copy link
Collaborator

romange commented Jul 19, 2024

I succeeded to reproduce it with opt only with the following script:

#!/usr/bin/env bash
set -e

export DRAGONFLY_PATH=/home/roman/projects/dragonfly/build-opt/dragonfly
for i in {1..50}; do
echo "pass ${i}"
pytest -xv dragonfly/replication_test.py -k "test_take_over_seeder" > stdout$i.txt 2> stderr$i.txt
done

romange added a commit that referenced this issue Jul 20, 2024
Before it was possible to issue several concurrent AsyncWrite requests.
But these are not atomic, which leads to replication stream corruption.
Now we wait for the previous request to finish before sending the next one.

ThrottleIfNeeded is now takes into account pending buffer size for throttling.

Fixes #3329

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants