Failure on test_take_over_seeder #3329

kostasrim · 2024-07-17T05:14:32Z

Run https://github.com/dragonflydb/dragonfly/actions/runs/9967436978/job/27541119341

E           redis.exceptions.ResponseError: Couldn't execute takeover

The text was updated successfully, but these errors were encountered:

dranikpg · 2024-07-17T11:26:57Z

https://github.com/dragonflydb/dragonfly/actions/runs/9971706138/job/27553364913?pr=3326

adiholden · 2024-07-18T10:32:45Z

I want to update on my current findings on this bug as I believe I will not be able to continue investigating this.

The takeover fails in the test as after sending the takeover command some data received in replica side is corrupted - this I can see as I saw the error log in replica and verified it comming from

dragonfly/src/server/journal/serializer.cc

Line 151 in 37b992f

return make_unexpected(make_error_code(errc::bad_message));

Sometimes I saw different message error message f.e
replica.cc:247] Error stable sync with localhost:45677 generic:34 Numerical result out of range
which also implies for corrupted data in the journal reader.
once the replica reached that state it was exiting the stable sync state and was not able to reconnect to master (this is at the time the takeover was already executed).
I suspect the problem in data corruption is comming from async writer on master side, when running the pytest with small change in code removing - the condition total_pending > kFlushThreshold from here

dragonfly/src/server/journal/streamer.cc

Line 96 in 37b992f

if (in_flight_bytes_ == 0 || total_pending > kFlushThreshold) {

the tests did not fail any more. But I did not get proving to my assumption.
FYI @romange

To reproduce test failure just build main branch in release mode and run the test_take_over_seeder test

chakaz · 2024-07-18T11:30:42Z

@adiholden did you get a different error reply other than "Couldn't execute takeover" like Kostas pointed out in the first comment?
I ask this because I only saw once an error indicating "bad message", and when I ran it many more times I saw different failures.
When I analyzed the data exchange between the master and replica I saw that the communication is slow, and that the replica takes long to reply (and the TCP window is full for much of the exchange). As a result, most of the failures that I saw were simply time out because the replica did not catch up with the master.
I can't explain this though, because it's not that many commands, and increasing the REPLTAKEOVER timeout did not help..

romange · 2024-07-19T10:09:27Z

I succeeded to reproduce it with opt only with the following script:

#!/usr/bin/env bash
set -e

export DRAGONFLY_PATH=/home/roman/projects/dragonfly/build-opt/dragonfly
for i in {1..50}; do
echo "pass ${i}"
pytest -xv dragonfly/replication_test.py -k "test_take_over_seeder" > stdout$i.txt 2> stderr$i.txt
done

Before it was possible to issue several concurrent AsyncWrite requests. But these are not atomic, which leads to replication stream corruption. Now we wait for the previous request to finish before sending the next one. ThrottleIfNeeded is now takes into account pending buffer size for throttling. Fixes #3329 Signed-off-by: Roman Gershman <roman@dragonflydb.io>

kostasrim added the bug Something isn't working label Jul 17, 2024

kostasrim assigned chakaz Jul 17, 2024

romange assigned romange and unassigned chakaz Jul 19, 2024

romange mentioned this issue Jul 20, 2024

fix: corruption in replication stream #3344

Merged

romange closed this as completed in #3344 Jul 20, 2024

romange closed this as completed in 7b2603a Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure on test_take_over_seeder #3329

Failure on test_take_over_seeder #3329

kostasrim commented Jul 17, 2024

dranikpg commented Jul 17, 2024

adiholden commented Jul 18, 2024 •

edited

Loading

chakaz commented Jul 18, 2024

romange commented Jul 19, 2024

Failure on test_take_over_seeder #3329

Failure on test_take_over_seeder #3329

Comments

kostasrim commented Jul 17, 2024

dranikpg commented Jul 17, 2024

adiholden commented Jul 18, 2024 • edited Loading

chakaz commented Jul 18, 2024

romange commented Jul 19, 2024

adiholden commented Jul 18, 2024 •

edited

Loading