Massive unnecessary Raft log rollback causes spike latency. #5881

UndefinedSy · 2024-05-11T11:05:02Z

Please check the FAQ documentation before raising an issue

Describe the bug (required)
I've found such a scenario with spike latency for seconds: graphd's log show that requests to a certain storaged host are timeout. when look into the storaged's log, it shows that many partitions (maybe all parts as the follower) have encountered a RaftLog Rollback.

However, there is no logs indicating leader re-election or leader change, which means it should not involve inconsistency.

Your Environments (required)
a private branch dispatched from the master branch for long, but the related code looks the same as the master branch.

How To Reproduce(required)
No idea, it happens occasionally.
In my case, it happens when the storaged is under heavy load caused by write pressure test.

Expected behavior
Should not triggle massive RaftLog Rollback and causes the storaged unresponsible for seconds.

Additional context
I've taked a look at the RaftPart Impl and have some thoughts about the issue.

there are massive Follower Raft Parts doing rollback, and the leaders of these parts are not changed. So this may not be caused by a log inconsistency.
from reading the source code, based on my understanding, it seems that there is a case that may causes the rollback:

1. A previous AppendLog was sent to Follower with: LogEntries [100, 103], commitId: 99
2. Due to the network or something else, the response was lost or not return to the Leader in time. as the result, the Leader's `last_log_id_sent` is still 99
3. Next time, the leader will send AppendLog to the Follower: LogEntries [100, 110], commitId 99
4. At this time, although [100, 103] of the local wal file is consistent, rollbackToLog 102 will still be triggered.

the corresponding code is:

if this is the case, a simple solution might be:

UndefinedSy · 2024-05-11T16:12:03Z

I think this also leads to a parallel rollback issue when restarting a crashed storage instance, an issue we've observed in our production environment. Here's the sequence of events:

The crashed instance has some logs that have been written to the WAL, but have not been committed.
Upon recovery, the crashed instance receives the new appendLog request with some overlapping log entries.
It then proceeds to perform a rollback operation, takes a lot of CPU time, and eventually causes more problems.

UndefinedSy added the type/bug Type: something is unexpected label May 11, 2024

github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels May 11, 2024

wey-gu mentioned this issue May 18, 2024

Weekly Report 2024-05-17 vesoft-inc/nebula-community#436

Open

UndefinedSy mentioned this issue Jul 5, 2024

fix. raft follower will rollback itself when it misses a certain log #5905

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive unnecessary Raft log rollback causes spike latency. #5881

Massive unnecessary Raft log rollback causes spike latency. #5881

UndefinedSy commented May 11, 2024

UndefinedSy commented May 11, 2024

Massive unnecessary Raft log rollback causes spike latency. #5881

Massive unnecessary Raft log rollback causes spike latency. #5881

Comments

UndefinedSy commented May 11, 2024

UndefinedSy commented May 11, 2024