Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CQ: Fix shared store scanner missing messages #12392

Merged
merged 1 commit into from
Oct 8, 2024
Merged

Conversation

lhoguin
Copy link
Contributor

@lhoguin lhoguin commented Sep 26, 2024

Fix for #12367

@lhoguin
Copy link
Contributor Author

lhoguin commented Sep 26, 2024

This should be better. If not, I will fix tomorrow.

@lhoguin
Copy link
Contributor Author

lhoguin commented Sep 27, 2024

All good. Just need more testing and if possible some real world testing.

@michaelklishin
Copy link
Member

michaelklishin commented Sep 27, 2024

@lhoguin lhoguin force-pushed the loic-fix-cq-scan branch 6 times, most recently from dd8870c to d863c71 Compare October 1, 2024 07:26
@lhoguin
Copy link
Contributor Author

lhoguin commented Oct 1, 2024

It was still possible, although rare, to have message store
files lose message data, when the following conditions were
met:

 * the message data contains byte values 255
   (255 is used as an OK marker after a message)
 * the message is located after a 0-filled hole in the file
 * the length of the data is at least 4096 bytes and
   if we misread it (as detailed below) we encounter
   a 255 byte where we expect the OK marker

The trick for the code to previously misread the length can
be explained as follow:

A message is stored in the following format:

  <<Len:64, MsgIdAndMsg:Len/unit:8, 255>>

With MsgId always being 16 bytes in length. So Len is always
at least 16, if the message data Msg is empty. But technically
it never is.

Now if we have a zero filled hole just before this message,
we may end up with this:

  <<0, Len:64, MsgIdAndMsg:Len/unit:8, 255>>

When we are scanning we are testing bytes to see if there is
a message there or not. We look for a Len that gives us byte
255 after MsgIdAndMsg.

Len of value 4096 looks like this in binary:

  <<0:48, 16, 0>>

Problem is if we have leading zeroes, Len may look like this:

  <<0, 0:48, 16, 0>>

If we take the first 64 bits we get a potential length of 16.
We look at the byte after the next 16 bytes. If it is 255, we
think this is a message and skip by this amount of bytes, and
mistakenly miss the real message.

Solving this by changing the file format would be simple enough,
but we don't have the luxury to afford that. A different solution
was found, which is to combine file scanning with checking that
the message exists in the message store index (populated from
queues at startup, and kept up to date over the life time of
the store). Then we know for sure that the message above
doesn't exist, because the MsgId won't be found in the index.
If it is, then the file number and offset will not match,
and the check will fail.

There remains a small chance that we get it wrong during dirty
recovery. Only a better file format would improve that.
@lhoguin lhoguin changed the title DO NOT MERGE CQ: Fix shared store scanner missing messages CQ: Fix shared store scanner missing messages Oct 7, 2024
@lhoguin lhoguin marked this pull request as ready for review October 7, 2024 11:27
@lhoguin
Copy link
Contributor Author

lhoguin commented Oct 7, 2024

Rebased and marked ready for review. But I recommend merging this after #12425 just to be sure that everything is green.

@michaelklishin
Copy link
Member

Two positive reports from two users in #12367:

  1. One had an exception before roughly every 18-24 hours ("every night"), not any more after 24 hours
  2. Another no longer sees the exception after several days in a few clusters

@michaelklishin
Copy link
Member

#12425 was merged.

@michaelklishin michaelklishin merged commit 6feca5f into main Oct 8, 2024
399 checks passed
@michaelklishin michaelklishin deleted the loic-fix-cq-scan branch October 8, 2024 10:49
@michaelklishin
Copy link
Member

@Mergifyio backport v4.0.x

Copy link

mergify bot commented Oct 8, 2024

backport v4.0.x

✅ Backports have been created

michaelklishin added a commit that referenced this pull request Oct 8, 2024
CQ: Fix shared store scanner missing messages (backport #12392)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants