Stop checkpoint validation when encountering a valid checkpoint #463

the-mikedavis · 2024-07-25T21:18:19Z

@mkuratczyk noticed that with many QQs on the qq-v4 branch and each QQ having many checkpoints, we spend a fair amount of effort reading the checkpoints during recovery. This is because ra_snapshot:find_checkpoints/1 uses the ra_snapshot:validate/1 callback to ensure that each snapshot is valid. validate/1 is somewhat expensive in ra_log_snapshot since it fully reads and decodes the checkpoint, discarding the result.

Not all of this validation is necessary: we can stop validating checkpoints when we find the latest checkpoint which is valid. This is likely to be good enough. I've also updated find_checkpoints/1 to stop its search when it finds a checkpoint with a lower index than the current snapshot as any checkpoints lower than the snapshot index won't be used for promotion and should be removed. For many QQs with many checkpoints each this should save some I/O usage and memory.

the-mikedavis · 2024-07-29T14:57:24Z

I took some rough measurements with tprof from OTP 27. The gist is that time and memory savings look pretty good: 1.62s down to 0.28s and 178 million words of memory down to ~20 million for ra_snapshot:init/6 on a QQ's checkpoint directory (from the qq-v4 branch) with 5 million messages.

Results...

Queue created with perf-test -qq -u qq -x 1 -y 0 -C 5000000 -c 3000

Measured with:

tprof:profile(fun() -> ra_snapshot:init(<<"uuid">>, ra_log_snapshot, "./snapshots", "./checkpoints", undefined, 3) end, #{type => call_time}).

and #{type => call_memory} for the memory breakdowns.

This branch:

FUNCTION                                        CALLS  TIME (μs)  PER CALL  [    %]
...
erlang:universaltime_to_localtime/1                 6         69     11.50  [ 0.02]
prim_file:close_nif/1                              17         91      5.35  [ 0.03]
prim_file:list_dir_nif/1                            2         92     46.00  [ 0.03]
prim_file:read_nif/2                               34        155      4.56  [ 0.05]
file:file_name_1/2                               1037        192      0.19  [ 0.07]
filename:join1/4                                 1914        203      0.11  [ 0.07]
prim_file:open_nif/2                               17        291     17.12  [ 0.10]
erlang:crc32/1                                      1      13109  13109.00  [ 4.54]
prim_file:read_file_nif/1                           1      52475  52475.00  [18.18]
ra_log_snapshot:parse_snapshot/1                    1      93138  93138.00  [32.26]
erlang:binary_to_term/1                            19     128266   6750.84  [44.43]
                                                          288697            [100.0]

0.28s

FUNCTION                                CALLS     WORDS    PER CALL  [    %]
...
prim_file:internal_native2name/1           17      1122       66.00  [ 0.01]
file:file_name_1/2                       1037      2040        1.97  [ 0.01]
lists:reverse/2                            42      3726       88.71  [ 0.02]
filename:join1/4                         1914      3758        1.96  [ 0.02]
erlang:crc32/1                              1      7450     7450.00  [ 0.04]
erlang:binary_to_term/1                    19  19871694  1045878.63  [99.89]
                                               19892656              [100.0]

main:

FUNCTION                                        CALLS  TIME (μs)  PER CALL  [    %]
...
erlang:universaltime_to_localtime/1                 6         71     11.83  [ 0.00]
prim_file:list_dir_nif/1                            2         93     46.50  [ 0.01]
prim_file:close_nif/1                              17        165      9.71  [ 0.01]
file:file_name_1/2                               1037        168      0.16  [ 0.01]
prim_file:read_nif/2                               34        190      5.59  [ 0.01]
filename:join1/4                                 2890        249      0.09  [ 0.02]
prim_file:open_nif/2                               17        704     41.41  [ 0.04]
erlang:crc32/1                                     17     141498   8323.41  [ 8.72]
prim_file:read_file_nif/1                          17     154438   9084.59  [ 9.52]
ra_log_snapshot:parse_snapshot/1                   17     520271  30604.18  [32.08]
erlang:binary_to_term/1                            51     803040  15745.88  [49.51]
                                                         1621975            [100.0]

FUNCTION                                    CALLS      WORDS    PER CALL  [    %]
...
prim_file:internal_native2name/1               17       1122       66.00  [ 0.00]
file:file_name_1/2                           1037       2040        1.97  [ 0.00]
lists:reverse/2                                57       5552       97.40  [ 0.00]
filename:join1/4                             2890       5678        1.96  [ 0.00]
erlang:crc32/1                                 17      66600     3917.65  [ 0.04]
erlang:binary_to_term/1                        51  177721626  3484737.76  [99.95]
                                                   177806402              [100.0]

kjnilsson · 2024-08-05T07:50:08Z

Other checkpoints we can validate during promotion and discard ones that fai

For a quorum queue were consumers keep up with ingress checkpoints are promoted very often. It would be nice not to to have to do the validation work every time just because we optimised recovery. My thought was that once we'd found a valid checkpoint during recovery we'd assume all prior checkpoints are also valid. That should be roughly as good as promoting any other checkpoint.

The most likely way a checkpoint would become corrupted is if the server hard stopped during a write or fsync. Sure there are other ways checkpoints could become corrupted but at least we guard against the most likely one.

This refactors `ra_snapshot:find_checkpoints/1` to cut down on some work when there are many checkpoints. We scan through the checkpoint directories to find the first (latest) valid checkpoint we can use for recovery. Then we can defer using the `ra_snapshot:validate/1` callback (which can be somewhat expensive) for any older checkpoints. We assume that any checkpoints older than the latest valid checkpoint are valid. We expect that invalid checkpoints would be created when a machine terminates hard and unexpectedly and may stop an in-progress write or leave a checkpoint file unsynced. This should only affect some number of the latest checkpoints though. Once we've found a checkpoint file that is valid, checkpoints older than that should be fully written and synchronized too. We also bail out of the search when we find a checkpoint that has a lower index than the current snapshot index. Those checkpoints cannot be promoted and should be deleted. We scan through the checkpoints from most recent to least recent, so when we find a checkpoint with an older index than the snapshot, we delete that checkpoint and any older checkpoints.

kjnilsson

a few minor things

src/ra_snapshot.erl

the-mikedavis requested review from kjnilsson and mkuratczyk July 25, 2024 21:18

the-mikedavis force-pushed the md/checkpoint-defer-validation branch from 2cb471c to 0b45ffc Compare July 25, 2024 21:19

the-mikedavis added 2 commits August 5, 2024 10:35

ra_checkpoint_SUITE: Show recovery from older valid checkpoint

48ecb89

the-mikedavis force-pushed the md/checkpoint-defer-validation branch from 0b45ffc to 48ecb89 Compare August 5, 2024 14:35

the-mikedavis marked this pull request as ready for review August 5, 2024 14:36

kjnilsson requested changes Aug 14, 2024

View reviewed changes

src/ra_snapshot.erl Show resolved Hide resolved

src/ra_snapshot.erl Outdated Show resolved Hide resolved

src/ra_snapshot.erl Outdated Show resolved Hide resolved

Formatting and comment fix

b7fe2da

kjnilsson changed the title ~~Defer some checkpoint validation until promotion~~ Stop checkpoint validation when encountering a valid checkpoint Aug 14, 2024

kjnilsson merged commit 4c5b409 into main Aug 14, 2024
9 checks passed

dumbbell added this to the 2.13.6 milestone Aug 14, 2024

michaelklishin deleted the md/checkpoint-defer-validation branch August 14, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop checkpoint validation when encountering a valid checkpoint #463

Stop checkpoint validation when encountering a valid checkpoint #463

the-mikedavis commented Jul 25, 2024 •

edited by kjnilsson

Loading

the-mikedavis commented Jul 29, 2024 •

edited

Loading

kjnilsson commented Aug 5, 2024

kjnilsson left a comment

Stop checkpoint validation when encountering a valid checkpoint #463

Stop checkpoint validation when encountering a valid checkpoint #463

Conversation

the-mikedavis commented Jul 25, 2024 • edited by kjnilsson Loading

the-mikedavis commented Jul 29, 2024 • edited Loading

kjnilsson commented Aug 5, 2024

kjnilsson left a comment

Choose a reason for hiding this comment

the-mikedavis commented Jul 25, 2024 •

edited by kjnilsson

Loading

the-mikedavis commented Jul 29, 2024 •

edited

Loading