[v23.3.x] storage: fix windowed compaction error handling #16940

andrwng · 2024-03-07T18:18:00Z

Backport of #16928

Also pulls in 5ad9e14

CONFLICT:

some reducer code in this branch has a default argument, but not on dev

We were previously using the l-value-variant of
model::record_batch_reader::consume(), which doesn't have a built-in
call to finally(). For a log_reader, this meant that in some scenarios
(e.g. if the reducer were to fail) we could end up not calling
finally(), and therefore wouldn't clear the underlying segment reader.

This is exactly what happened, as we hit:

std::runtime_error (lz4f_compressend:ERROR_dstMaxSize_tooSmall)

errors as we compressed/decompressed batches, and then ultimately
crashed because the log_reader was not closed:

Assert failure: (.../redpanda/redpanda/src/v/storage/log_reader.h:141) '!_iterator.reader' log reader destroyed with live reader"

Backports Required

Release Notes

Bug Fixes

Fixes a bug in windowed compaction that could cause Redpanda to crash when an error occurs while reading batches.

to be used by the next commit.

The staging file was missing a '.'. Note that these files are transient, so there isn't a compactibility concern here. (cherry picked from commit 62fd4f8)

CONFLICT: - kept default value for segment_last_offset in compaction_reducers An upcoming test will inject an exception to simulate a failure case seen in the wild that is otherwise hard to reproduce. I'd considered using the file_sanitizer-based error injection, but it seems that infra only really shines at making Redpanda hit asserts, so it's not quite what I'm looking for (my upcoming test will assert that we _don't_ crash in a particular error scenario). (cherry picked from commit 13f5537)

This will be useful to test methods that need a feature table (e.g. deduplicate_segments() has a feature table as an argument). (cherry picked from commit 958aaf7)

We were previously using the l-value-variant of model::record_batch_reader::consume(), which doesn't have a built-in call to finally(). For a log_reader, this meant that in some scenarios (e.g. if the reducer were to fail) we could end up not calling finally(), and therefore wouldn't clear the underlying segment reader. This is exactly what happened, as we hit: std::runtime_error (lz4f_compressend:ERROR_dstMaxSize_tooSmall) errors as we compressed/decompressed batches, and then ultimately crashed because the log_reader was not closed: Assert failure: (.../redpanda/redpanda/src/v/storage/log_reader.h:141) '!_iterator.reader' log reader destroyed with live reader" (cherry picked from commit cc06447)

andrwng · 2024-03-08T01:15:52Z

CI failures:

dotnwat · 2024-03-08T03:18:53Z

src/v/storage/fs_utils.cc

-        return with_extension("log.compaction.compaction_index");
+        return with_extension(".log.compaction.compaction_index");


does the affect the ability of storage to clean-up the transient files? i'm not sure what kind of wildcard / pattern it uses to identify them.

It would if we cleaned up leftover files at start up, but I don't think we do currently. We only clean them up at runtime after we compact

Good to merge? If so, please do. It's a thrill :D

bharathv and others added 5 commits March 7, 2024 10:11

compaction: plumb feature table into self_compact_segment

39ff812

to be used by the next commit.

storage: fix compaction index staging file namme

6a2cc33

The staging file was missing a '.'. Note that these files are transient, so there isn't a compactibility concern here. (cherry picked from commit 62fd4f8)

storage: expose disk_log_builder::feature_table

0dad371

This will be useful to test methods that need a feature table (e.g. deduplicate_segments() has a feature table as an argument). (cherry picked from commit 958aaf7)

andrwng requested a review from dotnwat March 7, 2024 18:18

github-actions bot added the area/redpanda label Mar 7, 2024

andrwng requested a review from andijcr March 7, 2024 18:18

andrwng added this to the v23.3.7 milestone Mar 7, 2024

dotnwat approved these changes Mar 8, 2024

View reviewed changes

andrwng merged commit c0f3555 into redpanda-data:v23.3.x Mar 8, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v23.3.x] storage: fix windowed compaction error handling #16940

[v23.3.x] storage: fix windowed compaction error handling #16940

andrwng commented Mar 7, 2024 •

edited

Loading

andrwng commented Mar 8, 2024

dotnwat Mar 8, 2024

andrwng Mar 8, 2024

piyushredpanda Mar 8, 2024

		return with_extension("log.compaction.compaction_index");
		return with_extension(".log.compaction.compaction_index");

[v23.3.x] storage: fix windowed compaction error handling #16940

[v23.3.x] storage: fix windowed compaction error handling #16940

Conversation

andrwng commented Mar 7, 2024 • edited Loading

Backports Required

Release Notes

Bug Fixes

andrwng commented Mar 8, 2024

dotnwat Mar 8, 2024

Choose a reason for hiding this comment

andrwng Mar 8, 2024

Choose a reason for hiding this comment

piyushredpanda Mar 8, 2024

Choose a reason for hiding this comment

andrwng commented Mar 7, 2024 •

edited

Loading