Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v23.3.x] storage: fix windowed compaction error handling #16940

Merged

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Mar 7, 2024

Backport of #16928

Fixes #16937

Also pulls in 5ad9e14

CONFLICT:

  • some reducer code in this branch has a default argument, but not on dev

We were previously using the l-value-variant of
model::record_batch_reader::consume(), which doesn't have a built-in
call to finally(). For a log_reader, this meant that in some scenarios
(e.g. if the reducer were to fail) we could end up not calling
finally(), and therefore wouldn't clear the underlying segment reader.

This is exactly what happened, as we hit:

std::runtime_error (lz4f_compressend:ERROR_dstMaxSize_tooSmall)

errors as we compressed/decompressed batches, and then ultimately
crashed because the log_reader was not closed:

Assert failure: (.../redpanda/redpanda/src/v/storage/log_reader.h:141) '!_iterator.reader' log reader destroyed with live reader"

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Fixes a bug in windowed compaction that could cause Redpanda to crash when an error occurs while reading batches.

bharathv and others added 5 commits March 7, 2024 10:11
The staging file was missing a '.'. Note that these files are transient,
so there isn't a compactibility concern here.

(cherry picked from commit 62fd4f8)
CONFLICT:
- kept default value for segment_last_offset in compaction_reducers

An upcoming test will inject an exception to simulate a failure case
seen in the wild that is otherwise hard to reproduce.

I'd considered using the file_sanitizer-based error injection, but it
seems that infra only really shines at making Redpanda hit asserts, so
it's not quite what I'm looking for (my upcoming test will assert that
we _don't_ crash in a particular error scenario).

(cherry picked from commit 13f5537)
This will be useful to test methods that need a feature table (e.g.
deduplicate_segments() has a feature table as an argument).

(cherry picked from commit 958aaf7)
We were previously using the l-value-variant of
model::record_batch_reader::consume(), which doesn't have a built-in
call to finally(). For a log_reader, this meant that in some scenarios
(e.g. if the reducer were to fail) we could end up not calling
finally(), and therefore wouldn't clear the underlying segment reader.

This is exactly what happened, as we hit:

std::runtime_error (lz4f_compressend:ERROR_dstMaxSize_tooSmall)

errors as we compressed/decompressed batches, and then ultimately
crashed because the log_reader was not closed:

Assert failure: (.../redpanda/redpanda/src/v/storage/log_reader.h:141) '!_iterator.reader' log reader destroyed with live reader"

(cherry picked from commit cc06447)
@andrwng andrwng requested a review from dotnwat March 7, 2024 18:18
@andrwng andrwng requested a review from andijcr March 7, 2024 18:18
@andrwng andrwng added this to the v23.3.7 milestone Mar 7, 2024
return with_extension("log.compaction.compaction_index");
return with_extension(".log.compaction.compaction_index");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the affect the ability of storage to clean-up the transient files? i'm not sure what kind of wildcard / pattern it uses to identify them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would if we cleaned up leftover files at start up, but I don't think we do currently. We only clean them up at runtime after we compact

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge? If so, please do. It's a thrill :D

@andrwng andrwng merged commit c0f3555 into redpanda-data:v23.3.x Mar 8, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants