-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segment_appender: truncate in addition to allocate #18598
Conversation
d474332
to
9cf6bdc
Compare
9cf6bdc
to
8271f51
Compare
4876478
to
0a71009
Compare
0a71009
to
90878d7
Compare
90878d7
to
498174c
Compare
@@ -162,7 +162,7 @@ func NewCommand(fs afero.Fs, p *config.Params) *cobra.Command { | |||
f.StringVar(&logsSince, "logs-since", "", "Include log entries on or newer than the specified date (journalctl date format, e.g. YYYY-MM-DD") | |||
f.StringVar(&logsUntil, "logs-until", "", "Include log entries on or older than the specified date (journalctl date format, e.g. YYYY-MM-DD") | |||
f.StringVar(&logsSizeLimit, "logs-size-limit", "100MiB", "Read the logs until the given size is reached (e.g. 3MB, 1GiB)") | |||
f.StringVar(&controllerLogsSizeLimit, "controller-logs-size-limit", "20MB", "The size limit of the controller logs that can be stored in the bundle (e.g. 3MB, 1GiB)") | |||
f.StringVar(&controllerLogsSizeLimit, "controller-logs-size-limit", "130MB", "The size limit of the controller logs that can be stored in the bundle (e.g. 3MB, 1GiB)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When doing the logic of whether to include the controller log in the debug bundle rpk also looks at the logical file size. Given that on most systems the initial segment falloc/truncate step is 32MB the controller log would basically never get included in the debug bundle anymore. Hence the need for the increase.
I think changing it to 130MB such that we at least include a single segment in case the controller log is large (128MiB is the default segment size) is good either way.
I do realize now though that that is actually in MB and not MiB so I will need to change it to 132MB.
src/v/storage/segment_appender.cc
Outdated
_fallocation_offset += step; | ||
// ss::file::allocate does not adjust logical file size | ||
// hence we need to do that explicitly with an extra | ||
// truncate. This allows for more efficient writes. | ||
return _out.truncate(_fallocation_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed suspicious to set _fallocation_offset before truncating if thinking about the logical change as something that ought to be pushed into ss::file::allocate()
in the first place.
Looking around, it doesn't look like it's too important since this appears to only be used when we append (in case we need to fallocate, in which case we co_await) or when we truncate (which shouldn't overlap with appends, and therefore should also co_await the entire execution of do_next_adaptive_fallocation()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the important bit is that there is no concurrent appends.
I have changed it anyway as like you say it's the more logical order.
Also tagging Nicolae since he's spent a good amount of time in the area with write-caching. |
This is in preparation for also `truncate`ing segment files which will adapt their logical size on disk. Note that this test is a bit different to how prod works because in reality segments get closed and truncated to their real size on roll so only ever the last segment has unwritten blocks but this test doesn't work like that and hence we need account for the unwritten blocks in all segments.
498174c
to
0719380
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49598#018fbf16-947c-4a6f-a54e-550f8fcc8ad7 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49598#018fbf16-9476-4ab4-a13f-8614dabbe27a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49630#018fc413-782a-4015-adee-8e25db5bc4e3 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49630#018fc414-7b42-45c4-a3a5-2f197c56962c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Did you notice increase in test duration? This new compute storage doesn't seem very lightweight and wonder if it would make more sense to adjust tests to fallocate less/not rely on that precise segment sizing information.
release_appender_in_background
gave a bit of headache when reasoning about correctness. Seems good. It holds the gate which prevents closing and writing the "clean segment" offset before an actual flush+truncate with the proper size.
@@ -33,6 +74,14 @@ def safe_listdir(p: Path) -> list[Path]: | |||
return [] | |||
|
|||
|
|||
def compute_size_for_file(file: Path): | |||
if file.suffix == '.log': | |||
reader = SegmentReader(io.BytesIO(file.read_bytes())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it makes sense to have a fast path if last byte is non 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can give it a try. Not sure whether we still leave zero pages on close/final-truncate.
Doesn't seem very noticable. Maybe very little.
I am very open to ideas but can't immediately see how we could improve it. The size checking already feels like a simplifying hack (all bits could be garbled). I wouldn't want to change all the tests to use smaller segment sizes etc as I prefer things to run in default configs where possible. |
89c48ac
to
096004b
Compare
.then([this, step] { | ||
// ss::file::allocate does not adjust logical file size | ||
// hence we need to do that explicitly with an extra | ||
// truncate. This allows for more efficient writes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// truncate. This allows for more efficient writes. | |
// truncate. This allows for more efficient writes. | |
// See https://github.com/redpanda-data/redpanda/pull/18598. |
Because you put a very good analysis in the cover letter, may make sense to link to the PR?
required=False) | ||
parser.add_argument('--segment-size', | ||
type=str, | ||
help='The redpanda data dir', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help is wrong.
|
||
|
||
# NB: SegmentReader is duplicated in si_utils.py for deployment reasons. If | ||
# making changes please adapt both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"fingers crossed" that everyone obeys this
We used compute_storage.py for validation checks to confirm segments "are the same" after certain operations. It uses the logical file size for that. With the switch to the segment_appender to also adjust the logical file size when we fallocate this would break. Make compute_storage actually read the segment files and parse the actual amount of batch data in there. We apply a small optimization where for fully closed/truncated segments we still use the file size and skip reading the full file. We also switch `data_checksum` to use compute_storage.py and calculate the md5 sum in compute_storage.py. This avoids repeated invocations of python which can slow things down.
In the test we trigger a segment GC and expect all data to be effectively gone. This is still the case but because the check was looking at the whole RP data dir it would now pick up the controller and kvstore log files which also have their logical file size adapted. Only look at the kafka dir.
Otherwise the controller log would now be skipped because on the average system where the default falloc step is 32MB it would now already be too big for the default 20MB limit. Note that this doesn't actually bloat the size of the debug bundle as any zero padding basically compresses perfectly anyway. 20 MB also feels kinda low for something important as the controller log.
In the segment_appender we adaptively increase the file size of log segments using `ss::file::allocate`. This however doesn't immediately do what's intended. Internally it calls `fallocate` with the `FALLOC_FL_KEEP_SIZE|FALLOC_FL_ZERO_RANGE` flags. The former leads to the logical file size not being extended. Extending the logical file size however seems to have been the intention of 7d77067. We do already handle zero bytes at the end fine either way as we are writing zero appended 4k chunks anyway. Besides this side effect there is other reasons why we want to update the logical file size immediately. First, this means that on every write we will need to update the file size which makes both the write and the fsync more expensive: as updating the file size is a metadata operation which needs to be journaled to the XFS in-memory log ("CIL") and then flushed out to disk when the inevitable fsync occurs immediately after. So adjusting the size costs CPU on the write and the fsync and potentially also causes some additional IO on the fsync (shows up as write amplification), though this latter effect wasn't noticeable in testing. Second, seastar considers XFS an append-challenged file system. This means that seastar will avoid having size changing and non-size changing operations outstanding at the same time. They will be queued internally in the ioqueue. Because of how `allocate` works above all our writes will always be considered as "appending" as we never updated the logical file size. To optimize the queued operations seastar employs certain "optimizations" in `append_challenged_posix_file_impl::optimize_queue`, including issuing ftruncate calls when it observes the application issuing writes that will extend the logical file size. Because nearly all our writes are appending this causes a continuous stream of implicit ftruncate syscalls of about 100/s per shard. ``` root:/tmp# perf trace -t 11990 -s -e ftruncate -- sleep 5 Summary of events: redpanda (11990), 1122 events, 100.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ ftruncate 561 0 32.056 0.013 0.057 0.265 1.85% ``` To avoid the two aforementioned issues this patch adds an explicit `truncate` after our `allocate` call. Note that we still need both as `ftruncate` alone doesn't preallocate blocks so it's a lot less performant on its own. In a medium IOPS OMB workload we see a drop in p99 producer latency from ~14ms to ~10ms. Threaded fallbacks and hence steal time are down by about 5 and 10% respectively. Further we can also see the `fsync` times distribution slightly improved: Before: ``` root:/tmp# xfsdist-bpfcc 5 1 Tracing XFS operation latency... Hit Ctrl-C to end. 14:07:37: operation = b'fsync' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 55 | | 4 -> 7 : 11872 |***** | 8 -> 15 : 38206 |***************** | 16 -> 31 : 18405 |******** | 32 -> 63 : 11474 |***** | 64 -> 127 : 29074 |************* | 128 -> 255 : 88607 |****************************************| 256 -> 511 : 27403 |************ | 512 -> 1023 : 557 | | 1024 -> 2047 : 28 | | 2048 -> 4095 : 23 | ``` After: ``` root:/tmp# xfsdist-bpfcc 5 1 Tracing XFS operation latency... Hit Ctrl-C to end. 13:57:45: operation = b'fsync' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 56 | | 4 -> 7 : 15472 |******* | 8 -> 15 : 45440 |*********************** | 16 -> 31 : 22825 |*********** | 32 -> 63 : 13378 |****** | 64 -> 127 : 31979 |**************** | 128 -> 255 : 77529 |****************************************| 256 -> 511 : 18116 |********* | 512 -> 1023 : 250 | | 1024 -> 2047 : 1 | ``` Further we can also observe the logial file size adapting during a run: Before: ``` root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-zUv2_Hs-0000/0_32/ | grep log ; sleep 1 ; done -rw-r--r-- 1 redpanda redpanda 37502976 May 21 10:10 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 39337984 May 21 10:10 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 41267200 May 21 10:10 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 43479040 May 21 10:10 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 45416448 May 21 10:10 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 47005696 May 21 10:10 0-1-v1.log ``` After: ``` root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-Yk-n3mY-0000/0_35/ | grep log ; sleep 1 ; done -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log ``` Note that longterm we will want to improve/change seastar to allow doing a "full" fallocate but just adding the truncate is the easier change for now. Fixes redpanda-data/core-internal#1293
096004b
to
7d0d283
Compare
I have pushed some changes now that speed up the size computation in |
@piyushredpanda I wanted to wait with merging this until the CPD etc is in place |
Oh dang, sorry, @StephanDollberg. Could I request having a DNM in PR then? Should we revert? |
Ah yeah sure good point, will add next time.
Nah not worth the hassle. I think we should have a few data points at least now since we made the throughput changes. |
In the segment_appender we adaptively increase the file size of log
segments using
ss::file::allocate
.This however doesn't immediately do what's intended. Internally it calls
fallocate
with theFALLOC_FL_KEEP_SIZE|FALLOC_FL_ZERO_RANGE
flags.The former leads to the logical file size not being extended.
Extending the logical file size however seems to have been the intention
of 7d77067. We do already handle zero
bytes at the end fine either way as we are writing zero appended 4k
chunks anyway.
Besides this side effect there is other reasons why we want to update
the logical file size immediately.
First, this means that on every write we will need to update the file
size which makes both the write and the fsync more expensive: as updating
the file size is a metadata operation which needs to be journaled to the XFS
in-memory log ("CIL") and then flushed out to disk when the inevitable fsync
occurs immediately after. So adjusting the size costs CPU on the write
and the fsync and potentially also causes some additional IO on the fsync
(shows up as write amplification), though this latter effect wasn't noticeable
in testing.
Second, seastar considers XFS an append-challenged file system. This
means that seastar will avoid having size changing and non-size changing
operations outstanding at the same time. They will be queued internally
in the ioqueue. Because of how
allocate
works above all our writeswill always be considered as "appending" as we never updated the logical
file size.
To optimize the queued operations seastar employs certain
"optimizations" in
append_challenged_posix_file_impl::optimize_queue
,including issuing
ftruncate
calls when it observes the applicationissuing writes that will extend the logical file size.
Because nearly all our writes are appending this causes a continuous stream
of implicit
ftruncate
syscalls of about 100/s per shard.To avoid the two aforementioned issues this patch adds an explicit
truncate
after ourallocate
call. Note that we still need both asftruncate
alone doesn't preallocate blocks so it's a lot lessperformant on its own.
In a medium IOPS OMB workload we see a drop in p99 producer latency from
~14ms to ~10ms. Threaded fallbacks and hence steal time are down by
about 5 and 10% respectively.
Further we can also see the
fsync
times distribution slightlyimproved:
Before:
After:
Further we can also observe the logial file size adapting during a run:
Before:
After:
Note that longterm we will want to improve/change seastar to allow doing
a "full" fallocate but just adding the truncate is the easier change for
now.
Fixes https://github.com/redpanda-data/core-internal/issues/1293
Backports Required
Release Notes