segment_appender: truncate in addition to allocate #18598

StephanDollberg · 2024-05-21T15:01:57Z

In the segment_appender we adaptively increase the file size of log
segments using ss::file::allocate.

This however doesn't immediately do what's intended. Internally it calls
fallocate with the FALLOC_FL_KEEP_SIZE|FALLOC_FL_ZERO_RANGE flags.
The former leads to the logical file size not being extended.

Extending the logical file size however seems to have been the intention
of 7d77067. We do already handle zero
bytes at the end fine either way as we are writing zero appended 4k
chunks anyway.

Besides this side effect there is other reasons why we want to update
the logical file size immediately.

First, this means that on every write we will need to update the file
size which makes both the write and the fsync more expensive: as updating
the file size is a metadata operation which needs to be journaled to the XFS
in-memory log ("CIL") and then flushed out to disk when the inevitable fsync
occurs immediately after. So adjusting the size costs CPU on the write
and the fsync and potentially also causes some additional IO on the fsync
(shows up as write amplification), though this latter effect wasn't noticeable
in testing.

Second, seastar considers XFS an append-challenged file system. This
means that seastar will avoid having size changing and non-size changing
operations outstanding at the same time. They will be queued internally
in the ioqueue. Because of how allocate works above all our writes
will always be considered as "appending" as we never updated the logical
file size.

To optimize the queued operations seastar employs certain
"optimizations" in append_challenged_posix_file_impl::optimize_queue,
including issuing ftruncate calls when it observes the application
issuing writes that will extend the logical file size.
Because nearly all our writes are appending this causes a continuous stream
of implicit ftruncate syscalls of about 100/s per shard.

root:/tmp# perf trace -t 11990 -s -e ftruncate -- sleep 5

 Summary of events:

 redpanda (11990), 1122 events, 100.0%

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   ftruncate            561      0    32.056     0.013     0.057     0.265      1.85%

To avoid the two aforementioned issues this patch adds an explicit
truncate after our allocate call. Note that we still need both as
ftruncate alone doesn't preallocate blocks so it's a lot less
performant on its own.

In a medium IOPS OMB workload we see a drop in p99 producer latency from
~14ms to ~10ms. Threaded fallbacks and hence steal time are down by
about 5 and 10% respectively.

Further we can also see the fsync times distribution slightly
improved:

Before:

root:/tmp# xfsdist-bpfcc 5 1
Tracing XFS operation latency... Hit Ctrl-C to end.

14:07:37:

operation = b'fsync'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 55       |                                        |
         4 -> 7          : 11872    |*****                                   |
         8 -> 15         : 38206    |*****************                       |
        16 -> 31         : 18405    |********                                |
        32 -> 63         : 11474    |*****                                   |
        64 -> 127        : 29074    |*************                           |
       128 -> 255        : 88607    |****************************************|
       256 -> 511        : 27403    |************                            |
       512 -> 1023       : 557      |                                        |
      1024 -> 2047       : 28       |                                        |
      2048 -> 4095       : 23       |

After:

root:/tmp# xfsdist-bpfcc 5 1
Tracing XFS operation latency... Hit Ctrl-C to end.

13:57:45:

operation = b'fsync'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 56       |                                        |
         4 -> 7          : 15472    |*******                                 |
         8 -> 15         : 45440    |***********************                 |
        16 -> 31         : 22825    |***********                             |
        32 -> 63         : 13378    |******                                  |
        64 -> 127        : 31979    |****************                        |
       128 -> 255        : 77529    |****************************************|
       256 -> 511        : 18116    |*********                               |
       512 -> 1023       : 250      |                                        |
      1024 -> 2047       : 1        |

Further we can also observe the logial file size adapting during a run:

Before:

root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-zUv2_Hs-0000/0_32/ | grep log ; sleep 1 ; done
-rw-r--r--   1 redpanda redpanda 37502976 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 39337984 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 41267200 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 43479040 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 45416448 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 47005696 May 21 10:10 0-1-v1.log

After:

root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-Yk-n3mY-0000/0_35/ | grep log ; sleep 1 ; done
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log

Note that longterm we will want to improve/change seastar to allow doing
a "full" fallocate but just adding the truncate is the easier change for
now.

Fixes https://github.com/redpanda-data/core-internal/issues/1293

Backports Required

Release Notes

none

twmb · 2024-05-24T14:08:35Z

src/go/rpk/pkg/cli/debug/bundle/bundle.go

@@ -162,7 +162,7 @@ func NewCommand(fs afero.Fs, p *config.Params) *cobra.Command {
 	f.StringVar(&logsSince, "logs-since", "", "Include log entries on or newer than the specified date (journalctl date format, e.g. YYYY-MM-DD")
 	f.StringVar(&logsUntil, "logs-until", "", "Include log entries on or older than the specified date (journalctl date format, e.g. YYYY-MM-DD")
 	f.StringVar(&logsSizeLimit, "logs-size-limit", "100MiB", "Read the logs until the given size is reached (e.g. 3MB, 1GiB)")
-	f.StringVar(&controllerLogsSizeLimit, "controller-logs-size-limit", "20MB", "The size limit of the controller logs that can be stored in the bundle (e.g. 3MB, 1GiB)")
+	f.StringVar(&controllerLogsSizeLimit, "controller-logs-size-limit", "130MB", "The size limit of the controller logs that can be stored in the bundle (e.g. 3MB, 1GiB)")


When doing the logic of whether to include the controller log in the debug bundle rpk also looks at the logical file size. Given that on most systems the initial segment falloc/truncate step is 32MB the controller log would basically never get included in the debug bundle anymore. Hence the need for the increase.

I think changing it to 130MB such that we at least include a single segment in case the controller log is large (128MiB is the default segment size) is good either way.

I do realize now though that that is actually in MB and not MiB so I will need to change it to 132MB.

andrwng · 2024-05-24T19:22:07Z

src/v/storage/segment_appender.cc

+                       _fallocation_offset += step;
+                       // ss::file::allocate does not adjust logical file size
+                       // hence we need to do that explicitly with an extra
+                       // truncate. This allows for more efficient writes.
+                       return _out.truncate(_fallocation_offset);


It seemed suspicious to set _fallocation_offset before truncating if thinking about the logical change as something that ought to be pushed into ss::file::allocate() in the first place.

Looking around, it doesn't look like it's too important since this appears to only be used when we append (in case we need to fallocate, in which case we co_await) or when we truncate (which shouldn't overlap with appends, and therefore should also co_await the entire execution of do_next_adaptive_fallocation()).

Yeah, the important bit is that there is no concurrent appends.

I have changed it anyway as like you say it's the more logical order.

andrwng · 2024-05-24T19:24:03Z

Also tagging Nicolae since he's spent a good amount of time in the area with write-caching.

This is in preparation for also `truncate`ing segment files which will adapt their logical size on disk. Note that this test is a bit different to how prod works because in reality segments get closed and truncated to their real size on roll so only ever the last segment has unwritten blocks but this test doesn't work like that and hence we need account for the unwritten blocks in all segments.

vbotbuildovich · 2024-05-28T13:38:13Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49598#018fbf16-947c-4a6f-a54e-550f8fcc8ad7

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49598#018fbf16-9476-4ab4-a13f-8614dabbe27a

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49630#018fc413-782a-4015-adee-8e25db5bc4e3

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49630#018fc414-7b42-45c4-a3a5-2f197c56962c

nvartolomei

LGTM. Did you notice increase in test duration? This new compute storage doesn't seem very lightweight and wonder if it would make more sense to adjust tests to fallocate less/not rely on that precise segment sizing information.

release_appender_in_background gave a bit of headache when reasoning about correctness. Seems good. It holds the gate which prevents closing and writing the "clean segment" offset before an actual flush+truncate with the proper size.

nvartolomei · 2024-05-28T17:14:28Z

tests/rptest/remote_scripts/compute_storage.py

@@ -33,6 +74,14 @@ def safe_listdir(p: Path) -> list[Path]:
        return []


+def compute_size_for_file(file: Path):
+    if file.suffix == '.log':
+        reader = SegmentReader(io.BytesIO(file.read_bytes()))


Wondering if it makes sense to have a fast path if last byte is non 0.

I can give it a try. Not sure whether we still leave zero pages on close/final-truncate.

StephanDollberg · 2024-05-28T18:21:24Z

LGTM. Did you notice increase in test duration? This new compute storage doesn't seem very lightweight and wonder if it would make more sense to adjust tests to fallocate less/not rely on that precise segment sizing information.

Doesn't seem very noticable. Maybe very little.

more sense to adjust tests to fallocate less/not rely on that precise segment sizing information

I am very open to ideas but can't immediately see how we could improve it. The size checking already feels like a simplifying hack (all bits could be garbled). I wouldn't want to change all the tests to use smaller segment sizes etc as I prefer things to run in default configs where possible.

travisdowns · 2024-05-29T14:39:07Z

src/v/storage/segment_appender.cc

+                   .then([this, step] {
+                       // ss::file::allocate does not adjust logical file size
+                       // hence we need to do that explicitly with an extra
+                       // truncate. This allows for more efficient writes.


Suggested change

// truncate. This allows for more efficient writes.

// truncate. This allows for more efficient writes.

// See https://github.com/redpanda-data/redpanda/pull/18598.

Because you put a very good analysis in the cover letter, may make sense to link to the PR?

travisdowns · 2024-05-29T14:44:49Z

tests/rptest/remote_scripts/compute_storage.py

+                        required=False)
+    parser.add_argument('--segment-size',
+                        type=str,
+                        help='The redpanda data dir',


Help is wrong.

travisdowns · 2024-05-29T14:49:41Z

tests/rptest/remote_scripts/compute_storage.py

+
+
+# NB: SegmentReader is duplicated in si_utils.py for deployment reasons. If
+# making changes please adapt both.


"fingers crossed" that everyone obeys this

We used compute_storage.py for validation checks to confirm segments "are the same" after certain operations. It uses the logical file size for that. With the switch to the segment_appender to also adjust the logical file size when we fallocate this would break. Make compute_storage actually read the segment files and parse the actual amount of batch data in there. We apply a small optimization where for fully closed/truncated segments we still use the file size and skip reading the full file. We also switch `data_checksum` to use compute_storage.py and calculate the md5 sum in compute_storage.py. This avoids repeated invocations of python which can slow things down.

In the test we trigger a segment GC and expect all data to be effectively gone. This is still the case but because the check was looking at the whole RP data dir it would now pick up the controller and kvstore log files which also have their logical file size adapted. Only look at the kafka dir.

Otherwise the controller log would now be skipped because on the average system where the default falloc step is 32MB it would now already be too big for the default 20MB limit. Note that this doesn't actually bloat the size of the debug bundle as any zero padding basically compresses perfectly anyway. 20 MB also feels kinda low for something important as the controller log.

In the segment_appender we adaptively increase the file size of log segments using `ss::file::allocate`. This however doesn't immediately do what's intended. Internally it calls `fallocate` with the `FALLOC_FL_KEEP_SIZE|FALLOC_FL_ZERO_RANGE` flags. The former leads to the logical file size not being extended. Extending the logical file size however seems to have been the intention of 7d77067. We do already handle zero bytes at the end fine either way as we are writing zero appended 4k chunks anyway. Besides this side effect there is other reasons why we want to update the logical file size immediately. First, this means that on every write we will need to update the file size which makes both the write and the fsync more expensive: as updating the file size is a metadata operation which needs to be journaled to the XFS in-memory log ("CIL") and then flushed out to disk when the inevitable fsync occurs immediately after. So adjusting the size costs CPU on the write and the fsync and potentially also causes some additional IO on the fsync (shows up as write amplification), though this latter effect wasn't noticeable in testing. Second, seastar considers XFS an append-challenged file system. This means that seastar will avoid having size changing and non-size changing operations outstanding at the same time. They will be queued internally in the ioqueue. Because of how `allocate` works above all our writes will always be considered as "appending" as we never updated the logical file size. To optimize the queued operations seastar employs certain "optimizations" in `append_challenged_posix_file_impl::optimize_queue`, including issuing ftruncate calls when it observes the application issuing writes that will extend the logical file size. Because nearly all our writes are appending this causes a continuous stream of implicit ftruncate syscalls of about 100/s per shard. ``` root:/tmp# perf trace -t 11990 -s -e ftruncate -- sleep 5 Summary of events: redpanda (11990), 1122 events, 100.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ ftruncate 561 0 32.056 0.013 0.057 0.265 1.85% ``` To avoid the two aforementioned issues this patch adds an explicit `truncate` after our `allocate` call. Note that we still need both as `ftruncate` alone doesn't preallocate blocks so it's a lot less performant on its own. In a medium IOPS OMB workload we see a drop in p99 producer latency from ~14ms to ~10ms. Threaded fallbacks and hence steal time are down by about 5 and 10% respectively. Further we can also see the `fsync` times distribution slightly improved: Before: ``` root:/tmp# xfsdist-bpfcc 5 1 Tracing XFS operation latency... Hit Ctrl-C to end. 14:07:37: operation = b'fsync' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 55 | | 4 -> 7 : 11872 |***** | 8 -> 15 : 38206 |***************** | 16 -> 31 : 18405 |******** | 32 -> 63 : 11474 |***** | 64 -> 127 : 29074 |************* | 128 -> 255 : 88607 |****************************************| 256 -> 511 : 27403 |************ | 512 -> 1023 : 557 | | 1024 -> 2047 : 28 | | 2048 -> 4095 : 23 | ``` After: ``` root:/tmp# xfsdist-bpfcc 5 1 Tracing XFS operation latency... Hit Ctrl-C to end. 13:57:45: operation = b'fsync' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 56 | | 4 -> 7 : 15472 |******* | 8 -> 15 : 45440 |*********************** | 16 -> 31 : 22825 |*********** | 32 -> 63 : 13378 |****** | 64 -> 127 : 31979 |**************** | 128 -> 255 : 77529 |****************************************| 256 -> 511 : 18116 |********* | 512 -> 1023 : 250 | | 1024 -> 2047 : 1 | ``` Further we can also observe the logial file size adapting during a run: Before: ``` root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-zUv2_Hs-0000/0_32/ | grep log ; sleep 1 ; done -rw-r--r-- 1 redpanda redpanda 37502976 May 21 10:10 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 39337984 May 21 10:10 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 41267200 May 21 10:10 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 43479040 May 21 10:10 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 45416448 May 21 10:10 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 47005696 May 21 10:10 0-1-v1.log ``` After: ``` root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-Yk-n3mY-0000/0_35/ | grep log ; sleep 1 ; done -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log -rw-r--r-- 1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log ``` Note that longterm we will want to improve/change seastar to allow doing a "full" fallocate but just adding the truncate is the easier change for now. Fixes redpanda-data/core-internal#1293

StephanDollberg · 2024-05-30T09:53:09Z

I have pushed some changes now that speed up the size computation in data_checksum (this makes one of the commits unnecessary). Haven't tested all tests yet so will need to see whether everything still passes.

StephanDollberg · 2024-06-07T17:31:42Z

@piyushredpanda I wanted to wait with merging this until the CPD etc is in place

piyushredpanda · 2024-06-08T01:16:17Z

Oh dang, sorry, @StephanDollberg. Could I request having a DNM in PR then? Should we revert?

StephanDollberg · 2024-06-08T09:03:40Z

Could I request having a DNM in PR then

Ah yeah sure good point, will add next time.

Should we revert?

Nah not worth the hassle. I think we should have a few data points at least now since we made the throughput changes.

github-actions bot added the area/redpanda label May 21, 2024

StephanDollberg marked this pull request as draft May 21, 2024 17:38

StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from d474332 to 9cf6bdc Compare May 23, 2024 09:51

redpanda-data deleted a comment from vbotbuildovich May 23, 2024

StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from 9cf6bdc to 8271f51 Compare May 23, 2024 13:07

github-actions bot added the area/rpk label May 23, 2024

StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from 4876478 to 0a71009 Compare May 23, 2024 17:54

redpanda-data deleted a comment from vbotbuildovich May 23, 2024

redpanda-data deleted a comment from vbotbuildovich May 24, 2024

StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from 0a71009 to 90878d7 Compare May 24, 2024 08:14

StephanDollberg marked this pull request as ready for review May 24, 2024 08:14

StephanDollberg requested review from twmb, r-vasquez, gene-redpanda and Deflaimun as code owners May 24, 2024 08:14

StephanDollberg requested review from andrwng, dotnwat and travisdowns and removed request for twmb, Deflaimun, r-vasquez, gene-redpanda and andrwng May 24, 2024 08:14

StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from 90878d7 to 498174c Compare May 24, 2024 09:41

redpanda-data deleted a comment from vbotbuildovich May 24, 2024

twmb reviewed May 24, 2024

View reviewed changes

andrwng previously approved these changes May 24, 2024

View reviewed changes

andrwng requested a review from nvartolomei May 24, 2024 19:23

StephanDollberg dismissed andrwng’s stale review via 0719380 May 28, 2024 10:57

StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from 498174c to 0719380 Compare May 28, 2024 10:57

nvartolomei reviewed May 28, 2024

View reviewed changes

StephanDollberg force-pushed the stephan/truncate-plus-allocate branch 3 times, most recently from 89c48ac to 096004b Compare May 29, 2024 09:47

travisdowns reviewed May 29, 2024

View reviewed changes

StephanDollberg added 4 commits May 30, 2024 10:48

StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from 096004b to 7d0d283 Compare May 30, 2024 09:52

nvartolomei approved these changes Jun 7, 2024

View reviewed changes

travisdowns approved these changes Jun 7, 2024

View reviewed changes

piyushredpanda merged commit 4b8135e into dev Jun 7, 2024
26 checks passed

piyushredpanda deleted the stephan/truncate-plus-allocate branch June 7, 2024 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segment_appender: truncate in addition to allocate #18598

segment_appender: truncate in addition to allocate #18598

StephanDollberg commented May 21, 2024 •

edited by travisdowns

Loading

twmb May 24, 2024

StephanDollberg May 24, 2024

andrwng May 24, 2024

StephanDollberg May 28, 2024

andrwng commented May 24, 2024

vbotbuildovich commented May 28, 2024 •

edited

Loading

nvartolomei left a comment

nvartolomei May 28, 2024

StephanDollberg May 28, 2024 •

edited

Loading

StephanDollberg commented May 28, 2024

travisdowns May 29, 2024

travisdowns May 29, 2024

travisdowns May 29, 2024

StephanDollberg commented May 30, 2024 •

edited

Loading

StephanDollberg commented Jun 7, 2024

piyushredpanda commented Jun 8, 2024

StephanDollberg commented Jun 8, 2024

	// truncate. This allows for more efficient writes.
	// truncate. This allows for more efficient writes.
	// See https://github.com/redpanda-data/redpanda/pull/18598.



		# NB: SegmentReader is duplicated in si_utils.py for deployment reasons. If
		# making changes please adapt both.

segment_appender: truncate in addition to allocate #18598

segment_appender: truncate in addition to allocate #18598

Conversation

StephanDollberg commented May 21, 2024 • edited by travisdowns Loading

Backports Required

Release Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng commented May 24, 2024

vbotbuildovich commented May 28, 2024 • edited Loading

nvartolomei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StephanDollberg May 28, 2024 • edited Loading

Choose a reason for hiding this comment

StephanDollberg commented May 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StephanDollberg commented May 30, 2024 • edited Loading

StephanDollberg commented Jun 7, 2024

piyushredpanda commented Jun 8, 2024

StephanDollberg commented Jun 8, 2024

StephanDollberg commented May 21, 2024 •

edited by travisdowns

Loading

vbotbuildovich commented May 28, 2024 •

edited

Loading

StephanDollberg May 28, 2024 •

edited

Loading

StephanDollberg commented May 30, 2024 •

edited

Loading