Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segment_appender: truncate in addition to allocate #18598

Merged
merged 5 commits into from
Jun 7, 2024

Conversation

StephanDollberg
Copy link
Member

@StephanDollberg StephanDollberg commented May 21, 2024

In the segment_appender we adaptively increase the file size of log
segments using ss::file::allocate.

This however doesn't immediately do what's intended. Internally it calls
fallocate with the FALLOC_FL_KEEP_SIZE|FALLOC_FL_ZERO_RANGE flags.
The former leads to the logical file size not being extended.

Extending the logical file size however seems to have been the intention
of 7d77067. We do already handle zero
bytes at the end fine either way as we are writing zero appended 4k
chunks anyway.

Besides this side effect there is other reasons why we want to update
the logical file size immediately.

First, this means that on every write we will need to update the file
size which makes both the write and the fsync more expensive: as updating
the file size is a metadata operation which needs to be journaled to the XFS
in-memory log ("CIL") and then flushed out to disk when the inevitable fsync
occurs immediately after. So adjusting the size costs CPU on the write
and the fsync and potentially also causes some additional IO on the fsync
(shows up as write amplification), though this latter effect wasn't noticeable
in testing.

Second, seastar considers XFS an append-challenged file system. This
means that seastar will avoid having size changing and non-size changing
operations outstanding at the same time. They will be queued internally
in the ioqueue. Because of how allocate works above all our writes
will always be considered as "appending" as we never updated the logical
file size.

To optimize the queued operations seastar employs certain
"optimizations" in append_challenged_posix_file_impl::optimize_queue,
including issuing ftruncate calls when it observes the application
issuing writes that will extend the logical file size.
Because nearly all our writes are appending this causes a continuous stream
of implicit ftruncate syscalls of about 100/s per shard.

root:/tmp# perf trace -t 11990 -s -e ftruncate -- sleep 5

 Summary of events:

 redpanda (11990), 1122 events, 100.0%

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   ftruncate            561      0    32.056     0.013     0.057     0.265      1.85%

To avoid the two aforementioned issues this patch adds an explicit
truncate after our allocate call. Note that we still need both as
ftruncate alone doesn't preallocate blocks so it's a lot less
performant on its own.

In a medium IOPS OMB workload we see a drop in p99 producer latency from
~14ms to ~10ms. Threaded fallbacks and hence steal time are down by
about 5 and 10% respectively.

Further we can also see the fsync times distribution slightly
improved:

Before:

root:/tmp# xfsdist-bpfcc 5 1
Tracing XFS operation latency... Hit Ctrl-C to end.

14:07:37:

operation = b'fsync'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 55       |                                        |
         4 -> 7          : 11872    |*****                                   |
         8 -> 15         : 38206    |*****************                       |
        16 -> 31         : 18405    |********                                |
        32 -> 63         : 11474    |*****                                   |
        64 -> 127        : 29074    |*************                           |
       128 -> 255        : 88607    |****************************************|
       256 -> 511        : 27403    |************                            |
       512 -> 1023       : 557      |                                        |
      1024 -> 2047       : 28       |                                        |
      2048 -> 4095       : 23       |

After:

root:/tmp# xfsdist-bpfcc 5 1
Tracing XFS operation latency... Hit Ctrl-C to end.

13:57:45:

operation = b'fsync'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 56       |                                        |
         4 -> 7          : 15472    |*******                                 |
         8 -> 15         : 45440    |***********************                 |
        16 -> 31         : 22825    |***********                             |
        32 -> 63         : 13378    |******                                  |
        64 -> 127        : 31979    |****************                        |
       128 -> 255        : 77529    |****************************************|
       256 -> 511        : 18116    |*********                               |
       512 -> 1023       : 250      |                                        |
      1024 -> 2047       : 1        |

Further we can also observe the logial file size adapting during a run:

Before:

root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-zUv2_Hs-0000/0_32/ | grep log ; sleep 1 ; done
-rw-r--r--   1 redpanda redpanda 37502976 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 39337984 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 41267200 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 43479040 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 45416448 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 47005696 May 21 10:10 0-1-v1.log

After:

root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-Yk-n3mY-0000/0_35/ | grep log ; sleep 1 ; done
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log

Note that longterm we will want to improve/change seastar to allow doing
a "full" fallocate but just adding the truncate is the easier change for
now.

Fixes https://github.com/redpanda-data/core-internal/issues/1293

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@StephanDollberg StephanDollberg marked this pull request as draft May 21, 2024 17:38
@StephanDollberg StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from d474332 to 9cf6bdc Compare May 23, 2024 09:51
@redpanda-data redpanda-data deleted a comment from vbotbuildovich May 23, 2024
@redpanda-data redpanda-data deleted a comment from vbotbuildovich May 23, 2024
@StephanDollberg StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from 9cf6bdc to 8271f51 Compare May 23, 2024 13:07
@StephanDollberg StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from 4876478 to 0a71009 Compare May 23, 2024 17:54
@redpanda-data redpanda-data deleted a comment from vbotbuildovich May 23, 2024
@redpanda-data redpanda-data deleted a comment from vbotbuildovich May 23, 2024
@redpanda-data redpanda-data deleted a comment from vbotbuildovich May 24, 2024
@StephanDollberg StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from 0a71009 to 90878d7 Compare May 24, 2024 08:14
@StephanDollberg StephanDollberg marked this pull request as ready for review May 24, 2024 08:14
@StephanDollberg StephanDollberg requested review from andrwng, dotnwat and travisdowns and removed request for twmb, Deflaimun, r-vasquez, gene-redpanda and andrwng May 24, 2024 08:14
@StephanDollberg StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from 90878d7 to 498174c Compare May 24, 2024 09:41
@redpanda-data redpanda-data deleted a comment from vbotbuildovich May 24, 2024
@@ -162,7 +162,7 @@ func NewCommand(fs afero.Fs, p *config.Params) *cobra.Command {
f.StringVar(&logsSince, "logs-since", "", "Include log entries on or newer than the specified date (journalctl date format, e.g. YYYY-MM-DD")
f.StringVar(&logsUntil, "logs-until", "", "Include log entries on or older than the specified date (journalctl date format, e.g. YYYY-MM-DD")
f.StringVar(&logsSizeLimit, "logs-size-limit", "100MiB", "Read the logs until the given size is reached (e.g. 3MB, 1GiB)")
f.StringVar(&controllerLogsSizeLimit, "controller-logs-size-limit", "20MB", "The size limit of the controller logs that can be stored in the bundle (e.g. 3MB, 1GiB)")
f.StringVar(&controllerLogsSizeLimit, "controller-logs-size-limit", "130MB", "The size limit of the controller logs that can be stored in the bundle (e.g. 3MB, 1GiB)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When doing the logic of whether to include the controller log in the debug bundle rpk also looks at the logical file size. Given that on most systems the initial segment falloc/truncate step is 32MB the controller log would basically never get included in the debug bundle anymore. Hence the need for the increase.

I think changing it to 130MB such that we at least include a single segment in case the controller log is large (128MiB is the default segment size) is good either way.

I do realize now though that that is actually in MB and not MiB so I will need to change it to 132MB.

andrwng
andrwng previously approved these changes May 24, 2024
Comment on lines 362 to 365
_fallocation_offset += step;
// ss::file::allocate does not adjust logical file size
// hence we need to do that explicitly with an extra
// truncate. This allows for more efficient writes.
return _out.truncate(_fallocation_offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed suspicious to set _fallocation_offset before truncating if thinking about the logical change as something that ought to be pushed into ss::file::allocate() in the first place.

Looking around, it doesn't look like it's too important since this appears to only be used when we append (in case we need to fallocate, in which case we co_await) or when we truncate (which shouldn't overlap with appends, and therefore should also co_await the entire execution of do_next_adaptive_fallocation()).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the important bit is that there is no concurrent appends.

I have changed it anyway as like you say it's the more logical order.

@andrwng andrwng requested a review from nvartolomei May 24, 2024 19:23
@andrwng
Copy link
Contributor

andrwng commented May 24, 2024

Also tagging Nicolae since he's spent a good amount of time in the area with write-caching.

This is in preparation for also `truncate`ing segment files which will
adapt their logical size on disk.

Note that this test is a bit different to how prod works because in
reality segments get closed and truncated to their real size on roll so
only ever the last segment has unwritten blocks but this test doesn't
work like that and hence we need account for the unwritten blocks in all
segments.
Copy link
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Did you notice increase in test duration? This new compute storage doesn't seem very lightweight and wonder if it would make more sense to adjust tests to fallocate less/not rely on that precise segment sizing information.

release_appender_in_background gave a bit of headache when reasoning about correctness. Seems good. It holds the gate which prevents closing and writing the "clean segment" offset before an actual flush+truncate with the proper size.

@@ -33,6 +74,14 @@ def safe_listdir(p: Path) -> list[Path]:
return []


def compute_size_for_file(file: Path):
if file.suffix == '.log':
reader = SegmentReader(io.BytesIO(file.read_bytes()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it makes sense to have a fast path if last byte is non 0.

Copy link
Member Author

@StephanDollberg StephanDollberg May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give it a try. Not sure whether we still leave zero pages on close/final-truncate.

@StephanDollberg
Copy link
Member Author

LGTM. Did you notice increase in test duration? This new compute storage doesn't seem very lightweight and wonder if it would make more sense to adjust tests to fallocate less/not rely on that precise segment sizing information.

Doesn't seem very noticable. Maybe very little.

more sense to adjust tests to fallocate less/not rely on that precise segment sizing information

I am very open to ideas but can't immediately see how we could improve it. The size checking already feels like a simplifying hack (all bits could be garbled). I wouldn't want to change all the tests to use smaller segment sizes etc as I prefer things to run in default configs where possible.

@StephanDollberg StephanDollberg force-pushed the stephan/truncate-plus-allocate branch 3 times, most recently from 89c48ac to 096004b Compare May 29, 2024 09:47
.then([this, step] {
// ss::file::allocate does not adjust logical file size
// hence we need to do that explicitly with an extra
// truncate. This allows for more efficient writes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// truncate. This allows for more efficient writes.
// truncate. This allows for more efficient writes.
// See https://github.com/redpanda-data/redpanda/pull/18598.

Because you put a very good analysis in the cover letter, may make sense to link to the PR?

required=False)
parser.add_argument('--segment-size',
type=str,
help='The redpanda data dir',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help is wrong.



# NB: SegmentReader is duplicated in si_utils.py for deployment reasons. If
# making changes please adapt both.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fingers crossed" that everyone obeys this

We used compute_storage.py for validation checks to confirm segments
"are the same" after certain operations. It uses the logical file size
for that.

With the switch to the segment_appender to also adjust the logical file
size when we fallocate this would break.

Make compute_storage actually read the segment files and parse the
actual amount of batch data in there.

We apply a small optimization where for fully closed/truncated segments
we still use the file size and skip reading the full file.

We also switch `data_checksum` to use compute_storage.py and calculate
the md5 sum in compute_storage.py. This avoids repeated invocations of
python which can slow things down.
In the test we trigger a segment GC and expect all data to be
effectively gone.

This is still the case but because the check was looking at the whole RP
data dir it would now pick up the controller and kvstore log files which
also have their logical file size adapted.

Only look at the kafka dir.
Otherwise the controller log would now be skipped because on the average
system where the default falloc step is 32MB it would now already be too
big for the default 20MB limit.

Note that this doesn't actually bloat the size of the debug bundle as
any zero padding basically compresses perfectly anyway.

20 MB also feels kinda low for something important as the controller
log.
In the segment_appender we adaptively increase the file size of log
segments using `ss::file::allocate`.

This however doesn't immediately do what's intended. Internally it calls
`fallocate` with the `FALLOC_FL_KEEP_SIZE|FALLOC_FL_ZERO_RANGE` flags.
The former leads to the logical file size not being extended.

Extending the logical file size however seems to have been the intention
of 7d77067. We do already handle zero
bytes at the end fine either way as we are writing zero appended 4k
chunks anyway.

Besides this side effect there is other reasons why we want to update
the logical file size immediately.

First, this means that on every write we will need to update the file
size which makes both the write and the fsync more expensive: as
updating the file size is a metadata operation which needs to be
journaled to the XFS in-memory log ("CIL") and then flushed out to disk
when the inevitable fsync occurs immediately after. So adjusting the
size costs CPU on the write and the fsync and potentially also causes
some additional IO on the fsync (shows up as write amplification),
though this latter effect wasn't noticeable in testing.

Second, seastar considers XFS an append-challenged file system. This
means that seastar will avoid having size changing and non-size changing
operations outstanding at the same time. They will be queued internally
in the ioqueue. Because of how `allocate` works above all our writes
will always be considered as "appending" as we never updated the logical
file size.

To optimize the queued operations seastar employs certain
"optimizations" in `append_challenged_posix_file_impl::optimize_queue`,
including issuing ftruncate calls when it observes the application
issuing writes that will extend the logical file size. Because nearly
all our writes are appending this causes a continuous stream of implicit
ftruncate syscalls of about 100/s per shard.

```
root:/tmp# perf trace -t 11990 -s -e ftruncate -- sleep 5

 Summary of events:

 redpanda (11990), 1122 events, 100.0%

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   ftruncate            561      0    32.056     0.013     0.057     0.265      1.85%
```

To avoid the two aforementioned issues this patch adds an explicit
`truncate` after our `allocate` call. Note that we still need both as
`ftruncate` alone doesn't preallocate blocks so it's a lot less
performant on its own.

In a medium IOPS OMB workload we see a drop in p99 producer latency from
~14ms to ~10ms. Threaded fallbacks and hence steal time are down by
about 5 and 10% respectively.

Further we can also see the `fsync` times distribution slightly
improved:

Before:

```
root:/tmp# xfsdist-bpfcc 5 1
Tracing XFS operation latency... Hit Ctrl-C to end.

14:07:37:

operation = b'fsync'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 55       |                                        |
         4 -> 7          : 11872    |*****                                   |
         8 -> 15         : 38206    |*****************                       |
        16 -> 31         : 18405    |********                                |
        32 -> 63         : 11474    |*****                                   |
        64 -> 127        : 29074    |*************                           |
       128 -> 255        : 88607    |****************************************|
       256 -> 511        : 27403    |************                            |
       512 -> 1023       : 557      |                                        |
      1024 -> 2047       : 28       |                                        |
      2048 -> 4095       : 23       |
```

After:

```
root:/tmp# xfsdist-bpfcc 5 1
Tracing XFS operation latency... Hit Ctrl-C to end.

13:57:45:

operation = b'fsync'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 56       |                                        |
         4 -> 7          : 15472    |*******                                 |
         8 -> 15         : 45440    |***********************                 |
        16 -> 31         : 22825    |***********                             |
        32 -> 63         : 13378    |******                                  |
        64 -> 127        : 31979    |****************                        |
       128 -> 255        : 77529    |****************************************|
       256 -> 511        : 18116    |*********                               |
       512 -> 1023       : 250      |                                        |
      1024 -> 2047       : 1        |
```

Further we can also observe the logial file size adapting during a run:

Before:

```
root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-zUv2_Hs-0000/0_32/ | grep log ; sleep 1 ; done
-rw-r--r--   1 redpanda redpanda 37502976 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 39337984 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 41267200 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 43479040 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 45416448 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 47005696 May 21 10:10 0-1-v1.log
```

After:

```
root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-Yk-n3mY-0000/0_35/ | grep log ; sleep 1 ; done
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
```

Note that longterm we will want to improve/change seastar to allow doing
a "full" fallocate but just adding the truncate is the easier change for
now.

Fixes redpanda-data/core-internal#1293
@StephanDollberg StephanDollberg force-pushed the stephan/truncate-plus-allocate branch from 096004b to 7d0d283 Compare May 30, 2024 09:52
@StephanDollberg
Copy link
Member Author

StephanDollberg commented May 30, 2024

I have pushed some changes now that speed up the size computation in data_checksum (this makes one of the commits unnecessary). Haven't tested all tests yet so will need to see whether everything still passes.

@piyushredpanda piyushredpanda merged commit 4b8135e into dev Jun 7, 2024
26 checks passed
@piyushredpanda piyushredpanda deleted the stephan/truncate-plus-allocate branch June 7, 2024 17:13
@StephanDollberg
Copy link
Member Author

@piyushredpanda I wanted to wait with merging this until the CPD etc is in place

@piyushredpanda
Copy link
Contributor

Oh dang, sorry, @StephanDollberg. Could I request having a DNM in PR then? Should we revert?

@StephanDollberg
Copy link
Member Author

Could I request having a DNM in PR then

Ah yeah sure good point, will add next time.

Should we revert?

Nah not worth the hassle. I think we should have a few data points at least now since we made the throughput changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants