Skip to content

Commit

Permalink
segment_appender: truncate in addition to allocate
Browse files Browse the repository at this point in the history
In the segment_appender we adaptively increase the file size of log
segments using `ss::file::allocate`.

This however doesn't immediately do what's intended. Internally it calls
`fallocate` with the `FALLOC_FL_KEEP_SIZE|FALLOC_FL_ZERO_RANGE` flags.
The former leads to the logical file size not being extended.

Extending the logical file size however seems to have been the intention
of 7d77067. We do already handle zero
bytes at the end fine either way as we are writing zero appended 4k
chunks anyway.

Besides this side effect there is other reasons why we want to update
the logical file size immediately.

First, this means that on every write we will need to update the file
size which makes fsync more expensive.

Second, seastar considers XFS an append-challenged file system. This
means that seastar will avoid having size changing and non-size changing
operations outstanding at the same time. They will be queued internally
in the ioqueue. Because of how `allocate` works above all our writes
will always be considered as "appending" as we never updated the logical
file size.

To optimize the queued operations seastar employs certain
"optimizations" in `append_challenged_posix_file_impl::optimize_queue`.
Because all our operations are appending this causes a continuous stream
of implicit `ftruncate` syscalls of about 100/s per shard.

```
root:/tmp# perf trace -t 11990 -s -e ftruncate -- sleep 5

 Summary of events:

 redpanda (11990), 1122 events, 100.0%

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   ftruncate            561      0    32.056     0.013     0.057     0.265      1.85%
```

To avoid the two aforementioned issues this patch adds an explicit
`truncate` after our `allocate` call. Note that we still need both as
`ftruncate` alone doesn't preallocate blocks so it's a lot less
performant on its own.

In a medium IOPS OMB workload we see a drop in p99 producer latency from
~14ms to ~10ms. Threaded fallbacks and hence steal time are down by
about 5 and 10% respectively.

Further we can also see the `fsync` times distribution slightly
improved:

Before:

```
root:/tmp# xfsdist-bpfcc 5 1
Tracing XFS operation latency... Hit Ctrl-C to end.

14:07:37:

operation = b'fsync'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 55       |                                        |
         4 -> 7          : 11872    |*****                                   |
         8 -> 15         : 38206    |*****************                       |
        16 -> 31         : 18405    |********                                |
        32 -> 63         : 11474    |*****                                   |
        64 -> 127        : 29074    |*************                           |
       128 -> 255        : 88607    |****************************************|
       256 -> 511        : 27403    |************                            |
       512 -> 1023       : 557      |                                        |
      1024 -> 2047       : 28       |                                        |
      2048 -> 4095       : 23       |
```

After:

```
root:/tmp# xfsdist-bpfcc 5 1
Tracing XFS operation latency... Hit Ctrl-C to end.

13:57:45:

operation = b'fsync'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 56       |                                        |
         4 -> 7          : 15472    |*******                                 |
         8 -> 15         : 45440    |***********************                 |
        16 -> 31         : 22825    |***********                             |
        32 -> 63         : 13378    |******                                  |
        64 -> 127        : 31979    |****************                        |
       128 -> 255        : 77529    |****************************************|
       256 -> 511        : 18116    |*********                               |
       512 -> 1023       : 250      |                                        |
      1024 -> 2047       : 1        |
```

Further we can also observe the logial file size adapting during a run:

Before:

```
root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-zUv2_Hs-0000/0_32/ | grep log ; sleep 1 ; done
-rw-r--r--   1 redpanda redpanda 37502976 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 39337984 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 41267200 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 43479040 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 45416448 May 21 10:10 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 47005696 May 21 10:10 0-1-v1.log
```

After:

```
root:/tmp# for i in {0..5} ; do ll /var/lib/redpanda/data/kafka/test-topic-Yk-n3mY-0000/0_35/ | grep log ; sleep 1 ; done
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
-rw-r--r--   1 redpanda redpanda 67108864 May 21 10:22 0-1-v1.log
```
  • Loading branch information
StephanDollberg committed May 21, 2024
1 parent e3f4b94 commit d474332
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion src/v/storage/segment_appender.cc
Original file line number Diff line number Diff line change
Expand Up @@ -358,7 +358,13 @@ ss::future<> segment_appender::do_next_adaptive_fallocation() {
_fallocation_offset,
_committed_offset);
return _out.allocate(_fallocation_offset, step)
.then([this, step] { _fallocation_offset += step; });
.then([this, step] {
_fallocation_offset += step;
// ss::file::allocate does not adjust logical file size
// hence we need to do that explicitly with an extra
// truncate. This allows for more efficient writes.
return _out.truncate(_fallocation_offset);
});
})
.handle_exception([this](std::exception_ptr e) {
vassert(
Expand Down

0 comments on commit d474332

Please sign in to comment.