Switch safekeepers to async. #4119

arssher · 2023-04-29T03:22:19Z

This is a full switch, fs io operations are also tokio ones, working through thread pool. Similar to pageserver, we have multiple runtimes for easier top usage and isolation.

Notable points:

Now that guts of safekeeper.rs are full of .await's, we need to be very careful not to drop task at random point, leaving timeline in unclear state. Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good. But to be safe probably we should add a fuse panicking if task is being dropped while operation on a timeline is in progress.
Timeline lock is Tokio one now, as we do disk IO under it.
Collecting metrics got a crutch: since prometheus Collector is synchronous, there is now a special task copying once in a scrape period data from under async lock to sync one where collector can take it.
Anything involving closures becomes significantly more complicated, as async fns are already kinda closures + 'async closures are unstable'.
Main thread now tracks other main tasks, which got much easier.
The only sync place left is initial data loading, as otherwise clippy complains on timeline map lock being held across await points -- which is not bad here as it happens only in single threaded runtime of main thread. But having it sync doesn't hurt either.

I'm concerned about performance of thread pool io offloading, async traits and many await points; but we can try and see how it goes.

fixes #3036 fixes #3966

arssher · 2023-04-29T03:30:02Z

As a basic perf test, we can try pgbench -i on staging before and after this patch.

github-actions · 2023-04-29T03:39:28Z

1075 tests run: 1028 passed, 0 failed, 47 skipped (full report)

Flaky tests (1)

Postgres 15

test_close_on_connections_exit: ✅ debug

_{The comment gets automatically updated with the latest test results
17bf244 at 2023-06-08T13:18:19.542Z :recycle:}

arssher · 2023-04-29T04:17:46Z

Since metrics now can be delayed up to a scrape period due to clutch described above, tests relying on them fail.

petuhovskiy · 2023-04-29T06:27:38Z

Delaying metrics doesn't feel right, we cannot rely on them now because they can be delayed. Can we .await timeline metrics to make it work like it was?

petuhovskiy

Generally looks ok, but I wouldn't want to roll it out to production until we test it for at least several days.

It's hard to say if we forgot/missed something, my main concerns:

blocking calls from async tasks, potential thread pool exhaustion
forgetting to .await (but AFAIR compiler checks for it?)
other bugs or changed behavior

Would be good to have metrics for async runtimes, can we somehow add that? The most useful metrics are:

number of unfinished spawned tasks
number of active (non-awaiting IO) tasks
total number of spawned tasks

koivunej

Few notes and questions on the not running drops on error and how the tasks are currently launched.

On a high level I still find GlobalTimelinesState behind an std mutex. Is that reasonable global std mutex? If we don't have time to do proper analysis, would recommend just going with tokio's async mutex. EDIT: it seems that it is required by the current way of doing metrics via the collector api.

Requesting changes after seeing the 6 tokio runtimes: I don't see any justification and one should be enough.

Looks like there is no spawn_blocking usage in the codebase which removes one deadlock source. Similarly no more block_on or block_in_place which is awesome 👍.

safekeeper/src/bin/safekeeper.rs

safekeeper/src/wal_service.rs

safekeeper/src/lib.rs

koivunej · 2023-04-29T09:22:12Z

forgetting to .await (but AFAIR compiler checks for it?)

Future trait has a #[must_use = "futures do nothing unless you '.await' or poll them"] and similar have been added to tokio for named futures. In general if a future type does not have a #[must_use] then it is a bug.. Can't say I've ran into this in other codebases; we did have a let _ = future_which_was_never_awaited; in the previous iteration of postgres_backend. It was caught by upgrading clippy.

petuhovskiy · 2023-04-29T10:26:10Z

On a high level I still find GlobalTimelinesState behind an std mutex. Is that reasonable global std mutex? If we don't have time to do proper analysis, would recommend just going with tokio's async mutex.

It's ok behind an std mutex. GlobalTimelinesState is just a collection of Arcs (HashMap<TenantTimelineId, Arc<Timeline>>), we don't do anything but get_all/get/set/delete while holding a lock, so it should never block anyone.

petuhovskiy · 2023-04-29T11:11:40Z

Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good.

HTTP handlers can and do get canceled often, need to check that we're good there.

koivunej · 2023-04-29T11:20:31Z

Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good.

HTTP handlers can and do get canceled often, need to check that we're good there.

An obvious way would be to spawn on each ... At the same time then I think you'd need a way to coalesce the requests, otherwise later attempts would get confusing or contradictory responses. An alternative to coalescing would be to notify "busy" or something while the previously detached operation is underway. Implementation cost is the same thought.

I see that you are not yet using utils::http::endpoint::RequestSpan around your handlers. Is this by design? It will at least log if request is cancelled before completion.

petuhovskiy · 2023-04-29T11:53:47Z

We miss logging on the HTTP side, it would be good to add spans and log at least some requests.

koivunej · 2023-04-29T13:59:34Z

We miss logging on the HTTP side, it would be good to add spans and log at least some requests.

Unless I am missing something, this will be it: ab0ecf5

With the panic tracing hook any panic will be within the correct context because that is before unwinding, and things will just work.

Noticed that on panic, the cancellation message will be printed but it's not that big of a deal. I'll PR that fix separatedly: #4125

arssher · 2023-04-29T15:42:17Z

Delaying metrics doesn't feel right, we cannot rely on them now because they can be delayed. Can we .await timeline metrics to make it work like it was?

Came up with another version of crutch, now they are fresh and tests are fixed.

arssher · 2023-04-29T15:44:36Z

On a high level I still find GlobalTimelinesState behind an std mutex. Is that reasonable global std mutex? If we don't have time to do proper analysis,

We do; as Arthur said, we only hold it during small pieces of CPU job and never await while holding, so std is fine.

arssher · 2023-04-29T15:51:24Z

Requesting changes after seeing the 6 tokio runtimes: I don't see any justification and one should be enough.

Having multiple runtimes provides isolation. For example, it gives immediate breakdown by subsystems when you look at top. Also, imagine there is a bug in http processing code which blocks the runtime in some blocking operations; it would be probably good if that doesn't harm WAL servicing. Yes, clearly overall we get more runtime threads than needed. But I don't expect that having 2x or 3x number of cores threads is sensibly worse than 1x from perf POV, though I'm not an expert here. So I'd leave it as is, though can live with either variant.

I'm not sure how running everything in single threaded rt is useful for hunting down deadlocks (seems like it would be generally still probablistic, though many threads mask it more often), but if this is desired, it is trivial to add an option for that -- it is anyway needed making the rt to have only single worker.

IOW, we can do that after encountering first deadlock :)

arssher · 2023-04-29T16:55:20Z

Unless I am missing something, this will be it: ab0ecf5

Great, included it in this PR.

arssher · 2023-04-29T18:28:39Z

Would be good to have metrics for async runtimes, can we somehow add that? The most useful metrics are:

Yeah. So there is unstable TODO metrics in tokio, and tokio-metrics, also requiring tokio unstable. So no good solution off the shelf. Also we can always enable/attach tokio console and check out what's going on.

.

koivunej · 2023-04-29T18:52:38Z

The easiest way to get metrics today is to add future wrappers to futures before spawning, count polls or poll duration etc like tokio-console. doing the waker tracking tokio-console provides is very difficult to jerry-rig, and it has open issues on false-positives.

For threads there is additional hooks possible to add on thread start and stop.

Also we can always enable/attach tokio console and check out what's going on

Note one cannot simply attach a tokio console, it will now require rebuilding with --cfg tokio_unstable unless it has changed recently, and it requires the tracing feature at least.

safekeeper/src/bin/safekeeper.rs

arssher · 2023-06-08T11:40:48Z

I rebased and added option to run everything in single threaded runtime (for debugging etc). Will try to run tests and commit today.

This is a full switch, fs io operations are also tokio ones, working through thread pool. Similar to pageserver, we have multiple runtimes for easier `top` usage and isolation. Notable points: - Now that guts of safekeeper.rs are full of .await's, we need to be very careful not to drop task at random point, leaving timeline in unclear state. Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good. But to be safe probably we should add a fuse panicking if task is being dropped while operation on a timeline is in progress. - Timeline lock is Tokio one now, as we do disk IO under it. - Collecting metrics got a crutch: since prometheus Collector is synchronous, it spawns a thread with current thread runtime collecting data. - Anything involving closures becomes significantly more complicated, as async fns are already kinda closures + 'async closures are unstable'. - Main thread now tracks other main tasks, which got much easier. - The only sync place left is initial data loading, as otherwise clippy complains on timeline map lock being held across await points -- which is not bad here as it happens only in single threaded runtime of main thread. But having it sync doesn't hurt either. I'm concerned about performance of thread pool io offloading, async traits and many await points; but we can try and see how it goes. fixes #3036 fixes #3966

With RequestSpan, successfull GETs are not logged, but all others, errors and warns on cancellations are.

arssher · 2023-06-11T19:17:32Z

Before merging I wanted to run at least basic benchmarks to have a feeling of thread pooled fs io operations overhead. I did that with pgbench insertion. In short, results are that overhead is noticeable, but can be neglected at least for now.

The command was like

pgbench -d  'postgres://ars:qB5QsTIaX0Sz@ep-dry-hill-850843.eu-west-1.aws.neon.build/neondb' -i -s 600 -I dtG

On eu-west staging, run on the ec2 machine in the same region:
sync:
server-side generate 71.91 s, disk write stable 261MiB/s.
server-side generate 71.96 s 224MiB/s
server-side generate 72.11 s 176MiB/s.
server-side generate 72.07 s 234MiB/s

async:
server-side generate 71.97 s disk load 198MiB/s.
server-side generate 72.18 s 262MiB/s.
server-side generate 72.09 s 263MiB/s.

async, current thread rt:
server-side generate 72.25 s 196MiB/s.
server-side generate 72.13 s 251MiB/s

Scale 600 is about 9.5GiB of WAL. Disk throughput is taken from grafana (node_exporter, which in turn reads /proc counters). I'm not fully sure why it has such variance, but probably 1) test is just too short (~70s, but grafana scrapes each 30s) 2) WAL offloading / sending to pageserver takes some of disk capacity, and this is not stable due to s3 throttling.

As kinda expected, single thread runtime is as good as multi as technically network reading and disk IO still happen in different threads due to thread pool offload.

Running the same pgbench but locally, with scale 200 now, having everything (compute, single safekeeper, single pageserver) in tmpfs:
async:
server_side_generate: 22.210 s
server_side_generate: 22.260 s
server_side_generate: 19.130 s
server_side_generate: 18.830 s
server_side_generate: 19.450 s
sync:
server_side_generate: 19.130 s
server_side_generate: 17.140 s
server_side_generate: 17.620 s
server_side_generate: 17.490 s
server_side_generate: 18.080 s
So it seems to be about ~10% difference.
command was like

BUILD_TYPE=release CARGO_BUILD_FLAGS="--features=testing" make -s -j12 && rm -rf /tmp/test_output/; TEST_OUTPUT=/tmp/test_output NEON_BIN=./target/release poetry run pytest test_runner/performance -k test_perf_pgbench --preserve-database-files

but I also adjusted code a bit (left only init test with -I dtG steps, removed vanilla, changed scale).

Importantly, both on staging and locally I disabled backpressure

+            "max_replication_apply_lag=-1",
+            "max_replication_flush_lag=-1",
+            "max_replication_write_lag=-1"

otherwise it kicks in quickly and variance increases a lot. Actually, before I realized that I need to do that I run some tests and it seems like there sync vs async diff was greater -- I remember at least one run with async taking 38.4s and sync 32s, but not sure how reliable is that.

While doing that, I found multiple interesting and unresolved links/issues about bad fs io thread pool performance:
https://users.rust-lang.org/t/why-tokio-fs-is-slower-than-std-fs/39610
https://users.rust-lang.org/t/blocking-permit/36865 (old but referenced fresh link above)
tokio-rs/tokio#3664
They seem to be more about listing operations though.
Overall, it looks good enough currently, so I merged the PR.

There are unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and `METRICS_SHIFTER_RUNTIME`. `WAL_REMOVER_RUNTIME` was implemented in [#4119](#4119) and removed in [#7887](#7887). `METRICS_SHIFTER_RUNTIME` was also implemented in [#4119](#4119) but has never been used. I removed unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and `METRICS_SHIFTER_RUNTIME`.

arssher requested a review from koivunej April 29, 2023 03:22

arssher requested a review from a team as a code owner April 29, 2023 03:22

arssher requested review from petuhovskiy and removed request for a team April 29, 2023 03:22

petuhovskiy added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Apr 29, 2023

petuhovskiy reviewed Apr 29, 2023

View reviewed changes

koivunej previously requested changes Apr 29, 2023

View reviewed changes

vadim2404 reviewed May 2, 2023

View reviewed changes

safekeeper/src/bin/safekeeper.rs Show resolved Hide resolved

petuhovskiy approved these changes May 7, 2023

View reviewed changes

koivunej mentioned this pull request May 23, 2023

Shield HTTP request handlers from async cancellations. #4314

Merged

arssher force-pushed the sk-async branch from bbaab42 to 49dee47 Compare June 8, 2023 11:39

feat: http request logging on safekeepers.

17bf244

With RequestSpan, successfull GETs are not logged, but all others, errors and warns on cancellations are.

arssher force-pushed the sk-async branch from 49dee47 to 17bf244 Compare June 8, 2023 12:07

arssher merged commit 7e17979 into main Jun 11, 2023

arssher deleted the sk-async branch June 11, 2023 18:53

shinyaaa mentioned this pull request Jul 19, 2024

Remove unused safekeeper runtimes #8433

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch safekeepers to async. #4119

Switch safekeepers to async. #4119

arssher commented Apr 29, 2023 •

edited

Loading

arssher commented Apr 29, 2023

github-actions bot commented Apr 29, 2023 •

edited

Loading

Postgres 15

arssher commented Apr 29, 2023

petuhovskiy commented Apr 29, 2023

petuhovskiy left a comment

koivunej left a comment •

edited

Loading

koivunej commented Apr 29, 2023 •

edited

Loading

petuhovskiy commented Apr 29, 2023 •

edited

Loading

petuhovskiy commented Apr 29, 2023

koivunej commented Apr 29, 2023 •

edited

Loading

petuhovskiy commented Apr 29, 2023

koivunej commented Apr 29, 2023 •

edited

Loading

arssher commented Apr 29, 2023

arssher commented Apr 29, 2023

arssher commented Apr 29, 2023 •

edited

Loading

arssher commented Apr 29, 2023

arssher commented Apr 29, 2023

koivunej commented Apr 29, 2023 •

edited

Loading

arssher commented Jun 8, 2023

arssher commented Jun 11, 2023

Switch safekeepers to async. #4119

Switch safekeepers to async. #4119

Conversation

arssher commented Apr 29, 2023 • edited Loading

arssher commented Apr 29, 2023

github-actions bot commented Apr 29, 2023 • edited Loading

1075 tests run: 1028 passed, 0 failed, 47 skipped (full report)

Postgres 15

arssher commented Apr 29, 2023

petuhovskiy commented Apr 29, 2023

petuhovskiy left a comment

Choose a reason for hiding this comment

koivunej left a comment • edited Loading

Choose a reason for hiding this comment

koivunej commented Apr 29, 2023 • edited Loading

petuhovskiy commented Apr 29, 2023 • edited Loading

petuhovskiy commented Apr 29, 2023

koivunej commented Apr 29, 2023 • edited Loading

petuhovskiy commented Apr 29, 2023

koivunej commented Apr 29, 2023 • edited Loading

arssher commented Apr 29, 2023

arssher commented Apr 29, 2023

arssher commented Apr 29, 2023 • edited Loading

arssher commented Apr 29, 2023

arssher commented Apr 29, 2023

koivunej commented Apr 29, 2023 • edited Loading

arssher commented Jun 8, 2023

arssher commented Jun 11, 2023

arssher commented Apr 29, 2023 •

edited

Loading

github-actions bot commented Apr 29, 2023 •

edited

Loading

koivunej left a comment •

edited

Loading

koivunej commented Apr 29, 2023 •

edited

Loading

petuhovskiy commented Apr 29, 2023 •

edited

Loading

koivunej commented Apr 29, 2023 •

edited

Loading

koivunej commented Apr 29, 2023 •

edited

Loading

arssher commented Apr 29, 2023 •

edited

Loading

koivunej commented Apr 29, 2023 •

edited

Loading