Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch safekeepers to async. #4119

Merged
merged 2 commits into from
Jun 11, 2023
Merged

Switch safekeepers to async. #4119

merged 2 commits into from
Jun 11, 2023

Conversation

arssher
Copy link
Contributor

@arssher arssher commented Apr 29, 2023

This is a full switch, fs io operations are also tokio ones, working through thread pool. Similar to pageserver, we have multiple runtimes for easier top usage and isolation.

Notable points:

  • Now that guts of safekeeper.rs are full of .await's, we need to be very careful not to drop task at random point, leaving timeline in unclear state. Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good. But to be safe probably we should add a fuse panicking if task is being dropped while operation on a timeline is in progress.
  • Timeline lock is Tokio one now, as we do disk IO under it.
  • Collecting metrics got a crutch: since prometheus Collector is synchronous, there is now a special task copying once in a scrape period data from under async lock to sync one where collector can take it.
  • Anything involving closures becomes significantly more complicated, as async fns are already kinda closures + 'async closures are unstable'.
  • Main thread now tracks other main tasks, which got much easier.
  • The only sync place left is initial data loading, as otherwise clippy complains on timeline map lock being held across await points -- which is not bad here as it happens only in single threaded runtime of main thread. But having it sync doesn't hurt either.

I'm concerned about performance of thread pool io offloading, async traits and many await points; but we can try and see how it goes.

fixes #3036 fixes #3966

@arssher arssher requested a review from koivunej April 29, 2023 03:22
@arssher arssher requested a review from a team as a code owner April 29, 2023 03:22
@arssher arssher requested review from petuhovskiy and removed request for a team April 29, 2023 03:22
@arssher
Copy link
Contributor Author

arssher commented Apr 29, 2023

As a basic perf test, we can try pgbench -i on staging before and after this patch.

@github-actions
Copy link

github-actions bot commented Apr 29, 2023

1075 tests run: 1028 passed, 0 failed, 47 skipped (full report)


Flaky tests (1)

Postgres 15

The comment gets automatically updated with the latest test results
17bf244 at 2023-06-08T13:18:19.542Z :recycle:

@arssher
Copy link
Contributor Author

arssher commented Apr 29, 2023

Since metrics now can be delayed up to a scrape period due to clutch described above, tests relying on them fail.

@petuhovskiy petuhovskiy added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Apr 29, 2023
@petuhovskiy
Copy link
Member

Delaying metrics doesn't feel right, we cannot rely on them now because they can be delayed. Can we .await timeline metrics to make it work like it was?

Copy link
Member

@petuhovskiy petuhovskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks ok, but I wouldn't want to roll it out to production until we test it for at least several days.

It's hard to say if we forgot/missed something, my main concerns:

  • blocking calls from async tasks, potential thread pool exhaustion
  • forgetting to .await (but AFAIR compiler checks for it?)
  • other bugs or changed behavior

Would be good to have metrics for async runtimes, can we somehow add that? The most useful metrics are:

  • number of unfinished spawned tasks
  • number of active (non-awaiting IO) tasks
  • total number of spawned tasks

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few notes and questions on the not running drops on error and how the tasks are currently launched.

On a high level I still find GlobalTimelinesState behind an std mutex. Is that reasonable global std mutex? If we don't have time to do proper analysis, would recommend just going with tokio's async mutex. EDIT: it seems that it is required by the current way of doing metrics via the collector api.

Requesting changes after seeing the 6 tokio runtimes: I don't see any justification and one should be enough.

Looks like there is no spawn_blocking usage in the codebase which removes one deadlock source. Similarly no more block_on or block_in_place which is awesome 👍.

safekeeper/src/bin/safekeeper.rs Show resolved Hide resolved
safekeeper/src/bin/safekeeper.rs Outdated Show resolved Hide resolved
safekeeper/src/bin/safekeeper.rs Show resolved Hide resolved
safekeeper/src/wal_service.rs Show resolved Hide resolved
safekeeper/src/lib.rs Show resolved Hide resolved
@koivunej
Copy link
Member

koivunej commented Apr 29, 2023

forgetting to .await (but AFAIR compiler checks for it?)

Future trait has a #[must_use = "futures do nothing unless you '.await' or poll them"] and similar have been added to tokio for named futures. In general if a future type does not have a #[must_use] then it is a bug.. Can't say I've ran into this in other codebases; we did have a let _ = future_which_was_never_awaited; in the previous iteration of postgres_backend. It was caught by upgrading clippy.

@petuhovskiy
Copy link
Member

petuhovskiy commented Apr 29, 2023

On a high level I still find GlobalTimelinesState behind an std mutex. Is that reasonable global std mutex? If we don't have time to do proper analysis, would recommend just going with tokio's async mutex.

It's ok behind an std mutex. GlobalTimelinesState is just a collection of Arcs (HashMap<TenantTimelineId, Arc<Timeline>>), we don't do anything but get_all/get/set/delete while holding a lock, so it should never block anyone.

@petuhovskiy
Copy link
Member

Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good.

HTTP handlers can and do get canceled often, need to check that we're good there.

@koivunej
Copy link
Member

koivunej commented Apr 29, 2023

Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good.

HTTP handlers can and do get canceled often, need to check that we're good there.

An obvious way would be to spawn on each ... At the same time then I think you'd need a way to coalesce the requests, otherwise later attempts would get confusing or contradictory responses. An alternative to coalescing would be to notify "busy" or something while the previously detached operation is underway. Implementation cost is the same thought.

I see that you are not yet using utils::http::endpoint::RequestSpan around your handlers. Is this by design? It will at least log if request is cancelled before completion.

@petuhovskiy
Copy link
Member

We miss logging on the HTTP side, it would be good to add spans and log at least some requests.

@koivunej
Copy link
Member

koivunej commented Apr 29, 2023

We miss logging on the HTTP side, it would be good to add spans and log at least some requests.

Unless I am missing something, this will be it: ab0ecf5

With the panic tracing hook any panic will be within the correct context because that is before unwinding, and things will just work.

Noticed that on panic, the cancellation message will be printed but it's not that big of a deal. I'll PR that fix separatedly: #4125

@arssher
Copy link
Contributor Author

arssher commented Apr 29, 2023

Delaying metrics doesn't feel right, we cannot rely on them now because they can be delayed. Can we .await timeline metrics to make it work like it was?

Came up with another version of crutch, now they are fresh and tests are fixed.

@arssher
Copy link
Contributor Author

arssher commented Apr 29, 2023

On a high level I still find GlobalTimelinesState behind an std mutex. Is that reasonable global std mutex? If we don't have time to do proper analysis,

We do; as Arthur said, we only hold it during small pieces of CPU job and never await while holding, so std is fine.

@arssher
Copy link
Contributor Author

arssher commented Apr 29, 2023

Requesting changes after seeing the 6 tokio runtimes: I don't see any justification and one should be enough.

Having multiple runtimes provides isolation. For example, it gives immediate breakdown by subsystems when you look at top. Also, imagine there is a bug in http processing code which blocks the runtime in some blocking operations; it would be probably good if that doesn't harm WAL servicing. Yes, clearly overall we get more runtime threads than needed. But I don't expect that having 2x or 3x number of cores threads is sensibly worse than 1x from perf POV, though I'm not an expert here. So I'd leave it as is, though can live with either variant.

I'm not sure how running everything in single threaded rt is useful for hunting down deadlocks (seems like it would be generally still probablistic, though many threads mask it more often), but if this is desired, it is trivial to add an option for that -- it is anyway needed making the rt to have only single worker.

IOW, we can do that after encountering first deadlock :)

@arssher
Copy link
Contributor Author

arssher commented Apr 29, 2023

Unless I am missing something, this will be it: ab0ecf5

Great, included it in this PR.

@arssher
Copy link
Contributor Author

arssher commented Apr 29, 2023

Would be good to have metrics for async runtimes, can we somehow add that? The most useful metrics are:

Yeah. So there is unstable TODO metrics in tokio, and tokio-metrics, also requiring tokio unstable. So no good solution off the shelf. Also we can always enable/attach tokio console and check out what's going on.

@koivunej
Copy link
Member

koivunej commented Apr 29, 2023

The easiest way to get metrics today is to add future wrappers to futures before spawning, count polls or poll duration etc like tokio-console. doing the waker tracking tokio-console provides is very difficult to jerry-rig, and it has open issues on false-positives.

For threads there is additional hooks possible to add on thread start and stop.

Also we can always enable/attach tokio console and check out what's going on

Note one cannot simply attach a tokio console, it will now require rebuilding with --cfg tokio_unstable unless it has changed recently, and it requires the tracing feature at least.

@arssher
Copy link
Contributor Author

arssher commented Jun 8, 2023

I rebased and added option to run everything in single threaded runtime (for debugging etc). Will try to run tests and commit today.

This is a full switch, fs io operations are also tokio ones, working through
thread pool. Similar to pageserver, we have multiple runtimes for easier `top`
usage and isolation.

Notable points:
- Now that guts of safekeeper.rs are full of .await's, we need to be very
  careful not to drop task at random point, leaving timeline in unclear
  state. Currently the only writer is walreceiver and we don't have top
  level cancellation there, so we are good. But to be safe probably we should
  add a fuse panicking if task is being dropped while operation on a timeline
  is in progress.
- Timeline lock is Tokio one now, as we do disk IO under it.
- Collecting metrics got a crutch: since prometheus Collector is
  synchronous, it spawns a thread with current thread runtime collecting data.
- Anything involving closures becomes significantly more complicated, as
  async fns are already kinda closures + 'async closures are unstable'.
- Main thread now tracks other main tasks, which got much easier.
- The only sync place left is initial data loading, as otherwise clippy
  complains on timeline map lock being held across await points -- which is
  not bad here as it happens only in single threaded runtime of main thread.
  But having it sync doesn't hurt either.

I'm concerned about performance of thread pool io offloading, async traits and
many await points; but we can try and see how it goes.

fixes #3036
fixes #3966
With RequestSpan, successfull GETs are not logged, but all others, errors and
warns on cancellations are.
@arssher arssher merged commit 7e17979 into main Jun 11, 2023
@arssher arssher deleted the sk-async branch June 11, 2023 18:53
@arssher
Copy link
Contributor Author

arssher commented Jun 11, 2023

Before merging I wanted to run at least basic benchmarks to have a feeling of thread pooled fs io operations overhead. I did that with pgbench insertion. In short, results are that overhead is noticeable, but can be neglected at least for now.

The command was like

pgbench -d  'postgres://ars:qB5QsTIaX0Sz@ep-dry-hill-850843.eu-west-1.aws.neon.build/neondb' -i -s 600 -I dtG

On eu-west staging, run on the ec2 machine in the same region:
sync:
server-side generate 71.91 s, disk write stable 261MiB/s.
server-side generate 71.96 s 224MiB/s
server-side generate 72.11 s 176MiB/s.
server-side generate 72.07 s 234MiB/s

async:
server-side generate 71.97 s disk load 198MiB/s.
server-side generate 72.18 s 262MiB/s.
server-side generate 72.09 s 263MiB/s.

async, current thread rt:
server-side generate 72.25 s 196MiB/s.
server-side generate 72.13 s 251MiB/s

Scale 600 is about 9.5GiB of WAL. Disk throughput is taken from grafana (node_exporter, which in turn reads /proc counters). I'm not fully sure why it has such variance, but probably 1) test is just too short (~70s, but grafana scrapes each 30s) 2) WAL offloading / sending to pageserver takes some of disk capacity, and this is not stable due to s3 throttling.

As kinda expected, single thread runtime is as good as multi as technically network reading and disk IO still happen in different threads due to thread pool offload.

Running the same pgbench but locally, with scale 200 now, having everything (compute, single safekeeper, single pageserver) in tmpfs:
async:
server_side_generate: 22.210 s
server_side_generate: 22.260 s
server_side_generate: 19.130 s
server_side_generate: 18.830 s
server_side_generate: 19.450 s
sync:
server_side_generate: 19.130 s
server_side_generate: 17.140 s
server_side_generate: 17.620 s
server_side_generate: 17.490 s
server_side_generate: 18.080 s
So it seems to be about ~10% difference.
command was like

BUILD_TYPE=release CARGO_BUILD_FLAGS="--features=testing" make -s -j12 && rm -rf /tmp/test_output/; TEST_OUTPUT=/tmp/test_output NEON_BIN=./target/release poetry run pytest test_runner/performance -k test_perf_pgbench --preserve-database-files

but I also adjusted code a bit (left only init test with -I dtG steps, removed vanilla, changed scale).

Importantly, both on staging and locally I disabled backpressure

+            "max_replication_apply_lag=-1",
+            "max_replication_flush_lag=-1",
+            "max_replication_write_lag=-1"

otherwise it kicks in quickly and variance increases a lot. Actually, before I realized that I need to do that I run some tests and it seems like there sync vs async diff was greater -- I remember at least one run with async taking 38.4s and sync 32s, but not sure how reliable is that.

While doing that, I found multiple interesting and unresolved links/issues about bad fs io thread pool performance:
https://users.rust-lang.org/t/why-tokio-fs-is-slower-than-std-fs/39610
https://users.rust-lang.org/t/blocking-permit/36865 (old but referenced fresh link above)
tokio-rs/tokio#3664
They seem to be more about listing operations though.
Overall, it looks good enough currently, so I merged the PR.

@shinyaaa shinyaaa mentioned this pull request Jul 19, 2024
5 tasks
skyzh pushed a commit that referenced this pull request Jul 19, 2024
There are unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and
`METRICS_SHIFTER_RUNTIME`.

`WAL_REMOVER_RUNTIME` was implemented in
[#4119](#4119) and removed in
[#7887](#7887).
`METRICS_SHIFTER_RUNTIME` was also implemented in
[#4119](#4119) but has never
been used.

I removed unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and
`METRICS_SHIFTER_RUNTIME`.
problame pushed a commit that referenced this pull request Jul 22, 2024
There are unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and
`METRICS_SHIFTER_RUNTIME`.

`WAL_REMOVER_RUNTIME` was implemented in
[#4119](#4119) and removed in
[#7887](#7887).
`METRICS_SHIFTER_RUNTIME` was also implemented in
[#4119](#4119) but has never
been used.

I removed unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and
`METRICS_SHIFTER_RUNTIME`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label
Projects
None yet
4 participants