Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasm: fix CPU profiler deadlock #17877

Merged
merged 2 commits into from
Apr 16, 2024
Merged

Conversation

rockwotj
Copy link
Contributor

@rockwotj rockwotj commented Apr 16, 2024

Currently trying to take a backtrace while the CPU profiler is running
results in a deadlock, because taking a backtrace segfaults, as we don't
write any debug symbols during JIT compilation (which causes other
deadlocks in libgcc). To fix this, disable the profile's backtracing
when within Wasm. In the future we should use Wasmtime's profiling APIs
to get a stacktrace within the guest program that is running. For more
information on that API see:
https://docs.wasmtime.dev/api/wasmtime/struct.GuestProfiler.html

FIXES: CORE-486

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Fixes an issue where using the CPU profiler with running Data Transforms could cause the process to deadlock.

Signed-off-by: Tyler Rockwood <rockwood@redpanda.com>
@rockwotj rockwotj requested review from BenPope and a team as code owners April 16, 2024 01:46
@rockwotj rockwotj requested review from jackietung-redpanda and removed request for a team April 16, 2024 01:46
@rockwotj
Copy link
Contributor Author

Not backporting because the new APIs in seastar were not backported.

Requires this vtools change: https://github.com/redpanda-data/vtools/pull/2644

@rockwotj rockwotj self-assigned this Apr 16, 2024
dotnwat
dotnwat previously approved these changes Apr 16, 2024
dotnwat
dotnwat previously approved these changes Apr 16, 2024
// Disable profiling backtraces inside the VM - at the time of writing
// backtraces lead to segfaults causing deadlock in Seastar's signal
// handlers.
auto _ = ss::internal::scoped_disable_profile_temporarily();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the comment mean on this object in seastar tree This is not reentrant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that it doesn't handle the case of nesting these RAII objects - when being destructed it doesn't set the flag to the original state, but unconditionally turns the flag off.

@rockwotj
Copy link
Contributor Author

FYI - I'll create a followup ticket, but if we want profiling results within the Wasmtime VM there are ways to get backtraces: https://bytecodealliance.zulipchat.com/#narrow/stream/217126-wasmtime/topic/Lower.20level.20backtrace.20API those APIs just need to be tied into C API

Currently trying to take a backtrace while the CPU profiler is running
results in a deadlock, because taking a backtrace segfaults, as we don't
write any debug symbols during JIT compilation (which causes other
deadlocks in libgcc). To fix this, disable the profile's backtracing
when within Wasm. In the future we should use Wasmtime's profiling APIs
to get a stacktrace within the guest program that is running. For more
information on that API see:
https://docs.wasmtime.dev/api/wasmtime/struct.GuestProfiler.html

Signed-off-by: Tyler Rockwood <rockwood@redpanda.com>
@rockwotj rockwotj dismissed stale reviews from StephanDollberg and dotnwat via a910869 April 16, 2024 20:05
@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/47890#018ee8de-1027-4296-8932-5d9f23cf1287:

"rptest.tests.write_caching_fi_test.WriteCachingFailureInjectionTest.test_unavoidable_data_loss"

@dotnwat
Copy link
Member

dotnwat commented Apr 16, 2024

oversized alloc

@dotnwat dotnwat merged commit 13273f3 into redpanda-data:dev Apr 16, 2024
14 of 17 checks passed
@rockwotj rockwotj deleted the wasm-profiler branch April 16, 2024 23:18
@travisdowns
Copy link
Member

@rockwotj @ballard26 this seems like a bit observability hole now that we are rolling out WASM: if we are using a lot of CPU time in WASM it's going to be invisible to the profiling.

Getting stacks in the VM would be nice, but it seems like a simpler solution which would get a lot of the way there would be to suppress stacks with a "reason", and if a sample would have been taken when suppressed we record that as a synthetic stack of say 2 frames: "supppressed" -> "reason" so it still shows clearly in the profile. This would let us see what % of time is being spent in suppressed stuff and keep the profiling unbiased overall.

@rockwotj
Copy link
Contributor Author

rockwotj commented Apr 17, 2024

invisible to the profiling.

It's better than deadlocking the process! Yes I agree I can add this to the profiler. My main priority was not terribly breaking things. Wasm does run in it's own scheduling group, and there are stats on CPU usage per function, so it's not like we're completely blind.

@dotnwat
Copy link
Member

dotnwat commented Apr 17, 2024

Do we have a ticket for improving the situation here, or this happening imminently?

@rockwotj
Copy link
Contributor Author

Ticket: https://redpandadata.atlassian.net/browse/CORE-2410

I can look at this early next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants