bug(debug profile): segfault/EXC_BAD_ACCESS during backtrace capture #6205

xiangjinwu · 2022-11-04T07:52:39Z

Describe the bug

When running playground on macOS using latest main (first bad commit db6691b), the following sql commands leads to a server crash with segfault/EXC_BAD_ACCESS.

It works as expected in cluster mode (./risedev d) rather than playground.

To Reproduce

CREATE TABLE t(a int, b int);
CREATE VIEW v AS SELECT * FROM t;
DROP TABLE t;

Expected behavior

Before that commit we were able to see the expected error:

ERROR:  QueryError: Permission denied: PermissionDenied: Fail to delete table `t` because 1 other relation(s) depend on it

Additional context

console warnings before segfault (also there on last good commit, may unrelated):

2022-11-04T15:37:18.08988+08:00  WARN risingwave_storage::hummock::state_store: sealing invalid epoch    
2022-11-04T15:37:18.090268+08:00  WARN risingwave_storage::hummock::state_store: syncing invalid epoch

backtrace from lldb:

* thread #7, name = 'risingwave-main', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000019f39685c libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::parseFDEInstructions(libunwind::LocalAddressSpace&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, int, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::PrologInfo*) + 204
    frame #1: 0x000000019f396710 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromFdeCie(libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, unsigned long) + 100
    frame #2: 0x000000019f3963e8 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromDwarfSection(unsigned long, libunwind::UnwindInfoSections const&, unsigned int) + 184
    frame #3: 0x000000019f3962a0 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::setInfoBasedOnIPRegister(bool) + 1012
    frame #4: 0x000000019f398788 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::step() + 696
    frame #5: 0x000000019f39b138 libunwind.dylib`_Unwind_Backtrace + 352
    frame #6: 0x0000000109081e40 risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 [inlined] std::backtrace_rs::backtrace::libunwind::trace::h471a59e08ff9e5dc at mod.rs:66:5 [opt]
    frame #7: 0x0000000109081e30 risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 [inlined] std::backtrace_rs::backtrace::trace_unsynchronized::h4e694232d85e2708 at mod.rs:66:5 [opt]
    frame #8: 0x0000000109081e24 risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 at backtrace.rs:333:13 [opt]
    frame #9: 0x00000001060d8454 risingwave`_$LT$risingwave_meta..error..MetaError$u20$as$u20$core..convert..From$LT$risingwave_meta..error..MetaErrorInner$GT$$GT$::from::hb4b62fbc8685e728(inner=<unavailable>) at error.rs:66:33
    frame #10: 0x0000000105e7e1d8 risingwave`_$LT$T$u20$as$u20$core..convert..Into$LT$U$GT$$GT$::into::h7837bc8fb77e8181(self=<unavailable>) at mod.rs:726:9
    frame #11: 0x00000001060d8830 risingwave`risingwave_meta::error::MetaError::permission_denied::h6592a4f64415a283(s=<unavailable>) at error.rs:96:9
    frame #12: 0x00000001064bb43c risingwave`risingwave_meta::manager::catalog::CatalogManager$LT$S$GT$::drop_materialized_source::_$u7b$$u7b$closure$u7d$$u7d$::h36956825a3496ee1((null)=ResumeTy @ 0x0000000170c3e2b0) at mod.rs:1012:36
(... more callers omitted ...)

The text was updated successfully, but these errors were encountered:

skyzh · 2022-11-04T14:19:17Z

maybe related: rust-lang/rust#47551

xxchan · 2022-11-04T15:34:53Z

It seems that this issue happens randomly after some random changes, and will be fixed after some other random changes.

xxchan · 2022-11-10T14:53:25Z

I have no idea how to fix it.. Maybe we can wait until it occurs more often and becomes a huge problem... 😇

xiangjinwu · 2022-11-14T03:34:44Z

Now it is no longer restricted to single binary ... It also happens with ./risedev d (1 compute node, debug profile). See #6340. I have confirmed that ENABLE_RELEASE_PROFILE=true ./risedev d does not have the issue, and under debug profile lldb reports the same stack inside std::backtrace::Backtrace::create

It feels like as long as we attempt to capture the backtrace, we have a chance to segfault. But it is deterministic given a certain commit and input query.

commit: 484b9ab
macOS Monterey 12.5 M1 Pro

skyzh · 2022-11-14T03:43:56Z

Sounds like a bug with LLVM… Also the bug sounds not like randomly happening. Something wrong with the rewind header. I’ll find an MCVE and report the bug to upstream.

skyzh · 2022-11-14T04:08:24Z

Tracked in rust-lang/rust#104388

Don't use `RwError`; Use `ok_or_else` to avoid unnecesary error creation. To some extent mitigate #6205... Approved-By: BugenZhao

xxchan · 2023-04-08T20:20:16Z

It seems using another linker (specifically, a newer lld) doesn't fix the problem, because the problem still occurs. I guess it's just a random effect like random code change.

xxchan · 2023-04-10T16:14:35Z

Summarize recent investigation:

The DWARF region (.eh_frame) increased in nightly-2022-08-13 (commits), I suspect it's due to update to LLVM 15, but can't confirm it in the LLVM's changelog, and did not manually tested it. Still, not sure the performance is reasonable. (Is it because linux uses glibc's libunwind which is faster?) As a comparison:

# Ubuntu
objdump target/debug/risingwave --dwarf=frames | wc -l
7963489

time for capture backtrace: 116.962µs

# mac
objdump target/debug/risingwave --dwarf=frame | wc -l
19623476

time for capture backtrace: 107.664291ms

The SEGFAULT is due to a zero cieInfo (details libunwind parseFDEInstructions segfaults due to a zero cieInfo llvm/llvm-project#62048). Since I'm a noob and know nothing about CIE in DWARF, whether it's LLVM libunwind's fault or rustc/lld producing an invalid CIE is not yet clear.
- Note the problem is not fixed in latest lld/libunwind.

TLDR: I still don't know why and how to fix it! 🤡

xxchan · 2023-04-10T20:01:26Z

Is it because linux uses glibc's libunwind which is faster?

Probably not! I've compiled rustc from source and use llvm-libunwind on linux (see the config below). The performance for backtrace capturing is almost the same as the version using libgcc_s.so

https://github.com/rust-lang/rust/blob/a73288371e3fa0a610fbc11e7e8418017bdfde42/config.example.toml#L726-L736

wangrunji0408 · 2023-04-11T06:53:11Z

Report a similar problem on a70e5b5 M1 Pro macOS.

$ ./risedev p
...
libunwind: malformed DW_CFA_same_value DWARF unwind, reg too big
Segmentation fault: 11  RUST_BACKTRACE=1 RW_NODE=playground cargo run --bin risingwave --profile "${RISINGWAVE_BUILD_PROFILE}"

The backtrace is similar to the context of this issue.

* thread #11, name = 'risingwave-main', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000019a5a073c libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::decodeFDE(libunwind::LocalAddressSpace&, unsigned long, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info*, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info*, bool) + 48
    frame #1: 0x000000019a5a2110 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::step() + 208
    frame #2: 0x000000019a5a4f00 libunwind.dylib`_Unwind_Backtrace + 348
    frame #3: 0x0000000109b1c2f8 risingwave`std::backtrace::Backtrace::create::hc66c9fdbb6a749f6 [inlined] std::backtrace_rs::backtrace::libunwind::trace::h3c90f7da2ae71937 at libunwind.rs:93:5 [opt]
    frame #4: 0x0000000109b1c2e8 risingwave`std::backtrace::Backtrace::create::hc66c9fdbb6a749f6 [inlined] std::backtrace_rs::backtrace::trace_unsynchronized::hb2b06b1b0aadcef3 at mod.rs:66:5 [opt]
    frame #5: 0x0000000109b1c2dc risingwave`std::backtrace::Backtrace::create::hc66c9fdbb6a749f6 at backtrace.rs:332:13 [opt]
    frame #6: 0x0000000109b01514 risingwave`anyhow::error::_$LT$impl$u20$anyhow..Error$GT$::msg::h88bf86dcaa2f30c2(message=<unavailable>) at error.rs:83:36
    frame #7: 0x000000010550bb8c risingwave`risingwave_meta::telemetry::TrackingId::from_meta_store::_$u7b$$u7b$closure$u7d$$u7d$::h75cf24e092eef2bc((null)=0x00000001712717b0) at telemetry.rs:48:27

$ cc --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.4.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

xxchan · 2023-04-11T07:01:32Z

I cannot reproduce it on my mac so I'm trying to upgrade XCode/macOS.

Some noticeable minor changes:

The libunwind message
The SEGFAULT is in decodeFDE, not parseFDEInstructions

xxchan · 2023-04-11T08:22:02Z

After upgrading to latest XCode, I cannot reproduce on a70e5b5, and 7e0c548, db6691b anymore 🥲.

> cc --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)

# previously it's
Apple clang version 14.0.0 (clang-1400.0.29.202)

xxchan · 2023-04-11T09:12:19Z

BTW, capturing backtrace on mac release mode takes 1.812083ms

❯ objdump target/debug/risingwave --dwarf=frames | wc -l
 15685458
❯ objdump target/release/risingwave --dwarf=frames | wc -l 
 7679132

xxchan · 2023-04-11T09:28:21Z

split-debuginfo=off/packed/unpacked have comparable performance, i.e., 100+ms on mac and 100+µs on linux

xxchan · 2023-04-11T12:56:13Z

-Cdebuginfo=1/line-tables-only only slightly reduced the time, still 100ms.

xiangjinwu · 2023-06-20T12:28:32Z

On main branch (f0f96a8, likely earlier):

./risedev p

-- taken from e2e_test/batch/./basic/query.slt.part
create table t3 (v1 int, v2 int, v3 int);
insert into t3 values(1, 2, NULL);
select v1/0 from t3;

same segfault in libunwind::CFI_Parser

xxchan · 2023-07-06T09:50:25Z

I meet a new error message on e977dc9 🤔. It segfaults immediately when risedev d (What's also interesting is that it won't occur if I start frontend manually after it crashed. Only for risedev...)


libunwind: malformed DW_CFA_same_value DWARF unwind, reg too big
Thu Jul  6 09:42:43 UTC 2023 [risedev]: Program exited with 139

Oh, the message is mentioned already #6205 (comment)

yezizp2012 · 2023-07-19T11:24:41Z

I also met this issue for risedev in my local setup macOS, but it can't be reproduced if restart frontend manually as well. And the issue is gone when I rebased the main branch. 🥵

xxchan · 2023-08-10T10:27:15Z

2342e8b SEGFAULTs immediately for me.

It's caused by write_exclusive_cluster_id calling object_store.read, which returns Err for non-existing data. And then object_store::ObjectError captures backtrace. It seems quite reasonable. Really funny. 🥵

Really hope to investigate this problem ... if I have enough time

What's worse is that adding some random dummy code doesn't fix it. 🥶

xxchan · 2023-11-06T09:09:45Z

There's a update in the upstream issue rust-lang/rust#104388 (comment). Looks promising

xxchan · 2023-12-14T14:26:25Z

tikv/tikv#9957 Is this the same? 🤔

xiangjinwu added the type/bug Something isn't working label Nov 4, 2022

github-actions bot added this to the release-0.1.14 milestone Nov 4, 2022

xiangjinwu mentioned this issue Nov 4, 2022

performance regression (for debug & single binary) of backtrace capturing after bumping toolchain on macOS #6131

Closed

xiangjinwu changed the title ~~bug: segfault/EXC_BAD_ACCESS during MetaError backtrace capture~~ bug(playground): segfault/EXC_BAD_ACCESS during MetaError backtrace capture Nov 4, 2022

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

fuyufjh removed this from the release-0.1.14 milestone Nov 11, 2022

xiangjinwu changed the title ~~bug(playground): segfault/EXC_BAD_ACCESS during MetaError backtrace capture~~ bug(debug profile): segfault/EXC_BAD_ACCESS during backtrace capture Nov 14, 2022

xiangjinwu mentioned this issue Nov 14, 2022

compute node dies when running regress boolean test #6340

Closed

xiangjinwu mentioned this issue Nov 14, 2022

toolchain: revert workarounds and fix regressions in bumping #6103

Closed

7 tasks

skyzh mentioned this issue Nov 14, 2022

capturing stack backtrace becomes slower and sometimes segfaults on Apple Silicon rust-lang/rust#104388

Open

This comment was marked as resolved.

Sign in to view

xxchan self-assigned this Jan 25, 2023

xxchan mentioned this issue Feb 14, 2023

All-in-one binary SEGFAULT immediately (MacOS, debug mode) #7927

Closed

This comment was marked as resolved.

Sign in to view

xxchan mentioned this issue Feb 15, 2023

refactor: error handling in system param #7945

Merged

mergify bot pushed a commit that referenced this issue Feb 15, 2023

refactor: error handling in system param (#7945)

df746b7

Don't use `RwError`; Use `ok_or_else` to avoid unnecesary error creation. To some extent mitigate #6205... Approved-By: BugenZhao

BugenZhao mentioned this issue Mar 14, 2023

perf: nexmark q14 #8503

Closed

ahmedriza mentioned this issue Mar 18, 2023

Cannot run risingwave binary on Mac OS M1 (when linked by zld) #8608

Closed

This was referenced Apr 7, 2023

Failed to start risedev after #8770 #9038

Closed

refactor(telemetry): avoid backtrace in TrackingId::from_meta_store error #9042

Closed

xxchan mentioned this issue Apr 7, 2023

chore: bump toolchain to nightly-2023-04-07 #9051

Merged

1 task

This comment was marked as resolved.

Sign in to view

xxchan mentioned this issue Aug 4, 2023

risedev p and d do not share build #11458

Closed

xxchan mentioned this issue Dec 27, 2023

refactor(error): migrate function errors to anyhow #14159

Closed

9 tasks

xxchan closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2024

bug(debug profile): segfault/EXC_BAD_ACCESS during backtrace capture #6205

bug(debug profile): segfault/EXC_BAD_ACCESS during backtrace capture #6205

Comments

xiangjinwu commented Nov 4, 2022 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

This comment was marked as resolved.

This comment was marked as resolved.

skyzh commented Nov 4, 2022

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

xxchan commented Nov 4, 2022

This comment was marked as resolved.

This comment was marked as resolved.

xxchan commented Nov 10, 2022

xiangjinwu commented Nov 14, 2022 • edited Loading

skyzh commented Nov 14, 2022

skyzh commented Nov 14, 2022

This comment was marked as resolved.

This comment was marked as resolved.

xxchan commented Apr 8, 2023

This comment was marked as resolved.

This comment was marked as resolved.

xxchan commented Apr 10, 2023 • edited Loading

xxchan commented Apr 10, 2023

wangrunji0408 commented Apr 11, 2023 • edited Loading

xxchan commented Apr 11, 2023

xxchan commented Apr 11, 2023 • edited Loading

xxchan commented Apr 11, 2023 • edited Loading

xxchan commented Apr 11, 2023

xxchan commented Apr 11, 2023

xiangjinwu commented Jun 20, 2023

xxchan commented Jul 6, 2023 • edited Loading

yezizp2012 commented Jul 19, 2023

xxchan commented Aug 10, 2023 • edited Loading

xxchan commented Nov 6, 2023

xxchan commented Dec 14, 2023

xiangjinwu commented Nov 4, 2022 •

edited

Loading

xiangjinwu commented Nov 14, 2022 •

edited

Loading

xxchan commented Apr 10, 2023 •

edited

Loading

wangrunji0408 commented Apr 11, 2023 •

edited

Loading

xxchan commented Apr 11, 2023 •

edited

Loading

xxchan commented Apr 11, 2023 •

edited

Loading

xxchan commented Jul 6, 2023 •

edited

Loading

xxchan commented Aug 10, 2023 •

edited

Loading