Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(debug profile): segfault/EXC_BAD_ACCESS during backtrace capture #6205

Closed
Tracked by #6103
xiangjinwu opened this issue Nov 4, 2022 · 47 comments · Fixed by #6362
Closed
Tracked by #6103

bug(debug profile): segfault/EXC_BAD_ACCESS during backtrace capture #6205

xiangjinwu opened this issue Nov 4, 2022 · 47 comments · Fixed by #6362
Assignees
Labels
type/bug Something isn't working

Comments

@xiangjinwu
Copy link
Contributor

xiangjinwu commented Nov 4, 2022

Describe the bug

When running playground on macOS using latest main (first bad commit db6691b), the following sql commands leads to a server crash with segfault/EXC_BAD_ACCESS.

It works as expected in cluster mode (./risedev d) rather than playground.

To Reproduce

CREATE TABLE t(a int, b int);
CREATE VIEW v AS SELECT * FROM t;
DROP TABLE t;

Expected behavior

Before that commit we were able to see the expected error:

ERROR:  QueryError: Permission denied: PermissionDenied: Fail to delete table `t` because 1 other relation(s) depend on it

Additional context

console warnings before segfault (also there on last good commit, may unrelated):

2022-11-04T15:37:18.08988+08:00  WARN risingwave_storage::hummock::state_store: sealing invalid epoch    
2022-11-04T15:37:18.090268+08:00  WARN risingwave_storage::hummock::state_store: syncing invalid epoch    

backtrace from lldb:

* thread #7, name = 'risingwave-main', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000019f39685c libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::parseFDEInstructions(libunwind::LocalAddressSpace&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, int, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::PrologInfo*) + 204
    frame #1: 0x000000019f396710 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromFdeCie(libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info const&, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info const&, unsigned long, unsigned long) + 100
    frame #2: 0x000000019f3963e8 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::getInfoFromDwarfSection(unsigned long, libunwind::UnwindInfoSections const&, unsigned int) + 184
    frame #3: 0x000000019f3962a0 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::setInfoBasedOnIPRegister(bool) + 1012
    frame #4: 0x000000019f398788 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::step() + 696
    frame #5: 0x000000019f39b138 libunwind.dylib`_Unwind_Backtrace + 352
    frame #6: 0x0000000109081e40 risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 [inlined] std::backtrace_rs::backtrace::libunwind::trace::h471a59e08ff9e5dc at mod.rs:66:5 [opt]
    frame #7: 0x0000000109081e30 risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 [inlined] std::backtrace_rs::backtrace::trace_unsynchronized::h4e694232d85e2708 at mod.rs:66:5 [opt]
    frame #8: 0x0000000109081e24 risingwave`std::backtrace::Backtrace::create::h908375f7f84cb508 at backtrace.rs:333:13 [opt]
    frame #9: 0x00000001060d8454 risingwave`_$LT$risingwave_meta..error..MetaError$u20$as$u20$core..convert..From$LT$risingwave_meta..error..MetaErrorInner$GT$$GT$::from::hb4b62fbc8685e728(inner=<unavailable>) at error.rs:66:33
    frame #10: 0x0000000105e7e1d8 risingwave`_$LT$T$u20$as$u20$core..convert..Into$LT$U$GT$$GT$::into::h7837bc8fb77e8181(self=<unavailable>) at mod.rs:726:9
    frame #11: 0x00000001060d8830 risingwave`risingwave_meta::error::MetaError::permission_denied::h6592a4f64415a283(s=<unavailable>) at error.rs:96:9
    frame #12: 0x00000001064bb43c risingwave`risingwave_meta::manager::catalog::CatalogManager$LT$S$GT$::drop_materialized_source::_$u7b$$u7b$closure$u7d$$u7d$::h36956825a3496ee1((null)=ResumeTy @ 0x0000000170c3e2b0) at mod.rs:1012:36
(... more callers omitted ...)
@xiangjinwu xiangjinwu added the type/bug Something isn't working label Nov 4, 2022
@github-actions github-actions bot added this to the release-0.1.14 milestone Nov 4, 2022
@xiangjinwu xiangjinwu changed the title bug: segfault/EXC_BAD_ACCESS during MetaError backtrace capture bug(playground): segfault/EXC_BAD_ACCESS during MetaError backtrace capture Nov 4, 2022
@xxchan

This comment was marked as resolved.

@BugenZhao

This comment was marked as resolved.

@skyzh
Copy link
Contributor

skyzh commented Nov 4, 2022

maybe related: rust-lang/rust#47551

@skyzh

This comment was marked as outdated.

@skyzh

This comment was marked as resolved.

@skyzh

This comment was marked as resolved.

@chenzl25

This comment was marked as resolved.

@skyzh

This comment was marked as resolved.

@skyzh

This comment was marked as resolved.

@chenzl25

This comment was marked as resolved.

@skyzh

This comment was marked as resolved.

@skyzh

This comment was marked as resolved.

@skyzh

This comment was marked as resolved.

@xxchan
Copy link
Member

xxchan commented Nov 4, 2022

It seems that this issue happens randomly after some random changes, and will be fixed after some other random changes.

@skyzh

This comment was marked as resolved.

@fuyufjh

This comment was marked as resolved.

@xxchan
Copy link
Member

xxchan commented Nov 10, 2022

I have no idea how to fix it.. Maybe we can wait until it occurs more often and becomes a huge problem... 😇

@fuyufjh fuyufjh removed this from the release-0.1.14 milestone Nov 11, 2022
@xiangjinwu xiangjinwu changed the title bug(playground): segfault/EXC_BAD_ACCESS during MetaError backtrace capture bug(debug profile): segfault/EXC_BAD_ACCESS during backtrace capture Nov 14, 2022
@xiangjinwu
Copy link
Contributor Author

xiangjinwu commented Nov 14, 2022

Now it is no longer restricted to single binary ... It also happens with ./risedev d (1 compute node, debug profile). See #6340. I have confirmed that ENABLE_RELEASE_PROFILE=true ./risedev d does not have the issue, and under debug profile lldb reports the same stack inside std::backtrace::Backtrace::create

It feels like as long as we attempt to capture the backtrace, we have a chance to segfault. But it is deterministic given a certain commit and input query.

commit: 484b9ab
macOS Monterey 12.5 M1 Pro

@skyzh
Copy link
Contributor

skyzh commented Nov 14, 2022

Sounds like a bug with LLVM… Also the bug sounds not like randomly happening. Something wrong with the rewind header. I’ll find an MCVE and report the bug to upstream.

@skyzh
Copy link
Contributor

skyzh commented Nov 14, 2022

Tracked in rust-lang/rust#104388

@skyzh

This comment was marked as resolved.

@xxchan

This comment was marked as resolved.

mergify bot pushed a commit that referenced this issue Feb 15, 2023
Don't use `RwError`;
Use `ok_or_else` to avoid unnecesary error creation.

To some extent mitigate #6205...

Approved-By: BugenZhao
@xxchan
Copy link
Member

xxchan commented Apr 8, 2023

It seems using another linker (specifically, a newer lld) doesn't fix the problem, because the problem still occurs. I guess it's just a random effect like random code change.

@xxchan

This comment was marked as resolved.

@xxchan

This comment was marked as resolved.

@xxchan
Copy link
Member

xxchan commented Apr 10, 2023

Summarize recent investigation:

  • The DWARF region (.eh_frame) increased in nightly-2022-08-13 (commits), I suspect it's due to update to LLVM 15, but can't confirm it in the LLVM's changelog, and did not manually tested it. Still, not sure the performance is reasonable. (Is it because linux uses glibc's libunwind which is faster?) As a comparison:
# Ubuntu
objdump target/debug/risingwave --dwarf=frames | wc -l
7963489

time for capture backtrace: 116.962µs

# mac
objdump target/debug/risingwave --dwarf=frame | wc -l
19623476

time for capture backtrace: 107.664291ms 

TLDR: I still don't know why and how to fix it! 🤡

@xxchan
Copy link
Member

xxchan commented Apr 10, 2023

Is it because linux uses glibc's libunwind which is faster?

Probably not! I've compiled rustc from source and use llvm-libunwind on linux (see the config below). The performance for backtrace capturing is almost the same as the version using libgcc_s.so

https://github.com/rust-lang/rust/blob/a73288371e3fa0a610fbc11e7e8418017bdfde42/config.example.toml#L726-L736

@wangrunji0408
Copy link
Contributor

wangrunji0408 commented Apr 11, 2023

Report a similar problem on a70e5b5 M1 Pro macOS.

$ ./risedev p
...
libunwind: malformed DW_CFA_same_value DWARF unwind, reg too big
Segmentation fault: 11  RUST_BACKTRACE=1 RW_NODE=playground cargo run --bin risingwave --profile "${RISINGWAVE_BUILD_PROFILE}"

The backtrace is similar to the context of this issue.

* thread #11, name = 'risingwave-main', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000019a5a073c libunwind.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::decodeFDE(libunwind::LocalAddressSpace&, unsigned long, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::FDE_Info*, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info*, bool) + 48
    frame #1: 0x000000019a5a2110 libunwind.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_arm64>::step() + 208
    frame #2: 0x000000019a5a4f00 libunwind.dylib`_Unwind_Backtrace + 348
    frame #3: 0x0000000109b1c2f8 risingwave`std::backtrace::Backtrace::create::hc66c9fdbb6a749f6 [inlined] std::backtrace_rs::backtrace::libunwind::trace::h3c90f7da2ae71937 at libunwind.rs:93:5 [opt]
    frame #4: 0x0000000109b1c2e8 risingwave`std::backtrace::Backtrace::create::hc66c9fdbb6a749f6 [inlined] std::backtrace_rs::backtrace::trace_unsynchronized::hb2b06b1b0aadcef3 at mod.rs:66:5 [opt]
    frame #5: 0x0000000109b1c2dc risingwave`std::backtrace::Backtrace::create::hc66c9fdbb6a749f6 at backtrace.rs:332:13 [opt]
    frame #6: 0x0000000109b01514 risingwave`anyhow::error::_$LT$impl$u20$anyhow..Error$GT$::msg::h88bf86dcaa2f30c2(message=<unavailable>) at error.rs:83:36
    frame #7: 0x000000010550bb8c risingwave`risingwave_meta::telemetry::TrackingId::from_meta_store::_$u7b$$u7b$closure$u7d$$u7d$::h75cf24e092eef2bc((null)=0x00000001712717b0) at telemetry.rs:48:27
$ cc --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.4.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

@xxchan
Copy link
Member

xxchan commented Apr 11, 2023

I cannot reproduce it on my mac so I'm trying to upgrade XCode/macOS.

Some noticeable minor changes:

  • The libunwind message
  • The SEGFAULT is in decodeFDE, not parseFDEInstructions

@xxchan
Copy link
Member

xxchan commented Apr 11, 2023

After upgrading to latest XCode, I cannot reproduce on a70e5b5, and 7e0c548, db6691b anymore 🥲.

> cc --version
Apple clang version 14.0.3 (clang-1403.0.22.14.1)

# previously it's
Apple clang version 14.0.0 (clang-1400.0.29.202) 

@xxchan
Copy link
Member

xxchan commented Apr 11, 2023

BTW, capturing backtrace on mac release mode takes 1.812083ms

❯ objdump target/debug/risingwave --dwarf=frames | wc -l
 15685458
❯ objdump target/release/risingwave --dwarf=frames | wc -l 
 7679132

@xxchan
Copy link
Member

xxchan commented Apr 11, 2023

split-debuginfo=off/packed/unpacked have comparable performance, i.e., 100+ms on mac and 100+µs on linux

@xxchan
Copy link
Member

xxchan commented Apr 11, 2023

-Cdebuginfo=1/line-tables-only only slightly reduced the time, still 100ms.

@xiangjinwu
Copy link
Contributor Author

On main branch (f0f96a8, likely earlier):

./risedev p

-- taken from e2e_test/batch/./basic/query.slt.part
create table t3 (v1 int, v2 int, v3 int);
insert into t3 values(1, 2, NULL);
select v1/0 from t3;

same segfault in libunwind::CFI_Parser

@xxchan
Copy link
Member

xxchan commented Jul 6, 2023

I meet a new error message on e977dc9 🤔. It segfaults immediately when risedev d (What's also interesting is that it won't occur if I start frontend manually after it crashed. Only for risedev...)


libunwind: malformed DW_CFA_same_value DWARF unwind, reg too big
Thu Jul  6 09:42:43 UTC 2023 [risedev]: Program exited with 139

Oh, the message is mentioned already #6205 (comment)

@yezizp2012
Copy link
Member

I also met this issue for risedev in my local setup macOS, but it can't be reproduced if restart frontend manually as well. And the issue is gone when I rebased the main branch. 🥵

@xxchan
Copy link
Member

xxchan commented Aug 10, 2023

2342e8b SEGFAULTs immediately for me.

It's caused by write_exclusive_cluster_id calling object_store.read, which returns Err for non-existing data. And then object_store::ObjectError captures backtrace. It seems quite reasonable. Really funny. 🥵

Really hope to investigate this problem ... if I have enough time


What's worse is that adding some random dummy code doesn't fix it. 🥶

@xxchan
Copy link
Member

xxchan commented Nov 6, 2023

There's a update in the upstream issue rust-lang/rust#104388 (comment). Looks promising

@xxchan
Copy link
Member

xxchan commented Dec 14, 2023

tikv/tikv#9957 Is this the same? 🤔

@xxchan xxchan closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants