Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGABRT thrown by libuv during initial block download ("Abort Trap 6") #867

Closed
pinheadmz opened this issue Oct 8, 2019 · 14 comments
Closed
Labels
bug Unexpected or incorrect behavior stability / efficiency Denial of service, better resource usage
Milestone

Comments

@pinheadmz
Copy link
Member

MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports)
OSX version 10.14.6 (18G95)
nodejs version 10.15.3
bcoin version: master at 99638ac, other branches as well
data directory (--prefix) writing to external SSD: https://www.amazon.com/Samsung-T5-Portable-SSD-MU-PA1T0B/dp/B073H552FJ

bcoin crashes consistently with Abort Trap 6 during initial block download, in the height range 280000 to 340000. This crash happens regularly and predictably, with various configurations including --no-wallet and even bypassing blockstore (forcing levelDB instead, using this patch).

Stack traces using lldb and llnode are not entirely helpful:

[debug] (chain) Memory: rss=259mb, js-heap=45/90mb native-heap=169mb
[info] (chain) Block 00000000000000010a3ab2f52dab5cc205483d574c4cd2456a4e959f31dff701 (282380) added to chain (size=237125 txs=589 time=77.326367).
[debug] (net) Status: time=2014-01-25T11:05:40Z height=282380 progress=47.02% orphans=0 active=12620 target=419558700 peers=8

Process 68237 stopped
* thread #10, stop reason = signal SIGABRT
    frame #0: 0x00007fff5e4072c6 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff5e4072c6 <+10>: jae    0x7fff5e4072d0            ; <+20>
    0x7fff5e4072c8 <+12>: movq   %rax, %rdi
    0x7fff5e4072cb <+15>: jmp    0x7fff5e401457            ; cerror_nocancel
    0x7fff5e4072d0 <+20>: retq
Target 0: (node) stopped.

(llnode) bt
* thread #10, stop reason = signal SIGABRT
  * frame #0: 0x00007fff5e4072c6 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff5e4c2bf1 libsystem_pthread.dylib`pthread_kill + 284
    frame #2: 0x00007fff5e3716a6 libsystem_c.dylib`abort + 127
    frame #3: 0x00000001009b28d2 node`uv_cond_wait + 20
    frame #4: 0x00000001009a44b1 node`worker + 71
    frame #5: 0x00007fff5e4c02eb libsystem_pthread.dylib`_pthread_body + 126
    frame #6: 0x00007fff5e4c3249 libsystem_pthread.dylib`_pthread_start + 66
    frame #7: 0x00007fff5e4bf40d libsystem_pthread.dylib`thread_start + 13

(llnode) v8 bt
 * thread #10: tid = 0xa5675f, 0x00007fff5e4072c6 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
  * frame #0: 0x00007fff5e4072c6 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff5e4c2bf1 libsystem_pthread.dylib`pthread_kill + 284
    frame #2: 0x00007fff5e3716a6 libsystem_c.dylib`abort + 127
    frame #3: 0x00000001009b28d2 node`uv_cond_wait + 20
    frame #4: 0x00000001009a44b1 node`worker + 71
    frame #5: 0x00007fff5e4c02eb libsystem_pthread.dylib`_pthread_body + 126
    frame #6: 0x00007fff5e4c3249 libsystem_pthread.dylib`_pthread_start + 66
    frame #7: 0x00007fff5e4bf40d libsystem_pthread.dylib`thread_start + 13

The process seems to be aborted by a bad mutex: https://github.com/libuv/libuv/blob/bee1bf5dd7de8da316821c32411425f7cf7ab49c/src/unix/thread.c#L780-L783

A bit more explanation here: https://linux.die.net/man/3/pthread_cond_wait

@braydonf
Copy link
Contributor

braydonf commented Oct 8, 2019

Several segmentation faults have been fixed upstream for leveldown and are pending to be updated into bdb as well, see bcoin-org/bdb#8. It's possible there is another similar issue?

@braydonf
Copy link
Contributor

braydonf commented Oct 8, 2019

There is also this potentially relevant line https://github.com/bcoin-org/bdb/blob/a42ea1238b8e500d8a9e18c6f36806a17c23bf61/deps/leveldb/port-libuv/port_uv.cc#L27 as it includes uv_cond_wait, however it's unclear if that code is run.

@braydonf braydonf added the bug Unexpected or incorrect behavior label Oct 8, 2019
@braydonf
Copy link
Contributor

braydonf commented Oct 9, 2019

Working on a branch to update LevelDB to v1.22 from v1.20 bcoin-org/bdb#10 that could be useful here.

@pinheadmz
Copy link
Member Author

bdb#10 performed better than previous trials, but sadly, this morning, it succumbed to the fatal signal. Made it to block 398274 before the familiar stack:

 * thread #10: tid = 0xb4db03, 0x00007fff5e4072c6 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
  * frame #0: 0x00007fff5e4072c6 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff5e4c2bf1 libsystem_pthread.dylib`pthread_kill + 284
    frame #2: 0x00007fff5e3716a6 libsystem_c.dylib`abort + 127
    frame #3: 0x00000001009b28d2 node`uv_cond_wait + 20
    frame #4: 0x00000001009a44b1 node`worker + 71
    frame #5: 0x00007fff5e4c02eb libsystem_pthread.dylib`_pthread_body + 126
    frame #6: 0x00007fff5e4c3249 libsystem_pthread.dylib`_pthread_start + 66
    frame #7: 0x00007fff5e4bf40d libsystem_pthread.dylib`thread_start + 13

@braydonf
Copy link
Contributor

Well, I removed port-libuv from that branch, so it's not related. Also port-libuv was Windows specific, and now LevelDB v1.22 supports that, so it was not necessary.

@braydonf
Copy link
Contributor

braydonf commented Oct 11, 2019

Also, considering the backtrace is landing in node, it could be useful to compile nodejs with debug symbols.

git clone https://github.com/nodejs/node.git
cd node
git checkout v10.15.3 // or latest v12.11.1
git log --show-signature
./configure --debug
make -j 8

Note: Compilation make take awhile with the v8 dependency (~50min).

And then run the debugger build:

lldb ./node_g ../bcoin/bin/node

@braydonf
Copy link
Contributor

braydonf commented Oct 11, 2019

Also for background, I've tried compiling with clang and have not been able to reproduce.

Here were my steps with clang and lldb:

export CC=clang
export CXX=clang++
export LINK=clang
export LINKXX=clang++
npm install --clang=1 --debug
lldb node ./bin/node

Usually my steps are as follows for gcc and gdb:

npm rebuild --debug
gdb --args node ./bin/node

@pinheadmz
Copy link
Member Author

Update: I've compiled nodejs from source (same version, 10.15.3) in --debug configuration in hopes of catching a better stack trace for this error. I tried syncing both master and chain-width branches over the last week or so and nothing crashed. Performance of nodejs-debug is noticeably slower, making me think whatever race condition is involved here is not going to occur in this environment.

@pinheadmz pinheadmz added the stability / efficiency Denial of service, better resource usage label Nov 18, 2019
@braydonf
Copy link
Contributor

braydonf commented Jan 7, 2020

What's the latest status of this issue?

@braydonf braydonf added this to the v2.0.0 milestone Jan 7, 2020
@pinheadmz
Copy link
Member Author

Unfortunately this is hard to test, since the conditions that cause the error are unknown and it often takes days of IBD to encounter, meanwhile consuming a lot of resources on my (only) machine. I'll look back into this in a few days, maybe get a different computer to test with. Meanwhile I have synced bcoin to the tip on AWS and Google Cloud without issue, and my personal full node runs fine on this laptop even if it needs to catch up a few days of blocks.

@pinheadmz
Copy link
Member Author

pinheadmz commented Jan 22, 2020

Update: I succeeded in syncing with a fresh install without any segfault issues and only one interruption (#933)!

On the same hardware listed above, I synced from genesis to height 614024 in about 72 hours.

The only differences between this week's run and the previous tests that threw segfault errors was that now I am running nodejs v12.13.0, and there have been ~62 commits to bcoin master branch. In all cases I was using braydonf/bdb@f2a9720

@braydonf
Copy link
Contributor

So you were running the upgraded LevelDB in addition to the segfault fixes in bcoin-org/bdb#8 and bcoin-org/bdb#7.

@pinheadmz
Copy link
Member Author

pinheadmz commented Jan 22, 2020

@braydonf yes, or in other words bcoin-org/bdb#10 which I think includes all that. But I was in the failed test runs back in October as well: #867 (comment)

@braydonf
Copy link
Contributor

Okay, I'm going to close this then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unexpected or incorrect behavior stability / efficiency Denial of service, better resource usage
Projects
None yet
Development

No branches or pull requests

2 participants