Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async-hooks/test-fseventwrap - Flaky on AIX #13577

Closed
mhdawson opened this issue Jun 9, 2017 · 23 comments
Closed

async-hooks/test-fseventwrap - Flaky on AIX #13577

mhdawson opened this issue Jun 9, 2017 · 23 comments
Assignees
Labels
aix Issues and PRs related to the AIX platform. async_hooks Issues and PRs related to the async hooks subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. fs Issues and PRs related to the fs subsystem / file system. test Issues and PRs related to the tests.

Comments

@mhdawson
Copy link
Member

mhdawson commented Jun 9, 2017

  • Version: Master
  • Platform: AIX
  • Subsystem: fs

Looks like test-fseventwrap fails intermitently on AIX:

https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/6463/console

not ok 13 async-hooks/test-fseventwrap
  ---
  duration_ms: 0.185
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)
  ...

I believe I've seen it on at least one other PR unrelated to fswatch/asyncwrap.

@gireeshpunathil can you take a look.

@mhdawson mhdawson added the aix Issues and PRs related to the AIX platform. label Jun 9, 2017
@mhdawson
Copy link
Member Author

mhdawson commented Jun 9, 2017

@nodejs/platform-aix

@mscdex mscdex added async_hooks Issues and PRs related to the async hooks subsystem. fs Issues and PRs related to the fs subsystem / file system. test Issues and PRs related to the tests. labels Jun 9, 2017
@refack
Copy link
Contributor

refack commented Jun 9, 2017

@refack
Copy link
Contributor

refack commented Jun 9, 2017

/cc @nodejs/async_hooks

@gireeshpunathil
Copy link
Member

Recreated (intermittent) the crash in the compiler:

#0  0x000000010050e3d4 in v8::internal::AstValueFactory::GetString(unsigned int, bool, v8::internal::Vector<unsigned char const>) ()
(gdb) where
#0  0x000000010050e3d4 in v8::internal::AstValueFactory::GetString(unsigned int, bool, v8::internal::Vector<unsigned char const>) ()
#1  0x000000010050e9e8 in v8::internal::AstValueFactory::GetOneByteStringInternal(v8::internal::Vector<unsigned char const>) ()
#2  0x000000010050e9e8 in v8::internal::AstValueFactory::GetOneByteStringInternal(v8::internal::Vector<unsigned char const>) ()
#3  0x000000010052a7ec in v8::internal::Scanner::IsDuplicateSymbol(v8::internal::DuplicateFinder*, v8::internal::AstValueFactory*) const ()
#4  0x0000000100529b5c in v8::internal::ParserBase<v8::internal::Parser>::ParseAndClassifyIdentifier(bool*)
    ()
#5  0x000000010052895c in v8::internal::ParserBase<v8::internal::Parser>::ParsePrimaryExpression(bool*, bool*) ()
#6  0x0000000100528040 in v8::internal::ParserBase<v8::internal::Parser>::ParseFormalParameter(v8::internal::ParserFormalParameters*, bool*) ()
#7  0x000000010055fc24 in v8::internal::ParserBase<v8::internal::Parser>::ParseFormalParameterList(v8::internal::ParserFormalParameters*, bool*) ()
#8  0x000000010055f2d8 in v8::internal::Parser::ParseFunction(v8::internal::AstRawString const*, int, v8::internal::FunctionKind, v8::internal::FunctionLiteral::FunctionType, v8::internal::DeclarationScope*, int*, int*, bool*, int*, bool*) ()
#9  0x000000010055e234 in v8::internal::Parser::ParseFunctionLiteral(v8::internal::AstRawString const*, v8::internal::Scanner::Location, v8::internal::FunctionNameValidity, v8::internal::FunctionKind, int, v8::internal::FunctionLiteral::FunctionType, v8::internal::LanguageMode, bool*) ()
#10 0x00000001005271e8 in v8::internal::Parser::DoParseFunction(v8::internal::ParseInfo*, v8::internal::AstRawString const*) ()
#11 0x000000010050c4e8 in v8::internal::Parser::ParseFunction(v8::internal::Isolate*, v8::internal::ParseInfo*) ()
#12 0x000000010050b878 in v8::internal::parsing::ParseFunction(v8::internal::ParseInfo*, v8::internal::Isolate*, bool) ()
#13 0x00000001004bf730 in v8::internal::(anonymous namespace)::GetUnoptimizedCode(v8::internal::CompilationInf---Type <return> to continue, or q <return> to quit---
o*, v8::internal::Compiler::ConcurrencyMode) ()
#14 0x00000001004c1ef4 in v8::internal::(anonymous namespace)::GetLazyCode(v8::internal::Handle<v8::internal::JSFunction>) ()
#15 0x00000001001b5108 in v8::internal::Compiler::Compile(v8::internal::Handle<v8::internal::JSFunction>, v8::internal::Compiler::ClearExceptionFlag) ()
#16 0x0000000100c7f920 in v8::internal::Runtime_CompileLazy(int, v8::internal::Object**, v8::internal::Isolate*) ()
#17 0x07000000001845f4 in ?? ()

@Trott
Copy link
Member

Trott commented Jun 12, 2017

Still a thing...

https://ci.nodejs.org/job/node-test-commit-aix/6522/nodes=aix61-ppc64/console

not ok 13 async-hooks/test-fseventwrap
  ---
  duration_ms: 0.173
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)

@gireeshpunathil
Copy link
Member

there are three mutually exclusive issues:

  1. the test case test-fseventwrap.js employs directory watch, which is not supported in AIX
  2. AIX fails to build in debug mode: throws linker error:
    ld: 0711-317 ERROR: Undefined symbol: vtable for v8::base::PosixTimezoneCache
  3. Crash in v8::internal::AstValueFactory::GetString

Debugging item 3. at the moment.

@trevnorris
Copy link
Contributor

@gireeshpunathil

  1. the test case test-fseventwrap.js employs directory watch, which is not supported in AIX

Why would only this test fail when there are other tests that check directory watch.

@gireeshpunathil
Copy link
Member

@trevnorris - this test did not fail due to (1) above (yet) - the test progress is masked off by the crash

@gibfahn
Copy link
Member

gibfahn commented Jun 13, 2017

Why would only this test fail when there are other tests that check directory watch.

The other test files that check directory watch should already be skipped for AIX.

@trevnorris
Copy link
Contributor

The other test files that check directory watch should already be skipped for AIX.

Ah yup. I see the isAix check in those other tests.

@gireeshpunathil
Copy link
Member

(dbx) (0x10050e5c0)/20i
0x10050e5c0 (GetString+0x118) eb21ffc8          ld   r25,-56(r1)
0x10050e5c4 (GetString+0x11c) eb41ffd0          ld   r26,-48(r1)
0x10050e5c8 (GetString+0x120) eb61ffd8          ld   r27,-40(r1)
0x10050e5cc (GetString+0x124) eb81ffe0          ld   r28,-32(r1)
0x10050e5d0 (GetString+0x128) 7c0803a6        mtlr   r0
0x10050e5d4 (GetString+0x12c) eba1ffe8          ld   r29,-24(r1)
0x10050e5d8 (GetString+0x130) ebc1fff0          ld   r30,-16(r1)
0x10050e5dc (GetString+0x134) ebe1fff8          ld   r31,-8(r1)
0x10050e5e0 (GetString+0x138) 4e800020         blr
0x10050e5e4 (GetString+0x13c) 60000000         ori   r0,r0,0x0
0x10050e5e8 (GetString+0x140) f8410028         std   r2,0x28(r1)
0x10050e5ec (GetString+0x144) 7f63db78          mr   r3,r27
0x10050e5f0 (GetString+0x148) e9370010          ld   r9,0x10(r23)
0x10050e5f4 (GetString+0x14c) e9490000          ld   r10,0x0(r9)  >>>>> crash
0x10050e5f8 (GetString+0x150) e9690010          ld   r11,0x10(r9)

Failing instruction corresponds to string_table_.LookupOrInsert

dbx) x
  $r0:0x000000010050ec08  $stkp:0x0fffffffffffc330   $toc:0x00000001101d6fc0  
  $r3:0x0fffffffffffc3a0    $r4:0x0000000110acf270    $r5:0x0000000000000001  
  $r6:0x0000000110aa7430    $r7:0x0000000200000000    $r8:0x0000000000000040  
  $r9:0x5b5b5b5b00000000   $r10:0x0000000110ac73b0   $r11:0x0000000000000000  
 $r12:0x0000000022222244   $r13:0x00000001102086a0   $r14:0x0fffffffffffcd08  
 $r15:0x0fffffffffffcd30   $r16:0x0fffffffffffcd28   $r17:0x0fffffffffffcae0  
 $r18:0x0fffffffffffcd28   $r19:0x0000000000000001   $r20:0x0000000000000081  
 $r21:0x0000000110ace9f8   $r22:0x0fffffffffffcd30   $r23:0x0000000110ac4c90  
 $r24:0x0000000200000000   $r25:0x0000000110aa7430   $r26:0x0000000000000001  
 $r27:0x0fffffffffffc3a0   $r28:0x000000009ba358f2   $r29:0x0000000000000008  
 $r30:0x00000000000000c0   $r31:0x0fffffffffffc330  
 $iar:0x000000010050e5f4   $msr:0xa00000000000d032    $cr:0x482222a2  
$link:0x000000010050ec08   $ctr:0x0000000000000000   $xer:0x00000001  

r23 is string_table_ (base::CustomMatcherHashMap), and r9 is supposed to be its vtable from where to look for function LookupOrInsert().

I verified that the object was proper after the construction, looks like later its fields were overlaid by this pattern:

(dbx) ($r23)/8X
0x0000000110ac4c90:  00000001 10ac73b0 00000040 0000002c
0x0000000110ac4ca0:  5b5b5b5b 00000000 00000000 00000000

Pattern 0x5b5b5b5b is typically invalid block eye-catcher.
I ran DEBUGMALLOC on allocations and frees, but it did not catch anything in this area.
So it looks like an internal memory corruption through a bad pointer.
Debugging further.

@krydos
Copy link
Contributor

krydos commented Jun 15, 2017

@Trott
Copy link
Member

Trott commented Jun 18, 2017

Again today:

https://ci.nodejs.org/job/node-test-commit-aix/6663/nodes=aix61-ppc64/console

not ok 13 async-hooks/test-fseventwrap
  ---
  duration_ms: 0.183
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)

Trott added a commit to Trott/io.js that referenced this issue Jun 18, 2017
Watching directories has limited support on AIX. This is documented.
Watch a file in test/async-hooks/test-fseventwrap.js to accommodate AIX.

Refs: nodejs#13577 (comment)
jasnell pushed a commit that referenced this issue Jun 20, 2017
Watching directories has limited support on AIX. This is documented.
Watch a file in test/async-hooks/test-fseventwrap.js to accommodate AIX.

PR-URL: #13766
Ref: #13577 (comment)
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Yuta Hiroto <hello@about-hiroppy.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
addaleax pushed a commit that referenced this issue Jun 20, 2017
Watching directories has limited support on AIX. This is documented.
Watch a file in test/async-hooks/test-fseventwrap.js to accommodate AIX.

PR-URL: #13766
Ref: #13577 (comment)
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Yuta Hiroto <hello@about-hiroppy.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
addaleax pushed a commit that referenced this issue Jun 21, 2017
Watching directories has limited support on AIX. This is documented.
Watch a file in test/async-hooks/test-fseventwrap.js to accommodate AIX.

PR-URL: #13766
Ref: #13577 (comment)
Reviewed-By: Gibson Fahnestock <gibfahn@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Yuta Hiroto <hello@about-hiroppy.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
@gireeshpunathil
Copy link
Member

reduced test case that crashes occasionally in AIX:

const fs = require('fs')
const watcher = fs.watch(__dirname, () => {})
watcher.close()
console.log('done')

The watcher.close() seem to be the one which corrupts the memory (string_table of the parser).

Debugging further.

@gireeshpunathil
Copy link
Member

Even with single-thread, memory corruption seem to be a hard thing to crack. At the end of a number of attempts, DEBUGMALLOC finally reported a bad free on a pointer which it never allocated into, at uv__fs_event_close:

(gdb) where
#0  0x09000000007e7eb4 in pthread_kill () from /usr/lib/libpthreads.a(shr_xpg5_64.o)
#1  0x09000000007e770c in _p_raise () from /usr/lib/libpthreads.a(shr_xpg5_64.o)
#2  0x090000000002bab0 in raise () from /usr/lib/libc.a(shr_64.o)
#3  0x090000000007c9b0 in abort () from /usr/lib/libc.a(shr_64.o)
#4  0x09000000000a4cfc in free_y_debugging () from /usr/lib/libc.a(shr_64.o)
#5  0x090000000009e994 in free_common_debugging () from /usr/lib/libc.a(shr_64.o)
#6  0x00000001000318f0 in uv__free (ptr=0x110aacb90) at ../deps/uv/src/uv-common.c:85
#7  0x000000010003067c in uv__fs_event_close (handle=0x16) at ../deps/uv/src/unix/aix.c:877
#8  0x000000010002aeec in uv_close (handle=0x16, close_cb=0x110aacb90) at ../deps/uv/src/unix/core.c:155
#9  0x0000000101127da4 in node::HandleWrap::Close(v8::FunctionCallbackInfo<v8::Value> const&) ()
#10 0x000000010144b62c in node::(anonymous namespace)::FSEventWrap::Close(v8::FunctionCallbackInfo<v8::Value> const&) ()
#11 0x0000000100244e68 in v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) ()
#12 0x000000010023d068 in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#13 0x000000010079afb8 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) ()
#14 0x07000000001845f4 in ?? ()

@gireeshpunathil
Copy link
Member

In uv__path_is_a_directory the variable handle->dir_filename is freed unconditionally while there exists a control flow wherein this is uninitialized - if the watch event is never fired.

@bnoordhuis
Copy link
Member

@gireeshpunathil What does f 10, p *wrap print?

@gireeshpunathil
Copy link
Member

This one resolves the issue consistently. I ran it a 3K times, and saw no failures.

diff --git a/deps/uv/src/unix/aix.c b/deps/uv/src/unix/aix.c
index 388c9cc..6314096 100644
--- a/deps/uv/src/unix/aix.c
+++ b/deps/uv/src/unix/aix.c
@@ -855,6 +855,7 @@ int uv_fs_event_start(uv_fs_event_t* handle,
   uv__io_init(&handle->event_watcher, uv__ahafs_event, fd);
   handle->path = uv__strdup(filename);
   handle->cb = cb;
+  handle->dir_filename = NULL;
 
   uv__io_start(handle->loop, &handle->event_watcher, POLLIN);
 
@@ -874,8 +875,10 @@ int uv_fs_event_stop(uv_fs_event_t* handle) {
   uv__handle_stop(handle);
 
   if (uv__path_is_a_directory(handle->path) == 0) {
-    uv__free(handle->dir_filename);
-    handle->dir_filename = NULL;
+    if (handle->dir_filename != NULL) {
+      uv__free(handle->dir_filename);
+      handle->dir_filename = NULL;
+    }
   }
 
   uv__free(handle->path);

Leaving it for a while for some review and then come up with a PR.

@gireeshpunathil gireeshpunathil self-assigned this Jul 4, 2017
@gireeshpunathil
Copy link
Member

@bnoordhuis - I overwrote the core file. (will get it once again.)

@bnoordhuis
Copy link
Member

@gireeshpunathil uv__free(ptr) is a no-op when ptr == NULL, you shouldn't need a guard. Looks like it's just a case of improper initialization (and a rather obvious one, in retrospect, but they always are.)

@gireeshpunathil
Copy link
Member

@bnoordhuis - Looks like the values are optimized out in the release build. Does the raw offsets reveal / confirm anything?

(gdb) p * wrap
No symbol "wrap" in current context.
(gdb) p *(FSEventWrap *) 0xfffffffffffdcf8
No symbol "FSEventWrap" in current context.
(gdb) p *(node::FSEventWrap *) 0xfffffffffffdcf8
No symbol "node" in current context.
(gdb) x/10x 0xfffffffffffdcf8
0xfffffffffffdcf8:      0x00000004      0x00000000      0x07000000      0x208d3d01
0xfffffffffffdd08:      0x00000000      0x00000016      0x07000000      0x002bbf00
0xfffffffffffdd18:      0x0fffffff      0xffffdd68
(gdb) 

Thanks for the info on the uv__free. So that means the only change is the initializer for the uninitialized field.

gireeshpunathil added a commit to gireeshpunathil/node that referenced this issue Jul 4, 2017
In AIX, fs watch close call was corrupting memory in the compiler.
The handle->dir_filename field can be un-initialized, if the watch
is initiated but not event got fired. But the uv_fs_event_stop was
freeing this pointer as if it was malloc'ed, leading to the crash.

Properly initialize handle-dir_filename to avoid a garbage pointer.
Fixes: nodejs#13577
@refack refack added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Jul 4, 2017
@gireeshpunathil
Copy link
Member

item no. 3 is being addressed through libuv PR 1400
Next - item no. 2: AIX fails to build in debug mode: throws linker error:
ld: 0711-317 ERROR: Undefined symbol: vtable for v8::base::PosixTimezoneCache
/cc @jbajwa

@gireeshpunathil
Copy link
Member

-O3 build fine, while -O0 throws up the above error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aix Issues and PRs related to the AIX platform. async_hooks Issues and PRs related to the async hooks subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. fs Issues and PRs related to the fs subsystem / file system. test Issues and PRs related to the tests.
Projects
None yet
9 participants