Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky parallel/test-fs-read-stream-concurrent-reads #22339

Closed
Trott opened this issue Aug 15, 2018 · 7 comments · Fixed by #25415
Closed

Investigate flaky parallel/test-fs-read-stream-concurrent-reads #22339

Trott opened this issue Aug 15, 2018 · 7 comments · Fixed by #25415
Labels
aix Issues and PRs related to the AIX platform. flaky-test Issues and PRs related to the tests with unstable failures on the CI. fs Issues and PRs related to the fs subsystem / file system.

Comments

@Trott
Copy link
Member

Trott commented Aug 15, 2018

  • Version: 11.0.0-pre (master)
  • Platform: aix61-ppc64
  • Subsystem: test fs

https://ci.nodejs.org/job/node-test-commit-aix/16983/nodes=aix61-ppc64/console

03:23:13 not ok 649 parallel/test-fs-read-stream-concurrent-reads
03:23:13   ---
03:23:13   duration_ms: 0.861
03:23:13   severity: fail
03:23:13   exitcode: 1
03:23:13   stack: |-
03:23:13     assert.js:338
03:23:13         throw err;
03:23:13         ^
03:23:13     
03:23:13     AssertionError [ERR_ASSERTION]: Retaining 524288 bytes in ABs for 1000 chunks of size 170
03:23:13         at ReadStream.fs.createReadStream.on.on.common.mustCall (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/parallel/test-fs-read-stream-concurrent-reads.js:37:9)
03:23:13         at ReadStream.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/common/index.js:460:15)
03:23:13         at ReadStream.emit (events.js:187:15)
03:23:13         at endReadableNT (_stream_readable.js:1085:12)
03:23:13         at process._tickCallback (internal/process/next_tick.js:63:19)
03:23:13   ...
@Trott Trott added fs Issues and PRs related to the fs subsystem / file system. aix Issues and PRs related to the AIX platform. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Aug 15, 2018
@Trott Trott closed this as completed Oct 18, 2018
@refack
Copy link
Contributor

refack commented Nov 17, 2018

Reopen:

  • Version: master
  • Platform: AIX
  • Subsystem: fs
  • test file - parallel/test-fs-read-stream-concurrent-reads
  • ci job - aix/18926/nodes=aix61-ppc64
  • ci worker - test-osuosl-aix61-ppc64_be-1
  • output:
     assert.js:351
         throw err;
         ^
     
     AssertionError [ERR_ASSERTION]: Retaining 524288 bytes in ABs for 1000 chunks of size 170
         at ReadStream.fs.createReadStream.on.on.common.mustCall (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/parallel/test-fs-read-stream-concurrent-reads.js:37:9)
         at ReadStream.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/common/index.js:346:15)
         at ReadStream.emit (events.js:187:15)
         at endReadableNT (_stream_readable.js:1098:12)
         at process.internalTickCallback (internal/process/next_tick.js:72:19)  ```

@Trott
Copy link
Member Author

Trott commented Nov 20, 2018

CI stress test: https://ci.nodejs.org/job/node-stress-single-test/2088/

RUN_TESTS: -J --repeat 10 parallel/test-fs-read-stream-concurrent-reads
RUN_TIMES: 100
RUN_LABEL: aix61-ppc64

@Trott
Copy link
Member Author

Trott commented Nov 20, 2018

Stress test above confirms flakiness: 15 failures in 100 runs.

Stress test without parallelism: https://ci.nodejs.org/job/node-stress-single-test/2090/

RUN_TESTS: -j 1 --repeat 10 parallel/test-fs-read-stream-concurrent-reads
RUN_TIMES: 100
RUN_LABEL: aix61-ppc64

@gireeshpunathil
Copy link
Member

easily recreated in AIX.

Looking at the test case especially the key assertion point:

assert(retainedMemory / (N * content.length) <= 3,

and wondering what would be the significance of 3.

I ran the test several iterations with varying iterations counts, buffer lengths with and without the fix (the test was testing) and have these observations:

  • without the fix, the number of buffers (64K) are equal to the number of iterations, deterministically.
  • with the fix, this is significantly reduced
  • with the fix, the number of buffers is no more dependent on the number of iterations, instead the size of chunks, as well as the number of concurrent reads.
  • however, the number of buffers is non-deterministic.
  • the non-determinism is (probably) due to:
    • the variations of read end callback of one sequence w.r.t data read callback of another sequence
  • the variations of data read in one iteration (in 99% cases the whole file content is read in one shot, but due to various environmental circumstances this is split into 2 or 3 or 4 in rare cases)

So probably the number 3 should be relaxed by taking into these fluctuations into account.

/cc @addaleax

@addaleax
Copy link
Member

and wondering what would be the significance of 3.

It means that we allocate at most 3 times as much memory as we need. The fix in #25415, where we increase the number to 8, seems somewhat excessive? I think having that kind of memory overhead in Node.js would be considered a bug…

@gireeshpunathil
Copy link
Member

@addaleax - thanks. If you look at the failure message:

AssertionError [ERR_ASSERTION]: Retaining 524288 bytes in ABs for 1000 chunks of size 170

showed that the actual consumption stepped out of the assumed value of 510000 (170 * 1000 * 3)

the reason I have identified (empirically) is because few file reads took more than one iterations.

as there is no guarantee on which reads can complete atomically (one shot) and which ones in two, I took the worst case of 2, for every reads, and hence arrived at the number 8.

If you think this is excessive, or being over-generous to the extend of loosing the meaning of this test, please suggest a better way, I am happy to modify.

I have confirmed that the current failures (in AIX and freebsd) do not reveal any issues with the original fix - as with and without the fix I see a drastic difference (in terms of few chunks vs. in terms of 1000 chunks) so it is just a matter of fine tuning the expectation, without causing flakes now and then.

@addaleax
Copy link
Member

@gireeshpunathil I think that is a part of the issue, yes – whether a read finishes in one piece does matter, you’re right about that.

I think the reason why this matters is that the test, in its current form, starts a new read for each chunk, not when a file stream ends – so it’s possible that the number of concurrent reads increases over time, depending on what the underlying fs operations do, which makes the test less reliable.

Also, the main issue for the high memory-to-content ratio that we are already seeing seems to be that we only use a handful of ArrayBuffers – 5 or 6, when I run locally (4 initial + 1 or 2 because of the split chunks you are referring to) – of size 65536, which we cannot fill with the actual content (170 * 1000).

So, what I would suggest is:

  • Move the startRead() call from .on('data') to .on('end'), so that we get a constant number of parallel reads, and therefore a constant number of ArrayBuffers.
  • Lightly increase the number of concurrent reads, so that the tests behaviour doesn’t change (e.g. to 6).
  • Increase N so that we can actually fill the ArrayBuffers fully. The formula would be N ~= (pool size) * (concurrent reads) / (file size), e.g. 1500 as a value that is slightly smaller than 65536 * 4 / 170 ~= 1542.

gireeshpunathil added a commit to gireeshpunathil/node that referenced this issue Jan 20, 2019
In parallel/test-fs-read-stream-concurrent-reads.js the number
of data chunks used is being tested when few concurrent reads
are performed. The number of chunks can fluctuate based on the
number of concurrent reads as well as the data that was read in
one shot. Accommodate these variations in the test.

Fixes: nodejs#22339

PR-URL: nodejs#25415
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
addaleax pushed a commit that referenced this issue Jan 23, 2019
In parallel/test-fs-read-stream-concurrent-reads.js the number
of data chunks used is being tested when few concurrent reads
are performed. The number of chunks can fluctuate based on the
number of concurrent reads as well as the data that was read in
one shot. Accommodate these variations in the test.

Fixes: #22339

PR-URL: #25415
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
BethGriggs pushed a commit that referenced this issue Apr 29, 2019
In parallel/test-fs-read-stream-concurrent-reads.js the number
of data chunks used is being tested when few concurrent reads
are performed. The number of chunks can fluctuate based on the
number of concurrent reads as well as the data that was read in
one shot. Accommodate these variations in the test.

Fixes: #22339

PR-URL: #25415
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
BethGriggs pushed a commit that referenced this issue May 10, 2019
In parallel/test-fs-read-stream-concurrent-reads.js the number
of data chunks used is being tested when few concurrent reads
are performed. The number of chunks can fluctuate based on the
number of concurrent reads as well as the data that was read in
one shot. Accommodate these variations in the test.

Fixes: #22339

PR-URL: #25415
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
MylesBorins pushed a commit that referenced this issue May 16, 2019
In parallel/test-fs-read-stream-concurrent-reads.js the number
of data chunks used is being tested when few concurrent reads
are performed. The number of chunks can fluctuate based on the
number of concurrent reads as well as the data that was read in
one shot. Accommodate these variations in the test.

Fixes: #22339

PR-URL: #25415
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aix Issues and PRs related to the AIX platform. flaky-test Issues and PRs related to the tests with unstable failures on the CI. fs Issues and PRs related to the fs subsystem / file system.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants