fs: partition readFile to avoid threadpool exhaustion #17054

davisjam · 2017-11-15T18:36:06Z

Problem
Node implements fs.readFile as a call to stat, followed by a C++ -> libuv request
to read the entire file based on the size reported by stat.

Why is this bad?
The effect is to place on the libuv threadpool a potentially-large read request,
occupying the libuv thread until it completes.
While readFile certainly requires buffering the entire file contents,
it can partition the read into smaller buffers (as is done on other read paths)
along the way to avoid threadpool squatting.

If the file is relatively large or stored on a slow medium,
reading the entire file in one shot seems particularly harmful,
and presents a possible DoS vector.

Downsides to partitioning?

Correctness: I don't think partitioning the read like this raises any additional risk of read-write races on the FS. If the application is concurrently readFile'ing and modifying the file, it will already see funny behavior. Though libuv uses preadv where available, this doesn't guarantee read atomicity in the presence of concurrent writes.
Performance implications:
a. Downside: Partitioning means that a single large readFile will be broken into many "out and back" requests to libuv, introducing overhead.
b. Upside: In between each "out and back", other work pending on the threadpool can take a turn. In short, although partitioning will slow down a large request, it will lead to better throughput if the threadpool is handling more than one type of request.

Related
It might be that writeFile has similar behavior. The writeFile path is a bit more complex and I didn't investigate carefully.

Fix approach
Simple -- instead of reading in one shot, partition the read length using kReadFileBufferLength.

Test
I introduced a new test to ensure that fs.readFile works for files smaller and larger than kReadFileBufferLength. It works.

Performance:

Machine details:
$ uname -a
Linux jamie-Lenovo-K450e 4.8.0-56-generic My contribution to this logo throw in. #61~16.04.1-Ubuntu SMP Wed Jun 14 11:58:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Excerpts from lscpu:
Architecture: x86_64
CPU(s): 8
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Model name: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
CPU MHz: 1499.194
benchmark/fs/readfile.js

Summary
Benchmarks using benchmark/fs/readfile.js are unfavorable. I ran three iterations with my change and three with an unmodified version. Performance within a version was similar across the three iterations, so I report only the third iteration for each.

comparable performance on the 1KB file
significant performance degradation on the 16MB file (4-5x decrease)

With partitioned read:

$ for i in `seq 1 3`; do /tmp/node-part/node benchmark/fs/readfile.js; done
...
fs/readfile.js concurrent=1 len=1024 dur=5: 42,836.45194074361
fs/readfile.js concurrent=10 len=1024 dur=5: 94,170.12611909183
fs/readfile.js concurrent=1 len=16777216 dur=5: 71.79583090225451
fs/readfile.js concurrent=10 len=16777216 dur=5: 163.98033223174818

Without change:

$ for i in `seq 1 3`; do /tmp/node-orig/node benchmark/fs/readfile.js; done
...
fs/readfile.js concurrent=1 len=1024 dur=5: 43,815.347866646596
fs/readfile.js concurrent=10 len=1024 dur=5: 93,783.59180605657
fs/readfile.js concurrent=1 len=16777216 dur=5: 339.77196820103387
fs/readfile.js concurrent=10 len=16777216 dur=5: 592.325183524534

benchmark/fs/readfile-clogging.js

As discussed above, the readfile.js benchmark doesn't tell the whole story. The contention of this PR is that the 16MB reads will clog the threadpool, disadvantaging other work contending for the threadpool. I've introduced a new benchmark to characterize this.

Benchmark summary: I copied readfile.js and added a small asynchronous zlib operation to compete for the threadpool. If a non-partitioned readFile is clogging the threadpool, there will be a relatively small number of zips.

Performance summary:

Small file: No difference whether 1 read or 10
Large file: With 1 read, some effect (1 thread is always reading, but 3 threads remain for zip). With 10 reads, huge effect (zips get a fair share of the threadpool when partitoined). 61K zips with partitioned, 700 zips without.

Partitioned:

$ for i in `seq 1 3`; do /tmp/node-part/node benchmark/fs/readfile-clogging.js; done
...
bench ended, reads 96464 zips 154582
fs/readfile-clogging.js concurrent=1 len=1024 dur=5: 19,289.8420223229
fs/readfile-clogging.js concurrent=1 len=1024 dur=5: 30,909.421907455828
bench ended, reads 332932 zips 62896
fs/readfile-clogging.js concurrent=10 len=1024 dur=5: 66,572.28049862666
fs/readfile-clogging.js concurrent=10 len=1024 dur=5: 12,575.639939453387
bench ended, reads 149 zips 149574
fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 29.793230608569676
fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 29,905.935378334147
bench ended, reads 623 zips 61745
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 124.57446300744513
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 12,345.553950958118

Non-partitioned:

$ for i in `seq 1 3`; do /tmp/node-orig/node benchmark/fs/readfile-clogging.js; done
...
bench ended, reads 92559 zips 153226
fs/readfile-clogging.js concurrent=1 len=1024 dur=5: 18,510.65052192176
fs/readfile-clogging.js concurrent=1 len=1024 dur=5: 30,641.12621937156
bench ended, reads 332066 zips 62739
fs/readfile-clogging.js concurrent=10 len=1024 dur=5: 66,396.6979771542
fs/readfile-clogging.js concurrent=10 len=1024 dur=5: 12,543.801322137173
bench ended, reads 1554 zips 98886
fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 310.708121371412
fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 19,769.932924561737
bench ended, reads 2759 zips 703
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 550.9968714783075
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 140.38479443398438

Issue:
This commit addresses #17047.

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
tests and/or benchmarks are included
commit message follows commit guidelines

Affected core subsystem(s)

fs

davisjam · 2017-11-15T18:58:10Z

Working on linter errors.

benjamingr · 2017-11-16T12:49:28Z

Thanks for following up, pinging @nodejs/fs for review.

bnoordhuis · 2017-11-16T13:05:27Z

I can see how this is a concern in a theoretical sense but I don't remember any bug reports where it was an actual issue. Seems like premature (de)optimization.

davisjam · 2017-11-16T14:20:35Z

@bnoordhuis I wouldn't call this a de-optimization. It optimizes the throughput of the threadpool in its entirety by increasing the number of requests that a readFile makes. It's an optimization for throughput, at the cost of the latency of large readFiles.

I think this is in the spirit of Node.js: handle many client requests simultaneously on a small number of threads, and don't do too much work in one shot on any of the threads. This is already the approach taken by a readStream.

The benchmark/fs/readfile-clogging.js demonstrates this:
Reading a 16MB file:

1 thread: partitioning yields 149K zips vs. 98K zips currently
10 threads: partitioning yields 60K zips vs. 700 zips currently

FWIW The latency cost can be largely mitigated with the use of a more reasonably-sized buffer. The 1-thread numbers for readfile.js improve to a 30% degradation and the 10-thread numbers are comparable to the non-partitioned performance.

benchmark/fs/readfile.js

Here's the (rounded) readFile throughout for various read lengths on the 16MB file:

With an 8KB buffer:

fs/readfile.js concurrent=1 len=16777216 dur=5: 71
fs/readfile.js concurrent=10 len=16777216 dur=5: 163

With a 64KB buffer:

fs/readfile.js concurrent=1 len=16777216 dur=5: 222
fs/readfile.js concurrent=10 len=16777216 dur=5: 534

With the full 16MB file in one shot:

fs/readfile.js concurrent=1 len=16777216 dur=5: 339
fs/readfile.js concurrent=10 len=16777216 dur=5: 592

benchmark/fs/readfile-clogging.js

Here's the (rounded) zip throughout for various read sizes:

With an 8KB buffer:

fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 29,905
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 12,345

With a 64KB buffer:

fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 33,201
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 6,995

With the full 16MB file in one shot:

fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 19,769
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 140

Conclusion
If we use a 64KB buffer size for readFile, there will be a 10% increase in readFile latency but a 50x increase in the ability of small concurrent threadpool operations to get a turn.

Admittedly the zip operation I'm using in readfile-clogging.js is tiny, so this is a particularly favorable comparison. I'm happy to try it with a more "reasonable" competing operation if anyone would like to suggest one.

mscdex · 2017-11-16T14:39:44Z

benchmark/fs/readfile-clogging.js

+    console.log(`bench ended, reads ${reads} zips ${zips}`);
+    bench_ended = true;
+    bench.end(reads);
+    bench.end(zips);


Calling this twice does not make sense and will break compare.js which is expecting one bench.end() per benchmark.

@mscdex Thanks for pointing this out. How ought I report throughput for two separate variables like this?

You can't. Perhaps just combine both values for total fulfilled requests per second?

OK. I'll leave in the console.log then so the distinction between request type is clear.

mscdex · 2017-11-16T14:43:56Z

benchmark/fs/readfile-clogging.js

+    bench.end(reads);
+    bench.end(zips);
+    try { fs.unlinkSync(filename); } catch (e) {}
+    process.exit(0);


This isn't really safe since process.send() used by bench.end() is not synchronous. It's better to just return early in afterRead() and afterZip() when bench_ended === true and let the process exit naturally.

OK, this process.exit(0) is just a copy/paste from benchmark/fs/readfile.js. I'll fix this in both places.

mscdex · 2017-11-16T14:46:45Z

benchmark/fs/readfile-clogging.js

+
+  var reads = 0;
+  var zips = 0;
+  var bench_ended = false;


Minor nit, but lower camelCase is typically used in JS portions of node core and underscores are typically used in C++ portions.

bnoordhuis · 2017-11-16T14:47:55Z

I wouldn't call this a de-optimization. It optimizes the throughput of the threadpool in its entirety by increasing the number of requests that a readFile makes. It's an optimization for throughput, at the cost of the latency of large readFiles.

I understand that. My point is that no one complained so far - people aren't filing bug reports. To me that suggests it's mostly a theoretical issue.

Meanwhile, the proposed changes will almost certainly regress some workloads and people are bound to take notice of that.

davisjam · 2017-11-16T14:49:08Z

Meanwhile, the proposed changes will almost certainly regress some workloads and people are bound to take notice of that.

True. But other workloads should be accelerated -- it's the readfile.js vs. readfile-clogging.js tradeoff.

mscdex · 2017-11-16T14:50:51Z

I think I agree with Ben here. Anyone who wants a file read in chunks can just use fs.createReadStream().

YurySolovyov · 2017-11-16T15:00:45Z

maybe add a separate API instead?
.readFile is simpler to use (over streams) because you don't have to manage assembling the final result.

davisjam · 2017-11-16T15:02:17Z

Anyone who wants a file read in chunks can just use fs.createReadStream().

And if they want to make giant reads, they can use fs.read().

But if they've opted for the simplicity offs.readFile(), I think the framework should Do The Right Thing -- namely, optimize threadpool throughput, not single request latency. Presumably some kind of chunking/partitioning is done by crypto and compression as well?

mscdex · 2017-11-16T15:10:03Z

On an unrelated note, please do not @ mention me in commit messages.

davisjam · 2017-11-16T15:12:12Z

Fixed, sorry.

refack · 2017-11-16T15:47:45Z

Hello @davisjam and thank you for the contribution 🎩

Sure looks like you did a lot of research and experimentation, and I really appreciate that.

I think the framework should Do The Right Thing -- namely, optimize threadpool throughput, not single request latency

There are several assumptions and rule-of-thumb optimizations around the uv threadpool [addition: based on empirical experience, and feedback]. One of those is that since the pool serves I/O bound operations, a small pool is enough. As such doing multiple interleaved FS operations is an anti-pattern.
As for "doing the right thing", I would go in a different way that has less concurrency, not more. Check that the uv threadpool is not all consumed doing the same operation.

davisjam · 2017-11-16T16:56:33Z

One of those is that since the pool serves I/O bound operations, a small pool is enough. As such doing multiple interleaved FS operations is an anti-pattern. As for "doing the right thing", I would go in a different way that has less concurrency, not more. Check that the uv threadpool is not all consumed doing the same operation.

@refack Perhaps I misunderstand you, but this PR does not increase concurrency. With my patch, an fs.readFile results in more requests to the threadpool, but each such request is submitted when the previous one completes, hand-over-hand. [addition: Of course, a server might fs.readFile on behalf of different clients concurrently, but there will be one task per ongoing fs.readFile in the queue.]

I agree that a small pool is good for certain activities, but reading large files in one shot is not one of them. A small pool suffices so long as each task doesn't take too long, but a long-running task on a small pool monopolizes its assigned thread, degrading the task throughput of the pool. Then indeed (one thread in) "the uv threadpool is all consumed doing the same operation." [addition: The trouble is that on Linux, a thread still performs the I/O-bound task synchronously, since approaches like KAIO have been rejected (see here). So if the task is long-running, the thread blocks for a long time.]

If the threadpool is used solely for "large" tasks, there's no problem -- each task takes a long time anyway, and partitioning them just adds overhead. But if the threadpool is used for a mix of larger and smaller tasks (e.g. serving different sized files to different clients, running compression and file I/O concurrently, etc.), then the larger tasks will harm the throughput of the smaller tasks. In my benchmark/fs/readfile-clogging.js benchmark, the small task throughput improves by 50x if you partition the large reads.

jasnell · 2017-11-16T18:28:13Z

I definitely appreciate the work here, but I think I'm also falling on the -1 side on this. I think a better approach would be to simply increase efforts to warn dev's away from using fs.readFile() for large files that cannot be read in a single uv roundtrip. Anything beyond that should be deferred to using either fs.read()or fs.createReadStream(). Using fs.readFile() to read anything larger than that is an anti-pattern that I really do not think we should be encouraging.

davisjam · 2017-11-16T19:05:11Z

@jasnell Thanks for your input!

I agree that fs.readFile is not a good idea in the server context, though of course it's fine for scripting purposes. I'm planning to include this discussion as part of this proposed guide, if there's interest from the nodejs.org folks.
However, all the documentation in the world won't stop a new developer from making a mistake. If we agree that "anything larger than [small files] should be deferred to fs.read() or fs.createReadStream()", then surely partitioning fs.readFile() in the style of fs.createReadStream() is an appropriate step. I don't think doing so encourages bad developer behavior -- it's just ensuring that if the developer has made a mistake, they won't pay too much for it. Do The Right Thing and so on.

My 64KB benchmark suggests that scripts that use fs.readFile() shouldn't suffer overmuch from partitioning (they still read 8GB's worth of the same 16MB file in 5 seconds), and that this partitioning stands to benefit some kinds of servers.

davisjam · 2017-11-16T19:43:08Z

warn dev's away from using fs.readFile() for large files that cannot be read in a single uv roundtrip

Right, but the current fs.readFile() behavior is to read any file in one uv roundtrip, regardless of its size. If a dev is reading small files with fs.readFile(), this PR will have no effect on performance. If the dev is reading large files with fs.readFile(), (1) they shouldn't be, but (2) we can still help them out.

davisjam · 2017-11-17T03:13:16Z

Found a few minutes for some deeper benchmarking...I've collected measurements across a range of partition sizes to give a better sense of the tradeoffs between degrading readFile performance and improving threadpool throughput.

I looked at the following partition sizes in KB: 4 8 16 32 64 128 256 512 1024 4096 16384. At each stage I doubled the partition size until I reached 1024 KB (1MB), at which point I quadrupled to 4MB and again to 16MB. The final partition size, 16384KB (16MB), is the size of the file being read, so this last size is the baseline, equivalent to the current behavior of Node.

The numbers I'm reporting represent a single run of the benchmarks on the machine described above, which is otherwise idle. Since it's just one run for each partition size, these numbers are just an estimate.

Excerpting the "1 and 10 concurrent readFile's on a 16MB file" performance from benchmark/fs/readfile.js:

4 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 39
fs/readfile.js concurrent=10 len=16777216 dur=5: 88
8 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 67
fs/readfile.js concurrent=10 len=16777216 dur=5: 162
16 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 111
fs/readfile.js concurrent=10 len=16777216 dur=5: 364
32 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 146
fs/readfile.js concurrent=10 len=16777216 dur=5: 514
64 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 214
fs/readfile.js concurrent=10 len=16777216 dur=5: 575
128 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 212
fs/readfile.js concurrent=10 len=16777216 dur=5: 566
256 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 292
fs/readfile.js concurrent=10 len=16777216 dur=5: 538
512 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 205
fs/readfile.js concurrent=10 len=16777216 dur=5: 523
1024 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 246
fs/readfile.js concurrent=10 len=16777216 dur=5: 492
4096 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 274
fs/readfile.js concurrent=10 len=16777216 dur=5: 477
16384 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 356
fs/readfile.js concurrent=10 len=16777216 dur=5: 577

Excerpting the "10 concurrent readFile's on a 16MB file" performance from benchmark/fs/readfile-clogging.js:

for size in 4 8 16 32 64 128 256 512 1024 4096 16384; do echo "$size KB"; echo; PARTITION_SIZE_KB=$size /tmp/node-part-cfg/node benchmark/fs/readfile-clogging.js | tee /tmp/o_clogging_$size; echo; done
4 KB
bench ended: reads 300, zips 62511, total ops 62811
8 KB
bench ended: reads 600, zips 61582, total ops 62182
16 KB
bench ended: reads 1139, zips 59401, total ops 60540
32 KB
bench ended: reads 1779, zips 47918, total ops 49697
64 KB
bench ended: reads 2258, zips 35383, total ops 37641
128 KB
bench ended: reads 2305, zips 19380, total ops 21685
256 KB
bench ended: reads 2743, zips 11553, total ops 14296
512 KB
bench ended: reads 2409, zips 5553, total ops 7962
1024 KB
bench ended: reads 2268, zips 2889, total ops 5157
4096 KB
bench ended: reads 2298, zips 1053, total ops 3351
16384 KB
bench ended: reads 2767, zips 693, total ops 3460

Summarizing these results:

readfile.js:
a. The relationship between partition size and read rate in the 1-reader case is unclear. The best performance is at 16MB (356 reads/second at one read per readFile), but other pretty good points were 256KB (292 reads/second) and 4096KB (274 reads/second).
b. The relationship between partition size and read rate in the 10-reader case is also unclear. The high point was again 16MB (577 reads/second), but 64KB (575 reads/sec) and 128KB (566 reads/sec) were also contenders.
readfile-clogging.js: Unsurprisingly, the number of zips is generally inversely proportional (roughly linearly) with the partition size, or linearly proportional to the number of partitions. The more partitions, the more turns the zip job gets.

Recommendation:

The 1-reader case seems pretty unrealistic, so let's focus on the 10-reader case. It looks to me like if we go with a 64KB partition, for pure readFile we face somewhere between a 10% drop in throughput (reported earlier) and a negligible drop in throughput (in this data). For this we get a 50x improvement in throughput for contending threadpool jobs. For better readFile performance, a larger blocksize could be used while still improving overall threadpool throughput.

Since the patch is a one-liner, nothing fancy, this seems like a pretty good trade to me.

As has been discussed, best practice is certainly not to use fs.readFile for serving files. But for users who are doing so, I think this patch could give them nice performance improvements for free.

Docs:
I agree with @jasnell that urging developers to avoid fs.readFile in server contexts is a good idea. I'm also happy to pursue a docs change and/or a longer guide in this direction.

davisjam · 2017-11-21T23:55:41Z

But if they've opted for the simplicity of fs.readFile(), I think the framework should Do The Right Thing -- namely, optimize threadpool throughput, not single request latency. Presumably some kind of chunking/partitioning is done by crypto and compression as well?

Actually, I just checked the crypto module. It does not chunk/partition large requests.

The following example will not print "Short buf finished" until there are no more long requests in the threadpool queue and one of the workers picks up the short request.

var nBytes = 10 * 1024*1024; /* 50 MB */
var nLongRequets = 20;
const crypto = require('crypto');

for (var i = 0; i < nLongRequets; i++) {
	crypto.randomBytes(nBytes, (buf) => {
		console.log('Long buf finished');
	});
}

crypto.randomBytes(1, (buf) => {
	console.log('Short buf finished');
});

console.log('begin');

Thoughts on a similar PR to partition large crypto requests like this, or a doc-change PR like #17154 with a warning?

For the FS there are alternatives to fs.readFile if you are making a large request. I don't see comparable framework alternatives for large crypto requests. Thoughts on a new API for this?

Trott · 2017-11-22T00:51:06Z

@nodejs/crypto (see #17054 (comment))

bnoordhuis · 2017-11-22T11:13:27Z

Same as #17054 (comment) with the addendum that crypto.randomBytes() doesn't do I/O, it's purely CPU-bound. Users can partition requests themselves if they want.

As well, there is hardly ever a reason to request more than a few hundred bytes of randomness at a time. I don't think large requests are a practical concern.

davisjam · 2018-01-28T02:20:42Z

@addaleax Yes.

BridgeAR · 2018-02-01T10:19:18Z

Landed in 67a4ce1

Problem: Node implements fs.readFile as: - a call to stat, then - a C++ -> libuv request to read the entire file using the stat size Why is this bad? The effect is to place on the libuv threadpool a potentially-large read request, occupying the libuv thread until it completes. While readFile certainly requires buffering the entire file contents, it can partition the read into smaller buffers (as is done on other read paths) along the way to avoid threadpool exhaustion. If the file is relatively large or stored on a slow medium, reading the entire file in one shot seems particularly harmful, and presents a possible DoS vector. Solution: Partition the read into multiple smaller requests. Considerations: 1. Correctness I don't think partitioning the read like this raises any additional risk of read-write races on the FS. If the application is concurrently readFile'ing and modifying the file, it will already see funny behavior. Though libuv uses preadv where available, this doesn't guarantee read atomicity in the presence of concurrent writes. 2. Performance Downside: Partitioning means that a single large readFile will require into many "out and back" requests to libuv, introducing overhead. Upside: In between each "out and back", other work pending on the threadpool can take a turn. In short, although partitioning will slow down a large request, it will lead to better throughput if the threadpool is handling more than one type of request. Fixes: nodejs#17047 PR-URL: nodejs#17054 Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com> Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>

BridgeAR · 2018-02-01T12:02:19Z

This broke our CI. I did not realize it right away and landed a couple other commits afterwards, otherwise I would have reverted this. A change landed a few hours before this one that changed the tmpDir behavior and broke the test from this PR.

I am submitting a fix.

PR-URL: #17610 Refs: #17054 (comment) Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Evan Lucas <evanlucas@me.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Jon Moss <me@jonathanmoss.me> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>

Problem: Node implements fs.readFile as: - a call to stat, then - a C++ -> libuv request to read the entire file using the stat size Why is this bad? The effect is to place on the libuv threadpool a potentially-large read request, occupying the libuv thread until it completes. While readFile certainly requires buffering the entire file contents, it can partition the read into smaller buffers (as is done on other read paths) along the way to avoid threadpool exhaustion. If the file is relatively large or stored on a slow medium, reading the entire file in one shot seems particularly harmful, and presents a possible DoS vector. Solution: Partition the read into multiple smaller requests. Considerations: 1. Correctness I don't think partitioning the read like this raises any additional risk of read-write races on the FS. If the application is concurrently readFile'ing and modifying the file, it will already see funny behavior. Though libuv uses preadv where available, this doesn't guarantee read atomicity in the presence of concurrent writes. 2. Performance Downside: Partitioning means that a single large readFile will require into many "out and back" requests to libuv, introducing overhead. Upside: In between each "out and back", other work pending on the threadpool can take a turn. In short, although partitioning will slow down a large request, it will lead to better throughput if the threadpool is handling more than one type of request. Fixes: nodejs#17047 PR-URL: nodejs#17054 Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com> Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com> Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>

zbjornson · 2019-01-27T03:22:18Z

The comments above noted that this doesn't appear to cause a significant performance regression, but we're seeing a 7.6-13.5x drop in read throughput between 8.x and 10.x in both the readfile benchmark and our real-world benchmarks that heavily exercise fs.readFile. Based on my troubleshooting, I think it's from this change, but it's possible some other change is responsible.

The readfile benchmark (Ubuntu 16):

Test	v8.15.0	v10.15.0	8 ÷ 10
concurrent=1 len=1024	6,661	7,066	1.06x
concurrent=10 len=1024	23,100	21,079	0.91x
concurrent=1 len=16777216	156.6	11.6	13.5x
concurrent=10 len=16777216	584	76.6	7.6x

From what I can extract from the comments in this PR, either no degradation or a 3.6-4.8x degradation was expected for the len=16M cases.

As for why I think it's because of this change, the benchmark below compares fs.readFile against a simple version of how fs.readFile used to work (one-shot read), measuring time to read the same 16 MB file 50 times.

// npm i async

const fs = require("fs");
const async = require("async");

function chunked(filename, cb) {
	fs.readFile(filename, cb);
}

function oneshot(filename, cb) {
	// shoddy implementation -- leaks fd in case of errors
	fs.open(filename, "r", 0o666, (err, fd) => {
		if (err) return cb(err);
		fs.fstat(fd, (err, stats) => {
			if (err) return cb(err);
			const data = Buffer.allocUnsafe(stats.size);
			fs.read(fd, data, 0, stats.size, 0, (err, bytesRead) => {
				if (err) return cb(err);
				fs.close(fd, err => {
					cb(err, data);
				});
			});
		});
	});
}

fs.writeFileSync("./test.dat", Buffer.alloc(16e6, 'x'));

function bm(method, name, cb) {
	const start = Date.now();
	async.timesSeries(50, (n, next) => {
		method("./test.dat", next);
	}, err => {
		if (err) return cb(err);
		const diff = Date.now() - start;
		console.log(name, diff);
		cb();
	});
}

async.series([
	cb => bm(chunked, "fs.readFile()", cb),
	cb => bm(oneshot, "oneshot", cb)
])

Node.js	OS	`fs.readFile()` (ms)	one-shot (ms)
v10.15.0	Ubuntu 16	7320	370
v8.15.0	Ubuntu 16	693	378
v10.15.0	Win64	2972	493

We've switched to fs.fstat() and then fs.read() (similar to above) as a work-around, but I wouldn't be surprised if this has also negatively impacted other apps/tools. As far as the original justification: web servers aside, other sorts of apps like build tools and compilers (for which DoS attacks are irrelevant) often need to read an entire file as fast as possible, and furthermore aren't typically concerned with concurrency.

Is anyone else able to verify that this degradation exists and/or was expected?

mcollina · 2019-01-27T09:00:37Z

I’ve done some empirical tests, and I saw some degradation. I’d recommend you to open up a new issue based on your data as it is likely to get more attention.

davisjam · 2019-01-27T12:06:15Z

@zbjornson

I’d recommend you to open up a new issue based on your data as it is likely to get more attention.

And tag me in it! :-)

Trott · 2019-01-27T19:05:08Z

I’d recommend you to open up a new issue based on your data as it is likely to get more attention.

I used "Reference in new issue" to create #25740 from @zbjornson's comment.

nodejs-github-bot added the fs Issues and PRs related to the fs subsystem / file system. label Nov 15, 2017

davisjam force-pushed the PartitionReadFile branch from 49d3297 to 16195e0 Compare November 15, 2017 18:45

benjamingr changed the title ~~Partition readFile to avoid threadpool exhaustion~~ fs: partition readFile to avoid threadpool exhaustion Nov 16, 2017

mscdex added the performance Issues and PRs related to the performance of Node.js. label Nov 16, 2017

mscdex reviewed Nov 16, 2017

View reviewed changes

davisjam force-pushed the PartitionReadFile branch from 6598119 to b9971e4 Compare November 16, 2017 15:11

davisjam mentioned this pull request Nov 20, 2017

doc: fs.readFile is async but not partitioned #17154

Closed

2 tasks

BridgeAR closed this Feb 1, 2018

BridgeAR mentioned this pull request Feb 1, 2018

test: fix builds #18500

Closed

4 tasks

paranoidjk mentioned this pull request Sep 30, 2018

【译】Don't Block the Event Loop (or the Worker Pool) frontend9/fe9-library#57

Open

Trott mentioned this pull request Jan 27, 2019

fs.readfile() performance degredation between 8.x and 10.x #25740

Closed

zbjornson mentioned this pull request Jan 27, 2019

fs.readFile is slower in v10 #25741

Closed

Linkgoron mentioned this pull request Mar 5, 2021

fs: improve fsPromises writeFile performance #37610

Merged

This comment was marked as off-topic.

Sign in to view

mcollina mentioned this pull request Mar 13, 2024

readFile in promises very slow nodejs/performance#151

Open

fs: partition readFile to avoid threadpool exhaustion #17054

fs: partition readFile to avoid threadpool exhaustion #17054

Conversation

davisjam commented Nov 15, 2017 • edited Loading

Checklist

Affected core subsystem(s)

davisjam commented Nov 15, 2017

benjamingr commented Nov 16, 2017

bnoordhuis commented Nov 16, 2017

davisjam commented Nov 16, 2017 • edited Loading

mscdex Nov 16, 2017

Choose a reason for hiding this comment

davisjam Nov 16, 2017

Choose a reason for hiding this comment

mscdex Nov 16, 2017

Choose a reason for hiding this comment

davisjam Nov 16, 2017

Choose a reason for hiding this comment

mscdex Nov 16, 2017

Choose a reason for hiding this comment

davisjam Nov 16, 2017

Choose a reason for hiding this comment

mscdex Nov 16, 2017

Choose a reason for hiding this comment

bnoordhuis commented Nov 16, 2017

davisjam commented Nov 16, 2017

mscdex commented Nov 16, 2017 • edited Loading

YurySolovyov commented Nov 16, 2017

davisjam commented Nov 16, 2017 • edited Loading

mscdex commented Nov 16, 2017

davisjam commented Nov 16, 2017

refack commented Nov 16, 2017 • edited Loading

davisjam commented Nov 16, 2017 • edited Loading

jasnell commented Nov 16, 2017

davisjam commented Nov 16, 2017

davisjam commented Nov 16, 2017

davisjam commented Nov 17, 2017 • edited Loading

davisjam commented Nov 21, 2017 • edited Loading

Trott commented Nov 22, 2017 • edited Loading

bnoordhuis commented Nov 22, 2017

davisjam commented Jan 28, 2018

BridgeAR commented Feb 1, 2018

BridgeAR commented Feb 1, 2018

zbjornson commented Jan 27, 2019 • edited Loading

mcollina commented Jan 27, 2019

davisjam commented Jan 27, 2019

Trott commented Jan 27, 2019

This comment was marked as off-topic.

davisjam commented Nov 15, 2017 •

edited

Loading

davisjam commented Nov 16, 2017 •

edited

Loading

mscdex commented Nov 16, 2017 •

edited

Loading

davisjam commented Nov 16, 2017 •

edited

Loading

refack commented Nov 16, 2017 •

edited

Loading

davisjam commented Nov 16, 2017 •

edited

Loading

davisjam commented Nov 17, 2017 •

edited

Loading

davisjam commented Nov 21, 2017 •

edited

Loading

Trott commented Nov 22, 2017 •

edited

Loading

zbjornson commented Jan 27, 2019 •

edited

Loading