Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readFileSync is 40% faster than readFile for large files #41435

Closed
mmomtchev opened this issue Jan 7, 2022 · 8 comments
Closed

readFileSync is 40% faster than readFile for large files #41435

mmomtchev opened this issue Jan 7, 2022 · 8 comments
Labels
feature request Issues that request new features to be added to Node.js. fs Issues and PRs related to the fs subsystem / file system. performance Issues and PRs related to the performance of Node.js. stale

Comments

@mmomtchev
Copy link
Contributor

What is the problem this feature will solve?

make readFile as fast as readFileSync

What is the feature you are proposing to solve the problem?

read the whole file in one operation to avoid excessive context switching

What alternatives have you considered?

No response

@mmomtchev mmomtchev added the feature request Issues that request new features to be added to Node.js. label Jan 7, 2022
@Mesteery Mesteery added fs Issues and PRs related to the fs subsystem / file system. performance Issues and PRs related to the performance of Node.js. labels Jan 7, 2022
@mmomtchev
Copy link
Contributor Author

Since the performance loss can appear huge for reading a file in 512KB chunks, I did spent some time to analyze where it is coming from.

This for reading 32MB files. The larger the file, the larger the loss, asymptotically approaching 40%.

async: 44.707s

 Performance counter stats for 'node test2':

         45 291,93 msec task-clock                #    1,011 CPUs utilized          
           144 976      context-switches          #    0,003 M/sec                  
             1 913      cpu-migrations            #    0,042 K/sec                  
         8 201 447      page-faults               #    0,181 M/sec                  
    91 527 619 025      cycles                    #    2,021 GHz                      (49,63%)
     8 654 142 716      stalled-cycles-frontend   #    9,46% frontend cycles idle     (49,81%)
    57 156 542 570      stalled-cycles-backend    #   62,45% backend cycles idle      (50,46%)
    35 767 220 649      instructions              #    0,39  insn per cycle         
                                                  #    1,60  stalled cycles per insn  (50,42%)
     6 229 128 421      branches                  #  137,533 M/sec                    (50,22%)
       254 991 439      branch-misses             #    4,09% of all branches          (49,55%)

      44,780408214 seconds time elapsed

       4,371191000 seconds user
      43,098050000 seconds sys
sync: 34.222s

 Performance counter stats for 'node test3':

         34 753,79 msec task-clock                #    1,014 CPUs utilized          
            14 762      context-switches          #    0,425 K/sec                  
             1 717      cpu-migrations            #    0,049 K/sec                  
         8 199 240      page-faults               #    0,236 M/sec                  
    86 338 397 933      cycles                    #    2,484 GHz                      (49,96%)
     5 263 748 859      stalled-cycles-frontend   #    6,10% frontend cycles idle     (50,06%)
    58 468 528 969      stalled-cycles-backend    #   67,72% backend cycles idle      (49,90%)
    33 312 199 600      instructions              #    0,39  insn per cycle         
                                                  #    1,76  stalled cycles per insn  (50,06%)
     5 691 816 428      branches                  #  163,775 M/sec                    (49,95%)
       158 764 983      branch-misses             #    2,79% of all branches          (50,12%)

      34,283279957 seconds time elapsed

       0,579624000 seconds user
      33,803549000 seconds sys

  • The only very significant difference is the number of context switches - 10 fold
  • Most of the total time and most of the additional time is spent in kernel-mode - setting up the read
  • The user-mode time also increases almost 10 fold (setting up buffers and async tasks) but it is not significant

So why so many context switches? Reading a 32MB file in 512KB blocks means there will be 64 reads. Where do all those context switches come from?

Here are the context switches of one reading cycle (17 is the file, 16 is the eventfd, ie libuv watcher.fd, 13 is the epoll fd ie libuv backend_fd)

* Worker thread comes back from read with 512KB
[pid 42759] read(17, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288) = 524288

* Worker thread sends its eventfd message
[pid 42759] write(16, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>

* Main thread wakes up in the epoll
[pid 42751] <... epoll_wait resumed>[{EPOLLIN, {u32=16, u64=16}}], 1024, 7903) = 1

* Worker thread continues after write
[pid 42759] <... write resumed>)        = 8

* Main thread starts consuming the eventfd message
[pid 42751] read(16,  <unfinished ...>

* Worker thread goes to sleep on the condition variable
[pid 42759] futex(0x44f6848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

* Main thread consumes the eventfd message
[pid 42751] <... read resumed>"\1\0\0\0\0\0\0\0", 1024) = 8

* Main thread sends the buffer to JS which asks for more so decides to set up another async job
      and wakes up a worker through the condition variable
[pid 42751] futex(0x44f684c, FUTEX_WAKE_PRIVATE, 1) = 1

* Worker wakes up in the condition variable (NOTE IT IS NOT THE SAME WORKER)
[pid 42760] <... futex resumed>)        = 0

* Main thread goes to sleep on epoll
[pid 42751] epoll_wait(13,  <unfinished ...>

* It consumes the work order item and it releases the main work mutex
     (it obtained it after returning from the condition variable)
[pid 42760] futex(0x44f67e0, FUTEX_WAKE_PRIVATE, 1) = 0

* This is the cost of UV_METRICS_IDLE_TIME
     (it happens in parallel with the read so it is not as bad, but still
      it is an extra syscall at every event loop iteration)
[pid 42751] <... epoll_wait resumed>[], 1024, 0) = 0

* The new worker starts it read
[pid 42760] read(17,  <unfinished ...>

* Main thread goes to sleep again this time for real
[pid 42751] epoll_wait(13,  <unfinished ...>

* Worker thread finishes its read
[pid 42760] <... read resumed>"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288) = 524288

* The cycle starts all over
[pid 42760] write(16, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>

Two things to note here:

  • Is the cost of UV_METRICS_IDLE_TIME expected?
  • The threads round-robin is actually free of charge because Linux steps in and makes them all run on the same CPU - see the cpu-mirgations in the perf output

@mmomtchev
Copy link
Contributor Author

I wonder, wouldn't it be possible to continue the reading without waking up the event loop? After all, the event loop is not needed until the final callback?

I have the same problem in gdal-async - I want to be able to schedule a whole lot of work units and run them one after another, preferably all in the same thread, and then to report the work done at the end. I can't schedule them at once, since the worker threads will pick them. I want to schedule them after finishing the previous unit.

The problem is that currently one cannot enqueue work from a worker thread - only the event loop can enqueue work.

This is definitely fixable.

In the case of readFile one will automatically schedule the next read without waking up the event loop until EOF is reached.

This requires that an alternative interface is provided where one can pass either a finishing callback which has to run on the main thread, either an intermediary callback which will run in the worker thread.

@worstpractice
Copy link

@mmomtchev Wonderful research! Any way to get my hands on your testing script(s)? Would be interesting to see if I can replicate your findings. 😃

@mmomtchev
Copy link
Contributor Author

It nothing but reading a large a file over and over in a loop

@Alf71
Copy link

Alf71 commented Sep 26, 2022

Hi, I have one or two questions regarding this:
In python we use set to read the file and accelerate the process:

mylist = [line.split()[0] for line in open('file.txt','r')]
list = set(mylist)

This works well, as far as we have enough memory, where 1 GB text will require around 2,5 GB RAM.

Now lets consider the same, but using javascript.
Does it simply work with

const fs = require('fs');

const data = fs.readFileSync('./file.txt');

?

No matter how big the file is?
I mean, the bigger the file to be red, the slowest the performance of the script when using LOOP.
But can we chang this behaviour?
Would this work with an even bigger file? 10 GB? 100 GB?
And what is the relation with RAM?

@bnoordhuis
Copy link
Member

By some coincidence I was thinking about this earlier this week. The problem with fs.readFile() is that it round-trips through libuv's thread pool several times: open file, stat fd, read data, close fd.

For small files that adds considerable overhead. It's way more efficient to add a C++ ReadFile() method that does it all in one go through uv_queue_work(). Libuv's uv_fs_*() functions can be called synchronously and without an event loop.

(If and when libuv grows io_uring support, that might turn into a deoptimization again, but that's still in the future.)

@github-actions
Copy link
Contributor

There has been no activity on this feature request for 5 months and it is unlikely to be implemented. It will be closed 6 months after the last non-automated comment.

For more information on how the project manages feature requests, please consult the feature request management document.

@github-actions github-actions bot added the stale label Sep 24, 2023
@github-actions
Copy link
Contributor

There has been no activity on this feature request and it is being closed. If you feel closing this issue is not the right thing to do, please leave a comment.

For more information on how the project manages feature requests, please consult the feature request management document.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues that request new features to be added to Node.js. fs Issues and PRs related to the fs subsystem / file system. performance Issues and PRs related to the performance of Node.js. stale
Projects
None yet
Development

No branches or pull requests

5 participants