Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow disk IO on MacOS (potentially Windows too) #35907

Open
IanButterworth opened this issue May 15, 2020 · 26 comments
Open

Slow disk IO on MacOS (potentially Windows too) #35907

IanButterworth opened this issue May 15, 2020 · 26 comments
Labels
io Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster system:mac Affects only macOS system:windows Affects only Windows

Comments

@IanButterworth
Copy link
Member

Benchmarks reports from SystemBenchmark.jl that have been community submitted have shown that disk IO performance is impeded on MacOS and potentially Windows too.

Example 1: A same-hardware test on a 2015 Macbook Pro, booted into Arch Linux and MacOS shows mostly much worse disk IO in MacOS
The tests are in milliseconds here (except peakflops), and the factor is test / res, so higher numbers indicate that the test (MacOS) is running slower than the ref (Arch Linux)

│ Row │ cat         │ testname          │ units   │ ref_res                                   │ test_res                                  │ factor    │
│     │ String      │ String            │ String? │ Any                                       │ Any                                       │ Any       │
├─────┼─────────────┼───────────────────┼─────────┼───────────────────────────────────────────┼───────────────────────────────────────────┼───────────┤
│ 1   │ info        │ SysBenchVer       │ missing │ 0.2.0                                     │ 0.2.0                                     │ Equal     │
│ 2   │ info        │ JuliaVer          │ missing │ 1.4.1                                     │ 1.4.1                                     │ Equal     │
│ 3   │ info        │ OS                │ missing │ Linux (x86_64-pc-linux-gnu)               │ macOS (x86_64-apple-darwin18.7.0)         │ Not Equal │
│ 4   │ info        │ CPU               │ missing │ Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz │ Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz │ Equal     │
│ 5   │ info        │ WORD_SIZE         │ missing │ 64                                        │ 64                                        │ Equal     │
│ 6   │ info        │ LIBM              │ missing │ libopenlibm                               │ libopenlibm                               │ Equal     │
│ 7   │ info        │ LLVM              │ missing │ libLLVM-8.0.1 (ORCJIT, haswell)           │ libLLVM-8.0.1 (ORCJIT, haswell)           │ Equal     │
│ 8   │ info        │ GPU               │ missing │ missing                                   │ missing                                   │ Not Equal │
│ 9   │ cpu         │ FloatMul          │ missing │ 1.3759999999999998e-6                     │ 1.3879999999999999e-6                     │ 1.00872   │
│ 10  │ cpu         │ FusedMulAdd       │ missing │ 1.4e-8                                    │ 3.1e-8                                    │ 2.21429   │
│ 11  │ cpu         │ FloatSin          │ missing │ 5.1799999999999995e-6                     │ 5.191e-6                                  │ 1.00212   │
│ 12  │ cpu         │ VecMulBroad       │ missing │ 3.6742943548387095e-5                     │ 5.503947368421053e-5                      │ 1.49796   │
│ 13  │ cpu         │ CPUMatMul         │ missing │ 0.022055                                  │ 0.0228935                                 │ 1.03802   │
│ 14  │ cpu         │ MatMulBroad       │ missing │ 0.003428                                  │ 0.005100375                               │ 1.48786   │
│ 15  │ cpu         │ 3DMulBroad        │ missing │ 0.0013702                                 │ 0.0015412                                 │ 1.1248    │
│ 16  │ cpu         │ peakflops         │ missing │ 1.530257840890542e11                      │ 1.4811308237589566e11                     │ 0.967896  │
│ 17  │ cpu         │ FFMPEGH264Write   │ missing │ 151.23031                                 │ 161.079387                                │ 1.06513   │
│ 18  │ mem         │ DeepCopy          │ missing │ 0.00016894915254237288                    │ 0.00017180699481865284                    │ 1.01692   │

│ 19  │ diskio      │ DiskWrite1KB      │ missing │ 0.0297705                                 │ 0.11039                                   │ 3.70803   │ <<
│ 20  │ diskio      │ DiskWrite1MB      │ missing │ 0.871518                                  │ 0.3283265                                 │ 0.376729  │ <<
│ 21  │ diskio      │ DiskRead1KB       │ missing │ 0.00951925                                │ 0.0330435                                 │ 3.47123   │ <<
│ 22  │ diskio      │ DiskRead1MB       │ missing │ 0.1272875                                 │ 1.0485435                                 │ 8.2376    │ <<

│ 23  │ loading     │ JuliaLoad         │ missing │ 119.903458                                │ 180.141969                                │ 1.50239   │
│ 24  │ compilation │ compilecache      │ missing │ 320.3946125                               │ 344.544143                                │ 1.07537   │
│ 25  │ compilation │ create_expr_cache │ missing │ 1.053377                                  │ 7.869244                                  │ 7.47049   │

Example 2: A same hardware test on a 2018 Macbook Pro with Parallels-hosted Ubuntu 18.04 VM (Ref), vs. the native MacOS (Test)

On a general note, I was surprised how well the Ubuntu VM did here

32×6 DataFrame
│ Row │ cat         │ testname                  │ units   │ ref_res                                  │ test_res                                 │ factor    │
│     │ String      │ String                    │ String? │ Any                                      │ Any                                      │ Any       │
├─────┼─────────────┼───────────────────────────┼─────────┼──────────────────────────────────────────┼──────────────────────────────────────────┼───────────┤
│ 1   │ info        │ SysBenchVer               │ missing │ 0.3.0                                    │ 0.3.0                                    │ Equal     │
│ 2   │ info        │ JuliaVer                  │ missing │ 1.4.1                                    │ 1.4.1                                    │ Equal     │
│ 3   │ info        │ OS                        │ missing │ Linux (x86_64-pc-linux-gnu)              │ macOS (x86_64-apple-darwin18.7.0)        │ Not Equal │
│ 4   │ info        │ CPU                       │ missing │ Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz │ Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz │ Equal     │
│ 5   │ info        │ WORD_SIZE                 │ missing │ 64                                       │ 64                                       │ Equal     │
│ 6   │ info        │ LIBM                      │ missing │ libopenlibm                              │ libopenlibm                              │ Equal     │
│ 7   │ info        │ LLVM                      │ missing │ libLLVM-8.0.1 (ORCJIT, skylake)          │ libLLVM-8.0.1 (ORCJIT, skylake)          │ Equal     │
│ 8   │ info        │ GPU                       │ missing │ missing                                  │ missing                                  │ Not Equal │
│ 9   │ cpu         │ FloatMul                  │ ms      │ 1.7070000000000001e-6                    │ 1.7190000000000002e-6                    │ 1.00703   │
│ 10  │ cpu         │ FusedMulAdd               │ ms      │ 1.709e-6                                 │ 1.716e-6                                 │ 1.0041    │
│ 11  │ cpu         │ FloatSin                  │ ms      │ 5.093e-6                                 │ 4.927e-6                                 │ 0.967406  │
│ 12  │ cpu         │ VecMulBroad               │ ms      │ 3.742237903225806e-5                     │ 4.559270516717325e-5                     │ 1.21833   │
│ 13  │ cpu         │ CPUMatMul                 │ ms      │ 0.021145                                 │ 0.0367385                                │ 1.73746   │
│ 14  │ cpu         │ MatMulBroad               │ ms      │ 0.003607                                 │ 0.0189734375                             │ 5.26017   │
│ 15  │ cpu         │ 3DMulBroad                │ ms      │ 0.0014235                                │ 0.00169535                               │ 1.19097   │
│ 16  │ cpu         │ peakflops                 │ flops   │ 1.4902694486333694e11                    │ 2.6443904974091913e11                    │ 1.77444   │
│ 17  │ cpu         │ FFMPEGH264Write           │ ms      │ 121.60521                                │ 143.683998                               │ 1.18156   │
│ 18  │ mem         │ DeepCopy                  │ ms      │ 0.00019974039938556066                   │ 0.00019046474820143885                   │ 0.953561  │
│ 19  │ mem         │ Bandwidth10kB             │ MiB/s   │ 96002.55773217192                        │ 99035.61046884404                        │ 1.03159   │
│ 20  │ mem         │ Bandwidth100kB            │ MiB/s   │ 41807.6505372956                         │ 43489.184021444205                       │ 1.04022   │
│ 21  │ mem         │ Bandwidth1MB              │ MiB/s   │ 21869.752938891692                       │ 26635.227381825163                       │ 1.2179    │
│ 22  │ mem         │ Bandwidth10MB             │ MiB/s   │ 10084.215207687876                       │ 11624.277395933892                       │ 1.15272   │
│ 23  │ mem         │ Bandwidth100MB            │ MiB/s   │ 10347.835483172948                       │ 8848.556421771302                        │ 0.855112  │

│ 24  │ diskio      │ DiskWrite1KB              │ ms      │ 0.051837                                 │ 0.131                                    │ 2.52715   │ <<
│ 25  │ diskio      │ DiskWrite1MB              │ ms      │ 1.1210205                                │ 0.518321                                 │ 0.462365  │ <<
│ 26  │ diskio      │ DiskRead1KB               │ ms      │ 0.0114043                                │ 0.0964005                                │ 8.453     │ <<
│ 27  │ diskio      │ DiskRead1MB               │ ms      │ 0.166096                                 │ 1.016004                                 │ 6.11697   │ <<

│ 28  │ loading     │ JuliaLoad                 │ ms      │ 126.233121                               │ 216.478817                               │ 1.71491   │
│ 29  │ compilation │ compilecache              │ ms      │ 277.6027165                              │ 357.7933795                              │ 1.28887   │
│ 30  │ compilation │ success_create_expr_cache │ ms      │ 269.5016405                              │ 370.772621                               │ 1.37577   │
│ 31  │ compilation │ create_expr_cache         │ ms      │ 3.351107                                 │ 12.492654                                │ 3.72792   │
│ 32  │ compilation │ output-ji-substart        │ ms      │ 40.68284                                 │ 27.275530500000002                       │ 0.670443  │

Further, plotting the full results so far shows a fast clustering of linux disk io results, while MacOS & windows struggle (bottom left here). There are some Linux outliers which according to users are systems using slower storage devices such as SD cards.

image

Windows needs more testing. So if anyone with a windows machine with a linux VM could run this, it would be helpful:

using SystemBenchmark
res = runbenchmark()
savebenchmark("res_windows.txt", res)
... # repeat on the linux VM
comp = compare(readbenchmark("res_windows.txt),readbenchmark("res_linux.txt))
show(comp, allrows=true, allcols=true)
@JeffBezanson JeffBezanson added the performance Must go faster label May 18, 2020
@JeffBezanson
Copy link
Member

It seems that stat calls are more expensive on the mac's filesystem; #35925 addresses that for some cases. But I believe many of these are filesystem differences not due to julia. We should drill down and find out.

@IanButterworth
Copy link
Member Author

IanButterworth commented May 18, 2020

Curiously I just gave #35925 a go on MacOS and actually think I see a regression from master (note that this is a comparison to the built-in SystemBenchmark ref, linux 1.4.1):
macos_fasterread.txt

│ Row │ cat         │ testname                  │ units   │ ref_res                                  │ test_res                                 │ factor    │
│     │ String      │ String                    │ String? │ Any                                      │ Any                                      │ Any       │
├─────┼─────────────┼───────────────────────────┼─────────┼──────────────────────────────────────────┼──────────────────────────────────────────┼───────────┤
│ 1   │ info        │ SysBenchVer               │         │ 0.3.0                                    │ 0.3.2                                    │ Not Equal │
│ 2   │ info        │ JuliaVer                  │         │ 1.4.1                                    │ 1.6.0-DEV.60                             │ Not Equal │
│ 3   │ info        │ OS                        │         │ Linux (x86_64-pc-linux-gnu)              │ macOS (x86_64-apple-darwin19.4.0)        │ Not Equal │
...
| 24  │ diskio      │ DiskWrite1KB              │ ms      │ 0.031895                                 │ 0.123459                                 │ 3.87079   │
│ 25  │ diskio      │ DiskWrite1MB              │ ms      │ 0.920052                                 │ 0.394346                                 │ 0.428612  │
│ 26  │ diskio      │ DiskRead1KB               │ ms      │ 0.006683666666666667                     │ 0.0687755                                │ 10.2901   │
│ 27  │ diskio      │ DiskRead1MB               │ ms      │ 0.143755                                 │ 2.02386                                  │ 14.0785   │

Current master (nightly binary):
macos_master.txt

│ Row │ cat         │ testname                  │ units   │ ref_res                                  │ test_res                                 │ factor    │
│     │ String      │ String                    │ String? │ Any                                      │ Any                                      │ Any       │
├─────┼─────────────┼───────────────────────────┼─────────┼──────────────────────────────────────────┼──────────────────────────────────────────┼───────────┤
│ 1   │ info        │ SysBenchVer               │         │ 0.3.0                                    │ 0.3.2                                    │ Not Equal │
│ 2   │ info        │ JuliaVer                  │         │ 1.4.1                                    │ 1.6.0-DEV.58                             │ Not Equal │
│ 3   │ info        │ OS                        │         │ Linux (x86_64-pc-linux-gnu)              │ macOS (x86_64-apple-darwin19.4.0)        │ Not Equal │
...
│ 24  │ diskio      │ DiskWrite1KB              │ ms      │ 0.031895                                 │ 0.125512                                 │ 3.93516   │
│ 25  │ diskio      │ DiskWrite1MB              │ ms      │ 0.920052                                 │ 0.490967                                 │ 0.53363   │
│ 26  │ diskio      │ DiskRead1KB               │ ms      │ 0.006683666666666667                     │ 0.0232165                                │ 3.47362   │
│ 27  │ diskio      │ DiskRead1MB               │ ms      │ 0.143755                                 │ 0.97102                                  │ 6.75469   │

Comparison of the two (ref = macos with master, test = macos with #35925). Mostly seems like benchmark noise apart from DiskRead tests:

│ Row │ cat         │ testname                  │ units   │ ref_res                                  │ test_res                                 │ factor    │
│     │ String      │ String                    │ String? │ Any                                      │ Any                                      │ Any       │
├─────┼─────────────┼───────────────────────────┼─────────┼──────────────────────────────────────────┼──────────────────────────────────────────┼───────────┤
│ 1   │ info        │ SysBenchVer               │ missing │ 0.3.2                                    │ 0.3.2                                    │ Equal     │
│ 2   │ info        │ JuliaVer                  │ missing │ 1.6.0-DEV.58                             │ 1.6.0-DEV.60                             │ Not Equal │
│ 3   │ info        │ OS                        │ missing │ macOS (x86_64-apple-darwin19.4.0)        │ macOS (x86_64-apple-darwin19.3.0)        │ Not Equal │
│ 4   │ info        │ CPU                       │ missing │ Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz │ Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz │ Equal     │
│ 5   │ info        │ WORD_SIZE                 │ missing │ 64                                       │ 64                                       │ Equal     │
│ 6   │ info        │ LIBM                      │ missing │ libopenlibm                              │ libopenlibm                              │ Equal     │
│ 7   │ info        │ LLVM                      │ missing │ libLLVM-9.0.1 (ORCJIT, skylake)          │ libLLVM-9.0.1 (ORCJIT, skylake)          │ Equal     │
│ 8   │ info        │ GPU                       │ missing │ missing                                  │ missing                                  │ Not Equal │
│ 9   │ cpu         │ FloatMul                  │ ms      │ 1.71e-6                                  │ 1.716e-6                                 │ 1.00351   │
│ 10  │ cpu         │ FusedMulAdd               │ ms      │ 1.714e-6                                 │ 2.046e-6                                 │ 1.1937    │
│ 11  │ cpu         │ FloatSin                  │ ms      │ 5.371000000000001e-6                     │ 5.21e-6                                  │ 0.970024  │
│ 12  │ cpu         │ VecMulBroad               │ ms      │ 4.007758620689656e-5                     │ 4.53242147922999e-5                      │ 1.13091   │
│ 13  │ cpu         │ CPUMatMul                 │ ms      │ 0.045556                                 │ 0.043592                                 │ 0.956888  │
│ 14  │ cpu         │ MatMulBroad               │ ms      │ 0.0185805625                             │ 0.019206285714285714                     │ 1.03368   │
│ 15  │ cpu         │ 3DMulBroad                │ ms      │ 0.0020743000000000003                    │ 0.0017695                                │ 0.853059  │
│ 16  │ cpu         │ peakflops                 │ flops   │ 2.2755524810313144e11                    │ 2.0147029241838956e11                    │ 0.885369  │
│ 17  │ cpu         │ FFMPEGH264Write           │ ms      │ 142.476753                               │ 142.957275                               │ 1.00337   │
│ 18  │ mem         │ DeepCopy                  │ ms      │ 0.00019632103825136614                   │ 0.00020576475037821485                   │ 1.0481    │
│ 19  │ mem         │ Bandwidth10kB             │ MiB/s   │ 109367.65496137166                       │ 112044.56483827147                       │ 1.02448   │
│ 20  │ mem         │ Bandwidth100kB            │ MiB/s   │ 41919.7501717033                         │ 42626.125973550705                       │ 1.01685   │
│ 21  │ mem         │ Bandwidth1MB              │ MiB/s   │ 23320.07131449444                        │ 22672.522559167202                       │ 0.972232  │
│ 22  │ mem         │ Bandwidth10MB             │ MiB/s   │ 10291.390226598516                       │ 10298.525062970422                       │ 1.00069   │
│ 23  │ mem         │ Bandwidth100MB            │ MiB/s   │ 7989.830485750289                        │ 7947.260803726205                        │ 0.994672  │
│ 24  │ diskio      │ DiskWrite1KB              │ ms      │ 0.125512                                 │ 0.1218915                                │ 0.971154  │
│ 25  │ diskio      │ DiskWrite1MB              │ ms      │ 0.490967                                 │ 0.419084                                 │ 0.853589  │
│ 26  │ diskio      │ DiskRead1KB               │ ms      │ 0.0232165                                │ 0.0680975                                │ 2.93315   │ <<
│ 27  │ diskio      │ DiskRead1MB               │ ms      │ 0.97102                                  │ 2.022493                                 │ 2.08285   │ <<
│ 28  │ loading     │ JuliaLoad                 │ ms      │ 169.502149                               │ 161.599899                               │ 0.95338   │
│ 29  │ compilation │ compilecache              │ ms      │ 316.5481865                              │ 310.810209                               │ 0.981873  │
│ 30  │ compilation │ success_create_expr_cache │ ms      │ 314.877099                               │ 316.0820275                              │ 1.00383   │
│ 31  │ compilation │ create_expr_cache         │ ms      │ 9.0505545                                │ 8.917059                                 │ 0.98525   │
│ 32  │ compilation │ output-ji-substart        │ ms      │ 52.009300499999995                       │ 59.39701100000002                        │ 1.14205   │

It could always be that the benchmark designs aren't capturing the issue accurately?

@IanButterworth
Copy link
Member Author

For a non-julia platform-level test, I ran a 1KB dd test with time sh -c "dd if=/dev/zero of=/tmp/testfile bs=1k count=1k && sync" on MacOS and an Ubuntu VM on the same machine.

The results seem more comparable.
MacOS x1.4 slower, compared to DiskRead1KB's x3.5-10 slower in julia

MacOS:
1048576 bytes transferred in 0.004472 secs (234,474,943 bytes/sec)
1048576 bytes transferred in 0.003677 secs (285,161,545 bytes/sec)
1048576 bytes transferred in 0.004267 secs (245,742,108 bytes/sec)

Ubuntu VM:
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00316243 s, 332 MB/s
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0033453 s, 313 MB/s
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00335024 s, 313 MB/s

Same, but 1MB time sh -c "dd if=/dev/zero of=/tmp/testfile bs=1000k count=1k && sync"
MacOS x1.2 slower, compared to DiskRead1MB's x6-14 slower in julia

MacOS:
1048576000 bytes transferred in 0.547662 secs (1,914,640,713 bytes/sec)
1048576000 bytes transferred in 0.579230 secs (1,810,292,756 bytes/sec)
1048576000 bytes transferred in 0.601566 secs (1,743,077,014 bytes/sec)

Ubuntu VM
1048576000 bytes (1.0 GB, 1000 MiB) copied, 0.457688 s, 2.3 GB/s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 0.459572 s, 2.3 GB/s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 0.459511 s, 2.3 GB/s

Also, if it's helpful, I ran #35925 on the reference linux machine, and compared it to its own 1.4.1 results. DiskRead1KB improves, DiskRead1MB looks like it's regressed.

│ Row │ cat         │ testname                  │ units   │ ref_res                                  │ test_res                                 │ factor    │
│     │ String      │ String                    │ String? │ Any                                      │ Any                                      │ Any       │
├─────┼─────────────┼───────────────────────────┼─────────┼──────────────────────────────────────────┼──────────────────────────────────────────┼───────────┤
│ 1   │ info        │ SysBenchVer               │         │ 0.3.0                                    │ 0.3.2                                    │ Not Equal │
│ 2   │ info        │ JuliaVer                  │         │ 1.4.1                                    │ 1.6.0-DEV.60                             │ Not Equal │
│ 3   │ info        │ OS                        │         │ Linux (x86_64-pc-linux-gnu)              │ Linux (x86_64-linux-gnu)                 │ Not Equal │
│ 4   │ info        │ CPU                       │         │ Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz │ Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz │ Equal     │
│ 5   │ info        │ WORD_SIZE                 │         │ 64                                       │ 64                                       │ Equal     │
│ 6   │ info        │ LIBM                      │         │ libopenlibm                              │ libopenlibm                              │ Equal     │
│ 7   │ info        │ LLVM                      │         │ libLLVM-8.0.1 (ORCJIT, skylake)          │ libLLVM-9.0.1 (ORCJIT, skylake)          │ Not Equal │
│ 8   │ info        │ GPU                       │         │ GeForce GTX 1650 with Max-Q Design       │ missing                                  │ Not Equal │
│ 9   │ cpu         │ FloatMul                  │ ms      │ 1.1339999999999999e-6                    │ 1.159e-6                                 │ 1.02205   │
│ 10  │ cpu         │ FusedMulAdd               │ ms      │ 1.1339999999999999e-6                    │ 1.159e-6                                 │ 1.02205   │
│ 11  │ cpu         │ FloatSin                  │ ms      │ 4.051e-6                                 │ 3.907e-6                                 │ 0.964453  │
│ 12  │ cpu         │ VecMulBroad               │ ms      │ 2.9823115577889445e-5                    │ 3.01367e-5                               │ 1.01051   │
│ 13  │ cpu         │ CPUMatMul                 │ ms      │ 0.018683                                 │ 0.0258885                                │ 1.38567   │
│ 14  │ cpu         │ MatMulBroad               │ ms      │ 0.0042223                                │ 0.00440328                               │ 1.04286   │
│ 15  │ cpu         │ 3DMulBroad                │ ms      │ 0.0010529                                │ 0.0011584                                │ 1.1002    │
│ 16  │ cpu         │ peakflops                 │ flops   │ 1.8061545733047305e11                    │ 1.79274e11                               │ 0.99257   │
│ 17  │ cpu         │ FFMPEGH264Write           │ ms      │ 107.751337                               │ 108.919                                  │ 1.01084   │
│ 18  │ mem         │ DeepCopy                  │ ms      │ 0.00018790257558790594                   │ 0.000181501                              │ 0.965929  │
│ 19  │ mem         │ Bandwidth10kB             │ MiB/s   │ 104871.77985698299                       │ 118339.0                                 │ 1.12841   │
│ 20  │ mem         │ Bandwidth100kB            │ MiB/s   │ 58579.50346475738                        │ 55572.2                                  │ 0.948663  │
│ 21  │ mem         │ Bandwidth1MB              │ MiB/s   │ 33439.963407070725                       │ 27033.1                                  │ 0.808408  │
│ 22  │ mem         │ Bandwidth10MB             │ MiB/s   │ 5551.476774539766                        │ 11678.0                                  │ 2.10358   │
│ 23  │ mem         │ Bandwidth100MB            │ MiB/s   │ 6910.9558841033295                       │ 5181.83                                  │ 0.749799  │
│ 24  │ diskio      │ DiskWrite1KB              │ ms      │ 0.031895                                 │ 0.033352                                 │ 1.04568   │ <<
│ 25  │ diskio      │ DiskWrite1MB              │ ms      │ 0.920052                                 │ 0.798928                                 │ 0.868351  │ <<
│ 26  │ diskio      │ DiskRead1KB               │ ms      │ 0.006683666666666667                     │ 0.00457438                               │ 0.684411  │ <<
│ 27  │ diskio      │ DiskRead1MB               │ ms      │ 0.143755                                 │ 0.411939                                 │ 2.86556   │ <<
│ 28  │ loading     │ JuliaLoad                 │ ms      │ 91.08045                                 │ 76.1643                                  │ 0.836231  │
│ 29  │ compilation │ compilecache              │ ms      │ 208.532099                               │ 210.482                                  │ 1.00935   │
│ 30  │ compilation │ success_create_expr_cache │ ms      │ 235.4855485                              │ 245.599                                  │ 1.04295   │
│ 31  │ compilation │ create_expr_cache         │ ms      │ 0.894126                                 │ 1.05864                                  │ 1.18399   │
│ 32  │ compilation │ output-ji-substart        │ ms      │ 32.639706000000004                       │ 46.8342                                  │ 1.43488   │

@JeffBezanson
Copy link
Member

Thanks for checking that. I think I found my mistake.

@JeffBezanson
Copy link
Member

Ok I've updated #35925; should be better now. But I'm a bit out of ideas; we're just doing open, read, and close. I don't know what would be several times slower.

@IanButterworth
Copy link
Member Author

Ok. Tried again on macOS.

With the update to #35925 (relative to 1.4.1 linux reference, table below)

  • DiskRead1KB stayed at 10x slower (3-8x on 1.4.1)
  • DiskRead1MB down from 14x to 6x slower (6-8x on 1.4.1)

This is all puzzling..

The benchmark is just:

function tempwrite(x; delete::Bool=false, path = joinpath(@__DIR__, "testwrite.dat"))
    open(path, "w") do io
        write(io, x)
    end
    delete && rm(path)
    return path
end
function tempread(path)
    x = open(path) do io
        read(io)
    end
    return x
end
t = @benchmark tempwrite(x) setup=(x=rand(UInt8, 1000))
t = @benchmark tempwrite(x) setup=(x=rand(UInt8, 1000000))
t = @benchmark tempread(path) setup=(path = tempwrite(rand(UInt8, 1000), delete=false))
t = @benchmark tempread(path) setup=(path = tempwrite(rand(UInt8, 1000000), delete=false))

macos_fasterread2.txt

│ Row │ cat         │ testname                  │ units   │ ref_res                                  │ test_res                                 │ factor    │
│     │ String      │ String                    │ String? │ Any                                      │ Any                                      │ Any       │
├─────┼─────────────┼───────────────────────────┼─────────┼──────────────────────────────────────────┼──────────────────────────────────────────┼───────────┤
│ 1   │ info        │ SysBenchVer               │         │ 0.3.0                                    │ 0.3.2                                    │ Not Equal │
│ 2   │ info        │ JuliaVer                  │         │ 1.4.1                                    │ 1.6.0-DEV.60                             │ Not Equal │
│ 3   │ info        │ OS                        │         │ Linux (x86_64-pc-linux-gnu)              │ macOS (x86_64-apple-darwin19.3.0)        │ Not Equal │
│ 4   │ info        │ CPU                       │         │ Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz │ Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz │ Not Equal │
│ 5   │ info        │ WORD_SIZE                 │         │ 64                                       │ 64                                       │ Equal     │
│ 6   │ info        │ LIBM                      │         │ libopenlibm                              │ libopenlibm                              │ Equal     │
│ 7   │ info        │ LLVM                      │         │ libLLVM-8.0.1 (ORCJIT, skylake)          │ libLLVM-9.0.1 (ORCJIT, skylake)          │ Not Equal │
│ 8   │ info        │ GPU                       │         │ GeForce GTX 1650 with Max-Q Design       │ missing                                  │ Not Equal │
│ 9   │ cpu         │ FloatMul                  │ ms      │ 1.1339999999999999e-6                    │ 1.713e-6                                 │ 1.51058   │
│ 10  │ cpu         │ FusedMulAdd               │ ms      │ 1.1339999999999999e-6                    │ 1.711e-6                                 │ 1.50882   │
│ 11  │ cpu         │ FloatSin                  │ ms      │ 4.051e-6                                 │ 5.058e-6                                 │ 1.24858   │
│ 12  │ cpu         │ VecMulBroad               │ ms      │ 2.9823115577889445e-5                    │ 4.50622e-5                               │ 1.51098   │
│ 13  │ cpu         │ CPUMatMul                 │ ms      │ 0.018683                                 │ 0.045178                                 │ 2.41813   │
│ 14  │ cpu         │ MatMulBroad               │ ms      │ 0.0042223                                │ 0.018787                                 │ 4.44947   │
│ 15  │ cpu         │ 3DMulBroad                │ ms      │ 0.0010529                                │ 0.0018309                                │ 1.73891   │
│ 16  │ cpu         │ peakflops                 │ flops   │ 1.8061545733047305e11                    │ 2.45663e11                               │ 1.36014   │
│ 17  │ cpu         │ FFMPEGH264Write           │ ms      │ 107.751337                               │ 143.785                                  │ 1.33442   │
│ 18  │ mem         │ DeepCopy                  │ ms      │ 0.00018790257558790594                   │ 0.00019617                               │ 1.044     │
│ 19  │ mem         │ Bandwidth10kB             │ MiB/s   │ 104871.77985698299                       │ 109491.0                                 │ 1.04404   │
│ 20  │ mem         │ Bandwidth100kB            │ MiB/s   │ 58579.50346475738                        │ 42673.8                                  │ 0.728477  │
│ 21  │ mem         │ Bandwidth1MB              │ MiB/s   │ 33439.963407070725                       │ 22815.2                                  │ 0.682273  │
│ 22  │ mem         │ Bandwidth10MB             │ MiB/s   │ 5551.476774539766                        │ 10334.9                                  │ 1.86165   │
│ 23  │ mem         │ Bandwidth100MB            │ MiB/s   │ 6910.9558841033295                       │ 8071.05                                  │ 1.16786   │
│ 24  │ diskio      │ DiskWrite1KB              │ ms      │ 0.031895                                 │ 0.124188                                 │ 3.89367   │ <<
│ 25  │ diskio      │ DiskWrite1MB              │ ms      │ 0.920052                                 │ 0.408841                                 │ 0.444367  │ <<
│ 26  │ diskio      │ DiskRead1KB               │ ms      │ 0.006683666666666667                     │ 0.0686085                                │ 10.2651   │ <<
│ 27  │ diskio      │ DiskRead1MB               │ ms      │ 0.143755                                 │ 0.862652                                 │ 6.00085   │ <<
│ 28  │ loading     │ JuliaLoad                 │ ms      │ 91.08045                                 │ 156.032                                  │ 1.71313   │
│ 29  │ compilation │ compilecache              │ ms      │ 208.532099                               │ 311.432                                  │ 1.49345   │
│ 30  │ compilation │ success_create_expr_cache │ ms      │ 235.4855485                              │ 310.809                                  │ 1.31986   │
│ 31  │ compilation │ create_expr_cache         │ ms      │ 0.894126                                 │ 8.86578                                  │ 9.91558   │
│ 32  │ compilation │ output-ji-substart        │ ms      │ 32.639706000000004                       │ 55.4831                                  │ 1.69987   │

@ViralBShah
Copy link
Member

ViralBShah commented May 23, 2020

I vaguely remember an old issue regarding buffer sizes that we use for read, and that larger buffer sizes helped make things faster. Is that somehow relevant here? I can't find the issue. Of course, that may not help with 1KB read tests.

@ViralBShah ViralBShah added system:mac Affects only macOS system:windows Affects only Windows io Involving the I/O subsystem: libuv, read, write, etc. labels May 23, 2020
@JeffBezanson
Copy link
Member

In the 1MB case, we do two read calls: one to initially fill the 32kb buffer, then we copy that to the output, and do one more read call to read the entire rest of the file into the output at once. So we're not even really using buffering here.

@IanButterworth
Copy link
Member Author

IanButterworth commented May 23, 2020

Two benchmarking details:

  1. I realised I was using read(f) instead of read(f, UInt8). Both are presented below, read(f, UInt8) is faster, as to be expected I guess..
  2. In SystemBenchmark, the setup phase of the benchmark is currently re-writing the test file immediately before each sample test. Perhaps MacOS is slower to release the file after closing, and the time includes waiting? Is that plausible?

Also, I thought more granularity might help. The test below shows MacOS is slower than an ubuntu VM on the same machine, across all file sizes. Though the degree slower is less than SystemBenchmark reported. Note that here, the test file written once per file size, not rewritten before each sample, like SystemBenchmark does.

using  BenchmarkTools
for zeros in 3:6
    for dig in 1:9
        bytes = parse(Int,"$(dig)$(repeat("0",zeros))")
        open("dummy.dat", "w") do io
           write(io, rand(UInt8,bytes))
        end;
        t = @benchmark open(f->read(f, UInt8), "dummy.dat")
        time_s = minimum(t).time / 1e9
        MiB_s = (bytes / time_s)  / (1024 * 1024)
        @info "File size: $(round(Int,bytes/1000)) KB. Read speed: $(round(MiB_s,digits=1)) MiB/s"
    end
end

read(f,UInt8) speed in MiB/s:

| File Size | MacOS    | Ubuntu VM on Mac | x faster |
|-----------|----------|------------------|----------|
| 1KB       |     99.9 |              349 |      3.5 |
| 2KB       |    201.4 |            690.3 |      3.4 |
| 3KB       |    255.9 |             1016 |      4.0 |
| 4KB       |    396.2 |           1306.5 |      3.3 |
| 5KB       |    483.7 |           1598.1 |      3.3 |
| 6KB       |    563.4 |           1876.9 |      3.3 |
| 7KB       |    670.1 |             2112 |      3.2 |
| 8KB       |    775.2 |           2375.2 |      3.1 |
| 9KB       |    855.8 |           2536.7 |      3.0 |
| 10KB      |      943 |           2476.9 |      2.6 |
| 20KB      |     1816 |           4289.9 |      2.4 |
| 30KB      |   2536.4 |           6190.9 |      2.4 |
| 40KB      |   3302.8 |           7591.9 |      2.3 |
| 50KB      |   3981.6 |           8316.7 |      2.1 |
| 60KB      |   4704.5 |           9331.8 |      2.0 |
| 70KB      |   5091.7 |           9511.3 |      1.9 |
| 80KB      |   5723.5 |          10168.5 |      1.8 |
| 90KB      |   6220.1 |          10744.1 |      1.7 |
| 100KB     |   6687.3 |          10977.3 |      1.6 |
| 200KB     |  11889.7 |          17958.3 |      1.5 |
| 300KB     |  17490.1 |          26661.3 |      1.5 |
| 400KB     |  24143.7 |          35190.9 |      1.5 |
| 500KB     |  29709.5 |          37537.4 |      1.3 |
| 600KB     |    35424 |          52864.4 |      1.5 |
| 700KB     |  42585.6 |          61875.2 |      1.5 |
| 800KB     |  48342.4 |          70388.4 |      1.5 |
| 900KB     |  55228.5 |          78801.6 |      1.4 |
| 1000KB    |  61317.7 |            87872 |      1.4 |
| 2000KB    | 118058.2 |         175727.7 |      1.5 |
| 3000KB    |   182254 |           263932 |      1.4 |
| 4000KB    | 235809.9 |         351617.4 |      1.5 |
| 5000KB    | 299370.4 |         441679.5 |      1.5 |
| 6000KB    | 363604.6 |         544231.1 |      1.5 |
| 7000KB    | 425774.6 |           615047 |      1.4 |
| 8000KB    | 483945.1 |         715300.4 |      1.5 |
| 9000KB    | 524477.2 |         765320.5 |      1.5 |

read(f) speed in MiB/s:

| File Size | MacOS  | Ubuntu VM on Mac | x faster |
|-----------|--------|------------------|----------|
| 1KB       |   78.7 |            199.3 |      2.5 |
| 2KB       |  157.3 |            380.7 |      2.4 |
| 3KB       |  230.7 |              538 |      2.3 |
| 4KB       |  303.7 |            703.8 |      2.3 |
| 5KB       |  373.2 |            850.6 |      2.3 |
| 6KB       |  438.9 |            996.1 |      2.3 |
| 7KB       |  509.1 |             1103 |      2.2 |
| 8KB       |  577.4 |           1180.4 |      2.0 |
| 9KB       |  641.1 |           1336.9 |      2.1 |
| 10KB      |  691.8 |           1303.9 |      1.9 |
| 20KB      | 1298.9 |           2137.1 |      1.6 |
| 30KB      | 1796.9 |           2728.4 |      1.5 |
| 40KB      | 1683.1 |           3347.7 |      2.0 |
| 50KB      | 1954.8 |           3910.1 |      2.0 |
| 60KB      |   2136 |           4479.1 |      2.1 |
| 70KB      |   2394 |           4605.9 |      1.9 |
| 80KB      | 2717.7 |           4203.1 |      1.5 |
| 90KB      | 2834.8 |           5056.6 |      1.8 |
| 100KB     | 3035.1 |           5247.8 |      1.7 |
| 200KB     |   4888 |           9630.2 |      2.0 |
| 300KB     | 5834.2 |           9311.1 |      1.6 |
| 400KB     | 6435.8 |          10906.6 |      1.7 |
| 500KB     | 6765.2 |          11350.6 |      1.7 |
| 600KB     | 7275.9 |          11575.1 |      1.6 |
| 700KB     | 7468.4 |          12004.5 |      1.6 |
| 800KB     | 7607.6 |          10325.9 |      1.4 |
| 900KB     | 7643.6 |          10249.3 |      1.3 |
| 1000KB    | 7845.5 |          10212.3 |      1.3 |
| 2000KB    | 7868.7 |           9680.5 |      1.2 |
| 3000KB    | 7596.8 |          10611.7 |      1.4 |
| 4000KB    | 6981.3 |           8572.7 |      1.2 |
| 5000KB    | 6477.1 |           8159.5 |      1.3 |
| 6000KB    | 6155.5 |           7901.6 |      1.3 |
| 7000KB    | 5839.3 |           7129.2 |      1.2 |
| 8000KB    | 5853.2 |           6915.7 |      1.2 |
| 9000KB    | 5716.1 |             6730 |      1.2 |

@ViralBShah
Copy link
Member

I read somewhere that SIP on mac could perhaps slow things down. Maybe try disabling and seeing if it makes a difference for the benchmark?

https://apple.stackexchange.com/questions/208478/how-do-i-disable-system-integrity-protection-sip-aka-rootless-on-macos-os-x

@IanButterworth
Copy link
Member Author

The tests above were actually with SIP disabled. I re-tried with SIP enabled and it is slower on the read(f,UInt8) tests.

read(f,UInt8) on MacOS with SIP enabled:

[ Info: File size: 1 KB. Read speed: 74.2 MiB/s
[ Info: File size: 2 KB. Read speed: 148.7 MiB/s
[ Info: File size: 3 KB. Read speed: 190.5 MiB/s
[ Info: File size: 4 KB. Read speed: 251.6 MiB/s
[ Info: File size: 5 KB. Read speed: 360.8 MiB/s
[ Info: File size: 6 KB. Read speed: 445.4 MiB/s
[ Info: File size: 7 KB. Read speed: 515.1 MiB/s
[ Info: File size: 8 KB. Read speed: 560.5 MiB/s
[ Info: File size: 9 KB. Read speed: 625.7 MiB/s
[ Info: File size: 10 KB. Read speed: 696.7 MiB/s
[ Info: File size: 20 KB. Read speed: 1373.9 MiB/s
[ Info: File size: 30 KB. Read speed: 1958.0 MiB/s
[ Info: File size: 40 KB. Read speed: 2526.3 MiB/s
[ Info: File size: 50 KB. Read speed: 3106.6 MiB/s
[ Info: File size: 60 KB. Read speed: 3648.8 MiB/s
[ Info: File size: 70 KB. Read speed: 4115.5 MiB/s
[ Info: File size: 80 KB. Read speed: 4517.4 MiB/s
[ Info: File size: 90 KB. Read speed: 5033.2 MiB/s
[ Info: File size: 100 KB. Read speed: 5379.2 MiB/s
[ Info: File size: 200 KB. Read speed: 9803.9 MiB/s
[ Info: File size: 300 KB. Read speed: 14824.7 MiB/s
[ Info: File size: 400 KB. Read speed: 19496.6 MiB/s
[ Info: File size: 500 KB. Read speed: 24720.7 MiB/s
[ Info: File size: 600 KB. Read speed: 24793.3 MiB/s
[ Info: File size: 700 KB. Read speed: 34898.4 MiB/s
[ Info: File size: 800 KB. Read speed: 41202.1 MiB/s
[ Info: File size: 900 KB. Read speed: 46207.6 MiB/s
[ Info: File size: 1000 KB. Read speed: 49461.9 MiB/s
[ Info: File size: 2000 KB. Read speed: 101741.5 MiB/s
[ Info: File size: 3000 KB. Read speed: 147894.7 MiB/s
[ Info: File size: 4000 KB. Read speed: 202371.2 MiB/s
[ Info: File size: 5000 KB. Read speed: 249183.3 MiB/s
[ Info: File size: 6000 KB. Read speed: 306335.8 MiB/s
[ Info: File size: 7000 KB. Read speed: 348874.8 MiB/s
[ Info: File size: 8000 KB. Read speed: 395305.4 MiB/s
[ Info: File size: 9000 KB. Read speed: 454202.7 MiB/s

@IanButterworth
Copy link
Member Author

IanButterworth commented Nov 2, 2020

I just checked and this issue remains on master

UInt8 reading

using BenchmarkTools
path, _ = mktemp()
for zeros in 3:8
    bytes = 10^zeros
    open(path, "w") do io
        write(io, rand(UInt8,bytes))
    end
    t = @benchmark open(path) do io
        read(io, UInt8)
    end
    time_s = minimum(t).time / 1e9
    speed = Base.format_bytes(bytes / time_s)
    @info "File size: $(Base.format_bytes(bytes)). \tbtime (min): $(round(time_s * 1e6, digits=2))us. \tRead speed: $(speed)/s"
end

MacOS with SIP enabled

[ Info: File size: 1000 bytes. 	btime (min): 15.27us. 	Read speed: 62.458 MiB/s
[ Info: File size: 9.766 KiB. 	btime (min): 16.06us. 	Read speed: 593.968 MiB/s
[ Info: File size: 97.656 KiB. 	btime (min): 17.05us. 	Read speed: 5.463 GiB/s
[ Info: File size: 976.562 KiB. btime (min): 17.07us. 	Read speed: 54.559 GiB/s
[ Info: File size: 9.537 MiB. 	btime (min): 16.99us. 	Read speed: 548.224 GiB/s
[ Info: File size: 95.367 MiB. 	btime (min): 17.04us. 	Read speed: 5.336 TiB/s

MacOS with SIP disabled

[ Info: File size: 1000 bytes. 	btime (min): 12.91us. 	Read speed: 73.865 MiB/s
[ Info: File size: 9.766 KiB. 	btime (min): 13.51us. 	Read speed: 705.694 MiB/s
[ Info: File size: 97.656 KiB. 	btime (min): 15.49us. 	Read speed: 6.011 GiB/s
[ Info: File size: 976.562 KiB. btime (min): 14.71us. 	Read speed: 63.304 GiB/s
[ Info: File size: 9.537 MiB. 	btime (min): 14.24us. 	Read speed: 653.835 GiB/s
[ Info: File size: 95.367 MiB. 	btime (min): 14.91us. 	Read speed: 6.099 TiB/s

Ubuntu VM on same machine

[ Info: File size: 1000 bytes. 	btime (min): 3.31us. 	Read speed: 288.391 MiB/s
[ Info: File size: 9.766 KiB. 	btime (min): 3.89us. 	Read speed: 2.395 GiB/s
[ Info: File size: 97.656 KiB. 	btime (min): 5.16us. 	Read speed: 18.033 GiB/s
[ Info: File size: 976.562 KiB. btime (min): 5.19us. 	Read speed: 179.359 GiB/s
[ Info: File size: 9.537 MiB. 	btime (min): 5.21us. 	Read speed: 1.744 TiB/s
[ Info: File size: 95.367 MiB. 	btime (min): 5.12us. 	Read speed: 17.781 TiB/s

@IanButterworth
Copy link
Member Author

Seems to be an overhead with file opening/closing
If you just do the file open/close:

t = @benchmark open(path) do io
end

MacOS, no SIP

[ Info: File size: 1000 bytes. 	btime (min): 11.98us.
[ Info: File size: 9.766 KiB. 	btime (min): 12.32us.
[ Info: File size: 97.656 KiB. 	btime (min): 11.95us.
[ Info: File size: 976.562 KiB. btime (min): 11.71us.
[ Info: File size: 9.537 MiB. 	btime (min): 11.51us.
[ Info: File size: 95.367 MiB. 	btime (min): 11.73us.

Ubuntu VM

[ Info: File size: 1000 bytes. 	btime (min): 2.53us.
[ Info: File size: 9.766 KiB. 	btime (min): 2.42us.
[ Info: File size: 97.656 KiB. 	btime (min): 2.41us.
[ Info: File size: 976.562 KiB. btime (min): 2.47us.
[ Info: File size: 9.537 MiB. 	btime (min): 2.47us.
[ Info: File size: 95.367 MiB. 	btime (min): 2.44us.

@IanButterworth
Copy link
Member Author

Breaking it down into open / close

for i in 1:5
    @time io = open(path)
    @time close(io)
    println("--")
end

MacOS, no SIP

  0.001304 seconds (29 allocations: 2.141 KiB, 95.06% compilation time)
  0.000011 seconds
--
  0.000051 seconds (14 allocations: 1008 bytes)
  0.000005 seconds
--
  0.000030 seconds (14 allocations: 1008 bytes)
  0.000005 seconds
--
  0.000030 seconds (14 allocations: 1008 bytes)
  0.000005 seconds
--
  0.000025 seconds (14 allocations: 1008 bytes)
  0.000006 seconds
--

Ubuntu VM

  0.001820 seconds (27 allocations: 1.891 KiB, 98.12% compilation time)
  0.000006 seconds
--
  0.000007 seconds (12 allocations: 752 bytes)
  0.000002 seconds
--
  0.000005 seconds (12 allocations: 752 bytes)
  0.000002 seconds
--
  0.000005 seconds (12 allocations: 752 bytes)
  0.000001 seconds
--
  0.000004 seconds (12 allocations: 752 bytes)
  0.000002 seconds
--

@StefanKarpinski StefanKarpinski added this to the 1.6 features milestone Nov 2, 2020
@StefanKarpinski
Copy link
Member

I'm marking this as a 1.6 release blocker.

@KristofferC
Copy link
Member

Maybe I missed something but I thought this wasn't a regression. Why is it release-blocking suddenly?

@IanButterworth
Copy link
Member Author

(I 👍-ed because it would be good to figure this out, but as far as I know it isn't a regression)

@StefanKarpinski
Copy link
Member

My impression was that this was a regression. If it's not, we can make it non-blocking.

@IanButterworth
Copy link
Member Author

If it's a regression, it may be pre 1.0.

MacOS official binaries:

1.0.4

  0.000027 seconds (4 allocations: 480 bytes)
  0.000005 seconds
--
  0.000020 seconds (4 allocations: 480 bytes)
  0.000005 seconds
--
  0.000028 seconds (4 allocations: 480 bytes)
  0.000006 seconds

1.1.1

  0.000025 seconds (12 allocations: 928 bytes)
  0.000006 seconds
--
  0.000021 seconds (12 allocations: 928 bytes)
  0.000004 seconds
--
  0.000019 seconds (12 allocations: 928 bytes)
  0.000004 seconds

1.2.0

  0.000032 seconds (11 allocations: 896 bytes)
  0.000006 seconds
--
  0.000026 seconds (11 allocations: 896 bytes)
  0.000005 seconds
--
  0.000030 seconds (11 allocations: 896 bytes)
  0.000003 seconds

1.3.1

  0.000025 seconds (16 allocations: 1024 bytes)
  0.000004 seconds
--
  0.000027 seconds (16 allocations: 1024 bytes)
  0.000004 seconds
--
  0.000023 seconds (16 allocations: 1024 bytes)
  0.000003 seconds

1.4.2

  0.000020 seconds (15 allocations: 1008 bytes)
  0.000006 seconds
--
  0.000032 seconds (15 allocations: 1008 bytes)
  0.000005 seconds
--
  0.000026 seconds (15 allocations: 1008 bytes)
  0.000005 seconds

1.5.2

  0.000024 seconds (14 allocations: 1008 bytes)
  0.000004 seconds
--
  0.000023 seconds (14 allocations: 1008 bytes)
  0.000002 seconds
--
  0.000019 seconds (14 allocations: 1008 bytes)
  0.000002 seconds

1.6.0-DEV.1221

  0.000019 seconds (14 allocations: 1008 bytes)
  0.000003 seconds
--
  0.000019 seconds (14 allocations: 1008 bytes)
  0.000003 seconds
--
  0.000029 seconds (14 allocations: 1008 bytes)
  0.000003 seconds

@KristofferC KristofferC removed this from the 1.6 features milestone Nov 2, 2020
@KristofferC
Copy link
Member

I took it off since I don't think it should block the release if it has been the same since 1.0.

@IanButterworth
Copy link
Member Author

IanButterworth commented Nov 2, 2020

To rule out my VM setup doing something weird with filesystem caching, I ran this on a native ubuntu machine and the speed seems real (but the hardware is a little lower spec than my mac)

  0.000008 seconds (12 allocations: 752 bytes)
  0.000002 seconds
--
  0.000008 seconds (12 allocations: 752 bytes)
  0.000002 seconds
--
  0.000007 seconds (12 allocations: 752 bytes)
  0.000002 seconds

@IanButterworth
Copy link
Member Author

@code_native is identical on the two

julia> @code_native open(path)
	.text
; ┌ @ iostream.jl:275 within `open'
	pushq	%rax
	movq	%rsi, (%rsp)
	movq	(%rsi), %rsi
; └
; ┌ @ iostream.jl:282 within `open'
	movabsq	$"#open#577", %rax
	movl	$1, %edi
	callq	*%rax
	popq	%rcx
	retq
; └

and the src handling from there onwards doesn't seem to treat MacOS and Linux differently, they both fall within this else

#else

And lower down, I validated that O_CLOEXEC is defined and supported in MacOS for me

#ifdef O_CLOEXEC

So code-wise, it seems identical. Could it be compiler settings? I tried with both official binaries and a local build, and they perform the same.

Chances of it being just a slower filesystem are increasing..

@IanButterworth
Copy link
Member Author

IanButterworth commented Nov 3, 2020

I made a raw c test that tries to mimic julia's open_cloexec for opening an existent file as read only.

Oddly, both times are proportionally slower than the timing above, but MacOS is still 3-4x slower

#include <stdio.h>
#include <fcntl.h>
#include <time.h>
int main()
{
    struct timespec tstart={0,0}, tend={0,0};
    clock_gettime(CLOCK_MONOTONIC, &tstart);
    int fd = open("/path/to/testfile.txt", O_RDONLY | O_CLOEXEC, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH);
    clock_gettime(CLOCK_MONOTONIC, &tend);
    printf("%.0f µs\n", (((double)tend.tv_sec + 1.0e-9*tend.tv_nsec) - ((double)tstart.tv_sec + 1.0e-9*tstart.tv_nsec)) * 1.0e6);

   return 0;
}

MacOS, no SIP

51 µs
46 µs
49 µs
30 µs
46 µs

Ubuntu VM

12 µs
12 µs
12 µs
13 µs
13 µs

@IanButterworth
Copy link
Member Author

I posted the above example on SO to collect any thoughts https://stackoverflow.com/questions/64656255/why-is-the-c-function-open-4x-slower-on-macos-vs-an-ubuntu-vm-on-the-same-ma/64656675#64656675

All signs are pointing to MacOS just having a slower filesystem

@StefanKarpinski
Copy link
Member

libc does a lot of locking on I/O so you either need to opt out of that (there's some #define) or use one of the non-locking lower level I/O APIs (the ones that don't start with f and not the *printf family).

@IanButterworth
Copy link
Member Author

IanButterworth commented Feb 17, 2022

The slowness and Linux VMs being faster for disk IO on a Mac than native MacOS appears to be explained by this golang/go#28739 (comment)

Just FWIW, this is a problem with Apple NVMe drives (T2 and M1 Macs alike). Their cache flush performance is abysmal. This affects both macOS and (native) Linux (presumably existing VM solutions fail to forward this as F_FULLSYNC if they run faster). You get about 46 IOPS of F_FULLSYNC on macOS on an M1 machine, and about the same on native Linux with fsync().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster system:mac Affects only macOS system:windows Affects only Windows
Projects
None yet
Development

No branches or pull requests

5 participants