Skip to content

Latest commit

Β 

History

History
782 lines (513 loc) Β· 21.4 KB

slides.md

File metadata and controls

782 lines (513 loc) Β· 21.4 KB
marp title author
true
One Billion Rows in Ruby
Rian McGuire
<style> img[alt~="center"] { display: block; margin: 0 auto; } </style> <script src="asciinema-player.js"></script> <script src="asciinema-loader.js"></script>

One Billion Rows in Ruby

Rian McGuire

Melbourne Ruby - March 2024


1BRC

https://github.com/gunnarmorling/1brc

The One Billion Row Challenge (1BRC) is a fun exploration of how far modern Java can be pushed for aggregating one billion rows from a text file.


Aggregate the input:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
Bridgetown;26.9
Istanbul;6.2
Roseau;34.4
Conakry;31.2
Istanbul;23.0
...

into min/mean/max, sorted by name:

{Abha=-23.0/18.0/59.2, Abidjan=-16.2/26.0/67.3, ...}

How big are one billion rows?

$ ./create_measurements.sh 1000000000
$ du -h measurements.txt
13G     measurements.txt
  • It can fit in RAM on a single machine (my machine has 32GB)
  • The OS will be able to cache the entire file in RAM
  • We're going to be CPU-bound, not IO-bound

Baseline


Measurement


Baseline


Baseline

center


YJIT

RubyVM::YJIT.enable


YJIT


YJIT

17% improvement!

center


Profiling


Running Stackprof on a program

stackprof run --out stackprof.dump --raw -- ./002_yjit.rb measurements_10M.txt
stackprof --text stackprof.dump
stackprof --d3-flamegraph stackprof.dump > flamegraph.html
<iframe src="flamegraph.html" style="width: 100%; height: 700px; border: 0;"></iframe>

Garbage collection refresher

  • Unused objects aren't cleaned up immediately
  • Periodically, Ruby searches for objects that are no longer needed and releases them
  • Immediate ("copy-by-value") objects don't require GC
    • Integer, Float, Symbol, true, false, nil

Counting allocations

GC.stat(:total_allocated_objects)

f.each_line do |line|
    name, value = line.split(";") # <- what are the allocations on this line?
    stations[name].add(value.to_f)
end

Garbage is expensive

Reducing the number of allocations (and avoiding String#split) improved time by 29%

center


Going parallel

My CPU has 6 cores (12 threads), but we're only using one!

  • Threads
    • Global VM Lock (GVL) means threads won't help on this CPU-bound problem
  • Processes (fork)
    • Uses more memory
    • Communication between processes is more expensive
  • Ractors
    • Theoretically, this is what Ractors are good at
    • Each Ractor can execute Ruby code in parallel
    • Experimental

Going parallel

Divide the work - by line?

  • Simple, but high coordination overhead

bg right fit


Going parallel

Divide the work - by range of lines?

  • The lines are different lengths, so we can't seek to the byte offset of a specific line

bg right fit


Going parallel

Divide the work - by range of bytes?

  • We could end up starting in the middle of a line
  • ...but there's a neat trick

bg right fit


Going parallel - Processes


Going parallel - Processes

  • 6.7x improvement over non-parallel

center


Going parallel - Ractors

  • Only a 1.7x improvement over non-parallel πŸ‘Ž
  • Wild guess: GC is still single-threaded? Everything gets blocked waiting on GC?

center


Can we go further?


strace

  • Understanding how a process is interacting with the system
  • Similar tool on macOS: dtrace
  • strace ./003_gc_golf.rb measurements_1M.txt
...
read(5, "a;17.5\nDili;17.5\nOuagadougou;16."..., 8192) = 8192
read(5, ".4\nRangpur;9.3\nKinshasa;39.8\nTeh"..., 8192) = 8192
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=779449478}) = 0
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=779505049}) = 0
read(5, "Memphis;12.7\nTokyo;18.5\nBaghdad;"..., 8192) = 8192
read(5, "n;7.8\nGaroua;25.1\nDili;34.1\nMeda"..., 8192) = 8192
read(5, "27.3\nMelbourne;10.8\nBlantyre;27."..., 8192) = 8192
read(5, "bouti;24.5\nTripoli;23.1\nKyoto;-5"..., 8192) = 8192
read(5, "10.9\nAmsterdam;26.0\nOslo;8.9\nSar"..., 8192) = 8192
read(5, "i;10.6\nLake Havasu City;24.3\nPor"..., 8192) = 8192
...

Copying is expensive

How does #each_line get the next line from a file?

  1. Ruby makes read(file, &buffer, 8192) syscall
  2. OS copies 8192 bytes from the file (already cached in RAM) into an internal Ruby buffer
  3. Ruby finds the index of the next \n in the buffer
  4. Ruby copies the line bytes from buffer into a new String object

bg right fit


IO::Buffer

bg right fit


IO::Buffer

Slower 😒. We no longer have each_line overhead, but we've traded it for lots of calls to buffer#get_value

center


Cheating? β€” IO::Buffer + C extension

IO::Buffer#get_string_until(offset, SEMI) feels like an API that could reasonably exist in the standard library


center


Cheating? β€” IO::Buffer + C extension

42% faster than File#each_line

center


What's next?

% self  % total  name
 33.04    33.04  get_string_until [c function] - (unknown)
 22.11    85.40  block in process_chunk - ./007_io_buffer_ext.rb:57
 19.35    19.35  to_f [c function] - (unknown)
  7.72     7.72  getbyte [c function] - (unknown)
  6.99    10.83  add - ./007_io_buffer_ext.rb:24
  4.83    98.39  loop - <internal:kernel>:192
  1.55     1.55  + [c function] - (unknown)
  1.34     1.34  < [c function] - (unknown)
  0.96     0.96  > [c function] - (unknown)
  0.40     0.40  value [c function] - (unknown)
  0.30     8.10  load - <internal:marshal>:35
  0.27     0.27  (unknown) [c function] - (unknown)
  0.21    90.89  fork [c function] - (unknown)
  0.18     0.35  dump [c function] - (unknown)
  0.18     0.18  write [c function] - (unknown)
  0.10     0.10  wait [c function] - (unknown)
  0.08     0.08  read [c function] - (unknown)
  0.08     0.08  eof? [c function] - (unknown)
  0.05     0.08  require_relative [c function] - (unknown)
  0.05     0.06  new [c function] - (unknown)

Custom decimal parser

  • String#to_f is 19% of time
  • We're also allocating a String for every value in order to parse it into a Float

Custom decimal parser

Faster...just

center


What's next?

% self  % total  name
 33.78    84.73  block in process_chunk - ./008_custom_parser.rb:79
 25.36    25.36  get_value [c function] - (unknown)
 18.17    18.17  get_string_until [c function] - (unknown)
  8.00     8.00  getbyte [c function] - (unknown)
  5.49     5.49  add - ./008_custom_parser.rb:27
  5.37    98.35  loop - <internal:kernel>:192
  1.83     1.83  * [c function] - (unknown)
  0.76     0.76  (unknown) [c function] - (unknown)
  0.18     0.18  eof? [c function] - (unknown)
  0.16    90.59  fork [c function] - (unknown)
  0.16     0.16  wait [c function] - (unknown)
  0.14     0.14  value [c function] - (unknown)
  0.12     0.18  dump [c function] - (unknown)
  0.11     8.16  load - <internal:marshal>:35
  0.05     0.11  new [c function] - (unknown)
  0.05     0.05  write [c function] - (unknown)
  0.05     0.05  read [c function] - (unknown)

Cheating - Even more C extension

  • Create a new IO::Buffer::Reader class that tracks the current offset in C
  • Implement the decimal-to-int parser in C

Cheating - Even more C extension

Ruby beat a Java implementation!

center


How is the Java implementation still 18x faster?

  • Completely eliminating String allocations in the hot loop, using a custom hash table
  • Using bit pattern tricks to search/parse multiple bytes at once
    private static final long BROADCAST_SEMICOLON = 0x3B3B3B3B3B3B3B3BL;
    private static final long BROADCAST_0x01 = 0x0101010101010101L;
    private static final long BROADCAST_0x80 = 0x8080808080808080L;
    
    private static long semicolonMatchBits(long word) {
        long diff = word ^ BROADCAST_SEMICOLON;
        return (diff - BROADCAST_0x01) & (~diff & BROADCAST_0x80);
    }
    
    private static int nameLen(long separator) {
        return (Long.numberOfTrailingZeros(separator) >>> 3) + 1;
    }
  • There's an interesting blog post by one of the top 10 authors: https://questdb.io/blog/billion-row-challenge-step-by-step/

Cheating β€” polars-df gem


Cheating β€” polars-df gem

center


On your machine vs production

  • Different computers have different performance characteristics
  • Make sure your optimise for the right thing

On your machine vs production

  • AMD Ryzen 5 2600 desktop (mid-range 2018 CPU, 6 cores, 12 threads)
  • M2 MacBook Air (2022, 8 cores: 4 performance, 4 efficiency)

center h:480


Conclusion

  • The optimisation loop
    • Measure the baseline
    • Profile and identify possible areas for improvement
    • Compare your change with the baseline
  • Readability/maintainability vs performance trade-off
    • Breaking abstractions
    • Hard-coding assumptions
  • Life is about the journey
  • Can you make it go faster?

Questions?