marp	title	author
true	One Billion Rows in Ruby	Rian McGuire

One Billion Rows in Ruby

Rian McGuire

Melbourne Ruby - March 2024

1BRC

https://github.com/gunnarmorling/1brc

The One Billion Row Challenge (1BRC) is a fun exploration of how far modern Java can be pushed for aggregating one billion rows from a text file.

Aggregate the input:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
Bridgetown;26.9
Istanbul;6.2
Roseau;34.4
Conakry;31.2
Istanbul;23.0
...

into min/mean/max, sorted by name:

{Abha=-23.0/18.0/59.2, Abidjan=-16.2/26.0/67.3, ...}

How big are one billion rows?

$ ./create_measurements.sh 1000000000
$ du -h measurements.txt
13G     measurements.txt

It can fit in RAM on a single machine (my machine has 32GB)
The OS will be able to cache the entire file in RAM
We're going to be CPU-bound, not IO-bound

Baseline

Measurement

Tools
- time - shell built-in
- hyperfine - https://github.com/sharkdp/hyperfine
- require 'benchmark'
Noise and statistical significance
- Did the change actually improve anything?

Baseline

YJIT

RubyVM::YJIT.enable

YJIT

17% improvement!

Profiling

Analyse the performance of a running program and see where it's spending its time
Stackprof - https://github.com/tmm1/stackprof
rbspy - https://rbspy.github.io/

Running Stackprof on a program

stackprof run --out stackprof.dump --raw -- ./002_yjit.rb measurements_10M.txt
stackprof --text stackprof.dump
stackprof --d3-flamegraph stackprof.dump > flamegraph.html

Garbage collection refresher

Unused objects aren't cleaned up immediately
Periodically, Ruby searches for objects that are no longer needed and releases them
Immediate ("copy-by-value") objects don't require GC
- Integer, Float, Symbol, true, false, nil

Counting allocations

GC.stat(:total_allocated_objects)

f.each_line do |line|
    name, value = line.split(";") # <- what are the allocations on this line?
    stations[name].add(value.to_f)
end

Garbage is expensive

Reducing the number of allocations (and avoiding String#split) improved time by 29%

Going parallel

My CPU has 6 cores (12 threads), but we're only using one!

Threads
- Global VM Lock (GVL) means threads won't help on this CPU-bound problem
Processes (fork)
- Uses more memory
- Communication between processes is more expensive
Ractors
- Theoretically, this is what Ractors are good at
- Each Ractor can execute Ruby code in parallel
- Experimental

Going parallel

Divide the work - by line?

Simple, but high coordination overhead

Going parallel

Divide the work - by range of lines?

The lines are different lengths, so we can't seek to the byte offset of a specific line

Going parallel

Divide the work - by range of bytes?

We could end up starting in the middle of a line
...but there's a neat trick

Going parallel - Processes

6.7x improvement over non-parallel

Going parallel - Ractors

Only a 1.7x improvement over non-parallel 👎
Wild guess: GC is still single-threaded? Everything gets blocked waiting on GC?

Can we go further?

strace

Understanding how a process is interacting with the system
Similar tool on macOS: dtrace
strace ./003_gc_golf.rb measurements_1M.txt

...
read(5, "a;17.5\nDili;17.5\nOuagadougou;16."..., 8192) = 8192
read(5, ".4\nRangpur;9.3\nKinshasa;39.8\nTeh"..., 8192) = 8192
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=779449478}) = 0
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=779505049}) = 0
read(5, "Memphis;12.7\nTokyo;18.5\nBaghdad;"..., 8192) = 8192
read(5, "n;7.8\nGaroua;25.1\nDili;34.1\nMeda"..., 8192) = 8192
read(5, "27.3\nMelbourne;10.8\nBlantyre;27."..., 8192) = 8192
read(5, "bouti;24.5\nTripoli;23.1\nKyoto;-5"..., 8192) = 8192
read(5, "10.9\nAmsterdam;26.0\nOslo;8.9\nSar"..., 8192) = 8192
read(5, "i;10.6\nLake Havasu City;24.3\nPor"..., 8192) = 8192
...

Copying is expensive

How does #each_line get the next line from a file?

Ruby makes read(file, &buffer, 8192) syscall
OS copies 8192 bytes from the file (already cached in RAM) into an internal Ruby buffer
Ruby finds the index of the next \n in the buffer
Ruby copies the line bytes from buffer into a new String object

IO::Buffer

(Experimental) low-level memory API
https://docs.ruby-lang.org/en/master/IO/Buffer.html
Use IO::Buffer#map to memory-map (mmap) a file
This tells the OS to use virtual memory to make the file appear within Ruby's memory, without copying

IO::Buffer

Slower 😢. We no longer have each_line overhead, but we've traded it for lots of calls to buffer#get_value

Cheating? — IO::Buffer + C extension

IO::Buffer#get_string_until(offset, SEMI) feels like an API that could reasonably exist in the standard library

Cheating? — IO::Buffer + C extension

42% faster than File#each_line

What's next?

% self  % total  name
 33.04    33.04  get_string_until [c function] - (unknown)
 22.11    85.40  block in process_chunk - ./007_io_buffer_ext.rb:57
 19.35    19.35  to_f [c function] - (unknown)
  7.72     7.72  getbyte [c function] - (unknown)
  6.99    10.83  add - ./007_io_buffer_ext.rb:24
  4.83    98.39  loop - <internal:kernel>:192
  1.55     1.55  + [c function] - (unknown)
  1.34     1.34  < [c function] - (unknown)
  0.96     0.96  > [c function] - (unknown)
  0.40     0.40  value [c function] - (unknown)
  0.30     8.10  load - <internal:marshal>:35
  0.27     0.27  (unknown) [c function] - (unknown)
  0.21    90.89  fork [c function] - (unknown)
  0.18     0.35  dump [c function] - (unknown)
  0.18     0.18  write [c function] - (unknown)
  0.10     0.10  wait [c function] - (unknown)
  0.08     0.08  read [c function] - (unknown)
  0.08     0.08  eof? [c function] - (unknown)
  0.05     0.08  require_relative [c function] - (unknown)
  0.05     0.06  new [c function] - (unknown)

Custom decimal parser

String#to_f is 19% of time
We're also allocating a String for every value in order to parse it into a Float

Custom decimal parser

Faster...just

What's next?

% self  % total  name
 33.78    84.73  block in process_chunk - ./008_custom_parser.rb:79
 25.36    25.36  get_value [c function] - (unknown)
 18.17    18.17  get_string_until [c function] - (unknown)
  8.00     8.00  getbyte [c function] - (unknown)
  5.49     5.49  add - ./008_custom_parser.rb:27
  5.37    98.35  loop - <internal:kernel>:192
  1.83     1.83  * [c function] - (unknown)
  0.76     0.76  (unknown) [c function] - (unknown)
  0.18     0.18  eof? [c function] - (unknown)
  0.16    90.59  fork [c function] - (unknown)
  0.16     0.16  wait [c function] - (unknown)
  0.14     0.14  value [c function] - (unknown)
  0.12     0.18  dump [c function] - (unknown)
  0.11     8.16  load - <internal:marshal>:35
  0.05     0.11  new [c function] - (unknown)
  0.05     0.05  write [c function] - (unknown)
  0.05     0.05  read [c function] - (unknown)

Cheating - Even more C extension

Create a new IO::Buffer::Reader class that tracks the current offset in C
Implement the decimal-to-int parser in C

Cheating - Even more C extension

Ruby beat a Java implementation!

How is the Java implementation still 18x faster?

Completely eliminating String allocations in the hot loop, using a custom hash table

Using bit pattern tricks to search/parse multiple bytes at once

private static final long BROADCAST_SEMICOLON = 0x3B3B3B3B3B3B3B3BL;
private static final long BROADCAST_0x01 = 0x0101010101010101L;
private static final long BROADCAST_0x80 = 0x8080808080808080L;

private static long semicolonMatchBits(long word) {
    long diff = word ^ BROADCAST_SEMICOLON;
    return (diff - BROADCAST_0x01) & (~diff & BROADCAST_0x80);
}

private static int nameLen(long separator) {
    return (Long.numberOfTrailingZeros(separator) >>> 3) + 1;
}

There's an interesting blog post by one of the top 10 authors: https://questdb.io/blog/billion-row-challenge-step-by-step/

Cheating — polars-df gem

https://pola.rs/
Similar to the pandas data analysis library for Python
Ruby isn't an officially supported language, but there's a gem: https://github.com/ankane/polars-ruby

Cheating — polars-df gem

On your machine vs production

Different computers have different performance characteristics
Make sure your optimise for the right thing

On your machine vs production

AMD Ryzen 5 2600 desktop (mid-range 2018 CPU, 6 cores, 12 threads)
M2 MacBook Air (2022, 8 cores: 4 performance, 4 efficiency)

Conclusion

The optimisation loop
- Measure the baseline
- Profile and identify possible areas for improvement
- Compare your change with the baseline

Readability/maintainability vs performance trade-off
- Breaking abstractions
- Hard-coding assumptions
Life is about the journey
Can you make it go faster?
- All the code is available at: https://github.com/rianmcguire/1brc-ruby

Files

slides.md

Latest commit

History

slides.md

File metadata and controls

One Billion Rows in Ruby

Rian McGuire

1BRC

How big are one billion rows?

Baseline

Measurement

Baseline

Baseline

YJIT

YJIT

YJIT

Profiling

Running Stackprof on a program

Garbage collection refresher

Counting allocations

Garbage is expensive

Going parallel

Going parallel

Going parallel

Going parallel

Going parallel - Processes

Going parallel - Processes

Going parallel - Ractors

Can we go further?

strace

Copying is expensive

IO::Buffer

IO::Buffer

Cheating? — IO::Buffer + C extension

Cheating? — IO::Buffer + C extension

What's next?

Custom decimal parser

Custom decimal parser

What's next?

Cheating - Even more C extension

Cheating - Even more C extension

How is the Java implementation still 18x faster?

Cheating — polars-df gem

Cheating — polars-df gem

On your machine vs production

On your machine vs production

Conclusion

Questions?