marp | title | author |
---|---|---|
true |
One Billion Rows in Ruby |
Rian McGuire |
Melbourne Ruby - March 2024
https://github.com/gunnarmorling/1brc
The One Billion Row Challenge (1BRC) is a fun exploration of how far modern Java can be pushed for aggregating one billion rows from a text file.
Aggregate the input:
Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
Bridgetown;26.9
Istanbul;6.2
Roseau;34.4
Conakry;31.2
Istanbul;23.0
...
into min/mean/max
, sorted by name:
{Abha=-23.0/18.0/59.2, Abidjan=-16.2/26.0/67.3, ...}
$ ./create_measurements.sh 1000000000
$ du -h measurements.txt
13G measurements.txt
- It can fit in RAM on a single machine (my machine has 32GB)
- The OS will be able to cache the entire file in RAM
- We're going to be CPU-bound, not IO-bound
- Tools
- time - shell built-in
- hyperfine - https://github.com/sharkdp/hyperfine
require 'benchmark'
- Noise and statistical significance
- Did the change actually improve anything?
RubyVM::YJIT.enable
17% improvement!
- Analyse the performance of a running program and see where it's spending its time
- Stackprof - https://github.com/tmm1/stackprof
- rbspy - https://rbspy.github.io/
stackprof run --out stackprof.dump --raw -- ./002_yjit.rb measurements_10M.txt
stackprof --text stackprof.dump
stackprof --d3-flamegraph stackprof.dump > flamegraph.html
- Unused objects aren't cleaned up immediately
- Periodically, Ruby searches for objects that are no longer needed and releases them
- Immediate ("copy-by-value") objects don't require GC
- Integer, Float, Symbol, true, false, nil
GC.stat(:total_allocated_objects)
f.each_line do |line|
name, value = line.split(";") # <- what are the allocations on this line?
stations[name].add(value.to_f)
end
Reducing the number of allocations (and avoiding String#split
) improved time by 29%
My CPU has 6 cores (12 threads), but we're only using one!
- Threads
- Global VM Lock (GVL) means threads won't help on this CPU-bound problem
- Processes (fork)
- Uses more memory
- Communication between processes is more expensive
- Ractors
- Theoretically, this is what Ractors are good at
- Each Ractor can execute Ruby code in parallel
- Experimental
Divide the work - by line?
- Simple, but high coordination overhead
Divide the work - by range of lines?
- The lines are different lengths, so we can't seek to the byte offset of a specific line
Divide the work - by range of bytes?
- We could end up starting in the middle of a line
- ...but there's a neat trick
- 6.7x improvement over non-parallel
- Only a 1.7x improvement over non-parallel π
- Wild guess: GC is still single-threaded? Everything gets blocked waiting on GC?
- Understanding how a process is interacting with the system
- Similar tool on macOS: dtrace
strace ./003_gc_golf.rb measurements_1M.txt
...
read(5, "a;17.5\nDili;17.5\nOuagadougou;16."..., 8192) = 8192
read(5, ".4\nRangpur;9.3\nKinshasa;39.8\nTeh"..., 8192) = 8192
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=779449478}) = 0
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=779505049}) = 0
read(5, "Memphis;12.7\nTokyo;18.5\nBaghdad;"..., 8192) = 8192
read(5, "n;7.8\nGaroua;25.1\nDili;34.1\nMeda"..., 8192) = 8192
read(5, "27.3\nMelbourne;10.8\nBlantyre;27."..., 8192) = 8192
read(5, "bouti;24.5\nTripoli;23.1\nKyoto;-5"..., 8192) = 8192
read(5, "10.9\nAmsterdam;26.0\nOslo;8.9\nSar"..., 8192) = 8192
read(5, "i;10.6\nLake Havasu City;24.3\nPor"..., 8192) = 8192
...
How does #each_line
get the next line from a file?
- Ruby makes
read(file, &buffer, 8192)
syscall - OS copies 8192 bytes from the file (already cached in RAM) into an internal Ruby buffer
- Ruby finds the index of the next
\n
in the buffer - Ruby copies the line bytes from buffer into a new String object
- (Experimental) low-level memory API
- https://docs.ruby-lang.org/en/master/IO/Buffer.html
- Use
IO::Buffer#map
to memory-map (mmap) a file - This tells the OS to use virtual memory to make the file appear within Ruby's memory, without copying
Slower π’. We no longer have each_line
overhead, but we've traded it for lots of calls to buffer#get_value
IO::Buffer#get_string_until(offset, SEMI)
feels like an API that could reasonably exist in the standard library
42% faster than File#each_line
% self % total name
33.04 33.04 get_string_until [c function] - (unknown)
22.11 85.40 block in process_chunk - ./007_io_buffer_ext.rb:57
19.35 19.35 to_f [c function] - (unknown)
7.72 7.72 getbyte [c function] - (unknown)
6.99 10.83 add - ./007_io_buffer_ext.rb:24
4.83 98.39 loop - <internal:kernel>:192
1.55 1.55 + [c function] - (unknown)
1.34 1.34 < [c function] - (unknown)
0.96 0.96 > [c function] - (unknown)
0.40 0.40 value [c function] - (unknown)
0.30 8.10 load - <internal:marshal>:35
0.27 0.27 (unknown) [c function] - (unknown)
0.21 90.89 fork [c function] - (unknown)
0.18 0.35 dump [c function] - (unknown)
0.18 0.18 write [c function] - (unknown)
0.10 0.10 wait [c function] - (unknown)
0.08 0.08 read [c function] - (unknown)
0.08 0.08 eof? [c function] - (unknown)
0.05 0.08 require_relative [c function] - (unknown)
0.05 0.06 new [c function] - (unknown)
String#to_f
is 19% of time- We're also allocating a String for every value in order to parse it into a Float
Faster...just
% self % total name
33.78 84.73 block in process_chunk - ./008_custom_parser.rb:79
25.36 25.36 get_value [c function] - (unknown)
18.17 18.17 get_string_until [c function] - (unknown)
8.00 8.00 getbyte [c function] - (unknown)
5.49 5.49 add - ./008_custom_parser.rb:27
5.37 98.35 loop - <internal:kernel>:192
1.83 1.83 * [c function] - (unknown)
0.76 0.76 (unknown) [c function] - (unknown)
0.18 0.18 eof? [c function] - (unknown)
0.16 90.59 fork [c function] - (unknown)
0.16 0.16 wait [c function] - (unknown)
0.14 0.14 value [c function] - (unknown)
0.12 0.18 dump [c function] - (unknown)
0.11 8.16 load - <internal:marshal>:35
0.05 0.11 new [c function] - (unknown)
0.05 0.05 write [c function] - (unknown)
0.05 0.05 read [c function] - (unknown)
- Create a new
IO::Buffer::Reader
class that tracks the current offset in C - Implement the decimal-to-int parser in C
Ruby beat a Java implementation!
- Completely eliminating String allocations in the hot loop, using a custom hash table
- Using bit pattern tricks to search/parse multiple bytes at once
private static final long BROADCAST_SEMICOLON = 0x3B3B3B3B3B3B3B3BL; private static final long BROADCAST_0x01 = 0x0101010101010101L; private static final long BROADCAST_0x80 = 0x8080808080808080L; private static long semicolonMatchBits(long word) { long diff = word ^ BROADCAST_SEMICOLON; return (diff - BROADCAST_0x01) & (~diff & BROADCAST_0x80); } private static int nameLen(long separator) { return (Long.numberOfTrailingZeros(separator) >>> 3) + 1; }
- There's an interesting blog post by one of the top 10 authors: https://questdb.io/blog/billion-row-challenge-step-by-step/
- https://pola.rs/
- Similar to the pandas data analysis library for Python
- Ruby isn't an officially supported language, but there's a gem: https://github.com/ankane/polars-ruby
- Different computers have different performance characteristics
- Make sure your optimise for the right thing
- AMD Ryzen 5 2600 desktop (mid-range 2018 CPU, 6 cores, 12 threads)
- M2 MacBook Air (2022, 8 cores: 4 performance, 4 efficiency)
- The optimisation loop
- Measure the baseline
- Profile and identify possible areas for improvement
- Compare your change with the baseline
- Readability/maintainability vs performance trade-off
- Breaking abstractions
- Hard-coding assumptions
- Life is about the journey
- Can you make it go faster?
- All the code is available at: https://github.com/rianmcguire/1brc-ruby