-
Notifications
You must be signed in to change notification settings - Fork 100
Performance test code and determining theoretical maximum read write speeds
This page follows on from the guide to In-Memory Storage Backends. It aims to explain some more information about how those graphs were generated and some ideas about how fio
could be used to come up with theoretical maximum read and write speeds. It is here and not on the docs page since these are partially completed, and no reliable results have been found.
Profiling hardware limits using fio
rather than numpy
is also possible, but varies wildly with the parameters used,
in particular the number of jobs (processes) to use. With the commands below, the results were too unreliable to
be of any use. The read and write numbers were both around 500 MB/s, but in increasing the --numjobs
option,
these numbers could be inflated drastically to upwards of 10 GB/s.
Read test command:
fio --filename=./test.dat --size=2G --rw=read --bs=500M --ioengine=libaio --numjobs=1 --iodepth=1000 --name=sequential_reads --direct=0 --group_reporting
and for writing:
fio --filename=./test.dat --size=2G --rw=write --bs=4k --ioengine=libaio --numjobs=1 --iodepth=1000 --name=sequential_writes --direct=0 --group_reporting
For an explanation of the parameters see the fio
docs.
Thoughts on using fio
more effectively: the block size --bs
parameter should match ArcticDB's segment size. E.g.
for 100,000 rows by 10 columns (as is the case here), then --bs=8MB
is appropriate. For the --size
, this should
match the total size of the data being written/read to/from the symbol. The --numjobs
should probably be one, since
this clones separate processes which write to their own files and aggregates the total write speed and ArcticDB is limited
to one storage backend. For --iodepth
, this should match the number of I/O threads that ArcticDB is configured
with:
from arcticdb_ext import set_config_int
set_config_int("VersionStore.NumIOThreads", <number_threads>)
However, even with these options, the read speed is still reported by fio
as around 500 MB/s which ArcticSB seems to
out-perform! More work needs to be done here in determining an appropriate hardware limit.
# Scipt to profile LMDB on disk vs tmpfs, and compare
# also with the in-memory ArcticDB backend
from arcticdb import Arctic
import pandas as pd
import time
import numpy as np
import shutil, os
num_processes = 50
ncols = 10
nrepeats_per_data_point = 5
# Note that these are not deleted when the script finishes
disk_dir = 'disk.lmdb'
# Need to manually mount ./k as a tmpfs file system
tmpfs_dir = 'k/tmpfs.lmdb'
# Temporary file to gauge hardware speed limits
temp_numpy_file = 'temp.npy'
csv_out_file = 'tmpfs_vs_disk_timings.csv'
timings = {
'Storage': [],
'Load (bytes)': [],
'Speed (megabytes/s)': []
}
for data_B in np.linspace(start=200e6, stop=1000e6, num=9, dtype=int):
nrows = int(data_B / ncols / np.dtype(float).itemsize)
array = np.random.randn(nrows, ncols)
data = pd.DataFrame(array, columns=[f'c{i}' for i in range(ncols)])
assert data.values.nbytes == data_B
start = time.time()
np.save(temp_numpy_file, array)
elapsed = time.time() - start
write_speed_MB_s = data_B / 1e6 / elapsed
start = time.time()
np.load(temp_numpy_file)
elapsed = time.time() - start
read_speed_MB_s = data_B / 1e6 / elapsed
print(f'For {data_B}, Numpy Read speed {read_speed_MB_s} MB/s, write {write_speed_MB_s} MB/s')
for _ in range(nrepeats_per_data_point):
for test_dir in (disk_dir, tmpfs_dir, 'mem'):
print(f'Timing {test_dir} with load {data_B} B')
if test_dir == 'mem':
ac = Arctic(f'mem://')
else:
if os.path.exists(test_dir):
# Free up space from last test
shutil.rmtree(test_dir)
ac = Arctic(f'lmdb://{test_dir}')
if 'lib' not in ac.list_libraries():
ac.create_library('lib')
lib = ac['lib']
start = time.time()
lib.write('symbol', data)
elapsed = time.time() - start
write_speed_MB_s = data_B / 1e6 / elapsed
print('Time to write', elapsed, 's')
start = time.time()
lib.read('symbol')
elapsed = time.time() - start
read_speed_MB_s = data_B / 1e6 / elapsed
print('Time to read', elapsed, 's')
storage_name = {disk_dir: 'disk', tmpfs_dir: 'tmpfs', 'mem': 'mem'}[test_dir]
# Record the writing speed
timings['Load (bytes)'].append(data_B)
timings['Storage'].append(storage_name + ' (write)')
timings['Speed (megabytes/s)'].append(write_speed_MB_s)
# Record the reading speed
timings['Load (bytes)'].append(data_B)
timings['Storage'].append(storage_name + ' (read)')
timings['Speed (megabytes/s)'].append(read_speed_MB_s)
pd.DataFrame(timings).to_csv(csv_out_file)
ArcticDB Wiki