Replies: 4 comments 3 replies
-
@briangow I implemented an SNR calculation for the benchmarks. The SNR is helpful for quantifying signal fidelity, including from quantization loss and different storage formats, e.g., signed 16-bit integers vs. other representations, and complements the (approximate) equality check from Would you prefer that I push to a new branch and create a pull request, fork the repository and create a pull request, or something else? |
Beta Was this translation helpful? Give feedback.
-
@WilliamDixon has improved the AtriumDB format code as described in #86. I'm attaching the results from running that updated code across the benchmarking waveform suite. |
Beta Was this translation helpful? Give feedback.
-
I've opened a feature request around separating out the loading of the waveform and the seek operations during our random access read testing: #92 . |
Beta Was this translation helpful? Give feedback.
-
I have a few thoughts on the benchmarking process and how it interacts with AtriumDB Nan Gap ConversionThe largest point I've already mentioned in #86 which is that AtriumDB doesn't mark gaps as np.nan values, but instead as jumps in time in a companion timestamp array. WFDB_Implicit_Times = [0, 1, 2, 3, 4, 5, 6, 7]
WFDB_Values = [4, 5, 6, np.nan, np.nan, np.nan, 7, 8] # metadata: start_time = 0, Period = 1
AtriumDB_Values = [4, 5, 6, 7, 8]
AtriumDB_Times = [0, 1, 2, 6, 7] I've added code to the AtriumDB read method to convert from AtriumDB formatted time to WFDB: # AtriumDB output is read_time_data, read_value_data
nan_values = np.empty(nan_times.size, dtype=np.float64)
nan_values[:] = np.nan
# Convert from nanosecond timestamps to sample numbers
closest_i_array = np.round((read_time_data - start_time_nano) / period_ns).astype(int)
# Ensure sample numbers remain within the bounds of the time request
mask = (closest_i_array >= 0) & (closest_i_array < num_samples)
closest_i_array = closest_i_array[mask]
nan_values[closest_i_array] = read_value_data[mask] If we allowed the option for output to be in AtriumDB's format (perhaps a BaseFormat Instance Variable), and used the above code outside of the benchmarking step to convert between formats, we'd get a more fair comparison. See below for the difference in AtriumDB read performance on the MIMIC IV p100 wave data: Original
---------
Read performance (median of N trials):
#seek #read KiB sec [N]
0 -1 62932 11.5839 [3] read 1 x 214981s, all channels
0 -1 27024 0.5715 [11] read 5 x 500s, all channels
0 -1 259192 5.4631 [3] read 50 x 50s, all channels
0 -1 2626928 52.4986 [3] read 500 x 5s, all channels
0 -1 13252 2.9162 [5] read 1 x 214981s, one channel
0 -1 5534 0.1474 [26] read 5 x 500s, one channel
0 -1 53196 1.3576 [3] read 50 x 50s, one channel
0 -1 556400 13.9426 [3] read 500 x 5s, one channel
Without NaN-Gap conversion
---------------------------
Read performance (median of N trials):
#seek #read KiB sec [N]
0 -1 62932 3.5624 [3] read 1 x 214981s, all channels
0 -1 27024 0.2788 [10] read 5 x 500s, all channels
0 -1 259292 2.2346 [3] read 50 x 50s, all channels
0 -1 2627488 23.6534 [3] read 500 x 5s, all channels
0 -1 12732 0.4819 [18] read 1 x 214981s, one channel
0 -1 5532 0.0844 [27] read 5 x 500s, one channel
0 -1 53196 0.7929 [3] read 50 x 50s, one channel
0 -1 556424 8.5768 [3] read 500 x 5s, one channel
Which gives us a 2-6 fold speed increase depending on the size and number of requests. Scale of the Benchmarks and a Question on the Performance Counter & Resource UsageAtriumDB was designed to work best with reads and write > 100,000 values at a single time. Upon a write request, data is batched into blocks of Using the default block size of This means all queries less than 1 block_size take roughly the same duration. Theoretically setting the Small Block Size (1024)
-----------------
Read performance (median of N trials):
#seek #read KiB sec [N]
0 -1 137976 7.6268 [3] read 1 x 214981s, all channels
0 -1 36504 0.3869 [6] read 5 x 500s, all channels
0 -1 386984 3.4375 [3] read 50 x 50s, all channels
0 -1 3843952 32.2930 [3] read 500 x 5s, all channels
0 -1 32108 2.1126 [6] read 1 x 214981s, one channel
0 -1 5948 0.0612 [32] read 5 x 500s, one channel
0 -1 58402 0.4895 [4] read 50 x 50s, one channel
0 -1 637920 5.4479 [3] read 500 x 5s, one channel My initial theory was that reading the sqlite metadata file was causing the high values in the 50 x 50s and 500 x 5s rows, but altering the code so that the sqlite metadata is cached in RAM does speed everything up, but doesn't prevent 500 x 5s taking 2 orders of magnitude more CPU seconds and file reading. Cached Metadata: Small Blocks (1024)
--------------------------------------------------------
Read performance (median of N trials):
#seek #read KiB sec [N]
0 -1 123544 6.1423 [3] read 1 x 214981s, all channels
0 -1 22468 0.1633 [39] read 5 x 500s, all channels
0 -1 219736 1.7041 [4] read 50 x 50s, all channels
0 -1 2206772 16.0132 [3] read 500 x 5s, all channels
0 -1 24476 0.8343 [11] read 1 x 214981s, one channel
0 -1 3600 0.0324 [195] read 5 x 500s, one channel
0 -1 36940 0.2815 [25] read 50 x 50s, one channel
0 -1 364196 2.8709 [3] read 500 x 5s, one channel
When subfile blocks are requested, AtriumDB uses python's native Implicit Data OrderAtriumDB has the capability to store data out of order, its possible that two separate blocks may overlap in the time interval of data they contain. This has advantages in live data streaming, allowing unsorted data to be stored when device output messages come out of order, or duplicate. Because of this, we have to check the order of all blocks pulled and also check for overlap between them, sorting if necessary. When data is neatly ordered like in our benchmark test, we can assume (and other formats do) that the data in the file is already in perfect order. We can then, for the purposes of this test, disable the checks and sorting done in the Cached Metadata, No Gap Conversion, No Check/Sort, Big Block Size (131072)
----------------------------------------------------------------------------------------
Read performance (median of N trials):
#seek #read KiB sec [N]
0 -1 61728 3.3177 [3] read 1 x 214981s, all channels
0 -1 22468 0.0913 [59] read 5 x 500s, all channels
0 -1 213504 0.6697 [8] read 50 x 50s, all channels
0 -1 2166848 6.5828 [3] read 500 x 5s, all channels
0 -1 8040 0.2893 [24] read 1 x 214981s, one channel
0 -1 3600 0.0146 [327] read 5 x 500s, one channel
0 -1 36080 0.1241 [39] read 50 x 50s, one channel
0 -1 359720 1.2405 [5] read 500 x 5s, one channel |
Beta Was this translation helpful? Give feedback.
-
In #11, a working group suggested using WFDB as the basis for a waveform format in CHoRUS: #11 (comment) . This effectively satisfies that discussion.
Part of the work that went into the group making this selection was the development of code to benchmark the various formats being considered. We would like to do some additional work to further develop the benchmarking code and to assess the formats against it. We can discuss that additional work here.
There are a few things that have already been raised which we'd like to implement:
Beta Was this translation helpful? Give feedback.
All reactions