Supplementary benchmarking #85

briangow · 2024-06-21T20:35:52Z

briangow
Jun 21, 2024
Maintainer

In #11, a working group suggested using WFDB as the basis for a waveform format in CHoRUS: #11 (comment) . This effectively satisfies that discussion.

Part of the work that went into the group making this selection was the development of code to benchmark the various formats being considered. We would like to do some additional work to further develop the benchmarking code and to assess the formats against it. We can discuss that additional work here.

There are a few things that have already been raised which we'd like to implement:

Add information about peak memory usage during benchmarking
Require that all formats use signed 16 bit integers for storage: CHoRUS waveform format selection #11 (comment)
Add additional information around the amount of fidelity loss: CHoRUS waveform format selection #11 (reply in thread)

matthewreyna · 2024-06-27T20:31:31Z

matthewreyna
Jun 27, 2024
Collaborator

@briangow I implemented an SNR calculation for the benchmarks. The SNR is helpful for quantifying signal fidelity, including from quantization loss and different storage formats, e.g., signed 16-bit integers vs. other representations, and complements the (approximate) equality check from np.isclose.

Would you prefer that I push to a new branch and create a pull request, fork the repository and create a pull request, or something else?

3 replies

briangow Jun 27, 2024
Maintainer Author

Thanks @matthewreyna ! If you could create a new branch and open a PR against main that would be great!

matthewreyna Jun 28, 2024
Collaborator

Yes, will do! Can you please add write permissions or push access to the repository?

briangow Jun 28, 2024
Maintainer Author

@matthewreyna , you should have write access now.

briangow · 2024-07-03T13:19:25Z

briangow
Jul 3, 2024
Maintainer Author

@WilliamDixon has improved the AtriumDB format code as described in #86. I'm attaching the results from running that updated code across the benchmarking waveform suite.

new_atrium_db.log

0 replies

briangow · 2024-07-19T16:20:32Z

briangow
Jul 19, 2024
Maintainer Author

I've opened a feature request around separating out the loading of the waveform and the seek operations during our random access read testing: #92 .

0 replies

WilliamDixon · 2024-07-23T14:32:58Z

WilliamDixon
Jul 23, 2024
Collaborator

I have a few thoughts on the benchmarking process and how it interacts with AtriumDB

Nan Gap Conversion

The largest point I've already mentioned in #86 which is that AtriumDB doesn't mark gaps as np.nan values, but instead as jumps in time in a companion timestamp array.

WFDB_Implicit_Times = [0, 1, 2, 3, 4, 5, 6, 7]
WFDB_Values = [4, 5, 6, np.nan, np.nan, np.nan, 7, 8]  # metadata: start_time = 0, Period = 1
AtriumDB_Values = [4, 5, 6, 7, 8]
AtriumDB_Times = [0, 1, 2, 6, 7]

I've added code to the AtriumDB read method to convert from AtriumDB formatted time to WFDB:

# AtriumDB output is read_time_data, read_value_data
nan_values = np.empty(nan_times.size, dtype=np.float64)
nan_values[:] = np.nan

# Convert from nanosecond timestamps to sample numbers
closest_i_array = np.round((read_time_data - start_time_nano) / period_ns).astype(int)
# Ensure sample numbers remain within the bounds of the time request
mask = (closest_i_array >= 0) & (closest_i_array < num_samples)

closest_i_array = closest_i_array[mask]
nan_values[closest_i_array] = read_value_data[mask]

If we allowed the option for output to be in AtriumDB's format (perhaps a BaseFormat Instance Variable), and used the above code outside of the benchmarking step to convert between formats, we'd get a more fair comparison. See below for the difference in AtriumDB read performance on the MIMIC IV p100 wave data:

Original
---------

Read performance (median of N trials):
 #seek  #read      KiB      sec     [N]
     0     -1    62932  11.5839     [3] read 1 x 214981s, all channels
     0     -1    27024   0.5715    [11] read 5 x 500s, all channels
     0     -1   259192   5.4631     [3] read 50 x 50s, all channels
     0     -1  2626928  52.4986     [3] read 500 x 5s, all channels
     0     -1    13252   2.9162     [5] read 1 x 214981s, one channel
     0     -1     5534   0.1474    [26] read 5 x 500s, one channel
     0     -1    53196   1.3576     [3] read 50 x 50s, one channel
     0     -1   556400  13.9426     [3] read 500 x 5s, one channel



Without NaN-Gap conversion
---------------------------

Read performance (median of N trials):
 #seek  #read      KiB      sec     [N]
     0     -1    62932   3.5624     [3] read 1 x 214981s, all channels
     0     -1    27024   0.2788    [10] read 5 x 500s, all channels
     0     -1   259292   2.2346     [3] read 50 x 50s, all channels
     0     -1  2627488  23.6534     [3] read 500 x 5s, all channels
     0     -1    12732   0.4819    [18] read 1 x 214981s, one channel
     0     -1     5532   0.0844    [27] read 5 x 500s, one channel
     0     -1    53196   0.7929     [3] read 50 x 50s, one channel
     0     -1   556424   8.5768     [3] read 500 x 5s, one channel

Which gives us a 2-6 fold speed increase depending on the size and number of requests.

Scale of the Benchmarks and a Question on the Performance Counter & Resource Usage

AtriumDB was designed to work best with reads and write > 100,000 values at a single time. Upon a write request, data is batched into blocks of AtriumSDK.block.block_size.

Using the default block size of 131_072, most requests in the read benchmark perform the same number of computations, whether it is requesting 5s of data (~1000 values) or 500s of data (100,000 values). In either case, the block metadata table will instruct the FileHandler to read 1 or 2 (in the case of overlap) blocks, decode the blocks, and truncate the data based on the exact start and end times requested.

This means all queries less than 1 block_size take roughly the same duration.

Theoretically setting the block_size to a much smaller value (in this case 1024) should solve this issue, but oddly I see no change (slight worsening) in the benchmark output:

Small Block Size (1024)
-----------------

Read performance (median of N trials):
 #seek  #read      KiB      sec     [N]
     0     -1   137976   7.6268     [3] read 1 x 214981s, all channels
     0     -1    36504   0.3869     [6] read 5 x 500s, all channels
     0     -1   386984   3.4375     [3] read 50 x 50s, all channels
     0     -1  3843952  32.2930     [3] read 500 x 5s, all channels
     0     -1    32108   2.1126     [6] read 1 x 214981s, one channel
     0     -1     5948   0.0612    [32] read 5 x 500s, one channel
     0     -1    58402   0.4895     [4] read 50 x 50s, one channel
     0     -1   637920   5.4479     [3] read 500 x 5s, one channel

My initial theory was that reading the sqlite metadata file was causing the high values in the 50 x 50s and 500 x 5s rows, but altering the code so that the sqlite metadata is cached in RAM does speed everything up, but doesn't prevent 500 x 5s taking 2 orders of magnitude more CPU seconds and file reading.

Cached Metadata: Small Blocks (1024)
--------------------------------------------------------
Read performance (median of N trials):
 #seek  #read      KiB      sec     [N]
     0     -1   123544   6.1423     [3] read 1 x 214981s, all channels
     0     -1    22468   0.1633    [39] read 5 x 500s, all channels
     0     -1   219736   1.7041     [4] read 50 x 50s, all channels
     0     -1  2206772  16.0132     [3] read 500 x 5s, all channels
     0     -1    24476   0.8343    [11] read 1 x 214981s, one channel
     0     -1     3600   0.0324   [195] read 5 x 500s, one channel
     0     -1    36940   0.2815    [25] read 50 x 50s, one channel
     0     -1   364196   2.8709     [3] read 500 x 5s, one channel

When subfile blocks are requested, AtriumDB uses python's native file.seek() and file.read() to skip to the correct byte number, could that be causing the large KiB value in the read 500 x 5s rows?

Implicit Data Order

AtriumDB has the capability to store data out of order, its possible that two separate blocks may overlap in the time interval of data they contain.

This has advantages in live data streaming, allowing unsorted data to be stored when device output messages come out of order, or duplicate.

Because of this, we have to check the order of all blocks pulled and also check for overlap between them, sorting if necessary.

When data is neatly ordered like in our benchmark test, we can assume (and other formats do) that the data in the file is already in perfect order. We can then, for the purposes of this test, disable the checks and sorting done in the AtriumSDK.get_data method, and with all previous optimizations included we get a further performance boost:

Cached Metadata, No Gap Conversion, No Check/Sort, Big Block Size (131072)
----------------------------------------------------------------------------------------
Read performance (median of N trials):
 #seek  #read      KiB      sec     [N]
     0     -1    61728   3.3177     [3] read 1 x 214981s, all channels
     0     -1    22468   0.0913    [59] read 5 x 500s, all channels
     0     -1   213504   0.6697     [8] read 50 x 50s, all channels
     0     -1  2166848   6.5828     [3] read 500 x 5s, all channels
     0     -1     8040   0.2893    [24] read 1 x 214981s, one channel
     0     -1     3600   0.0146   [327] read 5 x 500s, one channel
     0     -1    36080   0.1241    [39] read 50 x 50s, one channel
     0     -1   359720   1.2405     [5] read 500 x 5s, one channel

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supplementary benchmarking #85

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Supplementary benchmarking #85

briangow Jun 21, 2024 Maintainer

Replies: 4 comments · 3 replies

matthewreyna Jun 27, 2024 Collaborator

briangow Jun 27, 2024 Maintainer Author

matthewreyna Jun 28, 2024 Collaborator

briangow Jun 28, 2024 Maintainer Author

briangow Jul 3, 2024 Maintainer Author

briangow Jul 19, 2024 Maintainer Author

WilliamDixon Jul 23, 2024 Collaborator

Nan Gap Conversion

Scale of the Benchmarks and a Question on the Performance Counter & Resource Usage

Implicit Data Order

briangow
Jun 21, 2024
Maintainer

Replies: 4 comments 3 replies

matthewreyna
Jun 27, 2024
Collaborator

briangow Jun 27, 2024
Maintainer Author

matthewreyna Jun 28, 2024
Collaborator

briangow Jun 28, 2024
Maintainer Author

briangow
Jul 3, 2024
Maintainer Author

briangow
Jul 19, 2024
Maintainer Author

WilliamDixon
Jul 23, 2024
Collaborator