-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking - don't load waveform for each random access read #92
Comments
This would likely take a fairly significant effort by a number of people (for the format code) and delay completion of the final benchmarking. I can think of specific scenarios where loading and seeking independently would be beneficial (ex: for analyzing blocks of time around multiple deliveries of a drip medication). However, it isn't clear to me how common this type of analysis is. Please share your thoughts regarding how common this type of analysis is and whether you think it is worth the effort to implement this in our benchmarking software. |
It feels to me that the need to read multiple, non-contiguous segments from a single record would be common enough that it would be desirable to (1) provide a clear API for the task and (2) optimise read speeds. [Edit: not suggesting this needs to be benchmarked, but I think a useful feature for us to think about for WFDB]. |
Agree that this is a good benchmark/feature. It could be implemented fairly easily for CCDEF/HDF5 by loading the entire dataset into memory, as opposed to the default behavior of only loading the indexed portion of the signal. This approach doesn't make use of chunking, however, so I see it as complementary to the current benchmark setup. |
should doable for DICOM. With the way the waveform data is organized, I think I will need to maintain a dictionary of file objects and the metadata in each. |
We access the signals in a variety of different ways, but I think we can boil it down to three main methods, all of which are built in to AtriumDB.
AtriumDB has been optimised for reading large large amounts of data into RAM using a single "query". This approach allows us to parallelise the decompression of multiple blocks across multiple cores. This approach also reduces decompression of un-needed portions of the data, as accessing even a few seconds of data requires decompressing an entire block of data. Another advantage of larger data requests is that it minimises the number of queries on the metadata database. For example, on our development server (with fast SSDs and 40 cores) we can convert approximately 500M values per second of compressed signal in .tsc format into a NumPy array for large queries (i.e., where the number of blocks being decompressed exceeds the number of CPU cores available on the system). This equates to reading approximately 11 days of 500Hz ECG per second, or 45 days of 125Hz ABP per second. Note that both methods 1. and 2. involve selecting a large amount of data into RAM and then performing sub-selects on the cached array. Also note that the example used in method 1, with a 60s window and a 10s slide would involves a lot of redundancy between windows. It is inefficient to perform this kind of data access using a sequence of discrete, granular queries from disk. In our experience, data analysis that involves large amounts of retrospective data usually require method 1. However, if we were approaching the benchmarking that have been defined as part of the ChoRUS project we would probably approach it using Method 2. We often use method 2 when preparing windows of a single signal or multimodal data for training a model. Ideally the order of the windows supplied to the model training process would be truly random (i.e., method 3), but the performance hit of this approach is so significant that we usually use combinations of method 2 to produce a deterministic (i.e., repeatable), pseudo-random approach. We will also use method 2 when sub-sampling the properties of a larger dataset. We have found that it is often impractical and unnecessary to read/analyse the entire dataset when a very good approximation can be found by careful sub-sampling. Method 3, in our experience, is the least common way users want to access the data for automated analysis. Method 3. is typically used when selecting small windows of data for visualisation, review, or labelling by a user. However, I would argue that query speed is relatively unimportant for single window, user-oriented tasks. I feel that it is inevitable that, as we evolve towards more efficient systems, we will converge towards a system that performs fewer, larger queries from disk, and then performs sub-selects on the cached array in RAM. With all of this in mind, I think we should de-emphasise the importance of the "many small queries", and instead focus on the performance of the "single large" query in the benchmarking. |
@tcpan @briangow Would the idea of the new feature be to load compressed data in RAM and benchmark the memory usage and decompression speeds for random windows within the file? As @meshlab mentioned, AtriumDB has a similar feature as one of our bread and butter tools for data analysis and model training, but we load batches of data uncompressed in RAM as we iterate over it by various window sizes/slides, with and without shuffling the window order. We tend to be able to find a healthy middle ground between the efficiency of the batch query and being responsible with memory usage. If we left our batches compressed before each access, it would significantly increase the time for each access, and speed matters a lot when accessing data at this scale and fidelity (multiple years of 500Hz signal). If there are uses you have in mind where you need to load so much data at a time that uncompressed memory storage is impractical, then I think this addition makes sense. But if you think uncompressed memory storage is good enough, then while I definitely think Chorus should aim to have such a tool, benchmarking between formats would be unnecessary. |
@WilliamDixon , thanks for the thoughts! I had been assuming we would be loading the data uncompressed into RAM. I'll defer to @tcpan to clarify though. |
I managed to spend a few hours to create a test implementation of this. 3 new abstract methods have been added to the formats/base.py file: open_waveforms, close_waveforms, and read_opened_waveforms. Documentation in Benchmark.md for these have been added. Specifically, open_waveforms returns a dictionary that a user defined, which is then used by read_opened_waveforms to extract the waveforms.. It is up to the implementor to define what that object contains, so it could be compressed or uncompressed blocks in memory as @WilliamDixon mentioned, or as simple as a dictionary of file handles. As already stated, there is memory footprint and speed trade offs, both of which we are measuring for the benchmark. I've tested the new API by implementing them in NPY, Parquet, Pickle, and DICOM. I've also added a non-abstract function "open_read_close_waveforms" in formats/base.py. This uses the three new functions together, and is used for a second fidelity check. Three sets of read benchmark have been added that utilize these 3 functions - read all channels, read blocks from 1 channel (e.g. 500 reads of 5 seconds all from 1 channel), read blocks from a random channel (e.g. 500 reads of 5 seconds each from a different channel). Please review this version (in branch open_once_read_many) for feedback. Benchmark runtime has increased in total, of course. Please note that aside from the 4 formats, the other formats will throw a "not implemented" error. For all formats, we need the implementers to implement/review/optimize. |
NPY result on my laptop: Format: waveform_benchmark.formats.npy.NPY_Uncompressed Channel summary information: Output size: 201211 KiB (32.00 bits/sample) Fidelity check via read_waveforms: Chunk Numeric Samples NaN Samples Fidelity check via open/read/close: Chunk Numeric Samples NaN Samples Open Once Read Many performance (median of N trials): Read performance (median of N trials): |
Parquet result on my laptop. Not clear why kiB read is 0 for all. Format: waveform_benchmark.formats.parquet.Parquet_Uncompressed Channel summary information: Output size: 91152 KiB (14.50 bits/sample) Fidelity check via read_waveforms: Chunk Numeric Samples NaN Samples Fidelity check via open/read/close: Chunk Numeric Samples NaN Samples Open Once Read Many performance (median of N trials): Read performance (median of N trials): |
DICOM results on my laptop Format: waveform_benchmark.formats.dicom.DICOM16Bits Channel summary information: Output size: 100609 KiB (16.00 bits/sample) Fidelity check via read_waveforms: Chunk Numeric Samples NaN Samples Fidelity check via open/read/close: Chunk Numeric Samples NaN Samples Open Once Read Many performance (median of N trials): Read performance (median of N trials): |
Pickle results on my laptop. Note the amount of data read during file open operation. ./waveform_benchmark.py --input_record data/waveforms/physionet.org/files/charisdb/1.0.0/charis8 --format_class waveform_benchmark.formats.pickle.Pickle -m Format: waveform_benchmark.formats.pickle.Pickle Channel summary information: Output size: 201211 KiB (32.00 bits/sample) Fidelity check via read_waveforms: Chunk Numeric Samples NaN Samples Fidelity check via open/read/close: Chunk Numeric Samples NaN Samples Open Once Read Many performance (median of N trials): Read performance (median of N trials): |
Fixed by #94 |
It has been suggested by @tcpan and others that we could add a benchmarking test (or replace the existing one) which loads the waveform being tested and then does all of the random access reads without having to reload the waveform. This would separate the time to load the waveform and the time to seek a random block of data.
This would require:
benchmark.py
code to time the load and seek operations separatelyThe text was updated successfully, but these errors were encountered: