Allow for multithreaded channel extraction #100

EmilDohne · 2024-09-08T14:07:42Z

For historical reasons the channel extraction was done using a single thread as the rest of the reading/writing pipeline was fully parallelized.

Some recent changes however changed that iterating only over the channels in a specific layer (in most cases around 4). This opens up some additional performance to be had by utilizing thread counts over 4 more optimally.

Initial motivation for this ticket was the following article by the blosc2 team.

This will not only speed up extraction during read/write but also when users want to access the image data of a channel

EmilDohne · 2024-09-09T17:13:39Z

The results of these changes are actually very promising speeding up extraction by about:

~4x For our 8-bit large data use case from 5s -> 1.3s
~2x For our 16-bit large data use case from 3.5s -> 1.95s
~2x For our 32-bit large data use case from 7s -> 3.8s

As expected for our test cases with smaller data which does not fill up a chunk fully we are about on par speed wise with our previous implementation.

These changes also affect the read/write speeds by about 5-10% (in both directions).

Here are the updated averages:

Automotive Data 8-bit:

Read: 1.04s -> 1.14s
Write: 2.0s -> 1.92s

Automotive Data 8-bit Zip:

Read: 1.02s -> 1.09s
Write: 2.28s -> 2.13s

Glacious Hyundai 8-bit:

Read: 0.54s -> 0.59s
Write: 0.97s -> 1.01s

Glacious Hyundai 8-bit Zip:

Read: 0.75s -> 0.61s
Write: 1.37s -> 1.38s

Deep Nested Layers 8-bit:

Read: 0.40s -> 0.39s
Write: 0.71s -> 0.67s

Automotive Data 16-bit:

Read: 3.79s -> 3.96s
Write: 6.23s -> 6.99s

Automotive Data 32-bit:

Read: 13.54s -> 13.55s
Write: 14.48s -> 13.50ss

As we can see across the board the changes are minimal except for 32-bit write speeds. However, it appears that 16-bit read/write speeds got slower so I will have to investigate why that might be

EmilDohne · 2024-09-09T18:57:56Z

Did some more changes which made channel extraction slower but brought back in line speeds of read/writes and actually improved upon them!

~2x For our 8-bit large data use case from 5s -> 2.3s
~2x For our 16-bit large data use case from 3.5s -> 1.95s
~2x For our 32-bit large data use case from 7s -> 3.65s

I'm unsure why 8-bit data is slower in channel extraction compared to 16-bit but it might just be that blosc2 can compress that data better and more efficiently

General read/write speeds

Ignore the right column for these benchmarks as its just duplicate but these show the new write speeds which speed up:

8-bit: +10% write speeds
16-bit: +10% write speeds
32-bit +10% write speeds

EmilDohne · 2024-09-09T19:44:35Z

Since we now had our blocks no longer parallelize well on their own I parallelized the channels themselves giving us:

~5x For our 8-bit large data use case from 5s -> 1s
~2.2x For our 16-bit large data use case from 3.5s -> 1.6s
~2.2x For our 32-bit large data use case from 7s -> 3.2s

This doesnt affect regular read/write speeds as the ImageLayer extraction is only used for that

EmilDohne added enhancement New feature or request c++ labels Sep 8, 2024

EmilDohne self-assigned this Sep 8, 2024

EmilDohne linked a pull request Sep 9, 2024 that will close this issue

[Performance] allow for multithreaded channel extraction #101

Merged

EmilDohne closed this as completed in #101 Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for multithreaded channel extraction #100

Allow for multithreaded channel extraction #100

EmilDohne commented Sep 8, 2024

EmilDohne commented Sep 9, 2024

EmilDohne commented Sep 9, 2024

EmilDohne commented Sep 9, 2024

Allow for multithreaded channel extraction #100

Allow for multithreaded channel extraction #100

Comments

EmilDohne commented Sep 8, 2024

EmilDohne commented Sep 9, 2024

EmilDohne commented Sep 9, 2024

General read/write speeds

EmilDohne commented Sep 9, 2024