make buffer size, transfer size and concurrency configurable #25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The read buffer size as well as the transfer chunk size and transfer concurrency options provided to the Azure Storage client have a large impact on both the duration of queries and number of transactions done against the Azure Storage Account. This PR makes these values configurable so that users can tune these settings to balance performance and transaction costs.
The following shows the impact on the duration of a query against a 1.9 GiB gzipped json lines blob:
azure_read_transfer_concurrency = 5 / azure_read_transfer_chunk_size = 1 MiB / azure_read_buffer_size = 1 MiB
69842.0027 ms
azure_read_transfer_concurrency = 1 / azure_read_transfer_chunk_size = 1 MiB / azure_read_buffer_size = 1 MiB
64520.8366 ms
azure_read_transfer_concurrency = 4 / azure_read_transfer_chunk_size = 32 MiB / azure_read_buffer_size = 128 MiB
46287.7139 ms
azure_read_transfer_concurrency = 16 / azure_read_transfer_chunk_size = 8 MiB / azure_read_buffer_size = 128 MiB
35221.4137 ms
azure_read_transfer_concurrency = 16 / azure_read_transfer_chunk_size = 16 MiB / azure_read_buffer_size = 256 MiB
29436.0231 ms
The number of transactions required to do the query will be approximately: BlobSize / azure_read_transfer_chunk_size
In this PR I chose defaults for these values as follows: