Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make buffer size, transfer size and concurrency configurable #25

Merged

Conversation

jkuhn-cuda
Copy link
Contributor

The read buffer size as well as the transfer chunk size and transfer concurrency options provided to the Azure Storage client have a large impact on both the duration of queries and number of transactions done against the Azure Storage Account. This PR makes these values configurable so that users can tune these settings to balance performance and transaction costs.

The following shows the impact on the duration of a query against a 1.9 GiB gzipped json lines blob:

azure_read_transfer_concurrency = 5 / azure_read_transfer_chunk_size = 1 MiB / azure_read_buffer_size = 1 MiB
69842.0027 ms

azure_read_transfer_concurrency = 1 / azure_read_transfer_chunk_size = 1 MiB / azure_read_buffer_size = 1 MiB
64520.8366 ms

azure_read_transfer_concurrency = 4 / azure_read_transfer_chunk_size = 32 MiB / azure_read_buffer_size = 128 MiB
46287.7139 ms

azure_read_transfer_concurrency = 16 / azure_read_transfer_chunk_size = 8 MiB / azure_read_buffer_size = 128 MiB
35221.4137 ms

azure_read_transfer_concurrency = 16 / azure_read_transfer_chunk_size = 16 MiB / azure_read_buffer_size = 256 MiB
29436.0231 ms

The number of transactions required to do the query will be approximately: BlobSize / azure_read_transfer_chunk_size

In this PR I chose defaults for these values as follows:

  • I chose 5 for azure_read_transfer_concurrency because that seems to be the default in the Azure SDK.
  • I chose 1 MiB for azure_read_transfer_chunk_size and azure_read_buffer_size because the current buffer size is 1 MB. Actually, I just realized that the old buffer size was 1 MB and I changed it to 1 MiB. I can switch that back if desired.

Copy link
Collaborator

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jkuhn-cuda thanks a lot for the PR, LGTM! Also, those performance improvements look great!

@jkuhn-cuda
Copy link
Contributor Author

jkuhn-cuda commented Nov 30, 2023

Hey @jkuhn-cuda thanks a lot for the PR, LGTM! Also, those performance improvements look great!

Thanks @samansmink ! I'm glad you are happy with it. Approximately how long should I expect it to take for the PR to be merged and to be available in an official released build? I need to use these new configuration values for a project I am working on and I need to determine if an official release build will be available in time. Otherwise, I will need to set up my project to use a custom build with these changes until an official build is available.

@samansmink samansmink merged commit d2c685a into duckdb:main Dec 1, 2023
20 checks passed
@samansmink
Copy link
Collaborator

@jkuhn-cuda as soon as CI for this merge passes, the binary should be available in our nightly bucket which you can install from duckdb 0.9.2 using: force install azure from 'http://nightly-extensions.duckdb.org'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants