Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry settings not defined on s3 filesystem #2788

Closed
jmurray-clarify opened this issue Sep 4, 2024 · 2 comments
Closed

Retry settings not defined on s3 filesystem #2788

jmurray-clarify opened this issue Sep 4, 2024 · 2 comments

Comments

@jmurray-clarify
Copy link
Contributor

Describe the bug
When writing a high-cardinality (16k) partitioned table, we will frequent encounter throttling errors from S3:

ray.exceptions.RayTaskError(OSError): �[36mray::WriteFile [Stage:16]()�[39m (pid=15273, ip=10.1.71.222)
  File "/mnt/tmp/ray/session_2024-09-04_14-28-17_937685_23627/runtime_resources/pip/17241f6ccdf1c37bf076255dab7d5e63265f7e4d/virtualenv/lib64/python3.9/site-packages/daft/runners/ray_runner.py", line 356, in single_partition_pipeline
    return build_partitions(instruction_stack, partial_metadatas, *inputs)
  File "/tmp/ray/session_2024-09-04_14-28-17_937685_23627/runtime_resources/pip/17241f6ccdf1c37bf076255dab7d5e63265f7e4d/virtualenv/lib64/python3.9/site-packages/daft/runners/ray_runner.py", line 334, in build_partitions
    partitions = instruction.run(partitions)
  File "/tmp/ray/session_2024-09-04_14-28-17_937685_23627/runtime_resources/pip/17241f6ccdf1c37bf076255dab7d5e63265f7e4d/virtualenv/lib64/python3.9/site-packages/daft/execution/execution_step.py", line 343, in run
    return self._write_file(inputs)
  File "/tmp/ray/session_2024-09-04_14-28-17_937685_23627/runtime_resources/pip/17241f6ccdf1c37bf076255dab7d5e63265f7e4d/virtualenv/lib64/python3.9/site-packages/daft/execution/execution_step.py", line 347, in _write_file
    partition = self._handle_file_write(
  File "/tmp/ray/session_2024-09-04_14-28-17_937685_23627/runtime_resources/pip/17241f6ccdf1c37bf076255dab7d5e63265f7e4d/virtualenv/lib64/python3.9/site-packages/daft/execution/execution_step.py", line 362, in _handle_file_write
    return table_io.write_tabular(
  File "/tmp/ray/session_2024-09-04_14-28-17_937685_23627/runtime_resources/pip/17241f6ccdf1c37bf076255dab7d5e63265f7e4d/virtualenv/lib64/python3.9/site-packages/daft/table/table_io.py", line 490, in write_tabular
    _write_tabular_arrow_table(
  File "/tmp/ray/session_2024-09-04_14-28-17_937685_23627/runtime_resources/pip/17241f6ccdf1c37bf076255dab7d5e63265f7e4d/virtualenv/lib64/python3.9/site-packages/daft/table/table_io.py", line 886, in _write_tabular_arrow_table
    _retry_with_backoff(
  File "/tmp/ray/session_2024-09-04_14-28-17_937685_23627/runtime_resources/pip/17241f6ccdf1c37bf076255dab7d5e63265f7e4d/virtualenv/lib64/python3.9/site-packages/daft/table/table_io.py", line 819, in _retry_with_backoff
    return func()
  File "/tmp/ray/session_2024-09-04_14-28-17_937685_23627/runtime_resources/pip/17241f6ccdf1c37bf076255dab7d5e63265f7e4d/virtualenv/lib64/python3.9/site-packages/daft/table/table_io.py", line 867, in write_dataset
    pads.write_dataset(
  File "/tmp/ray/session_2024-09-04_14-28-17_937685_23627/runtime_resources/pip/17241f6ccdf1c37bf076255dab7d5e63265f7e4d/virtualenv/lib64/python3.9/site-packages/pyarrow/dataset.py", line 1030, in write_dataset
    _filesystemdataset_write(
  File "pyarrow/_dataset.pyx", line 4010, in pyarrow._dataset._filesystemdataset_write
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When completing multiple part upload for key '<REDACTED>' in bucket '<REDACTED>': AWS Error SLOW_DOWN during CompleteMultipartUpload operation: Please reduce your request rate.

Looking at the S3Config doc, we have some retry settings: retry_mode , num_tries , retry_initial_backoff_ms. However, these are not used for this call.

This uses the S3 filesystem created here:

Daft/daft/filesystem.py

Lines 214 to 227 in 60ebf82

if protocol == "s3":
translated_kwargs = {}
if io_config is not None and io_config.s3 is not None:
s3_config = io_config.s3
_set_if_not_none(translated_kwargs, "endpoint_override", s3_config.endpoint_url)
_set_if_not_none(translated_kwargs, "access_key", s3_config.key_id)
_set_if_not_none(translated_kwargs, "secret_key", s3_config.access_key)
_set_if_not_none(translated_kwargs, "session_token", s3_config.session_token)
_set_if_not_none(translated_kwargs, "region", s3_config.region_name)
_set_if_not_none(translated_kwargs, "anonymous", s3_config.anonymous)
resolved_filesystem = S3FileSystem(**translated_kwargs)
resolved_path = resolved_filesystem.normalize_path(_unwrap_protocol(path))
return resolved_path, resolved_filesystem

We should be defining a retry_strategy here.

Expected behavior
Respect retry settings from S3Config.

@samster25
Copy link
Member

@jmurray-clarify this is good call out! I don't think we propagate the retry settings to the pyarrow S3Filesystem for writes. Do you want to try taking a stab at it and making a contribution?

@jmurray-clarify
Copy link
Contributor Author

@samster25 , I created a PR for this. Thanks.

#2800

samster25 pushed a commit that referenced this issue Sep 6, 2024
Addresses: #2788

Propogates the S3Config.num_tries config to the pyarrow S3 filesystem.

Note that the other relevant parameters on S3Config, `retry_mode` and
`retry_initial_backoff_ms`, are ignored as pyarrow's
[S3RetryStrategy](https://github.com/apache/arrow/blob/ab0a40ee34217070f14027776682074c55d0b507/python/pyarrow/_s3fs.pyx#L112)
only has one parameter `max_attempts`.

Note that this only addresses S3. GCSConfig and AzureConfig do not have
retry settings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants