Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When downloading a Blob, failed requests (HttpResponseError exceptions) are not correctly retried #22784

Closed
Neki opened this issue Jan 26, 2022 · 7 comments
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. Storage Storage Service (Queues, Blobs, Files)

Comments

@Neki
Copy link

Neki commented Jan 26, 2022

  • Package Name: azure-storage-blob
  • Package Version: 12.8.1
  • Operating System: Ubuntu 20.04
  • Python Version: 3.8

Describe the bug
We have an integration test that's using the following code

service_client = BlobServiceClient(
    account_url=f"https://{storage_account_name}.blob.core.windows.net",
    credential=os.environ["AZURE_STORAGE_ACCOUNT_KEY"],
)
service_client.get_blob_client(
    container=container, blob=blob
)
.download_blob()
.readall()

Sometimes, this test fails with this stacktrace (one line edited to remove a private filename, otherwise it's a copy-paste):

../.venv/lib/python3.8/site-packages/urllib3/response.py:438: in _error_catcher
    yield
../.venv/lib/python3.8/site-packages/urllib3/response.py:519: in read
    data = self._fp.read(amt) if not fp_closed else b""
/usr/local/lib/python3.8/http/client.py:458: in read
    n = self.readinto(b)
/usr/local/lib/python3.8/http/client.py:502: in readinto
    n = self.fp.readinto(b)
/usr/local/lib/python3.8/socket.py:669: in readinto
    return self._sock.recv_into(b)
/usr/local/lib/python3.8/ssl.py:1241: in recv_into
    return self.read(nbytes, buffer)
/usr/local/lib/python3.8/ssl.py:1099: in read
    return self._sslobj.read(len, buffer)
E   ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
../.venv/lib/python3.8/site-packages/requests/models.py:758: in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
../.venv/lib/python3.8/site-packages/urllib3/response.py:576: in stream
    data = self.read(amt=amt, decode_content=decode_content)
../.venv/lib/python3.8/site-packages/urllib3/response.py:541: in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
/usr/local/lib/python3.8/contextlib.py:131: in __exit__
    self.gen.throw(type, value, traceback)
../.venv/lib/python3.8/site-packages/urllib3/response.py:455: in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
E   urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
../.venv/lib/python3.8/site-packages/azure/core/pipeline/transport/_requests_basic.py:169: in __next__
    chunk = next(self.iter_content_func)
../.venv/lib/python3.8/site-packages/requests/models.py:761: in generate
    raise ChunkedEncodingError(e)
E   requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
tests/integration_tests/test_file:261: in _test_function
    service_client.get_blob_client(
../.venv/lib/python3.8/site-packages/azure/storage/blob/_download.py:520: in readall
    self.readinto(stream)
../.venv/lib/python3.8/site-packages/azure/storage/blob/_download.py:617: in readinto
    downloader.process_chunk(chunk)
../.venv/lib/python3.8/site-packages/azure/storage/blob/_download.py:129: in process_chunk
    chunk_data = self._download_chunk(chunk_start, chunk_end - 1)
../.venv/lib/python3.8/site-packages/azure/storage/blob/_download.py:211: in _download_chunk
    chunk_data = process_content(response, offset[0], offset[1], self.encryption_options)
../.venv/lib/python3.8/site-packages/azure/storage/blob/_download.py:52: in process_content
    content = b"".join(list(data))
../.venv/lib/python3.8/site-packages/azure/core/pipeline/transport/_requests_basic.py:188: in __next__
    raise HttpResponseError(err, error=err)
E   azure.core.exceptions.HttpResponseError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

When looking at the relevant code:

while retry_active:
    try:
        _, response = self.client.download(
            range=range_header,
            range_get_content_md5=range_validation,
            validate_content=self.validate_content,
            data_stream_total=self.total_size,
            download_stream_current=self.progress_total,
            **self.request_options
        )
    except HttpResponseError as error:
        process_storage_error(error)

    try:
        chunk_data = process_content(response, offset[0], offset[1], self.encryption_options)
        retry_active = False
    except (requests.exceptions.ChunkedEncodingError, requests.exceptions.ConnectionError) as error:
        retry_total -= 1
        if retry_total <= 0:
            raise ServiceResponseError(error, error=error)
        time.sleep(1)

the HttpResponseError is raised by the process_content line, but is not caught here. self.client.download doesn't raise an exception, it returns an iterator instead. The exception is raised when process_content iterates on response.

To Reproduce
Steps to reproduce the behavior:

  1. Run many times a code snippet similar to the above, until a "connection reset by peer" network error occurs (I haven't wrote a simulator for that, but this occurred a few times in out integration tests so far).

Expected behavior
The retry code is triggered, and the network error does not surface to the application.

Screenshots
N/A

Additional context
Maybe I'm misinterpreting the stack trace here. The code never configures retry parameters, it uses the default provided by the Azure SDK. To avoid encountering this issue, should I change the retry parameters instead?

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jan 26, 2022
@tjprescott tjprescott added the Storage Storage Service (Queues, Blobs, Files) label Jan 26, 2022
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jan 26, 2022
@tjprescott tjprescott added the Client This issue points to a problem in the data-plane of the library. label Jan 26, 2022
@ghost ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Jan 26, 2022
@tjprescott
Copy link
Member

@tasherif-msft can you take a look?

@tasherif-msft
Copy link
Contributor

@jalauzon-msft fyi

@tasherif-msft
Copy link
Contributor

Hi @Neki we are aware of this issue. These resets are triggered by the service. We are currently in discussions on handling these types of errors the Python SDKs. We will update you once we have more information

@jalauzon-msft jalauzon-msft added bug This issue requires a change to an existing behavior in the product in order to be resolved. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jan 28, 2022
@jalauzon-msft
Copy link
Member

Hi @Neki Benoît, I took another look at this issue, and I believe this is separate from our other effort around Connection Resets that Tamer mentioned above.

I spent some time trying to reproduce your Traceback where Connection Resets would not be retried during a streaming download, but I was unable to reproduce the issue. It seems that requests made in process_content() go through our default retry pipeline and therefore Connection Resets are retried as expected.

I was trying to reproduce this issue from the main branch of the repo so there is a chance this was fixed in a later version of azure-storage-blob than what you are using (12.8.1) or, more likely, fixed in a later version of azure-core (though I'm not sure which version you are currently using). Would it be possible for you to try upgrading to latest version of blob and core to see if this issue still occurs?

As a note, we are still investigating Connection Reset errors that are occurring even with retries so you may still see those, but I am more interested in if the retries are working properly in the latest versions. Thanks!

@jalauzon-msft jalauzon-msft added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Apr 1, 2022
@ghost
Copy link

ghost commented Apr 1, 2022

Hi @Neki. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

@ghost ghost removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Apr 1, 2022
@jalauzon-msft
Copy link
Member

Hi @Neki, I've marked this Issue for close as I think retries are properly occurring in the latest versions, as mentioned. We can use #19753 to track any Connection Reset errors that are getting through the retries. There is also an example of how to increase retry count there.

@ghost
Copy link

ghost commented Apr 8, 2022

Hi @Neki, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

@ghost ghost closed this as completed Apr 8, 2022
azure-sdk pushed a commit to azure-sdk/azure-sdk-for-python that referenced this issue Mar 9, 2023
[Hub Generated] Review request for Microsoft.MachineLearningServices to add version preview/2023-02-01-preview (Azure#22436)

* Adds base for updating Microsoft.MachineLearningServices from version preview/2022-12-01-preview to version 2023-02-01-preview

* Updates readme

* Updates API version in new specs and examples

* Add managedResourceGroupTags to registries Feb 2023 Preview swagger (Azure#22617)

* removed feb2023 registries.json

* Revert "removed feb2023 registries.json"

This reverts commit d2d1f7055e00284ba040cf0b792404956fb89c24.

* add managedResourceGroupTags property

---------

Co-authored-by: Komal Yadav <komalyadav@microsoft.com>

* updating to feb 2023 preview (Azure#22456)

* updating to feb 2023 preview

* adding example files

* add to custom words, remove spelling errors

* Update specification/machinelearningservices/resource-manager/Microsoft.MachineLearningServices/preview/2023-02-01-preview/mfe.json

Co-authored-by: Jianye Xi <59603451+jianyexi@users.noreply.github.com>

* Update specification/machinelearningservices/resource-manager/Microsoft.MachineLearningServices/preview/2023-02-01-preview/mfe.json

Co-authored-by: Jianye Xi <59603451+jianyexi@users.noreply.github.com>

* Update specification/machinelearningservices/resource-manager/Microsoft.MachineLearningServices/preview/2023-02-01-preview/mfe.json

Co-authored-by: Jianye Xi <59603451+jianyexi@users.noreply.github.com>

* resolving model validation errors

* try resolving operation async response validation errors

* merge data container/ver, queue settings

* add datacontainer and version example files, add LRO uri

* update example jobs files, update LRO and registry name pattern

* last LRO fixes

* test that batchdeploymentproperties is resolved

* adding maulik's swagger generation PR changes

* update swagger with new deduped batch names

* adding updated computeresource name

* add boolean featuredatatype

---------

Co-authored-by: Jianye Xi <59603451+jianyexi@users.noreply.github.com>

* ManagedNetwork Support and paramter parttern update for WorkspaceRP (Azure#22588)

* add swagger support for managed network

* update parameter pattern

* Add rules

* remove unused file

* fix

* fix format and add pattern

* update format and example

* revert manage network changes

* small change

* revert

* Staging Branch Gate Fixes and Schema Bug Fix for ModelPackage (Azure#22784)

* updating to feb 2023 preview

* adding example files

* add to custom words, remove spelling errors

* Update specification/machinelearningservices/resource-manager/Microsoft.MachineLearningServices/preview/2023-02-01-preview/mfe.json

Co-authored-by: Jianye Xi <59603451+jianyexi@users.noreply.github.com>

* Update specification/machinelearningservices/resource-manager/Microsoft.MachineLearningServices/preview/2023-02-01-preview/mfe.json

Co-authored-by: Jianye Xi <59603451+jianyexi@users.noreply.github.com>

* Update specification/machinelearningservices/resource-manager/Microsoft.MachineLearningServices/preview/2023-02-01-preview/mfe.json

Co-authored-by: Jianye Xi <59603451+jianyexi@users.noreply.github.com>

* resolving model validation errors

* try resolving operation async response validation errors

* merge data container/ver, queue settings

* add datacontainer and version example files, add LRO uri

* update example jobs files, update LRO and registry name pattern

* last LRO fixes

* test that batchdeploymentproperties is resolved

* adding maulik's swagger generation PR changes

* update swagger with new deduped batch names

* adding updated computeresource name

* add boolean featuredatatype

* fixing model validation pattern error

* fix LRO gate

* adding model package updates to swagger

* un requiring property

---------

Co-authored-by: Jianye Xi <59603451+jianyexi@users.noreply.github.com>

---------

Co-authored-by: Komal Yadav <23komal.yadav23@gmail.com>
Co-authored-by: Komal Yadav <komalyadav@microsoft.com>
Co-authored-by: Jianye Xi <59603451+jianyexi@users.noreply.github.com>
Co-authored-by: ZhidaLiu <zhili@microsoft.com>
@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

No branches or pull requests

5 participants