Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Enable strict checking of http content-length #20412

Closed

Conversation

jochen-ott-by
Copy link

We have a big data application heavily relying on the azure blob store to store the data. Unfortunately, we keep running into client-side data corruption issues when reading data from the blob store: sometimes, fewer bytes are returned than expected; sometimes the returned data has the correct length, but data is corrupt.

(I reported an issue in a similar vein already as #16723 . This was fixed in the meantime and has already substantially improved the situation for us.)

A potential cause for these issues are prematurely closed tcp connections: If a tcp connection is closed while in the middle of transferring the body of a http response, the azure-core library so far returned truncated data -- rather than raising an exception. This PR fixes this behavior by enabling strict checking of the http content-length. In case of a prematurely closed tcp connection, an exception is raised rather than returning the truncated body.
Note that the aiohttp backend already raised in such case, so no change was required for the aiohttp backend.

The fix boils down to setting urllib3.response.HTTPResponse.enforce_content_length at an appropriate time: when or directly after constructing the HTTPResponse object, but before reading the body.
I chose to do this in a requests hook, as:

  • setting it in a HTTPAdapter.send (e.g. by adding BiggerBlockSizeHTTPAdapter.send) would also be possible in principle, but it would not cover cases in which RequestsTransport cannot set the adapter as it does not own the session.
  • setting it after requests.Session.request returns would only cover the stream=True case: For stream=False, the body of the response was already fully read once requests.Session.request returns, so setting enforce_content_length would have no effect.

@ghost ghost added Azure.Core customer-reported Issues that are reported by GitHub users external to the Azure organization. labels Aug 25, 2021
@ghost
Copy link

ghost commented Aug 25, 2021

Thank you for your contribution jochen-ott-by! We will review the pull request and get back to you soon.

@ghost
Copy link

ghost commented Aug 25, 2021

CLA assistant check
All CLA requirements met.

@xiangyan99
Copy link
Member

Thanks @jochen-ott-by for the feedback and contributing.

Could you give us more details about your scenarios? e.g. I guess the failure only matters in streaming scenario because if we fail to read the body, we will raise exceptions when deserializing it? I ask this because this will help us scope the fix. We need to evaluate if "enforce_content_length" should be an always-on setting by default or we can only enable it in some situations.

I also find https://github.com/streamlink/streamlink/pull/3768/files. Seems like that can also solve the problem, right?

Thanks.

@jochen-ott-by
Copy link
Author

jochen-ott-by commented Sep 13, 2021

Sorry for the late reply, I was on vacation.

Thanks @jochen-ott-by for the feedback and contributing.

Could you give us more details about your scenarios?

The scenario is reading the content of many (tens of thousands) blobs from the azure blob store in a big data application. I believe this always uses stream=True (but I'm actually not completely sure that's true).

e.g. I guess the failure only matters in streaming scenario because if we fail to read the body, we will raise exceptions when deserializing it?

I do not think this is true in general.
This really depends on the serialization format: some formats can reliably detect truncated data, but many formats do not. In particular, for reading "raw blobs", there is (in general) no notion of a "deserialization", so there is no step that would reliably raise an exception. To make it more concrete, think of a small helper tool that copies data (as in: raw blobs) from one blob storage technology to another. For such an application, there is no deserialization that would ever raise.

I ask this because this will help us scope the fix. We need to evaluate if "enforce_content_length" should be an always-on setting by default or we can only enable it in some situations.

I think the sane thing to do here is to always enable it. Note that the aiohttp-based async client already does this, so in this sense, this just brings the requests-based implementation on par with the aiohttp implementation.

I also find https://github.com/streamlink/streamlink/pull/3768/files. Seems like that can also solve the problem, right?

This seems to also address the same issue, yes. However, it works by monkey-patching third-party (urllib3 / requests) classes. Monkeypatching does not compose well: if several libraries to "patch" via such a mechanism, at most one of them will succeed. Also, there might be (subtle) incompatibilities, e.g. if other parts of the application rely on the unpatched behavior.
I would therefore strongly advice against using monkeypatching in a library like azure-core that is typically part of a much larger applications, and azure-core should avoid making too many assumptions about these applications. (Monkeypatching can be acceptable for self-contained applications such as streamlink, as they have different trade-offs than libraries.)

@xiangyan99
Copy link
Member

@jochen-ott-by Thanks for the information. We will have a discussion with our architect how to add the check.

@xiangyan99
Copy link
Member

@jochen-ott-by in terms of "Note that the aiohttp-based async client already does this", could you help point me where aiohttp implement this?

It looks to me aiohttp only check length when we call readexactly.

https://github.com/aio-libs/aiohttp/blob/master/aiohttp/streams.py#L425-L438

Did I miss anything?

@jochen-ott-by
Copy link
Author

@jochen-ott-by in terms of "Note that the aiohttp-based async client already does this", could you help point me where aiohttp implement this?

This PR adds a test that shows the behavior I described:
https://github.com/Azure/azure-sdk-for-python/pull/20412/files#diff-524251c35211f89b7517d30ac4f3eb5375189356582eed96d3e23c50769c6868R47

In aiohttp code, this check is quite deep, but done somewhere here:
https://github.com/aio-libs/aiohttp/blob/09ac1cbb4c07b4b3e1872374808512c9bc52f2ab/aiohttp/http_parser.py#L707

This code path is active in case "length" is set, which is usually the case if the "Content-Length" header was set; see
https://github.com/aio-libs/aiohttp/blob/09ac1cbb4c07b4b3e1872374808512c9bc52f2ab/aiohttp/http_parser.py#L319-L335

It looks to me aiohttp only check length when we call readexactly.

https://github.com/aio-libs/aiohttp/blob/master/aiohttp/streams.py#L425-L438

Did I miss anything?

readexactly and content-length checking are two very different things: the former is an API to get data, the latter is a protocol level check. So they have not much to do with each other.

@jochen-ott-by
Copy link
Author

jochen-ott-by commented Sep 27, 2021

Is there any update here? I can add that we did deploy this change in production, it did solve the issues we observed before. I therefore believe this is a robust fix that really improves real-world usage.

I again want to remind you that without this fix, we effectively have silent data corruption when accessing data from the azure blob store via this library. I therefore think it deserves some priority.

@xiangyan99
Copy link
Member

@jochen-ott-by Thank for the follow up.

Your PR looks great.

Well, we need some small changes based on your PR because we want to unify the error and make it inherit from AzureError to be less breaking.

The new PR can be found at #20888.

We will keep you posted.

Thanks again for your contribution.

@jochen-ott-by
Copy link
Author

Thanks for the update @xiangyan99 !

Integrating this properly with error reporting of course makes sense. This means we can close this PR now in favor of the new one you just created, #20888, so let's continue any discussions / code review comments "over there".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Azure.Core customer-reported Issues that are reported by GitHub users external to the Azure organization.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants