[Core] Enable strict checking of http content-length #20412

jochen-ott-by · 2021-08-25T07:04:15Z

We have a big data application heavily relying on the azure blob store to store the data. Unfortunately, we keep running into client-side data corruption issues when reading data from the blob store: sometimes, fewer bytes are returned than expected; sometimes the returned data has the correct length, but data is corrupt.

(I reported an issue in a similar vein already as #16723 . This was fixed in the meantime and has already substantially improved the situation for us.)

A potential cause for these issues are prematurely closed tcp connections: If a tcp connection is closed while in the middle of transferring the body of a http response, the azure-core library so far returned truncated data -- rather than raising an exception. This PR fixes this behavior by enabling strict checking of the http content-length. In case of a prematurely closed tcp connection, an exception is raised rather than returning the truncated body.
Note that the aiohttp backend already raised in such case, so no change was required for the aiohttp backend.

The fix boils down to setting urllib3.response.HTTPResponse.enforce_content_length at an appropriate time: when or directly after constructing the HTTPResponse object, but before reading the body.
I chose to do this in a requests hook, as:

setting it in a HTTPAdapter.send (e.g. by adding BiggerBlockSizeHTTPAdapter.send) would also be possible in principle, but it would not cover cases in which RequestsTransport cannot set the adapter as it does not own the session.
setting it after requests.Session.request returns would only cover the stream=True case: For stream=False, the body of the response was already fully read once requests.Session.request returns, so setting enforce_content_length would have no effect.

ghost · 2021-08-25T07:04:21Z

Thank you for your contribution jochen-ott-by! We will review the pull request and get back to you soon.

ghost · 2021-08-25T07:04:27Z

All CLA requirements met.

xiangyan99 · 2021-08-25T23:48:54Z

Thanks @jochen-ott-by for the feedback and contributing.

Could you give us more details about your scenarios? e.g. I guess the failure only matters in streaming scenario because if we fail to read the body, we will raise exceptions when deserializing it? I ask this because this will help us scope the fix. We need to evaluate if "enforce_content_length" should be an always-on setting by default or we can only enable it in some situations.

I also find https://github.com/streamlink/streamlink/pull/3768/files. Seems like that can also solve the problem, right?

Thanks.

jochen-ott-by · 2021-09-13T12:34:50Z

Sorry for the late reply, I was on vacation.

Thanks @jochen-ott-by for the feedback and contributing.

Could you give us more details about your scenarios?

The scenario is reading the content of many (tens of thousands) blobs from the azure blob store in a big data application. I believe this always uses stream=True (but I'm actually not completely sure that's true).

e.g. I guess the failure only matters in streaming scenario because if we fail to read the body, we will raise exceptions when deserializing it?

I do not think this is true in general.
This really depends on the serialization format: some formats can reliably detect truncated data, but many formats do not. In particular, for reading "raw blobs", there is (in general) no notion of a "deserialization", so there is no step that would reliably raise an exception. To make it more concrete, think of a small helper tool that copies data (as in: raw blobs) from one blob storage technology to another. For such an application, there is no deserialization that would ever raise.

I ask this because this will help us scope the fix. We need to evaluate if "enforce_content_length" should be an always-on setting by default or we can only enable it in some situations.

I think the sane thing to do here is to always enable it. Note that the aiohttp-based async client already does this, so in this sense, this just brings the requests-based implementation on par with the aiohttp implementation.

I also find https://github.com/streamlink/streamlink/pull/3768/files. Seems like that can also solve the problem, right?

This seems to also address the same issue, yes. However, it works by monkey-patching third-party (urllib3 / requests) classes. Monkeypatching does not compose well: if several libraries to "patch" via such a mechanism, at most one of them will succeed. Also, there might be (subtle) incompatibilities, e.g. if other parts of the application rely on the unpatched behavior.
I would therefore strongly advice against using monkeypatching in a library like azure-core that is typically part of a much larger applications, and azure-core should avoid making too many assumptions about these applications. (Monkeypatching can be acceptable for self-contained applications such as streamlink, as they have different trade-offs than libraries.)

xiangyan99 · 2021-09-15T16:42:28Z

@jochen-ott-by Thanks for the information. We will have a discussion with our architect how to add the check.

xiangyan99 · 2021-09-15T22:45:19Z

@jochen-ott-by in terms of "Note that the aiohttp-based async client already does this", could you help point me where aiohttp implement this?

It looks to me aiohttp only check length when we call readexactly.

https://github.com/aio-libs/aiohttp/blob/master/aiohttp/streams.py#L425-L438

Did I miss anything?

jochen-ott-by · 2021-09-16T05:46:48Z

@jochen-ott-by in terms of "Note that the aiohttp-based async client already does this", could you help point me where aiohttp implement this?

This PR adds a test that shows the behavior I described:
https://github.com/Azure/azure-sdk-for-python/pull/20412/files#diff-524251c35211f89b7517d30ac4f3eb5375189356582eed96d3e23c50769c6868R47

In aiohttp code, this check is quite deep, but done somewhere here:
https://github.com/aio-libs/aiohttp/blob/09ac1cbb4c07b4b3e1872374808512c9bc52f2ab/aiohttp/http_parser.py#L707

This code path is active in case "length" is set, which is usually the case if the "Content-Length" header was set; see
https://github.com/aio-libs/aiohttp/blob/09ac1cbb4c07b4b3e1872374808512c9bc52f2ab/aiohttp/http_parser.py#L319-L335

It looks to me aiohttp only check length when we call readexactly.

https://github.com/aio-libs/aiohttp/blob/master/aiohttp/streams.py#L425-L438

Did I miss anything?

readexactly and content-length checking are two very different things: the former is an API to get data, the latter is a protocol level check. So they have not much to do with each other.

jochen-ott-by · 2021-09-27T07:32:29Z

Is there any update here? I can add that we did deploy this change in production, it did solve the issues we observed before. I therefore believe this is a robust fix that really improves real-world usage.

I again want to remind you that without this fix, we effectively have silent data corruption when accessing data from the azure blob store via this library. I therefore think it deserves some priority.

xiangyan99 · 2021-09-27T20:18:17Z

@jochen-ott-by Thank for the follow up.

Your PR looks great.

Well, we need some small changes based on your PR because we want to unify the error and make it inherit from AzureError to be less breaking.

The new PR can be found at #20888.

We will keep you posted.

Thanks again for your contribution.

jochen-ott-by · 2021-09-28T05:35:25Z

Thanks for the update @xiangyan99 !

Integrating this properly with error reporting of course makes sense. This means we can close this PR now in favor of the new one you just created, #20888, so let's continue any discussions / code review comments "over there".

jochen-ott-by requested review from lmazuel and xiangyan99 as code owners August 25, 2021 07:04

ghost added Azure.Core customer-reported Issues that are reported by GitHub users external to the Azure organization. labels Aug 25, 2021

jochen-ott-by force-pushed the strict-content-length-checking branch from 29bfa61 to 1d9ebd8 Compare August 25, 2021 08:11

[Core] Enable strict checking of http content-length

58088c0

jochen-ott-by force-pushed the strict-content-length-checking branch from 1d9ebd8 to 58088c0 Compare September 14, 2021 05:32

jochen-ott-by closed this Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Enable strict checking of http content-length #20412

[Core] Enable strict checking of http content-length #20412

jochen-ott-by commented Aug 25, 2021

ghost commented Aug 25, 2021

ghost commented Aug 25, 2021 •

edited by ghost

Loading

xiangyan99 commented Aug 25, 2021

jochen-ott-by commented Sep 13, 2021 •

edited

Loading

xiangyan99 commented Sep 15, 2021

xiangyan99 commented Sep 15, 2021

jochen-ott-by commented Sep 16, 2021

jochen-ott-by commented Sep 27, 2021 •

edited

Loading

xiangyan99 commented Sep 27, 2021

jochen-ott-by commented Sep 28, 2021

[Core] Enable strict checking of http content-length #20412

[Core] Enable strict checking of http content-length #20412

Conversation

jochen-ott-by commented Aug 25, 2021

ghost commented Aug 25, 2021

ghost commented Aug 25, 2021 • edited by ghost Loading

xiangyan99 commented Aug 25, 2021

jochen-ott-by commented Sep 13, 2021 • edited Loading

xiangyan99 commented Sep 15, 2021

xiangyan99 commented Sep 15, 2021

jochen-ott-by commented Sep 16, 2021

jochen-ott-by commented Sep 27, 2021 • edited Loading

xiangyan99 commented Sep 27, 2021

jochen-ott-by commented Sep 28, 2021

ghost commented Aug 25, 2021 •

edited by ghost

Loading

jochen-ott-by commented Sep 13, 2021 •

edited

Loading

jochen-ott-by commented Sep 27, 2021 •

edited

Loading