-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Enable strict checking of http content-length #20412
[Core] Enable strict checking of http content-length #20412
Conversation
Thank you for your contribution jochen-ott-by! We will review the pull request and get back to you soon. |
29bfa61
to
1d9ebd8
Compare
Thanks @jochen-ott-by for the feedback and contributing. Could you give us more details about your scenarios? e.g. I guess the failure only matters in streaming scenario because if we fail to read the body, we will raise exceptions when deserializing it? I ask this because this will help us scope the fix. We need to evaluate if "enforce_content_length" should be an always-on setting by default or we can only enable it in some situations. I also find https://github.com/streamlink/streamlink/pull/3768/files. Seems like that can also solve the problem, right? Thanks. |
Sorry for the late reply, I was on vacation.
The scenario is reading the content of many (tens of thousands) blobs from the azure blob store in a big data application. I believe this always uses stream=True (but I'm actually not completely sure that's true).
I do not think this is true in general.
I think the sane thing to do here is to always enable it. Note that the aiohttp-based async client already does this, so in this sense, this just brings the requests-based implementation on par with the aiohttp implementation.
This seems to also address the same issue, yes. However, it works by monkey-patching third-party (urllib3 / requests) classes. Monkeypatching does not compose well: if several libraries to "patch" via such a mechanism, at most one of them will succeed. Also, there might be (subtle) incompatibilities, e.g. if other parts of the application rely on the unpatched behavior. |
1d9ebd8
to
58088c0
Compare
@jochen-ott-by Thanks for the information. We will have a discussion with our architect how to add the check. |
@jochen-ott-by in terms of "Note that the aiohttp-based async client already does this", could you help point me where aiohttp implement this? It looks to me aiohttp only check length when we call https://github.com/aio-libs/aiohttp/blob/master/aiohttp/streams.py#L425-L438 Did I miss anything? |
This PR adds a test that shows the behavior I described: In aiohttp code, this check is quite deep, but done somewhere here: This code path is active in case "length" is set, which is usually the case if the "Content-Length" header was set; see
|
Is there any update here? I can add that we did deploy this change in production, it did solve the issues we observed before. I therefore believe this is a robust fix that really improves real-world usage. I again want to remind you that without this fix, we effectively have silent data corruption when accessing data from the azure blob store via this library. I therefore think it deserves some priority. |
@jochen-ott-by Thank for the follow up. Your PR looks great. Well, we need some small changes based on your PR because we want to unify the error and make it inherit from AzureError to be less breaking. The new PR can be found at #20888. We will keep you posted. Thanks again for your contribution. |
Thanks for the update @xiangyan99 ! Integrating this properly with error reporting of course makes sense. This means we can close this PR now in favor of the new one you just created, #20888, so let's continue any discussions / code review comments "over there". |
We have a big data application heavily relying on the azure blob store to store the data. Unfortunately, we keep running into client-side data corruption issues when reading data from the blob store: sometimes, fewer bytes are returned than expected; sometimes the returned data has the correct length, but data is corrupt.
(I reported an issue in a similar vein already as #16723 . This was fixed in the meantime and has already substantially improved the situation for us.)
A potential cause for these issues are prematurely closed tcp connections: If a tcp connection is closed while in the middle of transferring the body of a http response, the azure-core library so far returned truncated data -- rather than raising an exception. This PR fixes this behavior by enabling strict checking of the http content-length. In case of a prematurely closed tcp connection, an exception is raised rather than returning the truncated body.
Note that the aiohttp backend already raised in such case, so no change was required for the aiohttp backend.
The fix boils down to setting urllib3.response.HTTPResponse.enforce_content_length at an appropriate time: when or directly after constructing the
HTTPResponse
object, but before reading the body.I chose to do this in a requests hook, as:
BiggerBlockSizeHTTPAdapter.send
) would also be possible in principle, but it would not cover cases in whichRequestsTransport
cannot set the adapter as it does not own the session.requests.Session.request
returns would only cover thestream=True
case: Forstream=False
, the body of the response was already fully read oncerequests.Session.request
returns, so settingenforce_content_length
would have no effect.