-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/bugfix: make the HTTP client able to return HTTP chunks when chunked transfer encoding is used #2150
Conversation
I still don't like the feature. |
This was already discussed. This not going to happen, not in this way. I see two possible solution:
Solution should not affect users who does not care about transfer encoding. |
Codecov Report
@@ Coverage Diff @@
## master #2150 +/- ##
=========================================
+ Coverage 97.15% 97.2% +0.04%
=========================================
Files 39 39
Lines 7913 7945 +32
Branches 1371 1378 +7
=========================================
+ Hits 7688 7723 +35
+ Misses 100 98 -2
+ Partials 125 124 -1
Continue to review full report at Codecov.
|
@asvetlov The "attach" endpoint triggers a HTTP Upgrade so this not about chunked encoding.
@fafhrd91 The solution I propose is not the same as the one Anvil implemented earlier, the current version of the C parser uses your second solution. |
With your implementation I could write very simple script that would flood server memory, and application would not be able to react. |
As for impact, everyone suddenly would need to care about transfer encoding |
If the script you suggest consists in sending huge chunk sizes, the problem seems to be exactly the same as users sending huge Content-Length sizes isn't it? I guess it should be handled the same way? |
with content-length application has full control on how to handle huge payloads. it can close connection, it can pause connection, etc. with this patch, application can not make any decision. I think, this is biggest problem. |
I like suggested approach for emitting special tokens on end of chunk if opt-in parameter is passed. |
I don't think we can make it in a backward compatible way, payload object is a byte stream. We can make payload configurable and use something like DataQueue |
Or we can just break |
I don't care about backward compatibility of |
ok, good. @jlacoline would like to rewrite your PR? |
@fafhrd91 to be clean: the proposal is rewriting |
yes, that is right. maybe it should return tuple: (data, eof_chunk) |
btw do we need parameter at all if we do not care about backward compatibility? |
I think no, the method behavior change is ok to me. |
ok, good. then +1 for new proposal |
I not sure about returning tuple. |
changing type of returning data doesn't seem very good, but that is matter of taste while 1:
data, end_of_chunk = yield from payload.readchunk()
buffer += data
if end_of_chunk:
decode_json_and_do_somthing(buffer) as oposite to: while 1:
data = yield from payload.readchunk()
if data is END_OF_CHUNK:
decode_json_and_do_somthing(buffer)
else:
buffer += data seems documentation should be simpler with tuple as we would need to describe each field in tuple only. |
I am OK with this new logic, seems to be a good compromise for usability and security. |
@after careful thinking I agree that |
I updated the code to make |
Will review in a few hours. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reflect in docs/client_reference.rst
your changes.
Use .. versionchanged:: 2.3
sphinx tag.
Also please add missing tests.
I highly recommend installing https://docs.codecov.io/v4.3.6/docs/browser-extension -- it shows coverage result just in github Files changed
tab.
aiohttp/streams.py
Outdated
class ChunkTupleAsyncStreamIterator(AsyncStreamIterator): | ||
@asyncio.coroutine | ||
def __anext__(self): | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests for covering these lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
Python-3.5 available for Python 3.5+ only | ||
""" | ||
return AsyncStreamIterator(self.readchunk) | ||
return ChunkTupleAsyncStreamIterator(self.readchunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test for the call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
aiohttp/streams.py
Outdated
elif self._http_chunk_splits is not None: | ||
if self._http_chunk_splits: | ||
pos = self._http_chunk_splits[0] | ||
if pos > self._cursor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test for case if not pos > self._cursor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR this condition was here because I was afraid of what would happen if some other calls than readchunk
were made before.
So I did a small modification to the condition here and added tests for those situations, covering cases where stream.unread_data
and stream.read
are used before stream.readchunk
.
aiohttp/streams.py
Outdated
if self._exception is not None: | ||
raise self._exception | ||
|
||
if not self._buffer and not self._eof: | ||
# end of http chunk without available data | ||
if self._http_chunk_splits and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace backslash with brackets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -547,18 +548,17 @@ def feed_data(self, chunk, SEP=b'\r\n', CHUNK_EXT=b';'): | |||
required = self._chunk_size | |||
chunk_len = len(chunk) | |||
|
|||
if required >= chunk_len: | |||
if required > chunk_len: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain why you replaced >=
with >
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If required == chunk_len
, then self._chunk_size == 0
and the "else" part does what is needed.
So lines 552 and 553 looked like duplicated code, and deduplicating this allowed me to put self.payload.end_http_chunk_receiving()
at only one place in the code.
But the code will continue after the 'else' as opposed to the previous version where (False, None) was returned immediately. So the behaviour is not exactly the same even if the same result will be eventually returned.
So imo it looks better but if you want me to revert to the previous version tell me.
Also I'm going to remove the useless blank line(554)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fine with the change
Thanks for the review. I updated my branch according to your feedback. |
Good. Please drop a message when you will be ready for next review. |
@asvetlov ready for next review |
@asvetlov pinging you again in case you didn't see the last message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks
@@ -547,18 +548,17 @@ def feed_data(self, chunk, SEP=b'\r\n', CHUNK_EXT=b';'): | |||
required = self._chunk_size | |||
chunk_len = len(chunk) | |||
|
|||
if required >= chunk_len: | |||
if required > chunk_len: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fine with the change
What do these changes do?
Those changes make the
StreamReader.readchunk
method return HTTP chunks whenTransfer-Encoding: chunked
is used.The
._buffer
attribute ofStreamReader
now contains the actual HTTP chunks instead of the TCP chunks received by the server (only if chunk encoding is used, otherwise it is the same as before)Are there changes in behavior for the user?
Previous behaviour: In all cases,
StreamReader.readchunk
returns the oldest chunk, the chunks beeing as received by the server (i.e. TCP chunks)New behaviour: If chunked encoding is used,
StreamReader.readchunk
returns the oldest HTTP chunks. Otherwise it will return the oldest TCP chunk.Why?
The current implementation of the HTTP client does not take into account how the payload is split when chunked encoding is used. The actual chunks that are stored and returned to the user correspond to TCP chunks, which are virtually meaningless in HTTP communication.
This leads to problems if someone needs to retreive the HTTP chunks as sent by the server: for example issue #1899 (note that I work in the same team as @Anvil and experience the same problems about Docker response parsing)
When reading the documentation, it seems that the
iter_chunks
generator was intended to yield HTTP chunks and not TCP chunks (related issue: #1949).As @asvetlov argued here, the way the payload is chunked is a transport problem and should be transparent to the user.
However some HTTP servers consider the HTTP chunks they send as atomic blocks of information that have a meaning by themselves (at least Docker does, but I suppose they are not the only ones to have made that choice for data streaming). More precisely, Docker streamed responses are chunks of json objects and the json decoding is done once for each chunk, so in this scenario preserving chunk integrity is mandatory.
From my understanding, allowing to retreive HTTP chunks is a quite common feature for HTTP clients (docker clients use the requests package in the Python version and the native client in the Go version)
How?
The C parser fires the signals
on_chunk_header
andon_chunk_complete
at the right moments but no handler was defined to act on those signals, so I added two handlers in_http_parser.pyx
. Those two handlers basically tell theStreamReader
payload when to start and stop storing chunks.I modified the Python parser to reflect this behaviour.
Checklist
CONTRIBUTORS.txt
changes
folder<issue_id>.<type>
for example (588.bug)issue_id
change it to the pr id after creating the pr.feature
: Signifying a new feature..bugfix
: Signifying a bug fix..doc
: Signifying a documentation improvement..removal
: Signifying a deprecation or removal of public API..misc
: A ticket has been closed, but it is not of interest to users.