Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash on multipart/form-data post #1743

Merged
merged 3 commits into from
Mar 24, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CONTRIBUTORS.txt
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ Georges Dubus
Greg Holt
Gregory Haynes
Günther Jena
Hu Bo
Hugo Herter
Igor Pavlov
Ingmar Steen
Expand Down
3 changes: 2 additions & 1 deletion aiohttp/web_request.py
Original file line number Diff line number Diff line change
Expand Up @@ -409,7 +409,8 @@ def post(self):
out.add(field.name, ff)
else:
value = yield from field.read(decode=True)
if content_type.startswith('text/'):
if content_type is None or \
content_type.startswith('text/'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in case of None, you really cannot be sure if is it safe to decode or not. Better leave data as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm also thinking about this, but most post() use cases do not return bytes. When user call post(), maybe he always want something same returned from either multipart or url-encoded data.

If the user cares about raw data (bytes), he may call multipart() directly and process the post data himself.

Copy link
Contributor Author

@hubo1016 hubo1016 Mar 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you post a form with fields like textboxes in a browser like Firefox, e.g.

<form method="post" enctype="multipart/form-data">
  <input type="hidden" name="p1" value="v1"/>
  <input type="submit"/>
</form>

The browser usually do not set Content-Type for subpart of the post.
Files are not affected by this commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @kxepal, we should not decode data if we do not know content-type
it would very hard to reason about exception if one occurs from this code

Copy link
Contributor Author

@hubo1016 hubo1016 Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a hard decision, and I am in an open mind about this. There are three ways for this situation:

  1. When no Content-Type is provided, assume it is a utf-8 string
  2. When no Content-Type is provided, always keep it as bytes
  3. When no Content-Type is provided, first try to parse it as an utf-8 string (with "strict"), and when exception occurs, return the raw bytes

Each has their own advantages and disadvantages. I'm looking at the code which is processing application/x-www-form-urlencoded data and it is:

            data = yield from self.read()
            if data:
                charset = self.charset or 'utf-8'
                out.extend(
                    parse_qsl(
                        data.rstrip().decode(charset),
                        encoding=charset))

Notice that this piece of code assume charset to be utf-8 when no charset is provided through Content-Type header (notice that a %NN encoded character is really a byte). It always decode data into string. So I suggest using the same strategy for multipart/form-data format.

As I have said, a lot of browsers do not send Content-Type header for sub parts of the form data - in most times, they are indeed encoded into utf-8. There is nothing a developer can do about this. If multipart/form-data post data is parsed into bytes, a developer is forced to check the data type of post() every time if he wants to accept both format. To decide to not decode a bytes object is easy, but the user may be suprised to see that the return type for multipart/form-data and application/x-www-from-urlencoded is so different. And he would also have a hard time when some tools or browsers actually provide the Content-Type header.

After we have a conclusion maybe we should add it into the document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no Content-Type is provided, always keep it as bytes

Will be fine in all cases. Browsers just are another HTTP clients with own specifics.

As I have said, a lot of browsers do not send Content-Type header for sub parts of the form data - in most times, they are indeed encoded into utf-8.

They actually do this for simple input fields, not file inputs. I'm worry about "in most times" part of your post, but in anyway, there are no reasons here to make any preferences for browsers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, according to RFC 7578

4.4. Content-Type Header Field for Each Part

Each part MAY have an (optional) "Content-Type" header field, which
defaults to "text/plain". If the contents of a file are to be sent,
the file data SHOULD be labeled with an appropriate media type, if
known, or "application/octet-stream".

It really SHOULD be considered as "text/plain"... And if "text/plain" is decoded to unicode with the default encoding as utf-8, it should be same for content without a content-type header.

I'm also testing the simple HTML page with Firefox, Internet Explorer and Edge, they all send the text without a content-type header - even when the input field contains non-ASCII characters.

Anyway, if you do not change your mind, I don't mind to change the logic to what you are considering.

@kxepal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI

https://tools.ietf.org/html/rfc7578#page-5

and also these chapters

5.1.2. Interpreting Forms and Creating multipart/form-data Data

Some applications of this specification will supply a character
encoding to be used for interpretation of the multipart/form-data
body. In particular, HTML 5 [W3C.REC-html5-20141028] uses

o the content of a "charset" field, if there is one;

o the value of an accept-charset attribute of the

element, if
there is one;

o the character encoding of the document containing the form, if it
is US-ASCII compatible;

o otherwise, UTF-8.

5.1.3. Parsing and Interpreting Form Data

While this specification provides guidance for the creation of
multipart/form-data, parsers and interpreters should be aware of the
variety of implementations. File systems differ as to whether and
how they normalize Unicode names, for example. The matching of form
elements to form-data parts may rely on a fuzzier match. In
particular, some multipart/form-data generators might have followed
the previous advice of [RFC2388] and used the "encoded-word" method
of encoding non-ASCII values, as described in [RFC2047]:

  encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

Others have been known to follow [RFC2231], to send unencoded UTF-8,
or even to send strings encoded in the form-charset.

For this reason, interpreting multipart/form-data (even from
conforming generators) may require knowing the charset used in form
encoding in cases where the charset field value or a charset
parameter of a "text/plain" Content-Type header field is not
supplied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love RFC references! Thanks for them. I guess RFC-7578#4.4 is pretty clear instructs what to do in this case so can follow it.

@fafhrd91 are you ok with as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok

charset = field.get_charset(default='utf-8')
value = value.decode(charset)
out.add(field.name, value)
Expand Down
22 changes: 22 additions & 0 deletions tests/test_web_request.py
Original file line number Diff line number Diff line change
Expand Up @@ -335,6 +335,28 @@ def test_make_too_big_request_adjust_limit(loop):
assert len(txt) == 1024**2 + 1


@asyncio.coroutine
def test_multipart_formdata(loop):
payload = StreamReader(loop=loop)
payload.feed_data(b"""-----------------------------326931944431359\r
Content-Disposition: form-data; name="a"\r
\r
b\r
-----------------------------326931944431359\r
Content-Disposition: form-data; name="c"\r
\r
d\r
-----------------------------326931944431359--\r\n""")
content_type = "multipart/form-data; boundary="\
"---------------------------326931944431359"
payload.feed_eof()
req = make_mocked_request('POST', '/',
headers={'CONTENT-TYPE': content_type},
payload=payload)
result = yield from req.post()
assert dict(result) == {'a': 'b', 'c': 'd'}


@asyncio.coroutine
def test_make_too_big_request_limit_None(loop):
payload = StreamReader(loop=loop)
Expand Down