Fix crash on multipart/form-data post #1743

hubo1016 · 2017-03-23T13:22:42Z

What do these changes do?

When multipart/form-data format data is posted to aiohttp server and is processed by request.post(), if there are fields without filename and "Content-Type" header, the request crashes on checking content_type.startswith("text/"). Many browsers and tools generates this kind of post data.

Are there changes in behavior for the user?

No. There may be different opinions on whether to decode the data to unicode string or leave it as bytes, but it should be better than crashing.

Related issue number

Checklist

I think the code is well written
Unit tests for the changes exist
Documentation reflects the changes
If you provide code modification, please add yourself to CONTRIBUTORS.txt
- The format is <Name> <Surname>.
- Please keep alphabetical order, the file is sorted by names.
Add a new entry to CHANGES.rst
- Choose any open position to avoid merge conflicts with other PRs.
- Add a link to the issue you are fixing (if any) using #issue_number format at the end of changelog message. Use Pull Request number if there are no issues for PR or PR covers the issue only partially.

kxepal · 2017-03-23T14:04:45Z

aiohttp/web_request.py

@@ -409,7 +409,8 @@ def post(self):
                    out.add(field.name, ff)
                else:
                    value = yield from field.read(decode=True)
-                    if content_type.startswith('text/'):
+                    if content_type is None or \
+                            content_type.startswith('text/'):


I think in case of None, you really cannot be sure if is it safe to decode or not. Better leave data as is.

Yes I'm also thinking about this, but most post() use cases do not return bytes. When user call post(), maybe he always want something same returned from either multipart or url-encoded data.

If the user cares about raw data (bytes), he may call multipart() directly and process the post data himself.

When you post a form with fields like textboxes in a browser like Firefox, e.g.

<form method="post" enctype="multipart/form-data"> <input type="hidden" name="p1" value="v1"/> <input type="submit"/> </form>

The browser usually do not set Content-Type for subpart of the post.
Files are not affected by this commit.

I agree with @kxepal, we should not decode data if we do not know content-type
it would very hard to reason about exception if one occurs from this code

It is a hard decision, and I am in an open mind about this. There are three ways for this situation:

When no Content-Type is provided, assume it is a utf-8 string

When no Content-Type is provided, always keep it as bytes

When no Content-Type is provided, first try to parse it as an utf-8 string (with "strict"), and when exception occurs, return the raw bytes

Each has their own advantages and disadvantages. I'm looking at the code which is processing application/x-www-form-urlencoded data and it is:

data = yield from self.read() if data: charset = self.charset or 'utf-8' out.extend( parse_qsl( data.rstrip().decode(charset), encoding=charset))

Notice that this piece of code assume charset to be utf-8 when no charset is provided through Content-Type header (notice that a %NN encoded character is really a byte). It always decode data into string. So I suggest using the same strategy for multipart/form-data format.

As I have said, a lot of browsers do not send Content-Type header for sub parts of the form data - in most times, they are indeed encoded into utf-8. There is nothing a developer can do about this. If multipart/form-data post data is parsed into bytes, a developer is forced to check the data type of post() every time if he wants to accept both format. To decide to not decode a bytes object is easy, but the user may be suprised to see that the return type for multipart/form-data and application/x-www-from-urlencoded is so different. And he would also have a hard time when some tools or browsers actually provide the Content-Type header.

After we have a conclusion maybe we should add it into the document.

When no Content-Type is provided, always keep it as bytes

Will be fine in all cases. Browsers just are another HTTP clients with own specifics.

As I have said, a lot of browsers do not send Content-Type header for sub parts of the form data - in most times, they are indeed encoded into utf-8.

They actually do this for simple input fields, not file inputs. I'm worry about "in most times" part of your post, but in anyway, there are no reasons here to make any preferences for browsers.

Well, according to RFC 7578

4.4. Content-Type Header Field for Each Part

Each part MAY have an (optional) "Content-Type" header field, which
defaults to "text/plain". If the contents of a file are to be sent,
the file data SHOULD be labeled with an appropriate media type, if
known, or "application/octet-stream".

It really SHOULD be considered as "text/plain"... And if "text/plain" is decoded to unicode with the default encoding as utf-8, it should be same for content without a content-type header.

I'm also testing the simple HTML page with Firefox, Internet Explorer and Edge, they all send the text without a content-type header - even when the input field contains non-ASCII characters.

Anyway, if you do not change your mind, I don't mind to change the logic to what you are considering.

@kxepal

FYI

https://tools.ietf.org/html/rfc7578#page-5

and also these chapters

5.1.2. Interpreting Forms and Creating multipart/form-data Data

Some applications of this specification will supply a character
encoding to be used for interpretation of the multipart/form-data
body. In particular, HTML 5 [W3C.REC-html5-20141028] uses

o the content of a "charset" field, if there is one;

o the value of an accept-charset attribute of the
element, if
there is one;

o the character encoding of the document containing the form, if it
is US-ASCII compatible;

o otherwise, UTF-8.

5.1.3. Parsing and Interpreting Form Data

While this specification provides guidance for the creation of
multipart/form-data, parsers and interpreters should be aware of the
variety of implementations. File systems differ as to whether and
how they normalize Unicode names, for example. The matching of form
elements to form-data parts may rely on a fuzzier match. In
particular, some multipart/form-data generators might have followed
the previous advice of [RFC2388] and used the "encoded-word" method
of encoding non-ASCII values, as described in [RFC2047]:

encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

Others have been known to follow [RFC2231], to send unencoded UTF-8,
or even to send strings encoded in the form-charset.

For this reason, interpreting multipart/form-data (even from
conforming generators) may require knowing the charset used in form
encoding in cases where the charset field value or a charset
parameter of a "text/plain" Content-Type header field is not
supplied.

Love RFC references! Thanks for them. I guess RFC-7578#4.4 is pretty clear instructs what to do in this case so can follow it.

@fafhrd91 are you ok with as well?

fafhrd91 · 2017-03-24T15:39:56Z

@hubo1016 please add yourself to contributors list

hubo1016 · 2017-03-24T16:16:36Z

@fafhrd91 Done. Where and what should I add to CHANGES.rst?

fafhrd91 · 2017-03-24T16:17:23Z

add it to 2.0 branch, I am planing to release 2.0.3 today

thanks!

fafhrd91 · 2017-03-24T16:30:59Z

@hubo1016 do not worry about change, I will add entry

lock · 2019-10-29T03:03:33Z

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

assume multipart/form-data field as text

1438fc3

hubo1016 force-pushed the 2.0 branch 3 times, most recently from 1e4c3ba to f48499c Compare March 23, 2017 13:59

kxepal reviewed Mar 23, 2017

View reviewed changes

hubo1016 force-pushed the 2.0 branch 8 times, most recently from 3c618da to a6dae6d Compare March 23, 2017 15:00

Add unit test

e65091c

hubo1016 force-pushed the 2.0 branch from a6dae6d to e65091c Compare March 23, 2017 15:10

kxepal mentioned this pull request Mar 24, 2017

Add None type check for content_type #1748

Closed

5 tasks

Add "Hu Bo" to contributors

932b231

fafhrd91 merged commit a666d7f into aio-libs:2.0 Mar 24, 2017

lock bot added the outdated label Oct 29, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash on multipart/form-data post #1743

Fix crash on multipart/form-data post #1743

hubo1016 commented Mar 23, 2017 •

edited

Loading

kxepal Mar 23, 2017

hubo1016 Mar 23, 2017

hubo1016 Mar 23, 2017 •

edited

Loading

fafhrd91 Mar 23, 2017

hubo1016 Mar 24, 2017 •

edited

Loading

kxepal Mar 24, 2017

hubo1016 Mar 24, 2017

hubo1016 Mar 24, 2017

kxepal Mar 24, 2017

fafhrd91 Mar 24, 2017

fafhrd91 commented Mar 24, 2017

hubo1016 commented Mar 24, 2017

fafhrd91 commented Mar 24, 2017

fafhrd91 commented Mar 24, 2017

lock bot commented Oct 29, 2019

Fix crash on multipart/form-data post #1743

Fix crash on multipart/form-data post #1743

Conversation

hubo1016 commented Mar 23, 2017 • edited Loading

What do these changes do?

Are there changes in behavior for the user?

Related issue number

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hubo1016 Mar 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hubo1016 Mar 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fafhrd91 commented Mar 24, 2017

hubo1016 commented Mar 24, 2017

fafhrd91 commented Mar 24, 2017

fafhrd91 commented Mar 24, 2017

lock bot commented Oct 29, 2019

hubo1016 commented Mar 23, 2017 •

edited

Loading

hubo1016 Mar 23, 2017 •

edited

Loading

hubo1016 Mar 24, 2017 •

edited

Loading