-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2.7] bpo-31677: Backport regex used to match encoded-word strings #7856
Conversation
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. When your account is ready, please add a comment in this pull request Thanks again for your contribution, we look forward to reviewing it! |
Hi @hloeung and thank you for your contribution! Do you mind modifying the Pull Request title to include the issue in the bug tracker that this PR is adressing? You can find more information here: https://devguide.python.org/pullrequest/#submitting If there is no issue created you can open a new one al add that number to the PR title. Also, notice that the CI is failing right now with this error:
Notice that to avoid waiting for the CI to fail in the PR you can run your tests locally. For example:
|
As this issue has been waiting for additional information for a long time, I am going to close it. Feel free to re-open it if @pablogsal 's concerns are addressed. |
Sorry, I somehow missed the notification for more/additional information about this. |
@pablogsal @csabella Okay, addressed @pablogsal 's concerns. Any chance we could re-open this? |
@hloeung That's OK. It looks like you're manually backporting a fix from master to 2.7. I'll see if the original reviewers can take a look at this for you. |
Lib/email/header.py
Outdated
@@ -35,12 +35,11 @@ | |||
=\? # literal =? | |||
(?P<charset>[^?]*?) # non-greedy up to the next ? is the charset | |||
\? # literal ? | |||
(?P<encoding>[qb]) # either a "q" or a "b", case insensitive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK for this part, but...
\? # literal ? | ||
(?P<encoded>.*?) # non-greedy up to the next ?= is the encoded string | ||
\?= # literal ?= | ||
(?=[ \t]|$) # whitespace or the end of the string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know this part and changes for tests are OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, this PR was a while back. This is really the issue here but was hidden when I copied the regex used in Python 3.x. Instead, I've reverted the noise so it's clear that it is this.
This pull request seems reverting this commit. The commit fixed this issue. |
Apologies, it's been a while since I originally created this PR. Basically, the original PR contained a regex that was copied across from Python 3.x. It's adding to the additional noise and hiding the real fix. |
Now this PR is totaly unrelating to bpo-31677. |
The current regex used in 2.7 doesn't really handle encoded strings with multiple parts. This backports the regex in Python 3.7 which does a better job.
Example string:
With the curret regex:
Python 2.7.14 (default, Sep 23 2017, 22:06:14)
>>> import re
>>> line = "=?gb2312?B?UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue?="
>>> # Match encoded-word strings in the form =?charset?q?Hello_World?=
... ecre = re.compile(r'''
... =\? # literal =?
... (?P<charset>[^?]*?) # non-greedy up to the next ? is the charset
... \? # literal ?
... (?P<encoding>[qb]) # either a "q" or a "b", case insensitive
... \? # literal ?
... (?P<encoded>.*?) # non-greedy up to the next ?= is the encoded string
... \?= # literal ?=
... (?=[ \t]|$) # whitespace or the end of the string
... ''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)
>>> print(ecre.split(line))
['', 'gb2312', 'B', 'UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue', '']
decode_header() then fails because it's unable to email.base64mime.decode() 'UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue'
With the regex from Python 3.7:
Python 2.7.14 (default, Sep 23 2017, 22:06:14)
>>> import re
>>> line = "=?gb2312?B?UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue?="
>>> # Match encoded-word strings in the form =?charset?q?Hello_World?=
... ecre = re.compile(r'''
... =\? # literal =?
... (?P<charset>[^?]*?) # non-greedy up to the next ? is the charset
... \? # literal ?
... (?P<encoding>[qQbB]) # either a "q" or a "b", case insensitive
... \? # literal ?
... (?P<encoded>.*?) # non-greedy up to the next ?= is the encoded string
... \?= # literal ?=
... ''', re.VERBOSE | re.MULTILINE)
>>> print(ecre.split(line))
['', 'gb2312', 'B', 'UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly', '', 'gb2312', 'Q', 'ay_related_issue', '']
It's correctly broken into a base64 part, email.base64mime.decode(), as well as a quoted-printable part for email.quoprimime().
https://bugs.python.org/issue31677