Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.7] bpo-31677: Backport regex used to match encoded-word strings #7856

Closed
wants to merge 5 commits into from
Closed

Conversation

hloeung
Copy link

@hloeung hloeung commented Jun 22, 2018

The current regex used in 2.7 doesn't really handle encoded strings with multiple parts. This backports the regex in Python 3.7 which does a better job.

Example string:

=?gb2312?B?UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue?=

With the curret regex:

Python 2.7.14 (default, Sep 23 2017, 22:06:14)
>>> import re
>>> line = "=?gb2312?B?UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue?="
>>> # Match encoded-word strings in the form =?charset?q?Hello_World?=
... ecre = re.compile(r'''
... =\? # literal =?
... (?P<charset>[^?]*?) # non-greedy up to the next ? is the charset
... \? # literal ?
... (?P<encoding>[qb]) # either a "q" or a "b", case insensitive
... \? # literal ?
... (?P<encoded>.*?) # non-greedy up to the next ?= is the encoded string
... \?= # literal ?=
... (?=[ \t]|$) # whitespace or the end of the string
... ''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)
>>> print(ecre.split(line))
['', 'gb2312', 'B', 'UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue', '']

decode_header() then fails because it's unable to email.base64mime.decode() 'UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue'

With the regex from Python 3.7:

Python 2.7.14 (default, Sep 23 2017, 22:06:14)
>>> import re
>>> line = "=?gb2312?B?UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue?="
>>> # Match encoded-word strings in the form =?charset?q?Hello_World?=
... ecre = re.compile(r'''
... =\? # literal =?
... (?P<charset>[^?]*?) # non-greedy up to the next ? is the charset
... \? # literal ?
... (?P<encoding>[qQbB]) # either a "q" or a "b", case insensitive
... \? # literal ?
... (?P<encoded>.*?) # non-greedy up to the next ?= is the encoded string
... \?= # literal ?=
... ''', re.VERBOSE | re.MULTILINE)
>>> print(ecre.split(line))
['', 'gb2312', 'B', 'UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly', '', 'gb2312', 'Q', 'ay_related_issue', '']

It's correctly broken into a base64 part, email.base64mime.decode(), as well as a quoted-printable part for email.quoprimime().

https://bugs.python.org/issue31677

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

When your account is ready, please add a comment in this pull request
and a Python core developer will remove the CLA not signed label
to make the bot check again.

Thanks again for your contribution, we look forward to reviewing it!

@pablogsal
Copy link
Member

Hi @hloeung and thank you for your contribution!

Do you mind modifying the Pull Request title to include the issue in the bug tracker that this PR is adressing? You can find more information here:

https://devguide.python.org/pullrequest/#submitting

If there is no issue created you can open a new one al add that number to the PR title.

Also, notice that the CI is failing right now with this error:

test test_email_renamed failed -- Traceback (most recent call last):
  File "/home/travis/build/python/cpython/Lib/email/test/test_email_renamed.py", line 1589, in test_rfc2047_missing_whitespace
    self.assertEqual(dh, [(s, None)])
AssertionError: Lists differ: [('Sm', None), ('\xf6', 'iso-8... != [('Sm=?ISO-8859-1?B?9g==?=rg=?...
First differing element 0:
('Sm', None)
('Sm=?ISO-8859-1?B?9g==?=rg=?ISO-8859-1?B?5Q==?=sbord', None)
First list contains 4 additional elements.
First extra element 1:
('\xf6', 'iso-8859-1')
+ [('Sm=?ISO-8859-1?B?9g==?=rg=?ISO-8859-1?B?5Q==?=sbord', None)]
- [('Sm', None),
-  ('\xf6', 'iso-8859-1'),
-  ('rg', None),
-  ('\xe5', 'iso-8859-1'),
-  ('sbord', None)]

Notice that to avoid waiting for the CI to fail in the PR you can run your tests locally. For example:

./configure --with-pydebug && make && make test

@Mariatta Mariatta changed the title Backport regex used to match encoded-word strings [2.7] Backport regex used to match encoded-word strings Jul 13, 2018
@csabella
Copy link
Contributor

As this issue has been waiting for additional information for a long time, I am going to close it. Feel free to re-open it if @pablogsal 's concerns are addressed.

@csabella csabella closed this May 11, 2019
@hloeung
Copy link
Author

hloeung commented May 18, 2019

Sorry, I somehow missed the notification for more/additional information about this.

@hloeung hloeung changed the title [2.7] Backport regex used to match encoded-word strings [2.7] bpo-31677: Backport regex used to match encoded-word strings May 18, 2019
@hloeung
Copy link
Author

hloeung commented May 18, 2019

@pablogsal @csabella Okay, addressed @pablogsal 's concerns. Any chance we could re-open this?

@csabella
Copy link
Contributor

@hloeung That's OK. It looks like you're manually backporting a fix from master to 2.7. I'll see if the original reviewers can take a look at this for you.

@csabella csabella reopened this May 18, 2019
@csabella csabella requested review from methane and bitdancer May 18, 2019 01:59
@@ -35,12 +35,11 @@
=\? # literal =?
(?P<charset>[^?]*?) # non-greedy up to the next ? is the charset
\? # literal ?
(?P<encoding>[qb]) # either a "q" or a "b", case insensitive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK for this part, but...

\? # literal ?
(?P<encoded>.*?) # non-greedy up to the next ?= is the encoded string
\?= # literal ?=
(?=[ \t]|$) # whitespace or the end of the string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know this part and changes for tests are OK.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, this PR was a while back. This is really the issue here but was hidden when I copied the regex used in Python 3.x. Instead, I've reverted the noise so it's clear that it is this.

@methane
Copy link
Member

methane commented May 26, 2019

This pull request seems reverting this commit.
dcd24ae

The commit fixed this issue.
Your pull request re-introduce the fixed issue?

@methane methane requested a review from warsaw May 26, 2019 06:56
@hloeung
Copy link
Author

hloeung commented May 29, 2019

Apologies, it's been a while since I originally created this PR. Basically, the original PR contained a regex that was copied across from Python 3.x. It's adding to the additional noise and hiding the real fix.

@methane
Copy link
Member

methane commented May 29, 2019

Now this PR is totaly unrelating to bpo-31677.
Please link to bpo issue which is what you want to fix.

@hloeung
Copy link
Author

hloeung commented May 29, 2019

commit 07ea53c for Issue #1079 is what removes this in Python 3.x (3.3.0 beta 1) but it also has a bunch of other changes.

I'm just going to abandon this. Apologies for taking up some of your time.

@hloeung hloeung closed this May 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants