[2.7] bpo-31677: Backport regex used to match encoded-word strings #7856

hloeung · 2018-06-22T04:04:31Z

The current regex used in 2.7 doesn't really handle encoded strings with multiple parts. This backports the regex in Python 3.7 which does a better job.

Example string:

=?gb2312?B?UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue?=

With the curret regex:

Python 2.7.14 (default, Sep 23 2017, 22:06:14)
>>> import re
>>> line = "=?gb2312?B?UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue?="
>>> # Match encoded-word strings in the form =?charset?q?Hello_World?=
... ecre = re.compile(r'''
... =\? # literal =?
... (?P<charset>[^?]*?) # non-greedy up to the next ? is the charset
... \? # literal ?
... (?P<encoding>[qb]) # either a "q" or a "b", case insensitive
... \? # literal ?
... (?P<encoded>.*?) # non-greedy up to the next ?= is the encoded string
... \?= # literal ?=
... (?=[ \t]|$) # whitespace or the end of the string
... ''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)
>>> print(ecre.split(line))
['', 'gb2312', 'B', 'UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue', '']

decode_header() then fails because it's unable to email.base64mime.decode() 'UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue'

With the regex from Python 3.7:

Python 2.7.14 (default, Sep 23 2017, 22:06:14)
>>> import re
>>> line = "=?gb2312?B?UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly?==?gb2312?Q?ay_related_issue?="
>>> # Match encoded-word strings in the form =?charset?q?Hello_World?=
... ecre = re.compile(r'''
... =\? # literal =?
... (?P<charset>[^?]*?) # non-greedy up to the next ? is the charset
... \? # literal ?
... (?P<encoding>[qQbB]) # either a "q" or a "b", case insensitive
... \? # literal ?
... (?P<encoded>.*?) # non-greedy up to the next ?= is the encoded string
... \?= # literal ?=
... ''', re.VERBOSE | re.MULTILINE)
>>> print(ecre.split(line))
['', 'gb2312', 'B', 'UmU6ILTwuLQ6IFtTdGVsbGEtVFBFXSC08Li0OiBbSFAtU3RlbGxhXSBTa3ly', '', 'gb2312', 'Q', 'ay_related_issue', '']

It's correctly broken into a base64 part, email.base64mime.decode(), as well as a quoted-printable part for email.quoprimime().

https://bugs.python.org/issue31677

the-knights-who-say-ni · 2018-06-22T04:04:34Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

When your account is ready, please add a comment in this pull request
and a Python core developer will remove the CLA not signed label
to make the bot check again.

Thanks again for your contribution, we look forward to reviewing it!

pablogsal · 2018-06-22T13:47:18Z

Hi @hloeung and thank you for your contribution!

Do you mind modifying the Pull Request title to include the issue in the bug tracker that this PR is adressing? You can find more information here:

https://devguide.python.org/pullrequest/#submitting

If there is no issue created you can open a new one al add that number to the PR title.

Also, notice that the CI is failing right now with this error:

test test_email_renamed failed -- Traceback (most recent call last):
  File "/home/travis/build/python/cpython/Lib/email/test/test_email_renamed.py", line 1589, in test_rfc2047_missing_whitespace
    self.assertEqual(dh, [(s, None)])
AssertionError: Lists differ: [('Sm', None), ('\xf6', 'iso-8... != [('Sm=?ISO-8859-1?B?9g==?=rg=?...
First differing element 0:
('Sm', None)
('Sm=?ISO-8859-1?B?9g==?=rg=?ISO-8859-1?B?5Q==?=sbord', None)
First list contains 4 additional elements.
First extra element 1:
('\xf6', 'iso-8859-1')
+ [('Sm=?ISO-8859-1?B?9g==?=rg=?ISO-8859-1?B?5Q==?=sbord', None)]
- [('Sm', None),
-  ('\xf6', 'iso-8859-1'),
-  ('rg', None),
-  ('\xe5', 'iso-8859-1'),
-  ('sbord', None)]

Notice that to avoid waiting for the CI to fail in the PR you can run your tests locally. For example:

./configure --with-pydebug && make && make test

csabella · 2019-05-11T03:12:16Z

As this issue has been waiting for additional information for a long time, I am going to close it. Feel free to re-open it if @pablogsal 's concerns are addressed.

hloeung · 2019-05-18T01:43:21Z

Sorry, I somehow missed the notification for more/additional information about this.

hloeung · 2019-05-18T01:55:07Z

@pablogsal @csabella Okay, addressed @pablogsal 's concerns. Any chance we could re-open this?

csabella · 2019-05-18T01:57:49Z

@hloeung That's OK. It looks like you're manually backporting a fix from master to 2.7. I'll see if the original reviewers can take a look at this for you.

methane · 2019-05-26T06:52:21Z

Lib/email/header.py

@@ -35,12 +35,11 @@
  =\?                   # literal =?
  (?P<charset>[^?]*?)   # non-greedy up to the next ? is the charset
  \?                    # literal ?
-  (?P<encoding>[qb])    # either a "q" or a "b", case insensitive


I'm OK for this part, but...

methane · 2019-05-26T06:52:48Z

Lib/email/header.py

  \?                    # literal ?
  (?P<encoded>.*?)      # non-greedy up to the next ?= is the encoded string
  \?=                   # literal ?=
-  (?=[ \t]|$)           # whitespace or the end of the string


I don't know this part and changes for tests are OK.

Ah sorry, this PR was a while back. This is really the issue here but was hidden when I copied the regex used in Python 3.x. Instead, I've reverted the noise so it's clear that it is this.

methane · 2019-05-26T06:56:24Z

This pull request seems reverting this commit.
dcd24ae

The commit fixed this issue.
Your pull request re-introduce the fixed issue?

…e issue

hloeung · 2019-05-29T09:38:28Z

Apologies, it's been a while since I originally created this PR. Basically, the original PR contained a regex that was copied across from Python 3.x. It's adding to the additional noise and hiding the real fix.

methane · 2019-05-29T11:41:27Z

Now this PR is totaly unrelating to bpo-31677.
Please link to bpo issue which is what you want to fix.

hloeung · 2019-05-29T22:33:48Z

commit 07ea53c for Issue #1079 is what removes this in Python 3.x (3.3.0 beta 1) but it also has a bunch of other changes.

I'm just going to abandon this. Apologies for taking up some of your time.

Backport regex used to match encoded-word strings

5db4a1e

the-knights-who-say-ni added the CLA not signed label Jun 22, 2018

bedevere-bot added the awaiting review label Jun 22, 2018

Backport updated unit test as well

383ec0b

Mariatta removed the CLA not signed label Jul 13, 2018

the-knights-who-say-ni added the CLA signed label Jul 13, 2018

Mariatta changed the title ~~Backport regex used to match encoded-word strings~~ [2.7] Backport regex used to match encoded-word strings Jul 13, 2018

csabella closed this May 11, 2019

Fixed backported unit test

9610a99

hloeung changed the title ~~[2.7] Backport regex used to match encoded-word strings~~ [2.7] bpo-31677: Backport regex used to match encoded-word strings May 18, 2019

csabella reopened this May 18, 2019

csabella requested review from methane and bitdancer May 18, 2019 01:59

methane reviewed May 26, 2019

View reviewed changes

methane requested a review from warsaw May 26, 2019 06:56

hloeung added 2 commits May 29, 2019 19:29

Don't try change too much, make it clear which bit of the regex is th…

8f38b36

…e issue

Remove noise

0be6210

hloeung closed this May 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.7] bpo-31677: Backport regex used to match encoded-word strings #7856

[2.7] bpo-31677: Backport regex used to match encoded-word strings #7856

hloeung commented Jun 22, 2018 •

edited by bedevere-bot

Loading

the-knights-who-say-ni commented Jun 22, 2018

pablogsal commented Jun 22, 2018

csabella commented May 11, 2019

hloeung commented May 18, 2019

hloeung commented May 18, 2019

csabella commented May 18, 2019

methane May 26, 2019

methane May 26, 2019

hloeung May 29, 2019

methane commented May 26, 2019

hloeung commented May 29, 2019

methane commented May 29, 2019 •

edited

Loading

hloeung commented May 29, 2019

[2.7] bpo-31677: Backport regex used to match encoded-word strings #7856

[2.7] bpo-31677: Backport regex used to match encoded-word strings #7856

Conversation

hloeung commented Jun 22, 2018 • edited by bedevere-bot Loading

the-knights-who-say-ni commented Jun 22, 2018

pablogsal commented Jun 22, 2018

csabella commented May 11, 2019

hloeung commented May 18, 2019

hloeung commented May 18, 2019

csabella commented May 18, 2019

methane May 26, 2019

Choose a reason for hiding this comment

methane May 26, 2019

Choose a reason for hiding this comment

hloeung May 29, 2019

Choose a reason for hiding this comment

methane commented May 26, 2019

hloeung commented May 29, 2019

methane commented May 29, 2019 • edited Loading

hloeung commented May 29, 2019

hloeung commented Jun 22, 2018 •

edited by bedevere-bot

Loading

methane commented May 29, 2019 •

edited

Loading