PyYAML 5.3 not compatible with Jython #369

pekkaklarck · 2020-01-07T13:20:13Z

PyYAML 5.2 still worked but 5.3 crashes at import. Tested with Jython 2.7.0 and 2.7.2b2 and this is the result:

$ jython -c "import yaml"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/peke/Prog/jython2.7.0/Lib/site-packages/yaml/__init__.py", line 8, in <module>
    from loader import *
  File "/home/peke/Prog/jython2.7.0/Lib/site-packages/yaml/loader.py", line 4, in <module>
    from reader import *
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 49-55: illegal Unicode character

This seems to be caused by PR #351. Apparently the root cause is that Jython doesn't support lone surrogates at all. According to https://bugs.jython.org/issue2048 that is by design.

The text was updated successfully, but these errors were encountered:

perlpunk · 2020-01-07T13:28:52Z

Thanks. It's not yet clear to me where it fails. So already importing yaml fails? and on which line does it fail? The error message only talks about "position 49-55".

pekkaklarck · 2020-01-07T13:42:56Z

It fails at import. Jython considers lone surrogates like u'\uD800' invalid, same as there being a syntax error in the code.

Not sure why the error message doesn't contain line information and not sure is the position correct either. The traceback shows that the problem is in the reader module, though, and only change there since v5.2 seems to be introduced by PR #351. I also already tested that reverting the changed line fixes the error.

perlpunk · 2020-01-07T13:45:30Z

@anishathalye do you have any idea?

perlpunk · 2020-01-07T13:48:23Z

Jython considers lone surrogates like u'\uD800' invalid

Hm, that's a bit weird, considered that the usage here is in a regex that actually wants to avoid invalid unicode.
How would one test input for things like this if it's forbidden to use this in the source code?
I wonder how the json module is doing this, as I assume it has similar tests.
Note that I'm not very experienced in python.

pekkaklarck · 2020-01-07T13:56:09Z

The Jython issue I referred to earlier explains their reasoning why lone surrogates aren't supported. I think it also has something to do with how JVM works.

I'm not sure is it possible to construct this regexp so that it would work also with Jython. If not, it is possible to have different regexp for Jython and others. Unfortunately Jython not liking lone surrogates at all means that the latter pattern cannot be in the source code directly. One possibility to handle that is using eval like in this example:

if has_ucs4:
    NON_PRINTABLE = u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD\U00010000-\U0010ffff]'
elif sys.platform.startswith('java'):
    # Jython doesn't support lone surrogates https://bugs.jython.org/issue2048 
    NON_PRINTABLE = u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]'
else:
    # Need to use eval here due to the above Jython issue
    NON_PRINTABLE = eval(r"u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uFFFD]|(?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)'")
NON_PRINTABLE = re.compile(NON_PRINTABLE)

perlpunk · 2020-01-07T14:05:19Z

@pekkaklarck how about moving the regex to an extra file and import it depending on has_ucs4 and platform?

OTOH I think there is an existing PR #124 rewriting this check without regex. maybe that could work, but it has to be updated.

pekkaklarck · 2020-01-07T14:11:25Z

Having regexps in own modules that are imported based on needs would work too. I've used that trick with our project to hide differences between Python 2, Python 3, PyPy, Jython and IronPython when there has been more code. If there has been problem in just one line, I've typically used eval or exec.

Not needing that NON_PRINTABLE regexp at allo would obviously solve the problem nicely.

perlpunk · 2020-01-07T20:09:36Z

I think in this case I would actually be ok with the eval.
@pekkaklarck would you want to do a PR with the code you showed?

pekkaklarck · 2020-01-08T14:01:47Z

I can do that but won't have time in the near future. I won't mind you or someone else taking care this. Would be a good first issue for someone new interested to contribute to open source.

anishathalye · 2020-01-08T18:07:29Z

I think NON_PRINTABLE is necessary if you want to point to the specific place where there's an issue; if you don't care about providing as good error messages, you could invert the regex, have a PRINTABLE regex instead (which will avoid the lone surrogate issue), but then you won't be able to point to the specific place in the string where there is a non-printable character.

This patch was taken from #369 (comment), authored by Pekka Klärck <peke@iki.fi>. In short, Jython doesn't support lone surrogates, so importing yaml (and in particular, loading `reader.py`) caused a UnicodeDecodeError. This patch works around this through a clever use of `eval` to defer evaluation of the string containing the lone surrogates, only doing it on non-Jython platforms. This is only done in `lib/yaml/reader.py` and not `lib3/yaml/reader.py` because Jython does not support Python 3. With this patch, Jython's behavior with respect to Unicode code points over 0xFFFF becomes as it was before 0716ae2. It still does not pass all the unit tests on Jython (passes 1275, fails 3, errors on 1); all the failing tests are related to unicode. Still, this is better than simply crashing upon `import yaml`. With this patch, all tests continue to pass on Python 2 / Python 3.

perlpunk · 2021-01-29T22:35:35Z

Fixed by #378 in 5.4

ageorgou mentioned this issue Jan 20, 2020

Test failures due to PyYAML version oracc/nammu#409

Closed

anishathalye mentioned this issue Jan 22, 2020

Fix compatibility with Jython #378

Merged

perlpunk added the task:bug label Jan 29, 2021

perlpunk closed this as completed Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyYAML 5.3 not compatible with Jython #369

PyYAML 5.3 not compatible with Jython #369

pekkaklarck commented Jan 7, 2020

perlpunk commented Jan 7, 2020

pekkaklarck commented Jan 7, 2020

perlpunk commented Jan 7, 2020

perlpunk commented Jan 7, 2020

pekkaklarck commented Jan 7, 2020

perlpunk commented Jan 7, 2020

pekkaklarck commented Jan 7, 2020 •

edited

Loading

perlpunk commented Jan 7, 2020

pekkaklarck commented Jan 8, 2020

anishathalye commented Jan 8, 2020

perlpunk commented Jan 29, 2021

PyYAML 5.3 not compatible with Jython #369

PyYAML 5.3 not compatible with Jython #369

Comments

pekkaklarck commented Jan 7, 2020

perlpunk commented Jan 7, 2020

pekkaklarck commented Jan 7, 2020

perlpunk commented Jan 7, 2020

perlpunk commented Jan 7, 2020

pekkaklarck commented Jan 7, 2020

perlpunk commented Jan 7, 2020

pekkaklarck commented Jan 7, 2020 • edited Loading

perlpunk commented Jan 7, 2020

pekkaklarck commented Jan 8, 2020

anishathalye commented Jan 8, 2020

perlpunk commented Jan 29, 2021

pekkaklarck commented Jan 7, 2020 •

edited

Loading