Test for unicode path #45

stsewd · 2018-03-06T06:44:34Z

Test to expose fix for #27

This test was a little hard to figure out how to do it, since I wasn't able to use the name on the py file, so I had to add an actual file (I take the file from readthedocs/readthedocs.org#3732 (comment)) and I had to made it py2 and py3 compatible. Please let me know if there is a better way.

stsewd · 2018-03-06T06:52:36Z

I forgot to make it py3 compatible

stsewd · 2018-03-06T07:13:11Z

readthedocs_build/config/test_find.py

+        # When python2
+        assert isinstance(e, UnicodeDecodeError)
+
+        unicode_base_path = base_path.decode('utf-8')


So the final question, should we force always to be str on the find_one method or we should document that anyone calling this function must pass a str object and not an unicode?

Also probably would be better to explicitly check the python version on the test, right?

We need to be able to support UTF8 file names, so I think we just need to patch the walk method. We did this somewhere -- either readthedocs.org or sphinx-autoapi (i think)

agjohnson

You can also look to some of the tests that probably already mock out the find_one calls. You shouldn't need to depend on a local repo file for the test, though this isn't a huge problem.

agjohnson · 2018-03-06T07:30:44Z

readthedocs_build/config/test_find.py

+        assert path == ''
+    except Exception as e:
+        # When python2
+        assert isinstance(e, UnicodeDecodeError)


Instead of a try/except here, I'd use six or basic python version_info checking to test for these two cases:

sys.version_info < (3,0,0)

agjohnson · 2018-03-06T07:33:08Z

readthedocs_build/config/test_find.py

@@ -75,3 +75,18 @@ def test_find_multiple_files(tmpdir):
        str(tmpdir.join('first', 'readthedocs.yml')),
        str(tmpdir.join('third', 'readthedocs.yml')),
    ]
+
+
+def test_find_unicode_path(tmpdir):


This test can also be written to fail, using the pytest.xfail decorator. Instead of writing a test that will fail once we have the bug fixed, we can write a passing case. Our tests will technically pass, but there will be an expected failure on the test run.

agjohnson · 2018-03-06T07:36:39Z

readthedocs_build/config/test_find.py

+        # When python2
+        assert isinstance(e, UnicodeDecodeError)
+
+        unicode_base_path = base_path.decode('utf-8')


We need to be able to support UTF8 file names, so I think we just need to patch the walk method. We did this somewhere -- either readthedocs.org or sphinx-autoapi (i think)

- Run only on python2 - Mark as fail

stsewd · 2018-03-06T21:45:25Z

You can also look to some of the tests that probably already mock out the find_one calls. You shouldn't need to depend on a local repo file for the test, though this isn't a huge problem.

I tried to use apply_fs to reproduce the bug, but I wasn't able to do it (even copying the same name of the file from this test). Even manually creating a file with the same name (copying it from the one that fails).

I think probably this is due a bad encoding or a corrupted file from the user? Even on my OS the file isn't show correctly.

agjohnson · 2018-03-08T01:43:19Z

So you probably aren't able to reproduce this easily as your encoding is not ascii. Here is what the servers give us back:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

That is bad, and in fact, I'm not sure why that's happening, as our locale is set:

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

What you're testing might not be the right thing to test. Also, you are having trouble creating a file or path with this codepoint because your encoding is likely 'utf-8' already. With the encoding ascii, I have no problem doing this with py.path and apply_fs. If you try the following on a box with default utf8 encoding, you'll get a file with a codepoint that isn't 0xf0:

In [1]: import os

In [2]: os.mkdir('/tmp/foo/')

In [3]: os.mkdir('/tmp/foo/f\xf0\xf0')

In [4]: ls /tmp/foo
f??/

In [5]: os.listdir('/tmp/foo')
Out[5]: ['f\xf0\xf0']

To make this more confusing, however, even with this I'm not able to reproduce the problem with find_all:

In [18]: path
Out[18]: local('/tmp/foo')

In [19]: path.listdir()
Out[19]: [local('/tmp/foo/f\xf0\xf0'), local('/tmp/foo/bo\xf0')]

In [20]: list(find_all('/tmp/foo', ['readthedocs.yml']))
Out[20]: []

So, I'm pretty confused at this point.

stsewd · 2018-03-08T03:18:21Z

So you probably aren't able to reproduce this easily as your encoding is not ascii. Here is what the servers give us back:

I got the same on my local instance

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

I asked to the user for more information readthedocs/readthedocs.org#3732 (comment), and I think probably that file was generated with some weird encoding that only Windows understand (I have seen some Windows files showing as invalid encode on my machine a couple of times). Also now that I remember, a couple of weeks ago I was helping to a German friend to setup his rtd instance and he was using Windows, and faced some similar problems with encoding (but on other part of the build).

ericholscher · 2018-03-08T08:24:34Z

The fix for python encoding being crazy is "use python3". I've tried to fix the server encodings a million times, and it doesn't work. We just need to move to an all UTF-8 world.

ericholscher · 2018-05-22T16:37:17Z

Looks like a good test to have, so going to merge this.

Add test for unicode path

dcba87e

Make test py2 & py3 compatible

e856a28

stsewd mentioned this pull request Mar 6, 2018

UnicodeError on path walking #27

Open

stsewd commented Mar 6, 2018

View reviewed changes

agjohnson reviewed Mar 6, 2018

View reviewed changes

Better test

2d0b0e5

- Run only on python2 - Mark as fail

stsewd force-pushed the test-unicode-error branch from e9252e9 to 2d0b0e5 Compare March 6, 2018 20:49

stsewd mentioned this pull request Mar 6, 2018

Build fails for new project readthedocs/readthedocs.org#3732

Closed

More explicit test

bc3982b

ericholscher merged commit b992332 into readthedocs:master May 22, 2018

stsewd deleted the test-unicode-error branch May 22, 2018 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test for unicode path #45

Test for unicode path #45

stsewd commented Mar 6, 2018 •

edited

Loading

stsewd commented Mar 6, 2018

stsewd Mar 6, 2018

stsewd Mar 6, 2018

agjohnson Mar 6, 2018

agjohnson left a comment

agjohnson Mar 6, 2018

agjohnson Mar 6, 2018

agjohnson Mar 6, 2018

stsewd commented Mar 6, 2018

agjohnson commented Mar 8, 2018

stsewd commented Mar 8, 2018

ericholscher commented Mar 8, 2018

ericholscher commented May 22, 2018

Test for unicode path #45

Test for unicode path #45

Conversation

stsewd commented Mar 6, 2018 • edited Loading

stsewd commented Mar 6, 2018

stsewd Mar 6, 2018

Choose a reason for hiding this comment

stsewd Mar 6, 2018

Choose a reason for hiding this comment

agjohnson Mar 6, 2018

Choose a reason for hiding this comment

agjohnson left a comment

Choose a reason for hiding this comment

agjohnson Mar 6, 2018

Choose a reason for hiding this comment

agjohnson Mar 6, 2018

Choose a reason for hiding this comment

agjohnson Mar 6, 2018

Choose a reason for hiding this comment

stsewd commented Mar 6, 2018

agjohnson commented Mar 8, 2018

stsewd commented Mar 8, 2018

ericholscher commented Mar 8, 2018

ericholscher commented May 22, 2018

stsewd commented Mar 6, 2018 •

edited

Loading