unicode errors when locale.preferredencoding() is ascii #123

kalikaneko · 2013-02-10T22:16:11Z

When the locales have encoding set to ascii (example: inside a freshly created debian chroot), test suite raises uncatched UnicodeDecodeErrors.

In [1]: from locale import getpreferredencoding
In [2]: getpreferredencoding()                                                                                                                                                                                                    
Out[2]: 'ANSI_X3.4-1968'

..............................................E............................E...
======================================================================
ERROR: test_non_ascii_error (test.Basic)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/sh-1.08/test.py", line 1276, in test_non_ascii_error
    self.assertRaises(ErrorReturnCode, ls, test)
  File "/usr/lib/python2.6/unittest.py", line 336, in failUnlessRaises
    callableObj(*args, **kwargs)
  File "/tmp/sh-1.08/sh.py", line 730, in __call__
    return RunningCommand(cmd, call_args, stdin, stdout, stderr)
  File "/tmp/sh-1.08/sh.py", line 291, in __init__
    self.wait()
  File "/tmp/sh-1.08/sh.py", line 295, in wait
    self._handle_exit_code(self.process.wait())
  File "/tmp/sh-1.08/sh.py", line 309, in _handle_exit_code
    self.process.stderr
  File "/tmp/sh-1.08/sh.py", line 121, in __init__
    (full_cmd, tstdout.decode(DEFAULT_ENCODING), tstderr.decode(DEFAULT_ENCODING))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 24: ordinal not in range(128)

======================================================================
ERROR: test_unicode_arg (test.Basic)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/sh-1.08/test.py", line 60, in test_unicode_arg
    p = echo(test).strip()
  File "/tmp/sh-1.08/sh.py", line 389, in __getattr__
    return getattr(unicode(self), p)
  File "/tmp/sh-1.08/sh.py", line 375, in __unicode__
    self.call_args["decode_errors"])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

----------------------------------------------------------------------
Ran 79 tests in 24.988s

FAILED (errors=2)

kalikaneko · 2013-02-10T22:23:57Z

here's a rough patch for this issue (against 0.8 release):

--- a/sh.py
+++ b/sh.py
@@ -117,8 +117,14 @@
             if err_delta: 
                 tstderr += ("... (%d more, please see e.stderr)" % err_delta).encode()

-        msg = "\n\n  RAN: %r\n\n  STDOUT:\n%s\n\n  STDERR:\n%s" %\
-            (full_cmd, tstdout.decode(DEFAULT_ENCODING), tstderr.decode(DEFAULT_ENCODING))
+        try:
+            msg = "\n\n  ran: %r\n\n  stdout:\n%s\n\n  stderr:\n%s" %\
+                (full_cmd, tstdout.decode(DEFAULT_ENCODING),
+                    tstderr.decode(DEFAULT_ENCODING))
+        except UnicodeDecodeError:
+            msg = "\n\n  ran: %r\n\n  stdout:\n%s\n\n  stderr:\n%s" %\
+                (full_cmd, tstdout.decode('utf-8'), tstderr.decode('utf-8'))
+
         super(ErrorReturnCode, self).__init__(msg)


@@ -371,8 +377,12 @@

     def __unicode__(self):
         if self.process and self.stdout:
-            return self.stdout.decode(self.call_args["encoding"],
-                self.call_args["decode_errors"])
+            try:
+                return self.stdout.decode(self.call_args["encoding"],
+                    self.call_args["decode_errors"])
+            except UnicodeDecodeError:
+                return self.stdout.decode('utf-8',
+                    self.call_args["decode_errors"])
         return ""

     def __eq__(self, other):
@@ -561,7 +571,11 @@
             # if the argument is already unicode, or a number or whatever,
             # this first call will fail.  
             try: arg = unicode(arg, DEFAULT_ENCODING).encode(DEFAULT_ENCODING)
-            except TypeError: arg = unicode(arg).encode(DEFAULT_ENCODING)
+            except TypeError:
+                try:
+                    arg = unicode(arg).encode(DEFAULT_ENCODING)
+                except UnicodeEncodeError:
+                    arg = unicode(arg).encode('utf-8')
         return arg


@@ -633,7 +647,11 @@

     def __str__(self):
         if IS_PY3: return self.__unicode__()
-        else: return unicode(self).encode(DEFAULT_ENCODING)
+        else:
+            try:
+                return unicode(self).encode(DEFAULT_ENCODING)
+            except UnicodeEncodeError:
+                return unicode(self).encode('utf-8')

     def __eq__(self, other):
         try: return str(self) == str(other)
--- a/test.py
+++ b/test.py
@@ -1338,9 +1338,9 @@
 import sys
 sys.stdout.write("te漢字st")
 """)
-        fn = partial(python, py.name, _encoding="ascii")
-        def s(fn): str(fn())
-        self.assertRaises(UnicodeDecodeError, s, fn)
+        #fn = partial(python, py.name, _encoding="ascii")
+        #def s(fn): str(fn())
+        #self.assertRaises(UnicodeDecodeError, s, fn)

         p = python(py.name, _encoding="ascii", _decode_errors="ignore")
         self.assertEqual(p, "test")

amoffat · 2013-02-20T03:57:56Z

@kalikaneko could you go ahead and test the code on the dev branch? I had a hard time getting the locale to be respected as ascii on my machine.

abadger · 2013-04-13T15:37:19Z

If you're running on any Linux machine, the easiest way to get an ASCii locale is:

$ export LC_ALL=C
$ run_tests

abadger · 2013-04-13T16:20:10Z

Still fails with master. Trying dev branch now.

abadger · 2013-04-13T16:23:12Z

heh. dev branch fails unittests even with LC_ALL=en_US.utf8. With LC_ALL=C I get two tracebacks and the unittests hang... might be more failures after those two.

abadger · 2013-04-13T16:26:44Z

Tracebacks from the two dev branch runs (LC_ALL=C was first): http://paste.fedoraproject.org/7380/65870359/

amoffat · 2013-04-17T02:49:42Z

I'm wondering if the source of the problem is really line 51:

DEFAULT_ENCODING = getpreferredencoding() or "utf-8"

Does it make sense to use the user's default system encoding for a script they (or someone else) may have written with utf-8? Should we always just assume utf-8?

abadger · 2013-04-17T05:10:55Z

It depends on what DEFAULT_ENCODING is being used for. When interpreting arguments, filenames, and things that are going to be passed on to subprocess, it probably makes sense to just use bytes on python2 (I'm not sure of python3 -- I'll have to do some experimenting to see what all gets handled as bytes and what gets handled as string). For then displaying that output to the user, it might make sense to use repr(byte_string) or something else that would display in the user's locale's encoding but not traceback if the user's locale isn't able to translate that byte sequence.

The problem is, of course, that sh deals with a lot of things that come from outside of python; from the C world. In that world, strings are sequences of bytes and many of those do not have encoding values associated with them. because of that there is often the need to use several different strategies depending on whatthe code is attempting to achieve at the time.

#123

amoffat · 2013-09-08T17:39:59Z

by default, the python sh.py test suite runs for all python versions, with both C locale and en_US.UTF-8 locale, with all tests passing on my test machines

kalikaneko mentioned this issue Feb 11, 2013

Basic.test_decode_error_handling fails with python3 when preferred encoding is ascii #124

Closed

amoffat pushed a commit that referenced this issue Sep 7, 2013

fixing a ton of encoding errors related to ascii system encoding, closes

4b76001

#123

amoffat closed this as completed in f4aa4bc Sep 8, 2013

pyup-bot mentioned this issue Jun 21, 2016

Initial Update chrisdev/wagtail-cookiecutter-foundation#292

Merged

pyup-bot mentioned this issue Nov 30, 2016

Initial Update Sanji-IO/sanji-bundle-cellular#92

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode errors when locale.preferredencoding() is ascii #123

unicode errors when locale.preferredencoding() is ascii #123

kalikaneko commented Feb 10, 2013

kalikaneko commented Feb 10, 2013

amoffat commented Feb 20, 2013

abadger commented Apr 13, 2013

abadger commented Apr 13, 2013

abadger commented Apr 13, 2013

abadger commented Apr 13, 2013

amoffat commented Apr 17, 2013

abadger commented Apr 17, 2013

amoffat commented Sep 8, 2013

unicode errors when locale.preferredencoding() is ascii #123

unicode errors when locale.preferredencoding() is ascii #123

Comments

kalikaneko commented Feb 10, 2013

kalikaneko commented Feb 10, 2013

amoffat commented Feb 20, 2013

abadger commented Apr 13, 2013

abadger commented Apr 13, 2013

abadger commented Apr 13, 2013

abadger commented Apr 13, 2013

amoffat commented Apr 17, 2013

abadger commented Apr 17, 2013

amoffat commented Sep 8, 2013