-
-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicode errors when locale.preferredencoding() is ascii #123
Comments
here's a rough patch for this issue (against 0.8 release): --- a/sh.py
+++ b/sh.py
@@ -117,8 +117,14 @@
if err_delta:
tstderr += ("... (%d more, please see e.stderr)" % err_delta).encode()
- msg = "\n\n RAN: %r\n\n STDOUT:\n%s\n\n STDERR:\n%s" %\
- (full_cmd, tstdout.decode(DEFAULT_ENCODING), tstderr.decode(DEFAULT_ENCODING))
+ try:
+ msg = "\n\n ran: %r\n\n stdout:\n%s\n\n stderr:\n%s" %\
+ (full_cmd, tstdout.decode(DEFAULT_ENCODING),
+ tstderr.decode(DEFAULT_ENCODING))
+ except UnicodeDecodeError:
+ msg = "\n\n ran: %r\n\n stdout:\n%s\n\n stderr:\n%s" %\
+ (full_cmd, tstdout.decode('utf-8'), tstderr.decode('utf-8'))
+
super(ErrorReturnCode, self).__init__(msg)
@@ -371,8 +377,12 @@
def __unicode__(self):
if self.process and self.stdout:
- return self.stdout.decode(self.call_args["encoding"],
- self.call_args["decode_errors"])
+ try:
+ return self.stdout.decode(self.call_args["encoding"],
+ self.call_args["decode_errors"])
+ except UnicodeDecodeError:
+ return self.stdout.decode('utf-8',
+ self.call_args["decode_errors"])
return ""
def __eq__(self, other):
@@ -561,7 +571,11 @@
# if the argument is already unicode, or a number or whatever,
# this first call will fail.
try: arg = unicode(arg, DEFAULT_ENCODING).encode(DEFAULT_ENCODING)
- except TypeError: arg = unicode(arg).encode(DEFAULT_ENCODING)
+ except TypeError:
+ try:
+ arg = unicode(arg).encode(DEFAULT_ENCODING)
+ except UnicodeEncodeError:
+ arg = unicode(arg).encode('utf-8')
return arg
@@ -633,7 +647,11 @@
def __str__(self):
if IS_PY3: return self.__unicode__()
- else: return unicode(self).encode(DEFAULT_ENCODING)
+ else:
+ try:
+ return unicode(self).encode(DEFAULT_ENCODING)
+ except UnicodeEncodeError:
+ return unicode(self).encode('utf-8')
def __eq__(self, other):
try: return str(self) == str(other)
--- a/test.py
+++ b/test.py
@@ -1338,9 +1338,9 @@
import sys
sys.stdout.write("te漢字st")
""")
- fn = partial(python, py.name, _encoding="ascii")
- def s(fn): str(fn())
- self.assertRaises(UnicodeDecodeError, s, fn)
+ #fn = partial(python, py.name, _encoding="ascii")
+ #def s(fn): str(fn())
+ #self.assertRaises(UnicodeDecodeError, s, fn)
p = python(py.name, _encoding="ascii", _decode_errors="ignore")
self.assertEqual(p, "test") |
@kalikaneko could you go ahead and test the code on the dev branch? I had a hard time getting the locale to be respected as ascii on my machine. |
If you're running on any Linux machine, the easiest way to get an ASCii locale is: $ export LC_ALL=C |
Still fails with master. Trying dev branch now. |
heh. dev branch fails unittests even with LC_ALL=en_US.utf8. With LC_ALL=C I get two tracebacks and the unittests hang... might be more failures after those two. |
Tracebacks from the two dev branch runs (LC_ALL=C was first): http://paste.fedoraproject.org/7380/65870359/ |
I'm wondering if the source of the problem is really line 51: DEFAULT_ENCODING = getpreferredencoding() or "utf-8" Does it make sense to use the user's default system encoding for a script they (or someone else) may have written with utf-8? Should we always just assume utf-8? |
It depends on what DEFAULT_ENCODING is being used for. When interpreting arguments, filenames, and things that are going to be passed on to subprocess, it probably makes sense to just use bytes on python2 (I'm not sure of python3 -- I'll have to do some experimenting to see what all gets handled as bytes and what gets handled as string). For then displaying that output to the user, it might make sense to use repr(byte_string) or something else that would display in the user's locale's encoding but not traceback if the user's locale isn't able to translate that byte sequence. The problem is, of course, that sh deals with a lot of things that come from outside of python; from the C world. In that world, strings are sequences of bytes and many of those do not have encoding values associated with them. because of that there is often the need to use several different strategies depending on whatthe code is attempting to achieve at the time. |
by default, the |
When the locales have encoding set to ascii (example: inside a freshly created debian chroot), test suite raises uncatched UnicodeDecodeErrors.
The text was updated successfully, but these errors were encountered: