Explicitly use utf-8 when decoding bytestrings #768

terminalmage · 2018-02-07T03:12:38Z

While Python 3 defaults to utf-8 in bytes.decode(), Python 2's
equivalent (str.decode()) will use the default encoding as set by
site.py (which is almost always ascii).

From looking at the code, it seems that these decodes have just sort of
been fixed piecemeal (likely when someone realized that pygit2 was
failing to handle unicode properly), but any decodes which run on Python
2 that don't specify utf-8 as the encoding are a ticking time bomb. I
personally noticed this was a problem when I encountered a traceback in
the RemoteCallbacks while fetching a new branch which contained utf-8
characters. During the fetch, when pygit2.remote.maybe_string() was
invoked by _update_tips_cb() with a pointer to a bytestring containing
unicode, the decode fails because the default encoding is ascii. As it
turns out, this was fixed in master, but there are a number which still
have no explicit encoding.

This commit explicitly uses utf-8 for all remaining bytestring decodes
which do not have an encoding specified, aside from one in PY3-specific
code where doing so would be redundant.

While Python 3 defaults to utf-8 in `bytes.decode()`, Python 2's equivalent (`str.decode()`) will use the default encoding as set by site.py (which is almost always ascii). From looking at the code, it seems that these decodes have just sort of been fixed piecemeal (likely when someone realized that pygit2 was failing to handle unicode properly, but any decodes which run on Python 2 that don't specify utf-8 as the encoding are a ticking time bomb. I personally noticed this was a problem when I encountered a traceback in the RemoteCallbacks while fetching a new branch which contained utf-8 characters. During the fetch, when `pygit2.remote.maybe_string()` was invoked by `_update_tips_cb()` with a pointer to a bytestring containing unicode, the decode fails because the default encoding is ascii. As it turns out, this was fixed in master, but there are a number which still have no explicit encoding. This commit explicitly uses utf-8 for all remaining bytestring decodes which do not have an encoding specified, aside from one in PY3-specific code where doing so would be redundant.

terminalmage force-pushed the decode-utf8 branch from 996ad2e to 2b42eb1 Compare February 7, 2018 03:13

terminalmage force-pushed the decode-utf8 branch from 2b42eb1 to 6e71992 Compare February 7, 2018 03:14

jdavid merged commit 6e71992 into libgit2:master Feb 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly use utf-8 when decoding bytestrings #768

Explicitly use utf-8 when decoding bytestrings #768

terminalmage commented Feb 7, 2018 •

edited

Loading

Explicitly use utf-8 when decoding bytestrings #768

Explicitly use utf-8 when decoding bytestrings #768

Conversation

terminalmage commented Feb 7, 2018 • edited Loading

terminalmage commented Feb 7, 2018 •

edited

Loading