Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly use utf-8 when decoding bytestrings #768

Merged
merged 1 commit into from
Feb 7, 2018

Conversation

terminalmage
Copy link
Contributor

@terminalmage terminalmage commented Feb 7, 2018

While Python 3 defaults to utf-8 in bytes.decode(), Python 2's
equivalent (str.decode()) will use the default encoding as set by
site.py (which is almost always ascii).

From looking at the code, it seems that these decodes have just sort of
been fixed piecemeal (likely when someone realized that pygit2 was
failing to handle unicode properly), but any decodes which run on Python
2 that don't specify utf-8 as the encoding are a ticking time bomb. I
personally noticed this was a problem when I encountered a traceback in
the RemoteCallbacks while fetching a new branch which contained utf-8
characters. During the fetch, when pygit2.remote.maybe_string() was
invoked by _update_tips_cb() with a pointer to a bytestring containing
unicode, the decode fails because the default encoding is ascii. As it
turns out, this was fixed in master, but there are a number which still
have no explicit encoding.

This commit explicitly uses utf-8 for all remaining bytestring decodes
which do not have an encoding specified, aside from one in PY3-specific
code where doing so would be redundant.

While Python 3 defaults to utf-8 in `bytes.decode()`, Python 2's
equivalent (`str.decode()`) will use the default encoding as set by
site.py (which is almost always ascii).

From looking at the code, it seems that these decodes have just sort of
been fixed piecemeal (likely when someone realized that pygit2 was
failing to handle unicode properly, but any decodes which run on Python
2 that don't specify utf-8 as the encoding are a ticking time bomb. I
personally noticed this was a problem when I encountered a traceback in
the RemoteCallbacks while fetching a new branch which contained utf-8
characters. During the fetch, when `pygit2.remote.maybe_string()` was
invoked by `_update_tips_cb()` with a pointer to a bytestring containing
unicode, the decode fails because the default encoding is ascii. As it
turns out, this was fixed in master, but there are a number which still
have no explicit encoding.

This commit explicitly uses utf-8 for all remaining bytestring decodes
which do not have an encoding specified, aside from one in PY3-specific
code where doing so would be redundant.
@jdavid jdavid merged commit 6e71992 into libgit2:master Feb 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants