Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problem #332

Open
StyXman opened this issue Jul 30, 2015 · 22 comments
Open

Encoding problem #332

StyXman opened this issue Jul 30, 2015 · 22 comments

Comments

@StyXman
Copy link
Contributor

StyXman commented Jul 30, 2015

I'm not sure this is a proper way to use TagReferences, but it's definitely unexpected. This time I'm using GitPython installed by pypi.

I have this nice tag:

In [8]: tag
Out[8]: <git.TagReference "refs/tags/PROMOTED_1501131729_MKT15_01_12_QU_1">

I can get a lot of info out of it:

In [9]: tag.object.hexsha
Out[9]: u'dca63c5c7e6aab3cd4934e60230ec3419ab87071'

In [12]: tag.name
Out[12]: 'PROMOTED_1501131729_MKT15_01_12_QU_1'

In [13]: tag.object
Out[13]: <git.TagObject "dca63c5c7e6aab3cd4934e60230ec3419ab87071">

In [14]: tag.ref
TypeError: PROMOTED_1501131729_MKT15_01_12_QU_1 is a detached symbolic reference as it points to 'dca63c5c7e6aab3cd4934e60230ec3419ab87071'

But this fails:

In [15]: tag.commit
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-15-2431a6e80cf9> in <module>()
----> 1 tag.commit

/home/mdione/local/lib/python2.7/site-packages/git/refs/tag.pyc in commit(self)
     29         elif obj.type == "tag":
     30             # it is a tag object which carries the commit as an object - we can point to anything
---> 31             return obj.object
     32         else:
     33             raise ValueError("Tag %s points to a Blob or Tree - have never seen that before" % self)

/home/mdione/local/lib/python2.7/site-packages/gitdb/util.pyc in __getattr__(self, attr)
--> 237         self._set_cache_(attr)
    238         # will raise in case the cache was not created
    239         return object.__getattribute__(self, attr)

/home/mdione/local/lib/python2.7/site-packages/git/objects/tag.pyc in _set_cache_(self, attr)
     54         if attr in TagObject.__slots__:
     55             ostream = self.repo.odb.stream(self.binsha)
---> 56             lines = ostream.read().decode(defenc).splitlines()
     57
     58             obj, hexsha = lines[0].split(" ")       # object <hexsha>

/usr/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
     14
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa8 in position 108: invalid start byte

Unluckily this is happening with an internal repo and I don't know how to even try to reproduce with a public one. Meanwhile I can workaround it by using tag.object.hexsha, which is what I wanted.

@Byron
Copy link
Member

Byron commented Jul 30, 2015

Thanks for posting this issue !

It seems that the TagObject's information can't be decoded as it contains a non-utf-8 encoding which is unexpected.
Maybe it is safer to not attempt to decode anything, and leave that to the client, who could read the bytes of the associated tag-object and parse them with a suitable encoding in mind.

Even though you have already discovered a workaround, the original problem remains. A proper fix would re-evaluate the current code and prefer to work on bytes instead of a decoded string.

@Byron Byron added this to the v1.0.2 - Fixes milestone Jul 30, 2015
@StyXman
Copy link
Contributor Author

StyXman commented Jul 30, 2015

In fact, no, tag.object.hexsha is not what I'm looking for.

@StyXman
Copy link
Contributor Author

StyXman commented Jul 30, 2015

More data: technically this is an encoding error in the data itself:

In [11]: stream= pricing.odb.stream(tag.object.binsha)

In [12]: stream.read()
Out[12]: 'object 4b50858c4debda3ad5d6ea5b7a485cd4eb5ecc73\ntype commit\ntag PROMOTED_1501131729_MKT15_01_12_QU_1\ntagger \xa8John Doe <jdoe@megacorp.com> 1421167136 +0100\n\nMerged CIU_MKT1501_28 to remote master\n'

You can see the offensive character just before the tagger's name (technically being part of it). In the other hand, I don't know even if git handles this, but what happens when different objects are encoded with different encodings? I'm pretty sure git objects do not store this kind of info...

@Byron
Copy link
Member

Byron commented Jul 30, 2015

Using the tag.object it should be straightforward to obtain the raw-bytes stored in the tag-object, in case this is what you are actually looking for. Those represent a few formatted lines of information, which could be parsed with code similar to the one currently in use.
Parsing can only safely operate on bytes though, as the encoding seems not to be UTF-8 at all times.

Even if parsing is made to work at some point, right now the tagger-name are expected to be str/unicode instances, which couldn't be obtained if the encoding of the underlying bytes are unknown.

@StyXman
Copy link
Contributor Author

StyXman commented Jul 30, 2015

What about using decode(defenc, 'ignore')? I hope it doesn't break anything else. I'll try that locally.

@Byron
Copy link
Member

Byron commented Jul 30, 2015

Great idea !
Of course it's questionable whether the program should silently drop information, instead of loudly abort operation as it currently does.
It seems that it's generally unwise to make assumptions about the encoding in TagObjects, so the implementation should leave it to the client to deal with that and provide byte-strings only.

@StyXman
Copy link
Contributor Author

StyXman commented Jul 30, 2015

But that would be against your policy of handling as much as possible as unicode (if I correctly understood #312)...

@StyXman
Copy link
Contributor Author

StyXman commented Jul 30, 2015

BTW, that fixed my particular problem, but I guess you don't want the PR just yet...

@Byron
Copy link
Member

Byron commented Jul 30, 2015

But how would you want to produce proper unicode strings if the encoding is unclear ? It's unsafe to try it, which is showing in this example.
The truth is that I am not entirely sure how git itself handles encodings, and it might be that GitPython actually went down a wrong path by trying to just decode textual data as UTF-8. The latter works most of the time, but that's not really good enough.

Maybe a suitable solution would be to allow the client to set the decode-behaviour on a per-repository basis to control whether .decode(defenc, 'ignore') is acceptable.

Doing this sounds like quite some work - and as it stands, the unicode handling in GitPython seems flawed by design :(.

@StyXman
Copy link
Contributor Author

StyXman commented Jul 30, 2015

I think git just doesn't handle encoding at all. In any case, any free form byte sequences (strings) are strings for user consumption: tag names, logs comments, etc. Even filenames are, I'm sure, not converted in any way. In fact, most (Unix/Linux) filesystems know nothing about encoding: it's possible to handle filenames encoded in one encoding in a system using another encoding, simply because filenames are treated as byte sequences with no specific meaning or encoding.

@CepGamer
Copy link

CepGamer commented Sep 9, 2015

I have encountered similar problem - when invoking diff on a file that contains wrong utf8 sequence in this locale, GitPython fails with UnicodeDecodeError. Backtrace follows:

File "/usr/lib/python2.7/site-packages/gitupstream/gitupstream.py", line 175, in update
diff = self._repo.git.diff('--full-index', self._mainline, self._rebased)
File "/usr/lib/python2.7/site-packages/git/cmd.py", line 431, in
return lambda _args, *_kwargs: self._call_process(name, _args, *_kwargs)
File "/usr/lib/python2.7/site-packages/git/cmd.py", line 802, in _call_process
return self.execute(make_call(), **_kwargs)
File "/usr/lib/python2.7/site-packages/git/cmd.py", line 610, in execute
stdout_value = stdout_value.decode(defenc)
File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 225: invalid continuation byte

Will this issue include my error or I need to create another one? Maybe you could help me with the solution?

@Byron
Copy link
Member

Byron commented Sep 20, 2015

@StyXman You are totally right. As stated previously, fixing this in GitPython may be a breaking change to some, as bytes would be returned instead of unicode. This make me somewhat reluctant to attempt such a change, but I should check how much is actually affected.

@CepGamer You can pass the stdout_as_string=False keyword argument when executing .git.diff (i.e. .git.diff(..., stdout_as_string=False)), or use GitPython's own diffing facilities.

@maikelsteneker
Copy link

I believe I ran into a similar issue. When querying the commit message for a commit, the following exception is thrown:

ERROR:git.objects.commit:Failed to decode message '...' using encoding UTF-8
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/git/objects/commit.py", line 500, in _deserialize
    self.message = self.message.decode(self.encoding)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf8 in position 126: invalid start byte

Unfortunately, I cannot share the exact commit message or the repository. I did not succeed in reproducing it in a test repository.
Perhaps an option would be to provide an option to disable decoding?

@Byron
Copy link
Member

Byron commented Feb 11, 2016

A new release was just made to pypi 😁 (see #298) !

@Byron Byron modified the milestones: v1.0.2 - Fixes, v1.0.3 - Fixes Feb 13, 2016
@Byron Byron modified the milestones: v2.0.0 - Features and Fixes, v2.0.1 - Bugfixes Apr 24, 2016
@nvie nvie modified the milestones: v2.0.4 - Bugfixes, v2.0.5 May 30, 2016
@Byron Byron modified the milestones: v2.0.9 - Bugfixes, v2.0.10 - Bugfixes, v2.1.0 - proper windows support, v2.1.0 - better windows support, v2.1.1 - Bugfixes Oct 16, 2016
@Byron Byron modified the milestones: v2.1.1 - Bugfixes, v2.1.2 - Bugfixes Dec 8, 2016
@Byron Byron modified the milestones: v2.1.2 - Bugfixes, v2.1.3 - Bugfixes Mar 8, 2017
@sbenthall
Copy link

I have run into this problem.

This script, which tries to loop through the tags of the nodejs/node repository, exposes this bug:

https://gist.github.com/sbenthall/14c4d14c00876440ba6d0ae62efa432f

Using version 2.1.11

@brizjin
Copy link

brizjin commented Jul 15, 2019

I have the same essue, when reading branches property, how to solve it?

@ViCrack
Copy link

ViCrack commented Oct 15, 2020

I have the same essue, when reading branches property, how to solve it?

repo = Repo(r'') print(repo.branches)
I have a similar question.

Traceback (most recent call last):
  File "D:/Python/src/post.py", line 16, in <module>
    print(repo.branches)
  File "D:\Programs\Python37\lib\site-packages\git\repo\base.py", line 289, in heads
    return Head.list_items(self)
  File "D:\Programs\Python37\lib\site-packages\git\util.py", line 922, in list_items
    out_list.extend(cls.iter_items(repo, *args, **kwargs))
  File "D:\Programs\Python37\lib\site-packages\git\refs\symbolic.py", line 616, in _iter_items
    for _sha, rela_path in cls._iter_packed_refs(repo):
  File "D:\Programs\Python37\lib\site-packages\git\refs\symbolic.py", line 91, in _iter_packed_refs
    for line in fp:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 538: illegal multibyte sequence

@zeze1004
Copy link

I have a similar question too🥲

pr_repo = g.get_repo(repo_name)

"/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1258, in putheader
    values[i] = one_value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 6: ordinal not in range(256)

Details

File "/Users/mac/project/kerraform/./auto_git_api.py", line 80, in pull_request
pr_repo = g.get_repo(repo_name)
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/MainClass.py", line 330, in get_repo
headers, data = self.__requester.requestJsonAndCheck("GET", url)
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 354, in requestJsonAndCheck
*self.requestJson(
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 454, in requestJson
return self.__requestEncode(cnx, verb, url, parameters, headers, input, encode)
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 528, in __requestEncode
status, responseHeaders, output = self.__requestRaw(
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 555, in __requestRaw
response = cnx.getresponse()
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/github/Requester.py", line 127, in getresponse
r = verb(
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/sessions.py", line 542, in get
return self.request('GET', url, **kwargs)
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/sessions.py", line 529, in request
resp = self.send(prep, **send_kwargs)
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/sessions.py", line 645, in send
r = adapter.send(request, **kwargs)
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/requests/adapters.py", line 440, in send
resp = conn.urlopen(
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1285, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1326, in _send_request
self.putheader(hdr, value)
File "/Users/mac/project/kerraform/venv/lib/python3.9/site-packages/urllib3/connection.py", line 224, in putheader
_HTTPConnection.putheader(self, header, *values)
File "/Users/mac/.anyenv/envs/pyenv/versions/3.9.9/lib/python3.9/http/client.py", line 1258, in putheader
values[i] = one_value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 6: ordinal not in range(256)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests