-
-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding problem #332
Comments
Thanks for posting this issue ! It seems that the TagObject's information can't be decoded as it contains a non-utf-8 encoding which is unexpected. Even though you have already discovered a workaround, the original problem remains. A proper fix would re-evaluate the current code and prefer to work on bytes instead of a decoded string. |
In fact, no, |
More data: technically this is an encoding error in the data itself: In [11]: stream= pricing.odb.stream(tag.object.binsha)
In [12]: stream.read()
Out[12]: 'object 4b50858c4debda3ad5d6ea5b7a485cd4eb5ecc73\ntype commit\ntag PROMOTED_1501131729_MKT15_01_12_QU_1\ntagger \xa8John Doe <jdoe@megacorp.com> 1421167136 +0100\n\nMerged CIU_MKT1501_28 to remote master\n' You can see the offensive character just before the tagger's name (technically being part of it). In the other hand, I don't know even if git handles this, but what happens when different objects are encoded with different encodings? I'm pretty sure git objects do not store this kind of info... |
Using the Even if parsing is made to work at some point, right now the tagger-name are expected to be |
What about using |
Great idea ! |
But that would be against your policy of handling as much as possible as |
BTW, that fixed my particular problem, but I guess you don't want the PR just yet... |
But how would you want to produce proper unicode strings if the encoding is unclear ? It's unsafe to try it, which is showing in this example. Maybe a suitable solution would be to allow the client to set the decode-behaviour on a per-repository basis to control whether Doing this sounds like quite some work - and as it stands, the unicode handling in GitPython seems flawed by design :(. |
I think git just doesn't handle encoding at all. In any case, any free form byte sequences (strings) are strings for user consumption: tag names, logs comments, etc. Even filenames are, I'm sure, not converted in any way. In fact, most (Unix/Linux) filesystems know nothing about encoding: it's possible to handle filenames encoded in one encoding in a system using another encoding, simply because filenames are treated as byte sequences with no specific meaning or encoding. |
I have encountered similar problem - when invoking diff on a file that contains wrong utf8 sequence in this locale, GitPython fails with UnicodeDecodeError. Backtrace follows: File "/usr/lib/python2.7/site-packages/gitupstream/gitupstream.py", line 175, in update Will this issue include my error or I need to create another one? Maybe you could help me with the solution? |
@StyXman You are totally right. As stated previously, fixing this in GitPython may be a breaking change to some, as bytes would be returned instead of unicode. This make me somewhat reluctant to attempt such a change, but I should check how much is actually affected. @CepGamer You can pass the |
I believe I ran into a similar issue. When querying the commit message for a commit, the following exception is thrown: ERROR:git.objects.commit:Failed to decode message '...' using encoding UTF-8
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/git/objects/commit.py", line 500, in _deserialize
self.message = self.message.decode(self.encoding)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf8 in position 126: invalid start byte Unfortunately, I cannot share the exact commit message or the repository. I did not succeed in reproducing it in a test repository. |
A new release was just made to pypi 😁 (see #298) ! |
I have run into this problem. This script, which tries to loop through the tags of the nodejs/node repository, exposes this bug: https://gist.github.com/sbenthall/14c4d14c00876440ba6d0ae62efa432f Using version 2.1.11 |
I have the same essue, when reading branches property, how to solve it? |
|
I have a similar question too🥲
DetailsFile "/Users/mac/project/kerraform/./auto_git_api.py", line 80, in pull_request |
I'm not sure this is a proper way to use
TagReferences
, but it's definitely unexpected. This time I'm usingGitPython
installed by pypi.I have this nice tag:
I can get a lot of info out of it:
But this fails:
Unluckily this is happening with an internal repo and I don't know how to even try to reproduce with a public one. Meanwhile I can workaround it by using
tag.object.hexsha
, which is what I wanted.The text was updated successfully, but these errors were encountered: