Consistent interface to get text and bytes #895

jdavid · 2019-03-31T11:34:56Z

Follow up from #610 #790 and #893

General policy:

Do not implement str()
To get unicode string use .text
In .text use UTF-8 and replace (the rationale for replace is explained in Patch: Add __str__ and __bytes__ for undecoded content. #790 (comment))
To get the byte string use .data or .raw (this is to be decided)
For attributes the name of the attribute returns text, prefix with raw_ to get bytes. For instance Signature.name and Signature.raw_name
Implement the buffer protocol, bytes(..) where appropriate

Open for discussion.

TODO:

Replace TreeEntry._name by .raw_name
Replace DiffLine.content by .text
Inventory all the places where we get bytes, text, or the buffer protocol
Settle on .data or .raw
Replace DiffLine.raw_content by .data or .raw
Replace Object.read_raw() by .data (or .raw), then remove Blob.data (it will inherit from Object)
Settle on str() bytes() and the buffer protocol

The case of Oid, what we've now:

oid.raw returns the byte string (that's good, unless we decide to settle on .data)
str(oid) and oid.hex both return the hex representation, always <str> (bytes in Python 2 and text in Python 3)
Oid is the only place where we implement str(...)
Object.hex and TreeEntry.hex behave the same, they return always <str>. Apparently these are the only places where we always return <str>.

The text was updated successfully, but these errors were encountered:

jdavid · 2019-03-31T11:42:51Z

If I'm not missing anything..

These are the places where we use .data:

Blob.data
Patch.data

These are the places where we use .raw:

Oid.raw

With read_raw():

Object.read_raw()

jnrbsn · 2019-03-31T21:07:53Z

Just curious, are you wanting to get all this done before doing another release? Also, how do you decide when to do a release?

jnrbsn · 2019-03-31T22:29:03Z

I prefer data as an attribute name, because it's a noun, and it straightforwardly tells you what you're getting. raw is an adjective. Raw what? I think having attributes like raw_name makes sense, but the word "raw" by itself makes less sense. Also, since the word "data" is already kind of generic, it already kind of implies that no assumptions are being made about the structure of the data, and so it already implies "raw".

jdavid · 2019-04-01T07:40:08Z

Good point.

No, this doesn't block the release. What I'd like to do before the next release is handling the pycparser issue, #846

The criteria for new releases is: not too big, not too small (but it may be small if it's important). Looking back usually we've a release every 1 to 3 months.

dsully · 2019-05-07T21:22:40Z

Given that Python 2 will be EOL on January 1 2020, do you want to consider making this a Python 3-only module? That would drastically simplify string & bytes handling.

jnrbsn · 2019-11-12T17:01:39Z

I'm an engineer on the Bitbucket team, and I just want to point out that we absolutely need an option to get raw bytes version of pretty much everything in the repo. It's not a question of Python2 vs. 3. pygit2's strategy of assuming UTF-8 and failing otherwise doesn't work for us all the time. We host millions of repos, and almost all of that repo data is created outside of Bitbucket and pushed to us. But we still need to display it correctly on our website. In the past, we've encountered issues in pygit2 where we simply have no way of reading a piece of data using pygit2 because it's not UTF-8 encoded, and pygit2 just raises an exception.

jdavid · 2019-11-16T08:31:30Z

@jnrbsn Okay, noted.
The fastest way to get something done is to make a PR.

jnrbsn · 2019-11-16T19:26:15Z

@jdavid I know. I wasn't necessarily saying it was urgent. I just wanted to point out that dropping support for Python 2 doesn't mean we no longer need this.

jdavid mentioned this issue Mar 31, 2019

Patch: Add __str__ and __bytes__ for undecoded content. #790

Closed

rcoup mentioned this issue Mar 18, 2020

Accept Pathlike objects (eg. pathlib.Path) in Py3.6+ #990

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent interface to get text and bytes #895

Consistent interface to get text and bytes #895

jdavid commented Mar 31, 2019 •

edited

Loading

jdavid commented Mar 31, 2019

jnrbsn commented Mar 31, 2019 •

edited

Loading

jnrbsn commented Mar 31, 2019

jdavid commented Apr 1, 2019

dsully commented May 7, 2019

jnrbsn commented Nov 12, 2019

jdavid commented Nov 16, 2019

jnrbsn commented Nov 16, 2019

Consistent interface to get text and bytes #895

Consistent interface to get text and bytes #895

Comments

jdavid commented Mar 31, 2019 • edited Loading

jdavid commented Mar 31, 2019

jnrbsn commented Mar 31, 2019 • edited Loading

jnrbsn commented Mar 31, 2019

jdavid commented Apr 1, 2019

dsully commented May 7, 2019

jnrbsn commented Nov 12, 2019

jdavid commented Nov 16, 2019

jnrbsn commented Nov 16, 2019

jdavid commented Mar 31, 2019 •

edited

Loading

jnrbsn commented Mar 31, 2019 •

edited

Loading