Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent interface to get text and bytes #895

Open
1 of 7 tasks
jdavid opened this issue Mar 31, 2019 · 8 comments
Open
1 of 7 tasks

Consistent interface to get text and bytes #895

jdavid opened this issue Mar 31, 2019 · 8 comments

Comments

@jdavid
Copy link
Member

jdavid commented Mar 31, 2019

Follow up from #610 #790 and #893

General policy:

  • Do not implement str()
  • To get unicode string use .text
  • In .text use UTF-8 and replace (the rationale for replace is explained in Patch: Add __str__ and __bytes__ for undecoded content. #790 (comment))
  • To get the byte string use .data or .raw (this is to be decided)
  • For attributes the name of the attribute returns text, prefix with raw_ to get bytes. For instance Signature.name and Signature.raw_name
  • Implement the buffer protocol, bytes(..) where appropriate

Open for discussion.

TODO:

  • Replace TreeEntry._name by .raw_name
  • Replace DiffLine.content by .text
  • Inventory all the places where we get bytes, text, or the buffer protocol
  • Settle on .data or .raw
  • Replace DiffLine.raw_content by .data or .raw
  • Replace Object.read_raw() by .data (or .raw), then remove Blob.data (it will inherit from Object)
  • Settle on str() bytes() and the buffer protocol

The case of Oid, what we've now:

  • oid.raw returns the byte string (that's good, unless we decide to settle on .data)
  • str(oid) and oid.hex both return the hex representation, always <str> (bytes in Python 2 and text in Python 3)
  • Oid is the only place where we implement str(...)
  • Object.hex and TreeEntry.hex behave the same, they return always <str>. Apparently these are the only places where we always return <str>.
@jdavid
Copy link
Member Author

jdavid commented Mar 31, 2019

If I'm not missing anything..

These are the places where we use .data:

  • Blob.data
  • Patch.data

These are the places where we use .raw:

  • Oid.raw

With read_raw():

  • Object.read_raw()

@jnrbsn
Copy link
Contributor

jnrbsn commented Mar 31, 2019

Just curious, are you wanting to get all this done before doing another release? Also, how do you decide when to do a release?

@jnrbsn
Copy link
Contributor

jnrbsn commented Mar 31, 2019

I prefer data as an attribute name, because it's a noun, and it straightforwardly tells you what you're getting. raw is an adjective. Raw what? I think having attributes like raw_name makes sense, but the word "raw" by itself makes less sense. Also, since the word "data" is already kind of generic, it already kind of implies that no assumptions are being made about the structure of the data, and so it already implies "raw".

@jdavid
Copy link
Member Author

jdavid commented Apr 1, 2019

Good point.

No, this doesn't block the release. What I'd like to do before the next release is handling the pycparser issue, #846

The criteria for new releases is: not too big, not too small (but it may be small if it's important). Looking back usually we've a release every 1 to 3 months.

@dsully
Copy link
Contributor

dsully commented May 7, 2019

Given that Python 2 will be EOL on January 1 2020, do you want to consider making this a Python 3-only module? That would drastically simplify string & bytes handling.

@jnrbsn
Copy link
Contributor

jnrbsn commented Nov 12, 2019

I'm an engineer on the Bitbucket team, and I just want to point out that we absolutely need an option to get raw bytes version of pretty much everything in the repo. It's not a question of Python2 vs. 3. pygit2's strategy of assuming UTF-8 and failing otherwise doesn't work for us all the time. We host millions of repos, and almost all of that repo data is created outside of Bitbucket and pushed to us. But we still need to display it correctly on our website. In the past, we've encountered issues in pygit2 where we simply have no way of reading a piece of data using pygit2 because it's not UTF-8 encoded, and pygit2 just raises an exception.

@jdavid
Copy link
Member Author

jdavid commented Nov 16, 2019

@jnrbsn Okay, noted.
The fastest way to get something done is to make a PR.

@jnrbsn
Copy link
Contributor

jnrbsn commented Nov 16, 2019

@jdavid I know. I wasn't necessarily saying it was urgent. I just wanted to point out that dropping support for Python 2 doesn't mean we no longer need this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants