Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic Modules/Utils #23

Open
4 of 8 tasks
amstocker opened this issue Nov 20, 2015 · 23 comments
Open
4 of 8 tasks

Generic Modules/Utils #23

amstocker opened this issue Nov 20, 2015 · 23 comments

Comments

@amstocker
Copy link

amstocker commented Nov 20, 2015

Before moving on to the rest of libp2p and py-ipfs, we should look at the required generic utilities and determine which existing implementation we want to use, or if there isn't one, who wants to work on it. Here is what we need (see #21 (comment) for more info):

I know we already have several multihash implementations, and I hacked together a multiaddr implementation that could use some love. If anyone has already worked on any of these or is interested on working on any of them, please comment here. Also if anyone has anymore links to references or specs for the generic modules, post them here.

@jbenet
Copy link
Contributor

jbenet commented Nov 20, 2015

multicodec, too. ipld as well.

On Thu, Nov 19, 2015 at 8:02 PM Andrew Stocker notifications@github.com
wrote:

Before moving on to the rest of libp2p and py-ipfs, we should look at the
required generic modules and determine which existing implementation we
want to use, or if there isn't one, who wants to work on it. Here is what
we need (see #21 (comment)
#21 (comment) for
more info):

I know we already have several multihash implementation
#13, and I hacked together a multiaddr
implementation https://github.com/amstocker/python-multiaddr that could
use some love. If anyone has already worked on any of these or is
interested on working on any of them, please comment here. Also if anyone
has anymore links to references or specs for the generic modules, post them
here.


Reply to this email directly or view it on GitHub
#23.

@amstocker
Copy link
Author

@jbenet thanks

@amstocker amstocker changed the title Generic Modules Generic Modules/Utils Nov 20, 2015
@candeira
Copy link
Contributor

Regarding the multihash/multihashing split, it appears our discussion in #13 is now obsolete. We agreed on using a multihash implementation that followed the Python hashlib API, but current favourite JulienPalard's is more of a multihash_ing_ implementation, in that it exposes a high level Python-hashlib-like interface to create multihash values and hexdigests, but doesn't implement the low level go-multihash/js-multihash interface.

The multihash/multihashing split seems also very useful if we are to share as much of the testing as possible, maybe through sharness or some other platform-independent system.

This is my suggestion:

Multihashing:

  • keep JulienPalard's multihash with the python-hashlib API, but rename it to multihashing.
  • augment it it so it has a superset of the python-hashlib API. The rationale for a drop-in substitution only works one way anyway: that is, a multihash should be able to be used anywhere that a hashlib digest is, but not in reverse, because a naked digest will never be a legal multihash. This means multihashing should implement digest() and hexdigest(), but we could also implement utility functions base58btc_digest(), etc.
  • rework it to use an underlying multihash implementation that only cares about the low level (encode, decode, check, etc.) interface.

Multihash:

  • of the existing multihash implementations, I'm inclined to use @tehmaze's. The canonical multihash value is a binary string or struct, not an instance of a Multihash object. The functions encode(), decode(), check(), etc. should be just that, module functions. At most the decoded multihash could be a namedtuple, for attribute access of "hashcode, name, length, digest", but the output of multihashing.hash('sha1', 'hello') should be a binary string.
  • in any case, a multihash should be immutable. @bmcorser's multihash implementation allows one to create a multihash object, encode it into a binary string, and then mutate it by decoding a different binary multihash value.
  • also, we should add to it some of the behaviour in the go-multihash reference implementation that the spec doesn't (yet) reflect. For instance, a sha-256 multihash with a 16-byte length (truncating it to 128 bits, the first half of the digest) is a legal multihash value according to go-multihash, and the Python implementatiopn should be able to encode(), decode() and check() such a value.

If people agree to the above points, I could do this work myself. I'm already doing it on Elixir.

@amstocker: I'll look into your multiaddr starter over the weekend. Maybe we can meet on IRC after the libp2p hangout, which I plan to lurk on.

@JulienPalard
Copy link
Contributor

The rationale for a drop-in substitution only works one way anyway: that is, a multihash should be able to be used anywhere that a hashlib digest is, but not in reverse, because a naked digest will never be a legal multihash. This means multihashing should implement digest() and hexdigest(), but we could also implement utility functions base58btc_digest(), etc.

I can't completly agree, if we're diverging from hashlib by adding methods like base58_btc_digest, we're in fact completly dropping the drop-in replacement of hashlib, because another drop-in replacement of hashlib implementation will probably not name the extended method the same way, like base58_digest instead. They're no longer drop-in-replacable.

On the other hand, I don't think it's the responsibility of an hashlib to encode (in any bases like base58). I mean, one may want to use hashlib but don't care about encoding in base58, and one may use a base58 encoder without needing hashlib.

@jbenet
Copy link
Contributor

jbenet commented Nov 20, 2015

I don't think you're limited to one "drop in replacement" interface. What I often do is write the interface I want, and then write another object that wraps it to provide some other useful interface.

from multihashing.hashlib import multihash

Or something more concise :)

@jbenet
Copy link
Contributor

jbenet commented Nov 20, 2015

@candeira that sgtm.

If the 2 modules thing sounds cumbersome, another option is to use subpackages in python. I prefer super modular things (hence breaking it up into two modules) but the py and go communities don't in general.

@candeira
Copy link
Contributor

@JulienPalard: I don't know that we are talking about the same thing when we mention encoding. It has to do with the view of whether a separation is needed between the functionality of multihash and of multihashing.

  • multihash would be the low level library that doesn't know anything about crypto, but only about the multihash datastructure. Thus, multihash.encode() is just a packer of a (hashcode, length, digest) in to a single binary struct/string, after validation of inputs. It doesn't do any hashing of the content data, because it's input is already the binary digest from a hash function.
  • multihashing would be the higher level library that takes some data and he name for a hash function, and returns a digest packed into a multihash value. This is the one that you want to be compatible with Python's hashlib, a goal I agreed with at first, now not so much.

Your multihash package is actually an implementation of of multihashing that doesn't consider a standalone multihash necessary, and implements the necessary functionality of multihash without exposing the low level API.

However, after reading about the difference between multihash and multihashing, there are several reasons why I think it's important to expose both:

  • The aforementioned usefulness of reusing sharness tests.
  • Checking whether a hash is well-formed in the absence of the content is not a crypto operation, and is the province of multihash, not of multihashing
  • The application specific hashes. While multihashing can implement common standard hashes, there is a place in the multihash spec for application implementors to just add their own. They would use multihash to pack the multihash value.
  • The header bytes add no entropy to the multihash, and distort the uniform distribution of the values. Any application wanting to strip the headers would do well to also validate it's dealing with a well formed value, so use multihash instead of just using slicing the value mh[2:].

As to the necessity of a base58btcdigest() encoder in multihashing, it's the standard human readable string encoder for /ipfs/<multihash> uris, so at should be at least as important and necessary as hexdigest() was considered to be in hashlib when it was first designed. However, this is not as important as having multihashing.verify(), for instance, to be able to easily verify that a blob retrieved via a multihash key indeed hashes to that value.

In short, having a hashlib compatible API is useful, but restrincting ourselves to only the hashlib compatible API denies our users of needed functionality. After all, hashes will be compared and verified much more often than created.

I think there is a design that can please everyone, and I plan to offer an implementation soon. Here it is:

multihashing
├──__init__.py 
├── multihashing.py   # JulienPalard's high level library, refactored to use the low level API, and  extended with verify(), truncation, etc.
├── multihash.py      # tehmaze's low level multihash packing library
└── hashlib.py        # multihashing, restricted to expose a cloned API of stdlib's hashlib

This fulfills everybody's needs and requirements:

# multihashing.multihashing is the high level utility module
# it allows to create default hashes (including base58 encoded ones),
# truncated hashes, verify blobs against hashes, etc.

>>> from multihashing import multihashing
>>> m1 = multihashing.sha1()
>>> m1.update(b'foo')
>>> m1.digest()
b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3'
>>> m1.digest(length=10) # we make truncated hashes by passing a shorter length than default digest length
b'\x11\x0a\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]'
>>> m1.hexdigest()
11140beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
>>> m1.base58btcdigest()
5dqx43zNtUUbPj97vJhpHyUUPyrmXG
>>> multihashing.verify(b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3', b'foo')
True
>>> multihashing.verify('5dqx43zNtUUbPj97vJhpHyUUPyrmXG', b'foo', 'base58btc')
True

# multihashing.multihash is the low level pack-and-unpack library

>>> from multihashing import multihash
>>> m2 = multihash.encode('sha1', 20, b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3')
>>> m2_truncated = multihash.encode('sha1', 10,  b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3')   # we make truncated hashes by passing a shorter length than default digest length, this is how multihashing truncates
>>> multihash.check(m2)      # explicit checking returns boolean
True
>>> multihash.decode(m2)   # implicit checking throws exception if not a well formed multihash
DecodedMultihash(code=127, name='sha1', length=20, digest=b'\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3')

# multihashing.hashlib exposes a hashlib compatible API and nothing else, 
# and only creates hashes with default lengths and either bytestring or hex encoding for digests

>>> from multihashing import hashlib
>>> m3 = multihashing.sha1()
>>> m3.update(b'foo')
>>> print(m3.hexdigest())
11140beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
>>> print(m3.base58btcdigest())
AttributeError: etc.
>>> print(m3.digest(lenght=10)
TypeError: digest() takes no arguments (1 given)

# multihashing.hashlib and multihashing.multihashing objects can be compared

>>> m1.digest() == m2
True
>>> m3.digest() == m2
True
>>> m1 == m3
True
>>> m2_truncated = m1.digest(lenght=10)
True

@jbenet I guess you were using 'modules' in a Go sense. This proposal suggests having one single python package with three independent Python modules, ie three .py files. I don't see any reason to split it further.

@JulienPalard
Copy link
Contributor

I still don't get the difference between multihash and multihashing and still think the names are badly choosen. However I'm aware that my multihash implementation don't provide a way to unhash a hash (I mean, "parse it"). But if we have to implement it, why not calling it "unmultihash" ?

@JulienPalard
Copy link
Contributor

It looks like I miseed your message #23 (comment), so I'm re-replying.

I still think that multihash and multihashing are very badly choosen names, and people will confuse them every time they need one. And I don't think that we need to choose bad names before a javascript implementation choose bad names.

On the other hand I'm aware that sticking on a pure hashlib compatible API is very restrictive.

But, adding a base58encode method to multihash seems wrong to me, there's a problem of responsibility: It's not the hashing library responsibility to provide human readable representation of them. If you go this way, it's also the responsibility of the hashing library to display it on a graphical user interface as a texbox, and also the responsibility of the hasing library to find a blob from its hash in a merkle tree ? No.

I still agree that we need a base58 library, but not only for the hasing representation, probably for some other representations too.

It all boils up to choosing between:

import multihash
import base58

print(base58.encode(multihash.hash(...)))

and

import multihash

print(multihash.hash(...).base58encode())

As I can understand it, multihashing may be a "human API" on top of "multihash" that does what it's cool to exploit it like providing a base58encode and some other usefull methods. A facade, for humans. In this sense it's not "that bad", still the name are very confusing and will cost a lot of time for everybody (I mean "import multihash ... multihash. ... .base58encode() oh it throws, it's multihashing, I did it again...). And our points of view on the separation of responsibilities diverges clearly here.

A long message, to propose almost nothing better, that's bad. real bad, sorry. But before choosing names, we have to synchronize on what we want those two differents libs to do (what reponsibilities do they get). We're not simply copying a javascript implementation.

@amstocker
Copy link
Author

I think that the first one makes a lot more sense to me, but I would change base58 to mirror python's base64 module:

import multihash
from base58 import b58encode, b58decode

print(b58encode(multihash.hash(...)))

In general I think we should have these basic modules:

  • base58 module
  • varint module

and then these modules for multihash:

  • low-level multihash module for encoding, decoding multihash bytearrays
  • hashlib interface to the low-level functions
  • optional higher level api with objects

All the multihash modules can be part of py-multihash, and I don't think the naming matters very much between multihash/multihashing.

@ivilata
Copy link
Contributor

ivilata commented Jan 13, 2016

For hashlib compatibility we only need to provide a single class, and in my opinion it still makes sense to provide some additional functionality related with multihash in the same module.

For instance (and taking much inspiration from the examples above), a single multihash.py module may be created that provides a hashlib-compatible class (hash) and a structured utility class (Multihash) in the same place:

>>> import multihash
>>> hash = multihash.hash('sha1')  # hashlib-compatible object
>>> hash.update(b'foo')
>>> digest = hash.digest()
>>> digest
b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3'

>>> import base58
>>> base58.b58encode(digest)  # basic way to encode a digest
'5dqx43zNtUUbPj97vJhpHyUUPyrmXG'

>>> mh = multihash.Multihash.from_digest(digest)  # parse the digest
>>> mh
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=20, digest=b'\x0b...')
>>> mh.encode('base58')  # same result as above
'5dqx43zNtUUbPj97vJhpHyUUPyrmXG'

# Truncating while encoding may be supported
# but it breaks ``Mh.decode(mh.encode(X), X) == mh``,
# thus I prefer creating a different Multihash explicitly.
>>> mh.encode('base58', length=10)
'KegLCXHeZVecdh4Y'
>>> mh_trunc = mh.truncate(10)  # truncate to a new multihash
>>> mh_trunc
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=10, digest=b'\x0b...')
>>> mh_trunc.encode('base58')  # same result as above
'KegLCXHeZVecdh4Y'

>>> mh_dec = multihash.Multihash.decode('KegLCXHeZVecdh4Y', 'base58')  # decoding
>>> mh_dec
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=10, digest=b'\x0b...')
>>> mh_dec == mh_trunc
True
# No ``multihash.check()``, use ``from_digest()`` and catch ``ValueError``.

>>> mh.verify(b'foo')  # verification
True
>>> mh_trunc.verify(b'foo')
True

The Multihash class above could well be implemented as a namedtuple for memory savings. I used an Enum for the hash function as a comfortable and compact way to have the function code and name together, although numbers and strings may be accepted as well:

>>> multihash.Multihash(multihash.Func.sha1, 20, b'...')  # function by enum
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=20, digest=b'...')
>>> multihash.Multihash(0x11, 20, b'...')  # function by code
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=20, digest=b'...')
>>> multihash.Multihash('sha1', 20, b'...')  # function by name
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=20, digest=b'...')
>>> multihash.Multihash(0x04, 16, b'...')  # user-defined hashing function
multihash.Multihash(func=4, length=16, digest=b'...')
>>> multihash.Multihash(0x15, 16, b'...')  # unknown hashing function
ValueError: ...

Top-level utility functions may be defined to use encoded multihash strings straight away instead of Multihash (e.g. mh_digest == b'\x11\x14\x0b...' and mh_string == '5dqx...':

multihash.encode(mh_digest, coding[, length]) -> mh_string
multihash.decode(mh_string, coding) -> mh_digest
multihash.verify(mh_string, coding, data) -> bool

But I don't much see the point of these last functions.

@ivilata
Copy link
Contributor

ivilata commented Jan 15, 2016

After reviewing the previous proposals I've noticed that it may not make sense to implement multihash as a hashlib (PEP 247) compatible object, since it's not providing any new hashing method on its own. Multihash is just a way to encode the results of other hashes in a binary packaging, even the re-encoding part (with base64 and the likes) would not be necessary, only comfortable to have. With that in mind I've come up with a leaner version of the proposal:

>>> import hashlib
>>> hash = hashlib.sha1()  # use hashlib for proper hashing
>>> hash.update(b'foo')
>>> digest = hash.digest()
>>> digest
b'\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3'

>>> import multihash
>>> mh = multihash.Multihash.from_hash(hash)
>>> mh
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=20, digest=b'\x0b...')
>>> mh.encode()  # binary multihash encoding
b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3'
>>> import base58
>>> bytes(base58.b58encode(mh.encode()))  # plain way to ASCII-armor
b'5dqx43zNtUUbPj97vJhpHyUUPyrmXG'
>>> mh.encode('base58')  # just a shortcut to the operation above
b'5dqx43zNtUUbPj97vJhpHyUUPyrmXG'

>>> mh_trunc = mh.truncate(10)  # truncate to a new multihash
>>> mh_trunc
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=10, digest=b'\x0b...')
>>> mh_trunc.encode('base58')
b'KegLCXHeZVecdh4Y'
# Truncating while encoding may be supported
# but ``decode(mh.encode(X, L)) != decode(mh.encode(X))``,
# thus I prefer creating a different Multihash explicitly.
>>> mh.encode('base58', length=10)  # same result as above
b'KegLCXHeZVecdh4Y'

>>> mh_bin = multihash.decode(mh.encode())  # binary encode and decode
>>> mh == mh_bin
True
>>> mh_dec = multihash.decode(b'KegLCXHeZVecdh4Y', 'base58')
>>> mh_dec == mh_trunc
True

>>> mh.verify(b'foo')
True
>>> mh_trunc.verify(b'foo')
True
>>> multihash.hash(hashlib.sha1, b'foo') == mh
True
# no multihash.check(), use multihash.decode() and catch ValueError

@ivilata
Copy link
Contributor

ivilata commented Jan 18, 2016

Ongoing implementation of the proposal just above in https://github.com/ivilata/pymultihash.

@candeira
Copy link
Contributor

We're overlapping, ivilata. My ongoing implementation is at https://github.com/candeira/py-multihash.

@ivilata
Copy link
Contributor

ivilata commented Jan 25, 2016

Hi @candeira, my implementation is based on my proposal above, which departs from your multihash/multihashing proposal to focus on something more pythonic (to my taste, of course!). It is currently only lacking some error checks and an introductory module docstring with some comprehensive usage examples beyond function doctests. I hope that the complete doc can offer a better feeling for the module and help us decide which parts to reuse, merge or drop. No worries if the whole thing is merged or dropped altogether. :) Thanks!

@ivilata
Copy link
Contributor

ivilata commented Feb 9, 2016

I've uploaded pymultihash to PyPi. It's feature complete, with some testing (using doctests) and a pretty complete tutorial in the package docstring (which I'm working to put through Sphinx). You're very welcome to have a look at it and give your opinions, thanks! Edit: docs available in ReadTheDocs.

@amstocker
Copy link
Author

@ivilata this is really awesome!!

@candeira
Copy link
Contributor

Really good! I'm still struggling to find time to finish mine. More to come.

@ivilata
Copy link
Contributor

ivilata commented Feb 11, 2016

Thanks! In general I'd like to make a case for not trying to mimic the Go or JS and go for something more pythonic where possible. :)

@Fak3
Copy link

Fak3 commented May 12, 2016

I found another multihash implementation: https://github.com/kalmi/py-multihash

@JulienPalard
Copy link
Contributor

@Fak3 YAYYYYYYYYYY ! #irony

@amstocker
Copy link
Author

@Fak3 there's a few out there but afaik the one made by @ivilata is the most fleshed out: https://github.com/ivilata/pymultihash

@fredthomsen
Copy link
Contributor

There is an additional multiaddr implementation by @sbuss present as well: https://github.com/sbuss/py-multiaddr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants