feat: Massive refactor with a new API #45

vmx · 2020-02-13T17:33:13Z

The new API also makes it possible to wrap an existing hash into a Multihash. This
is useful if you e.g. want to create hashes for testing, without spending time with
actually hashing data.

You also interact with multicodecs less directly. This should make it easier to use
your own multihash implementations without forking the code base.

BREAKING CHANGE: There is a new API

When using multihashes, you now import implementations of that hash, which
has a digest() function. That function returns a Multihash.

New way:

use multihash::{MultihashDigest, Sha3_512};
let my_multihash = Sha3_512::digest(b"hello world!");

Old way:

use multihash::{encode, Hash};
let my_multihash = encode(Hash::SHA3512, b"hello world!");

This PR is heavily based on the ideas from #30, though I tried to keep things simple for now, not exposing a large API.

You might wonder why I implemented all those MultihashDigest implementations manually and didn't use a macro. The problem is that we are using different hashing libraries which don't have a unified API. Hence there are small differences that are also not easily expressible with macros. The current version is using a macro for that purpose, but I think that's neither easy to understand.

Though creating macros for specific algorithms makes sense, e.g. there could be one to support all Blake2 variants.

Please have a look a leave comments. This is really meant as a starting point for discussions, not as a "merge sooner rather than later" PR.

dignifiedquire · 2020-02-13T17:39:22Z

src/digests.rs

+///
+/// The size of the hash is determoned by the size of the input hash. If it should be truncated
+/// the input data must already be the truncated hash.
+pub fn wrap(code: u64, data: &[u8]) -> Multihash {


This should probably take a Code, not a u64

The reason why this doesn't take a Code is, that it can be used for your custom, not yet implemented Multihash, without the need to fork the library. I thought about making it a trait, but that seemed a bit like overkill.

@vmx I would suggest having a separate fn for that use case and have primary one be Code as that makes it more clear.

Or better yet might be adding Custom(u64) to the enum so that use of non standard codes is explicit.

vmx · 2020-02-13T17:56:35Z

I'm CC'ing a few folks that might have interest in reviewing this: @dvc94ch, @koivunej, @Gozala

Gozala · 2020-02-13T20:02:47Z

src/digests.rs

+                data: bytes,
+            });
+        }
+        Ok(Multihash {


As far as I understand it MultihashRef is dropped here but we know bytes are valid or we'd have Err. I do however find it confusing / awkward, probably my lack of Rust experience, makes me react with Parse don't validate

I think parse approach would not work here as MultihashRef contains &[u8], but maybe it make sense to have separate fn that does not construct MultihashRef ?

Maybe this is idiomatic in Rust, but maybe a comment for newbies (like myself) would be worth it.

Gozala · 2020-02-13T20:22:48Z

src/digests.rs

+        })
+    }
+
+    /// Returns the bytes representation of the multihash.


I apologize if I'm misunderstanding this but as far as I understand this would return bytes without consuming Multihash as to_vec creates a copy (isn't it). This seems to contradict expectation from into_bytes in Strings.

If my assumption is correct this method probably should either be removed or changed so it consumes multihash to match String behavior.

I think this is correct. A method named into_ carries with it the expectation that it consumes self and can thus perform the conversion in a cheap way.

For naming guidelines, see https://rust-lang.github.io/api-guidelines/naming.html#ad-hoc-conversions-follow-as_-to_-into_-conventions-c-conv

Gozala · 2020-02-13T20:28:17Z

src/digests.rs

+    }
+
+    /// Returns which hashing algorithm is used in this multihash.
+    pub fn algorithm(&self) -> Code {


So wrap allows creating a multihash with non standard code, but such multihash will cause panic if it's algorithm() is called ? I think it's better to decide whether custom algorithms are ok or not and stick to that decision than this.

Personally I would prefer to have custom Code support so it's supported but also clear when non-standard version is used.

Gozala · 2020-02-13T20:30:16Z

src/digests.rs

+    }
+
+    /// Returns the hashed data.
+    pub fn digest(&self) -> &'a [u8] {


I would much rather have no way to creating invalid Multihash than having to deal with the possibly invalid multihash across it's methods.

rklaehn · 2020-02-14T08:29:29Z

src/digests.rs

+/// Represents a valid multihash.
+#[derive(Debug, Clone, PartialEq, Eq, Hash)]
+pub struct Multihash {
+    bytes: Bytes,


Just curious: why are we using Bytes here? Bytes is good if you want to share multiple slices of some larger buffer without copying. But a multihash is typically self-contained. Using bytes here might have the advantage of avoiding a copy, but it carries the risk of dragging around much more memory than we actually need.

E.g. you get a large buffer from some network or disk interface as a Bytes, then take the first 32 bytes from the buffer into a multihash. This will then contain a pointer to the large buffer even if you keep the multihash for long and are no longer interested in the rest of the buffer.

Now, sharing is obviously important for a Multihash, but I would think that just using an Arc<[u8]> (not Arc<Vec> !) would be the best compromise.

Was also wondering about this but forgot to ask. Thanks @rklaehn for bringing it up

I also wonder if dependency on bytes is really needed here. What was the original rationale (bytes vs. Vec<_>)?

For small amounts of data (like hashes) it might be always faster to just allocate a vector, like sled might be doing in future than to do anything related to Arcs. Then again, perhaps my knowledge of bytes internals is outdated and I don't have time right now to understand them fully, could be that using bytes for single owner data is simple as Vec<_>.

Even if it was, as bytes is part of the async ecosystem it might be easier to get this crate stable throught the use of only mostly std components and to need to update this crate only when source specs or hash implementations evolve or more are needed.

Regarding @rklaehn from network usecase, I wonder if there is any benefit from actually constructing an owned type around the bytes, as the bytes are usable for verification purposes through the borrowed type, whatever it ends up looking like.

Also wondering if there should be a way to hash into a hopefully large enough &mut [u8] but perhaps that can be experimented outside of the crate if necessary. The same goes for updating on blocks. Perhaps this crate is at a quite difficult intersection with simplicity and different advanced IO usecases. I also wonder if only mapping from the spec codes to digest instances which can be used the by the caller in number of different ways would be enough.

For the case where you don't want to own the thing, there is MultihashRef. And in general I must say fewer dependencies are always better, especially for crates that have not yet reached 1.0.

Even if it was, as bytes is part of the async ecosystem

Just to react to that: the bytes crate was introduced as a hack around the fact that you couldn't do anything combining asynchronous and borrowing. You couldn't for instance asynchronously write a buffer by reference.

This is no longer true in a post-async-await world, and I totally see bytes disappearing at some point.

The background why bytes is used: #42.

I would propose comment summarizing why Bytes are used, clearly without a good context it seems controversial.

The discussion in #42 was comparing Bytes to Vec. I would love to see the tradeoff between Bytes and Arc<[u8]>. It might be that there are situations where Bytes comes up on top, but currently I don't see them.

I've updated the Gist (https://gist.github.com/vmx/0dc0d981b522e2834084559faa0dcb56) and added the code from #47 as comparison. On my machine that new code is not slower than the bytes approach.

re: sled use, I'm closing that linked issue because although it is faster to copy 2-4k than bump an Arc in some cases, it's NOT faster to allocate a vec first. I need to do more measuring to find out whether that third path between inlining and Arc<[u8]> actually exists.

koivunej · 2020-02-14T11:03:05Z

src/digests.rs

+#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
+pub struct MultihashRef<'a> {
+    bytes: &'a [u8],
+}


Since this hasn't been brought up before, structures like this make me wonder if Multihash should be

struct Multihash<'a> { bytes: std::borrow::Cow<'a, [u8]>, }

This would remove the need for separate Multihash and MultihashRef<'_> being able to support both owning and working on borrowed data. This does bring some less ergonomic points in the longer run but as long as there is just some bytes this might be doable. Perhaps there was some rationale for this split which is not in any of the comments?

I think a multihash with a lifetime would be a world of hurt in terms of ergonomics. I can already see myself fighting lifetime issues to do utterly trivial things with multihashes and futures.

Also, I kinda dislike Cow since it has some surprising footguns... E.g. rust-lang/rust#50160

I think an Arc<[u8]> would be best.

I agree, to use Cow can bring some interesting fights with borrowck. To use with owned data you could have type OwnedMultihash = Multihash<'static>; and make sure to always deal with that. I understand that current limitations with async fn push many things towards needing to be always owned or basically Arc<_> but at the same time I am not sure if a crate like multihash is a good place decide require anything more than the minimum (slices).

The downside of using std::borrow::Cow would bring the additional requirement of being able to use only Vec as the owned companion, so perhaps keeping the MultihashRef<'_> allows for more use cases.

vmx · 2020-02-14T20:12:13Z

I've pushed a new commit addressing the use of Code instead of a raw integer. How do you like that approach @Gozala and @dignifiedquire?

I wasn't sure how usable that would be but looking at the test for the custom codec i quite like it (though if there's a better way, let me know).

austinabell

Aside from those two comments, I prefer the new interface for sure. One other small nit is that it is somewhat annoying to have to import the MultihashDigest along with the type of hash I want to generate just to generate the digested bytes.

I don't personally see the need for the MultihashDigest trait when you can just have the digest and code functions as an implementation of these base types and use the trait for any dynamic usages.

Edit: I updated locally switching to this dependency in Forest, and happy to push if you would like to play around with the broken cid prefix interactions I mention :)

austinabell · 2020-02-14T20:54:41Z

src/hashes.rs

-    pub fn size(self) -> u8 {
-        match self {
-            // TODO vmx 2020-01-27: Identity doesn't have a fixed length. The `size()` API should
-            // be removed or renamed to `max_size()` as you can also store truncated hashes with a
-            // different size.
-            Hash::Identity => 42,
-            Hash::SHA1 => 20,
-            Hash::SHA2256 => 32,
-            Hash::SHA2512 => 64,
-            Hash::SHA3224 => 28,
-            Hash::SHA3256 => 32,
-            Hash::SHA3384 => 48,
-            Hash::SHA3512 => 64,
-            Hash::Keccak224 => 28,
-            Hash::Keccak256 => 32,
-            Hash::Keccak384 => 48,
-            Hash::Keccak512 => 64,
-            Hash::Blake2b256 => 32,
-            Hash::Blake2b512 => 64,
-            Hash::Blake2s128 => 16,
-            Hash::Blake2s256 => 32,


I don't personally use it, and happy to remove from my version of rust cid but there is currently the functionality to get the length of the multihash variant for use in encoding and decoding prefixes here which functionality is removed with this.

The reason for removing size() is, that it currently doesn't do what I would expect it to do. size() should return the size of the digest in the multihash. Though you can also use a truncated version of the hash. Hence there could e.g. be a Sha2-256 hash trunctated to 10 bytes. Then size() should obviously be 10 and not 32.

In the link you provide it is using digest.len(). Shouldn't that still work as that's the size we really want?

yeah there shouldn't really be a use for this, I was just indicating the multihash length was included in the prefix type you guys had created in the Cid library (but there was no actual need for that specific field) so if you wanted to get the length of the hash type you would actually have to hash some arbitrary data (again this isn't a problem as I don't see why you would need this)

@austinabell Thanks for bringing this up. I agree I don't see why it would be a problem in practice.

austinabell · 2020-02-14T21:54:20Z

src/hashes.rs

+pub enum ImplementedCode {
    /// Identity (Raw binary )
-    Identity,
+    Identity = 0x00,
    /// SHA-1 (20-byte hash size)
-    SHA1,
+    Sha1 = 0x11,
    /// SHA-256 (32-byte hash size)
-    SHA2256,
+    Sha2_256 = 0x12,
    /// SHA-512 (64-byte hash size)
-    SHA2512,
-    /// SHA3-512 (64-byte hash size)
-    SHA3512,
-    /// SHA3-384 (48-byte hash size)
-    SHA3384,
-    /// SHA3-256 (32-byte hash size)
-    SHA3256,
+    Sha2_512 = 0x13,
    /// SHA3-224 (28-byte hash size)
-    SHA3224,
+    Sha3_224 = 0x17,
+    /// SHA3-256 (32-byte hash size)
+    Sha3_256 = 0x16,
+    /// SHA3-384 (48-byte hash size)
+    Sha3_384 = 0x15,
+    /// SHA3-512 (64-byte hash size)
+    Sha3_512 = 0x14,
    /// Keccak-224 (28-byte hash size)
-    Keccak224,
+    Keccak224 = 0x1a,
    /// Keccak-256 (32-byte hash size)
-    Keccak256,
+    Keccak256 = 0x1b,
    /// Keccak-384 (48-byte hash size)
-    Keccak384,
+    Keccak384 = 0x1c,
    /// Keccak-512 (64-byte hash size)
-    Keccak512,
+    Keccak512 = 0x1d,
    /// BLAKE2b-256 (32-byte hash size)
-    Blake2b256,
+    Blake2b256 = 0xb220,
    /// BLAKE2b-512 (64-byte hash size)
-    Blake2b512,
+    Blake2b512 = 0xb240,
    /// BLAKE2s-128 (16-byte hash size)
-    Blake2s128,
+    Blake2s128 = 0xb250,
    /// BLAKE2s-256 (32-byte hash size)
-    Blake2s256,
-}
-
-impl Hash {
-    /// Get the corresponding hash code.
-    pub fn code(self) -> u16 {
-        match self {
-            Hash::Identity => 0x00,
-            Hash::SHA1 => 0x11,
-            Hash::SHA2256 => 0x12,
-            Hash::SHA2512 => 0x13,
-            Hash::SHA3224 => 0x17,
-            Hash::SHA3256 => 0x16,
-            Hash::SHA3384 => 0x15,
-            Hash::SHA3512 => 0x14,
-            Hash::Keccak224 => 0x1A,
-            Hash::Keccak256 => 0x1B,
-            Hash::Keccak384 => 0x1C,
-            Hash::Keccak512 => 0x1D,
-            Hash::Blake2b256 => 0xB220,
-            Hash::Blake2b512 => 0xB240,
-            Hash::Blake2s128 => 0xB250,
-            Hash::Blake2s256 => 0xB260,
+    Blake2s256 = 0xb260,
+}


There is no way to go from these codes to the structs used for generating digests? This also limits functionality in some dynamic cases where if you get the code from calling algorithm(&self) -> Code you couldn't generate another hash from this. Just a thought, could have a function which takes the code as input and returns a generic type bounded on DynMultihashDigest to solve this problem?

That's a great idea! See the latest commits for an implementation.

austinabell · 2020-02-17T16:02:37Z

src/hashes.rs

+}
+
+#[derive(Clone, Copy, Debug, PartialEq)]
+pub enum ImplementedCode {


Thoughts on removing this type and just moving the implemented codes into the Code enum? Seems redundant and annoying to use a seperate type for this. Something like austinabell@da0d002 perhaps?

Good point and good idea. I wanted that, but you can't mix Custom(u64) together with definitions like Sha2_512 = 0x13. But what I haven't thought of was doing it your way. :)

austinabell · 2020-02-17T16:05:59Z

src/digests.rs

+}
+
+/// The `MultihashDigest` trait specifies an interface common for all multihash functions.
+pub trait MultihashDigest {


What is the functional benefit to having the builtin functionality tied to this trait? Seems unnecessary to have to import the MultihashDigest trait just to use a builtin hash. I only see the need for the DynMultihashDigest trait, unless I'm missing the use case for this trait?

I see this trait as the thing you would actually want. The DynMultihashDigest is a workaround around Rust limitations.

What do others think? Does it make sense to have both traits, or should we only have the hacky DynMultihashDigest?

to be clear about what I am suggesting: austinabell@131cc94

Just quickly put together, could add code as another base impl as non associative and it would be functionally the same but not require you to import this trait whenever you want to hash in a non-dynamic context.

Definitely more verbose, and could be cleaned up, but more clean usage imo

Having code() as a method isn't a big deal. Though the problem I'm seeing is that now code() and digest() will be instance methods (right word?) that take self as first parameter and not static methods. To me, the API should be clean and not hide that self isn't actually needed. This is what I meant with DynMultihashDigest being hacky.

Yeah I was just too lazy to move the associated code const to the base impl, does this satisfy that criteria: austinabell@fd07805?

fixed changes in edit

Just to be clear, I have no opinion or preference, just offering an alternative since it seems weird to me to require importing that trait that does nothing just to be able to hash using builtin implementations. Seems to be an even tradeoff since declaring custom implementations could be slightly confusing

That's (from the end-user perspective) quite nice. We could also have a derive macro as the trait implementation would be the same.

Could please someone with more Rust experience comment on which approach is better? The one proposed by @austinabell (#45 (comment)) feels a bit un-idiomatic, but the resulting API is quite nice. @dignifiedquire perhaps?

vmx · 2020-02-17T18:43:17Z

This issue has already lots of comments, hence here's a summary of the current state (if I missed anything, let me know):

Resolved:

Bytes vs. `Vec vs. something else: that discussion has been moved to Use inline storage for small hashes #47
Using Code instead of a raw int for wrapping a Multihash
Agreement that this new API is better than the old one

Open:

Having the DynMultihashDigest trait and the MultihashDigest or just the DynMultihashDigest one (feat: Massive refactor with a new API #45 (comment))

vmx · 2020-02-19T17:50:43Z

I like the idea from @austinabell and applied his commit. The MutlihashDigest trait implementations can easily be replaced with a macro. But I think that could be done later on (even on a separate PR).

This resolves all open issues.

@dignifiedquire I'm happy with this PR now, it's ready for a final review.

dignifiedquire

looks good from my side

The new API also makes it possible to wrap an existing hash into a Multihash. This is useful if you e.g. want to create hashes for testing, without spending time with actually hashing data. You also interact with multicodecs less directly. This should make it easier to use your own multihash implementations without forking the code base. BREAKING CHANGE: There is a new API When using multihashes, you now import implementations of that hash, which has a `digest()` function. That function returns a Multihash. New way: use multihash::Sha3_512; let my_multihash = Sha3_512::digest(b"hello world!"); Old way: use multihash::{encode, Hash}; let my_multihash = encode(Hash::SHA3512, b"hello world!");

vmx · 2020-02-21T15:24:01Z

I rebased it (which was more work than expected) on master. It would be cool to get another quick review by someone to make sure I didn't make any major mistake during this rebase.

Also make `wrap()` take a code instance and not a reference, that makes the API nicer.

vmx · 2020-02-21T17:42:44Z

I couldn't resist and pushed one more commit into this PR as passing on the code by reference felt awkward.

vmx force-pushed the import-impls branch from 44c1363 to aa54ac9 Compare February 13, 2020 17:35

vmx requested a review from dignifiedquire February 13, 2020 17:38

dignifiedquire reviewed Feb 13, 2020

View reviewed changes

vmx mentioned this pull request Feb 13, 2020

[2020 Theme Proposal] IPFS in Rust ipfs/roadmap#54

Closed

Gozala reviewed Feb 13, 2020

View reviewed changes

rklaehn reviewed Feb 14, 2020

View reviewed changes

koivunej reviewed Feb 14, 2020

View reviewed changes

austinabell reviewed Feb 14, 2020

View reviewed changes

vmx force-pushed the import-impls branch from cc7a5c1 to 089c37e Compare February 17, 2020 14:15

austinabell reviewed Feb 17, 2020

View reviewed changes

dignifiedquire approved these changes Feb 20, 2020

View reviewed changes

vmx added 3 commits February 21, 2020 15:30

chore: remove blank lines from Cargo.toml

4cfe7d2

refactor: move contents from lib.rs to digests.rs

7bba92b

vmx force-pushed the import-impls branch from d2b218b to a1c4cd5 Compare February 21, 2020 15:21

feat: derive Copy trait for Code

d185165

Also make `wrap()` take a code instance and not a reference, that makes the API nicer.

vmx merged commit 2b8dfe3 into master Feb 25, 2020

vmx deleted the import-impls branch February 25, 2020 11:59

vmx mentioned this pull request Feb 25, 2020

feat: replace sha1 and tiny-keccak with sha-1 and sha3 #52

Merged

feat: Massive refactor with a new API #45

feat: Massive refactor with a new API #45

Conversation

vmx commented Feb 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmx commented Feb 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rklaehn Feb 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmx commented Feb 14, 2020

austinabell left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinabell Feb 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinabell Feb 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmx commented Feb 17, 2020

vmx commented Feb 19, 2020

dignifiedquire left a comment

Choose a reason for hiding this comment

vmx commented Feb 21, 2020

vmx commented Feb 21, 2020

rklaehn Feb 14, 2020 •

edited

Loading

austinabell left a comment •

edited

Loading

austinabell Feb 17, 2020 •

edited

Loading

austinabell Feb 17, 2020 •

edited

Loading