Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Massive refactor with a new API #45

Merged
merged 4 commits into from
Feb 25, 2020
Merged

feat: Massive refactor with a new API #45

merged 4 commits into from
Feb 25, 2020

Conversation

vmx
Copy link
Member

@vmx vmx commented Feb 13, 2020

The new API also makes it possible to wrap an existing hash into a Multihash. This
is useful if you e.g. want to create hashes for testing, without spending time with
actually hashing data.

You also interact with multicodecs less directly. This should make it easier to use
your own multihash implementations without forking the code base.

BREAKING CHANGE: There is a new API

When using multihashes, you now import implementations of that hash, which
has a digest() function. That function returns a Multihash.

New way:

use multihash::{MultihashDigest, Sha3_512};
let my_multihash = Sha3_512::digest(b"hello world!");

Old way:

use multihash::{encode, Hash};
let my_multihash = encode(Hash::SHA3512, b"hello world!");

This PR is heavily based on the ideas from #30, though I tried to keep things simple for now, not exposing a large API.

You might wonder why I implemented all those MultihashDigest implementations manually and didn't use a macro. The problem is that we are using different hashing libraries which don't have a unified API. Hence there are small differences that are also not easily expressible with macros. The current version is using a macro for that purpose, but I think that's neither easy to understand.

Though creating macros for specific algorithms makes sense, e.g. there could be one to support all Blake2 variants.

Please have a look a leave comments. This is really meant as a starting point for discussions, not as a "merge sooner rather than later" PR.

src/digests.rs Outdated
///
/// The size of the hash is determoned by the size of the input hash. If it should be truncated
/// the input data must already be the truncated hash.
pub fn wrap(code: u64, data: &[u8]) -> Multihash {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably take a Code, not a u64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why this doesn't take a Code is, that it can be used for your custom, not yet implemented Multihash, without the need to fork the library. I thought about making it a trait, but that seemed a bit like overkill.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vmx I would suggest having a separate fn for that use case and have primary one be Code as that makes it more clear.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better yet might be adding Custom(u64) to the enum so that use of non standard codes is explicit.

@vmx
Copy link
Member Author

vmx commented Feb 13, 2020

I'm CC'ing a few folks that might have interest in reviewing this: @dvc94ch, @koivunej, @Gozala

data: bytes,
});
}
Ok(Multihash {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand it MultihashRef is dropped here but we know bytes are valid or we'd have Err. I do however find it confusing / awkward, probably my lack of Rust experience, makes me react with Parse don't validate

I think parse approach would not work here as MultihashRef contains &[u8], but maybe it make sense to have separate fn that does not construct MultihashRef ?

Maybe this is idiomatic in Rust, but maybe a comment for newbies (like myself) would be worth it.

})
}

/// Returns the bytes representation of the multihash.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize if I'm misunderstanding this but as far as I understand this would return bytes without consuming Multihash as to_vec creates a copy (isn't it). This seems to contradict expectation from into_bytes in Strings.

If my assumption is correct this method probably should either be removed or changed so it consumes multihash to match String behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct. A method named into_ carries with it the expectation that it consumes self and can thus perform the conversion in a cheap way.

For naming guidelines, see https://rust-lang.github.io/api-guidelines/naming.html#ad-hoc-conversions-follow-as_-to_-into_-conventions-c-conv

}

/// Returns which hashing algorithm is used in this multihash.
pub fn algorithm(&self) -> Code {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So wrap allows creating a multihash with non standard code, but such multihash will cause panic if it's algorithm() is called ? I think it's better to decide whether custom algorithms are ok or not and stick to that decision than this.

Personally I would prefer to have custom Code support so it's supported but also clear when non-standard version is used.

}

/// Returns the hashed data.
pub fn digest(&self) -> &'a [u8] {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather have no way to creating invalid Multihash than having to deal with the possibly invalid multihash across it's methods.

src/digests.rs Outdated
/// Represents a valid multihash.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct Multihash {
bytes: Bytes,
Copy link
Collaborator

@rklaehn rklaehn Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: why are we using Bytes here? Bytes is good if you want to share multiple slices of some larger buffer without copying. But a multihash is typically self-contained. Using bytes here might have the advantage of avoiding a copy, but it carries the risk of dragging around much more memory than we actually need.

E.g. you get a large buffer from some network or disk interface as a Bytes, then take the first 32 bytes from the buffer into a multihash. This will then contain a pointer to the large buffer even if you keep the multihash for long and are no longer interested in the rest of the buffer.

Now, sharing is obviously important for a Multihash, but I would think that just using an Arc<[u8]> (not Arc<Vec> !) would be the best compromise.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was also wondering about this but forgot to ask. Thanks @rklaehn for bringing it up

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if dependency on bytes is really needed here. What was the original rationale (bytes vs. Vec<_>)?

For small amounts of data (like hashes) it might be always faster to just allocate a vector, like sled might be doing in future than to do anything related to Arcs. Then again, perhaps my knowledge of bytes internals is outdated and I don't have time right now to understand them fully, could be that using bytes for single owner data is simple as Vec<_>.

Even if it was, as bytes is part of the async ecosystem it might be easier to get this crate stable throught the use of only mostly std components and to need to update this crate only when source specs or hash implementations evolve or more are needed.

Regarding @rklaehn from network usecase, I wonder if there is any benefit from actually constructing an owned type around the bytes, as the bytes are usable for verification purposes through the borrowed type, whatever it ends up looking like.

Also wondering if there should be a way to hash into a hopefully large enough &mut [u8] but perhaps that can be experimented outside of the crate if necessary. The same goes for updating on blocks. Perhaps this crate is at a quite difficult intersection with simplicity and different advanced IO usecases. I also wonder if only mapping from the spec codes to digest instances which can be used the by the caller in number of different ways would be enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case where you don't want to own the thing, there is MultihashRef. And in general I must say fewer dependencies are always better, especially for crates that have not yet reached 1.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it was, as bytes is part of the async ecosystem

Just to react to that: the bytes crate was introduced as a hack around the fact that you couldn't do anything combining asynchronous and borrowing. You couldn't for instance asynchronously write a buffer by reference.

This is no longer true in a post-async-await world, and I totally see bytes disappearing at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background why bytes is used: #42.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose comment summarizing why Bytes are used, clearly without a good context it seems controversial.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion in #42 was comparing Bytes to Vec. I would love to see the tradeoff between Bytes and Arc<[u8]>. It might be that there are situations where Bytes comes up on top, but currently I don't see them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the Gist (https://gist.github.com/vmx/0dc0d981b522e2834084559faa0dcb56) and added the code from #47 as comparison. On my machine that new code is not slower than the bytes approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: sled use, I'm closing that linked issue because although it is faster to copy 2-4k than bump an Arc in some cases, it's NOT faster to allocate a vec first. I need to do more measuring to find out whether that third path between inlining and Arc<[u8]> actually exists.

#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
pub struct MultihashRef<'a> {
bytes: &'a [u8],
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this hasn't been brought up before, structures like this make me wonder if Multihash should be

struct Multihash<'a> {
  bytes: std::borrow::Cow<'a, [u8]>,
}

This would remove the need for separate Multihash and MultihashRef<'_> being able to support both owning and working on borrowed data. This does bring some less ergonomic points in the longer run but as long as there is just some bytes this might be doable. Perhaps there was some rationale for this split which is not in any of the comments?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a multihash with a lifetime would be a world of hurt in terms of ergonomics. I can already see myself fighting lifetime issues to do utterly trivial things with multihashes and futures.

Also, I kinda dislike Cow since it has some surprising footguns... E.g. rust-lang/rust#50160

I think an Arc<[u8]> would be best.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, to use Cow can bring some interesting fights with borrowck. To use with owned data you could have type OwnedMultihash = Multihash<'static>; and make sure to always deal with that. I understand that current limitations with async fn push many things towards needing to be always owned or basically Arc<_> but at the same time I am not sure if a crate like multihash is a good place decide require anything more than the minimum (slices).

The downside of using std::borrow::Cow would bring the additional requirement of being able to use only Vec as the owned companion, so perhaps keeping the MultihashRef<'_> allows for more use cases.

@vmx
Copy link
Member Author

vmx commented Feb 14, 2020

I've pushed a new commit addressing the use of Code instead of a raw integer. How do you like that approach @Gozala and @dignifiedquire?

I wasn't sure how usable that would be but looking at the test for the custom codec i quite like it (though if there's a better way, let me know).

Copy link
Contributor

@austinabell austinabell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from those two comments, I prefer the new interface for sure. One other small nit is that it is somewhat annoying to have to import the MultihashDigest along with the type of hash I want to generate just to generate the digested bytes.

I don't personally see the need for the MultihashDigest trait when you can just have the digest and code functions as an implementation of these base types and use the trait for any dynamic usages.

Edit: I updated locally switching to this dependency in Forest, and happy to push if you would like to play around with the broken cid prefix interactions I mention :)

Comment on lines -64 to -84
pub fn size(self) -> u8 {
match self {
// TODO vmx 2020-01-27: Identity doesn't have a fixed length. The `size()` API should
// be removed or renamed to `max_size()` as you can also store truncated hashes with a
// different size.
Hash::Identity => 42,
Hash::SHA1 => 20,
Hash::SHA2256 => 32,
Hash::SHA2512 => 64,
Hash::SHA3224 => 28,
Hash::SHA3256 => 32,
Hash::SHA3384 => 48,
Hash::SHA3512 => 64,
Hash::Keccak224 => 28,
Hash::Keccak256 => 32,
Hash::Keccak384 => 48,
Hash::Keccak512 => 64,
Hash::Blake2b256 => 32,
Hash::Blake2b512 => 64,
Hash::Blake2s128 => 16,
Hash::Blake2s256 => 32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't personally use it, and happy to remove from my version of rust cid but there is currently the functionality to get the length of the multihash variant for use in encoding and decoding prefixes here which functionality is removed with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for removing size() is, that it currently doesn't do what I would expect it to do. size() should return the size of the digest in the multihash. Though you can also use a truncated version of the hash. Hence there could e.g. be a Sha2-256 hash trunctated to 10 bytes. Then size() should obviously be 10 and not 32.

In the link you provide it is using digest.len(). Shouldn't that still work as that's the size we really want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there shouldn't really be a use for this, I was just indicating the multihash length was included in the prefix type you guys had created in the Cid library (but there was no actual need for that specific field) so if you wanted to get the length of the hash type you would actually have to hash some arbitrary data (again this isn't a problem as I don't see why you would need this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@austinabell Thanks for bringing this up. I agree I don't see why it would be a problem in practice.

src/hashes.rs Outdated
Comment on lines 19 to 46
pub enum ImplementedCode {
/// Identity (Raw binary )
Identity,
Identity = 0x00,
/// SHA-1 (20-byte hash size)
SHA1,
Sha1 = 0x11,
/// SHA-256 (32-byte hash size)
SHA2256,
Sha2_256 = 0x12,
/// SHA-512 (64-byte hash size)
SHA2512,
/// SHA3-512 (64-byte hash size)
SHA3512,
/// SHA3-384 (48-byte hash size)
SHA3384,
/// SHA3-256 (32-byte hash size)
SHA3256,
Sha2_512 = 0x13,
/// SHA3-224 (28-byte hash size)
SHA3224,
Sha3_224 = 0x17,
/// SHA3-256 (32-byte hash size)
Sha3_256 = 0x16,
/// SHA3-384 (48-byte hash size)
Sha3_384 = 0x15,
/// SHA3-512 (64-byte hash size)
Sha3_512 = 0x14,
/// Keccak-224 (28-byte hash size)
Keccak224,
Keccak224 = 0x1a,
/// Keccak-256 (32-byte hash size)
Keccak256,
Keccak256 = 0x1b,
/// Keccak-384 (48-byte hash size)
Keccak384,
Keccak384 = 0x1c,
/// Keccak-512 (64-byte hash size)
Keccak512,
Keccak512 = 0x1d,
/// BLAKE2b-256 (32-byte hash size)
Blake2b256,
Blake2b256 = 0xb220,
/// BLAKE2b-512 (64-byte hash size)
Blake2b512,
Blake2b512 = 0xb240,
/// BLAKE2s-128 (16-byte hash size)
Blake2s128,
Blake2s128 = 0xb250,
/// BLAKE2s-256 (32-byte hash size)
Blake2s256,
}

impl Hash {
/// Get the corresponding hash code.
pub fn code(self) -> u16 {
match self {
Hash::Identity => 0x00,
Hash::SHA1 => 0x11,
Hash::SHA2256 => 0x12,
Hash::SHA2512 => 0x13,
Hash::SHA3224 => 0x17,
Hash::SHA3256 => 0x16,
Hash::SHA3384 => 0x15,
Hash::SHA3512 => 0x14,
Hash::Keccak224 => 0x1A,
Hash::Keccak256 => 0x1B,
Hash::Keccak384 => 0x1C,
Hash::Keccak512 => 0x1D,
Hash::Blake2b256 => 0xB220,
Hash::Blake2b512 => 0xB240,
Hash::Blake2s128 => 0xB250,
Hash::Blake2s256 => 0xB260,
Blake2s256 = 0xb260,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to go from these codes to the structs used for generating digests? This also limits functionality in some dynamic cases where if you get the code from calling algorithm(&self) -> Code you couldn't generate another hash from this. Just a thought, could have a function which takes the code as input and returns a generic type bounded on DynMultihashDigest to solve this problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea! See the latest commits for an implementation.

src/hashes.rs Outdated
}

#[derive(Clone, Copy, Debug, PartialEq)]
pub enum ImplementedCode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on removing this type and just moving the implemented codes into the Code enum? Seems redundant and annoying to use a seperate type for this. Something like austinabell@da0d002 perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point and good idea. I wanted that, but you can't mix Custom(u64) together with definitions like Sha2_512 = 0x13. But what I haven't thought of was doing it your way. :)

}

/// The `MultihashDigest` trait specifies an interface common for all multihash functions.
pub trait MultihashDigest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the functional benefit to having the builtin functionality tied to this trait? Seems unnecessary to have to import the MultihashDigest trait just to use a builtin hash. I only see the need for the DynMultihashDigest trait, unless I'm missing the use case for this trait?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this trait as the thing you would actually want. The DynMultihashDigest is a workaround around Rust limitations.

What do others think? Does it make sense to have both traits, or should we only have the hacky DynMultihashDigest?

Copy link
Contributor

@austinabell austinabell Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear about what I am suggesting: austinabell@131cc94

Just quickly put together, could add code as another base impl as non associative and it would be functionally the same but not require you to import this trait whenever you want to hash in a non-dynamic context.

Definitely more verbose, and could be cleaned up, but more clean usage imo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having code() as a method isn't a big deal. Though the problem I'm seeing is that now code() and digest() will be instance methods (right word?) that take self as first parameter and not static methods. To me, the API should be clean and not hide that self isn't actually needed. This is what I meant with DynMultihashDigest being hacky.

Copy link
Contributor

@austinabell austinabell Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was just too lazy to move the associated code const to the base impl, does this satisfy that criteria: austinabell@fd07805?

fixed changes in edit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, I have no opinion or preference, just offering an alternative since it seems weird to me to require importing that trait that does nothing just to be able to hash using builtin implementations. Seems to be an even tradeoff since declaring custom implementations could be slightly confusing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's (from the end-user perspective) quite nice. We could also have a derive macro as the trait implementation would be the same.

Could please someone with more Rust experience comment on which approach is better? The one proposed by @austinabell (#45 (comment)) feels a bit un-idiomatic, but the resulting API is quite nice. @dignifiedquire perhaps?

@vmx
Copy link
Member Author

vmx commented Feb 17, 2020

This issue has already lots of comments, hence here's a summary of the current state (if I missed anything, let me know):

Resolved:

  • Bytes vs. `Vec vs. something else: that discussion has been moved to Use inline storage for small hashes #47
  • Using Code instead of a raw int for wrapping a Multihash
  • Agreement that this new API is better than the old one

Open:

@vmx
Copy link
Member Author

vmx commented Feb 19, 2020

I like the idea from @austinabell and applied his commit. The MutlihashDigest trait implementations can easily be replaced with a macro. But I think that could be done later on (even on a separate PR).

This resolves all open issues.

@dignifiedquire I'm happy with this PR now, it's ready for a final review.

Copy link
Member

@dignifiedquire dignifiedquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good from my side

vmx added 3 commits February 21, 2020 15:30
The new API also makes it possible to wrap an existing hash into a Multihash. This
is useful if you e.g. want to create hashes for testing, without spending time with
actually hashing data.

You also interact with multicodecs less directly. This should make it easier to use
your own multihash implementations without forking the code base.

BREAKING CHANGE: There is a new API

When using multihashes, you now import implementations of that hash, which
has a `digest()` function. That function returns a Multihash.

New way:

    use multihash::Sha3_512;
    let my_multihash = Sha3_512::digest(b"hello world!");

Old way:

    use multihash::{encode, Hash};
    let my_multihash = encode(Hash::SHA3512, b"hello world!");
@vmx
Copy link
Member Author

vmx commented Feb 21, 2020

I rebased it (which was more work than expected) on master. It would be cool to get another quick review by someone to make sure I didn't make any major mistake during this rebase.

Also make `wrap()` take a code instance and not a reference, that
makes the API nicer.
@vmx
Copy link
Member Author

vmx commented Feb 21, 2020

I couldn't resist and pushed one more commit into this PR as passing on the code by reference felt awkward.

@vmx vmx merged commit 2b8dfe3 into master Feb 25, 2020
@vmx vmx deleted the import-impls branch February 25, 2020 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants