-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Massive refactor with a new API #45
Conversation
src/digests.rs
Outdated
/// | ||
/// The size of the hash is determoned by the size of the input hash. If it should be truncated | ||
/// the input data must already be the truncated hash. | ||
pub fn wrap(code: u64, data: &[u8]) -> Multihash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably take a Code
, not a u64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why this doesn't take a Code
is, that it can be used for your custom, not yet implemented Multihash, without the need to fork the library. I thought about making it a trait, but that seemed a bit like overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vmx I would suggest having a separate fn
for that use case and have primary one be Code
as that makes it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or better yet might be adding Custom(u64)
to the enum so that use of non standard codes is explicit.
data: bytes, | ||
}); | ||
} | ||
Ok(Multihash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand it MultihashRef
is dropped here but we know bytes
are valid or we'd have Err
. I do however find it confusing / awkward, probably my lack of Rust experience, makes me react with Parse don't validate
I think parse approach would not work here as MultihashRef
contains &[u8]
, but maybe it make sense to have separate fn that does not construct MultihashRef
?
Maybe this is idiomatic in Rust, but maybe a comment for newbies (like myself) would be worth it.
}) | ||
} | ||
|
||
/// Returns the bytes representation of the multihash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize if I'm misunderstanding this but as far as I understand this would return bytes without consuming Multihash
as to_vec
creates a copy (isn't it). This seems to contradict expectation from into_bytes
in String
s.
If my assumption is correct this method probably should either be removed or changed so it consumes multihash to match String behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct. A method named into_
carries with it the expectation that it consumes self and can thus perform the conversion in a cheap way.
For naming guidelines, see https://rust-lang.github.io/api-guidelines/naming.html#ad-hoc-conversions-follow-as_-to_-into_-conventions-c-conv
} | ||
|
||
/// Returns which hashing algorithm is used in this multihash. | ||
pub fn algorithm(&self) -> Code { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So wrap allows creating a multihash with non standard code, but such multihash will cause panic if it's algorithm()
is called ? I think it's better to decide whether custom algorithms are ok or not and stick to that decision than this.
Personally I would prefer to have custom Code
support so it's supported but also clear when non-standard version is used.
} | ||
|
||
/// Returns the hashed data. | ||
pub fn digest(&self) -> &'a [u8] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather have no way to creating invalid Multihash
than having to deal with the possibly invalid multihash across it's methods.
src/digests.rs
Outdated
/// Represents a valid multihash. | ||
#[derive(Debug, Clone, PartialEq, Eq, Hash)] | ||
pub struct Multihash { | ||
bytes: Bytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: why are we using Bytes here? Bytes is good if you want to share multiple slices of some larger buffer without copying. But a multihash is typically self-contained. Using bytes here might have the advantage of avoiding a copy, but it carries the risk of dragging around much more memory than we actually need.
E.g. you get a large buffer from some network or disk interface as a Bytes, then take the first 32 bytes from the buffer into a multihash. This will then contain a pointer to the large buffer even if you keep the multihash for long and are no longer interested in the rest of the buffer.
Now, sharing is obviously important for a Multihash, but I would think that just using an Arc<[u8]> (not Arc<Vec> !) would be the best compromise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was also wondering about this but forgot to ask. Thanks @rklaehn for bringing it up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if dependency on bytes
is really needed here. What was the original rationale (bytes vs. Vec<_>
)?
For small amounts of data (like hashes) it might be always faster to just allocate a vector, like sled might be doing in future than to do anything related to Arcs. Then again, perhaps my knowledge of bytes
internals is outdated and I don't have time right now to understand them fully, could be that using bytes
for single owner data is simple as Vec<_>
.
Even if it was, as bytes
is part of the async ecosystem it might be easier to get this crate stable throught the use of only mostly std
components and to need to update this crate only when source specs or hash implementations evolve or more are needed.
Regarding @rklaehn from network usecase, I wonder if there is any benefit from actually constructing an owned type around the bytes, as the bytes are usable for verification purposes through the borrowed type, whatever it ends up looking like.
Also wondering if there should be a way to hash into a hopefully large enough &mut [u8]
but perhaps that can be experimented outside of the crate if necessary. The same goes for updating on blocks. Perhaps this crate is at a quite difficult intersection with simplicity and different advanced IO usecases. I also wonder if only mapping from the spec codes to digest instances which can be used the by the caller in number of different ways would be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the case where you don't want to own the thing, there is MultihashRef
. And in general I must say fewer dependencies are always better, especially for crates that have not yet reached 1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it was, as bytes is part of the async ecosystem
Just to react to that: the bytes
crate was introduced as a hack around the fact that you couldn't do anything combining asynchronous and borrowing. You couldn't for instance asynchronously write a buffer by reference.
This is no longer true in a post-async-await world, and I totally see bytes
disappearing at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The background why bytes
is used: #42.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose comment summarizing why Bytes
are used, clearly without a good context it seems controversial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion in #42 was comparing Bytes to Vec. I would love to see the tradeoff between Bytes and Arc<[u8]>. It might be that there are situations where Bytes comes up on top, but currently I don't see them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the Gist (https://gist.github.com/vmx/0dc0d981b522e2834084559faa0dcb56) and added the code from #47 as comparison. On my machine that new code is not slower than the bytes approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: sled use, I'm closing that linked issue because although it is faster to copy 2-4k than bump an Arc in some cases, it's NOT faster to allocate a vec first. I need to do more measuring to find out whether that third path between inlining and Arc<[u8]> actually exists.
#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)] | ||
pub struct MultihashRef<'a> { | ||
bytes: &'a [u8], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this hasn't been brought up before, structures like this make me wonder if Multihash
should be
struct Multihash<'a> {
bytes: std::borrow::Cow<'a, [u8]>,
}
This would remove the need for separate Multihash
and MultihashRef<'_>
being able to support both owning and working on borrowed data. This does bring some less ergonomic points in the longer run but as long as there is just some bytes
this might be doable. Perhaps there was some rationale for this split which is not in any of the comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a multihash with a lifetime would be a world of hurt in terms of ergonomics. I can already see myself fighting lifetime issues to do utterly trivial things with multihashes and futures.
Also, I kinda dislike Cow since it has some surprising footguns... E.g. rust-lang/rust#50160
I think an Arc<[u8]>
would be best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, to use Cow
can bring some interesting fights with borrowck. To use with owned data you could have type OwnedMultihash = Multihash<'static>;
and make sure to always deal with that. I understand that current limitations with async fn
push many things towards needing to be always owned or basically Arc<_>
but at the same time I am not sure if a crate like multihash
is a good place decide require anything more than the minimum (slices).
The downside of using std::borrow::Cow
would bring the additional requirement of being able to use only Vec
as the owned companion, so perhaps keeping the MultihashRef<'_>
allows for more use cases.
I've pushed a new commit addressing the use of I wasn't sure how usable that would be but looking at the test for the custom codec i quite like it (though if there's a better way, let me know). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from those two comments, I prefer the new interface for sure. One other small nit is that it is somewhat annoying to have to import the MultihashDigest
along with the type of hash I want to generate just to generate the digested bytes.
I don't personally see the need for the MultihashDigest trait when you can just have the digest and code functions as an implementation of these base types and use the trait for any dynamic usages.
Edit: I updated locally switching to this dependency in Forest, and happy to push if you would like to play around with the broken cid prefix interactions I mention :)
pub fn size(self) -> u8 { | ||
match self { | ||
// TODO vmx 2020-01-27: Identity doesn't have a fixed length. The `size()` API should | ||
// be removed or renamed to `max_size()` as you can also store truncated hashes with a | ||
// different size. | ||
Hash::Identity => 42, | ||
Hash::SHA1 => 20, | ||
Hash::SHA2256 => 32, | ||
Hash::SHA2512 => 64, | ||
Hash::SHA3224 => 28, | ||
Hash::SHA3256 => 32, | ||
Hash::SHA3384 => 48, | ||
Hash::SHA3512 => 64, | ||
Hash::Keccak224 => 28, | ||
Hash::Keccak256 => 32, | ||
Hash::Keccak384 => 48, | ||
Hash::Keccak512 => 64, | ||
Hash::Blake2b256 => 32, | ||
Hash::Blake2b512 => 64, | ||
Hash::Blake2s128 => 16, | ||
Hash::Blake2s256 => 32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't personally use it, and happy to remove from my version of rust cid but there is currently the functionality to get the length of the multihash variant for use in encoding and decoding prefixes here which functionality is removed with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for removing size()
is, that it currently doesn't do what I would expect it to do. size()
should return the size of the digest in the multihash. Though you can also use a truncated version of the hash. Hence there could e.g. be a Sha2-256 hash trunctated to 10 bytes. Then size()
should obviously be 10
and not 32
.
In the link you provide it is using digest.len()
. Shouldn't that still work as that's the size we really want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah there shouldn't really be a use for this, I was just indicating the multihash length was included in the prefix type you guys had created in the Cid library (but there was no actual need for that specific field) so if you wanted to get the length of the hash type you would actually have to hash some arbitrary data (again this isn't a problem as I don't see why you would need this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@austinabell Thanks for bringing this up. I agree I don't see why it would be a problem in practice.
src/hashes.rs
Outdated
pub enum ImplementedCode { | ||
/// Identity (Raw binary ) | ||
Identity, | ||
Identity = 0x00, | ||
/// SHA-1 (20-byte hash size) | ||
SHA1, | ||
Sha1 = 0x11, | ||
/// SHA-256 (32-byte hash size) | ||
SHA2256, | ||
Sha2_256 = 0x12, | ||
/// SHA-512 (64-byte hash size) | ||
SHA2512, | ||
/// SHA3-512 (64-byte hash size) | ||
SHA3512, | ||
/// SHA3-384 (48-byte hash size) | ||
SHA3384, | ||
/// SHA3-256 (32-byte hash size) | ||
SHA3256, | ||
Sha2_512 = 0x13, | ||
/// SHA3-224 (28-byte hash size) | ||
SHA3224, | ||
Sha3_224 = 0x17, | ||
/// SHA3-256 (32-byte hash size) | ||
Sha3_256 = 0x16, | ||
/// SHA3-384 (48-byte hash size) | ||
Sha3_384 = 0x15, | ||
/// SHA3-512 (64-byte hash size) | ||
Sha3_512 = 0x14, | ||
/// Keccak-224 (28-byte hash size) | ||
Keccak224, | ||
Keccak224 = 0x1a, | ||
/// Keccak-256 (32-byte hash size) | ||
Keccak256, | ||
Keccak256 = 0x1b, | ||
/// Keccak-384 (48-byte hash size) | ||
Keccak384, | ||
Keccak384 = 0x1c, | ||
/// Keccak-512 (64-byte hash size) | ||
Keccak512, | ||
Keccak512 = 0x1d, | ||
/// BLAKE2b-256 (32-byte hash size) | ||
Blake2b256, | ||
Blake2b256 = 0xb220, | ||
/// BLAKE2b-512 (64-byte hash size) | ||
Blake2b512, | ||
Blake2b512 = 0xb240, | ||
/// BLAKE2s-128 (16-byte hash size) | ||
Blake2s128, | ||
Blake2s128 = 0xb250, | ||
/// BLAKE2s-256 (32-byte hash size) | ||
Blake2s256, | ||
} | ||
|
||
impl Hash { | ||
/// Get the corresponding hash code. | ||
pub fn code(self) -> u16 { | ||
match self { | ||
Hash::Identity => 0x00, | ||
Hash::SHA1 => 0x11, | ||
Hash::SHA2256 => 0x12, | ||
Hash::SHA2512 => 0x13, | ||
Hash::SHA3224 => 0x17, | ||
Hash::SHA3256 => 0x16, | ||
Hash::SHA3384 => 0x15, | ||
Hash::SHA3512 => 0x14, | ||
Hash::Keccak224 => 0x1A, | ||
Hash::Keccak256 => 0x1B, | ||
Hash::Keccak384 => 0x1C, | ||
Hash::Keccak512 => 0x1D, | ||
Hash::Blake2b256 => 0xB220, | ||
Hash::Blake2b512 => 0xB240, | ||
Hash::Blake2s128 => 0xB250, | ||
Hash::Blake2s256 => 0xB260, | ||
Blake2s256 = 0xb260, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no way to go from these codes to the structs used for generating digests? This also limits functionality in some dynamic cases where if you get the code from calling algorithm(&self) -> Code
you couldn't generate another hash from this. Just a thought, could have a function which takes the code as input and returns a generic type bounded on DynMultihashDigest
to solve this problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea! See the latest commits for an implementation.
src/hashes.rs
Outdated
} | ||
|
||
#[derive(Clone, Copy, Debug, PartialEq)] | ||
pub enum ImplementedCode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on removing this type and just moving the implemented codes into the Code enum? Seems redundant and annoying to use a seperate type for this. Something like austinabell@da0d002 perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point and good idea. I wanted that, but you can't mix Custom(u64)
together with definitions like Sha2_512 = 0x13
. But what I haven't thought of was doing it your way. :)
} | ||
|
||
/// The `MultihashDigest` trait specifies an interface common for all multihash functions. | ||
pub trait MultihashDigest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the functional benefit to having the builtin functionality tied to this trait? Seems unnecessary to have to import the MultihashDigest trait just to use a builtin hash. I only see the need for the DynMultihashDigest trait, unless I'm missing the use case for this trait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this trait as the thing you would actually want. The DynMultihashDigest
is a workaround around Rust limitations.
What do others think? Does it make sense to have both traits, or should we only have the hacky DynMultihashDigest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be clear about what I am suggesting: austinabell@131cc94
Just quickly put together, could add code as another base impl as non associative and it would be functionally the same but not require you to import this trait whenever you want to hash in a non-dynamic context.
Definitely more verbose, and could be cleaned up, but more clean usage imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having code()
as a method isn't a big deal. Though the problem I'm seeing is that now code()
and digest()
will be instance methods (right word?) that take self
as first parameter and not static methods. To me, the API should be clean and not hide that self
isn't actually needed. This is what I meant with DynMultihashDigest
being hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was just too lazy to move the associated code const to the base impl, does this satisfy that criteria: austinabell@fd07805?
fixed changes in edit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, I have no opinion or preference, just offering an alternative since it seems weird to me to require importing that trait that does nothing just to be able to hash using builtin implementations. Seems to be an even tradeoff since declaring custom implementations could be slightly confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's (from the end-user perspective) quite nice. We could also have a derive macro as the trait implementation would be the same.
Could please someone with more Rust experience comment on which approach is better? The one proposed by @austinabell (#45 (comment)) feels a bit un-idiomatic, but the resulting API is quite nice. @dignifiedquire perhaps?
This issue has already lots of comments, hence here's a summary of the current state (if I missed anything, let me know): Resolved:
Open:
|
I like the idea from @austinabell and applied his commit. The This resolves all open issues. @dignifiedquire I'm happy with this PR now, it's ready for a final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good from my side
The new API also makes it possible to wrap an existing hash into a Multihash. This is useful if you e.g. want to create hashes for testing, without spending time with actually hashing data. You also interact with multicodecs less directly. This should make it easier to use your own multihash implementations without forking the code base. BREAKING CHANGE: There is a new API When using multihashes, you now import implementations of that hash, which has a `digest()` function. That function returns a Multihash. New way: use multihash::Sha3_512; let my_multihash = Sha3_512::digest(b"hello world!"); Old way: use multihash::{encode, Hash}; let my_multihash = encode(Hash::SHA3512, b"hello world!");
I rebased it (which was more work than expected) on master. It would be cool to get another quick review by someone to make sure I didn't make any major mistake during this rebase. |
Also make `wrap()` take a code instance and not a reference, that makes the API nicer.
I couldn't resist and pushed one more commit into this PR as passing on the code by reference felt awkward. |
The new API also makes it possible to wrap an existing hash into a Multihash. This
is useful if you e.g. want to create hashes for testing, without spending time with
actually hashing data.
You also interact with multicodecs less directly. This should make it easier to use
your own multihash implementations without forking the code base.
BREAKING CHANGE: There is a new API
When using multihashes, you now import implementations of that hash, which
has a
digest()
function. That function returns a Multihash.New way:
Old way:
This PR is heavily based on the ideas from #30, though I tried to keep things simple for now, not exposing a large API.
You might wonder why I implemented all those
MultihashDigest
implementations manually and didn't use a macro. The problem is that we are using different hashing libraries which don't have a unified API. Hence there are small differences that are also not easily expressible with macros. The current version is using a macro for that purpose, but I think that's neither easy to understand.Though creating macros for specific algorithms makes sense, e.g. there could be one to support all Blake2 variants.
Please have a look a leave comments. This is really meant as a starting point for discussions, not as a "merge sooner rather than later" PR.