Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new ChecksummedBlock in DataCache #572

Closed
wants to merge 3 commits into from
Closed

Conversation

passaro
Copy link
Contributor

@passaro passaro commented Oct 23, 2023

Description of change

Introduce a new ChecksummedBlock type which represents a bytes buffer and its matching checksum. It is a simpler version of ChecksummedBytes in that the checksum is always matching the exposed bytes buffer, rather than potentially a containing larger buffer. The new type is better suited to be used in the DataCache because it can be more efficiently serialized/deserialized preserving its checksum.

This change also introduces a new checksums module, containing both ChecksummedBytes and ChecksummedBlock, in addition to other checksum functions and types.

Relevant issues: #255

Does this change impact existing behavior?

No changes.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).

Introduce a new `ChecksummedBlock` type which represents a bytes buffer and its matching checksum. It is a simpler version of `ChecksummedBytes` in that the checksum is always matching the exposed bytes buffer, rather than potentially a containing larger buffer. The new type is better suited to be used in the `DataCache` because it can be more efficiently serialized/deserialized preserving its checksum.

This change also introduces a new `checksums` module, containing both `ChecksummedBytes` and `ChecksummedBlock`, in addition to other checksum functions and types.

Signed-off-by: Alessandro Passaro <alexpax@amazon.co.uk>
@passaro passaro temporarily deployed to PR integration tests October 23, 2023 16:51 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests October 23, 2023 16:51 — with GitHub Actions Inactive
@passaro passaro had a problem deploying to PR integration tests October 23, 2023 16:51 — with GitHub Actions Failure
@passaro passaro temporarily deployed to PR integration tests October 23, 2023 16:51 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests October 24, 2023 06:45 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests October 24, 2023 06:45 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests October 24, 2023 06:45 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests October 24, 2023 06:45 — with GitHub Actions Inactive
Copy link
Contributor

@dannycjones dannycjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's get a second (third?) opinion on the extend fn.

Plus fix typo.

mountpoint-s3/src/checksums.rs Outdated Show resolved Hide resolved
mountpoint-s3/src/checksums/block.rs Show resolved Hide resolved
mountpoint-s3/src/checksums/block.rs Show resolved Hide resolved
Comment on lines +49 to +69
/// Append the given bytes to current `ChecksummedBlock`.
pub fn extend(&mut self, extend: ChecksummedBlock) {
if self.is_empty() {
*self = extend;
return;
}
if extend.is_empty() {
return;
}

let total_len = self.bytes.len() + extend.len();
let mut bytes_mut = BytesMut::with_capacity(total_len);
bytes_mut.extend_from_slice(&self.bytes);
bytes_mut.extend_from_slice(&extend.bytes);
let new_bytes = bytes_mut.freeze();
let new_checksum = combine_checksums(self.checksum, extend.checksum, extend.len());
*self = ChecksummedBlock {
bytes: new_bytes,
checksum: new_checksum,
};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is safe since we are taking two checksummed buffers, combining the two, and calculating the new checksum independently of the new buffer.

IMO the durability risk here is mitigated, but I'd also like a second opinion from the team.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds right. We know the expected checksum of each side (unlike in the ChecksummedBytes case where we only know the checksum of some larger slice of each side), and can compute the new expected checksum from those without actually looking at the bytes.

Can you add a comment here capturing that reasoning?

mountpoint-s3/src/checksums/block.rs Outdated Show resolved Hide resolved
@passaro passaro had a problem deploying to PR integration tests October 24, 2023 13:24 — with GitHub Actions Failure
@passaro passaro had a problem deploying to PR integration tests October 24, 2023 13:24 — with GitHub Actions Failure
@passaro passaro had a problem deploying to PR integration tests October 24, 2023 13:24 — with GitHub Actions Failure
@passaro passaro had a problem deploying to PR integration tests October 24, 2023 13:24 — with GitHub Actions Failure
Signed-off-by: Alessandro Passaro <alexpax@amazon.co.uk>
Signed-off-by: Alessandro Passaro <alexpax@amazon.co.uk>
@passaro passaro temporarily deployed to PR integration tests October 24, 2023 13:42 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests October 24, 2023 13:42 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests October 24, 2023 13:42 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests October 24, 2023 13:42 — with GitHub Actions Inactive
Copy link
Contributor

@dannycjones dannycjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@passaro passaro added this pull request to the merge queue Oct 24, 2023
/// A `ChecksummedBlock` is a bytes buffer that carries its checksum.
/// The implementation guarantees that its integrity will be validated when data is accessed.
#[derive(Debug, Clone)]
pub struct ChecksummedBlock {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if it might be nicer to have just one implementation of this stuff, and give ChecksummedBytes a shrink_to_fit-style method to get the guarantee you're looking for. But then I guess that makes extend et al more complicated because you have to handle all the different combinations to decide when you can skip validating the checksums, so probably not worth it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all, I think shrink_to_fit would be a better approach and can also be used to improve extend. I will close this PR and open a new one with that change.

Comment on lines +51 to +57
if self.is_empty() {
*self = extend;
return;
}
if extend.is_empty() {
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these cases you probably need to validate the checksum of the empty side (which will be trivial to compute because they're zero-length slices), because the length might have been corrupted.

Comment on lines +49 to +69
/// Append the given bytes to current `ChecksummedBlock`.
pub fn extend(&mut self, extend: ChecksummedBlock) {
if self.is_empty() {
*self = extend;
return;
}
if extend.is_empty() {
return;
}

let total_len = self.bytes.len() + extend.len();
let mut bytes_mut = BytesMut::with_capacity(total_len);
bytes_mut.extend_from_slice(&self.bytes);
bytes_mut.extend_from_slice(&extend.bytes);
let new_bytes = bytes_mut.freeze();
let new_checksum = combine_checksums(self.checksum, extend.checksum, extend.len());
*self = ChecksummedBlock {
bytes: new_bytes,
checksum: new_checksum,
};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds right. We know the expected checksum of each side (unlike in the ChecksummedBytes case where we only know the checksum of some larger slice of each side), and can compute the new expected checksum from those without actually looking at the bytes.

Can you add a comment here capturing that reasoning?

/// Validate data integrity in this `ChecksummedBlock`.
///
/// Return `IntegrityError` on data corruption.
pub fn validate(&self) -> Result<(), IntegrityError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever use this as public API? If not, might be better to make it private, since it kinda invites time-of-check/time-of-use problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave this public:

  1. it can be useful to fail fast
  2. it does not return the data, so at worst it could be redundant

self.validate().expect("should be valid");
other.validate().expect("should be valid");

true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be unreachable? here we know the bytes are equal but the checksums aren't, but they both passed validation?

Comment on lines +115 to +124
if self.bytes != other.bytes {
return false;
}

if self.checksum == other.checksum {
return true;
}

self.validate().expect("should be valid");
other.validate().expect("should be valid");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't really matter since it's just test code, but I think you want to do it this way to be correctly bracketed:

let result = self.bytes == other.bytes;
self.validate().expect("should be valid");
other.validate().expect("should be valid");
result

@jamesbornholt jamesbornholt removed this pull request from the merge queue due to a manual request Oct 24, 2023
@passaro passaro closed this Oct 25, 2023
@passaro passaro deleted the blocks branch November 1, 2023 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants