New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Use new ChecksummedBlock in DataCache #572

Closed

passaro wants to merge 3 commits into awslabs:main from passaro:blocks

Contributor

passaro commented Oct 23, 2023

Description of change

Introduce a new ChecksummedBlock type which represents a bytes buffer and its matching checksum. It is a simpler version of ChecksummedBytes in that the checksum is always matching the exposed bytes buffer, rather than potentially a containing larger buffer. The new type is better suited to be used in the DataCache because it can be more efficiently serialized/deserialized preserving its checksum.

This change also introduces a new checksums module, containing both ChecksummedBytes and ChecksummedBlock, in addition to other checksum functions and types.

Relevant issues: #255

Does this change impact existing behavior?

No changes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).


          Use new ChecksummedBlock in DataCache

016cd7b

Introduce a new `ChecksummedBlock` type which represents a bytes buffer and its matching checksum. It is a simpler version of `ChecksummedBytes` in that the checksum is always matching the exposed bytes buffer, rather than potentially a containing larger buffer. The new type is better suited to be used in the `DataCache` because it can be more efficiently serialized/deserialized preserving its checksum.

This change also introduces a new `checksums` module, containing both `ChecksummedBytes` and `ChecksummedBlock`, in addition to other checksum functions and types.

Signed-off-by: Alessandro Passaro <alexpax@amazon.co.uk>

passaro temporarily deployed to PR integration tests

October 23, 2023 16:51

— with

GitHub Actions Inactive

passaro temporarily deployed to PR integration tests

October 23, 2023 16:51

— with

GitHub Actions Inactive

passaro had a problem deploying to PR integration tests

October 23, 2023 16:51

— with

GitHub Actions Failure

passaro temporarily deployed to PR integration tests

October 23, 2023 16:51

— with

GitHub Actions Inactive

passaro temporarily deployed to PR integration tests

October 24, 2023 06:45

— with

GitHub Actions Inactive

passaro temporarily deployed to PR integration tests

October 24, 2023 06:45

— with

GitHub Actions Inactive

passaro temporarily deployed to PR integration tests

October 24, 2023 06:45

— with

GitHub Actions Inactive

passaro temporarily deployed to PR integration tests

October 24, 2023 06:45

— with

GitHub Actions Inactive

dannycjones reviewed

View reviewed changes

Contributor

dannycjones left a comment

LGTM, let's get a second (third?) opinion on the extend fn.

Plus fix typo.

mountpoint-s3/src/checksums.rs Outdated Show resolved Hide resolved

mountpoint-s3/src/checksums/block.rs Show resolved Hide resolved

mountpoint-s3/src/checksums/block.rs Show resolved Hide resolved

mountpoint-s3/src/checksums/block.rs

Comment on lines +49 to +69

+                  /// Append the given bytes to current `ChecksummedBlock`.
+                  pub fn extend(&mut self, extend: ChecksummedBlock) {
+                      if self.is_empty() {
+                          *self = extend;
+                          return;
+                      }
+                      if extend.is_empty() {
+                          return;
+                      }
+                      let total_len = self.bytes.len() + extend.len();
+                      let mut bytes_mut = BytesMut::with_capacity(total_len);
+                      bytes_mut.extend_from_slice(&self.bytes);
+                      bytes_mut.extend_from_slice(&extend.bytes);
+                      let new_bytes = bytes_mut.freeze();
+                      let new_checksum = combine_checksums(self.checksum, extend.checksum, extend.len());
+                      *self = ChecksummedBlock {
+                          bytes: new_bytes,
+                          checksum: new_checksum,
+                      };
+                  }

Contributor

dannycjones Oct 24, 2023

I think this is safe since we are taking two checksummed buffers, combining the two, and calculating the new checksum independently of the new buffer.

IMO the durability risk here is mitigated, but I'd also like a second opinion from the team.

Member

jamesbornholt Oct 24, 2023

Yeah, that sounds right. We know the expected checksum of each side (unlike in the ChecksummedBytes case where we only know the checksum of some larger slice of each side), and can compute the new expected checksum from those without actually looking at the bytes.

Can you add a comment here capturing that reasoning?

mountpoint-s3/src/checksums/block.rs Outdated Show resolved Hide resolved

passaro had a problem deploying to PR integration tests

October 24, 2023 13:24

— with

GitHub Actions Failure

passaro had a problem deploying to PR integration tests

October 24, 2023 13:24

— with

GitHub Actions Failure

passaro had a problem deploying to PR integration tests

October 24, 2023 13:24

— with

GitHub Actions Failure

passaro had a problem deploying to PR integration tests

October 24, 2023 13:24

— with

GitHub Actions Failure


          Simplify ChecksummedBlock tests

4b18b78

Signed-off-by: Alessandro Passaro <alexpax@amazon.co.uk>

passaro force-pushed the blocks branch from 99d590f to 4b18b78 Compare

October 24, 2023 13:33

passaro had a problem deploying to PR integration tests

October 24, 2023 13:33

— with

GitHub Actions Failure

passaro had a problem deploying to PR integration tests

October 24, 2023 13:33

— with

GitHub Actions Failure

passaro had a problem deploying to PR integration tests

October 24, 2023 13:33

— with

GitHub Actions Failure

passaro had a problem deploying to PR integration tests

October 24, 2023 13:33

— with

GitHub Actions Failure


          Implement From/TryFrom for ChecksummedBlock/Bytes

162aa16

Signed-off-by: Alessandro Passaro <alexpax@amazon.co.uk>

passaro temporarily deployed to PR integration tests

October 24, 2023 13:42

— with

GitHub Actions Inactive

passaro temporarily deployed to PR integration tests

October 24, 2023 13:42

— with

GitHub Actions Inactive

passaro temporarily deployed to PR integration tests

October 24, 2023 13:42

— with

GitHub Actions Inactive

passaro temporarily deployed to PR integration tests

October 24, 2023 13:42

— with

GitHub Actions Inactive

dannycjones approved these changes

View reviewed changes

Contributor

dannycjones left a comment

LGTM!

passaro added this pull request to the merge queue

jamesbornholt reviewed

View reviewed changes

mountpoint-s3/src/checksums/block.rs

+              /// A `ChecksummedBlock` is a bytes buffer that carries its checksum.
+              /// The implementation guarantees that its integrity will be validated when data is accessed.
+              #[derive(Debug, Clone)]
+              pub struct ChecksummedBlock {

Member

jamesbornholt Oct 24, 2023

I was wondering if it might be nicer to have just one implementation of this stuff, and give ChecksummedBytes a shrink_to_fit-style method to get the guarantee you're looking for. But then I guess that makes extend et al more complicated because you have to handle all the different combinations to decide when you can skip validating the checksums, so probably not worth it?

Contributor Author

passaro Oct 25, 2023

After all, I think shrink_to_fit would be a better approach and can also be used to improve extend. I will close this PR and open a new one with that change.

mountpoint-s3/src/checksums/block.rs

Comment on lines +51 to +57

+                      if self.is_empty() {
+                          *self = extend;
+                          return;
+                      }
+                      if extend.is_empty() {
+                          return;
+                      }

Member

jamesbornholt Oct 24, 2023

For these cases you probably need to validate the checksum of the empty side (which will be trivial to compute because they're zero-length slices), because the length might have been corrupted.

mountpoint-s3/src/checksums/block.rs

Comment on lines +49 to +69

+                  /// Append the given bytes to current `ChecksummedBlock`.
+                  pub fn extend(&mut self, extend: ChecksummedBlock) {
+                      if self.is_empty() {
+                          *self = extend;
+                          return;
+                      }
+                      if extend.is_empty() {
+                          return;
+                      }
+                      let total_len = self.bytes.len() + extend.len();
+                      let mut bytes_mut = BytesMut::with_capacity(total_len);
+                      bytes_mut.extend_from_slice(&self.bytes);
+                      bytes_mut.extend_from_slice(&extend.bytes);
+                      let new_bytes = bytes_mut.freeze();
+                      let new_checksum = combine_checksums(self.checksum, extend.checksum, extend.len());
+                      *self = ChecksummedBlock {
+                          bytes: new_bytes,
+                          checksum: new_checksum,
+                      };
+                  }

Member

jamesbornholt Oct 24, 2023

Yeah, that sounds right. We know the expected checksum of each side (unlike in the ChecksummedBytes case where we only know the checksum of some larger slice of each side), and can compute the new expected checksum from those without actually looking at the bytes.

Can you add a comment here capturing that reasoning?

mountpoint-s3/src/checksums/block.rs

+                  /// Validate data integrity in this `ChecksummedBlock`.
+                  ///
+                  /// Return `IntegrityError` on data corruption.
+                  pub fn validate(&self) -> Result<(), IntegrityError> {

Member

jamesbornholt Oct 24, 2023

Do we ever use this as public API? If not, might be better to make it private, since it kinda invites time-of-check/time-of-use problems.

Contributor Author

passaro Oct 25, 2023

I'd leave this public:

it can be useful to fail fast
it does not return the data, so at worst it could be redundant

mountpoint-s3/src/checksums/block.rs

+                      self.validate().expect("should be valid");
+                      other.validate().expect("should be valid");
+                      true

Member

jamesbornholt Oct 24, 2023

this should be unreachable? here we know the bytes are equal but the checksums aren't, but they both passed validation?

mountpoint-s3/src/checksums/block.rs

Comment on lines +115 to +124

+                      if self.bytes != other.bytes {
+                          return false;
+                      }
+                      if self.checksum == other.checksum {
+                          return true;
+                      }
+                      self.validate().expect("should be valid");
+                      other.validate().expect("should be valid");

Member

jamesbornholt Oct 24, 2023

Doesn't really matter since it's just test code, but I think you want to do it this way to be correctly bracketed:

let result = self.bytes == other.bytes;
self.validate().expect("should be valid");
other.validate().expect("should be valid");
result

jamesbornholt removed this pull request from the merge queue due to a manual request

passaro closed this

passaro deleted the blocks branch

November 1, 2023 16:58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet