Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(puffin): implement CachedPuffinWriter #4203

Merged
merged 7 commits into from
Jun 25, 2024

Conversation

zhongzc
Copy link
Contributor

@zhongzc zhongzc commented Jun 25, 2024

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

#4193

What's changed and what's your intention?

Mainly implement put_blob and put_dir.

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.

Summary by CodeRabbit

  • New Features

    • Introduced support for compression codecs in various operations, enabling enhanced data handling capabilities.
    • Added cache management functionality to efficiently handle puffin files.
  • Bug Fixes

    • Improved error handling for file operations, unsupported compression, and duplicate blob writes.
  • Improvements

    • Standardized byte count handling by changing types from usize to u64, enhancing consistency.
    • Enhanced the Blob struct by introducing a compression_codec field and renaming data to compressed_data.
  • Dependency Updates

    • Added async-compression and async-walkdir as dependencies.
    • Moved uuid from development to regular dependencies.

@zhongzc zhongzc requested review from evenyag, v0y4g3r, waynexia and a team as code owners June 25, 2024 05:41
Copy link
Contributor

coderabbitai bot commented Jun 25, 2024

Walkthrough

The changes involve significant updates across various files in the codebase, such as altering method return types from usize to u64, adding a compression_codec field to the Blob struct, and changing dependency listings in Cargo.toml. Additionally, new modules and error variants are introduced, enhancing functionality related to file handling, compression, and cache management.

Changes

Files Change Summary
src/mito2/src/sst/index.rs Changed return type of the finish method from usize to u64 in the Indexer struct.
src/mito2/src/sst/index/... Updated Blob struct by adding compression_codec field and renaming data to compressed_data. Altered ByteCount type and error handling in SstIndexCreator.
src/mito2/src/sst/parquet/writer.rs Removed type conversion to u64 in assignment of index_file_size.
src/puffin/Cargo.toml Added async-compression and async-walkdir dependencies. Moved uuid from dev-dependencies to regular dependencies.
src/puffin/src/blob_metadata.rs Changed CompressionCodec enum's derive attributes from Clone to Copy.
src/puffin/src/error.rs Added new error variants for file operations and updated ErrorExt implementation to handle them.
src/puffin/src/file_format/writer.rs Added compression_codec to Blob, modified add_blob and finish methods’ return types to u64 for PuffinSyncWriter and PuffinAsyncWriter traits.
src/puffin/src/file_format/writer/file.rs Updated PuffinFileWriter to change return types of add_blob and finish methods to u64 and reference compressed_data.
src/puffin/src/lib.rs Added #![feature(trait_alias)] attribute.
src/puffin/src/puffin_manager.rs Added new module declarations for cache_manager and cached_puffin_manager.
src/puffin/src/puffin_manager/cache_manager.rs Introduced CacheManager and DirWriterProvider traits, and new type aliases for cache operations.
src/puffin/src/puffin_manager/cached_puffin_manager.rs Added DirMetadata, DirFileMetadata structs, and CachedPuffinWriter implementing PuffinWriter.
src/puffin/src/tests.rs Renamed data to compressed_data in Blob struct and added compression_codec field in test functions.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PuffinFileWriter
    participant Blob
    participant CacheManager

    User->>PuffinFileWriter: add_blob()
    PuffinFileWriter-->>Blob: Write compressed_data
    PuffinFileWriter-->>User: Return u64 (byte count)

    User->>PuffinFileWriter: finish()
    PuffinFileWriter-->>CacheManager: Cache directory
    PuffinFileWriter-->>User: Return u64 (total bytes)
Loading

Poem

In the code where bytes do flow,
Changes came, updates did grow.
Compression codecs now in sight,
File sizes altered to new height.
Dependencies lined up in a row,
For Puffin's features to smoothly blow.
🎉 🐇 Here's to the future, set aglow! 🚀


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added the docs-not-required This change does not impact docs. label Jun 25, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 840f81e and be43f32.

Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
Files selected for processing (16)
  • src/mito2/src/sst/index.rs (1 hunks)
  • src/mito2/src/sst/index/applier.rs (2 hunks)
  • src/mito2/src/sst/index/creator.rs (3 hunks)
  • src/mito2/src/sst/index/creator/statistics.rs (3 hunks)
  • src/mito2/src/sst/parquet/writer.rs (1 hunks)
  • src/puffin/Cargo.toml (2 hunks)
  • src/puffin/src/blob_metadata.rs (1 hunks)
  • src/puffin/src/error.rs (4 hunks)
  • src/puffin/src/file_format/writer.rs (4 hunks)
  • src/puffin/src/file_format/writer/file.rs (2 hunks)
  • src/puffin/src/lib.rs (1 hunks)
  • src/puffin/src/puffin_manager.rs (1 hunks)
  • src/puffin/src/puffin_manager/cache_manager.rs (1 hunks)
  • src/puffin/src/puffin_manager/cached_puffin_manager.rs (1 hunks)
  • src/puffin/src/puffin_manager/cached_puffin_manager/writer.rs (1 hunks)
  • src/puffin/src/tests.rs (2 hunks)
Files skipped from review due to trivial changes (2)
  • src/mito2/src/sst/parquet/writer.rs
  • src/puffin/src/lib.rs
Additional comments not posted (34)
src/puffin/Cargo.toml (3)

11-11: Added dependency: async-compression.

Adding async-compression version 0.4.11 is consistent with the PR's focus on enhancing compression features. Ensure the chosen version is compatible with other project dependencies to avoid conflicts.


13-13: Added dependency: async-walkdir.

The addition of async-walkdir version 2.0.0 supports asynchronous directory walking, which is likely used in the new caching functionalities. Similar to the previous comment, verify compatibility with existing dependencies.


26-26: Dependency moved: uuid.

Moving uuid from [dev-dependencies] to [dependencies] suggests it's now used in the production code, not just during development. This change should be double-checked to ensure it's intended and that uuid is indeed utilized in the production paths.

Verification successful

Dependency moved: uuid.

The uuid dependency is indeed used in production code paths, such as in src/servers/src/mysql/handler.rs and src/puffin/src/puffin_manager/cached_puffin_manager/writer.rs. Moving uuid to [dependencies] is justified based on its usage in the production code.

  • src/servers/src/mysql/handler.rs
  • src/puffin/src/puffin_manager/cached_puffin_manager/writer.rs
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the usage of `uuid` in production code paths.

# Test: Search for `uuid` usage in non-test Rust files. Expect: At least one occurrence.
rg --type rust --glob '!*_test.rs' 'uuid'

Length of output: 8843

src/puffin/src/puffin_manager/cached_puffin_manager.rs (3)

15-15: Module import for writer.

The import of the writer module is necessary for utilizing CachedPuffinWriter within this file. This is a straightforward change that supports the PR's functionality.


18-18: Export of CachedPuffinWriter.

Exporting CachedPuffinWriter makes it available for use elsewhere in the project. This is crucial for integrating the new writer functionality provided by this PR.


20-24: Introduction of metadata structures (DirMetadata and DirFileMetadata).

The new structures for directory and file metadata are well-defined and include serialization capabilities, which are essential for the caching functionality. Ensure that all necessary fields are included and appropriately documented.

Also applies to: 26-38

src/puffin/src/file_format/writer.rs (2)

22-22: Updated Blob structure with compression support.

Adding compression_codec to the Blob structure aligns with the PR's focus on enhanced compression capabilities. This change is well-integrated with the existing structure and should facilitate more efficient data handling.

Also applies to: 34-37


52-52: Updated return types for writer methods.

Changing the return type of the add_blob and finish methods from Result<()> to Result<u64> in both synchronous and asynchronous contexts is a significant improvement. This allows tracking the number of bytes written, which is crucial for managing storage efficiently. Ensure that all usages of these methods are updated to handle the new return type.

Also applies to: 55-55, 68-68, 71-71

Verification successful

Updated return types for writer methods.

The return types for the add_blob and finish methods have been updated to Result<u64>, allowing for tracking the number of bytes written. The provided context indicates that these changes have been reflected across the codebase and proper handling of the new return type has been implemented.

  • Verified Usages:
    • src/puffin/src/file_format/writer.rs
    • src/puffin/src/file_format/writer/file.rs
    • src/puffin/src/puffin_manager/cached_puffin_manager/writer.rs
    • src/puffin/src/tests.rs
    • src/mito2/src/sst/index/applier.rs
    • src/mito2/src/sst/index/creator.rs

The usages in these files correctly handle the new return type by either unwrapping the result or properly propagating errors.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify all usages of `add_blob` and `finish` methods handle the new return type.

# Test: Search for usages of these methods in Rust files. Expect: Proper handling of `Result<u64>`.
rg --type rust 'add_blob|finish'

Length of output: 50113



Script:

#!/bin/bash
# Description: Extract lines surrounding the instances of `add_blob` and `finish` methods to verify handling of the new return type.

# Extract surrounding lines for `add_blob`
rg --type rust 'add_blob' -C 3

# Extract surrounding lines for `finish`
rg --type rust 'finish' -C 3

Length of output: 208460

src/puffin/src/puffin_manager.rs (1)

15-16: Introduction of new modules for cache management.

The introduction of cache_manager and cached_puffin_manager modules is crucial for the new caching functionality. This structural change supports the PR's objectives and enhances the modularity of the code.

src/puffin/src/puffin_manager/cache_manager.rs (1)

24-28: Comprehensive introduction of cache management functionalities.

The introduction of DirWriterProvider, CacheManager, and associated types and functions is well-executed. These changes are essential for managing the cache effectively and are aligned with the PR's objectives. Ensure that the implementations of these traits are robust and handle edge cases properly.

Also applies to: 30-79

src/mito2/src/sst/index/creator/statistics.rs (3)

38-38: Type conversion from usize to u64 for byte_count

The change in type from usize to u64 for byte_count is consistent with the need to handle larger data sizes, which could exceed the limits of usize on 32-bit systems.


66-66: Updated return type for byte_count method

The method byte_count now returns u64, aligning with the updated type of the byte_count field. This ensures type consistency across the interface and the data structure.


115-115: Updated parameter type for inc_byte_count method

The parameter type for inc_byte_count has been updated to u64 to match the field's type, ensuring consistency and avoiding potential type conversion issues or data loss on large values.

src/puffin/src/file_format/writer/file.rs (4)

78-87: Updated add_blob method to return u64 and handle compressed data

The add_blob method now correctly returns the size of the written data as u64, which is a necessary change to support large file operations. Additionally, handling of the compressed_data attribute of the Blob struct is properly implemented, ensuring that data compression is accounted for during the write operation.


94-99: Updated finish method to return total written bytes as u64

The finish method now returns the total number of bytes written as u64. This is an appropriate change for consistency with the return type of add_blob and to support large data sizes.


109-120: Asynchronous handling of add_blob with compressed data

The asynchronous version of add_blob has been updated similarly to its synchronous counterpart, handling compressed data and returning the size as u64. This ensures that the API is consistent and capable of handling large data operations asynchronously.


127-133: Asynchronous finish method correctly handles footer and returns total bytes

The asynchronous finish method, like its synchronous version, now handles writing the footer and flushing the writer, returning the total bytes written as u64. This change ensures that the method behaves consistently in both synchronous and asynchronous contexts.

src/puffin/src/error.rs (5)

67-73: New error variant for file opening issues

The addition of the Open variant to handle file opening errors is a necessary improvement for robust error handling, especially given the file-intensive operations of the system.


75-81: New error variant for metadata reading issues

The Metadata error variant addresses potential issues during metadata retrieval, which is crucial for the system's operation where metadata plays a significant role.


83-89: New error variant for directory walking issues

The WalkDirError variant is crucial for robustly handling errors that may occur during directory traversal, which is a common operation in file management systems.


187-199: New error variants for unsupported compression and duplicate blob writes

The addition of UnsupportedCompression and DuplicateBlob error variants is appropriate, given the system's expanded functionality with compression and the need to handle specific blob-related errors.


Line range hint 213-230: Updated ErrorExt implementation to include new error variants

The ErrorExt implementation has been updated to include the new error variants, ensuring that all errors have appropriate status codes. This is crucial for consistent error handling and reporting across the system.

src/puffin/src/puffin_manager/cached_puffin_manager/writer.rs (4)

36-49: Struct definition for CachedPuffinWriter

The CachedPuffinWriter struct is well-defined with clear documentation on its purpose. It includes essential fields for managing file writing, such as puffin_file_name, cache_manager, and puffin_file_writer, which are crucial for its functionality.


57-94: Implementation of put_blob method with compression handling

The put_blob method is implemented with comprehensive error checks and support for different compression codecs. This method's design ensures that blobs are written correctly and efficiently, handling potential duplicate blobs and unsupported compressions smoothly.


96-180: Implementation of put_dir method with directory handling

The put_dir method is robust, handling directory traversal and file writing with appropriate error checks and compression handling. The method effectively manages directory metadata and ensures all files within the directory are processed correctly.


187-190: Finish method for finalizing file writes

The finish method in CachedPuffinWriter correctly finalizes the file writing process, ensuring that all data is written and the file is properly closed. This is essential for maintaining data integrity.

src/mito2/src/sst/index/applier.rs (2)

211-213: Updated Blob struct to handle compressed data

The update to the Blob struct to include compressed_data and compression_codec fields in the test setup is essential for testing the handling of different data types and compression settings. This ensures that the system can effectively manage various data scenarios.


264-266: Handling of invalid blob types in tests

The test for handling invalid blob types is crucial for ensuring the system gracefully handles errors related to unsupported or incorrect blob types. This helps maintain robustness and reliability.

src/puffin/src/tests.rs (2)

192-195: Approved: Updated test cases to reflect changes in Blob struct.

The test cases have been updated to use the new compressed_data and compression_codec fields in the Blob struct, which aligns with the structural changes made across the project.

Also applies to: 202-205


262-265: Approved: Asynchronous test cases updated to reflect structural changes in Blob struct.

The asynchronous test cases have been correctly updated to include the compressed_data and compression_codec fields in the Blob struct, maintaining consistency with the synchronous tests and the broader project updates.

Also applies to: 273-276

src/puffin/src/blob_metadata.rs (1)

72-72: Approved: Optimization by changing derive attributes in CompressionCodec.

Changing the derive attribute from Clone to Copy for the CompressionCodec enum is a good practice for small immutable data types like enums, as it allows more efficient value copying.

src/mito2/src/sst/index.rs (1)

80-80: Approved: Updated return type in finish method to support larger data sizes.

The change from Option<usize> to Option<u64> in the finish method of the Indexer struct supports handling larger data sizes, which is beneficial for scalability and data management in large-scale applications.

src/mito2/src/sst/index/creator.rs (2)

57-57: Type definition updated to accommodate larger values.

Changing ByteCount from usize to u64 is a good practice for systems that handle large datasets or files, ensuring that the data type can accommodate larger values without overflow.


274-276: Enhanced Blob struct with optional compression support.

The addition of the compression_codec field to the Blob struct allows for optional compression, which can significantly reduce storage or transmission size when handling large data sets. Ensure that all parts of the system that instantiate or manipulate Blob objects are updated to handle this new field correctly.

Verification successful

Enhanced Blob struct with optional compression support.

The addition of the compression_codec field to the Blob struct has been correctly handled across the codebase. The field is properly included in all instances where the Blob struct is instantiated, ensuring that the optional compression feature is supported.

  • Files reviewed:
    • src/puffin/src/tests.rs
    • src/puffin/src/puffin_manager/cached_puffin_manager/writer.rs
    • src/mito2/src/sst/index/creator.rs
    • src/mito2/src/sst/index/applier.rs

All relevant usages of the Blob struct have been updated to handle the new compression_codec field appropriately.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that all usages of `Blob` struct have been updated to handle the new `compression_codec` field.

# Test: Search for the Blob struct usage across the codebase. Expect: Proper handling of the new field.
rg --type rust $'Blob {'

Length of output: 101



Script:

#!/bin/bash
# Description: Verify that all usages of `Blob` struct have been updated to handle the new `compression_codec` field.

# Test: Search for the Blob struct usage across the codebase. Expect: Proper handling of the new field.
rg --type rust $'Blob \\{'

Length of output: 1295



Script:

#!/bin/bash
# Description: Extract relevant lines from files where `Blob` struct is instantiated to verify handling of `compression_codec` field.

# Extract lines from src/puffin/src/tests.rs
rg --type rust --context 5 'Blob \{' src/puffin/src/tests.rs

# Extract lines from src/puffin/src/puffin_manager/cached_puffin_manager/writer.rs
rg --type rust --context 5 'Blob \{' src/puffin/src/puffin_manager/cached_puffin_manager/writer.rs

# Extract lines from src/mito2/src/sst/index/creator.rs
rg --type rust --context 5 'Blob \{' src/mito2/src/sst/index/creator.rs

# Extract lines from src/mito2/src/sst/index/applier.rs
rg --type rust --context 5 'Blob \{' src/mito2/src/sst/index/applier.rs

Length of output: 5108

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>
@zhongzc zhongzc requested a review from fengjiachun June 25, 2024 05:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between be43f32 and b0a4f29.

Files selected for processing (3)
  • src/puffin/src/puffin_manager.rs (2 hunks)
  • src/puffin/src/puffin_manager/cache_manager.rs (1 hunks)
  • src/puffin/src/puffin_manager/cached_puffin_manager/writer.rs (1 hunks)
Files skipped from review as they are similar to previous changes (3)
  • src/puffin/src/puffin_manager.rs
  • src/puffin/src/puffin_manager/cache_manager.rs
  • src/puffin/src/puffin_manager/cached_puffin_manager/writer.rs

Copy link

codecov bot commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 17.90123% with 133 lines in your changes missing coverage. Please review.

Project coverage is 84.70%. Comparing base (cdd4baf) to head (b0a4f29).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4203      +/-   ##
==========================================
- Coverage   85.05%   84.70%   -0.35%     
==========================================
  Files        1031     1033       +2     
  Lines      181276   181442     +166     
==========================================
- Hits       154176   153692     -484     
- Misses      27100    27750     +650     

Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM

src/puffin/src/file_format/writer.rs Show resolved Hide resolved
@WenyXu WenyXu added this pull request to the merge queue Jun 25, 2024
Merged via the queue into GreptimeTeam:main with commit 1e815dd Jun 25, 2024
53 checks passed
@zhongzc zhongzc deleted the zhongzc/puffin-dir-2 branch June 25, 2024 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required This change does not impact docs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants