Parquet Modular Encryption support #6637

rok · 2024-10-28T20:51:18Z

Which issue does this PR close?

This PR is based on branch and an internal patch and aims to provide basic modular encryption support. Closes #3511.

Rationale for this change

See #3511.

What changes are included in this PR?

TBD

Are there any user-facing changes?

TBD

rok · 2024-10-28T20:53:17Z

Currently this is a rough rebase of work done by @ggershinsky. As ParquetMetaDataReader is now available some refactoring will be required.

etseidl · 2024-10-28T22:50:19Z

As ParquetMetaDataReader is now available some refactoring will be required.

@rok let me know if you want any help shoehorning this into ParquetMetaDataReader.

brainslush · 2024-11-15T09:29:23Z

Is there any help, input or contribution needed here?

rok · 2024-11-21T00:15:17Z

Thanks for the offer @etseidl & @brainslush! I'm making some progress and would definitely appreciate a review! I'll ping once I push.

rok · 2024-12-04T13:30:31Z

As ParquetMetaDataReader is now available some refactoring will be required.

@rok let me know if you want any help shoehorning this into ParquetMetaDataReader.

@etseidl could you please do a quick pass to say if this makes sense in respect to ParquetMetaDataReader?
I'll continue with data decryption.

etseidl

Only looking at the metadata bits for now...looks good to me so far. Just a few minor nits. Thanks @rok!

etseidl · 2024-12-04T17:22:57Z

parquet/src/file/footer.rs

@@ -52,13 +53,16 @@ pub fn parse_metadata<R: ChunkReader>(chunk_reader: &R) -> Result<ParquetMetaDat
 /// Decodes [`ParquetMetaData`] from the provided bytes.
 ///
 /// Typically this is used to decode the metadata from the end of a parquet
-/// file. The format of `buf` is the Thift compact binary protocol, as specified
+/// file. The format of `buf` is the Thrift compact binary protocol, as specified


etseidl · 2024-12-04T17:24:18Z

parquet/src/file/footer.rs

 /// by the [Parquet Spec].
 ///
 /// [Parquet Spec]: https://github.com/apache/parquet-format#metadata
 #[deprecated(since = "53.1.0", note = "Use ParquetMetaDataReader::decode_metadata")]
-pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData> {
-    ParquetMetaDataReader::decode_metadata(buf)
+pub fn decode_metadata(


I'm not sure we should be updating a deprecated function. If encryption is desired I'd say force use of the new API so we don't have to maintain this one. Just pass None to ParquetMetaDataReader::decode_metadata.

etseidl · 2024-12-04T17:43:34Z

parquet/src/file/metadata/reader.rs

+            &mut fetch,
+            file_size,
+            self.get_prefetch_size(),
+            self.file_decryption_properties.clone(),


Very minor nit: I understand that file_decryption_properties needs to be cloned eventually...just wondering if we could pass references down into decode_metadata and do the clone there where it's more obviously needed.

rok · 2024-12-16T23:57:21Z

parquet/src/file/serialized_reader.rs

+        )?;
+
+        // todo: This currently fails, possibly due to wrongly generated AAD
+        let buf = file_decryptor.decrypt(prot.read_bytes()?.as_slice(), aad.as_ref());


@ggershinsky I'm stuck on decrypting a page header here. Anything obvious you notice that I'm doing wrong?
(My test case is uniformly encrypted parquet file from the parque test data. Encrypted with AES_GCM_V1 I believe. Spec for reference.

Hi Rok, I had a look at this and noticed a couple of problems. One minor issue is that the module type is hard coded to be a dictionary page header, but it looks like the first page encountered should be a data page header just based on debugging reading the same file from C++. And then in create_module_aad, the page index is written as an i32 but it should be an i16. Kind of related, there are a bunch of checks there like row_group_ordinal > i16::MAX, but row_group_ordinal is an i16 so this is a no-op (not sure if there is proper error handling when converting these to i16 further up?).

The main problem seems to be that prot.read_bytes here reads the length of the buffer as a Thrift varint, then passes the remaining buffer through to be decrypted, but for encrypted buffers the length is actually written as a 4 byte little-endian and the decrypt method also expects to receive a ciphertext buffer that includes the 4 length bytes.

With the following changes I can get to todo!("Decrypted page header!"):

diff --git a/parquet/src/encryption/ciphers.rs b/parquet/src/encryption/ciphers.rs index 89515fe0e..fc0b703ba 100644 --- a/parquet/src/encryption/ciphers.rs +++ b/parquet/src/encryption/ciphers.rs @@ -194,7 +194,7 @@ fn create_module_aad(file_aad: &[u8], module_type: ModuleType, row_group_ordinal } if row_group_ordinal > i16::MAX { return Err(general_err!("Encrypted parquet files can't have more than {} row groups: {}", - u16::MAX, row_group_ordinal)); + i16::MAX, row_group_ordinal)); } if column_ordinal < 0 { @@ -202,7 +202,7 @@ fn create_module_aad(file_aad: &[u8], module_type: ModuleType, row_group_ordinal } if column_ordinal > i16::MAX { return Err(general_err!("Encrypted parquet files can't have more than {} columns: {}", - u16::MAX, column_ordinal)); + i16::MAX, column_ordinal)); } if module_buf[0] != (ModuleType::DataPageHeader as u8) && @@ -218,10 +218,11 @@ fn create_module_aad(file_aad: &[u8], module_type: ModuleType, row_group_ordinal if page_ordinal < 0 { return Err(general_err!("Wrong page ordinal: {}", page_ordinal)); } - if page_ordinal > i32::MAX { + if page_ordinal > (i16::MAX as i32){ return Err(general_err!("Encrypted parquet files can't have more than {} pages in a chunk: {}", - u16::MAX, page_ordinal)); + i16::MAX, page_ordinal)); } + let page_ordinal = page_ordinal as i16; let mut aad = Vec::with_capacity(file_aad.len() + 7); aad.extend_from_slice(file_aad); diff --git a/parquet/src/file/serialized_reader.rs b/parquet/src/file/serialized_reader.rs index adf4aa07a..48796b09f 100644 --- a/parquet/src/file/serialized_reader.rs +++ b/parquet/src/file/serialized_reader.rs @@ -342,7 +342,6 @@ impl<R: 'static + ChunkReader> RowGroupReader for SerializedRowGroupReader<'_, R /// Reads a [`PageHeader`] from the provided [`Read`] pub(crate) fn read_page_header<T: Read>(input: &mut T, crypto_context: Option<Arc<CryptoContext>>) -> Result<PageHeader> { - let mut prot = TCompactInputProtocol::new(input); if let Some(crypto_context) = crypto_context { // let mut buf = [0; 16 * 1024]; // let size = input.read(&mut buf)?; @@ -354,19 +353,25 @@ pub(crate) fn read_page_header<T: Read>(input: &mut T, crypto_context: Option<Ar let aad = create_page_aad( aad_file_unique.as_slice(), - ModuleType::DictionaryPageHeader, + ModuleType::DataPageHeader, crypto_context.row_group_ordinal, crypto_context.column_ordinal, 0, )?; - // todo: This currently fails, possibly due to wrongly generated AAD - let buf = file_decryptor.decrypt(prot.read_bytes()?.as_slice(), aad.as_ref()); + let mut len_bytes = [0; 4]; + input.read_exact(&mut len_bytes)?; + let ciphertext_len = u32::from_le_bytes(len_bytes) as usize; + let mut ciphertext = vec![0; 4 + ciphertext_len]; + input.read_exact(&mut ciphertext[4..])?; + let buf = file_decryptor.decrypt(&ciphertext, aad.as_ref()); todo!("Decrypted page header!"); let mut prot = TCompactSliceInputProtocol::new(buf.as_slice()); let page_header = PageHeader::read_from_in_protocol(&mut prot)?; return Ok(page_header) } + + let mut prot = TCompactInputProtocol::new(input); let page_header = PageHeader::read_from_in_protocol(&mut prot)?; Ok(page_header) }

github-actions bot added the parquet Changes to the parquet crate label Oct 28, 2024

brainslush mentioned this pull request Nov 18, 2024

Add modular parquet de-/encryption pola-rs/polars#19858

Open

zubieta mentioned this pull request Nov 21, 2024

Client side encryption for cloud storage openobserve/openobserve#5115

Open

ggershinsky and others added 2 commits November 21, 2024 21:49

first commit

9f663ba

Use ParquetMetaDataReader

6f055f9

rok force-pushed the decryption-basics-fork branch from 7faac72 to 6f055f9 Compare November 23, 2024 21:56

Fix CI

d263510

rok force-pushed the decryption-basics-fork branch from fe488b3 to d263510 Compare November 23, 2024 23:06

test

3f9a143

etseidl reviewed Dec 4, 2024

View reviewed changes

rok added 2 commits December 11, 2024 01:47

save progress

9d17990

work

29d55eb

rok force-pushed the decryption-basics-fork branch from f90d8b4 to 29d55eb Compare December 16, 2024 23:51

rok commented Dec 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet Modular Encryption support #6637

Parquet Modular Encryption support #6637

rok commented Oct 28, 2024 •

edited

Loading

rok commented Oct 28, 2024

etseidl commented Oct 28, 2024

brainslush commented Nov 15, 2024

rok commented Nov 21, 2024

rok commented Dec 4, 2024

etseidl left a comment

etseidl Dec 4, 2024

etseidl Dec 4, 2024

etseidl Dec 4, 2024

rok Dec 16, 2024

adamreeve Dec 17, 2024

Parquet Modular Encryption support #6637

Are you sure you want to change the base?

Parquet Modular Encryption support #6637

Conversation

rok commented Oct 28, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

rok commented Oct 28, 2024

etseidl commented Oct 28, 2024

brainslush commented Nov 15, 2024

rok commented Nov 21, 2024

rok commented Dec 4, 2024

etseidl left a comment

Choose a reason for hiding this comment

etseidl Dec 4, 2024

Choose a reason for hiding this comment

etseidl Dec 4, 2024

Choose a reason for hiding this comment

etseidl Dec 4, 2024

Choose a reason for hiding this comment

rok Dec 16, 2024

Choose a reason for hiding this comment

adamreeve Dec 17, 2024

Choose a reason for hiding this comment

rok commented Oct 28, 2024 •

edited

Loading