Convert some panics that happen on invalid parquet files to error results #6738

jp0317 · 2024-11-16T06:09:04Z

Which issue does this PR close?

This solves some of #6737.

Rationale for this change

Some code changes to replace some panics with proper errors

What changes are included in this PR?

Some codes that lead to panic are converted to returning error results.

Are there any user-facing changes?

Behavior change from panics to errors.

tustvold

I'm not sure about this, this seems to add a number of untested additional checks, some to very hot codepaths.

I suggest rather than just looking for things that might panic, instead going from a failing test to a fix. This would also better capture the more problematic cases where the reader gets stuck on malformed input, a panic is a good outcome IMO...

tustvold · 2024-11-16T08:08:35Z

parquet/src/file/metadata/mod.rs

@@ -959,17 +959,18 @@ impl ColumnChunkMetaData {
    }

    /// Returns the offset and length in bytes of the column chunk within the file
-    pub fn byte_range(&self) -> (u64, u64) {
+    pub fn byte_range(&self) -> Result<(u64, u64)> {


This is a breaking API change

yeah..is it fine given that it just wraps the return value within Result? The behavior change is just "panic --> error".

No this is a breaking change

tustvold · 2024-11-16T10:08:00Z

parquet/src/format.rs

@@ -1738,6 +1738,12 @@ impl crate::thrift::TSerializable for IntType {
      bit_width: f_1.expect("auto-generated code should have checked for presence of required fields"),
      is_signed: f_2.expect("auto-generated code should have checked for presence of required fields"),
    };
+    if ret.bit_width != 8 && ret.bit_width != 16 && ret.bit_width != 32 && ret.bit_width != 64 {


This is an auto generated file, in place edits need to be scripted

didn't notice that, thanks! I moved the checking to schema type.rs

tustvold · 2024-11-16T10:09:06Z

parquet/src/schema/types.rs

@@ -1227,6 +1231,10 @@ fn from_thrift_helper(elements: &[SchemaElement], index: usize) -> Result<(usize
                if !is_root_node {
                    builder = builder.with_repetition(rep);
                }
+            } else if !is_root_node {


Do we need this check?

this is based on the comment at line 1230 which says All other types must have one, and the assert at line 1066: assert!(tp.get_basic_info.()has_repetitio())

My concern with this check is unless it is necessary for correctness, there is potential adding it breaks something for someone. Parquet is a very broad ecosystem, lots of writers have interesting interpretations of the specification

tustvold · 2024-11-16T10:09:42Z

parquet/src/thrift.rs

@@ -67,7 +67,17 @@ impl<'a> TCompactSliceInputProtocol<'a> {
        let mut shift = 0;
        loop {
            let byte = self.read_byte()?;
-            in_progress |= ((byte & 0x7F) as u64) << shift;
+            let val = (byte & 0x7F) as u64;
+            let val = val.checked_shl(shift).map_or_else(


This is a very performance critical code path, this probably should use wrapping_shl

i'm afraid wrapping might cause correctness issues...besides, would the checked_shl really make a noticeable performance difference here?

It shouldn't cause correctness issues, and yes it will matter. There are benchmarks that will likely show this

tustvold · 2024-11-16T10:10:34Z

parquet/src/thrift.rs

 impl TInputProtocol for TCompactSliceInputProtocol<'_> {
    fn read_message_begin(&mut self) -> thrift::Result<TMessageIdentifier> {
-        unimplemented!()
+        thrift_unimplemented!()


This should be unreachable, a panic is the correct thing to do here

iiuc it "should be unreachable" unless the input file is malformed? I guess this goes back to the discussion on how to handle invalid inputs:

No it is actually genuinely unreachable, we don't use thrift messages

jp0317 · 2024-11-19T03:28:54Z

Thanks @tustvold for the review!

this seems to add a number of untested additional checks...

I think the changes are more about converting panics to errors, rather than actual code logic.

looking for things that might panic, instead going from a failing test to a fix

these panics were triggered in my own fuzzing test with invalid parquet files. Nevertheless, i think it's a similar topic of "how to handle invalid inputs" as discussed in #5323. Reading this doc, imho errors better than panics unless it's really something unrecoverable.

github-actions bot added the parquet Changes to the parquet crate label Nov 16, 2024

jp0317 force-pushed the panic branch from 18c494b to 8235bf3 Compare November 16, 2024 06:25

Reduce panics

f481dff

jp0317 force-pushed the panic branch from 8235bf3 to f481dff Compare November 16, 2024 06:29

tustvold requested changes Nov 16, 2024

View reviewed changes

t pushmove integer logical type from format.rs to schema type.rs

a4f8286

jp0317 force-pushed the panic branch from 42f0223 to a4f8286 Compare November 19, 2024 03:30

jp0317 requested a review from tustvold November 19, 2024 03:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert some panics that happen on invalid parquet files to error results #6738

Convert some panics that happen on invalid parquet files to error results #6738

jp0317 commented Nov 16, 2024

tustvold left a comment

tustvold Nov 16, 2024

jp0317 Nov 19, 2024

tustvold Nov 19, 2024

tustvold Nov 16, 2024

jp0317 Nov 19, 2024

tustvold Nov 16, 2024

jp0317 Nov 19, 2024

tustvold Nov 19, 2024

tustvold Nov 16, 2024

jp0317 Nov 19, 2024

tustvold Nov 19, 2024

tustvold Nov 16, 2024

jp0317 Nov 19, 2024

tustvold Nov 19, 2024

jp0317 commented Nov 19, 2024

Convert some panics that happen on invalid parquet files to error results #6738

Are you sure you want to change the base?

Convert some panics that happen on invalid parquet files to error results #6738

Conversation

jp0317 commented Nov 16, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jp0317 commented Nov 19, 2024