-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert some panics that happen on invalid parquet files to error results #6738
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this, this seems to add a number of untested additional checks, some to very hot codepaths.
I suggest rather than just looking for things that might panic, instead going from a failing test to a fix. This would also better capture the more problematic cases where the reader gets stuck on malformed input, a panic is a good outcome IMO...
@@ -959,17 +959,18 @@ impl ColumnChunkMetaData { | |||
} | |||
|
|||
/// Returns the offset and length in bytes of the column chunk within the file | |||
pub fn byte_range(&self) -> (u64, u64) { | |||
pub fn byte_range(&self) -> Result<(u64, u64)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking API change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah..is it fine given that it just wraps the return value within Result? The behavior change is just "panic --> error".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is a breaking change
parquet/src/format.rs
Outdated
@@ -1738,6 +1738,12 @@ impl crate::thrift::TSerializable for IntType { | |||
bit_width: f_1.expect("auto-generated code should have checked for presence of required fields"), | |||
is_signed: f_2.expect("auto-generated code should have checked for presence of required fields"), | |||
}; | |||
if ret.bit_width != 8 && ret.bit_width != 16 && ret.bit_width != 32 && ret.bit_width != 64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an auto generated file, in place edits need to be scripted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't notice that, thanks! I moved the checking to schema type.rs
@@ -1227,6 +1231,10 @@ fn from_thrift_helper(elements: &[SchemaElement], index: usize) -> Result<(usize | |||
if !is_root_node { | |||
builder = builder.with_repetition(rep); | |||
} | |||
} else if !is_root_node { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is based on the comment at line 1230 which says All other types must have one
, and the assert at line 1066: assert!(tp.get_basic_info.()has_repetitio())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with this check is unless it is necessary for correctness, there is potential adding it breaks something for someone. Parquet is a very broad ecosystem, lots of writers have interesting interpretations of the specification
@@ -67,7 +67,17 @@ impl<'a> TCompactSliceInputProtocol<'a> { | |||
let mut shift = 0; | |||
loop { | |||
let byte = self.read_byte()?; | |||
in_progress |= ((byte & 0x7F) as u64) << shift; | |||
let val = (byte & 0x7F) as u64; | |||
let val = val.checked_shl(shift).map_or_else( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very performance critical code path, this probably should use wrapping_shl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm afraid wrapping might cause correctness issues...besides, would the checked_shl really make a noticeable performance difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't cause correctness issues, and yes it will matter. There are benchmarks that will likely show this
impl TInputProtocol for TCompactSliceInputProtocol<'_> { | ||
fn read_message_begin(&mut self) -> thrift::Result<TMessageIdentifier> { | ||
unimplemented!() | ||
thrift_unimplemented!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be unreachable, a panic is the correct thing to do here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc it "should be unreachable" unless the input file is malformed? I guess this goes back to the discussion on how to handle invalid inputs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it is actually genuinely unreachable, we don't use thrift messages
Thanks @tustvold for the review!
I think the changes are more about converting panics to errors, rather than actual code logic.
these panics were triggered in my own fuzzing test with invalid parquet files. Nevertheless, i think it's a similar topic of "how to handle invalid inputs" as discussed in #5323. Reading this doc, imho errors better than panics unless it's really something unrecoverable. |
Which issue does this PR close?
This solves some of #6737.
Rationale for this change
Some code changes to replace some panics with proper errors
What changes are included in this PR?
Some codes that lead to panic are converted to returning error results.
Are there any user-facing changes?
Behavior change from panics to errors.