Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert some panics that happen on invalid parquet files to error results #6738

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jp0317
Copy link
Contributor

@jp0317 jp0317 commented Nov 16, 2024

Which issue does this PR close?

This solves some of #6737.

Rationale for this change

Some code changes to replace some panics with proper errors

What changes are included in this PR?

Some codes that lead to panic are converted to returning error results.

Are there any user-facing changes?

Behavior change from panics to errors.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 16, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this, this seems to add a number of untested additional checks, some to very hot codepaths.

I suggest rather than just looking for things that might panic, instead going from a failing test to a fix. This would also better capture the more problematic cases where the reader gets stuck on malformed input, a panic is a good outcome IMO...

@@ -959,17 +959,18 @@ impl ColumnChunkMetaData {
}

/// Returns the offset and length in bytes of the column chunk within the file
pub fn byte_range(&self) -> (u64, u64) {
pub fn byte_range(&self) -> Result<(u64, u64)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking API change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah..is it fine given that it just wraps the return value within Result? The behavior change is just "panic --> error".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is a breaking change

@@ -1738,6 +1738,12 @@ impl crate::thrift::TSerializable for IntType {
bit_width: f_1.expect("auto-generated code should have checked for presence of required fields"),
is_signed: f_2.expect("auto-generated code should have checked for presence of required fields"),
};
if ret.bit_width != 8 && ret.bit_width != 16 && ret.bit_width != 32 && ret.bit_width != 64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an auto generated file, in place edits need to be scripted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't notice that, thanks! I moved the checking to schema type.rs

@@ -1227,6 +1231,10 @@ fn from_thrift_helper(elements: &[SchemaElement], index: usize) -> Result<(usize
if !is_root_node {
builder = builder.with_repetition(rep);
}
} else if !is_root_node {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is based on the comment at line 1230 which says All other types must have one, and the assert at line 1066: assert!(tp.get_basic_info.()has_repetitio())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with this check is unless it is necessary for correctness, there is potential adding it breaks something for someone. Parquet is a very broad ecosystem, lots of writers have interesting interpretations of the specification

@@ -67,7 +67,17 @@ impl<'a> TCompactSliceInputProtocol<'a> {
let mut shift = 0;
loop {
let byte = self.read_byte()?;
in_progress |= ((byte & 0x7F) as u64) << shift;
let val = (byte & 0x7F) as u64;
let val = val.checked_shl(shift).map_or_else(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very performance critical code path, this probably should use wrapping_shl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm afraid wrapping might cause correctness issues...besides, would the checked_shl really make a noticeable performance difference here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't cause correctness issues, and yes it will matter. There are benchmarks that will likely show this

impl TInputProtocol for TCompactSliceInputProtocol<'_> {
fn read_message_begin(&mut self) -> thrift::Result<TMessageIdentifier> {
unimplemented!()
thrift_unimplemented!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unreachable, a panic is the correct thing to do here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc it "should be unreachable" unless the input file is malformed? I guess this goes back to the discussion on how to handle invalid inputs:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is actually genuinely unreachable, we don't use thrift messages

@jp0317
Copy link
Contributor Author

jp0317 commented Nov 19, 2024

Thanks @tustvold for the review!

this seems to add a number of untested additional checks...

I think the changes are more about converting panics to errors, rather than actual code logic.

looking for things that might panic, instead going from a failing test to a fix

these panics were triggered in my own fuzzing test with invalid parquet files. Nevertheless, i think it's a similar topic of "how to handle invalid inputs" as discussed in #5323. Reading this doc, imho errors better than panics unless it's really something unrecoverable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants