Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Err on try_from_le_slice #6295

Merged
merged 8 commits into from
Aug 26, 2024
Merged

Conversation

samuelcolvin
Copy link
Contributor

Which issue does this PR close?

fix #3577

Rationale for this change

Our parquet reader is panicing when hitting metadata that is either invalid or unexpected

What changes are included in this PR?

Remove interval from_le_slice method, use T::try_from_le_slice instead.

Are there any user-facing changes?

NA?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 23, 2024
Comment on lines +101 to +103
u32::try_from_le_slice(&bs[0..4]).unwrap(),
u32::try_from_le_slice(&bs[4..8]).unwrap(),
u32::try_from_le_slice(&bs[8..12]).unwrap(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one place (outside tests) where we're still unwraping T::try_from_le_slice, I think this is okay since the trait is marked as unsafe?

@@ -438,7 +439,7 @@ impl BitReader {
}

// TODO: better to avoid copying here
Some(from_le_slice(v.as_bytes()))
T::try_from_le_slice(v.as_bytes()).ok()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if returning None here if try_from_le_slice is correct? It appears to roughly match the meaning in the docstring:

/// Returns `None` if there's not enough data available. `Some` otherwise.

(Some(from_le_slice::<T>(min)), Some(from_le_slice::<T>(max)))
(
Some(T::try_from_le_slice(min)?),
Some(T::try_from_le_slice(max)?),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where we were getting the panic from:

called `Result::unwrap()` on an `Err` value: General("error converting value, expected 4 bytes got 0")
...
             at /Users/samuel/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/2795b94/parquet/src/util/bit_util.rs:29:5
   5: parquet::file::page_index::index::NativeIndex<T>::try_new::{{closure}}
             at /Users/samuel/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/2795b94/parquet/src/file/page_index/index.rs:210:31

@samuelcolvin
Copy link
Contributor Author

samuelcolvin commented Aug 23, 2024

if we could get this merged, we can point cargo to use head of arrow-rs.

Currently we're getting panics when reading parquet files which (we think) have invalid metadata, meaning we can't even fall back to alternative behaviour.

@samuelcolvin samuelcolvin marked this pull request as ready for review August 23, 2024 19:47
@samuelcolvin
Copy link
Contributor Author

cc @tustvold who created the original issue.

@adriangb
Copy link
Contributor

Right the point is that even if we have invalid data it's hard to even debug because by we have no ability to do error handling due the panics.

@@ -192,7 +191,7 @@ impl<T: ParquetValueType> NativeIndex<T> {
let indexes = index
.min_values
.iter()
.zip(index.max_values.into_iter())
.zip(index.max_values.iter())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change so max is borrowed, to match min

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @samuelcolvin

Is there any way to add a test for this? Ideally a file with invalid statistics but if not at least some sort of unit test?

Also I think we should consider ignoring the error rather than erroring out on statistics

@@ -186,14 +186,18 @@ pub fn from_thrift(
// INT96 statistics may not be correct, because comparison is signed
// byte-wise, not actual timestamps. It is recommended to ignore
// min/max statistics for INT96 columns.
let min = min.map(|data| {
let min = if let Some(data) = min {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make invalid statistics return an error I think it will mean the reader will not be able to read any data from the file

What do you think about ignoring the error. .ok() style instead here -- that way the statistics would come in as None but the file could still be read

@adriangb
Copy link
Contributor

I have a repro locally but it's a 25MB file that you have to write out then load back again. So not exactly a MRE. I'm happy to share that file if that's helpful. I tried narrowing down to a specific column or value without much luck.

It happens specifically when using MetadataLoader to load metadata dumped by ParquetMetadataWriter.

Our goal by converting this panic to an error is to buy time in figuring out the root cause by falling back to loading the metadata from the file itself instead of the cached version that was written via ParquetMetadataWriter.

@samuelcolvin
Copy link
Contributor Author

@alamb I've added a test, to get it to fail I've included the raw metadata from our problematic file (35KB).

Let me know if you want to keep this or remove or, and if you want me to send you a link to the full 25MB parquet file via email.

If we make invalid statistics return an error I think it will mean the reader will not be able to read any data from the file

I don't think we should change this, if we run into errors like this we can always call again with

loader.load_page_index(false, false).await?

as shown in the test.

@samuelcolvin
Copy link
Contributor Author

Okay, I've now added a a pretty good MRE test. @alamb we're really appreciate it if this could be merged.

I'm happy to remove bad_metadata_err and the raw data file if you like?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @samuelcolvin

cc @etseidl as I think you have worked on this code recently (the index stuff specifically)

I am going to merge this in to help @samuelcolvin -- if anyone has concerns or other comments please let us know and we can address them as follow on PRs

@alamb alamb merged commit ee2f75a into apache:master Aug 26, 2024
16 checks passed
@samuelcolvin samuelcolvin deleted the try_from_le_slice branch August 26, 2024 21:41
@samuelcolvin
Copy link
Contributor Author

Thanks so much @alamb, I'll create an issue with a follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't Panic on Invalid Parquet Statistics
3 participants