-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Err on try_from_le_slice
#6295
Err on try_from_le_slice
#6295
Conversation
u32::try_from_le_slice(&bs[0..4]).unwrap(), | ||
u32::try_from_le_slice(&bs[4..8]).unwrap(), | ||
u32::try_from_le_slice(&bs[8..12]).unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the one place (outside tests) where we're still unwrap
ing T::try_from_le_slice
, I think this is okay since the trait is marked as unsafe
?
@@ -438,7 +439,7 @@ impl BitReader { | |||
} | |||
|
|||
// TODO: better to avoid copying here | |||
Some(from_le_slice(v.as_bytes())) | |||
T::try_from_le_slice(v.as_bytes()).ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if returning None
here if try_from_le_slice
is correct? It appears to roughly match the meaning in the docstring:
/// Returns `None` if there's not enough data available. `Some` otherwise.
(Some(from_le_slice::<T>(min)), Some(from_le_slice::<T>(max))) | ||
( | ||
Some(T::try_from_le_slice(min)?), | ||
Some(T::try_from_le_slice(max)?), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where we were getting the panic from:
called `Result::unwrap()` on an `Err` value: General("error converting value, expected 4 bytes got 0")
...
at /Users/samuel/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/2795b94/parquet/src/util/bit_util.rs:29:5
5: parquet::file::page_index::index::NativeIndex<T>::try_new::{{closure}}
at /Users/samuel/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/2795b94/parquet/src/file/page_index/index.rs:210:31
if we could get this merged, we can point cargo to use head of Currently we're getting panics when reading parquet files which (we think) have invalid metadata, meaning we can't even fall back to alternative behaviour. |
cc @tustvold who created the original issue. |
Right the point is that even if we have invalid data it's hard to even debug because by we have no ability to do error handling due the panics. |
@@ -192,7 +191,7 @@ impl<T: ParquetValueType> NativeIndex<T> { | |||
let indexes = index | |||
.min_values | |||
.iter() | |||
.zip(index.max_values.into_iter()) | |||
.zip(index.max_values.iter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change so max
is borrowed, to match min
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @samuelcolvin
Is there any way to add a test for this? Ideally a file with invalid statistics but if not at least some sort of unit test?
Also I think we should consider ignoring the error rather than erroring out on statistics
@@ -186,14 +186,18 @@ pub fn from_thrift( | |||
// INT96 statistics may not be correct, because comparison is signed | |||
// byte-wise, not actual timestamps. It is recommended to ignore | |||
// min/max statistics for INT96 columns. | |||
let min = min.map(|data| { | |||
let min = if let Some(data) = min { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make invalid statistics return an error I think it will mean the reader will not be able to read any data from the file
What do you think about ignoring the error. .ok()
style instead here -- that way the statistics would come in as None
but the file could still be read
I have a repro locally but it's a 25MB file that you have to write out then load back again. So not exactly a MRE. I'm happy to share that file if that's helpful. I tried narrowing down to a specific column or value without much luck. It happens specifically when using MetadataLoader to load metadata dumped by ParquetMetadataWriter. Our goal by converting this panic to an error is to buy time in figuring out the root cause by falling back to loading the metadata from the file itself instead of the cached version that was written via ParquetMetadataWriter. |
@alamb I've added a test, to get it to fail I've included the raw metadata from our problematic file (35KB). Let me know if you want to keep this or remove or, and if you want me to send you a link to the full 25MB parquet file via email.
I don't think we should change this, if we run into errors like this we can always call again with loader.load_page_index(false, false).await? as shown in the test. |
Okay, I've now added a a pretty good MRE test. @alamb we're really appreciate it if this could be merged. I'm happy to remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @samuelcolvin
cc @etseidl as I think you have worked on this code recently (the index stuff specifically)
I am going to merge this in to help @samuelcolvin -- if anyone has concerns or other comments please let us know and we can address them as follow on PRs
Thanks so much @alamb, I'll create an issue with a follow up. |
Which issue does this PR close?
fix #3577
Rationale for this change
Our parquet reader is panicing when hitting metadata that is either invalid or unexpected
What changes are included in this PR?
Remove interval
from_le_slice
method, useT::try_from_le_slice
instead.Are there any user-facing changes?
NA?