-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ParquetMetaDataBuilder
#6466
Changes from all commits
a2818bf
572ce5c
5e324f6
b6c63a8
997c3f9
c0432e6
21e8dc2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -191,11 +191,13 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> { | |
/// Creates file reader from a Parquet file with read options. | ||
/// Returns error if Parquet file does not exist or is corrupt. | ||
pub fn new_with_options(chunk_reader: R, options: ReadOptions) -> Result<Self> { | ||
let metadata = ParquetMetaDataReader::new().parse_and_finish(&chunk_reader)?; | ||
let mut metadata_builder = ParquetMetaDataReader::new() | ||
.parse_and_finish(&chunk_reader)? | ||
.into_builder(); | ||
let mut predicates = options.predicates; | ||
let row_groups = metadata.row_groups().to_vec(); | ||
let mut filtered_row_groups = Vec::<RowGroupMetaData>::new(); | ||
for (i, rg_meta) in row_groups.into_iter().enumerate() { | ||
|
||
// Filter row groups based on the predicates | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the cleanup of this code (which is modifying the ParquetMetaData) is the best example of why having this API makes sense -- it makes one fewer copies and also I think is quite a bit clearer |
||
for (i, rg_meta) in metadata_builder.take_row_groups().into_iter().enumerate() { | ||
let mut keep = true; | ||
for predicate in &mut predicates { | ||
if !predicate(&rg_meta, i) { | ||
|
@@ -204,41 +206,30 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> { | |
} | ||
} | ||
if keep { | ||
filtered_row_groups.push(rg_meta); | ||
metadata_builder = metadata_builder.add_row_group(rg_meta); | ||
} | ||
} | ||
|
||
if options.enable_page_index { | ||
let mut columns_indexes = vec![]; | ||
let mut offset_indexes = vec![]; | ||
|
||
for rg in &mut filtered_row_groups { | ||
for rg in metadata_builder.row_groups().iter() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can build the metadata here (with the filtered row groups), pass it into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if options.enable_page_index {
let mut reader = ParquetMetaDataReader::new_with_metadata(metadata_builder.build())
.with_page_indexes(options.enable_page_index);
reader.read_page_indexes(&chunk_reader)?;
metadata_builder = ParquetMetaDataBuilder::new_from_metadata(reader.finish()?);
} I forgot to do this in #6450. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't quite follow why this is needed. What scenario does it help (I can write a test to cover it) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean replace if options.enable_page_index {
let mut columns_indexes = vec![];
let mut offset_indexes = vec![];
for rg in metadata_builder.row_groups().iter() {
let column_index = index_reader::read_columns_indexes(&chunk_reader, rg.columns())?;
let offset_index = index_reader::read_offset_indexes(&chunk_reader, rg.columns())?;
columns_indexes.push(column_index);
offset_indexes.push(offset_index);
}
metadata_builder = metadata_builder
.set_column_index(Some(columns_indexes))
.set_offset_index(Some(offset_indexes));
} with the above code snippet from my earlier comment. This should be a bit more efficient since There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I implemented something slightly different: Since there is already a One thing that might be different is that it looks like the current code may only read the column index/page index for row groups that passed the "predicates" but the ParquetMetadataReader reads the index for all the row groups. That being said, no tests fail, so i am not sure if it is a real problem or not There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, this worries me a bit, since the column and offset indexes will have more row groups represented than are in the We could prune the page indexes at the same time we're pruning the row groups. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, that is probably mirrors the intent the most closely. How about I'll back out c0432e6 and we can address improving this code as a follow on PR (along with tests) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
let column_index = index_reader::read_columns_indexes(&chunk_reader, rg.columns())?; | ||
let offset_index = index_reader::read_offset_indexes(&chunk_reader, rg.columns())?; | ||
columns_indexes.push(column_index); | ||
offset_indexes.push(offset_index); | ||
} | ||
|
||
Ok(Self { | ||
chunk_reader: Arc::new(chunk_reader), | ||
metadata: Arc::new(ParquetMetaData::new_with_page_index( | ||
metadata.file_metadata().clone(), | ||
filtered_row_groups, | ||
Some(columns_indexes), | ||
Some(offset_indexes), | ||
)), | ||
props: Arc::new(options.props), | ||
}) | ||
} else { | ||
Ok(Self { | ||
chunk_reader: Arc::new(chunk_reader), | ||
metadata: Arc::new(ParquetMetaData::new( | ||
metadata.file_metadata().clone(), | ||
filtered_row_groups, | ||
)), | ||
props: Arc::new(options.props), | ||
}) | ||
metadata_builder = metadata_builder | ||
.set_column_index(Some(columns_indexes)) | ||
.set_offset_index(Some(offset_indexes)); | ||
} | ||
|
||
Ok(Self { | ||
chunk_reader: Arc::new(chunk_reader), | ||
metadata: Arc::new(metadata_builder.build()), | ||
props: Arc::new(options.props), | ||
}) | ||
} | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these methods follow the exsiting convention of
add
andset
as the other Builders in this crate sucha s