-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate MetadataLoader
#6474
Deprecate MetadataLoader
#6474
Conversation
cc @alamb. I did not like that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @etseidl -- I love the deprecation, but am not sure about the change to use &mut F
references. I think the owned API makes more sense
} | ||
|
||
// these tests are all replicated in parquet::file::metadata::reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -212,16 +212,18 @@ impl ArrowReaderMetadata { | |||
input: &mut T, | |||
options: ArrowReaderOptions, | |||
) -> Result<Self> { | |||
// TODO: this is all rather awkward. It would be nice if AsyncFileReader::get_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only is this akward, it is also a common source of confusion / bugs (namely that when someone supplies the ParquetMetaData to the arrow reader options to avoid a second object store request, if often turns out the second fetch happens anyways to read the page index (thus obviating the attempt at optimization)
To avoid this they need to ensure when they read the metadata in the first place, they also read the page index.
This is (in a roundabout way) what is happening to @progval in apache/datafusion#12593
I will try and file a ticket explaining the issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #6476
parquet/src/file/metadata/reader.rs
Outdated
@@ -299,11 +299,14 @@ impl ParquetMetaDataReader { | |||
/// See [`Self::with_prefetch_hint`] for a discussion of how to reduce the number of fetches | |||
/// performed by this function. | |||
#[cfg(feature = "async")] | |||
pub async fn load_and_finish<F: MetadataFetch>( | |||
pub async fn load_and_finish<F>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this API is new and has not yet been released, this is not a breaking API change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Specifically ParquetMetaDataReader
has not been released yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is why I was rushing to get this into 53.1.0. 😄 But if we revert this change then this PR can merge after 53.1.0 is released.
let mut loader = MetadataLoader::load(self, file_size, prefetch).await?; | ||
loader | ||
.load_page_index(preload_column_index, preload_offset_index) | ||
let metadata = ParquetMetaDataReader::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet/src/file/metadata/reader.rs
Outdated
where | ||
for<'a> &'a mut F: MetadataFetch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @alamb. I did not like that the load... functions in ParquetMetaDataReader took ownership of fetch.
What don't you like about it? Is the idea that MetadataFetch
actually has mutable state so that calling fetch
would change its state?
I think in general trying to get async functions to be happy with references is often much harder than with owned objects.
So I guess I think we should keep the API as owned to make working with these APIs simpler, unless there is some compelling reason for the change (better performance, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What don't you like about it? Is the idea that
MetadataFetch
actually has mutable state so that callingfetch
would change its state?
Some of it is my lack of Rust experience. I'm having a hard time tracking the ownership here, and am confusing traits with objects. Implementing the MetadataFetch
trait for a mutable reference to something that implements the AsyncFileReader
trait leaves my head spinning a little.
I think in general trying to get async functions to be happy with references is often much harder than with owned objects.
Indeed, it took me days to get this to compile 😅.
So I guess I think we should keep the API as owned to make working with these APIs simpler, unless there is some compelling reason for the change (better performance, etc)
I'll revert now that I understand the mechanics a little better. But one consequence is some things will get a little more verbose. For instance, when not taking ownership, get_metadata
for ParquetObjectReader
is
fn get_metadata(&mut self) -> BoxFuture<'_, Result<Arc<ParquetMetaData>>> {
Box::pin(async move {
let metadata = ParquetMetaDataReader::new()
.with_column_indexes(self.preload_column_index)
.with_offset_indexes(self.preload_offset_index)
.with_prefetch_hint(self.metadata_size_hint)
.load_and_finish(self, self.meta.size)
.await?;
Ok(Arc::new(metadata))
})
}
But if load_and_finish
takes ownership of the mutable reference to self
, we can't then get at self.meta.size
in the same call, but need to save it earlier, IIUC.
fn get_metadata(&mut self) -> BoxFuture<'_, Result<Arc<ParquetMetaData>>> {
Box::pin(async move {
let file_size = self.meta.size;
let metadata = ParquetMetaDataReader::new()
.with_column_indexes(self.preload_column_index)
.with_offset_indexes(self.preload_offset_index)
.with_prefetch_hint(self.metadata_size_hint)
.load_and_finish(self, file_size)
.await?;
Ok(Arc::new(metadata))
})
}
A small price to pay, but one I was hoping to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small price to pay, but one I was hoping to avoid.
I agree it would be nicer.
While I don't fully understand all the implications here, this is in my mind related to how how the rust compiler generates async continuations and how it interacts with the borrow checker. Sometimes it has a hard time expressing that a borrow is ok for some reason
} | ||
|
||
/// Asynchronously fetch the page index structures when a [`ParquetMetaData`] has already | ||
/// been obtained. See [`Self::new_with_metadata()`]. | ||
#[cfg(feature = "async")] | ||
pub async fn load_page_index<F: MetadataFetch>( | ||
pub async fn load_page_index<F: MetadataFetch>(&mut self, fetch: F) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will be breaking if in 52.2.0. I'll revert this if need be. I found having remainder
in the public API confusing.
Thanks for the quick review @alamb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl -- this looks great to me. I think the tests can be simplified, but I will make a follow on PR with a proposal.
@@ -836,7 +841,7 @@ mod async_tests { | |||
|
|||
struct MetadataFetchFn<F>(F); | |||
|
|||
impl<F, Fut> MetadataFetch for MetadataFetchFn<F> | |||
impl<'a, F, Fut> MetadataFetch for &'a mut MetadataFetchFn<F> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think now that the API has been changed back, these test changes are also not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change allows me to wrap the fetch
function once, rather than for each invocation of load
, but that could be done inline or with a From
I suppose. Happy to revert if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in #6484
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought reverting them was nice to demonstrate the same API can still be used
I did notice that the metadata loader tests actually have a copy/paste MetadataFetchFn
adapter. Maybe if the arrow crate needs it in two places, we should make it easier for actual users to write them 🤔 Not sure
Which issue does this PR close?
Part of #6447
Rationale for this change
What changes are included in this PR?
Deprecate public methods in
MetadataLoader
, as well asmetadata::fetch_parquet_metadata()
.Also changes the load API in
ParquetMetaDataReader
to use references to theMetadataFetch
rather than taking ownership.Are there any user-facing changes?
Yes, but only to the as yet unreleased
ParquetMetaDataReader
.