-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplified TableProvider::Insert API #6339
Comments
This sounds like a great approach. Thank you for bringing it up. I'm trying to implement the insert operation for Apache Iceberg Tables and I have one point which I would like to make: Apache Iceberg supports transactions in which datafiles(parquetfiles) can be added to the table. For that the Something like: /// The DataSink implements writing streams of [`RecordBatch`]es to
/// partitioned destinations
pub trait DataSink: std::fmt::Debug + std::fmt::Display + Send + Sync {
/// How does this sink want its input distributed?
fn required_input_distribution(&self) -> Distribution;
/// return a future which writes a RecordBatchStream to a particular partition
/// and return the number of rows written
fn write_stream(&self, partition: usize, input: SendableRecordBatchStream) -> BoxFuture<Result<u64>>;
/// Indicate that all record batches have been written and allow to update the metadata of the table.
/// This can be used to end a transaction on the table.
fn flush(&mut self) -> BoxFuture<Result<()>>;
} |
|
The return type is
As the returned type is
I personally would return |
Thanks for your explanation. There is just one small detail that would be great for Iceberg/Deltalake. Apache Iceberg and Deltalake use MVCC to guarantee atomic transactions on tables. Therefore they optimistically write the data of a transaction to some kind of storage. Once the data is written, the metadata of the table is updated if no other process has updated the metadata in the meantime. The current porposal for the Forget my comment on the asynchronous method, I somehow missed the BoxFuture. |
@JanKaul I think we could add a |
(BTW I'll plan to use async_trait so the fact the trait returns |
I think you could do something where the last write_stream to complete flushes, but it is a bit odd for sure This does lead to one question though, what actually is the meaning of partitioning in this context? The formulation of What does exposing the partitioning yield over something simple like
|
I think it would be a common pattern (where you either want to commit all the partitions or none, rather than have some of them possibly complete and some fail). 🤔
It allows the Sink to request data be split according to one of the following options: https://docs.rs/datafusion/latest/datafusion/physical_plan/enum.Partitioning.html And let the DataFusion optimizer do its thing to optimize its calculation.
Can you please be more specific about what type of partitioning would be required? Are you thinking of "partition by value in a column (like the date)"?
It makes it easier to implement |
I'd wager almost 100% of workloads would want atomicity at the
Partitioning or bucketing by value would be the most common use-case, which is distinct from the sort of partitioning currently implemented by DataFusion.
Unless I'm missing something, its the difference between calling ExecutionPlan::execute and being given the result? Is that really a meaningful complexity? I guess I'm just trying to play devils advocate for Keep-It-Simple 😄 |
Iceberg allows partitioning by hash, truncating a value, date. I was thinking of expressing the partitioning as a DISTRIBUTE BY (https://docs.rs/datafusion/24.0.0/datafusion/prelude/enum.Partitioning.html#variant.DistributeBy) and then use RepartitionExec on the input to apply it to the query. But that is just an initial idea I didn't get to it yet. |
Yes, that is the construction you would definitely want. I don't believe a corresponding notion exists in the physical layer, which has pretty hard-coded assumptions about partition enumerability TBC this is something we probably should fix, but it is likely a very intrusive change to decouple ExecutionPlan from partitioning |
I think @JanKaul 's idea in #6339 (comment) would work well
I am trying to avoid having to write a new If the API is like this: impl TableProvider {
async fn insert(&self, ctx: Arc<TaskContext>, plan: Arc<dyn ExecutionPlan>) -> Result<()>;
} I think the DataFusion implementation would be simpler, but now each sink would be potentially more complicated as it would have to deal with running multiple streams concurrently. But maybe that is ok 🤔 I am also trying to keep Execution and datasource separate (so I can eventually break them into different crates) -- maybe I can just make another trait to abstract away the execution plan detail I am about to get ion to a plane -- I'll play around with it |
So we have the "one-exec-for-each-provider" pattern on the read side, but we have a "single-exec-across-all-providers" pattern on the write side? Am I misunderstanding this? If this is indeed the case, what is the motivation and/or justification behind this asymmetry? |
For the read side you can often make use of existing Execs. In my use case for Apache Iceberg I can implement the |
The asymmetry is a good question @ozankabak My thinking is
I think we could achieve most of the above by keeping the same However, it seemed like most of the flexibility gained by using an ExecutionPlan would only server to be more confusing
IOx follows this strategy as well |
Aren't these arguments mostly applicable to the read side as well? I am worried that we may prematurely commit to a design without enough clarity in all possible implications and usage patterns. I think this kind of a design question should be decided after we have some more concrete cases (i.e. more write execs like we have read execs now) and thus have enough data points to analyze various use cases. Only then we can perform a more serious pros/cons analysis. In this case, maybe the decision will be to refactor the read side as well, or maybe we'll see that doing this is not the right thing to do even for the write side. |
Yes, basically, and if I could change it now I would change it not to return an
I think this is a good point (and there is some evidence for us not really knowing what the API should be in the discussions above with @tustvold like #6339 (comment)) . I think we can leave the Let me see spend some more time with #6347 refining the idea |
Here is a proposed PR that makes it easier to write insert plans, but does not change the |
I think this is done for now. Closing |
Is your feature request related to a problem or challenge?
Recent INSERT work #6049 is a good example of a useful datafusion feature that has an extensibility story (a new function on a trait)
However, it takes a non trivial effort to add such support (requires an new physical operator).
Describe the solution you'd like
Thus I would like to propose the following API to support writing to sources
DataSink trait
A new trait that exposes just the information needed writing. Something like:
Change signature of
TableProvider
Then if we change the signature of
TableProvider
fromTo something like
I think almost all of the inert plans can share a common
ExecutionPlan
Describe alternatives you've considered
do nothing
Additional context
No response
The text was updated successfully, but these errors were encountered: