Replies: 2 comments
-
First, I'll say I'd be in favor of splitting things out. Overall these splits seems sensible.
I understand the goal, but I'm afraid this will be often impossible due to breaking changes in Arrow. Additionally, I wonder if there's a split we can make between DataFusion-powered modules (where DataFusion is used internally but not exposed at interface) and DataFusion plugins (e.g. |
Beta Was this translation helpful? Give feedback.
-
Agreed, splitting up into small crates can make life easier and reduce cognitive load when reasoning about the code. I do think we may want to consider a few more things, especially if we consider the larger eco-system like delta-sharing etc. When it comes to the cloud crates I do have some doubts, since essentially only AWS has any special needs, due to locking. So not even all S3 APIs require any special dependencies. Factoring out the locking logic and mirroring object-stores features may be a way to go. Over there the specific cloud features - aws, azure, gcs - are essentially just legacy, since they are more or less just reference the "cloud" feature. moving towards logical plans I also have some drafts flying oround for deltalake-sql (logical planning / sql parsing), and deltalake-execution, which as physical operators for datafusion. All in all I feel we may want to start small, where we are sure and move to individual crates step by step? |
Beta Was this translation helpful? Give feedback.
-
Lately I have been thinking about how to improve the way the
deltalake
crate is packaged and delivered to users. I believe it is time for us to create sub-crates and convertdeltalake
to a meta-package.This would be not too dissimilar to how
arrow
anddatafusion
are packaged and delivered today. In the case of the arrow package, one can pullarrow
directory and get all the bells and whistles. However if a user only requires the arrow types, they can pull a sub-crate ofarrow-schema
, which is much smaller in code and dependency footprint.Benefits:
deltalake-core
will be easier to oriented around the kernel of functionality needed to implement the Delta protocolNew Crates
I am proposing the addition of the following sub-crates which should be considered as dependencies (optional ❔ and non-optional 🔐 as noted) of the
deltalake
crate:deltalake-core 🔐
This crate would have much of the existing traits and implementations needed to do things like log processing, provide the key APIs that Python and other users depend on such as the writer and
DeltaOps
implementations.The key distinction here is that this would not create the cloud-specific or engine specific functionality, such as the Datafusion integration and dependency. One of the potential benefits of this refactoring is that we might be able to advocate for the inclusion of
deltalake-core
as adatafusion
dependency, and move theTableProvider
fordelta
upstream intodatafusion
so that it can have native Delta support.Dependencies
I believe this crate would need to take the least necessary dependencies to do its job, which would likely be:
arrow-*
subcrates, I think we can avoid some of our current dependency tree here.object_store
tokio
deltalake-aws ❔
This crate would contain the dynamodb locking code, and other special case storage logic related to AWS/S3. Right now this code is kind of a mess (IMHO) in the Rust crate. There's some refactoring that @roeap has tried here but we've not merged. Additionally there are some changes that need to be made in #1601 which would be a good time to break the AWS code out.
deltalake-azure ❔
Similar motivation to the above with AWS. There's some specific code for Azure and OneLake we have floating around which could/should be moved over. Hopefully having a small and narrow Azure specific section of the tree would make it easier for contributors to help improve our support.
deltalake-gcp ❔
☝️
deltalake-datafusion ❔
The
deltalake-datafusion
crate would start out with the Datafusion Tableprovider and some other Datafusion powered operations, but I would hope to get the TableProvider upstream. At that point this crate would become a place to host the Datafusion powered extensions to core functionality we want to support, as well as other optimizations to make Delta nice in Datafusion land.❗ This would probably be the hardest dependency to get right since there's a tight coupling between major releases of arrow, datafusion, and in turn the existing
deltalake
crate. By separating this crate out, I am hopeful thatdeltalake-core
would be able to adopt newer Arrow versions at a more rapid pace, rather than being strung along downstream fromarrow
->datafusion
->deltalake
release cycles.deltalake-catalog-glue ❔
Assuming the
deltalake-core
package has a trait which defines what a "Catalog" should look like and how that's used, this would contain the AWS Glue Data Catalog specific code.I do not believe this would need to take a dependency on
deltalake-aws
so long as both crates share the same version range for theaws_config
crate once they have adopted the AWS SDK for Rust, which also distributes service-specific crates.deltalake-catalog-unity ❔
Assuming the
Catalog
trait indeltalake-core
this would contain Databricks Unity Catalog specific code.deltalake-testing ❔
This package would be more for the development of the delta-rs project itself, but provide all the test utilities and interfaces we find so helpful for writing integration tests.
The Meta Package
The
deltalake
crate would continue to be released as it is today, and maintain its feature flags which dictate what versions and configuration of the sub-crates it includes. I am hopeful that thedeltalake
crate simply becomes a "shell". That is aCargo.toml
and asrc/lib.rs
which re-exports a number of symbols for users' convenience.Semantic Versioning
I think all the crates should follow semantic versioning of course, but should share a major version. The
deltalake-core
crate should not have public API changes in the major ranges so thatdeltalake-core
0.20.0, 0.21.0, 0.22.0, etc can be used withdeltalake-aws
0.1.0 and so on.Conclusion
I am volunteering to take on this work, and would target
0.20
as a good version to set as a milestone marker for such a release. If others are game to try this out, I can start pulling the scope into a milestone for the work.Beta Was this translation helpful? Give feedback.
All reactions