-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Data Feed in Delta #2095
Labels
binding/python
Issues for the Python package
binding/rust
Issues for the Rust crate
enhancement
New feature or request
Milestone
Comments
take |
ion-elgreco
added
binding/python
Issues for the Python package
binding/rust
Issues for the Rust crate
labels
Jan 25, 2024
rtyler
added a commit
to rtyler/delta-rs
that referenced
this issue
May 7, 2024
This change introduces a `CDCTracker` which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations Fixes delta-io#604 Fixes delta-io#2095
rtyler
added a commit
to rtyler/delta-rs
that referenced
this issue
May 12, 2024
This change introduces a `CDCTracker` which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations Fixes delta-io#604 Fixes delta-io#2095
rtyler
added a commit
to rtyler/delta-rs
that referenced
this issue
May 17, 2024
This change introduces a `CDCTracker` which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations Fixes delta-io#604 Fixes delta-io#2095
rtyler
added a commit
to rtyler/delta-rs
that referenced
this issue
May 21, 2024
This change introduces a `CDCTracker` which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations Fixes delta-io#604 Fixes delta-io#2095
rtyler
added a commit
to rtyler/delta-rs
that referenced
this issue
May 29, 2024
This change introduces a `CDCTracker` which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations Fixes delta-io#604 Fixes delta-io#2095
rtyler
modified the milestones:
Rust v0.18,
Correct timestamp handling,
Change Data Capture Support
May 29, 2024
rtyler
added a commit
to rtyler/delta-rs
that referenced
this issue
May 29, 2024
This change introduces a `CDCTracker` which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations Fixes delta-io#604 Fixes delta-io#2095
rtyler
added a commit
to rtyler/delta-rs
that referenced
this issue
Jun 1, 2024
This change introduces a `CDCTracker` which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations Fixes delta-io#604 Fixes delta-io#2095
rtyler
added a commit
to rtyler/delta-rs
that referenced
this issue
Jun 3, 2024
This change introduces a `CDCTracker` which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations Fixes delta-io#604 Fixes delta-io#2095
ion-elgreco
pushed a commit
to rtyler/delta-rs
that referenced
this issue
Jun 4, 2024
This change introduces a `CDCTracker` which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations Fixes delta-io#604 Fixes delta-io#2095
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
binding/python
Issues for the Python package
binding/rust
Issues for the Rust crate
enhancement
New feature or request
Description
Finally making an issue about this for discussion. I have a draft PR of reading CDF available already in #2048 which still needs work on it's reader side but I wanted to more discuss the writer side of it as it will take much more refactoring. I wanted to also ask how we wanted to approach this. The reader can be mostly encapsulated as it's own thing where the writer will touch all writing operations in delta lake. Do w e want to roll these into the same PR, or make subsequent PRs? I think subsequent would be better, but just my opinion.
So for reader, it's first stages are in-flight with #2048, I just need to figure out how I want to validate the correctness of this. @MrPowers maybe you can help me figure that out?
For writers, well there is a bit more to do here. CDC actions have to be added to the commit log along with the subsequent add/remove actions, generating additional change data files in the
_change_data
directory of a delta table. Currently we encapsulate the builders of these operations in such a way that the builder builds and commits all the actions itself without giving any ability for features to influence the actions of the commit before it's written. So, in order to make CDF work we would be required to update every action and add it's subsequent CDF aware functionality to the operation. I'd argue this would only exacerbate the current issue with builders owning the entire life cycle of the operation and we should not do this.I would instead suggest that we refactor the builders to only create and return a list of actions to commit and a snapshot to commit to. Then let a subsequent (maybe global) part of the code do the actual commit. This way you can compose operations in a more maintainable way. I spoke with @r3stl355 about this as well because some of the work he did for replaceWhere would have been another good candidate to benefit from this type of rethinking. So for writers I am proposing we take this approach.
This will benefit us in the sense that CDF's implementation will have no effect on what those other operations do. Only augment them, so generally our implementations of these features will be more resilient to mistakes as we implement more features down the line. Things like row-tracking come to mind when thinking about potential issues down the line as row-tracking has a specific clause for readers regarding CDF files. Additionally checkpoints must specifically go the opposite way and remove CDF from their checkpointing. I linked these under the related issues, but hopefully that makes sense.
Use Case
https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/
https://docs.delta.io/latest/delta-change-data-feed.html
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-cdc-file
Related Issue(s)
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#reader-requirements-for-row-tracking
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1
The text was updated successfully, but these errors were encountered: