Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid appending redundant files to the delta table #3

Closed
rtyler opened this issue Sep 8, 2023 · 2 comments
Closed

Avoid appending redundant files to the delta table #3

rtyler opened this issue Sep 8, 2023 · 2 comments

Comments

@rtyler
Copy link
Contributor

rtyler commented Sep 8, 2023

If oxbow is re-triggered with a file that has already been added to a Delta Table, it will happily add another add action to the transaction log for a redundant file.

While not a problem per se, it is unnecessary and inefficient. append_to_table should probably look at table.get_files before adding the files passed into it.

@rtyler
Copy link
Contributor Author

rtyler commented Sep 8, 2023

Now that I'm thinking about this problem more, this might be a critical dependency to allowing oxbow to work with tables which have merges or optimizes where some files are being removed while others are being added 🤔

@rtyler
Copy link
Contributor Author

rtyler commented Oct 21, 2023

In Lambda this can cause some duplicate data to be added to the table when fed by SQS since its delivery is not "exactly once" guaranteed, e.g.:

>>> files = dt.files()
>>> len(files)
4279
>>> len(list(set(files)))
2508
>>>

@rtyler rtyler closed this as completed in ceb38f8 Oct 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant