How to efficiently process extremely large delta table #2711
-
Hi all, I am trying to build a daily process using delta-rs to continuously copy a few delta table data from S3 to Snowflake. On a day to day basis, I will be grabbing the new/updated data (cdf functionality). Some of the tables will have close to 90 million new/updated data combined per day. But first, I have to get a snapshot of the current delta tables into Snowflake -- largest table is likely around a billion records. Any idea on how to best process this extremely large table using python and delta rs so I can get all the records from S3 into Snowflake? Thanks!! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
If daily, and if possible, partition the data daily executing compact a few times through the day. Then filter on that partition when you read the delta table. Anything more efficient would require a lot more code. |
Beta Was this translation helpful? Give feedback.
If daily, and if possible, partition the data daily executing compact a few times through the day. Then filter on that partition when you read the delta table.
Anything more efficient would require a lot more code.