How to efficiently process extremely large delta table #2711

jiaw314 · 2024-07-27T20:20:55Z

jiaw314
Jul 27, 2024

Hi all,

I am trying to build a daily process using delta-rs to continuously copy a few delta table data from S3 to Snowflake. On a day to day basis, I will be grabbing the new/updated data (cdf functionality). Some of the tables will have close to 90 million new/updated data combined per day. But first, I have to get a snapshot of the current delta tables into Snowflake -- largest table is likely around a billion records. Any idea on how to best process this extremely large table using python and delta rs so I can get all the records from S3 into Snowflake?

Thanks!!

Answered by WhiteWolf1776

Aug 20, 2024

If daily, and if possible, partition the data daily executing compact a few times through the day. Then filter on that partition when you read the delta table.

Anything more efficient would require a lot more code.

View full answer

WhiteWolf1776 · 2024-08-20T14:30:03Z

WhiteWolf1776
Aug 20, 2024

If daily, and if possible, partition the data daily executing compact a few times through the day. Then filter on that partition when you read the delta table.

Anything more efficient would require a lot more code.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to efficiently process extremely large delta table #2711

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to efficiently process extremely large delta table #2711

jiaw314 Jul 27, 2024

Replies: 1 comment

WhiteWolf1776 Aug 20, 2024

jiaw314
Jul 27, 2024

WhiteWolf1776
Aug 20, 2024