-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication macro #335
Comments
Here is a macro I use to dedupe using BigQuery, not sure if its relevant or could help.
|
We've had resource issues using
It is more performant (can handle more rows, not necessarily faster). You may want to consider going down this path. Adding the In addition, rather than passing in a dataset and table, it would be better to pass in a source, so that this can be used as a separate CTE. Of course you could call Also, we use this macro to dedupe based on a configurable column AND order by - so that you can select which row to select. |
Hi @blake-enyart , @switzer and @rwatts3 ! I was trying to apply this method but to many columns (4). but I got some errors. I was wondering if you managed to find a solution for deduping based on many columns. |
There is no limit to the number of columns to deduplicate across - just add a comma list in the string. For example, you can call the macro as follows:
|
Great idea! Using Airbyte in incremental mode without normalization, I could see a great benefit in having such a macro dedicated to Airbyte's raw data. Thanks for sharing your snippet @switzer |
👋 Hey team! I'd welcome a PR that implemented this. It will need to be cross-database compatible by the time we merge it in - I don't know offhand whether all four of the supported adapters (Snowflake, BQ, Redshift, Postgres) have array_agg or not. If they don't, falling back to the row_number based approach seems reasonable - also remember that there are integration tests that can validate that everything works correctly across all warehouses so you don't have to have access to them all individually. Happy to help out with any questions along the way 🚀 |
@joellabes I thought I'd take a crack at this. It turns out that:
For these reasons I've ended up implementing the It was also necessary for the macro to take a relation as an argument because most of the databases do not have a |
Fixed by #512 |
Describe the feature
It would be great to have deduplication macro which utilizes something such as the surrogate_key macro to deduplicate on. The basis for this would be something such as: #29
Describe alternatives you've considered
At this point, the only alternative is writing this in full SQL query and this feels like a pretty common repetitive type of query that is prime for integrating into a macro
Additional context
Is this feature database-specific? Which database(s) is/are relevant? Please include any other relevant context here.
I work mainly in Snowflake, but this could be generalized. Perhaps a first draft in Snowflake and expands outward?
Who will this benefit?
It is very common practice to deduplicate data on ingestion in addition to type casting which will greatly expedite new data source ingestion
Are you interested in contributing this feature?
I would need a hand to get started on this. I can leverage this PR as a reference, #29, but would greatly appreciate any support and perspective provided here.
The text was updated successfully, but these errors were encountered: