Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging with null matching causes extreme performance degradation. #2891

Closed
brendan-cook-87 opened this issue Jul 12, 2024 · 4 comments · Fixed by #2892
Closed

Merging with null matching causes extreme performance degradation. #2891

brendan-cook-87 opened this issue Jul 12, 2024 · 4 comments · Fixed by #2892
Labels
bug Something isn't working

Comments

@brendan-cook-87
Copy link
Contributor

Describe the bug

When we add the join clause OR (target."id" IS NULL AND source."id" IS NULL) to enable matching of nulls in the id columns, it causes some queries to take orders of magnitude longer to complete.

#2872

How to Reproduce

I have a table with approximately 17 million rows. And the approximate data scanned for an insert operation is 350MB.

The previous behaviour of running:
MERGE INTO "production_mobile"."tracks" target USING "production_mobile"."temp_table_f30f8023428b4dab816840e62ba40699" source ON (target."id" = source."id")

to insert ~300 new rows would execute in around 5s.

Running it with this clause:
MERGE INTO "production_mobile"."tracks" target USING "production_mobile"."temp_table_f30f8023428b4dab816840e62ba40699" source ON (target."id" = source."id" OR (target."id" IS NULL AND source."id" IS NULL)

is taking 12+ minutes across several attempts.

This table has no nulls in the id column. It is processing event logs and merging on UUIDs.

Expected behavior

This behaviour should not be the default given that it causes live production systems such a degradation in performance.
I am unable to update to the latest version of this layer at this time.

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.11

AWS SDK for pandas version

3.9.0

Additional context

No response

@aldder
Copy link
Contributor

aldder commented Jul 12, 2024

Hello Bendan, sorry to hear that my pr introduced this huge delay in timings.

As an alternative to the control check

f'(target."{x}" = source."{x}" OR (target."{x}" IS NULL AND source."{x}" IS NULL))'

can you please try the following:

f'(target."{x}" IS NOT DISTINCT FROM source."{x}"'

Looking at the documentation https://trino.io/docs/current/functions/comparison.html#is-distinct-from-and-is-not-distinct-from this is the the proper way to have some sort of equality check on NULL values, but we must be aware that this doesn't work if any of the columns involved contains ALL NULL values (due to a bug on trino, idk honestly).

Please let me know.

@brendan-cook-87
Copy link
Contributor Author

I just ran a test using IS NOT DISTINCT FROM and it still took around 8 minutes to do the insert.
So I'd agree it seems slightly more performant, but still orders of magnitude slower than just merging on =.

@Siddharth-Latthe-07
Copy link

@brendan-cook-87 The drastic increase in query execution time when adding the OR (target."id" IS NULL AND source."id" IS NULL) is due to how SQL databases handle null comparisons and the increased complexity of the query.
Try out these steps and let me know, if it works

  1. Ensure No Nulls:
    If your table has no nulls in the id column, you don't need to add the null-check condition.
    refer this query:
MERGE INTO "production_mobile"."tracks" target 
USING "production_mobile"."temp_table_f30f8023428b4dab816840e62ba40699" source 
ON (target."id" = source."id")
  1. Use Coalesce for Null Handling: (if applicable)
    If you need to handle potential nulls in a different context, you can use COALESCE to provide a default value for nulls.
  2. Partition the Data: If you need to handle a significant number of rows, consider partitioning your data to improve performance.
  3. Optimize Indexes: Ensure that your indexes are optimized and consider adding composite indexes if applicable.
    for optimiztion refer this query:-
MERGE INTO "production_mobile"."tracks" target 
USING "production_mobile"."temp_table_f30f8023428b4dab816840e62ba40699" source 
ON (COALESCE(target."id", 'default_value') = COALESCE(source."id", 'default_value'))

Hope, this helps ad let me know the further issues
Thanks

@brendan-cook-87
Copy link
Contributor Author

I agree merging on id = id works fine. That's exactly why I raised the issue.
Because the latest version changes the sql generated by to_iceberg to the less efficient version...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants