Merging with null matching causes extreme performance degradation. #2891

brendan-cook-87 · 2024-07-12T09:33:57Z

Describe the bug

When we add the join clause OR (target."id" IS NULL AND source."id" IS NULL) to enable matching of nulls in the id columns, it causes some queries to take orders of magnitude longer to complete.

#2872

How to Reproduce

I have a table with approximately 17 million rows. And the approximate data scanned for an insert operation is 350MB.

The previous behaviour of running:
MERGE INTO "production_mobile"."tracks" target USING "production_mobile"."temp_table_f30f8023428b4dab816840e62ba40699" source ON (target."id" = source."id")

to insert ~300 new rows would execute in around 5s.

Running it with this clause:
MERGE INTO "production_mobile"."tracks" target USING "production_mobile"."temp_table_f30f8023428b4dab816840e62ba40699" source ON (target."id" = source."id" OR (target."id" IS NULL AND source."id" IS NULL)

is taking 12+ minutes across several attempts.

This table has no nulls in the id column. It is processing event logs and merging on UUIDs.

Expected behavior

This behaviour should not be the default given that it causes live production systems such a degradation in performance.
I am unable to update to the latest version of this layer at this time.

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.11

AWS SDK for pandas version

3.9.0

Additional context

No response

The text was updated successfully, but these errors were encountered:

aldder · 2024-07-12T09:54:06Z

Hello Bendan, sorry to hear that my pr introduced this huge delay in timings.

As an alternative to the control check

f'(target."{x}" = source."{x}" OR (target."{x}" IS NULL AND source."{x}" IS NULL))'

can you please try the following:

f'(target."{x}" IS NOT DISTINCT FROM source."{x}"'

Looking at the documentation https://trino.io/docs/current/functions/comparison.html#is-distinct-from-and-is-not-distinct-from this is the the proper way to have some sort of equality check on NULL values, but we must be aware that this doesn't work if any of the columns involved contains ALL NULL values (due to a bug on trino, idk honestly).

Please let me know.

brendan-cook-87 · 2024-07-12T11:05:33Z

I just ran a test using IS NOT DISTINCT FROM and it still took around 8 minutes to do the insert.
So I'd agree it seems slightly more performant, but still orders of magnitude slower than just merging on =.

Siddharth-Latthe-07 · 2024-07-15T13:21:40Z

@brendan-cook-87 The drastic increase in query execution time when adding the OR (target."id" IS NULL AND source."id" IS NULL) is due to how SQL databases handle null comparisons and the increased complexity of the query.
Try out these steps and let me know, if it works

Ensure No Nulls:
If your table has no nulls in the id column, you don't need to add the null-check condition.
refer this query:

MERGE INTO "production_mobile"."tracks" target 
USING "production_mobile"."temp_table_f30f8023428b4dab816840e62ba40699" source 
ON (target."id" = source."id")

Use Coalesce for Null Handling: (if applicable)
If you need to handle potential nulls in a different context, you can use COALESCE to provide a default value for nulls.
Partition the Data: If you need to handle a significant number of rows, consider partitioning your data to improve performance.
Optimize Indexes: Ensure that your indexes are optimized and consider adding composite indexes if applicable.
for optimiztion refer this query:-

MERGE INTO "production_mobile"."tracks" target 
USING "production_mobile"."temp_table_f30f8023428b4dab816840e62ba40699" source 
ON (COALESCE(target."id", 'default_value') = COALESCE(source."id", 'default_value'))

Hope, this helps ad let me know the further issues
Thanks

brendan-cook-87 · 2024-07-15T13:43:31Z

I agree merging on id = id works fine. That's exactly why I raised the issue.
Because the latest version changes the sql generated by to_iceberg to the less efficient version...

brendan-cook-87 added the bug Something isn't working label Jul 12, 2024

brendan-cook-87 mentioned this issue Jul 12, 2024

fix: add an argument to control handling nulls in merge criteria #2892

Merged

jaidisido linked a pull request Jul 16, 2024 that will close this issue

fix: add an argument to control handling nulls in merge criteria #2892

Merged

jaidisido closed this as completed in #2892 Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging with null matching causes extreme performance degradation. #2891

Merging with null matching causes extreme performance degradation. #2891

brendan-cook-87 commented Jul 12, 2024

aldder commented Jul 12, 2024

brendan-cook-87 commented Jul 12, 2024

Siddharth-Latthe-07 commented Jul 15, 2024

brendan-cook-87 commented Jul 15, 2024

Merging with null matching causes extreme performance degradation. #2891

Merging with null matching causes extreme performance degradation. #2891

Comments

brendan-cook-87 commented Jul 12, 2024

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

aldder commented Jul 12, 2024

brendan-cook-87 commented Jul 12, 2024

Siddharth-Latthe-07 commented Jul 15, 2024

brendan-cook-87 commented Jul 15, 2024