Avoid reading min/max/null statistics for planning iceberg inserts #23757

raunaqmorarka · 2024-10-11T14:19:22Z

Description

For large tables going through statistics for all files can be slow.
The calling code in getStatisticsCollectionMetadataForWrite was not
using all the statistics and is simplified to only fetch NDVs.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Iceberg
* Improve planning time for iceberg inserts. ({issue}`23757`)

findinpath

LGTM

My only concern is that we should guard the code against potential future regressions on the write path which could be affected by min/max stats.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

findinpath · 2024-10-12T08:28:57Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

@@ -346,6 +347,7 @@
 import static org.apache.iceberg.SnapshotSummary.DELETED_RECORDS_PROP;
 import static org.apache.iceberg.SnapshotSummary.REMOVED_EQ_DELETES_PROP;
 import static org.apache.iceberg.SnapshotSummary.REMOVED_POS_DELETES_PROP;
+import static org.apache.iceberg.SnapshotSummary.TOTAL_RECORDS_PROP;


The calling code in getStatisticsCollectionMetadataForWrite was not using all the statistics and is simplified to only fetch NDVs.

For regression prevention purposes: Can we ensure through a test that the calling code does not depend on min/max stats retrieved from io.trino.plugin.iceberg.TableStatisticsReader#makeTableStatistics ?

I remember raising a while ago the idea promoted by this PR and @findepi was rather cautious in doing this change because of potential regressions.

The important thing here is to not go through all manifest files for planning inserts rather than usage of min/max/null stats. There are tests which assert on manifest file accesses on filesystem. If this code were getting those stats through some other cheaper means, then we wouldn't care about it.
At worst, there's couple of ways of making a mistake in this code:

We generate NDV stats even though we don't know existing NDVs and end up under counting NDVs by recording only the NDVs collected on write. The other min/max/nulls stats would still be correct, so the CBO may give worse plans but it's not the end of the world. Eventually the stats will become more accurate, or a call to ANALYZE will fix the whole thing.

We fail to detect that the table is empty or that NDVs are known and skip generating them on write. Again we get possibly worse plans from CBO while the other non-ndv stats are still intact and a call to ANALYZE will fix the problem.
Either way, these are much more tolerable problems than the problem caused by the current code where it can bottleneck INSERT queries for minutes on planning.

findinpath · 2024-10-12T08:45:06Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergFileOperations.java

@@ -234,7 +233,6 @@ public void testInsert()
                        .add(new FileOperation(STATS, "InputFile.newStream"))
                        .add(new FileOperation(SNAPSHOT, "OutputFile.create"))
                        .add(new FileOperation(MANIFEST, "OutputFile.create"))
-                        .add(new FileOperation(MANIFEST, "InputFile.newStream"))


Maybe we should outline in the commit comment that the manifest list corresponding to the table snapshot is not read anymore now during writes.

For large tables going through statistics for all files can be slow. The calling code in getStatisticsCollectionMetadataForWrite was not using all the statistics and is simplified to only fetch NDVs.

Add TestIcebergFileOperations#testInsert

847bd1f

cla-bot bot added the cla-signed label Oct 11, 2024

raunaqmorarka requested a review from sopel39 October 11, 2024 14:19

github-actions bot added the iceberg Iceberg connector label Oct 11, 2024

raunaqmorarka requested review from findinpath, ebyhr, alexjo2144 and wweiss-starburst October 11, 2024 14:19

raunaqmorarka force-pushed the ice-metadata-for-write branch from 9cb5219 to 6bc6c3f Compare October 11, 2024 14:50

raunaqmorarka added the performance label Oct 11, 2024

raunaqmorarka force-pushed the ice-metadata-for-write branch 2 times, most recently from d2bbc44 to 7e669a2 Compare October 11, 2024 16:53

findinpath approved these changes Oct 12, 2024

View reviewed changes

findinpath requested a review from lukasz-stec October 12, 2024 08:49

Avoid reading min/max/null statistics for planning iceberg inserts

c98d78b

For large tables going through statistics for all files can be slow. The calling code in getStatisticsCollectionMetadataForWrite was not using all the statistics and is simplified to only fetch NDVs.

raunaqmorarka force-pushed the ice-metadata-for-write branch from 7e669a2 to c98d78b Compare October 12, 2024 13:12

losipiuk approved these changes Oct 13, 2024

View reviewed changes

raunaqmorarka merged commit 0c0dda1 into trinodb:master Oct 13, 2024
42 checks passed

raunaqmorarka deleted the ice-metadata-for-write branch October 13, 2024 10:14

github-actions bot added this to the 462 milestone Oct 13, 2024

mosabua mentioned this pull request Oct 15, 2024

Add Trino 462 release notes #23745

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid reading min/max/null statistics for planning iceberg inserts #23757

Avoid reading min/max/null statistics for planning iceberg inserts #23757

raunaqmorarka commented Oct 11, 2024 •

edited

Loading

findinpath left a comment

findinpath Oct 12, 2024

raunaqmorarka Oct 12, 2024

findinpath Oct 12, 2024

Avoid reading min/max/null statistics for planning iceberg inserts #23757

Avoid reading min/max/null statistics for planning iceberg inserts #23757

Conversation

raunaqmorarka commented Oct 11, 2024 • edited Loading

Description

Additional context and related issues

Release notes

findinpath left a comment

Choose a reason for hiding this comment

findinpath Oct 12, 2024

Choose a reason for hiding this comment

raunaqmorarka Oct 12, 2024

Choose a reason for hiding this comment

findinpath Oct 12, 2024

Choose a reason for hiding this comment

raunaqmorarka commented Oct 11, 2024 •

edited

Loading