[Spark] Support OPTIMIZE tbl FULL for clustered table #3793

dabao521 · 2024-10-22T21:38:20Z

Which Delta project/connector is this regarding?

Description

Add new sql syntax OPTIMIZE tbl FULL
Implemented OPTIMIZE tbl FULL to re-cluster all data in the table.

How was this patch tested?

new unit tests added

Does this PR introduce any user-facing changes?

Yes
Previously clustered table won't re-cluster data that was clustered against different cluster keys. With OPTIMIZE tbl FULL, they will be re-clustered against the new keys.

ami7o

Overall LGTM. some questions

spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableStrategy.scala

ami7o · 2024-10-25T00:11:26Z

...t/scala/org/apache/spark/sql/delta/skipping/clustering/IncrementalZCubeClusteringSuite.scala

@@ -153,7 +153,7 @@ class IncrementalZCubeClusteringSuite extends QueryTest
                inputZCubeFiles = ClusteringFileStats(2, SKIP_CHECK_SIZE_VALUE),
                inputOtherFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE),
                inputNumZCubes = 1,
-                mergedFiles = ClusteringFileStats(6, SKIP_CHECK_SIZE_VALUE),
+                mergedFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE),


Could you explain this change?

This is related to comment https://github.com/delta-io/delta/pull/3793/files#r1815830851
After fixing the validateClusteringMetrics , this assertion starts to failing and I have to fix it since validateClusteringMetrics is used by new tests as well

ami7o · 2024-10-25T00:11:36Z

...t/scala/org/apache/spark/sql/delta/skipping/clustering/IncrementalZCubeClusteringSuite.scala

@@ -73,17 +73,17 @@ class IncrementalZCubeClusteringSuite extends QueryTest
      actualMetrics: ClusteringStats, expectedMetrics: ClusteringStats): Unit = {
    var finalActualMetrics = actualMetrics
    if (expectedMetrics.inputZCubeFiles.size == SKIP_CHECK_SIZE_VALUE) {
-      val stats = expectedMetrics.inputZCubeFiles
+      val stats = finalActualMetrics.inputZCubeFiles


why is this needed?

This is a test bug left from the commit that added this test. I have to fix this in the PR since new tests depend on validateClusteringMetrics to validate the metrics are correct. Without this fix, though this validation passed, it doesn't mean the program is correct.

ami7o · 2024-10-25T00:14:33Z

...t/scala/org/apache/spark/sql/delta/skipping/clustering/IncrementalZCubeClusteringSuite.scala

@@ -281,5 +283,159 @@ class IncrementalZCubeClusteringSuite extends QueryTest
      }
    }
  }
+
+  test("OPTIMIZE FULL") {


can we add a test case for different clusteringProvider with OPTIMIZE FULL?

Added a new test OPTIMIZE FULL - change clustering provider

spark/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala

spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableStrategy.scala

rahulsmahadev · 2024-10-29T23:12:21Z

...t/scala/org/apache/spark/sql/delta/skipping/clustering/IncrementalZCubeClusteringSuite.scala

@@ -230,7 +230,7 @@ class IncrementalZCubeClusteringSuite extends QueryTest
                inputZCubeFiles = ClusteringFileStats(2, SKIP_CHECK_SIZE_VALUE),
                inputOtherFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE),
                inputNumZCubes = 1,
-                mergedFiles = ClusteringFileStats(6, SKIP_CHECK_SIZE_VALUE),
+                mergedFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE),


Im guessing this is also same reason as above

Yes, this is fixing the test bug introduced in https://github.com/delta-io/delta/pull/3793/files#r1815830851

rahulsmahadev

LGTM! thanks for working on this

dabao521 added 9 commits October 21, 2024 13:06

add sql word

10fa537

fix up

15bca66

add error class to capture wrong FULL use

0ad00a6

fix up

767715b

fix up isFull into optimize command context

ca2330a

Implementation details for OPTIMIZE FULL

a664345

fix test

92ef939

fix test failures

7e9a031

add comment

0264863

ami7o reviewed Oct 25, 2024

View reviewed changes

rahulsmahadev reviewed Oct 25, 2024

View reviewed changes

spark/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala Show resolved Hide resolved

fix comments

6ae768a

dabao521 requested review from ami7o and rahulsmahadev October 28, 2024 20:00

rahulsmahadev reviewed Oct 29, 2024

View reviewed changes

spark/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala Show resolved Hide resolved

dabao521 requested a review from rahulsmahadev October 29, 2024 20:05

rahulsmahadev reviewed Oct 29, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableStrategy.scala Show resolved Hide resolved

rahulsmahadev reviewed Oct 29, 2024

View reviewed changes

rahulsmahadev approved these changes Oct 29, 2024

View reviewed changes

allisonport-db merged commit 959765a into delta-io:master Oct 30, 2024
16 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Support OPTIMIZE tbl FULL for clustered table #3793

[Spark] Support OPTIMIZE tbl FULL for clustered table #3793

dabao521 commented Oct 22, 2024 •

edited

Loading

ami7o left a comment •

edited

Loading

ami7o Oct 25, 2024

dabao521 Oct 28, 2024

ami7o Oct 25, 2024

dabao521 Oct 28, 2024

ami7o Oct 25, 2024

dabao521 Oct 28, 2024

rahulsmahadev Oct 29, 2024

dabao521 Oct 30, 2024

rahulsmahadev left a comment

[Spark] Support OPTIMIZE tbl FULL for clustered table #3793

[Spark] Support OPTIMIZE tbl FULL for clustered table #3793

Conversation

dabao521 commented Oct 22, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

ami7o left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahulsmahadev left a comment

Choose a reason for hiding this comment

dabao521 commented Oct 22, 2024 •

edited

Loading

ami7o left a comment •

edited

Loading