[Spark] Allow missing fields with implicit casting during streaming write #3822

johanl-db · 2024-10-29T09:03:05Z

Description

Follow-up on #3443 that introduced implicit casting during streaming write to delta tables.

The feature was shipped disabled due to a regression found in testing where writing data with missing struct fields start being rejected. Streaming writes are one of the few inserts that allows missing struct fields.

This change allows configuring the casting behavior used in MERGE, UPDATE and streaming writes wrt to missing struct fields.

How was this patch tested?

Extensive tests were added in #3762 in preparation for this changes, covering for all inserts (SQL, dataframe, append/overwrite, ..):

Missing top-level columns and nested struct fields.
Extra top-level columns and nested struct fields with schema evolution.
Position vs. name based resolution for top-level columns and nested struct fields.
with e.p. the goal of ensuring that enabling implicit casting in stream writes here doesn't cause any other unwanted behavior change.

This PR introduces the following user-facing changes

From the initial PR: #3443

Previously, writing to a Delta sink using a type that doesn't match the column type in the Delta table failed with DELTA_FAILED_TO_MERGE_FIELDS:

spark.readStream
    .table("delta_source")
    # Column 'a' has type INT in 'delta_sink'.
    .select(col("a").cast("long").alias("a"))
    .writeStream
    .format("delta")
    .option("checkpointLocation", "<location>")
    .toTable("delta_sink")

DeltaAnalysisException: [DELTA_FAILED_TO_MERGE_FIELDS] Failed to merge fields 'a' and 'a'

With this change, writing to the sink now succeeds and data is cast from LONG to INT. If any value overflows, the stream fails with (assuming default storeAssignmentPolicy=ANSI):

SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] Fail to assign a value of 'LONG' type to the 'INT' type column or variable 'a' due to an overflow. Use `try_cast` on the input value to tolerate overflow and return NULL instead."

tomvanbussel

LGTM, but left a few nits.

The casting and missing columns behavior is getting really complex overall, as it depends so strongly on the operation performed. It would be good if we could clean this up, but this will likely require some breaking changes.

tomvanbussel · 2024-11-05T10:21:44Z

spark/src/main/scala/org/apache/spark/sql/delta/UpdateExpressionsSupport.scala

+  case class CastingBehavior(
+    allowMissingStructField: Boolean,
+    resolveStructsByName: Boolean,
+    isMergeOrUpdate: Boolean


Nit: This is mixing policy flags (isMergeOrUpdate) with mechanism flags (allowMissingStructField, resolveStructsByName).

I pulled isMergeOrUpdate out, this is now a trait that gets mixed in instead of a flag

tomvanbussel · 2024-11-05T10:27:33Z

spark/src/main/scala/org/apache/spark/sql/delta/UpdateExpressionsSupport.scala

+   *                                in error messages and to provide backward compatible behavior.
+   */
+  case class CastingBehavior(
+    allowMissingStructField: Boolean,


It seems like this config is completely ignored when resolveStructsByName is false. This may lead to some surprising behavior in the future.

I'm now using the type system to enforce a constraint that by position cannot specify allowMissingStructField.

tomvanbussel · 2024-11-05T11:55:39Z

spark/src/test/scala/org/apache/spark/sql/delta/DeltaInsertIntoColumnOrderSuite.scala

-      includeInserts = inserts -- insertsByName.intersect(insertsDataframe)
+      // Exclude dataframe inserts by name (except streaming) which don't support implicit cast.
+      // See negative test below.
+      includeInserts = inserts -- (insertsByName.intersect(insertsDataframe) - StreamingInsert)


Nit: This is getting a little bit complex. It's hard for me to understand for me which cases are actually covered here. It's okay to have some duplication in tests.

Minor improvement: I introduced a variable for "inserts that don't support implicit casting" to make it a bit easier to reason about.

tomvanbussel · 2024-11-05T11:56:46Z

spark/src/test/scala/org/apache/spark/sql/delta/DeltaInsertIntoColumnOrderSuite.scala

-      includeInserts = inserts -- insertsDataframe.intersect(insertsByName)
+      // Exclude dataframe inserts by name (except streaming) which don't support implicit cast.
+      // See negative test below.
+      includeInserts = inserts -- (insertsByName.intersect(insertsDataframe) - StreamingInsert)


Why not just:

Suggested change

includeInserts = inserts -- (insertsByName.intersect(insertsDataframe) - StreamingInsert)

includeInserts = inserts -- insertsByName.intersect(insertsDataframe) + StreamingInsert

inserts is one of:

insertsAppend - StreamingInsert

insertsOverwrite - SQLInsertOverwritePartitionByPosition

Set(StreamingInsert)

Set(SQLInsertOverwritePartitionByPosition)

It doesn't always contain StreamingInsert, we don't want to add it then otherwise we'll get duplicate tests

johanl-db added 2 commits October 29, 2024 10:00

Allow missing fields with implicit casting during streaming write

fce003f

Allow missing fields with implicit casting during streaming write

4c5609b

johanl-db requested a review from tomvanbussel October 29, 2024 17:02

johanl-db changed the title ~~[Spark][WIP] Allow missing fields with implicit casting during streaming write~~ [Spark] Allow missing fields with implicit casting during streaming write Oct 30, 2024

Update doc string

5aaa5bb

johanl-db mentioned this pull request Oct 30, 2024

[Spark] Disable implicit casting in Delta streaming sink #3691

Merged

tomvanbussel approved these changes Nov 5, 2024

View reviewed changes

Remodel CastingBehavior type

5a0ab2a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Allow missing fields with implicit casting during streaming write #3822

[Spark] Allow missing fields with implicit casting during streaming write #3822

johanl-db commented Oct 29, 2024 •

edited

Loading

tomvanbussel left a comment

tomvanbussel Nov 5, 2024

johanl-db Nov 13, 2024

tomvanbussel Nov 5, 2024

johanl-db Nov 13, 2024

tomvanbussel Nov 5, 2024

johanl-db Nov 13, 2024

tomvanbussel Nov 5, 2024

johanl-db Nov 13, 2024

	includeInserts = inserts -- (insertsByName.intersect(insertsDataframe) - StreamingInsert)
	includeInserts = inserts -- insertsByName.intersect(insertsDataframe) + StreamingInsert

[Spark] Allow missing fields with implicit casting during streaming write #3822

Are you sure you want to change the base?

[Spark] Allow missing fields with implicit casting during streaming write #3822

Conversation

johanl-db commented Oct 29, 2024 • edited Loading

Description

How was this patch tested?

This PR introduces the following user-facing changes

tomvanbussel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johanl-db commented Oct 29, 2024 •

edited

Loading