Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Logical optimizer causes invalid query result with case expression #8942

Closed
sergiimk opened this issue Jan 22, 2024 · 5 comments · Fixed by #8960
Closed

Regression: Logical optimizer causes invalid query result with case expression #8942

sergiimk opened this issue Jan 22, 2024 · 5 comments · Fixed by #8960
Labels
bug Something isn't working regression Something that used to work no longer does

Comments

@sergiimk
Copy link
Contributor

Describe the bug

When logical optimization is enabled datafusion v34 started producing incorrect results.

To Reproduce

Here's the minimal repro case I found so far:

let config = SessionConfig::new();
let runtime = Arc::new(RuntimeEnv::default());
let state = SessionState::new_with_config_rt(config, runtime).with_optimizer_rules(vec![]);
let ctx = SessionContext::new_with_state(state);

let schema = Arc::new(Schema::new(vec![Field::new("id", DataType::Int32, false)]));

let batch =
    RecordBatch::try_new(schema, vec![Arc::new(array::Int32Array::from(vec![0, 1]))]).unwrap();

let df = ctx.read_batch(batch).unwrap();
df.clone().show().await.unwrap();

// Add `t` column full of nulls
let df = df
    .with_column("t", cast(Expr::Literal(ScalarValue::Null), DataType::Int32))
    .unwrap();
df.clone().show().await.unwrap();

let df = df
    // (case when id = 1 then 10 else t) as t
    .with_column(
        "t",
        when(col("id").eq(lit(1)), lit(10))
            .otherwise(col("t"))
            .unwrap(),
    )
    .unwrap()
    // (case when id = 1 then 10 else t) as t2
    .with_column(
        "t2",
        when(col("id").eq(lit(1)), lit(10))
            .otherwise(col("t"))
            .unwrap(),
    )
    .unwrap();

df.clone().show().await.unwrap();

Code above will show:

+----+----+----+
| id | t  | t2 |
+----+----+----+
| 0  |    |    |
| 1  | 10 | 10 |
+----+----+----+

which is correct.

Now comment out the with_optimizer_rules(vec![]) and you will get a very different result:

+----+---+----+
| id | t | t2 |
+----+---+----+
| 0  |   |    |
| 1  |   | 10 |
+----+---+----+

Note that despite t and t2 having identical expressions, column t is now different.

Perhaps the fact that t column is being replaced with expression that depends on previous value of t is what triggers the issue.

Expected behavior

Logical optimization does not produce incorrect results.

Additional context

This broke in datafusion 34, version 33 worked fine.

@sergiimk sergiimk added the bug Something isn't working label Jan 22, 2024
@sergiimk
Copy link
Contributor Author

I further isolated the issue to OptimizeProjections optimizer step.

@gruuya
Copy link
Contributor

gruuya commented Jan 22, 2024

It seems like when the entering plan's innermost projection:

Projection: ?table?.id, t, CASE WHEN ?table?.id = Int32(1) THEN Int32(10) ELSE t END AS t2
  Projection: ?table?.id, CASE WHEN ?table?.id = Int32(1) THEN Int32(10) ELSE t END AS t
    Projection: ?table?.id, Int32(NULL) AS t
      TableScan: ?table?

is being rewritten, this evaluation :
https://github.com/apache/arrow-datafusion/blob/2b218be67a6c412629530b812836a6cec76efc32/datafusion/optimizer/src/optimize_projections.rs#L867-L871
concludes that its and its input schema (the bottom most projection) are identical, and so it just discards the projection (proj and its exprs_used) even though it has non-trivial computation on top.

Trying out a naive solution like

@@ -867,7 +867,7 @@ fn rewrite_projection_given_requirements(
     return if let Some(input) =
         optimize_projections(&proj.input, config, &required_indices)?
     {
-        if &projection_schema(&input, &exprs_used)? == input.schema() {
+        if &projection_schema(&input, &exprs_used)? == input.schema() && exprs_used.iter().all(is_expr_trivial) {
             Ok(Some(input))
         } else {
             Projection::try_new(exprs_used, Arc::new(input))

does solve this particular problem but then it fails to eliminate unneeded projections in some other tests cases (notably in test_infinite_source_partition_by which ends up with a bunch of interleaved projections).

@gruuya
Copy link
Contributor

gruuya commented Jan 22, 2024

Opened #8951 as a potential solution, though I'm not sure that's the best approach here.

@sergiimk
Copy link
Contributor Author

sergiimk commented Jan 22, 2024

Thank you so much for investigating, @gruuya!

It seems that the offending statement appeared in PR #8340. It was a big refactoring and I can't tell if this line migrated from somewhere else or was introduced.

@mustafasrepo could you kindly take a look at this issue and the proposed solution?

Based on @gruuya's findings it looks like any computation that replaces an existing column without changing the schema will be eliminated by the optimizer, which seems like a major issue (perhaps necessitating a patch release).

@alamb
Copy link
Contributor

alamb commented Jan 24, 2024

Proposed PR to fix: #8960

@alamb alamb changed the title Logical optimizer causes invalid query result with case expression Regression: Logical optimizer causes invalid query result with case expression Jan 24, 2024
@alamb alamb added the regression Something that used to work no longer does label Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Something that used to work no longer does
Projects
None yet
3 participants