Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/opt: version of SplitDisjunctionOfJoinTerms rule for left join #94382

Open
michae2 opened this issue Dec 28, 2022 · 1 comment
Open

sql/opt: version of SplitDisjunctionOfJoinTerms rule for left join #94382

michae2 opened this issue Dec 28, 2022 · 1 comment
Labels
A-sql-optimizer SQL logical planning and optimizations. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team

Comments

@michae2
Copy link
Collaborator

michae2 commented Dec 28, 2022

In #74303 two new exploration rules were added, SplitDisjunctionOfJoinTerms and SplitDisjunctionOfAntiJoinTerms, which split inner joins, semi joins, and anti joins with disjunctions in their join predicates into unioned joins. It would be nice to also have a version of this rule for left join (or other outer joins).

Here's an example of a query that would benefit:

CREATE TABLE x (
  x INT PRIMARY KEY
);
CREATE TABLE abc (
  a INT PRIMARY KEY,
  b INT,
  c INT,
  INDEX (b),
  INDEX (c)
);

INSERT INTO x SELECT generate_series(0, 15);
INSERT INTO abc SELECT x + y * 16, x, y FROM x, generate_series(0, 1023) s(y);
ANALYZE x;
ANALYZE abc;

EXPLAIN
SELECT x, sum(a)
FROM x LEFT JOIN abc ON b = x OR c = x
WHERE x > 14
GROUP BY x;

The plan for this query currently uses a cross join over a full table scan of abc:

demo@127.0.0.1:26257/defaultdb> EXPLAIN
                             -> SELECT x, sum(a)
                             -> FROM x LEFT JOIN abc ON b = x OR c = x
                             -> WHERE x > 14
                             -> GROUP BY x;

                                            info
---------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • group (hash)
  │ estimated row count: 1
  │ group by: x
  │
  └── • cross join (right outer)
      │ estimated row count: 1,039
      │ pred: (b = x) OR (c = x)
      │
      ├── • scan
      │     estimated row count: 16,384 (100% of the table; stats collected 12 minutes ago)
      │     table: abc@abc_pkey
      │     spans: FULL SCAN
      │
      └── • scan
            estimated row count: 1 (6.3% of the table; stats collected 12 minutes ago)
            table: x@x_pkey
            spans: [/15 - ]
(20 rows)


Time: 2ms total (execution 2ms / network 0ms)

If we (incorrectly) rewrite the query to use UNION over two left joins, the plan looks much better (though it now generates different results):

demo@127.0.0.1:26257/defaultdb> EXPLAIN
                             -> SELECT x, sum(a)
                             -> FROM (
                             -> SELECT x, a
                             -> FROM x LEFT JOIN abc ON b = x
                             -> WHERE x > 14
                             -> UNION
                             -> SELECT x, a
                             -> FROM x LEFT JOIN abc ON c = x
                             -> WHERE x > 14
                             -> )
                             -> GROUP BY x;

                                              info
------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • group (hash)
  │ estimated row count: 2
  │ group by: x
  │
  └── • union
      │ estimated row count: 1,050
      │
      ├── • merge join (right outer)
      │   │ estimated row count: 1,034
      │   │ equality: (b) = (x)
      │   │ right cols are key
      │   │
      │   ├── • scan
      │   │     estimated row count: 1,034 (6.3% of the table; stats collected 20 minutes ago)
      │   │     table: abc@abc_b_idx
      │   │     spans: [/15 - ]
      │   │
      │   └── • scan
      │         estimated row count: 1 (6.3% of the table; stats collected 20 minutes ago)
      │         table: x@x_pkey
      │         spans: [/15 - ]
      │
      └── • lookup join (left outer)
          │ estimated row count: 16
          │ table: abc@abc_c_idx
          │ equality: (x) = (c)
          │ pred: c > 14
          │
          └── • scan
                estimated row count: 1 (6.3% of the table; stats collected 20 minutes ago)
                table: x@x_pkey
                spans: [/15 - ]
(35 rows)


Time: 4ms total (execution 4ms / network 0ms)

This exact rewrite won't usually work (in fact I think it's incorrect in this case), so I imagine this rule will be tricky to get right.

Jira issue: CRDB-22866

@michae2 michae2 added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-optimizer SQL logical planning and optimizations. T-sql-queries SQL Queries Team labels Dec 28, 2022
@michae2
Copy link
Collaborator Author

michae2 commented Dec 28, 2022

One very rough idea for the rule could be to make the output tree similar to SplitDisjunctionOfJoinTerms but with a left join on top to add the unmatched rows. In other words, rewrite something like

(left-join x abc
  (or (= b x) (= c x)))

into something like

(left-join
  x
  (distinct-on
    (union-all
      (inner-join x abc (= b x))
      (inner-join x abc (= c x)))
    pkcols)
  (= x x))

@michae2 michae2 added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA labels Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-optimizer SQL logical planning and optimizations. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

1 participant