Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better selectivity estimates for ORed predicates #75090

Closed
msirek opened this issue Jan 18, 2022 · 1 comment · Fixed by #89358
Closed

Better selectivity estimates for ORed predicates #75090

msirek opened this issue Jan 18, 2022 · 1 comment · Fixed by #89358
Assignees
Labels
A-sql-optimizer SQL logical planning and optimizations. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team

Comments

@msirek
Copy link
Contributor

msirek commented Jan 18, 2022

Description
(#74303) Adds support for combining selectivities of ORed equijoin predicates with an industry-standard approach using the disjunction rule:

Given predicates A and B and probability function P:
P(A or B) = P(A) + P(B) - P(A and B)

This issue is opened to also apply the rule to single-table filter predicates on different columns for both Selects and Joins, e.g.,

col1 = 1 OR col2 = 1

Selectivities of predicates on the same column should be additive if the ranges are non-overlapping, e.g.,

col1 < 1 OR col1 > 15

Describe alternatives you've considered
Multicolumn stats could be used in the future to further improve estimates by getting a more accurate result for P(A and B).

Jira issue: CRDB-12458

@msirek msirek added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 18, 2022
@msirek msirek self-assigned this Jan 18, 2022
@msirek msirek added the T-sql-queries SQL Queries Team label Jan 21, 2022
@msirek msirek added the A-sql-optimizer SQL logical planning and optimizations. label Jan 21, 2022
msirek pushed a commit to msirek/cockroach that referenced this issue Oct 5, 2022
Fixes cockroachdb#75090

Currently, the optimizer only does a good job of estimating the
selectivity of ORed predicates when each disjunct is a tight
constraint because we give up and go with the default selectivity of 1/3
in that case. If any of the disjuncts has a selectivity above 1/3, then
the total selectivity should be at least as high as the selectivity of
that disjunct.

This is addressed by using the method of combining ORed selectivities
introduced by cockroachdb#74303:
```
   Given predicates A and B and probability function P:

      P(A or B) = P(A) + P(B) - P(A and B)
```
The above formula is applied n-1 times, given n disjuncts.
For each disjunct which is not a tight constraint, 1/3 is used for that
predicate's selectivity in the formula.

Release note (bug fix): This patch fixes incorrect selectivity
estimation for queries with ORed predicates all referencing a
common single table.
@msirek
Copy link
Contributor Author

msirek commented Oct 5, 2022

Started working on this here: #89358

msirek pushed a commit to msirek/cockroach that referenced this issue Nov 8, 2022
Fixes cockroachdb#75090

Currently, the optimizer only does a good job of estimating the
selectivity of single-table ORed predicates when all disjuncts are tight
constraints. Otherwise we give up and go with the default selectivity of
1/3 for the entire filter. If any of the disjuncts has a selectivity
above 1/3, then the total selectivity should be at least as high as the
selectivity of that disjunct. Any failure to find a tight constraint for
a given disjunct should only cause a selectivity of 1/3 to be used for
that disjunct, not the entire filter.

This is addressed by using the method of combining ORed selectivities
introduced by cockroachdb#74303:
```
   Given predicates A and B and probability function P:

      P(A or B) = P(A) + P(B) - P(A and B)
```
The above formula is applied n-1 times, given n disjuncts.
For each disjunct which is not a tight constraint, 1/3 is used for that
predicate's selectivity in the formula.

Release note (bug fix): This patch fixes incorrect selectivity
estimation for queries with ORed predicates all referencing a
common single table.
msirek pushed a commit to msirek/cockroach that referenced this issue Nov 29, 2022
Fixes cockroachdb#75090

Currently, the optimizer only does a good job of estimating the
selectivity of single-table ORed predicates when all disjuncts are tight
constraints. Otherwise we give up and go with the default selectivity of
1/3 for the entire filter. If any of the disjuncts has a selectivity
above 1/3, then the total selectivity should be at least as high as the
selectivity of that disjunct. Any failure to find a tight constraint for
a given disjunct should only cause a selectivity of 1/3 to be used for
that disjunct, not the entire filter.

This is addressed by using the method of combining ORed selectivities
introduced by cockroachdb#74303:
```
   Given predicates A and B and probability function P:

      P(A or B) = P(A) + P(B) - P(A and B)
```
The above formula is applied n-1 times, given n disjuncts.
For each disjunct which is not a tight constraint, 1/3 is used for that
predicate's selectivity in the formula.

Release note (bug fix): This patch fixes incorrect selectivity
estimation for queries with ORed predicates all referencing a
common single table.
craig bot pushed a commit that referenced this issue Nov 30, 2022
89358: memo: better selectivity estimates on single-table ORed predicates r=rytaft,michae2 a=msirek

Fixes #75090

Currently, the optimizer only does a good job of estimating the       
selectivity of single-table ORed predicates when all disjuncts are tight  
constraints. Otherwise we give up and go with the default selectivity of    
1/3 for the entire filter. If any of the disjuncts has a selectivity 
above 1/3, then the total selectivity should be at least as high as the 
selectivity of that disjunct. Any failure to find a tight constraint for
a given disjunct should only cause a selectivity of 1/3 to be used for
that disjunct, not the entire filter.  

This is addressed by using the method of combining ORed selectivities
introduced by #74303:
```
   Given predicates A and B and probability function P:

      P(A or B) = P(A) + P(B) - P(A and B)
```
The above formula is applied n-1 times, given n disjuncts.
For each disjunct which is not a tight constraint, 1/3 is used for that
predicate's selectivity in the formula.

Release note (bug fix): This patch fixes incorrect selectivity
estimation for queries with ORed predicates all referencing a
common single table.


91040: cloud/amazon: support external ID in AWS assume role r=rhu713 a=rhu713

Support passing in the optional external ID when assuming a role. This is done by extending the values of the comma-separated string value of the ASSUME_ROLE parameter to the format `<role>;external_id=<id>`. Users can still use the previous format of just `<role>` to specify a role without any external ID.

When using role chaining, each role in the chain can be associated with a different external ID. For example:
`ASSUME_ROLE=<roleA>;external_id=<idA>,<roleB>;external_id=<idB>,<roleC>` will use external ID `<idA>` to assume delegate `<roleA>`, then use external ID `<idB>` to assume delegate `<roleB>`, and finally no external ID to assume the final role `<roleC>`.

Additional documentation about external IDs can be found here: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user_externalid.html

Release note (enterprise change): Support passing in the optional external ID when assuming a role. This is done by extending the values of the comma-separated string value of the ASSUME_ROLE parameter to the format `<role>;external_id=<id>`. Users can still use the previous format of just `<role>` to specify a role without any external ID.

When using role chaining, each role in the chain can be associated with a different external ID. For example:
`ASSUME_ROLE=<roleA>;external_id=<idA>,<roleB>;external_id=<idB>,<roleC>` will use external ID `<idA>` to assume delegate `<roleA>`, then use external ID `<idB>` to assume delegate `<roleB>`, and finally no external ID to assume the final role `<roleC>`.

Addresses #90239

92341: changefeedccl: fix flakey tests r=[miretskiy] a=HonoreDB

Our test SQL connection doesn't always read its own writes: it can use different connections for different queries. So
```
INSERT INTO foo;
feed(CREATE CHANGEFEED FOR foo)
```
is a race condition. This just doesn't matter for most tests because if the insert happens after the initial scan, it creates the same payload. But tests with initial_scan='only' won't see it at all.

Fixes #92211

Release note: None

92462: scripts: improve pprof-loop.sh r=andrewbaptist a=tbg

Now it also works with the runtime trace endpoint (worked before but
printed concerning-looking errors) and with non-blocking endpoints
(cumulative profiles such as heap or mutex), for which it previously
busy-looped.

Epic: None
Release note: None


92591: ui: add 99.9th and 99.99th SQL latency charts r=maryliag a=maryliag

This commit introduces the charts for:
`Service Latency: SQL Statements, 99.99th percentile` and
`Service Latency: SQL Statements, 99.9th percentile` on Metrics page under SQL view.

Fixes #74247

<img width="806" alt="Screen Shot 2022-11-28 at 12 23 56 PM" src="https://user-images.githubusercontent.com/1017486/204342077-b5509f51-f94e-44be-a2e8-8bd185d12ce6.png">



Release note (ui change): Added charts
`Service Latency: SQL Statements, 99.9th percentile` and `Service Latency: SQL Statements, 99.99th percentile` to Metrics page, under SQL view.

92635: tree: fix internal error comparing tuple type with non-tuple type r=rytaft,mgartner a=msirek

Fixes #91532

This fixes comparison of an empty tuple expression, ROW() or (), or
a non-empty tuple expression with a non-tuple type by returning
an error message instead of an internal error.

Release note (bug fix): This fixes an internal error when comparing
a tuple type with a non-tuple type.

92636: backupccl: add restore corrupt descriptor validation test r=adityamaru a=msbutler

This patch adds a test that ensures that restore automatically detects corrupt restoring descriptors via the sql descriptor collection read/write machinery.

Fixes #84757

Release note: None

92692: jobs: protected timestamps refresh logic is not transactional r=fqazi a=fqazi

Fixes: #92427

Previously, the refresh logic for protected timestamps would fetch jobs with transactions before attempting to update an existing one. Due to a recent change to allow this API to reflect any changes into individual job objects, we no longer do that correctly. This patch fixes the protected timestamp manager to update timestamps in a transactionally, safe manner since multiple updates can be happening concurrently for schema changes.

Release note: None

Co-authored-by: Mark Sirek <sirek@cockroachlabs.com>
Co-authored-by: Rui Hu <rui@cockroachlabs.com>
Co-authored-by: Aaron Zinger <zinger@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: maryliag <marylia@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
@craig craig bot closed this as completed in 5290f1f Nov 30, 2022
mgartner pushed a commit to mgartner/cockroach that referenced this issue Dec 29, 2022
Fixes cockroachdb#75090

Currently, the optimizer only does a good job of estimating the
selectivity of single-table ORed predicates when all disjuncts are tight
constraints. Otherwise we give up and go with the default selectivity of
1/3 for the entire filter. If any of the disjuncts has a selectivity
above 1/3, then the total selectivity should be at least as high as the
selectivity of that disjunct. Any failure to find a tight constraint for
a given disjunct should only cause a selectivity of 1/3 to be used for
that disjunct, not the entire filter.

This is addressed by using the method of combining ORed selectivities
introduced by cockroachdb#74303:
```
   Given predicates A and B and probability function P:

      P(A or B) = P(A) + P(B) - P(A and B)
```
The above formula is applied n-1 times, given n disjuncts.
For each disjunct which is not a tight constraint, 1/3 is used for that
predicate's selectivity in the formula.

Release note (bug fix): This patch fixes incorrect selectivity
estimation for queries with ORed predicates all referencing a
common single table.
mgartner pushed a commit to mgartner/cockroach that referenced this issue Jan 3, 2023
Fixes cockroachdb#75090

Currently, the optimizer only does a good job of estimating the
selectivity of single-table ORed predicates when all disjuncts are tight
constraints. Otherwise we give up and go with the default selectivity of
1/3 for the entire filter. If any of the disjuncts has a selectivity
above 1/3, then the total selectivity should be at least as high as the
selectivity of that disjunct. Any failure to find a tight constraint for
a given disjunct should only cause a selectivity of 1/3 to be used for
that disjunct, not the entire filter.

This is addressed by using the method of combining ORed selectivities
introduced by cockroachdb#74303:
```
   Given predicates A and B and probability function P:

      P(A or B) = P(A) + P(B) - P(A and B)
```
The above formula is applied n-1 times, given n disjuncts.
For each disjunct which is not a tight constraint, 1/3 is used for that
predicate's selectivity in the formula.

Release note (bug fix): This patch fixes incorrect selectivity
estimation for queries with ORed predicates all referencing a
common single table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-optimizer SQL logical planning and optimizations. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant