Better selectivity estimates for ORed predicates #75090

msirek · 2022-01-18T18:16:56Z

Description
(#74303) Adds support for combining selectivities of ORed equijoin predicates with an industry-standard approach using the disjunction rule:

Given predicates A and B and probability function P:
P(A or B) = P(A) + P(B) - P(A and B)

This issue is opened to also apply the rule to single-table filter predicates on different columns for both Selects and Joins, e.g.,

col1 = 1 OR col2 = 1

Selectivities of predicates on the same column should be additive if the ranges are non-overlapping, e.g.,

col1 < 1 OR col1 > 15

Describe alternatives you've considered
Multicolumn stats could be used in the future to further improve estimates by getting a more accurate result for P(A and B).

Jira issue: CRDB-12458

Fixes cockroachdb#75090 Currently, the optimizer only does a good job of estimating the selectivity of ORed predicates when each disjunct is a tight constraint because we give up and go with the default selectivity of 1/3 in that case. If any of the disjuncts has a selectivity above 1/3, then the total selectivity should be at least as high as the selectivity of that disjunct. This is addressed by using the method of combining ORed selectivities introduced by cockroachdb#74303: ``` Given predicates A and B and probability function P: P(A or B) = P(A) + P(B) - P(A and B) ``` The above formula is applied n-1 times, given n disjuncts. For each disjunct which is not a tight constraint, 1/3 is used for that predicate's selectivity in the formula. Release note (bug fix): This patch fixes incorrect selectivity estimation for queries with ORed predicates all referencing a common single table.

msirek · 2022-10-05T01:25:20Z

Started working on this here: #89358

Fixes cockroachdb#75090 Currently, the optimizer only does a good job of estimating the selectivity of single-table ORed predicates when all disjuncts are tight constraints. Otherwise we give up and go with the default selectivity of 1/3 for the entire filter. If any of the disjuncts has a selectivity above 1/3, then the total selectivity should be at least as high as the selectivity of that disjunct. Any failure to find a tight constraint for a given disjunct should only cause a selectivity of 1/3 to be used for that disjunct, not the entire filter. This is addressed by using the method of combining ORed selectivities introduced by cockroachdb#74303: ``` Given predicates A and B and probability function P: P(A or B) = P(A) + P(B) - P(A and B) ``` The above formula is applied n-1 times, given n disjuncts. For each disjunct which is not a tight constraint, 1/3 is used for that predicate's selectivity in the formula. Release note (bug fix): This patch fixes incorrect selectivity estimation for queries with ORed predicates all referencing a common single table.

89358: memo: better selectivity estimates on single-table ORed predicates r=rytaft,michae2 a=msirek Fixes #75090 Currently, the optimizer only does a good job of estimating the selectivity of single-table ORed predicates when all disjuncts are tight constraints. Otherwise we give up and go with the default selectivity of 1/3 for the entire filter. If any of the disjuncts has a selectivity above 1/3, then the total selectivity should be at least as high as the selectivity of that disjunct. Any failure to find a tight constraint for a given disjunct should only cause a selectivity of 1/3 to be used for that disjunct, not the entire filter. This is addressed by using the method of combining ORed selectivities introduced by #74303: ``` Given predicates A and B and probability function P: P(A or B) = P(A) + P(B) - P(A and B) ``` The above formula is applied n-1 times, given n disjuncts. For each disjunct which is not a tight constraint, 1/3 is used for that predicate's selectivity in the formula. Release note (bug fix): This patch fixes incorrect selectivity estimation for queries with ORed predicates all referencing a common single table. 91040: cloud/amazon: support external ID in AWS assume role r=rhu713 a=rhu713 Support passing in the optional external ID when assuming a role. This is done by extending the values of the comma-separated string value of the ASSUME_ROLE parameter to the format `<role>;external_id=<id>`. Users can still use the previous format of just `<role>` to specify a role without any external ID. When using role chaining, each role in the chain can be associated with a different external ID. For example: `ASSUME_ROLE=<roleA>;external_id=<idA>,<roleB>;external_id=<idB>,<roleC>` will use external ID `<idA>` to assume delegate `<roleA>`, then use external ID `<idB>` to assume delegate `<roleB>`, and finally no external ID to assume the final role `<roleC>`. Additional documentation about external IDs can be found here: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user_externalid.html Release note (enterprise change): Support passing in the optional external ID when assuming a role. This is done by extending the values of the comma-separated string value of the ASSUME_ROLE parameter to the format `<role>;external_id=<id>`. Users can still use the previous format of just `<role>` to specify a role without any external ID. When using role chaining, each role in the chain can be associated with a different external ID. For example: `ASSUME_ROLE=<roleA>;external_id=<idA>,<roleB>;external_id=<idB>,<roleC>` will use external ID `<idA>` to assume delegate `<roleA>`, then use external ID `<idB>` to assume delegate `<roleB>`, and finally no external ID to assume the final role `<roleC>`. Addresses #90239 92341: changefeedccl: fix flakey tests r=[miretskiy] a=HonoreDB Our test SQL connection doesn't always read its own writes: it can use different connections for different queries. So ``` INSERT INTO foo; feed(CREATE CHANGEFEED FOR foo) ``` is a race condition. This just doesn't matter for most tests because if the insert happens after the initial scan, it creates the same payload. But tests with initial_scan='only' won't see it at all. Fixes #92211 Release note: None 92462: scripts: improve pprof-loop.sh r=andrewbaptist a=tbg Now it also works with the runtime trace endpoint (worked before but printed concerning-looking errors) and with non-blocking endpoints (cumulative profiles such as heap or mutex), for which it previously busy-looped. Epic: None Release note: None 92591: ui: add 99.9th and 99.99th SQL latency charts r=maryliag a=maryliag This commit introduces the charts for: `Service Latency: SQL Statements, 99.99th percentile` and `Service Latency: SQL Statements, 99.9th percentile` on Metrics page under SQL view. Fixes #74247 <img width="806" alt="Screen Shot 2022-11-28 at 12 23 56 PM" src="https://user-images.githubusercontent.com/1017486/204342077-b5509f51-f94e-44be-a2e8-8bd185d12ce6.png"> Release note (ui change): Added charts `Service Latency: SQL Statements, 99.9th percentile` and `Service Latency: SQL Statements, 99.99th percentile` to Metrics page, under SQL view. 92635: tree: fix internal error comparing tuple type with non-tuple type r=rytaft,mgartner a=msirek Fixes #91532 This fixes comparison of an empty tuple expression, ROW() or (), or a non-empty tuple expression with a non-tuple type by returning an error message instead of an internal error. Release note (bug fix): This fixes an internal error when comparing a tuple type with a non-tuple type. 92636: backupccl: add restore corrupt descriptor validation test r=adityamaru a=msbutler This patch adds a test that ensures that restore automatically detects corrupt restoring descriptors via the sql descriptor collection read/write machinery. Fixes #84757 Release note: None 92692: jobs: protected timestamps refresh logic is not transactional r=fqazi a=fqazi Fixes: #92427 Previously, the refresh logic for protected timestamps would fetch jobs with transactions before attempting to update an existing one. Due to a recent change to allow this API to reflect any changes into individual job objects, we no longer do that correctly. This patch fixes the protected timestamp manager to update timestamps in a transactionally, safe manner since multiple updates can be happening concurrently for schema changes. Release note: None Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> Co-authored-by: Rui Hu <rui@cockroachlabs.com> Co-authored-by: Aaron Zinger <zinger@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: maryliag <marylia@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>

Fixes cockroachdb#75090 Currently, the optimizer only does a good job of estimating the selectivity of single-table ORed predicates when all disjuncts are tight constraints. Otherwise we give up and go with the default selectivity of 1/3 for the entire filter. If any of the disjuncts has a selectivity above 1/3, then the total selectivity should be at least as high as the selectivity of that disjunct. Any failure to find a tight constraint for a given disjunct should only cause a selectivity of 1/3 to be used for that disjunct, not the entire filter. This is addressed by using the method of combining ORed selectivities introduced by cockroachdb#74303: ``` Given predicates A and B and probability function P: P(A or B) = P(A) + P(B) - P(A and B) ``` The above formula is applied n-1 times, given n disjuncts. For each disjunct which is not a tight constraint, 1/3 is used for that predicate's selectivity in the formula. Release note (bug fix): This patch fixes incorrect selectivity estimation for queries with ORed predicates all referencing a common single table.

msirek added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 18, 2022

msirek self-assigned this Jan 18, 2022

msirek added the T-sql-queries SQL Queries Team label Jan 21, 2022

msirek added the A-sql-optimizer SQL logical planning and optimizations. label Jan 21, 2022

msirek mentioned this issue Oct 4, 2022

An Issue of Estimated Rows #89161

Closed

msirek mentioned this issue Oct 5, 2022

memo: better selectivity estimates on single-table ORed predicates #89358

Merged

DrewKimball mentioned this issue Nov 17, 2022

Suspicious Estimated Rows by OR #90112

Closed

craig bot closed this as completed in 5290f1f Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better selectivity estimates for ORed predicates #75090

Better selectivity estimates for ORed predicates #75090

msirek commented Jan 18, 2022 •

edited by cockroach-jira-scripts

Loading

msirek commented Oct 5, 2022

Better selectivity estimates for ORed predicates #75090

Better selectivity estimates for ORed predicates #75090

Comments

msirek commented Jan 18, 2022 • edited by cockroach-jira-scripts Loading

msirek commented Oct 5, 2022

msirek commented Jan 18, 2022 •

edited by cockroach-jira-scripts

Loading