Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema: drop column with index using hash cascade DML injection failure #111619

Closed
ecwall opened this issue Oct 2, 2023 · 11 comments · Fixed by #120794
Closed

schema: drop column with index using hash cascade DML injection failure #111619

ecwall opened this issue Oct 2, 2023 · 11 comments · Fixed by #120794
Assignees
Labels
A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. db-cy-23 O-schema-testing Originated from SQL Schema testing strategies P-1 Issues/test failures with a fix SLA of 1 month T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@ecwall
Copy link
Contributor

ecwall commented Oct 2, 2023

Setup

CREATE TABLE tbl (
	insert_phase_ordinal TEXT NOT NULL, 
	operation_phase_ordinal TEXT NOT NULL,
	operation TEXT NOT NULL,
	val INT,
	PRIMARY KEY (insert_phase_ordinal, operation_phase_ordinal, operation)
);
ALTER TABLE tbl ADD COLUMN i INT NOT NULL DEFAULT 1;
CREATE INDEX idx ON tbl (i) USING HASH;

Schema change

ALTER TABLE tbl DROP COLUMN i CASCADE;

Schema change DML injection failure

dml_injection_test.go:442: [phaseOrdinal=PostCommitPhase:1] error executing query="DELETE FROM tbl WHERE insert_phase_ordinal = $1 AND operation_phase_ordinal = $2 AND operation = $3" args=["StatementPhase:1" "PostCommitPhase:1" "delete"]: pq: column "crdb_internal_column_5_name_placeholder" does not exist
does not exist

Same underlying bug as #111608

Jira issue: CRDB-31989

Epic CRDB-37763

@ecwall ecwall added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-schema-changes T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) db-cy-23 O-schema-testing Originated from SQL Schema testing strategies labels Oct 2, 2023
@rafiss rafiss self-assigned this Jan 12, 2024
@rafiss rafiss added the P-1 Issues/test failures with a fix SLA of 1 month label Jan 16, 2024
@rafiss rafiss assigned Xiang-Gu and unassigned rafiss Jan 16, 2024
@Xiang-Gu
Copy link
Contributor

A simpler repo

create table t (i int primary key);
insert into t values (0);
alter table t add column j int default 1;
create index idx on t (j) using hash;

SET CLUSTER SETTING jobs.debug.pausepoints = 'schemachanger.root.$ cockroach demo.0';
set cluster setting jobs.registry.interval.adopt = "2s";
set cluster setting jobs.registry.interval.cancel = "2s";
SET sql_safe_updates = false;

alter table t drop column j cascade;  -- job stops right before starting 1st stage of PostCommitPhase
delete from t where i = 0;  -- ERROR: column "crdb_internal_column_2_name_placeholder" does not exist

The concurrent DML failed at the very beginning of PostCommitPhase (i.e. before executing first stage of PostCommitPhase). At the point in time, the relevant schema objects are in the following states:

  1. index_7_placeholder (previous "idx"): mutation, WRITE_ONLY
  2. column_2_placeholder (previous "j"): mutation, WRITE_ONLY
  3. column_3_placeholder (previous "shard"): mutation, WRITE_ONLY

From a schema point of view, I didn't see anything particular wrong with those transitions, so I decided to go a layer above to the optimizer to see where exactly is the error thrown.

That concurrent DELETE, under this descriptor state, runs into the error with the following stacktrace

github.com/cockroachdb/cockroach/pkg/sql/catalog/colinfo.NewUndefinedColumnError(column_resolver.go:196)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*scope).FindSourceProvidingColumn(scope.go:853)
github.com/cockroachdb/cockroach/pkg/sql/catalog/colinfo.ResolveColumnItem(column_item_resolver.go:115)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*scope).VisitPre(scope.go:1032)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*scope).VisitPre(scope.go:1029)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.WalkExpr(walk.go:834)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.walkExprSlice(walk.go:625)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.(*FuncExpr).Walk(walk.go:303)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.WalkExpr(walk.go:837)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.walkExprSlice(walk.go:625)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.(*FuncExpr).Walk(walk.go:303)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.WalkExpr(walk.go:837)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.walkExprSlice(walk.go:625)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.(*FuncExpr).Walk(walk.go:303)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.WalkExpr(walk.go:837)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.walkExprSlice(walk.go:625)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.(*FuncExpr).Walk(walk.go:303)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.WalkExpr(walk.go:837)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*scope).walkExprTree(scope.go:433)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*scope).resolveAndRequireType(scope.go:491)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).addComputedColsForTable(select.go:936)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildScan(select.go:767)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*mutationBuilder).buildInputForDelete(mutation_builder.go:410)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildDelete(delete.go:93)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmt.func1(builder.go:353)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).processWiths(with.go:116)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmt(builder.go:352)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmtAtRootWithScope(builder.go:290)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmtAtRoot(builder.go:271)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).Build(builder.go:250)
github.com/cockroachdb/cockroach/pkg/sql.(*optPlanningCtx).buildExecMemo(plan_opt.go:588)
github.com/cockroachdb/cockroach/pkg/sql.(*planner).makeOptimizerPlan(plan_opt.go:245)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).makeExecPlan(conn_executor_exec.go:2275)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(conn_executor_exec.go:1744)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(conn_executor_exec.go:1108)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt.func1(conn_executor_exec.go:146)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithProfiling(conn_executor_exec.go:3375)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(conn_executor_exec.go:145)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1(conn_executor.go:2225)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(conn_executor.go:2230)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(conn_executor.go:2147)
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(conn_executor.go:953)
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommands(conn.go:247)
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).serveImpl.func3(server.go:1001)
runtime.goexit(asm_arm64.s:1197)
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).serveImpl(server.go:998)

My understanding is that addComputedColsForTable is processing the "column_3_placeholder" computed column (whose ComputeExpr = mod(fnv32(md5(crdb_internal.datums_to_bytes(crdb_internal_column_2_name_placeholder)))), and the code went ahead to resolve crdb_internal_column_2_name_placeholder by name within the *scope. But this scope only contains "normal" (i.e. non-mutation) column (via tableScope.appendOrdinaryColumnsFromTable(tabMeta, &tabMeta.Alias)). Therefore, we end up with a column not found error.

@mgartner Can I ask you to chime in for your thoughts? In particular, I'd like to hear your thoughts on a few questions:

  1. What is the general purpose of addComputedColsForTable and why it includes of columns undergoing mutations?
  2. Do we want to change here to also include "mutation columns" to the scope before resolving column by name in a computed expression?

@mgartner
Copy link
Collaborator

This looks similar to the issue in #79613. See also #79691, #83593, and #97368.

These states are invalid, at least according to the optimizer's interpretation of WRITE_ONLY:

index_7_placeholder (previous "idx"): mutation, WRITE_ONLY
column_2_placeholder (previous "j"): mutation, WRITE_ONLY
column_3_placeholder (previous "shard"): mutation, WRITE_ONLY

The optimizer must plan to delete entries in the WRITE_ONLY index. In order to determine the keys to delete, it needs to read j and compute the value of shard based on j. But those columns are WRITE_ONLY and therefore cannot be read.

I'll note that, based on the same reasoning, a non-sharded index on a column with the following state should also be invalid:

index_7_placeholder (previous "idx"): mutation, WRITE_ONLY
column_2_placeholder (previous "j"): mutation, WRITE_ONLY

AFAICT this won't cause an internal error, but it might be the source of correctness problems.

However, when this issue has come up before (see the issues I linked above), there was skepticism about whether or not reads can be made internally by the system on WRITE_ONLY columns and indexes. The optimizer currently assumes that absolutely no reads should be made (though maybe not true for the non-shared index example I gave). Should this be loosened to: no reads of WRITE_ONLY columns and indexes should be visible to the user? I think we need to come to a confident answer on this. I didn't see an explicit mention of this situation in the online schema change RFC or the F1 paper it is based on.

@mgartner
Copy link
Collaborator

#59233 is highly relevant and seems to have figured out most of these questions. TLDR; columns that have been fully backfilled and are safe for reading by internal operations, but should not be accessible to the user should be "ordinary" "inaccessible" columns.

@rimadeodhar
Copy link
Collaborator

The solution outlined in #59233 is not applicable here. That solution talks of situations where certain columns need to be inaccessible to user queries but can be internally considered readable. The most notable point here though is that such columns don't have a WRITE_ONLY mutation on them.
For all columns in the process of being dropped, they will temporarily have a WRITE_ONLY mutation followed by DELETE_ONLY followed by actual deletion. However, during the WRITE_ONLY and DELETE_ONLY phases, we should still continue to support write and delete operations on those columns as per our schema deletion paradigm (outlined in the RFC and F1 paper). Which means the original question posed by Marcus still remains unresolved - what do we do for internal reads for columns in the WRITE_ONLY and DELETE_ONLY phases? Within the context of this particular issue, I believe we need to allow this access. See my next comment for more details.

@rimadeodhar
Copy link
Collaborator

This particular issue seems to have started occurring post the changes in #106782. In this PR, we allow computation of virtual columns that are under mutation (i.e. in the process of being added/dropped) within addComputedColsForTable. While this won’t cause issues as long as the referenced column is public, this will cause trouble for columns that are also in the process of being dropped as outlined in this issue and #111608. I believe this code is incomplete. If we are allowing evaluation of computed columns under mutation, if the referenced column is also being dropped and is under the same mutation level or higher, we can reference the column safely (i.e. the column will only contain actual data and will not be in a state of backfill) for the purpose of evaluation. See the cases outlined below for a “proof” of this statement:

CASE 1: The referenced column is public. As long as the computed column is DELETE_ONLY, WRITE_ONLY or public, we can read from the referenced column.
The above assumption holds true trivially and this is what is being done today.

CASE 2: The referenced column is WRITE_ONLY. In this case, as long as the computed column is WRITE_ONLY or DELETE_ONLY, we can safely read the referenced column for the purpose of column computation within addComputedColsToTable.
Due to this safeguard, this is only possible when a column and thereby the computed column is in the process of being dropped. In such cases, the data contained within the WRITE_ONLY column is complete (i.e. it is not a new column being backfilled) and we can reference the column for computation.

CASE 3: The referenced column is DELETE_ONLY. In this case, as long as the computed column is also in the DELETE_ONLY, we can read the referenced column. As from Case 2, this will only occur on a column being dropped. This indicates that the data contained within the column is committed and complete and can be read for the purpose of deletion operations which are the only operations allowed within the DELETE_ONLY phase.

This should also work as expected for rollbacks as we will go through similar steps and only read from the referenced column under mutation when it contains actual data by enforcing the above rules. cc @fqazi to confirm my above assumptions.

This involves making a change to a fairly deep seated assumption within the queries code. I would like someone from the @cockroachdb/sql-queries team to opine on the above. If we are in agreement, we can look at making this change in a safe manner. We can start by limiting this change to allow reads for under mutation virtual computed columns only. I have, intentionally, avoided going into more cases of where we may want such internal reads to be permitted. I believe there must be more cases, but for now, we can start off here.

@fqazi
Copy link
Collaborator

fqazi commented Mar 12, 2024

@rimadeodhar The assumptions above all make sense. I think as long awe keep both in lock step in terms of the states we should be safe.

@mgartner
Copy link
Collaborator

@rimadeodhar and I just discussed this a bit more just now. Some notes on new things we discussed.

  • DROP INDEX and DROP CONSTRAINT seem fundamentally different because we don't perform writes to a constraint.
    • But is this true for unique indexes which are both an index and type of constraint?
  • I think we should be able to implement DROP INDEX with inaccessible columns without violating the invariant that we should not read from WRITE_ONLY columns. I think the optimizer already provides all the tools necessary to do this.
  • DROP CONSTRAINT may need some optimizer changes for UPDATE to avoid reading WRITE_ONLY columns. I need to think about this more.

@rimadeodhar
Copy link
Collaborator

rimadeodhar commented Mar 13, 2024

I think the optimizer already provides all the tools necessary to do this.

Documenting here that we'll need to update the optimizer to allow inaccessible columns to be read for internal operations for this.

@rimadeodhar
Copy link
Collaborator

I dug into this further and it looks like we can resolve this by setting up the schema dependency rules without any optimizer updates.
The schema deprules should be set up such that we drop the constraint and virtual, computed columns (transition them to ABSENT) before we move the column to WRITE_ONLY. This allows the column to be continue to be accessed until all dependencies are fully dropped. We can leverage the PostCommitNonRevertible stage within the declarative schemachanger to ensure we move the column to WRITE_ONLY within the PostCommitNonRevertible stage as soon as the dependencies are fully dropped. This ensures that we won't have any data written to the column that violates the constraint even in the presence of rollbacks (as the column is inaccessible to users as soon as the constraint is dropped).

We can leverage the same principle to fix #111608 and #118314.

rimadeodhar added a commit to rimadeodhar/cockroach that referenced this issue Mar 20, 2024
This PR fixes the new schema changer deprules for
dropping virtual computed columns which are also
used for hash and expression indexes.
Currently, the optimizer allows for virtual, computed
columns to be evaluated even when under mutation.
However, this causes concurrent DML issues when the
schemachanger job is running as the column that the
virtual computed column depends on moves into WRITE_ONLY
stage prior to the computed column being dropped.
As a result, the optimizer is unable to access the column
for evaluating the compute expression.
This PR updates the dep rules to ensure the virtual,
computed column is dropped before the dependent column
moves to WRITE_ONLY ensuring that the compute expression
can be enforced correctly for concurrent DML during all
stages of the schema change.

Epic: none
Fixes: cockroachdb#111608
Fixes: cockroachdb#111619
Release note: None
rimadeodhar added a commit to rimadeodhar/cockroach that referenced this issue Apr 26, 2024
This PR fixes the new schema changer deprules for
dropping virtual computed columns which are also
used for hash and expression indexes.
Currently, the optimizer allows for virtual, computed
columns to be evaluated even when under mutation.
However, this causes concurrent DML issues when the
schemachanger job is running as the column that the
virtual computed column depends on moves into WRITE_ONLY
stage prior to the computed column being dropped.
As a result, the optimizer is unable to access the column
for evaluating the compute expression.
This PR updates the dep rules to ensure the virtual,
computed column is dropped before the dependent column
moves to WRITE_ONLY ensuring that the compute expression
can be enforced correctly for concurrent DML during all
stages of the schema change.

Epic: none
Fixes: cockroachdb#111608
Fixes: cockroachdb#111619
Release note: None
craig bot pushed a commit that referenced this issue May 2, 2024
120794: scplan: Fix deprules for dropping computed columns r=rimadeodhar a=rimadeodhar

This PR fixes the new schema changer deprules for
dropping virtual computed columns which are also
used for hash and expression indexes.
Currently, the optimizer allows for virtual, computed
columns to be evaluated even when under mutation.
However, this causes concurrent DML issues when the
schemachanger job is running as the column that the
virtual computed column depends on moves into WRITE_ONLY
stage prior to the computed column being dropped.
As a result, the optimizer is unable to access the column
for evaluating the compute expression.
This PR updates the dep rules to ensure the virtual,
computed column is dropped before the dependent column
moves to WRITE_ONLY ensuring that the compute expression
can be enforced correctly for concurrent DML during all
stages of the schema change.

Epic: none
Fixes: #111608
Fixes: #111619
Release note: None

------------------------------------------------------------------
**Note for reviewers:** This PR is stacked on top of #120792.

Co-authored-by: rimadeodhar <rima@cockroachlabs.com>
@craig craig bot closed this as completed in 47aa2d5 May 2, 2024
blathers-crl bot pushed a commit that referenced this issue May 3, 2024
This PR fixes the new schema changer deprules for
dropping virtual computed columns which are also
used for hash and expression indexes.
Currently, the optimizer allows for virtual, computed
columns to be evaluated even when under mutation.
However, this causes concurrent DML issues when the
schemachanger job is running as the column that the
virtual computed column depends on moves into WRITE_ONLY
stage prior to the computed column being dropped.
As a result, the optimizer is unable to access the column
for evaluating the compute expression.
This PR updates the dep rules to ensure the virtual,
computed column is dropped before the dependent column
moves to WRITE_ONLY ensuring that the compute expression
can be enforced correctly for concurrent DML during all
stages of the schema change.

Epic: none
Fixes: #111608
Fixes: #111619
Release note: None
@rimadeodhar
Copy link
Collaborator

Re-opening to track backport to 24.1

@rimadeodhar rimadeodhar reopened this May 3, 2024
rimadeodhar added a commit to rimadeodhar/cockroach that referenced this issue May 6, 2024
This PR fixes the new schema changer deprules for
dropping virtual computed columns which are also
used for hash and expression indexes.
Currently, the optimizer allows for virtual, computed
columns to be evaluated even when under mutation.
However, this causes concurrent DML issues when the
schemachanger job is running as the column that the
virtual computed column depends on moves into WRITE_ONLY
stage prior to the computed column being dropped.
As a result, the optimizer is unable to access the column
for evaluating the compute expression.
This PR updates the dep rules to ensure the virtual,
computed column is dropped before the dependent column
moves to WRITE_ONLY ensuring that the compute expression
can be enforced correctly for concurrent DML during all
stages of the schema change.

Epic: none
Fixes: cockroachdb#111608
Fixes: cockroachdb#111619
Release note: None
@rimadeodhar
Copy link
Collaborator

Closing now that the backport is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. db-cy-23 O-schema-testing Originated from SQL Schema testing strategies P-1 Issues/test failures with a fix SLA of 1 month T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants