schema: drop column with index using hash cascade DML injection failure #111619

ecwall · 2023-10-02T21:41:25Z

Setup

CREATE TABLE tbl (
	insert_phase_ordinal TEXT NOT NULL, 
	operation_phase_ordinal TEXT NOT NULL,
	operation TEXT NOT NULL,
	val INT,
	PRIMARY KEY (insert_phase_ordinal, operation_phase_ordinal, operation)
);
ALTER TABLE tbl ADD COLUMN i INT NOT NULL DEFAULT 1;
CREATE INDEX idx ON tbl (i) USING HASH;

Schema change

ALTER TABLE tbl DROP COLUMN i CASCADE;

Schema change DML injection failure

dml_injection_test.go:442: [phaseOrdinal=PostCommitPhase:1] error executing query="DELETE FROM tbl WHERE insert_phase_ordinal = $1 AND operation_phase_ordinal = $2 AND operation = $3" args=["StatementPhase:1" "PostCommitPhase:1" "delete"]: pq: column "crdb_internal_column_5_name_placeholder" does not exist
does not exist

Same underlying bug as #111608

Jira issue: CRDB-31989

Epic CRDB-37763

The text was updated successfully, but these errors were encountered:

Xiang-Gu · 2024-01-17T17:12:00Z

A simpler repo

create table t (i int primary key);
insert into t values (0);
alter table t add column j int default 1;
create index idx on t (j) using hash;

SET CLUSTER SETTING jobs.debug.pausepoints = 'schemachanger.root.$ cockroach demo.0';
set cluster setting jobs.registry.interval.adopt = "2s";
set cluster setting jobs.registry.interval.cancel = "2s";
SET sql_safe_updates = false;

alter table t drop column j cascade;  -- job stops right before starting 1st stage of PostCommitPhase
delete from t where i = 0;  -- ERROR: column "crdb_internal_column_2_name_placeholder" does not exist

The concurrent DML failed at the very beginning of PostCommitPhase (i.e. before executing first stage of PostCommitPhase). At the point in time, the relevant schema objects are in the following states:

index_7_placeholder (previous "idx"): mutation, WRITE_ONLY
column_2_placeholder (previous "j"): mutation, WRITE_ONLY
column_3_placeholder (previous "shard"): mutation, WRITE_ONLY

From a schema point of view, I didn't see anything particular wrong with those transitions, so I decided to go a layer above to the optimizer to see where exactly is the error thrown.

That concurrent DELETE, under this descriptor state, runs into the error with the following stacktrace

github.com/cockroachdb/cockroach/pkg/sql/catalog/colinfo.NewUndefinedColumnError(column_resolver.go:196)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*scope).FindSourceProvidingColumn(scope.go:853)
github.com/cockroachdb/cockroach/pkg/sql/catalog/colinfo.ResolveColumnItem(column_item_resolver.go:115)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*scope).VisitPre(scope.go:1032)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*scope).VisitPre(scope.go:1029)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.WalkExpr(walk.go:834)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.walkExprSlice(walk.go:625)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.(*FuncExpr).Walk(walk.go:303)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.WalkExpr(walk.go:837)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.walkExprSlice(walk.go:625)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.(*FuncExpr).Walk(walk.go:303)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.WalkExpr(walk.go:837)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.walkExprSlice(walk.go:625)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.(*FuncExpr).Walk(walk.go:303)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.WalkExpr(walk.go:837)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.walkExprSlice(walk.go:625)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.(*FuncExpr).Walk(walk.go:303)
github.com/cockroachdb/cockroach/pkg/sql/sem/tree.WalkExpr(walk.go:837)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*scope).walkExprTree(scope.go:433)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*scope).resolveAndRequireType(scope.go:491)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).addComputedColsForTable(select.go:936)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildScan(select.go:767)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*mutationBuilder).buildInputForDelete(mutation_builder.go:410)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildDelete(delete.go:93)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmt.func1(builder.go:353)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).processWiths(with.go:116)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmt(builder.go:352)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmtAtRootWithScope(builder.go:290)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).buildStmtAtRoot(builder.go:271)
github.com/cockroachdb/cockroach/pkg/sql/opt/optbuilder.(*Builder).Build(builder.go:250)
github.com/cockroachdb/cockroach/pkg/sql.(*optPlanningCtx).buildExecMemo(plan_opt.go:588)
github.com/cockroachdb/cockroach/pkg/sql.(*planner).makeOptimizerPlan(plan_opt.go:245)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).makeExecPlan(conn_executor_exec.go:2275)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(conn_executor_exec.go:1744)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(conn_executor_exec.go:1108)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt.func1(conn_executor_exec.go:146)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithProfiling(conn_executor_exec.go:3375)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(conn_executor_exec.go:145)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1(conn_executor.go:2225)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(conn_executor.go:2230)
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(conn_executor.go:2147)
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(conn_executor.go:953)
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommands(conn.go:247)
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).serveImpl.func3(server.go:1001)
runtime.goexit(asm_arm64.s:1197)
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).serveImpl(server.go:998)

My understanding is that addComputedColsForTable is processing the "column_3_placeholder" computed column (whose ComputeExpr = mod(fnv32(md5(crdb_internal.datums_to_bytes(crdb_internal_column_2_name_placeholder)))), and the code went ahead to resolve crdb_internal_column_2_name_placeholder by name within the *scope. But this scope only contains "normal" (i.e. non-mutation) column (via tableScope.appendOrdinaryColumnsFromTable(tabMeta, &tabMeta.Alias)). Therefore, we end up with a column not found error.

@mgartner Can I ask you to chime in for your thoughts? In particular, I'd like to hear your thoughts on a few questions:

What is the general purpose of addComputedColsForTable and why it includes of columns undergoing mutations?
Do we want to change here to also include "mutation columns" to the scope before resolving column by name in a computed expression?

mgartner · 2024-01-18T16:35:55Z

This looks similar to the issue in #79613. See also #79691, #83593, and #97368.

These states are invalid, at least according to the optimizer's interpretation of WRITE_ONLY:

index_7_placeholder (previous "idx"): mutation, WRITE_ONLY
column_2_placeholder (previous "j"): mutation, WRITE_ONLY
column_3_placeholder (previous "shard"): mutation, WRITE_ONLY

The optimizer must plan to delete entries in the WRITE_ONLY index. In order to determine the keys to delete, it needs to read j and compute the value of shard based on j. But those columns are WRITE_ONLY and therefore cannot be read.

I'll note that, based on the same reasoning, a non-sharded index on a column with the following state should also be invalid:

index_7_placeholder (previous "idx"): mutation, WRITE_ONLY
column_2_placeholder (previous "j"): mutation, WRITE_ONLY

AFAICT this won't cause an internal error, but it might be the source of correctness problems.

However, when this issue has come up before (see the issues I linked above), there was skepticism about whether or not reads can be made internally by the system on WRITE_ONLY columns and indexes. The optimizer currently assumes that absolutely no reads should be made (though maybe not true for the non-shared index example I gave). Should this be loosened to: no reads of WRITE_ONLY columns and indexes should be visible to the user? I think we need to come to a confident answer on this. I didn't see an explicit mention of this situation in the online schema change RFC or the F1 paper it is based on.

mgartner · 2024-01-24T22:22:39Z

#59233 is highly relevant and seems to have figured out most of these questions. TLDR; columns that have been fully backfilled and are safe for reading by internal operations, but should not be accessible to the user should be "ordinary" "inaccessible" columns.

rimadeodhar · 2024-03-11T23:14:49Z

The solution outlined in #59233 is not applicable here. That solution talks of situations where certain columns need to be inaccessible to user queries but can be internally considered readable. The most notable point here though is that such columns don't have a WRITE_ONLY mutation on them.
For all columns in the process of being dropped, they will temporarily have a WRITE_ONLY mutation followed by DELETE_ONLY followed by actual deletion. However, during the WRITE_ONLY and DELETE_ONLY phases, we should still continue to support write and delete operations on those columns as per our schema deletion paradigm (outlined in the RFC and F1 paper). Which means the original question posed by Marcus still remains unresolved - what do we do for internal reads for columns in the WRITE_ONLY and DELETE_ONLY phases? Within the context of this particular issue, I believe we need to allow this access. See my next comment for more details.

rimadeodhar · 2024-03-11T23:26:17Z

This particular issue seems to have started occurring post the changes in #106782. In this PR, we allow computation of virtual columns that are under mutation (i.e. in the process of being added/dropped) within addComputedColsForTable. While this won’t cause issues as long as the referenced column is public, this will cause trouble for columns that are also in the process of being dropped as outlined in this issue and #111608. I believe this code is incomplete. If we are allowing evaluation of computed columns under mutation, if the referenced column is also being dropped and is under the same mutation level or higher, we can reference the column safely (i.e. the column will only contain actual data and will not be in a state of backfill) for the purpose of evaluation. See the cases outlined below for a “proof” of this statement:

CASE 1: The referenced column is public. As long as the computed column is DELETE_ONLY, WRITE_ONLY or public, we can read from the referenced column.
The above assumption holds true trivially and this is what is being done today.

CASE 2: The referenced column is WRITE_ONLY. In this case, as long as the computed column is WRITE_ONLY or DELETE_ONLY, we can safely read the referenced column for the purpose of column computation within addComputedColsToTable.
Due to this safeguard, this is only possible when a column and thereby the computed column is in the process of being dropped. In such cases, the data contained within the WRITE_ONLY column is complete (i.e. it is not a new column being backfilled) and we can reference the column for computation.

CASE 3: The referenced column is DELETE_ONLY. In this case, as long as the computed column is also in the DELETE_ONLY, we can read the referenced column. As from Case 2, this will only occur on a column being dropped. This indicates that the data contained within the column is committed and complete and can be read for the purpose of deletion operations which are the only operations allowed within the DELETE_ONLY phase.

This should also work as expected for rollbacks as we will go through similar steps and only read from the referenced column under mutation when it contains actual data by enforcing the above rules. cc @fqazi to confirm my above assumptions.

This involves making a change to a fairly deep seated assumption within the queries code. I would like someone from the @cockroachdb/sql-queries team to opine on the above. If we are in agreement, we can look at making this change in a safe manner. We can start by limiting this change to allow reads for under mutation virtual computed columns only. I have, intentionally, avoided going into more cases of where we may want such internal reads to be permitted. I believe there must be more cases, but for now, we can start off here.

fqazi · 2024-03-12T17:04:13Z

@rimadeodhar The assumptions above all make sense. I think as long awe keep both in lock step in terms of the states we should be safe.

mgartner · 2024-03-13T18:22:08Z

@rimadeodhar and I just discussed this a bit more just now. Some notes on new things we discussed.

DROP INDEX and DROP CONSTRAINT seem fundamentally different because we don't perform writes to a constraint.
- But is this true for unique indexes which are both an index and type of constraint?
I think we should be able to implement DROP INDEX with inaccessible columns without violating the invariant that we should not read from WRITE_ONLY columns. I think the optimizer already provides all the tools necessary to do this.
DROP CONSTRAINT may need some optimizer changes for UPDATE to avoid reading WRITE_ONLY columns. I need to think about this more.

rimadeodhar · 2024-03-13T21:25:42Z

I think the optimizer already provides all the tools necessary to do this.

Documenting here that we'll need to update the optimizer to allow inaccessible columns to be read for internal operations for this.

rimadeodhar · 2024-03-18T21:48:52Z

I dug into this further and it looks like we can resolve this by setting up the schema dependency rules without any optimizer updates.
The schema deprules should be set up such that we drop the constraint and virtual, computed columns (transition them to ABSENT) before we move the column to WRITE_ONLY. This allows the column to be continue to be accessed until all dependencies are fully dropped. We can leverage the PostCommitNonRevertible stage within the declarative schemachanger to ensure we move the column to WRITE_ONLY within the PostCommitNonRevertible stage as soon as the dependencies are fully dropped. This ensures that we won't have any data written to the column that violates the constraint even in the presence of rollbacks (as the column is inaccessible to users as soon as the constraint is dropped).

We can leverage the same principle to fix #111608 and #118314.

This PR fixes the new schema changer deprules for dropping virtual computed columns which are also used for hash and expression indexes. Currently, the optimizer allows for virtual, computed columns to be evaluated even when under mutation. However, this causes concurrent DML issues when the schemachanger job is running as the column that the virtual computed column depends on moves into WRITE_ONLY stage prior to the computed column being dropped. As a result, the optimizer is unable to access the column for evaluating the compute expression. This PR updates the dep rules to ensure the virtual, computed column is dropped before the dependent column moves to WRITE_ONLY ensuring that the compute expression can be enforced correctly for concurrent DML during all stages of the schema change. Epic: none Fixes: cockroachdb#111608 Fixes: cockroachdb#111619 Release note: None

120794: scplan: Fix deprules for dropping computed columns r=rimadeodhar a=rimadeodhar This PR fixes the new schema changer deprules for dropping virtual computed columns which are also used for hash and expression indexes. Currently, the optimizer allows for virtual, computed columns to be evaluated even when under mutation. However, this causes concurrent DML issues when the schemachanger job is running as the column that the virtual computed column depends on moves into WRITE_ONLY stage prior to the computed column being dropped. As a result, the optimizer is unable to access the column for evaluating the compute expression. This PR updates the dep rules to ensure the virtual, computed column is dropped before the dependent column moves to WRITE_ONLY ensuring that the compute expression can be enforced correctly for concurrent DML during all stages of the schema change. Epic: none Fixes: #111608 Fixes: #111619 Release note: None ------------------------------------------------------------------ **Note for reviewers:** This PR is stacked on top of #120792. Co-authored-by: rimadeodhar <rima@cockroachlabs.com>

This PR fixes the new schema changer deprules for dropping virtual computed columns which are also used for hash and expression indexes. Currently, the optimizer allows for virtual, computed columns to be evaluated even when under mutation. However, this causes concurrent DML issues when the schemachanger job is running as the column that the virtual computed column depends on moves into WRITE_ONLY stage prior to the computed column being dropped. As a result, the optimizer is unable to access the column for evaluating the compute expression. This PR updates the dep rules to ensure the virtual, computed column is dropped before the dependent column moves to WRITE_ONLY ensuring that the compute expression can be enforced correctly for concurrent DML during all stages of the schema change. Epic: none Fixes: #111608 Fixes: #111619 Release note: None

rimadeodhar · 2024-05-03T19:37:17Z

Re-opening to track backport to 24.1

This PR fixes the new schema changer deprules for dropping virtual computed columns which are also used for hash and expression indexes. Currently, the optimizer allows for virtual, computed columns to be evaluated even when under mutation. However, this causes concurrent DML issues when the schemachanger job is running as the column that the virtual computed column depends on moves into WRITE_ONLY stage prior to the computed column being dropped. As a result, the optimizer is unable to access the column for evaluating the compute expression. This PR updates the dep rules to ensure the virtual, computed column is dropped before the dependent column moves to WRITE_ONLY ensuring that the compute expression can be enforced correctly for concurrent DML during all stages of the schema change. Epic: none Fixes: cockroachdb#111608 Fixes: cockroachdb#111619 Release note: None

rimadeodhar · 2024-05-07T21:41:27Z

Closing now that the backport is merged.

ecwall added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-schema-changes T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) db-cy-23 O-schema-testing Originated from SQL Schema testing strategies labels Oct 2, 2023

This was referenced Oct 2, 2023

schema: drop column cascade used by expression index DML injection failure #111608

Closed

schema: create index using hash DML injection failure #111621

Closed

rafiss self-assigned this Jan 12, 2024

rafiss added the P-1 Issues/test failures with a fix SLA of 1 month label Jan 16, 2024

rafiss assigned Xiang-Gu and unassigned rafiss Jan 16, 2024

mgartner mentioned this issue Jan 24, 2024

sql: partial predicate evaluation fails for newly added columns #79613

Closed

Xiang-Gu mentioned this issue Jan 25, 2024

schemachanger: dropping a column with a constraint fails concurrent DML #118314

Closed

Xiang-Gu mentioned this issue Feb 15, 2024

schemachanger: Fix concurrent DML failure when dropping a column with depended on by indexes/constraints #119254

Closed

exalate-issue-sync bot assigned rimadeodhar and unassigned Xiang-Gu Feb 22, 2024

mgartner mentioned this issue Feb 27, 2024

Fix partial idx deletable #97368

Draft

rimadeodhar mentioned this issue Mar 20, 2024

scplan: Fix deprules for dropping computed columns #120794

Merged

michae2 mentioned this issue Mar 24, 2024

release-23.1: sql: collect table statistics on virtual computed columns #120928

Closed

craig bot closed this as completed in 47aa2d5 May 2, 2024

blathers-crl bot mentioned this issue May 3, 2024

release-24.1: scplan: Fix deprules for dropping computed columns #123580

Closed

rimadeodhar reopened this May 3, 2024

rimadeodhar mentioned this issue May 6, 2024

release-24.1: scplan: Fix deprules for dropping computed columns #123684

Merged

rimadeodhar closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema: drop column with index using hash cascade DML injection failure #111619

schema: drop column with index using hash cascade DML injection failure #111619

ecwall commented Oct 2, 2023 •

edited by exalate-issue-sync bot

Loading

Xiang-Gu commented Jan 17, 2024

mgartner commented Jan 18, 2024

mgartner commented Jan 24, 2024

rimadeodhar commented Mar 11, 2024

rimadeodhar commented Mar 11, 2024

fqazi commented Mar 12, 2024

mgartner commented Mar 13, 2024

rimadeodhar commented Mar 13, 2024 •

edited

Loading

rimadeodhar commented Mar 18, 2024

rimadeodhar commented May 3, 2024

rimadeodhar commented May 7, 2024

schema: drop column with index using hash cascade DML injection failure #111619

schema: drop column with index using hash cascade DML injection failure #111619

Comments

ecwall commented Oct 2, 2023 • edited by exalate-issue-sync bot Loading

Xiang-Gu commented Jan 17, 2024

mgartner commented Jan 18, 2024

mgartner commented Jan 24, 2024

rimadeodhar commented Mar 11, 2024

rimadeodhar commented Mar 11, 2024

fqazi commented Mar 12, 2024

mgartner commented Mar 13, 2024

rimadeodhar commented Mar 13, 2024 • edited Loading

rimadeodhar commented Mar 18, 2024

rimadeodhar commented May 3, 2024

rimadeodhar commented May 7, 2024

ecwall commented Oct 2, 2023 •

edited by exalate-issue-sync bot

Loading

rimadeodhar commented Mar 13, 2024 •

edited

Loading