fix(source): fix panic for `ALTER SOURCE` with schema registry #17293

xxchan · 2024-06-18T05:28:48Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Note that it doesn't panic only when columns are added at the end, but actually only few cases of ALTER SOURCE won't panic.

The root cause is quite clear: when the schema registry is refreshed, the newly resolved columns still use column ids from 1. So the updated SourceCatalog contains duplicated column IDs.

Why protobuf test works previously? Or when it won't panic?

Test: https://github.com/risingwavelabs/risingwave/blob/dec1c4f0d8e9400b98888e923f941e9b54d40c3e/e2e_test/schema_registry/alter_sr.slt

It just happen to work, and since the test proto file has a struct field, whose field occupies a column ID. And it works because of 2 mistakes combined together.

e.g.,

message Foo {
  int32 a = 1;
  Bar bar = 2;
}

message Bar {
  int32 baz = 1;
}

When CREATE SOURCE

In bind_columns_from_source: we get [a:#1, bar:#3 {bar.baz:#2}] according to the schema

Then in bind_create_source: we use col_id_gen to "compact" the ids, and also added additional cols. We get [a:#1, bar:#2 {bar.baz:#2}, _rw_kafka_timestamp:#3, _row_id:#0] (note bar.baz is unchanged, although it doesn't matter, but is strange :)

risingwave/src/frontend/src/handler/create_source.rs

Lines 1390 to 1392 in fbb597f

    
           for c in &mut columns { 
        
               c.column_desc.column_id = col_id_gen.generate(c.name()) 
        
           }

Then when adding field b=3 to Foo, and ALTER SOURCE, we will get:

In refresh_sr_and_get_columns_diff, we only call bind_columns_from_source, and get [a:#1, bar:#3 {bar.baz:#2}, b:#4].

Note that we don't use col_id_gen here!
We calculated added_columns, and got b:#4. And simply extend it to the original columns.

Note that we don't compare hidden columns here (_rw_kafka_timestamp)!

So there are a lot of edge cases to make it fail or crash:

If we INCLUDE timestamp to make _rw_kafka_timestamp not hidden, ALTER SOURCE src_user REFRESH SCHEMA will fail with: this altering statement will drop columns, which is not supported yet: (_rw_kafka_timestamp: timestamp with time zone)
If in the protobuf test, we don't have struct fields, or the newly added column is at the beginning (according to the protobuf field number), it will panic, like avro.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

xxchan · 2024-06-18T05:29:06Z

refactor: refactor ColumnDesc #17346
fix(source): fix panic for ALTER SOURCE with schema registry #17293 👈
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @xxchan and the rest of your teammates on Graphite

xxchan · 2024-06-18T09:21:43Z

avoid manipulating columns manually, which bypasses ColumnIdGenerator and can be problematic if we support ALTER TABLE with connector

#9828

e2e_test/schema_registry/pb.py

xxchan · 2024-06-18T11:37:13Z

src/frontend/src/handler/create_source.rs

+    if cfg!(debug_assertions) {
+        // validate column ids
+        // Note: this just documents how it works currently. It doesn't mean whether it's reasonable.
+        if let Some(ref columns) = columns {
+            let mut i = 1;
+            fn check_col(col: &ColumnDesc, i: &mut usize, columns: &Vec<ColumnCatalog>) {
+                for nested_col in &col.field_descs {
+                    // What's the usage of struct fields' column IDs?
+                    check_col(nested_col, i, columns);
+                }
+                assert!(
+                    col.column_id.get_id() == *i as i32,
+                    "unexpected column id\ncol: {col:?}\ni: {i}\ncolumns: {columns:#?}"
+                );
+                *i += 1;
+            }
+            for col in columns {
+                check_col(&col.column_desc, &mut i, columns);
+            }
+        }
+    }


Perhaps we should use ColumnId::placeholder() to assign IDs, because later we will re-assign column ids with col_id_gen, or use col_id_gen here directly, but it would be very intrusive.

xxchan · 2024-06-18T11:37:53Z

src/frontend/src/handler/create_table.rs

@@ -894,6 +895,7 @@ pub(super) async fn handle_create_table_plan(
    with_version_column: Option<String>,
    include_column_options: IncludeOption,
 ) -> Result<(PlanRef, Option<PbSource>, PbTable, TableJobType)> {
+    let col_id_gen = ColumnIdGenerator::new_initial();


Move arg to eliminate unnecessary param

Interesting. I believe we reused this function for ALTER TABLE prior to some refactoring and that's why we took an argument of column id generator. 🤡

xxchan · 2024-06-19T01:16:08Z

src/frontend/src/handler/alter_source_with_sr.rs

-    let added_columns = columns_minus(&columns_from_resolve_source, &original_source.columns);
+    let mut added_columns = columns_minus(&columns_from_resolve_source, &original_source.columns);
+    // The newly resolved columns' column IDs also starts from 1. They cannot be used directly.
+    let mut next_col_id = max_column_id(&original_source.columns).next();
+    for col in &mut added_columns {
+        col.column_desc.column_id = next_col_id;
+        next_col_id = next_col_id.next();
+    }


This is the real fix of the issue. Other changes are just refactoring/debugging

xxchan · 2024-06-19T02:26:30Z

src/frontend/src/handler/alter_source_column.rs

            let mut bound_column = bind_sql_columns(&[column_def])?.remove(0);
-            bound_column.column_desc.column_id = columns
-                .iter()
-                .fold(ColumnId::new(i32::MIN), |a, b| a.max(b.column_id()))
-                .next();
+            bound_column.column_desc.column_id = max_column_id(columns).next();


You can see ALTER SOURCE ADD COLUMN use this solution. Actually bind_sql_columns previously takes col_id_gen as a param, but it's changed and use ColumnId::placeholder() after the refactor :( #10307

tabVersion

LGTM, thanks for the change

BugenZhao

LGTM

e2e_test/schema_registry/alter_sr.slt

src/frontend/src/handler/alter_source_column.rs

src/frontend/src/handler/alter_source_with_sr.rs

BugenZhao · 2024-06-19T03:37:58Z

src/frontend/src/handler/create_table.rs

@@ -894,6 +895,7 @@ pub(super) async fn handle_create_table_plan(
    with_version_column: Option<String>,
    include_column_options: IncludeOption,
 ) -> Result<(PlanRef, Option<PbSource>, PbTable, TableJobType)> {
+    let col_id_gen = ColumnIdGenerator::new_initial();


Interesting. I believe we reused this function for ALTER TABLE prior to some refactoring and that's why we took an argument of column id generator. 🤡

BugenZhao · 2024-06-19T03:39:50Z

src/frontend/src/optimizer/plan_node/logical_source.rs

@@ -70,6 +70,9 @@ impl LogicalSource {
        ctx: OptimizerContextRef,
        as_of: Option<AsOf>,
    ) -> Result<Self> {
+        // XXX: should we reorder the columns?


I think the order does not matter much. The columns field is essentially a map indexed by the column id.

I think so, it's just for what users will see in SELECT *.

But I'm not sure if we rely on the position of hidden column like _row_id somewhere.. For projected_row_id we do so...

BugenZhao · 2024-06-19T03:43:04Z

src/frontend/src/handler/alter_source_column.rs

-                .iter()
-                .fold(ColumnId::new(i32::MIN), |a, b| a.max(b.column_id()))
-                .next();
+            bound_column.column_desc.column_id = max_column_id(columns).next();


Correct me if I'm wrong: we actually can directly go through the path of planning a completely new source catalog without keeping the consistency for column ids between the old and the new one. The current approach is just to be more compatible with ALTER TABLE, in case of future extension.

I think the same, but without confidence.

tabVersion · 2024-06-19T03:50:11Z

#17336 change a position to unify the additional columns related logic

e2e_test/schema_registry/alter_sr.slt

xxchan · 2024-06-19T08:02:35Z

I finally found the reason of CI failure, and why I can't reproduce it locally: When using rpk topic produce --schema-id=topic, it looks for schema registry of the redpanda, but the schema is in confluent schema registry... 🤡

Unfortunately rpk topic produce returns 0 when it meet error..

unable to build value serializer using TopicNameStrategy for topic "avro_alter_source_test": unable to get schema with name "avro_alter_source_test-value" using TopicName strategy: unable to GET "http://127.0.0.1:8081/subjects/avro_alter_source_test-value/versions/latest": Get "http://127.0.0.1:8081/subjects/avro_alter_source_test-value/versions/latest": dial tcp 127.0.0.1:8081: connect: connection refused

Signed-off-by: xxchan <xxchan22f@gmail.com>

… (#17353) Signed-off-by: xxchan <xxchan22f@gmail.com> Co-authored-by: xxchan <xxchan22f@gmail.com>

github-actions bot added the Invalid PR Title label Jun 18, 2024

xxchan changed the title ~~add tests~~ fix(source): fix panic for alter source with sr Jun 18, 2024

github-actions bot added type/fix Bug fix and removed Invalid PR Title labels Jun 18, 2024

xxchan changed the title ~~fix(source): fix panic for alter source with sr~~ fix(source): fix panic for alter source with schema registry Jun 18, 2024

xxchan marked this pull request as ready for review June 18, 2024 05:32

xxchan commented Jun 18, 2024

View reviewed changes

xxchan force-pushed the xxchan/sr branch 2 times, most recently from 295121d to 74040b6 Compare June 19, 2024 01:14

xxchan commented Jun 19, 2024

View reviewed changes

xxchan changed the title ~~fix(source): fix panic for alter source with schema registry~~ fix(source): fix panic for ALTER SOURCE with schema registry Jun 19, 2024

xxchan requested review from fuyufjh, tabVersion, xiangjinwu, BugenZhao and st1page June 19, 2024 02:23

xxchan commented Jun 19, 2024

View reviewed changes

tabVersion approved these changes Jun 19, 2024

View reviewed changes

BugenZhao approved these changes Jun 19, 2024

View reviewed changes

xxchan added the need-cherry-pick-release-1.10 Open a cherry-pick PR to branch release-1.10 after the current PR is merged label Jun 19, 2024

xxchan mentioned this pull request Jun 19, 2024

refactor: revisit column ID assignment #17340

Open

xxchan commented Jun 19, 2024

View reviewed changes

e2e_test/schema_registry/alter_sr.slt Outdated Show resolved Hide resolved

xxchan mentioned this pull request Jun 19, 2024

refactor: refactor ColumnDesc #17346

Closed

9 tasks

xxchan force-pushed the xxchan/sr branch from 7ba5786 to 7fe9874 Compare June 19, 2024 09:46

xxchan enabled auto-merge June 19, 2024 10:06

add tests

073f0cc

Signed-off-by: xxchan <xxchan22f@gmail.com>

xxchan added 11 commits June 19, 2024 19:47

fix(source): fix panic for alter source with sr

12473cf

refine

ec5cf0e

refine

236c79d

include timestamp

10ee1ac

fix

f7b51b5

fix

c7f5f5f

fix

8afeaf1

debug

c4fec4c

Signed-off-by: xxchan <xxchan22f@gmail.com>

resolve comments

02e6668

Signed-off-by: xxchan <xxchan22f@gmail.com>

fix ci registry failure

6acfac5

Signed-off-by: xxchan <xxchan22f@gmail.com>

fix include column (hacky)

97e0a4d

xxchan force-pushed the xxchan/sr branch from fb616c4 to ef00148 Compare June 19, 2024 11:47

move tests

55a0f98

Signed-off-by: xxchan <xxchan22f@gmail.com>

xxchan force-pushed the xxchan/sr branch from ef00148 to 55a0f98 Compare June 19, 2024 12:10

xxchan added this pull request to the merge queue Jun 19, 2024

Merged via the queue into main with commit 2a413ff Jun 19, 2024
30 of 31 checks passed

xxchan deleted the xxchan/sr branch June 19, 2024 13:02

github-actions bot pushed a commit that referenced this pull request Jun 19, 2024

fix(source): fix panic for ALTER SOURCE with schema registry (#17293)

9aff870

Signed-off-by: xxchan <xxchan22f@gmail.com>

github-actions bot mentioned this pull request Jun 19, 2024

fix(source): fix panic for ALTER SOURCE with schema registry (#17293) #17353

Merged

github-merge-queue bot pushed a commit that referenced this pull request Jun 24, 2024

fix(source): fix panic for ALTER SOURCE with schema registry (#17293)…

4313966

… (#17353) Signed-off-by: xxchan <xxchan22f@gmail.com> Co-authored-by: xxchan <xxchan22f@gmail.com>

xiangjinwu mentioned this pull request Sep 26, 2024

Schema mismatch in source can lead to cluster panic #18715

Closed

xxchan mentioned this pull request Dec 10, 2024

feat(frontend): support alter add column for shared source #19649

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(source): fix panic for `ALTER SOURCE` with schema registry #17293

fix(source): fix panic for `ALTER SOURCE` with schema registry #17293

xxchan commented Jun 18, 2024 •

edited

Loading

xxchan commented Jun 18, 2024 •

edited

Loading

xxchan commented Jun 18, 2024

xxchan Jun 18, 2024 •

edited

Loading

xxchan Jun 18, 2024

BugenZhao Jun 19, 2024

xxchan Jun 19, 2024

xxchan Jun 19, 2024

tabVersion left a comment

BugenZhao left a comment

BugenZhao Jun 19, 2024

BugenZhao Jun 19, 2024

xxchan Jun 19, 2024

BugenZhao Jun 19, 2024

xxchan Jun 19, 2024

tabVersion commented Jun 19, 2024

xxchan commented Jun 19, 2024

	for c in &mut columns {
	c.column_desc.column_id = col_id_gen.generate(c.name())
	}

fix(source): fix panic for ALTER SOURCE with schema registry #17293

fix(source): fix panic for ALTER SOURCE with schema registry #17293

Conversation

xxchan commented Jun 18, 2024 • edited Loading

What's changed and what's your intention?

Why protobuf test works previously? Or when it won't panic?

Checklist

Documentation

Release note

xxchan commented Jun 18, 2024 • edited Loading

xxchan commented Jun 18, 2024

xxchan Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tabVersion left a comment

Choose a reason for hiding this comment

BugenZhao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tabVersion commented Jun 19, 2024

xxchan commented Jun 19, 2024

fix(source): fix panic for `ALTER SOURCE` with schema registry #17293

fix(source): fix panic for `ALTER SOURCE` with schema registry #17293

xxchan commented Jun 18, 2024 •

edited

Loading

xxchan commented Jun 18, 2024 •

edited

Loading

xxchan Jun 18, 2024 •

edited

Loading