Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need a better way for changefeed retry on some errors such as CDC:ErrJSONCodecRowTooLarge #3329

Closed
amyangfei opened this issue Nov 8, 2021 · 2 comments · Fixed by #4262
Closed
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@amyangfei
Copy link
Contributor

amyangfei commented Nov 8, 2021

Is your feature request related to a problem?

  1. setup a ticdc cluster, create a changefeed with command cdc cli changefeed create -c test-cf --sink-uri="kafka://172.18.0.2:9092/cdc-test?kafka-version=2.7.0&max-batch-size=1&max-message-bytes=5000"
  2. Execute a SQL in upstream TiDB with a wide row change, whose data size is larger than 5000

Then TiCDC will meet the following error

[2021/11/08 15:36:38.998 +08:00] [WARN] [json.go:433] ["Single message too large"] [max-message-size=5000] [length=10457] [table=test.t1]
[2021/11/08 15:36:39.176 +08:00] [ERROR] [processor.go:313] ["error on running processor"] [capture=127.0.0.1:8300] [changefeed=test-cf] [error="[CDC:ErrJSONCodecRowTooLarge]json codec single row too large"] [errorVerbose="[CDC:ErrJSONCodecRowTooLarge]json codec single row too large\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20210513014640-40f9a1999b3b/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20210513014640-40f9a1999b3b/normalize.go:159\ngithub.com/pingcap/ticdc/cdc/sink/codec.(*JSONEventBatchEncoder).AppendRowChangedEvent\n\tgithub.com/pingcap/ticdc/cdc/sink/codec/json.go:435\ngithub.com/pingcap/ticdc/cdc/sink.(*mqSink).runWorker\n\tgithub.com/pingcap/ticdc/cdc/sink/mq.go:345\ngithub.com/pingcap/ticdc/cdc/sink.(*mqSink).run.func1\n\tgithub.com/pingcap/ticdc/cdc/sink/mq.go:275\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371"]
[2021/11/08 15:36:39.176 +08:00] [ERROR] [processor.go:150] ["run processor failed"] [changefeed=test-cf] [capture=127.0.0.1:8300] [error="[CDC:ErrJSONCodecRowTooLarge]json codec single row too large"] [errorVerbose="[CDC:ErrJSONCodecRowTooLarge]json codec single row too large\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20210513014640-40f9a1999b3b/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20210513014640-40f9a1999b3b/normalize.go:159\ngithub.com/pingcap/ticdc/cdc/sink/codec.(*JSONEventBatchEncoder).AppendRowChangedEvent\n\tgithub.com/pingcap/ticdc/cdc/sink/codec/json.go:435\ngithub.com/pingcap/ticdc/cdc/sink.(*mqSink).runWorker\n\tgithub.com/pingcap/ticdc/cdc/sink/mq.go:345\ngithub.com/pingcap/ticdc/cdc/sink.(*mqSink).run.func1\n\tgithub.com/pingcap/ticdc/cdc/sink/mq.go:275\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371"]

The changefeed will be resumed every several seconds and fail again (there exists a rate limit in cdc owner)

Note in production environment resume a changefeed frequently could add a lot of load to cluster

Describe the feature you'd like

TiCDC would meet error, pause the changefeed, and won't retry to resume the changefeed until the chanegfeed config is updated and changefeed is resumed.

Describe alternatives you've considered

No response

Teachability, Documentation, Adoption, Migration Strategy

No response

@amyangfei amyangfei added the subject/new-feature Denotes an issue or pull request adding a new feature. label Nov 8, 2021
@amyangfei
Copy link
Contributor Author

amyangfei commented Nov 12, 2021

Some discussion record and solution candidates

  1. Add a more errors to fast failed error in the following list. The drawback of this way is we can't enumerate all fast failed errors.
    https://github.com/pingcap/ticdc/blob/083d6b0f88b98df7ab3235f14a06d974afcb96f9/pkg/errors/helper.go#L37-L39
  2. Add a better backoff mechanism when owner tries to restart a errored changefeed, such as retry after 1min, 2min, 4min etc, which will reduce the overhead that is introduced by changefeed initialization and let user be able to find changefeed in error state by changefeed query command.
  3. Add another changefeed state, such as unrecover error state, changefeed in this state will not be restarted by owner, but it will contribute to the service GC safepoint calculation. We don't recommend this solution since adding a new changefeed state introduces more complex.

@amyangfei amyangfei added type/bug The issue is confirmed as a bug. area/ticdc Issues or PRs related to TiCDC. and removed subject/new-feature Denotes an issue or pull request adding a new feature. labels Nov 18, 2021
@amyangfei
Copy link
Contributor Author

Solution-2 is a short term candidate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants