Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB Down Occurs When TiDB Sets tiflash Replica During Upgrade #57863

Closed
apollodafoni opened this issue Dec 2, 2024 · 5 comments · Fixed by #57945
Closed

TiDB Down Occurs When TiDB Sets tiflash Replica During Upgrade #57863

apollodafoni opened this issue Dec 2, 2024 · 5 comments · Fixed by #57945
Assignees
Labels
affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.5 This bug affects the 8.5.x(LTS) versions. component/ddl This issue is related to DDL of TiDB. severity/major type/bug The issue is confirmed as a bug.

Comments

@apollodafoni
Copy link

apollodafoni commented Dec 2, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. alter table tiflash replica
  2. During alter , use tiup upgrade from v8.4.0 to v8.5.0-pre

2. What did you expect to see? (Required)

upgrade success

3. What did you see instead (Required)

Image
[2024/12/02 10:14:20.198 +08:00] [ERROR] [advancer.go:400] ["listen task meet error, would reopen."] [error=EOF]
[2024/12/02 10:14:20.323 +08:00] [ERROR] [tso_client.go:366] ["[tso] update connection contexts failed"] [dc=global] [error="rpc error: code = Canceled desc = context canceled"]
[2024/12/02 10:19:37.115 +08:00] [ERROR] [tso_dispatcher.go:438] ["[tso] getTS error after processing requests"] [dc-location=global] [stream-url=http://pd-1-peer:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2024/12/02 10:19:37.217 +08:00] [ERROR] [controller.go:623] ["[resource group controller] token bucket rpc error"] [error="rpc error: code = Unavailable desc = not leader"]
[2024/12/02 10:19:37.220 +08:00] [ERROR] [pd_service_discovery.go:560] ["[pd] failed to update member"] [urls="[http://pd-1-peer:2379,http://pd-2-peer:2379,http://pd-3-peer:2379]"] [error="[PD:client:ErrClientGetMember]get member failed"]
[2024/12/02 10:19:37.271 +08:00] [ERROR] [tso_dispatcher.go:438] ["[tso] getTS error after processing requests"] [dc-location=global] [stream-url=http://pd-1-peer:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2024/12/02 10:19:37.324 +08:00] [ERROR] [pd_service_discovery.go:560] ["[pd] failed to update member"] [urls="[http://pd-1-peer:2379,http://pd-2-peer:2379,http://pd-3-peer:2379]"] [error="[PD:client:ErrClientGetMember]get member failed"]
[2024/12/02 10:19:37.375 +08:00] [ERROR] [tso_dispatcher.go:438] ["[tso] getTS error after processing requests"] [dc-location=global] [stream-url=http://pd-1-peer:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2024/12/02 10:19:37.427 +08:00] [ERROR] [pd_service_discovery.go:560] ["[pd] failed to update member"] [urls="[http://pd-1-peer:2379,http://pd-2-peer:2379,http://pd-3-peer:2379]"] [error="[PD:client:ErrClientGetMember]get member failed"]
[2024/12/02 10:19:37.531 +08:00] [ERROR] [pd_service_discovery.go:560] ["[pd] failed to update member"] [urls="[http://pd-1-peer:2379,http://pd-2-peer:2379,http://pd-3-peer:2379]"] [error="[PD:client:ErrClientGetMember]get member failed"]
[2024/12/02 10:19:37.573 +08:00] [ERROR] [tso_dispatcher.go:438] ["[tso] getTS error after processing requests"] [dc-location=global] [stream-url=http://pd-1-peer:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2024/12/02 10:19:37.634 +08:00] [ERROR] [pd_service_discovery.go:560] ["[pd] failed to update member"] [urls="[http://pd-1-peer:2379,http://pd-2-peer:2379,http://pd-3-peer:2379]"] [error="[PD:client:ErrClientGetMember]get member failed"]
[2024/12/02 10:19:37.738 +08:00] [ERROR] [pd_service_discovery.go:560] ["[pd] failed to update member"] [urls="[http://pd-1-peer:2379,http://pd-2-peer:2379,http://pd-3-peer:2379]"] [error="[PD:client:ErrClientGetMember]get member failed"]
[2024/12/02 10:19:37.842 +08:00] [ERROR] [pd_service_discovery.go:560] ["[pd] failed to update member"] [urls="[http://pd-1-peer:2379,http://pd-2-peer:2379,http://pd-3-peer:2379]"] [error="[PD:client:ErrClientGetMember]get member failed"]
[2024/12/02 10:19:37.970 +08:00] [ERROR] [tso_dispatcher.go:438] ["[tso] getTS error after processing requests"] [dc-location=global] [stream-url=http://pd-1-peer:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2024/12/02 10:19:38.037 +08:00] [ERROR] [client.go:252] ["[pd] request failed with a non-200 status"] [caller-id=pd-http-client] [name=GetMinResolvedTSByStoresIDs] [uri=/pd/api/v1/min-resolved-ts] [method=GET] [target-url=] [source=tikv-driver] [url=http://pd-1-peer:2379/pd/api/v1/min-resolved-ts] [status="500 Internal Server Error"] [body="[PD:apiutil:ErrRedirectToNotLeader]redirect to not leader"]
[2024/12/02 10:19:38.074 +08:00] [ERROR] [pd_service_discovery.go:560] ["[pd] failed to update member"] [urls="[http://pd-1-peer:2379,http://pd-2-peer:2379,http://pd-3-peer:2379]"] [error="[PD:client:ErrClientGetMember]get member failed"]
[2024/12/02 10:19:38.074 +08:00] [ERROR] [tso_dispatcher.go:438] ["[tso] getTS error after processing requests"] [dc-location=global] [stream-url=http://pd-1-peer:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2024/12/02 10:21:09.936 +08:00] [ERROR] [http_status.go:531] ["start status/rpc server error"] [error="accept tcp [::]:10080: use of closed network connection"]
[2024/12/02 10:21:09.936 +08:00] [ERROR] [http_status.go:526] ["http server error"] [error="http: Server closed"]
[2024/12/02 10:21:09.936 +08:00] [ERROR] [http_status.go:521] ["grpc server error"] [error="mux: server closed"]
[2024/12/02 10:21:09.947 +08:00] [ERROR] [advancer.go:400] ["listen task meet error, would reopen."] [error=EOF]
[2024/12/02 10:21:21.774 +08:00] [ERROR] [session.go:3935] ["set system variable max_allowed_packet error"] [error="[variable:1621]SESSION variable 'max_allowed_packet' is read-only. Use SET GLOBAL to assign the value"]
[2024/12/02 10:21:25.985 +08:00] [ERROR] [ddl_tiflash_api.go:538] ["updating TiFlash replica status err"] [category=ddl] [error="context canceled"] [tableID=131] [isPartition=true]

4. What is your TiDB version? (Required)

v8.4.0

@apollodafoni apollodafoni added the type/bug The issue is confirmed as a bug. label Dec 2, 2024
@apollodafoni
Copy link
Author

/severity major

@apollodafoni
Copy link
Author

/label affects-8.5

@ti-chi-bot ti-chi-bot bot added affects-8.5 This bug affects the 8.5.x(LTS) versions. and removed may-affects-8.5 labels Dec 2, 2024
@xhebox
Copy link
Contributor

xhebox commented Dec 2, 2024

Stuck after domain.Start.

Found ["background process started"] [source=domain] [process=logBackupAdvancer] but not ["background process started"] [source=domain] [process=globalBindHandleWorkerLoop].

Also found get owner lock on the problematic tidb server, which means it is stuck here before L3738. Possibly upgrade at 3733.

tidb/pkg/session/session.go

Lines 3717 to 3738 in 0e8ccbc

if err != nil {
// Bootstrap fail will cause program exit.
logutil.BgLogger().Fatal("start domain error", zap.Error(err))
}
// For the bootstrap SQLs, the following variables should be compatible with old TiDB versions.
// TODO we should have a createBootstrapSession to init those special variables.
s.sessionVars.EnableClusteredIndex = variable.ClusteredIndexDefModeIntOnly
s.SetValue(sessionctx.Initing, true)
if startMode == ddl.Bootstrap {
bootstrap(s)
} else if startMode == ddl.Upgrade {
// below sleep is used to mitigate https://github.com/pingcap/tidb/issues/57003,
// to let the older owner have time to notice that it's already retired.
time.Sleep(owner.WaitTimeOnForceOwner)
upgrade(s)
}
finishBootstrap(store)
s.ClearValue(sessionctx.Initing)
dom.Close()

sh-4.4# grep -r -H 'get owner lock' /tiup/deploy/tidb-4000/log/tidb.log
sh-4.4# exit
command terminated with exit code 1
[xhe@M1PRO-2 ~]$ kubectl exec -it tidb-3-0 -- /bin/sh
sh-4.4#  grep -r -H 'get owner lock' /tiup/deploy/tidb-4000/log/tidb.log
sh-4.4#  grep -r -H 'get owner lock' /tiup/deploy/tidb-4000/log/tidb.log
sh-4.4# exit
command terminated with exit code 1
[xhe@M1PRO-2 ~]$ kubectl exec -it tidb-1-0 -- /bin/sh
sh-4.4# grep -r -H 'get owner lock' /tiup/deploy/tidb-4000/log/tidb.log
/tiup/deploy/tidb-4000/log/tidb.log:[2024/12/02 10:21:11.354 +08:00] [INFO] [session.go:3691] ["[upgrade] get owner lock to upgrade"]

Found job is submited by tidb-1(problem server) before restart, and then it is executed by upgraded tidb-1. Note that test server has --with-mock-upgrade set.

sh-4.4# grep -r -H 'set tiflash replica' /tiup/deploy/tidb-4000/log/tidb.log
/tiup/deploy/tidb-4000/log/tidb.log:[2024/12/02 10:18:23.687 +08:00] [INFO] [job_worker.go:763] ["run DDL job"] [category=ddl] [jobID=196] [conn=1916796932] [job="ID:196, Type:set tiflash replica, State:queueing, SchemaState:none, SchemaID:189, TableID:113, RowCount:0, ArgLen:0, start time: 2024-12-02 10:18:23.64 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0, Version: v2"]
/tiup/deploy/tidb-4000/log/tidb.log:[2024/12/02 10:18:23.751 +08:00] [INFO] [schema_version.go:381] ["wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [category=ddl] [ver=73] ["take time"=30.413144ms] [job="ID:196, Type:set tiflash replica, State:done, SchemaState:public, SchemaID:189, TableID:113, RowCount:0, ArgLen:1, start time: 2024-12-02 10:18:23.64 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0, Version: v2"]
/tiup/deploy/tidb-4000/log/tidb.log:[2024/12/02 10:18:23.805 +08:00] [INFO] [job_worker.go:378] ["finish DDL job"] [category=ddl] [jobID=196] [conn=1916796932] [job="ID:196, Type:set tiflash replica, State:synced, SchemaState:public, SchemaID:189, TableID:113, RowCount:0, ArgLen:0, start time: 2024-12-02 10:18:23.64 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0, Version: v2"]
sh-4.4# exit
[xhe@M1PRO-2 ~]$ grep -r -H 'set tiflash replica' tidb.log
tidb.log:[2024/12/02 10:18:23.675 +08:00] [INFO] [session.go:4099] ["CRUCIAL OPERATION"] [conn=1916796932] [schemaVersion=72] [cur_db=] [sql="alter table ntodb.nto_ssmx_bill_detail set tiflash replica 1"] [user=root@%]
tidb.log:[2024/12/02 10:18:23.683 +08:00] [INFO] [executor.go:6515] ["DDL job submitted"] [category=ddl] [job="ID:196, Type:set tiflash replica, State:queueing, SchemaState:none, SchemaID:189, TableID:113, RowCount:0, ArgLen:1, start time: 2024-12-02 10:18:23.64 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0, Version: v2"] [query="alter table ntodb.nto_ssmx_bill_detail set tiflash replica 1"]
tidb.log:[2024/12/02 10:18:23.685 +08:00] [INFO] [job_submitter.go:135] ["add DDL jobs"] [category=ddl] ["batch count"=1] [jobs="ID:196, Type:set tiflash replica, State:queueing, SchemaState:none, SchemaID:189, TableID:113, RowCount:0, ArgLen:1, start time: 2024-12-02 10:18:23.64 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0, Version: v2; "] [table=true] [fast_create=false]

@apollodafoni
Copy link
Author

/assign @tangenta

@jebter jebter added the component/ddl This issue is related to DDL of TiDB. label Dec 3, 2024
@tangenta
Copy link
Contributor

tangenta commented Dec 3, 2024

Deadlock goroutines:

 Goroutine 1 - User: /usr/local/go/src/runtime/sema.go:71 sync.runtime_Semacquire (0x2087265) [semacquire 37577929467619446]
         0  0x000000000208582e in runtime.gopark
             at /usr/local/go/src/runtime/proc.go:425
         1  0x0000000002060c05 in runtime.goparkunlock
             at /usr/local/go/src/runtime/proc.go:430
         2  0x0000000002060c05 in runtime.semacquire1
             at /usr/local/go/src/runtime/sema.go:178
         3  0x0000000002087265 in sync.runtime_Semacquire
             at /usr/local/go/src/runtime/sema.go:71
         4  0x00000000020a1c68 in sync.(*WaitGroup).Wait
             at /usr/local/go/src/sync/waitgroup.go:118
         5  0x0000000004a58ca5 in github.com/pingcap/tidb/pkg/ddl.(*ddl).close
             at /workspace/source/tidb/pkg/ddl/ddl.go:1008
         6  0x0000000004a566d7 in github.com/pingcap/tidb/pkg/ddl.(*ddl).Stop
             at /workspace/source/tidb/pkg/ddl/ddl.go:747
         7  0x0000000004ce3c43 in github.com/pingcap/tidb/pkg/domain.(*Domain).Close
             at /workspace/source/tidb/pkg/domain/domain.go:1204
         8  0x00000000058fbcb2 in github.com/pingcap/tidb/pkg/session.runInBootstrapSession
             at /workspace/source/tidb/pkg/session/session.go:3738
         9  0x00000000058fa128 in github.com/pingcap/tidb/pkg/session.bootstrapSessionImpl
             at /workspace/source/tidb/pkg/session/session.go:3458
        10  0x0000000005a1bd56 in github.com/pingcap/tidb/pkg/session.BootstrapSession
             at /workspace/source/tidb/pkg/session/session.go:3404
        11  0x0000000005a1bd56 in main.createStoreDDLOwnerMgrAndDomain
             at /workspace/source/tidb/cmd/tidb-server/main.go:416
  Goroutine 646 - User: /workspace/source/tidb/pkg/ddl/executor.go:6589 github.com/pingcap/tidb/pkg/ddl.(*executor).DoDDLJobWrapper (0x4aa2f49) [chan recei
ve 37577929467619446]
         0  0x000000000208582e in runtime.gopark
             at /usr/local/go/src/runtime/proc.go:425
         1  0x00000000020158fc in runtime.chanrecv
             at /usr/local/go/src/runtime/chan.go:639
         2  0x00000000020154b2 in runtime.chanrecv1
             at /usr/local/go/src/runtime/chan.go:489
         3  0x0000000004aa2f49 in github.com/pingcap/tidb/pkg/ddl.(*executor).DoDDLJobWrapper
             at /workspace/source/tidb/pkg/ddl/executor.go:6589
         4  0x0000000004a8a7a5 in github.com/pingcap/tidb/pkg/ddl.(*executor).doDDLJob2
             at /workspace/source/tidb/pkg/ddl/executor.go:6555
         5  0x0000000004a8a7a5 in github.com/pingcap/tidb/pkg/ddl.(*executor).UpdateTableReplicaInfo
             at /workspace/source/tidb/pkg/ddl/executor.go:3915
         6  0x0000000004a630a5 in github.com/pingcap/tidb/pkg/ddl.(*ddl).refreshTiFlashTicker
             at /workspace/source/tidb/pkg/ddl/ddl_tiflash_api.go:533
         7  0x0000000004a6412a in github.com/pingcap/tidb/pkg/ddl.(*ddl).PollTiFlashRoutine
             at /workspace/source/tidb/pkg/ddl/ddl_tiflash_api.go:588
         8  0x0000000004b63665 in github.com/pingcap/tidb/pkg/ddl.(*ddl).PollTiFlashRoutine-fm
             at <autogenerated>:1
         9  0x00000000032314ac in github.com/pingcap/tidb/pkg/util.(*WaitGroupWrapper).Run.func1
             at /workspace/source/tidb/pkg/util/wait_group_wrapper.go:157
        10  0x000000000208e821 in runtime.goexit
             at /usr/local/go/src/runtime/asm_amd64.s:1700

During the upgrade, the TiDB will start DDL twice. The first DDL only executes the upgrade-related SQL statements. After it is closed, the second DDL will be available to users (including port 4000 and 10080).

When DDL starts, it launches a background goroutine, periodically load all tables containing tiflash replica, and then execute the DDL to update tiflash replica status. This DDL and DDL.Close() blocks each other, causing a deadlock.

@tangenta tangenta added affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. labels Dec 3, 2024
@tangenta tangenta removed may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 labels Dec 3, 2024
@ti-chi-bot ti-chi-bot bot closed this as completed in 6c39a81 Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.5 This bug affects the 8.5.x(LTS) versions. component/ddl This issue is related to DDL of TiDB. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants