Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: handle kill signal during write result to connection #52882

Merged
merged 18 commits into from
Jun 12, 2024

Conversation

wshwsh12
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #44009

Problem Summary:

What changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 25, 2024
Copy link

tiprow bot commented Apr 25, 2024

Hi @wshwsh12. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wshwsh12
Copy link
Contributor Author

Manual test

  1. Create table t with one int column, and insert 1-8388608.
  2. set @@global.tidb_enable_tmp_storage_on_oom = off; and set @@global.tidb_mem_quota_query = 5 << 30;
  3. Use go-mysql-driver to query the sql "select count(1) from t group by a;", fetch 10000 rows and hang forever.
  4. Wait 10 minutes and sql doesn't finished.
  5. Create another session and try to kill query. kill tidb query xxxxxxxxx.
  6. The memory usage is released. After the client is restored, other sql can still be queried.
    image
tidb> kill tidb query 2631925768;
Query OK, 0 rows affected (0.00 sec)

tidb> show processlist;
+------------+------+-----------------+------+---------+------+---------------------------------+-----------------------------------+
| Id         | User | Host            | db   | Command | Time | State                           | Info                              |
+------------+------+-----------------+------+---------+------+---------------------------------+-----------------------------------+
| 2631925766 | root | 127.0.0.1:35256 | test | Query   |    0 | autocommit                      | show processlist                  |
| 2631925768 | root | 127.0.0.1:37406 | test | Query   |  516 | autocommit; wait query finished | select count(1) from t group by a |
+------------+------+-----------------+------+---------+------+---------------------------------+-----------------------------------+
2 rows in set (0.00 sec)

@wshwsh12
Copy link
Contributor Author

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Apr 25, 2024
Copy link

codecov bot commented Apr 25, 2024

Codecov Report

Attention: Patch coverage is 76.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 56.0530%. Comparing base (b8c0865) to head (43b6052).
Report is 16 commits behind head on master.

Additional details and impacted files
@@                Coverage Diff                @@
##             master     #52882         +/-   ##
=================================================
- Coverage   74.5303%   56.0530%   -18.4773%     
=================================================
  Files          1507       1657        +150     
  Lines        358289     635446     +277157     
=================================================
+ Hits         267034     356187      +89153     
- Misses        71847     255021     +183174     
- Partials      19408      24238       +4830     
Flag Coverage Δ
integration 22.5263% <56.2500%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.9656% <ø> (-2.2339%) ⬇️
parser ∅ <ø> (∅)
br 41.1319% <ø> (-2.6874%) ⬇️

@wshwsh12
Copy link
Contributor Author

/retest

1 similar comment
@wshwsh12
Copy link
Contributor Author

/retest

pkg/executor/adapter.go Show resolved Hide resolved
pkg/session/session.go Outdated Show resolved Hide resolved
pkg/server/server.go Show resolved Hide resolved
pkg/executor/adapter.go Show resolved Hide resolved
pkg/util/servermemorylimit/servermemorylimit.go Outdated Show resolved Hide resolved
pkg/parser/mysql/const.go Outdated Show resolved Hide resolved
@wshwsh12 wshwsh12 requested a review from XuHuaiyu June 6, 2024 03:43
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 6, 2024
@purelind
Copy link
Contributor

purelind commented Jun 7, 2024

/retest

pkg/executor/adapter.go Outdated Show resolved Hide resolved
@guo-shaoge
Copy link
Collaborator

Other lgtm

@wshwsh12 wshwsh12 requested a review from guo-shaoge June 7, 2024 07:44
@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 7, 2024
executor exec.Executor
fields []*ast.ResultField
executor exec.Executor
// `Fields` maybe call after `Close`, and executor will clear in `Close` function, so we need to store the schema in recordSet to avoid null pointer exception.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// `Fields` maybe call after `Close`, and executor will clear in `Close` function, so we need to store the schema in recordSet to avoid null pointer exception.
// The `Fields` method may be called after `Close`, and the executor is cleared in the `Close` function.
// Therefore, we need to store the schema in `recordSet` to avoid a null pointer exception when calling `executor.Schema()`.

Comment on lines 116 to 117
// If kill sql fails after 60 seconds, the current SQL may be stuck in the write packet network stack.
// Now, we can reclaim the resource by calling `Finish` and start looking for the next SQL with large memory usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If kill sql fails after 60 seconds, the current SQL may be stuck in the write packet network stack.
// Now, we can reclaim the resource by calling `Finish` and start looking for the next SQL with large memory usage.
// If the SQL cannot be terminated after 60 seconds, it may be stuck in the network stack while writing packets to the client,
// encountering some bugs that cause it to hang, or failing to detect the kill signal.
// In this case, the resources can be reclaimed by calling the `Finish` method, and then we can start looking for the next SQL with the largest memory usage.

if seconds := time.Since(s.killStartTime) / time.Second; seconds >= 60 {
// If kill sql fails after 60 seconds, the current SQL may be stuck in the write packet network stack.
// Now, we can reclaim the resource by calling `Finish` and start looking for the next SQL with large memory usage.
logutil.BgLogger().Warn(fmt.Sprintf("global memory controller failed to kill the top-consumer in %ds, try to close the executors force", seconds))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logutil.BgLogger().Warn(fmt.Sprintf("global memory controller failed to kill the top-consumer in %ds, try to close the executors force", seconds))
logutil.BgLogger().Warn(fmt.Sprintf("global memory controller failed to kill the top consumer in %d seconds. Attempting to force close the executors.", seconds))

@@ -41,11 +41,47 @@ const (
type SQLKiller struct {
Signal killSignal
ConnID uint64
Finish func()
// InWriteResultSet is used to mark whether the query is calling clientConn.writeResultSet().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// InWriteResultSet is used to mark whether the query is calling clientConn.writeResultSet().
// InWriteResultSet is used to indicate whether the query is currently calling clientConn.writeResultSet().
// If the query is in writeResultSet and Finish() can acquire rs.finishLock, we can assume the query is waiting for the client to receive data from the server over network I/O.

if atomic.CompareAndSwapUint32(&killer.Signal, 0, reason) {
status := atomic.LoadUint32(&killer.Signal)
err := killer.getKillError(status)
logutil.BgLogger().Warn("kill query started", zap.Uint64("conn", killer.ConnID), zap.String("reason", err.Error()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logutil.BgLogger().Warn("kill query initiated", zap.Uint64("connection ID", killer.ConnID), zap.String("reason", err.Error()))

Comment on lines 79 to 80
// FinishResultSet is used to finish the result set.
// If the cancel signal is received and SQL is waiting for network IO, resource released can be performed first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// FinishResultSet is used to close the result set.
// If a kill signal is sent but the SQL query is stuck in the network stack while writing packets to the client,
// encountering some bugs that cause it to hang, or failing to detect the kill signal, we can call Finish to release resources used during the SQL execution process.

@@ -907,6 +907,12 @@ func (s *Server) Kill(connectionID uint64, query bool, maxExecutionTime bool) {
// Mark the client connection status as WaitShutdown, when clientConn.Run detect
// this, it will end the dispatch loop and exit.
conn.setStatus(connStatusWaitShutdown)
if conn.bufReadConn != nil {
// When we try `kill connection` and tidb is stuck in the write packet network stack, we can quickly exit the network stack and end the SQL by setting WriteDeadline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// When we try `kill connection` and tidb is stuck in the write packet network stack, we can quickly exit the network stack and end the SQL by setting WriteDeadline.
// When attempting to 'kill connection' and TiDB is stuck in the network stack while writing packets,
// we can quickly exit the network stack and terminate the SQL execution by setting WriteDeadline.

status := a.stmt.Ctx.GetSessionVars().SQLKiller.GetKillSignal()
inWriteResultSet := a.stmt.Ctx.GetSessionVars().SQLKiller.InWriteResultSet.Load()
if status > 0 && inWriteResultSet {
logutil.BgLogger().Warn("kill query, this SQL may be stuck in the network I/O stack.", zap.Uint64("conn", a.stmt.Ctx.GetSessionVars().ConnectionID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logutil.BgLogger().Warn("kill query, this SQL may be stuck in the network I/O stack.", zap.Uint64("conn", a.stmt.Ctx.GetSessionVars().ConnectionID))
logutil.BgLogger().Warn("kill query, this SQL might be stuck in the network stack while writing packets to the client.", zap.Uint64("connection ID", a.stmt.Ctx.GetSessionVars().ConnectionID))

Comment on lines 101 to 103
// finishLock is a mutex used to synchronize access to the `Next` and `Finish` function of the adapter.
// It ensures that only one goroutine can access to the `Next` and `Finish` function at a time, preventing race conditions.
// When we terminate the current SQL externally(e.g. kill query), we may use an additional goroutine to call the Finish function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// finishLock is a mutex used to synchronize access to the `Next` and `Finish` function of the adapter.
// It ensures that only one goroutine can access to the `Next` and `Finish` function at a time, preventing race conditions.
// When we terminate the current SQL externally(e.g. kill query), we may use an additional goroutine to call the Finish function.
// finishLock is a mutex used to synchronize access to the `Next` and `Finish` functions of the adapter.
// It ensures that only one goroutine can access the `Next` and `Finish` functions at a time, preventing race conditions.
// When we terminate the current SQL externally (e.g., kill query), an additional goroutine would be used to call the `Finish` function.

status := a.stmt.Ctx.GetSessionVars().SQLKiller.GetKillSignal()
inWriteResultSet := a.stmt.Ctx.GetSessionVars().SQLKiller.InWriteResultSet.Load()
if status > 0 && inWriteResultSet {
logutil.BgLogger().Warn("kill query, this SQL may be stuck in the network I/O stack.", zap.Uint64("conn", a.stmt.Ctx.GetSessionVars().ConnectionID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will kill connection trigger this log?
if so, how do we distinguish kill query and kill connection from the log?

Copy link

ti-chi-bot bot commented Jun 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: guo-shaoge, XuHuaiyu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [XuHuaiyu,guo-shaoge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 12, 2024
Copy link

ti-chi-bot bot commented Jun 12, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-06-07 07:49:34.014294992 +0000 UTC m=+105328.067606916: ☑️ agreed by guo-shaoge.
  • 2024-06-12 08:08:23.653012723 +0000 UTC m=+538457.706324647: ☑️ agreed by XuHuaiyu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kill statement can't work and write result to conn takes a lot of time
4 participants