Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: Improve cluster restart messaging #2442

Merged
merged 1 commit into from
Jul 22, 2024

Conversation

zph
Copy link
Contributor

@zph zph commented Jul 19, 2024

What problem does this PR solve?

Cluster restart messaging indicates to user that the:

  • Cluster will be unavailable (
    fmt.Sprintf("Will restart the cluster %s with nodes: %s roles: %s.\nCluster will be unavailable\nDo you want to continue? [y/N]:",
    )

This message is confusing when the command run only requests a single node restart, especially monitoring nodes or ones in an HA deployment.

What is changed and how it works?

Commit changes the messaging so that users performing full cluster restart retain the same message as before. But users only restarting selected nodes or roles will have that replaced with a more detailed warning message:

Cluster functionality related to nodes: %s and roles: % will be unavailable

Check List

Tests

  • Manual test (add detailed scripts or steps below)
    • I recompiled the code and ran both a full cluster restart request and a single node restart request

Code changes

  • N/a

Side effects

  • N/a

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release notes:

NONE

Cluster restart messaging indicates to user that the:
  - Cluster will be unavailable
    (https://github.com/pingcap/tiup/blob/654b087a1d293128424c0bd04e1ba216b064ffd1/pkg/cluster/manager/basic.go#L235)

This message is confusing when the command run only requests a single
node restart, especially monitoring nodes or ones in an HA deployment.

Commit changes the messaging so that users performing full cluster
restart retain the same message as before. But users only restarting
selected nodes or roles will have that replaced with a more detailed warning message:

`Cluster functionality related to nodes: %s and roles: % will be
unavailable`
@ti-chi-bot ti-chi-bot bot requested review from kaaaaaaang and nexustar July 19, 2024 01:54
Copy link
Contributor

ti-chi-bot bot commented Jul 19, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bb7133 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@CLAassistant
Copy link

CLAassistant commented Jul 19, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

ti-chi-bot bot commented Jul 19, 2024

Welcome @zph! It looks like this is your first PR to pingcap/tiup 🎉

@ti-chi-bot ti-chi-bot bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 19, 2024
@ti-chi-bot ti-chi-bot bot added the lgtm label Jul 19, 2024
Copy link
Contributor

ti-chi-bot bot commented Jul 19, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-07-19 02:52:00.393137721 +0000 UTC m=+581542.384079185: ☑️ agreed by xhebox.

@zph
Copy link
Contributor Author

zph commented Jul 19, 2024

I'm away from my computer tonight, but if there are persistent test failures I can retry my manual regression

Copy link
Member

@bb7133 bb7133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, PTAL @kaaaaaaang

@zph
Copy link
Contributor Author

zph commented Jul 19, 2024

Looking at the failing test, I'm unable to see enough detail to assess if it's my changes or unrelated test issues.

The part closest to pd receiving a signal is a connection error:

[2024/07/19 07:09:00.517 +00:00] [WARN] [server.go:530] ["Server.onConn handshake"] [conn=5506538949855674773] [error="read tcp 127.0.0.1:4000->127.0.0.1:55138: read: connection reset by peer"] ["remote addr"=127.0.0.1:55138]
[2024/07/19 07:09:01.300 +00:00] [INFO] [manager.go:263] ["revoke session"] ["owner info"="[log-backup] /tidb/br-stream/owner ownerManager 0491a53d-ac93-459c-b17c-1adcb6c564e1"] [error="rpc error: code = Canceled desc = grpc: the client connection is closing"]
...
check detail log from: /home/runner/work/tiup/tiup/go/src/github.com/pingcap/tiup/tests/tiup-playground/_tmp/home/data/test_play/tidb-0/tidb.log
pd quit: signal: killed

Please let me know if I should look into this test failure further or how I can run it locally to debug with more logs if it seems related to my change.

@xhebox xhebox merged commit 8d7cb71 into pingcap:master Jul 22, 2024
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants