Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement terminate failpoint #16788

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

serathius
Copy link
Member

No description provided.

Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Though curious about handling of lazyfs for restarting member, question below.

tests/robustness/failpoint/kill.go Outdated Show resolved Hide resolved
@serathius serathius force-pushed the robustness-terminate branch 2 times, most recently from 74da5f4 to 29fd037 Compare October 19, 2023 06:46
Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@serathius
Copy link
Member Author

/retest

@serathius
Copy link
Member Author

Oops, looks like test don't pass after rebase.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius
Copy link
Member Author

Very weird. I have re-implement graceful termination to do the exactly what forceful termination does, but use SIGTERM instead of SIGKILL, still the results are different. In 3 node cluster terminated cluster is unable to bootstrap raft, it hangs somewhere between connecting to other members and publishing member to cluster. Issue doesn't happen in single node cluster.

Is there something wrong with e2e testing framework or etcd? cc @ahrtr

Added a draft commit to on top for easier reproduction of the issue.

@serathius
Copy link
Member Author

I think it might be some issue with peer proxy in e2e framework. https://github.com/etcd-io/etcd/actions/runs/8425584127/job/23072006524?pr=16788

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link

@serathius: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-unit-test-amd64 aedb627 link true /test pull-etcd-unit-test-amd64
pull-etcd-unit-test-arm64 aedb627 link true /test pull-etcd-unit-test-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants