-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Member Add Failure #11554
Comments
To reproduce this, it will help to run 3 instances of etcd inside a single etcd docker container. You might want to use 3 separate terminal windows to make this easy to follow. Steps to reproduce: Window 1
Window 2
Window 3
Repeat the last two steps in Window 3 repeatedly to see how consistently it passes or fails. Then you can start over and change the version of the docker image you run with in step 1 of Window 1. Our results:
|
To summarize, it looks like this was a regression introduced in |
@daniel-keeney |
We are running this command from the third node that is being added. The process looks like:
Therefore there will be nothing listening on any interfaces at the time of running |
If it helps, I grabbed a goroutine dump of the |
@jingyih (answering for @daniel-keeney since he's out): http://0.0.0.:32379 is the endpoint that is going to be served by the 3rd member. |
@swalner-pivotal Thanks! Given that http://0.0.0.:32379 is not yet available at the time of command execution. Could you remove it from the "--endpoints" flag and see if it helps? |
Hi @jingyih, we also tried that, but it's important to note that this exact configuration works when run with 3.3.12 (and 3.3.13). |
In 3.3.14, the etcd client balancer was rewritten to fix a major bug. It is a breaking change [1]. That might explain the different behavior you observed. [1] https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md#v3314-2019-08-16 |
@jingyih Thanks for your responsiveness! We took your advice and tried replacing Window 3's step 3, which was originally: Please advise us if there is something else we should try, we appreciate your help. |
Could you paste the error message of the following command you tried?
|
Sorry for leaving out that detail. It was the same as in the initial report:
|
But
|
Sorry about that, I was trying to indicate that it was the same "DeadlineExceeded" error as before. Here is the combined shell prompt, command, and output:
If you are not able to reproduce it on the first try, then run:
to reset the cluster back to 2 members and try |
Thanks @daniel-keeney, I was able to reproduce this. I will take a closer look. |
Great, thank you! |
{"level":"warn","ts":"2020-07-15T05:45:40.089+0100","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"passthrough:///https://ipmasked:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} It is happening when we add stacked kubernetes masters based on instruction at kubernetes.io. It is a second master node. The etcdctl member status and list are showing correctly with node 1 as master and node 2 as false however when we down the master 1, the entire cluster is going done. Kubernetes version 1.18.5 and etcd version: 3.4.3 +------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ |
We are trying to set up a 3-node cluster of etcd. The first two nodes start up fine, but the third one experiences trouble approximately 80% of the time. The steps that the nodes take to join the cluster are to:
etcdctl member add ...
This is failing at step 1, and so the etcd server ends up not running on the third node.
Version: 3.3.17
Script:
ETCDCTL_API=3 etcdctl member add <memberName> --peer-urls "https://master-2.internal:2380"
(
master-2.internal
is itself)Expected Output:
Actual Output:
Output of etcdctl endpoint health:
We are able to
dig
the relevant host names from all 3 nodes (master-0|1|2.internal), and we tried adding the--command-timeout=30s
flag toetcdctl member add
which did not help. We are able to manually remove the member and retry, which works about 20% of the time. How can we go about diagnosing this problem further?EDIT: Markdown formatting
The text was updated successfully, but these errors were encountered: