etcdserver: request timed out, possibly due to connection lost #1059

wking · 2019-01-12T09:52:45Z

Version

$ openshift-install version
openshift-install v0.9.1

Platform (aws|libvirt|openstack):

All.

What happened?

In an e2e-aws run mentioned here:

fail [k8s.io/kubernetes/test/e2e/storage/persistent_volumes-local.go:248]: Expected error:
    <*errors.errorString | 0xc4212bc710>: {
        s: "pod Create API error: etcdserver: request timed out, possibly due to connection lost",
    }
    pod Create API error: etcdserver: request timed out, possibly due to connection lost
not to have occurred

What you expected to happen?

No errors due to etcd delays.

How to reproduce it (as minimally and precisely as possible)?

There have been a lot of these in CI recently, although I'm not sure what would have changed. AWS has had a number of performance issues for us today though, including slow resource generation. Maybe our CI disks are just running slower than usual or something?

Anything else we need to know?

Details or a similar issue in the etcd logs:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1054/pull-ci-openshift-installer-master-e2e-aws/2824/artifacts/e2e-aws/pods/kube-system_etcd-member-ip-10-0-13-114.ec2.internal_etcd-member.log.gz | gunzip | grep -B2 -A3 'etcdserver: request timed out' | head -n 9
2019-01-12 01:35:58.954371 I | raft: raft.node: 1b29101e3d7dd22a lost leader bd31e70ef4e40f8b at term 13
2019-01-12 01:35:59.812308 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2019-01-12 01:35:59.812418 W | etcdserver: read-only range request "key:\"/openshift.io/podtemplates\" range_end:\"/openshift.io/podtemplatet\" count_only:true " with result "error:etcdserver: request timed out" took too long (7.336055856s) to execute
2019-01-12 01:35:59.812518 W | etcdserver: read-only range request "key:\"/openshift.io/services/endpoints/kube-system/kube-scheduler\" " with result "error:etcdserver: request timed out" took too long (8.292539027s) to execute
2019-01-12 01:35:59.812576 W | etcdserver: read-only range request "key:\"/openshift.io/pods/openshift-cluster-kube-scheduler-operator/openshift-cluster-kube-scheduler-operator-56f567694-87qpg\" " with result "error:etcdserver: request timed out" took too long (9.082020841s) to execute
2019-01-12 01:35:59.812635 W | etcdserver: read-only range request "key:\"/openshift.io/pods/openshift-cluster-kube-scheduler-operator/openshift-cluster-kube-scheduler-operator-56f567694-87qpg\" " with result "error:etcdserver: request timed out" took too long (9.082939895s) to execute
2019-01-12 01:36:02.056897 I | raft: 1b29101e3d7dd22a [term: 13] ignored a MsgReadIndexResp message with lower term from bd31e70ef4e40f8b [term: 12]
2019-01-12 01:36:02.554105 W | wal: sync duration of 3.599668164s, expected less than 1s
2019-01-12 01:36:03.654282 I | raft: 1b29101e3d7dd22a is starting a new election at term 13

This seems similar to etcd-io/etcd#9464, which talks about ticks for election and pre-voting as potential fixes, and about bumping to 3.4 to get them. Are their plans for bumping the elderly 3.1.14 we use for bootstrap health checks? Or the more respectable 3.3.10 the machine-config operator suggests for the masters? I guess we'd have to bump to 3.4 for pre-voting, since 3.3.10 already contains the backported-to-3.3.x etcd-io/etcd@3282d9070 (which landed in 3.3.3). Or maybe the problem is something else entirely :p.

As a minor pivot, it seems safe enough for us to move up to 3.3.10 to catch up with openshift/machine-config-operator@59f809676.

/kind bug

The text was updated successfully, but these errors were encountered:

wking · 2019-01-12T10:02:33Z

Looks like 3.4 with pre-voting is still off in the future? So that leaves "don't flake out the masters to cause so many elections"?

eparis · 2019-02-20T20:53:55Z

I'm going to close this issue as it seems to likely be etcd, not installer. If we start running into trouble with this again lets open a BZ against the etcd component.

khanthecomputerguy · 2019-04-30T20:54:03Z

If you are still having this issue. Check out this video. I was able to resolve the issue with these instructions.
https://www.youtube.com/watch?v=EjTzIokJPcI

This is a stopgap solution until openshift is able to merge the API PR openshift#1059 openshift/api#1059.

openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 12, 2019

This was referenced Jan 12, 2019

pkg/asset/ignition/bootstrap: Bump etcd to 3.3.10 #1060

Closed

aws: Increase the default master instance size to reduce etcd timeouts #1069

Merged

eparis closed this as completed Feb 20, 2019

thunderboltsid added a commit to thunderboltsid/installer that referenced this issue Jan 14, 2022

Replace openshift/api with yannickstruyf3/api

00ce1f4

This is a stopgap solution until openshift is able to merge the API PR openshift#1059 openshift/api#1059.

thunderboltsid mentioned this issue Jan 14, 2022

Replace openshift/api with yannickstruyf3/api thunderboltsid/installer#1

Closed

thunderboltsid added a commit to nutanix-cloud-native/openshift-installer that referenced this issue Jan 27, 2022

Replace openshift/api with yannickstruyf3/api

725cedd

This is a stopgap solution until openshift is able to merge the API PR openshift#1059 openshift/api#1059.

thunderboltsid added a commit to nutanix-cloud-native/openshift-installer that referenced this issue Feb 1, 2022

Replace openshift/api with yannickstruyf3/api

2af07de

This is a stopgap solution until openshift is able to merge the API PR openshift#1059 openshift/api#1059.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver: request timed out, possibly due to connection lost #1059

etcdserver: request timed out, possibly due to connection lost #1059

wking commented Jan 12, 2019

wking commented Jan 12, 2019

eparis commented Feb 20, 2019

khanthecomputerguy commented Apr 30, 2019

etcdserver: request timed out, possibly due to connection lost #1059

etcdserver: request timed out, possibly due to connection lost #1059

Comments

wking commented Jan 12, 2019

Version

Platform (aws|libvirt|openstack):

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

wking commented Jan 12, 2019

eparis commented Feb 20, 2019

khanthecomputerguy commented Apr 30, 2019