Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-controller-manager and kube-scheduler restarting in a clean installation #88111

Closed
cja000 opened this issue Feb 13, 2020 · 13 comments
Closed
Labels
sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/unresolved Indicates an issue that can not or will not be resolved.

Comments

@cja000
Copy link

cja000 commented Feb 13, 2020

What happened:
Just after a installation controller and scheduler begin to restart.

What you expected to happen:
I wish to have a stable kubernetes installation without restarts

How to reproduce it (as minimally and precisely as possible):
I only see this problem in our "production" environment. Testing the same procedure with Virtual Box even in a laptop I cannot reproduce it.

Anything else we need to know?:
etcd has a lot of messages took too long with values over 15 seconds (took too long (18.015145443s) to execute)

...
2020-02-13 09:25:04.688572 I | embed: listening for peers on 172.16.1.1:2380
2020-02-13 09:25:04.688761 I | embed: listening for metrics on http://127.0.0.1:2381
raft2020/02/13 09:25:04 INFO: a0e08efea825151f is starting a new election at term 1
raft2020/02/13 09:25:04 INFO: a0e08efea825151f became candidate at term 2
raft2020/02/13 09:25:04 INFO: a0e08efea825151f received MsgVoteResp from a0e08efea825151f at term 2
raft2020/02/13 09:25:04 INFO: a0e08efea825151f became leader at term 2
raft2020/02/13 09:25:04 INFO: raft.node: a0e08efea825151f elected leader a0e08efea825151f at term 2
2020-02-13 09:25:04.781152 I | etcdserver: setting up the initial cluster version to 3.4
2020-02-13 09:25:04.781527 N | etcdserver/membership: set the initial cluster version to 3.4
2020-02-13 09:25:04.781624 I | etcdserver: published {Name:ml-k8s-master ClientURLs:[https://172.16.1.1:2379]} to cluster 50f4e72624d098cd
2020-02-13 09:25:04.781657 I | embed: ready to serve client requests
2020-02-13 09:25:04.781729 I | etcdserver/api: enabled capabilities for version 3.4
2020-02-13 09:25:04.781814 I | embed: ready to serve client requests
2020-02-13 09:25:04.786124 I | embed: serving client requests on 127.0.0.1:2379
2020-02-13 09:25:04.786821 I | embed: serving client requests on 172.16.1.1:2379
2020-02-13 09:25:17.472409 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000024395s) to execute
WARNING: 2020/02/13 09:25:17 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:19.047993 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000018721s) to execute
WARNING: 2020/02/13 09:25:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:19.572311 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (1.999980608s) to execute
WARNING: 2020/02/13 09:25:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:21.547985 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000023755s) to execute
WARNING: 2020/02/13 09:25:21 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:21.672315 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context deadline exceeded" took too long (2.00004148s) to execute
WARNING: 2020/02/13 09:25:21 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:22.472452 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2020-02-13 09:25:23.772362 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000063609s) to execute
WARNING: 2020/02/13 09:25:23 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:24.048026 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000065812s) to execute
WARNING: 2020/02/13 09:25:24 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:25.872341 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000038869s) to execute
WARNING: 2020/02/13 09:25:25 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:26.547981 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000024442s) to execute
WARNING: 2020/02/13 09:25:26 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:27.972340 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000071169s) to execute
WARNING: 2020/02/13 09:25:27 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:29.047995 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000056661s) to execute
WARNING: 2020/02/13 09:25:29 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-13 09:25:29.287420 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (18.015145443s) to execute
2020-02-13 09:25:29.287479 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (19.7395452s) to execute
2020-02-13 09:25:29.287497 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (20.1151665s) to execute
2020-02-13 09:25:29.287523 W | etcdserver: read-only range request "key:\"/registry/namespaces/default\" " with result "range_response_count:0 size:4" took too long (16.314794334s) to execute
2020-02-13 09:25:29.287568 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (14.739618835s) to execute
2020-02-13 09:25:29.287615 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (15.915336418s) to execute
2020-02-13 09:25:29.287747 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "range_response_count:0 size:4" took too long (17.239620012s) to execute
...

Environment:

  • Kubernetes version (use kubectl version):
podml@ml-k8s-master:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    • Memory
podml@ml-k8s-master:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:          125Gi       1.3Gi       119Gi       2.0Mi       4.4Gi       123Gi
Swap:            0B          0B          0B
  • CPU
podml@ml-k8s-master:~$ lscpu 
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              1
Core(s) per socket:              12
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           63
Model name:                      Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping:                        2
CPU MHz:                         1830.038
CPU max MHz:                     3300.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        4999.56
Virtualization:                  VT-x
L1d cache:                       768 KiB
L1i cache:                       768 KiB
L2 cache:                        6 MiB
L3 cache:                        60 MiB
NUMA node0 CPU(s):               0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):               1,3,5,7,9,11,13,15,17,19,21,23
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc a
                                 rch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dc
                                 a sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnm
                                 i flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d
  • OS (e.g: cat /etc/os-release):
podml@ml-k8s-master:~$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="19.10 (Eoan Ermine)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 19.10"
VERSION_ID="19.10"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=eoan
UBUNTU_CODENAME=eoan
  • Kernel (e.g. uname -a):
podml@ml-k8s-master:~$ uname -a
Linux ml-k8s-master 5.3.0-29-generic #31-Ubuntu SMP Fri Jan 17 17:27:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    Installed with kubeadm
podml@ml-k8s-master:~$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=172.16.1.1
  • Network plugin and version (if this is a network-related bug):
    There are two interfaces eno1 (for external access) and enp129s0f0(kuberentes network)
podml@ml-k8s-master:~$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.41.60.1      0.0.0.0         UG    0      0        0 eno1.1500
10.41.60.0      0.0.0.0         255.255.255.0   U     0      0        0 eno1.1500
172.16.1.0      0.0.0.0         255.255.255.0   U     0      0        0 enp129s0f0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
  • Others:
    I have used exactly the same kubectl init command in the "production" master than in the Virtual Box master but I got different IP in the get pods command
podml@ml-k8s-master:~$ kubectl get pods -o wide -A
NAMESPACE     NAME                                    READY   STATUS              RESTARTS   AGE    IP           NODE            NOMINATED NODE   READINESS GATES
kube-system   coredns-6955765f44-2dx9q                0/1     ContainerCreating   0          112m   <none>       ml-k8s-master   <none>           <none>
kube-system   coredns-6955765f44-8fw48                0/1     ContainerCreating   0          112m   <none>       ml-k8s-master   <none>           <none>
kube-system   etcd-ml-k8s-master                      1/1     Running             0          112m   172.16.1.1   ml-k8s-master   <none>           <none>
kube-system   kube-apiserver-ml-k8s-master            1/1     Running             0          112m   172.16.1.1   ml-k8s-master   <none>           <none>
kube-system   kube-controller-manager-ml-k8s-master   0/1     CrashLoopBackOff    17         112m   172.16.1.1   ml-k8s-master   <none>           <none>
kube-system   kube-proxy-lpnzh                        1/1     Running             0          112m   172.16.1.1   ml-k8s-master   <none>           <none>
kube-system   kube-scheduler-ml-k8s-master            0/1     CrashLoopBackOff    17         112m   172.16.1.1   ml-k8s-master   <none>           <none>

"Production" (restarting) IP --> 172.16.1.1 (kubernetes network)
Virtual Box (stable) IP --> 10.0.2.15 (external IP)

Disk performance
I have followed Using Fio to Tell Whether Your Storage is Fast Enough for Etcd

podml@ml-k8s-master:~$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1)
mytest: (groupid=0, jobs=1): err= 0: pid=10421: Thu Feb 13 12:49:39 2020
  write: IOPS=8660, BW=18.0MiB/s (19.9MB/s)(21.0MiB/1158msec); 0 zone resets
    clat (nsec): min=1862, max=2905.0k, avg=31388.68, stdev=40146.89
     lat (nsec): min=1978, max=2905.2k, avg=31539.73, stdev=40153.10
    clat percentiles (nsec):
     |  1.00th=[  1928],  5.00th=[  1976], 10.00th=[  2008], 20.00th=[  2160],
     | 30.00th=[  2544], 40.00th=[  4832], 50.00th=[ 44288], 60.00th=[ 45824],
     | 70.00th=[ 47360], 80.00th=[ 50432], 90.00th=[ 61696], 95.00th=[ 67072],
     | 99.00th=[ 91648], 99.50th=[105984], 99.90th=[240640], 99.95th=[268288],
     | 99.99th=[522240]
   bw (  KiB/s): min=19284, max=19621, per=100.00%, avg=19452.50, stdev=238.29, samples=2
   iops        : min= 8586, max= 8736, avg=8661.00, stdev=106.07, samples=2
  lat (usec)   : 2=7.73%, 4=28.52%, 10=6.85%, 20=0.74%, 50=35.15%
  lat (usec)   : 100=20.33%, 250=0.61%, 500=0.06%, 750=0.01%
  lat (msec)   : 4=0.01%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=44, max=1411, avg=81.69, stdev=24.78
    sync percentiles (usec):
     |  1.00th=[   62],  5.00th=[   66], 10.00th=[   67], 20.00th=[   69],
     | 30.00th=[   72], 40.00th=[   77], 50.00th=[   80], 60.00th=[   82],
     | 70.00th=[   85], 80.00th=[   90], 90.00th=[  100], 95.00th=[  113],
     | 99.00th=[  143], 99.50th=[  159], 99.90th=[  239], 99.95th=[  277],
     | 99.99th=[ 1074]
  cpu          : usr=7.26%, sys=13.31%, ctx=21301, majf=0, minf=14
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=18.0MiB/s (19.9MB/s), 18.0MiB/s-18.0MiB/s (19.9MB/s-19.9MB/s), io=21.0MiB (23.1MB), run=1158-1158msec

Disk stats (read/write):
    dm-0: ios=5578/11648, merge=0/0, ticks=204/984, in_queue=1188, util=92.46%, aggrios=5632/11109, aggrmerge=0/680, aggrticks=244/960, aggrin_queue=0, aggrutil=91.58%
  sda: ios=5632/11109, merge=0/680, ticks=244/960, in_queue=0, util=91.58%

I have another k8s running in a similar machine with Ubuntu 16.04.4 and k8s 1.14.2 without problems but I need to upgrade to Ubuntu 19.10

@cja000 cja000 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 13, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 13, 2020
@cja000
Copy link
Author

cja000 commented Feb 13, 2020

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 13, 2020
@neolit123
Copy link
Member

looks like you logged a separate issue after our discussion here:
#74340 (comment)

this seems like an isolated case related to your setup (or just something rare).

here you mentioned that you have tried a number of different CNIs, but if that is true why are the CNI pods missing in your kubectl get po output?

2020-02-13 09:25:29.287747 W | etcdserver: read-only range request "key:"/registry/health" " with result "range_response_count:0 size:4" took too long (17.239620012s) to execute

tagging as api-machinery because they own the etcd implementation.

/remove-sig node
/sig api-machinery
/remove-kind bug
/triage unresolved

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/unresolved Indicates an issue that can not or will not be resolved. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. labels Feb 14, 2020
@cja000
Copy link
Author

cja000 commented Feb 14, 2020

Yes, I created this new issue
I think it is something related with the environment but I cannot get any clue from the pods logs

podml@ml-k8s-master:~$ ./get_logs.sh 
Leader selection lost: 08:59:02
Logs from 08:58 to  08:59
--------------------
POD: coredns-6955765f44-jrrl8
--------------------
--------------------
POD: coredns-6955765f44-q9bkm
--------------------
--------------------
POD: etcd-ml-k8s-master
--------------------
2020-02-14 08:58:58.181154 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context deadline exceeded" took too long (2.000028245s) to execute
WARNING: 2020/02/14 08:58:58 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-14 08:59:02.300882 W | etcdserver: read-only range request "key:\"/registry/jobs/\" range_end:\"/registry/jobs0\" limit:500 " with result "error:context canceled" took too long (4.363348979s) to execute
WARNING: 2020/02/14 08:59:02 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-14 08:59:03.181298 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2020-02-14 08:59:05.456452 W | etcdserver: failed to revoke 151f703f410e8b55 ("etcdserver: request timed out")
2020-02-14 08:59:07.579251 W | etcdserver: read-only range request "key:\"/registry/events/kube-system/kube-apiserver-ml-k8s-master.15f30131f321ff36\" " with result "error:context canceled" took too long (9.396316674s) to execute
WARNING: 2020/02/14 08:59:07 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-14 08:59:08.181033 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context canceled" took too long (2.000042121s) to execute
WARNING: 2020/02/14 08:59:08 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2020-02-14 08:59:10.181537 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2020-02-14 08:59:10.181647 W | etcdserver: read-only range request "key:\"/registry/poddisruptionbudgets\" range_end:\"/registry/poddisruptionbudgett\" count_only:true " with result "error:etcdserver: request timed out" took too long (7.223919113s) to execute
2020-02-14 08:59:10.181688 W | etcdserver: read-only range request "key:\"/registry/persistentvolumeclaims\" range_end:\"/registry/persistentvolumeclaimt\" count_only:true " with result "error:etcdserver: request timed out" took too long (9.774878581s) to execute
2020-02-14 08:59:10.181735 W | etcdserver: read-only range request "key:\"/registry/services/specs\" range_end:\"/registry/services/spect\" count_only:true " with result "error:etcdserver: request timed out" took too long (11.453872193s) to execute
2020-02-14 08:59:10.181823 W | etcdserver: read-only range request "key:\"/registry/replicasets\" range_end:\"/registry/replicasett\" count_only:true " with result "error:etcdserver: request timed out" took too long (9.248831403s) to execute
2020-02-14 08:59:10.569762 W | etcdserver: read-only range request "key:\"/registry/namespaces/default\" " with result "range_response_count:1 size:173" took too long (17.164415984s) to execute
2020-02-14 08:59:10.569826 W | etcdserver: read-only range request "key:\"/registry/namespaces/kube-system\" " with result "range_response_count:1 size:179" took too long (17.125888475s) to execute
2020-02-14 08:59:10.569858 W | etcdserver: read-only range request "key:\"/registry/ingress\" range_end:\"/registry/ingrest\" count_only:true " with result "range_response_count:0 size:6" took too long (17.087557233s) to execute
2020-02-14 08:59:10.569942 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "range_response_count:1 size:440" took too long (18.31016043s) to execute
2020-02-14 08:59:10.570036 W | etcdserver: request "header:<ID:1522058616051239826 username:\"kube-apiserver-etcd-client\" auth_revision:1 > txn:<compare:<target:MOD key:\"/registry/leases/kube-node-lease/ml-k8s-master\" mod_revision:61114 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ml-k8s-master\" value_size:240 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ml-k8s-master\" > >>" with result "size:18" took too long (16.995753273s) to execute
2020-02-14 08:59:10.570194 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:458" took too long (18.315828398s) to execute
2020-02-14 08:59:10.570389 W | etcdserver: failed to revoke 151f703f410e8b55 ("lease not found")
2020-02-14 08:59:10.570494 W | etcdserver: read-only range request "key:\"/registry/networkpolicies\" range_end:\"/registry/networkpoliciet\" count_only:true " with result "range_response_count:0 size:6" took too long (5.893007021s) to execute
2020-02-14 08:59:10.570539 W | etcdserver: read-only range request "key:\"/registry/events/kube-system/kube-apiserver-ml-k8s-master.15f30131f321ff36\" " with result "range_response_count:1 size:517" took too long (2.802178524s) to execute
2020-02-14 08:59:10.570643 W | etcdserver: read-only range request "key:\"/registry/minions\" range_end:\"/registry/miniont\" count_only:true " with result "range_response_count:0 size:8" took too long (1.762697427s) to execute
2020-02-14 08:59:10.570687 W | etcdserver: read-only range request "key:\"/registry/mutatingwebhookconfigurations\" range_end:\"/registry/mutatingwebhookconfigurationt\" count_only:true " with result "range_response_count:0 size:6" took too long (1.729756006s) to execute
2020-02-14 08:59:10.570772 W | etcdserver: read-only range request "key:\"/registry/cronjobs\" range_end:\"/registry/cronjobt\" count_only:true " with result "range_response_count:0 size:6" took too long (4.58903663s) to execute
2020-02-14 08:59:10.570811 W | etcdserver: read-only range request "key:\"/registry/validatingwebhookconfigurations\" range_end:\"/registry/validatingwebhookconfigurationt\" count_only:true " with result "range_response_count:0 size:6" took too long (5.835117719s) to execute
2020-02-14 08:59:10.570924 W | etcdserver: read-only range request "key:\"/registry/deployments\" range_end:\"/registry/deploymentt\" count_only:true " with result "range_response_count:0 size:8" took too long (4.027499803s) to execute
2020-02-14 08:59:15.884380 I | mvcc: store.index: compact 60546
2020-02-14 08:59:15.885605 I | mvcc: finished scheduled compaction at 60546 (took 773.326µs)
--------------------
POD: kube-apiserver-ml-k8s-master
--------------------
I0214 08:59:00.574743       1 trace.go:116] Trace[1983671341]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-02-14 08:58:53.573418987 +0000 UTC m=+61541.537589897) (total time: 7.001283601s):
E0214 08:59:00.574798       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0214 08:59:00.575103       1 trace.go:116] Trace[1397886959]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ml-k8s-master,user-agent:kubelet/v1.17.3 (linux/amd64) kubernetes/06ad960,client:172.16.1.1 (started: 2020-02-14 08:58:53.573269824 +0000 UTC m=+61541.537440733) (total time: 7.00179831s):
E0214 08:59:02.253200       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0214 08:59:02.253515       1 trace.go:116] Trace[1490658799]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:kube-controller-manager/v1.17.3 (linux/amd64) kubernetes/06ad960/leader-election,client:172.16.1.1 (started: 2020-02-14 08:58:52.253641269 +0000 UTC m=+61540.217812172) (total time: 9.99983788s):
E0214 08:59:02.259008       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0214 08:59:02.259280       1 trace.go:116] Trace[769769017]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:kube-scheduler/v1.17.3 (linux/amd64) kubernetes/06ad960/leader-election,client:172.16.1.1 (started: 2020-02-14 08:58:52.259378942 +0000 UTC m=+61540.223549852) (total time: 9.999865213s):
E0214 08:59:02.272275       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"client disconnected"}
I0214 08:59:02.300452       1 trace.go:116] Trace[427744836]: "List etcd3" key:/jobs,resourceVersion:,limit:500,continue: (started: 2020-02-14 08:58:57.937112414 +0000 UTC m=+61545.901283317) (total time: 4.363288468s):
E0214 08:59:02.300509       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0214 08:59:02.300821       1 trace.go:116] Trace[1862767867]: "List" url:/apis/batch/v1/jobs,user-agent:kube-controller-manager/v1.17.3 (linux/amd64) kubernetes/06ad960/system:serviceaccount:kube-system:cronjob-controller,client:172.16.1.1 (started: 2020-02-14 08:58:57.93707678 +0000 UTC m=+61545.901247690) (total time: 4.363709696s):
I0214 08:59:07.577808       1 trace.go:116] Trace[2008566106]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-02-14 08:59:00.576455082 +0000 UTC m=+61548.540625988) (total time: 7.001310614s):
E0214 08:59:07.577870       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0214 08:59:07.578206       1 trace.go:116] Trace[1041096166]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ml-k8s-master,user-agent:kubelet/v1.17.3 (linux/amd64) kubernetes/06ad960,client:172.16.1.1 (started: 2020-02-14 08:59:00.576316612 +0000 UTC m=+61548.540487514) (total time: 7.001825853s):
I0214 08:59:07.579050       1 trace.go:116] Trace[578086289]: "GuaranteedUpdate etcd3" type:*core.Event (started: 2020-02-14 08:58:58.182446016 +0000 UTC m=+61546.146616925) (total time: 9.396569596s):
E0214 08:59:07.579089       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0214 08:59:07.579364       1 trace.go:116] Trace[1435648382]: "Patch" url:/api/v1/namespaces/kube-system/events/kube-apiserver-ml-k8s-master.15f30131f321ff36,user-agent:kubelet/v1.17.3 (linux/amd64) kubernetes/06ad960,client:172.16.1.1 (started: 2020-02-14 08:58:58.182343573 +0000 UTC m=+61546.146514479) (total time: 9.39698838s):
I0214 08:59:10.570347       1 trace.go:116] Trace[2043612436]: "Get" url:/api/v1/namespaces/default,user-agent:kube-apiserver/v1.17.3 (linux/amd64) kubernetes/06ad960,client:::1 (started: 2020-02-14 08:58:53.404936116 +0000 UTC m=+61541.369107029) (total time: 17.165367218s):
I0214 08:59:10.570655       1 trace.go:116] Trace[1323927304]: "Get" url:/api/v1/namespaces/kube-system,user-agent:kube-apiserver/v1.17.3 (linux/amd64) kubernetes/06ad960,client:::1 (started: 2020-02-14 08:58:53.44361207 +0000 UTC m=+61541.407782972) (total time: 17.127000316s):
I0214 08:59:10.570909       1 trace.go:116] Trace[908427642]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-02-14 08:59:07.593653061 +0000 UTC m=+61555.557823941) (total time: 2.977224368s):
I0214 08:59:10.571013       1 trace.go:116] Trace[1268955959]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ml-k8s-master,user-agent:kubelet/v1.17.3 (linux/amd64) kubernetes/06ad960,client:172.16.1.1 (started: 2020-02-14 08:59:07.593556077 +0000 UTC m=+61555.557726949) (total time: 2.977426224s):
I0214 08:59:10.589288       1 trace.go:116] Trace[1096541851]: "GuaranteedUpdate etcd3" type:*core.Event (started: 2020-02-14 08:59:07.767883507 +0000 UTC m=+61555.732054416) (total time: 2.821369656s):
I0214 08:59:10.589411       1 trace.go:116] Trace[1078016545]: "Patch" url:/api/v1/namespaces/kube-system/events/kube-apiserver-ml-k8s-master.15f30131f321ff36,user-agent:kubelet/v1.17.3 (linux/amd64) kubernetes/06ad960,client:172.16.1.1 (started: 2020-02-14 08:59:07.767786833 +0000 UTC m=+61555.731957739) (total time: 2.821592124s):
--------------------
POD: kube-controller-manager-ml-k8s-master
--------------------
I0214 08:58:03.481241       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:08.481548       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:13.481836       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:18.482133       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:23.482438       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:28.482786       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:33.483098       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:38.483407       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:43.483712       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:48.484934       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:53.485243       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:58:58.485563       1 node_lifecycle_controller.go:1443] Initializing eviction metric for zone: 
I0214 08:59:02.253057       1 leaderelection.go:288] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0214 08:59:02.253145       1 controllermanager.go:279] leaderelection lost
--------------------
POD: kube-flannel-ds-amd64-bh4mj
--------------------
--------------------
POD: kube-proxy-9mxbs
--------------------
--------------------
POD: kube-scheduler-ml-k8s-master
--------------------
I0214 08:59:02.258835       1 leaderelection.go:288] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded
F0214 08:59:02.258870       1 server.go:257] leaderelection lost

Just in case someone is interested, the script get_logs.sh used is:

#!/bin/bash
NOW=`date +"%Y%m%d_%H%M%S"`
LOG_FILE="logs-${NOW}"
CMD='kubectl logs -n kube-system '
PODS=`kubectl get pods -A -o custom-columns=:metadata.name --no-headers=true`
LINE=$(printf "%0.s-" {1..20})
for p in $PODS
do
    echo ${LINE} >> ${LOG_FILE}
    echo "POD: "${p} >> ${LOG_FILE}
    echo ${LINE} >> ${LOG_FILE}
    ${CMD} ${p} >> ${LOG_FILE}
done

LSL=$(awk -F'[ :.]' '/ leaderelection lost/{printf("%s:%s:%s\n"), $2,$3,$4}' ${LOG_FILE} | sort -u | head -1)
if [[ -z ${LSL} ]]; then
	echo "No Leader selection lost"
	exit
fi
echo "Leader selection lost: "$LSL
LSL_SEC=$(date +%s -d${LSL})
LSL_HM=$(date +"%H:%M" -d "@"${LSL_SEC})
MIN_BEFORE=$(date +"%H:%M" -d "@"$((${LSL_SEC} - 60)))
echo "Logs from "${MIN_BEFORE}" to " ${LSL_HM}
egrep -e '^---|^POD' -e " ${MIN_BEFORE}" -e " ${LSL_HM}" ${LOG_FILE}

@cja000
Copy link
Author

cja000 commented Feb 14, 2020

Regarding the metrics:
With this script:

curl -s localhost:2381/metrics | awk -F' ' '/fsync.*sum/{t=$2;printf("%s: ", $1); getline; if ( $2 != 0) {print t/$2*1000 "ms"} else {print "0 ms"}}'

I have this values:

etcd_disk_wal_fsync_duration_seconds_sum: 21.0832ms
etcd_snap_db_fsync_duration_seconds_sum: 0ms
etcd_snap_fsync_duration_seconds_sum: 0.644806ms

@fedebongio
Copy link
Contributor

/assign @jingyih

@k8s-ci-robot
Copy link
Contributor

@fedebongio: GitHub didn't allow me to assign the following users: jingyih.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @jingyih

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cja000
Copy link
Author

cja000 commented Feb 21, 2020

Tested with CentOS 8 and got the same problem

@jingyih
Copy link
Contributor

jingyih commented Feb 21, 2020

The disk performance (fio test) looks good to me.

Question:

  • Which version of etcd are you running?
  • Do you see "took too long" warnings in etcd log all the time, or do they appear only for a short period of time?
  • Since you mentioned this only happens in your production environment, is the etcd configured differently in this environment?

@cja000
Copy link
Author

cja000 commented Feb 24, 2020

Hi,

  • I'm installing k8s 1.17.3 so etcd version is etcdmain: etcd Version: 3.4.3
  • Latest logs are for 66'5h (aprox from the first occurrence to the last) and I got more than 28k times the message "took too long"
  • I have done the same installation in the VirtualBox than in the production environment. The problem appear just after the kubectl init, without configuring anything else. I have tried to replicate the network in the Virtual Box, one external network to access and one internal network for k8s

Thank you

@cja000
Copy link
Author

cja000 commented Mar 5, 2020

Hi,

Finally I found the problem :-)
There are problems with the RAID controller. If it is configured for RAID 0 there are resets and that affects the etcd.
Comparing the logs of etcd and the iDARC storage logs the times of the failures coincide.
Removed the RAID 0 it began to work. Or even with RAID 1 it was working.

2 of 3 machines have the failure. Guess in which 2 machines I was doing the tests?

Regards

@jingyih
Copy link
Contributor

jingyih commented Mar 5, 2020

Thanks for reporting back! Happy to hear that you figured it out:)

@neolit123
Copy link
Member

Removed the RAID 0 it began to work

great! thanks for the update.

i'm closing the issue and will point users to it if needed.
/close

@k8s-ci-robot
Copy link
Contributor

@neolit123: Closing this issue.

In response to this:

Removed the RAID 0 it began to work

great! thanks for the update.

i'm closing the issue and will point users to it if needed.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/unresolved Indicates an issue that can not or will not be resolved.
Projects
None yet
Development

No branches or pull requests

5 participants