-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubelet timeout generating ImagePullBackOff error #3084
Comments
same problem here. |
Rancher/RKE does not use the kubelet config file to configure kubelet, which sadly means you cannot find it anywhere. But it does not mean you cannot use it, and actually, you are very close to the final solution: You need to set both the services:
kubelet:
extra_args:
config: path-to-the-config-file-in-the-container
extra_binds:
- "path-to-file-on-host:path-to-the-config-file-in-the-container" And of course, you need to create/put such a config file on the control plan node beforehand. I hope this is helpful. |
and a caveat: AFAIK, the kubelet process does not auto-restart when the changes are made in the config file, which means you need to restart the kubelet container after changes are made to the "external" config file. |
Still I am experiencing the time out issue. kubelet: kubelet_timeout_config.yamlapiVersion: kubelet.config.k8s.io/v1beta1 |
@likku123 can you do the following checks on the container
If all the above look right, it means RKE has configured the kube-apiserver properly, then I will doubt if it is an upstream issue or something wrong outside of RKE. |
Thanks for your response, Jiaqi Luo. I was trying this afternoon to solve this issue, but unfortunatly didn't find how. The file kubelet-config.yml I used in each server: apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
clientCAFile: "/etc/kubernetes/ssl/kube-ca.pem"
runtimeRequestTimeout: 45m0s
tlsCipherSuites: ["TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"]
failSwapOn: False
volumePluginDir: "/var/lib/kubelet/volumeplugin"
clusterDomain: "cluster.local" The RKE config: kubelet:
extra_args:
config: /var/lib/kubelet/kubelet-config.yml
extra_binds:
- >-
/var/lib/kubelet/kubelet-config.yml:/var/lib/kubelet/kubelet-config.yml And finally, the process running in on server with --config, as example:
I found a request which probably shows up the current kubelet configuration:
the content returned: {
"kubeletconfig": {
"enableServer": true,
"syncFrequency": "1m0s",
"fileCheckFrequency": "20s",
"httpCheckFrequency": "20s",
"address": "0.0.0.0",
"port": 10250,
"tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",
"tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",
"tlsCipherSuites": [
"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
"TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
"TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"
],
"authentication": {
"x509": {
"clientCAFile": "/etc/kubernetes/ssl/kube-ca.pem"
},
"webhook": {
"enabled": true,
"cacheTTL": "2m0s"
},
"anonymous": {
"enabled": false
}
},
"authorization": {
"mode": "Webhook",
"webhook": {
"cacheAuthorizedTTL": "5m0s",
"cacheUnauthorizedTTL": "30s"
}
},
"registryPullQPS": 5,
"registryBurst": 10,
"eventRecordQPS": 0,
"eventBurst": 10,
"enableDebuggingHandlers": true,
"healthzPort": 10248,
"healthzBindAddress": "127.0.0.1",
"oomScoreAdj": -999,
"clusterDomain": "cluster.local",
"clusterDNS": [
"10.43.0.10"
],
"streamingConnectionIdleTimeout": "30m0s",
"nodeStatusUpdateFrequency": "10s",
"nodeStatusReportFrequency": "5m0s",
"nodeLeaseDurationSeconds": 40,
"imageMinimumGCAge": "2m0s",
"imageGCHighThresholdPercent": 85,
"imageGCLowThresholdPercent": 80,
"volumeStatsAggPeriod": "1m0s",
"cgroupsPerQOS": true,
"cgroupDriver": "cgroupfs",
"cpuManagerPolicy": "none",
"cpuManagerReconcilePeriod": "10s",
"memoryManagerPolicy": "None",
"topologyManagerPolicy": "none",
"topologyManagerScope": "container",
"runtimeRequestTimeout": "40m0s",
"hairpinMode": "promiscuous-bridge",
"maxPods": 110,
"podPidsLimit": -1,
"resolvConf": "/etc/resolv.conf",
"cpuCFSQuota": true,
"cpuCFSQuotaPeriod": "100ms",
"nodeStatusMaxImages": 50,
"maxOpenFiles": 1000000,
"contentType": "application/vnd.kubernetes.protobuf",
"kubeAPIQPS": 5,
"kubeAPIBurst": 10,
"serializeImagePulls": true,
"evictionHard": {
"imagefs.available": "15%",
"memory.available": "100Mi",
"nodefs.available": "10%",
"nodefs.inodesFree": "5%"
},
"evictionPressureTransitionPeriod": "5m0s",
"enableControllerAttachDetach": true,
"makeIPTablesUtilChains": true,
"iptablesMasqueradeBit": 14,
"iptablesDropBit": 15,
"failSwapOn": false,
"memorySwap": {},
"containerLogMaxSize": "10Mi",
"containerLogMaxFiles": 5,
"configMapAndSecretChangeDetectionStrategy": "Watch",
"enforceNodeAllocatable": [
"pods"
],
"volumePluginDir": "/var/lib/kubelet/volumeplugins",
"logging": {
"format": "text",
"flushFrequency": 5000000000,
"verbosity": 1,
"options": {
"json": {
"infoBufferSize": "0"
}
}
},
"enableSystemLogHandler": true,
"shutdownGracePeriod": "0s",
"shutdownGracePeriodCriticalPods": "0s",
"enableProfilingHandler": true,
"enableDebugFlagsHandler": true,
"seccompDefault": false,
"memoryThrottlingFactor": 0.8,
"registerWithTaints": [
{
"key": "node-role.kubernetes.io/controlplane",
"value": "true",
"effect": "NoSchedule"
}
],
"registerNode": true
}
} I restarted the kubelet and the server, but the ErrImagePull behavior, if takes more than 2 minutes, still persists... So starting to search more about it, I found similar issues:
And a Pull request with this issue to: kubernetes/minikube#13600 So this it's probably a bug! But I found something interesting that I'll try later, something relationed to changing the container runtime: kubernetes/minikube#14789 (comment) |
This is definitely issue with cri-dockerd version which comes along with rke-tools. Any suggestions to deploy cri-dockerd 0.2.6 in my present setup |
No idea how to upgrade it... bash-5.1# ./cri-dockerd --version
cri-dockerd 0.2.4 (4b57f30) |
@jiaqiluo ,
Our cri-dockerd version: cri-dockerd 0.2.4 What're our options from here? Thanks in advance. |
I also have the same issue in RKE 1.24. |
@jiaqiluo , We're using the kubelet config file (we can se throut docker inspect kubelet) but the exatly same problem persists. We've a limited internet connection (From Brazil), and images like Airflow, Redis, RabbitMQ exceeds the default timeout of 2 minutes. @vinibodruch or me can send to you any kind of log or information. Thanks in advance. |
Looking for similar issues, this is the only thing that I thought it could be the solution: #3051 |
I'd be happy if anyone knows a decent workaround (that hopefully doesn't involve SSHing into each node and running docker pull). This is seriously disturbing cluster operations as there's some cluster images that are impossible to complete pulling within 2m and will endlessly fail. |
Hack I am doing right now is . Using ansible script i am manually downloading the required images in nodes and scheduled cron job to pull latest changes regularly. |
I've rolled back to 1.23 for the time being. |
Hi @likku123 @gmanera @iTaybb @vinibodruch |
/backport v1.3.17 |
Issue was reproducible on RKE v1.4.1-rc1. Fixes for the above error were validated with the RKE v1.4.1-rc2. Validations steps
Issue is not active with RKE v1.4.1-rc2. Hence closing the issue. |
TL;DR
Where is the kubelet config file on Rancher 2.6.9 - RKE1, like this https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/
Can I manage it? Does this file exist?
Didn't find in /var/lib/kubelet
Explain
Recently we've upgraded the Kubernetes version to v1.24.4-rancher1-1 and to Rancher 2.6.9. Everything worked fine, but recently we've noticed a new behavior: If a image is to big or takes more than 2 minutes to accomplish the download, the Kubernetes raise an ErrImagePull.
To bypass this error, I need to login to the cluster, do a docker pull <image> to stop this error.
Error: ImagePullBackOff
Searching for the problem, we discovered that the error is caused by a timeout in kubelet's request (2 minutes, accourding to the doc https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/), which could be raised with a flag --runtime-request-timeout duration. Changing the cluster.yaml with the below parameters, nothing happens:
The process running, showing that the parameter reflects to kubelet configuration
In the official page, this parameter is deprecated, which explains this behavior, and to change it I need to alter a parameter named runtimeRequestTimeout inside a config-file.
So I have some doubts:
I read this docs too, but no sucess:
Configs and current versions
K8s version:
RKE version:
Docker version: (
docker version
,docker info
preferred)Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred)Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Vmware
cluster.yml file:
I would be grateful if this help me and others to solve this annoying issue.
The text was updated successfully, but these errors were encountered: