Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop while watching for Kubernetes events #6503

Closed
exekias opened this issue Mar 7, 2018 · 29 comments · Fixed by #6504
Closed

Infinite loop while watching for Kubernetes events #6503

exekias opened this issue Mar 7, 2018 · 29 comments · Fixed by #6504

Comments

@exekias
Copy link
Contributor

exekias commented Mar 7, 2018

Both when using add_kubernetes_metadata or kubernetes autodiscover provider, some users have reported this issue (which causes an infinite loop):

2018-03-05T08:16:36.021Z        ERROR   kubernetes/watcher.go:188       kubernetes: Watching API error EOF
2018-03-05T08:16:36.021Z        INFO    kubernetes/watcher.go:173       kubernetes: Watching API for resource events
2018-03-05T08:16:36.030Z        ERROR   kubernetes/watcher.go:188       kubernetes: Watching API error proto: wrong wireType = 6 for field ServiceAccountName
2018-03-05T08:16:36.030Z        INFO    kubernetes/watcher.go:173       kubernetes: Watching API for resource events
2018-03-05T08:16:36.031Z        ERROR   kubernetes/watcher.go:188       kubernetes: Watching API error proto: wrong wireType = 6 for field ServiceAccountName
2018-03-05T08:16:36.031Z        INFO    kubernetes/watcher.go:173       kubernetes: Watching API for resource events
2018-03-05T08:16:36.033Z        ERROR   kubernetes/watcher.go:188       kubernetes: Watching API error proto: wrong wireType = 6 for field ServiceAccountName

More details can be found here: #6353

@exekias
Copy link
Contributor Author

exekias commented Mar 7, 2018

I've pushed a branch ignoring unmarshall errors: https://github.com/exekias/beats/tree/ignore-k8s-errors

Also built an image from it: exekias/filebeat:ignore-k8s-errors, @simioa it would be awesome if you could do another test, your help here is much appreciated

@simioa
Copy link
Contributor

simioa commented Mar 9, 2018

@exekias Looks good, no problems till now.

@exekias
Copy link
Contributor Author

exekias commented Mar 9, 2018

Good to hear, @simioa, I'm wondering, did you see the ERROR happen?

The code doesn't really avoid that message but should recover from it in all cases:
kubernetes: Watching API error proto: wrong wireType = 6 for field ServiceAccountName

@simioa
Copy link
Contributor

simioa commented Mar 9, 2018

Not that ERROR Message but this:

2018-03-08T15:33:50.114Z        ERROR   kubernetes/watcher.go:198       kubernetes: Watching API error EOF

I just looked into an Filebeat 6.2.2 Log and saw that right after the kubernetes: Watching API error EOF ERROR appeared, the infinite loop started.

This is not the case anymore, the only thing logged after the EOF Error related to the Watching API is:

2018-03-08T15:33:50.114Z        INFO    kubernetes/watcher.go:181       kubernetes: Watching API for resource events

@exekias
Copy link
Contributor Author

exekias commented Mar 9, 2018

Ok, please keep an eye on it and report if it happens 😇

In the meanwhile, it seems we now have a way to reproduce it, I will be doing some testing myself

Thank you for your effort! It's really helping with this issue

@exekias
Copy link
Contributor Author

exekias commented Mar 12, 2018

A fix has been merged and should go out in both 6.3.0 and 6.2.3

ruflin pushed a commit that referenced this issue Mar 12, 2018
…6530)

Cherry-pick of PR #6504 to 6.2 branch. Original message: 

This PR fixes #6503

How to reproduce: Run filebeat pointing to minikube. 

```
minikube ssh
sudo su

ps aux | grep localkube
kill -9 process_id
```

This will force a failure on the API server, and when the API server comes back up it will not be able to serve up the last resource version that we had requested with the failure:
```
type:"ERROR" object:<raw:"k8s\000\n\014\n\002v1\022\006Status\022C\n\004\n\000\022\000\022\007Failure\032)too old resource version: 310742 (310895)\"\004Gone0\232\003\032\000\"\000" >  typeMeta:<apiVersion:"v1" kind:"Status" > raw:"\n\004\n\000\022\000\022\007Failure\032)too old resource version: 310742 (310895)\"\004Gone0\232\003" contentEncoding:"" contentType:""  <nil>
```

In such scenarios the only mitigation would be to move the resource version to the latest. Scenarios like this would be addressed by `client-go`. The reason why the code fails with error is because we pass a `Pod` resource to do the `watcher.Next()` in this scenario the resource that is attempted to be parsed is an `Error` resource and the protobuf unmarshalling fails. This is a limitation in the client that we use as the resource needs to be passed explicitly. 

This fix is not the best in the world as it might miss few state changes.
@primeroz
Copy link

what's the ETA for new container 6.2.3 with the fix merged ?

@exekias
Copy link
Contributor Author

exekias commented Mar 13, 2018

It should be out soon, in any case, for convenience, I've pushed a snapshot build with the fix: exekias/filebeat:6.2.3-snapshot

@exekias
Copy link
Contributor Author

exekias commented Mar 20, 2018

An update on this, new images for 6.2.3 are already available: https://www.docker.elastic.co/

@yangyuw
Copy link

yangyuw commented Aug 19, 2018

@exekias this error also appears in 6.4.0, kubernetes 1.6.4 , any workaround for this?

2018-08-19T01:53:44.655Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error EOF
2018-08-19T01:53:44.655Z	INFO	kubernetes/watcher.go:238	kubernetes: Watching API for resource events
2018-08-19T01:53:53.051Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error EOF
2018-08-19T01:53:53.051Z	INFO	kubernetes/watcher.go:238	kubernetes: Watching API for resource events
2018-08-19T01:53:53.293Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error EOF
2018-08-19T01:53:53.294Z	INFO	kubernetes/watcher.go:238	kubernetes: Watching API for resource events
2018-08-19T01:53:54.065Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error EOF
2018-08-19T01:53:54.065Z	INFO	kubernetes/watcher.go:238	kubernetes: Watching API for resource events
2018-08-19T01:54:05.405Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error EOF
2018-08-19T01:54:05.405Z	INFO	kubernetes/watcher.go:238	kubernetes: Watching API for resource events
2018-08-19T01:54:07.414Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error EOF
2018-08-19T01:54:07.414Z	INFO	kubernetes/watcher.go:238	kubernetes: Watching API for resource events

@hobti01
Copy link

hobti01 commented Aug 26, 2018

Still appears in 6.3.2:

2018-08-25T14:50:05.364Z        INFO    [beat]  instance/beat.go:725    Build info      {"system_info": {"build": {"commit": "45a9a9e1561b6c540e94211ebe03d18abcacae55", "libbeat": "6.3.2", "time": "2018-07-20T04:18:19.000Z", "version": "6.3.2"}}}
...
2018-08-25T18:40:57.180Z        ERROR   kubernetes/watcher.go:195       kubernetes: Watching API error EOF
2018-08-25T18:40:57.180Z        INFO    kubernetes/watcher.go:179       kubernetes: Watching API for resource events

@yangyuw
Copy link

yangyuw commented Aug 26, 2018

@hobti01 Do you have memory issue with filebeat process? With the error appear more and more frequently, the memory useage may grow to 5GB and more.In my cluster, this error appears after filebeat restart about 1 hour.
I found the root cause maybe this line, w.r return empty string, which is abnormal.

if _, err := io.ReadFull(w.r, length); err != nil {

@hobti01
Copy link

hobti01 commented Aug 27, 2018

@yu-yang2 we do not have memory issues. We have noticed that sometimes log shipping will stop for an application container when we update filebeat and is not restarted.

This is the only ERROR in our logs, but this seems to be just a warning about the Kubernetes watch which is automatically restarted.

@agoloAbbas
Copy link

Same here filebeat version 6.3.2

2018-09-05T04:47:25.540Z	ERROR	kubernetes/watcher.go:195	kubernetes: Watching API error EOF
2018-09-05T04:47:25.541Z	INFO	kubernetes/watcher.go:179	kubernetes: Watching API for resource events```

@mslga
Copy link

mslga commented Nov 9, 2018

Same error in filebeat v6.4.2
Installed from https://github.com/helm/charts/tree/master/stable/filebeat

2018-11-09T15:03:07.979+0600	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error EOF

@thspinto
Copy link

thspinto commented Nov 13, 2018

I also got this error in v6.4.3.
The pod with this error stops collecting kubernetes' metadata. When I restart the pod, the metadata collection resumes, but appears again after a few hours.

2018-11-13T12:23:44.563Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error EOF
2018-11-13T12:23:44.563Z	INFO	kubernetes/watcher.go:238	kubernetes: Watching API for resource events

@bwunderlich824
Copy link

I also have the same errors with v6.4.3 my filebeat memory usage grows until it's killed by kubernetes at about 1gb of usage per pod.

@opsnull
Copy link

opsnull commented Nov 14, 2018

same error for v6.4.3:
2018-11-13T21:11:43.472+0800 ERROR kubernetes/watcher.go:254 kubernetes: Watching API error EOF
2018-11-13T21:11:43.472+0800 INFO kubernetes/watcher.go:238 kubernetes: Watching API for resource events
2018-11-13T21:11:43.474+0800 ERROR kubernetes/watcher.go:254 kubernetes: Watching API error proto: illegal wireType 6
2018-11-13T21:11:43.474+0800 INFO kubernetes/watcher.go:258 kubernetes: Ignoring event, moving to most recent resource version
2018-11-13T21:11:43.474+0800 INFO kubernetes/watcher.go:238 kubernetes: Watching API for resource events

@bwunderlich824
Copy link

Can we have this re-opened since the issues seems to still be occurring?

@bwunderlich824
Copy link

@opsnull Do you also see filebeat consuming large amounts of memory?

@bwunderlich824
Copy link

I have opened a new ticket here
#9078

@WuChenLinux
Copy link

6.6.1也有这个错误
image

@Jasstkn
Copy link

Jasstkn commented Jun 18, 2019

Have this issue in filebeat-oss:6.8.0

2019-06-18T13:04:47.479Z	ERROR	kubernetes/watcher.go:258	kubernetes: Watching API error EOF
2019-06-18T13:04:47.479Z	INFO	kubernetes/watcher.go:242	kubernetes: Watching API for resource events
2019-06-18T13:05:05.263Z	INFO	[monitoring]	log/log.go:144	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":470,"time":{"ms":2}},"total":{"ticks":2840,"time":{"ms":18},"value":2840},"user":{"ticks":2370,"time":{"ms":16}}},"handles":{"limit":{"hard":65536,"soft":65536},"open":17},"info":{"ephemeral_id":"b4b54edd-c52c-4b5a-a29b-512f3f10c76e","uptime":{"ms":3360067}},"memstats":{"gc_next":52123232,"memory_alloc":34918008,"memory_total":303615072}},"filebeat":{"harvester":{"open_files":8,"running":8}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":900,"active":-573,"batches":1,"total":327},"read":{"bytes":42},"write":{"bytes":33899}},"pipeline":{"clients":2,"events":{"active":4118,"retry":327}}},"registrar":{"states":{"current":35}},"system":{"load":{"1":0.09,"15":0.1,"5":0.12,"norm":{"1":0.045,"15":0.05,"5":0.06}}}}}}

@illia-sh
Copy link

Have the same issue in 6.8.1

@justinwalz
Copy link
Contributor

I'm also seeing this on 6.8.1

Repeated output of 2019-09-04T01:17:29.852Z ERROR kubernetes/watcher.go:258 kubernetes: Watching API error EOF coupled with dropped logs.

Some pods on the machine do not have missed logs, while others do. We're seeing this issue in development and production clusters.

@ZhaoBaymax
Copy link

Have the same issue in metricbeat 7.3.1 .
but this issue happen in alibaba cloud Container Service
in aws cloud had not happend

kubernetes module config

 79   kubernetes.yml: |-
 80     - module: kubernetes
 81       metricsets:
 82         - event
 83       period: 10s
 84       host: ${NODE_NAME}
 85       # If using Red Hat OpenShift remove the previous hosts entry and 
 86       # uncomment these settings:
 87       #hosts: ["https://${HOSTNAME}:10250"]
 88       #bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 89       #ssl.certificate_authorities:
 90       #  - /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
 91       #ssl.verification_mode: "none"
 92       add_metadata: true
 93       in_cluster: true
 94       fields_under_root: true
 95       fields:
 96         environment: ${ENVIRONMENT}
 97         REGION: ${REGION}

log:

2019-09-05T06:29:12.239Z	DEBUG	[cfgfile]	cfgfile/reload.go:197	Scan for new config files
2019-09-05T06:29:12.239Z	DEBUG	[cfgfile]	cfgfile/cfgfile.go:175	Load config from file: /usr/share/metricbeat/modules.d/kubernetes.yml
2019-09-05T06:29:12.239Z	DEBUG	[cfgfile]	cfgfile/reload.go:216	Number of module configs found: 1
2019-09-05T06:29:12.239Z	DEBUG	[reload]	cfgfile/list.go:62	Starting reload procedure, current runners: 1
2019-09-05T06:29:12.239Z	DEBUG	[reload]	cfgfile/list.go:80	Start list: 0, Stop list: 0
2019-09-05T06:29:20.456Z	DEBUG	[kubernetes]	kubernetes/watcher.go:261	EOF while watching API
2019-09-05T06:29:20.456Z	INFO	kubernetes/watcher.go:242	kubernetes: Watching API for resource events
2019-09-05T06:29:20.479Z	DEBUG	[kubernetes]	kubernetes/watcher.go:261	EOF while watching API
2019-09-05T06:29:20.479Z	INFO	kubernetes/watcher.go:242	kubernetes: Watching API for resource events
2019-09-05T06:29:20.579Z	DEBUG	[kubernetes]	kubernetes/watcher.go:261	EOF while watching API

@brokencode64
Copy link

We also are seeing this error again in both our aws and on-prem environments on filebeat 6.8.1:

2019-09-26T08:58:13.420Z        ERROR   kubernetes/watcher.go:258       kubernetes: Watching API error EOF
2019-09-26T09:51:33.050Z        ERROR   kubernetes/watcher.go:258       kubernetes: Watching API error EOF
2019-09-26T09:51:33.054Z        ERROR   kubernetes/watcher.go:258       kubernetes: Watching API error proto: wrong wireType = 6 for field ServiceAccountName

Many sets of logs are not being shipped as well.

@rbailly-talend
Copy link

This error is reproducible with filebeat 6.8.9:

{"level":"error","timestamp":"2020-06-19T11:37:51.250Z","caller":"kubernetes/watcher.go:258","message":"kubernetes: Watching API error EOF"}
{"level":"error","timestamp":"2020-06-19T11:37:51.251Z","caller":"kubernetes/watcher.go:258","message":"kubernetes: Watching API error proto: wrong wireType = 6 for field ServiceAccountName"}
{"level":"error","timestamp":"2020-06-19T12:24:33.883Z","caller":"kubernetes/watcher.go:258","message":"kubernetes: Watching API error EOF"}

We're missing some containers logs when this error occurs.

@thernstig
Copy link

@exekias This is still seen on later versions.

leweafan pushed a commit to leweafan/beats that referenced this issue Apr 28, 2023
…atch (elastic#6530)

Cherry-pick of PR elastic#6504 to 6.2 branch. Original message: 

This PR fixes elastic#6503

How to reproduce: Run filebeat pointing to minikube. 

```
minikube ssh
sudo su

ps aux | grep localkube
kill -9 process_id
```

This will force a failure on the API server, and when the API server comes back up it will not be able to serve up the last resource version that we had requested with the failure:
```
type:"ERROR" object:<raw:"k8s\000\n\014\n\002v1\022\006Status\022C\n\004\n\000\022\000\022\007Failure\032)too old resource version: 310742 (310895)\"\004Gone0\232\003\032\000\"\000" >  typeMeta:<apiVersion:"v1" kind:"Status" > raw:"\n\004\n\000\022\000\022\007Failure\032)too old resource version: 310742 (310895)\"\004Gone0\232\003" contentEncoding:"" contentType:""  <nil>
```

In such scenarios the only mitigation would be to move the resource version to the latest. Scenarios like this would be addressed by `client-go`. The reason why the code fails with error is because we pass a `Pod` resource to do the `watcher.Next()` in this scenario the resource that is attempted to be parsed is an `Error` resource and the protobuf unmarshalling fails. This is a limitation in the client that we use as the resource needs to be passed explicitly. 

This fix is not the best in the world as it might miss few state changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.