-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestFleet* seems to be flaky (again) #7389
Comments
The elastic-agent team has been notified of this failure (->). |
Since the default stack version was updated to |
What is the ECK configuration that reproduces this so that we can try doing it locally? |
I reproduce by applying config/recipes/elastic-agent/fleet-kubernetes-integration.yaml with the ECK operator installed with the default settings. Steps to reproduce: # https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-installing-eck.html
helm repo add elastic https://helm.elastic.co
helm repo update
helm install elastic-operator elastic/eck-operator -n elastic-system --create-namespace
# deploy es+kb+fleet+agent
kubectl apply -f https://raw.githubusercontent.com/elastic/cloud-on-k8s/main/config/recipes/elastic-agent/fleet-kubernetes-integration.yaml
# check the existence of the data streams
wget https://gist.githubusercontent.com/thbkrkr/0da65863799bbe415576afb78b74fde3/raw/a4f5ac0099a9bc61df81467a72cb4b08b9e7b782/eckurl && chmod +x ./eckurl
./eckurl elasticsearch /_data_stream/metrics-elastic_agent.metricbeat-default
{"data_streams":[{"name":"metrics-elastic_agent.metricbeat-default","timestamp_field":{"name":"@timestamp"},"indices":[{"index_name":".ds-metrics-elastic_agent.metricbeat-default-2024.01.23-000001","index_uuid":"fvSO8ppTTAmHWpZ4k-5p6A","prefer_ilm":true,"ilm_policy":"metrics","managed_by":"Index Lifecycle Management"}],"generation":1,"_meta":{"package":{"name":"elastic_agent"},"managed_by":"fleet","managed":true},"status":"GREEN","template":"metrics-elastic_agent.metricbeat","ilm_policy":"metrics","next_generation_managed_by":"Index Lifecycle Management","prefer_ilm":true,"hidden":false,"system":false,"allow_custom_routing":false,"replicated":false,"time_series":{"temporal_ranges":[{"start":"2024-01-23T07:20:56.000Z","end":"2024-01-23T11:33:51.000Z"}]}}]}
🟢
./eckurl elasticsearch /_data_stream/metrics-elastic_agent.filebeat-default
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [metrics-elastic_agent.filebeat-default]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":"metrics-elastic_agent.filebeat-default","index":"metrics-elastic_agent.filebeat-default"}],"type":"index_not_found_exception","reason":"no such index [metrics-elastic_agent.filebeat-default]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":"metrics-elastic_agent.filebeat-default","index":"metrics-elastic_agent.filebeat-default"},"status":404}
🔴 |
Thanks, I can reproduce the problem but I still can't tell why its happening. |
If I manually post to elasticsearch the index gets created. This tells me the agent must not be writing to this datastream. I did confirm that I can get these metrics from Filebeat manually by interacting with its unix socket. So that is a bit odd. I used the following to do this:
|
Edit: nevermind was looking at the wrong output. |
This format difference doesn't appear for an 8.12.0 agent on my Mac: sudo curl --unix-socket /Library/Elastic/Agent/data/tmp/PGwsYWcynGUYZEjD872Gs-npqbv-30jS.sock http:/localhost/state | jq
{
"beat": {
"name": "Craigs-Macbook-Pro.local"
},
"host": {
"architecture": "arm64",
"hostname": "Craigs-Macbook-Pro.local",
"id": "48DA13D6-B83B-5C71-A4F3-494E674F9F37",
"os": {
"build": "23D56",
"family": "darwin",
"kernel": "23.3.0",
"name": "macOS",
"platform": "darwin",
"version": "14.3"
}
},
"input": {
"count": 2,
"names": [
"log"
]
},
"management": {
"enabled": true
},
"module": {
"count": 0,
"names": []
},
"output": {
"batch_size": 1600,
"clients": 1,
"name": "elasticsearch"
},
"outputs": {
"elasticsearch": {
"cluster_uuid": "JdqUYKmBQQq4IZpBjOaKNQ"
}
},
"queue": {
"name": "mem"
},
"service": {
"id": "0d2d05af-b3af-41ef-8d8e-a604f8666a6c",
"name": "filebeat",
"version": "8.12.0"
}
} |
Looking again, I now see that the format for Filebeat is correct but it is missing the Elasticsearch cluster UUID which is a condition for metrics to be collected. Filebeat: {
"beat": {
"name": "kind-worker"
},
"host": {
"architecture": "aarch64",
"containerized": false,
"hostname": "kind-worker",
"os": {
"codename": "focal",
"family": "debian",
"kernel": "5.10.76-linuxkit",
"name": "Ubuntu",
"platform": "ubuntu",
"version": "20.04.6 LTS (Focal Fossa)"
}
},
"input": {
"count": 2,
"names": [
"log"
]
},
"management": {
"enabled": true
},
"module": {
"count": 0,
"names": []
},
"output": {
"batch_size": 1600,
"clients": 1,
"name": "elasticsearch"
},
"outputs": {
"elasticsearch": {
"cluster_uuid": ""
}
},
"queue": {
"name": "mem"
},
"service": {
"id": "8463dd7f-4023-49d4-bf69-663da75f6f67",
"name": "filebeat",
"version": "8.12.0"
}
} Metricbeat: {
"beat": {
"name": "kind-worker"
},
"host": {
"architecture": "aarch64",
"containerized": false,
"hostname": "kind-worker",
"os": {
"codename": "focal",
"family": "debian",
"kernel": "5.10.76-linuxkit",
"name": "Ubuntu",
"platform": "ubuntu",
"version": "20.04.6 LTS (Focal Fossa)"
}
},
"management": {
"enabled": true
},
"module": {
"count": 11,
"names": [
"system"
]
},
"output": {
"batch_size": 1600,
"clients": 1,
"name": "elasticsearch"
},
"outputs": {
"elasticsearch": {
"cluster_uuid": "hdBpRNasQTGstWKazFHmSw"
}
},
"queue": {
"name": "mem"
},
"service": {
"id": "0f919bbd-8e3d-40dd-b970-98bbfa827012",
"name": "metricbeat",
"version": "8.12.0"
}
} The most likely reason for this is that Filebeat hasn't sent any data to Elasticsearch. In this case this makes sense because I took the Kubernetes integration out and there are not system logs on the pod to collect. I can confirm this by creating
Then Filebeat gets a cluster UUID and I see the So the root cause is likely that Filebeat has no data to send. In this case I removed the Kubernetes integration to simplify debugging, but that integration should be sending container logs by default so I need to take a look at why this might be the case. |
Putting the k8s integration back in, I can confirm that the filestream input used for container logs is also missing the elasticsearch cluster UUID in its I can see that we are discovering the containers whose logs we should read, for example in the diagnostics in the streams:
- data_stream:
dataset: kubernetes.container_logs
id: kubernetes-container-logs-elastic-agent-agent-q99vh-99ea050f475a9328eaf3a644f1f787168ffcb950864de9d7f2f6520ce99bab06
parsers:
- container:
format: auto
stream: all
paths:
- /var/log/containers/*99ea050f475a9328eaf3a644f1f787168ffcb950864de9d7f2f6520ce99bab06.log If I exec into the agent pod I and look to see if
There should be a host path volume mount for - name: varlog
hostPath:
path: /var/log I'm not sure where that would have been dropped if this was working before. The only mounts I see in the agent pod are:
For the record there are a couple of other permissions errors that would affect the k8s integration in the logs. The agent can't list a few resources and kube-state-metrics is missing. {"log.level":"error","@timestamp":"2024-01-25T16:44:04.553Z","message":"W0125 16:44:04.552798 1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.DaemonSet: daemonsets.apps is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"daemonsets\" in API group \"apps\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.553Z","message":"E0125 16:44:04.553010 1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.DaemonSet: failed to list *v1.DaemonSet: daemonsets.apps is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"daemonsets\" in API group \"apps\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.647Z","message":"error making http request: Get \"http://kube-state-metrics:8080/metrics\": lookup kube-state-metrics on 10.96.0.10:53: no such host","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes.state_namespace","log.origin":{"file.line":115,"file.name":"kubernetes/state_metricset.go","function":"github.com/elastic/beats/v7/metricbeat/helper/kubernetes.(*MetricSet).Fetch"},"service.name":"metricbeat","id":"kubernetes/metrics-kubernetes.state_namespace-a3b718bd-efec-54f4-b513-7711c744a8ec","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-01-25T16:44:04.751Z","message":"SSL/TLS verifications disabled.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"tls","log.origin":{"file.line":107,"file.name":"tlscommon/tls_config.go","function":"github.com/elastic/elastic-agent-libs/transport/tlscommon.(*TLSConfig).ToConfig"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.846Z","message":"W0125 16:44:04.846961 1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"storageclasses\" in API group \"storage.k8s.io\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.847Z","message":"E0125 16:44:04.847157 1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"storageclasses\" in API group \"storage.k8s.io\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.847Z","message":"error making http request: Get \"http://kube-state-metrics:8080/metrics\": lookup kube-state-metrics on 10.96.0.10:53: no such host","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes.state_node","log.origin":{"file.line":115,"file.name":"kubernetes/state_metricset.go","function":"github.com/elastic/beats/v7/metricbeat/helper/kubernetes.(*MetricSet).Fetch"},"service.name":"metricbeat","id":"kubernetes/metrics-kubernetes.state_node-a3b718bd-efec-54f4-b513-7711c744a8ec","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.852Z","message":"W0125 16:44:04.852377 1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"persistentvolumes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.852Z","message":"E0125 16:44:04.852415 1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"persistentvolumes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:05.050Z","message":"error making http request: Get \"http://kube-state-metrics:8080/metrics\": lookup kube-state-metrics on 10.96.0.10:53: no such host","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes.state_replicaset","log.origin":{"file.line":115,"file.name":"kubernetes/state_metricset.go","function":"github.com/elastic/beats/v7/metricbeat/helper/kubernetes.(*MetricSet).Fetch"},"service.name":"metricbeat","id":"kubernetes/metrics-kubernetes.state_replicaset-a3b718bd-efec-54f4-b513-7711c744a8ec","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:05.146Z","message":"W0125 16:44:05.146766 1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.CronJob: cronjobs.batch is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"cronjobs\" in API group \"batch\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:05.146Z","message":"E0125 16:44:05.146931 1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.CronJob: failed to list *v1.CronJob: cronjobs.batch is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"cronjobs\" in API group \"batch\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"} So |
I'm not sure what would have changed this, it could be that we got filebeat metrics because we were collecting container logs and that has now stopped, or we used to collect system logs and that stopped. Either way the root cause is something in the k8s configuration of the agent or possibly something like a container base image change that removed a system log source we used to be able to read (the elastic agent base image didn't change). |
The > GET /_cat/indices | grep filebeat
green open .ds-metrics-elastic_agent.filebeat_input-default-2024.01.25-000001 tkTKKY4VTGiZ98R6QeX-qw 1 1 4085 0 3.3mb 1.6mb 1.6mb
green open .ds-logs-elastic_agent.filebeat-default-2024.01.25-000001 HlIksAZiQieVaTj3sWAuGg 1 1 1462 0 3.3mb 1.3mb 1.3mb
> GET /.ds-logs-elastic_agent.filebeat-default-2024.01.25-000001/_search \
| jq '.hits.hits[]._source.log.file.path' -r
/usr/share/elastic-agent/state/data/logs/elastic-agent-20240125-1.ndjson
... Also the filebeat filestream input seems started reading the last log lines from an elastic-agent pod:
|
This is written by the monitoring Filebeat, which is and has always been excluded from our internal metrics collection. I'll take a look at what happens with 8.11.0 to see if I can spot the difference. |
Looking a the contents of the metrics-elastic_agent.filebeat-default index with this setup in 8.11.0 it contains documents that only have metadata and no actual fields from the Beat: Example: {
"_index": ".ds-metrics-elastic_agent.filebeat-default-2024.01.25-000001",
"_id": "i0-SB2e42r7pkaXFAAABjUM82Ng",
"_score": 1,
"_source": {
"@timestamp": "2024-01-26T00:47:57.656Z",
"agent": {
"ephemeral_id": "75d58779-cf51-454a-9536-23ac30f05e9f",
"id": "00c20c83-0824-44fc-9267-871eb17c5093",
"name": "kind-worker",
"type": "metricbeat",
"version": "8.11.0"
},
"data_stream": {
"dataset": "elastic_agent.filebeat",
"namespace": "default",
"type": "metrics"
},
"ecs": {
"version": "8.0.0"
},
"elastic_agent": {
"id": "00c20c83-0824-44fc-9267-871eb17c5093",
"process": "filebeat",
"snapshot": false,
"version": "8.11.0"
},
"event": {
"agent_id_status": "verified",
"ingested": "2024-01-26T00:47:58Z"
},
"host": {
"architecture": "aarch64",
"containerized": false,
"hostname": "kind-worker",
"ip": [
"10.244.1.1",
"172.18.0.2",
"fc00:f853:ccd:e793::2",
"fe80::42:acff:fe12:2"
],
"mac": [
"02-42-AC-12-00-02",
"26-B9-06-C4-E9-C9",
"7A-B7-3C-13-E3-57",
"96-1E-B7-3F-9E-03",
"9E-09-96-29-B3-8D",
"CA-76-44-33-D5-25",
"EE-5E-F1-53-37-15"
],
"name": "kind-worker",
"os": {
"codename": "focal",
"family": "debian",
"kernel": "5.10.76-linuxkit",
"name": "Ubuntu",
"platform": "ubuntu",
"version": "20.04.6 LTS (Focal Fossa)"
}
},
"metricset": {
"name": "state"
}
}
} The cluster UUID of each of the two filebeats is empty in 8.11.0 as well, so they aren't actually collecting data either. I think the culprit is likely https://github.com/elastic/beats/pull/37419/files#diff-028b7407303d60646b0315786de2d77607b4d4f18d93a782d3b7558daf18344e Specifically this change: diff --git a/metricbeat/module/beat/state/data.go b/metricbeat/module/beat/state/data.go
index b555c84bd4..5d82310460 100644
--- a/metricbeat/module/beat/state/data.go
+++ b/metricbeat/module/beat/state/data.go
@@ -77,22 +77,19 @@ func eventMapping(r mb.ReporterV2, info beat.Info, content []byte, isXpack bool)
return fmt.Errorf("failure parsing Beat's State API response: %w", err)
}
- event.MetricSetFields, _ = schema.Apply(data)
-
clusterUUID := getMonitoringClusterUUID(data)
if clusterUUID == "" {
if isOutputES(data) {
clusterUUID = getClusterUUID(data)
- if clusterUUID != "" {
- event.ModuleFields.Put("elasticsearch.cluster.id", clusterUUID)
- if event.MetricSetFields != nil {
- event.MetricSetFields.Put("cluster.uuid", clusterUUID)
- }
+ if clusterUUID == "" {
+ return nil
}
}
}
+ event.ModuleFields.Put("elasticsearch.cluster.id", clusterUUID)
+
event.MetricSetFields, _ = schema.Apply(data)
if event.MetricSetFields != nil { What that does is it doesn't send an event if the cluster UUID is empty, which is exactly what we observe happening. |
The summary here is:
What I suggest you do here is re-enable this test but remove the check for the existence of Alternatively I think if you mount |
Great investigation, thank you very much.
Makes sense, I agree, it doesn't look like a real bug.
Already done in #7497 :) |
See #6494 and #6331
TestFleet* seems to be flaky again. Seeing this again:
The errors from Fleet agent that seem relevant:
https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/396#018c79bc-99cf-48d8-af3e-17fb84b4176c
The text was updated successfully, but these errors were encountered: