-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestFleet* fail on 8.6.x #6331
Comments
There's nothing obvious wrong in the logs here. In the fleet-server logs I see:
The `signal "terminated" received" is coming from https://github.com/elastic/elastic-agent/blob/4dcea16e97f79b1d1b456af0348715f609ea76d3/internal/pkg/agent/cmd/run.go#L242 and means the fleet-server agent is shutting down because it was sent SIGTERM. We'll need to find out where this is coming from. Is there anything in the output of |
I don't see anything relevant in the k8s events:
I looked for SIGTERM and found something interesting. An agent is stopped no matter what when performing fleet enrollment (code) and the stop is done by sending a SIGTERM, aka "terminated" (code). |
Yes, the agent restarting on re-enroll is expected. When it does this it will first terminate the child processes it started by sending SIGTERM. So fleet-server getting SIGTERM is expected. We can see this happening in the full logs where the logs originating from fleet-server are annotated with the binary name:
What I don't think is expected is Elastic Agent itself getting a SIGTERM. The Elastic Agent should restart by The log below I believe originates from the Elastic Agent and not the fleet-server subprocess. A terminated Elastic Agent won't restart itself which I think is what we are observing, but don't know why yet.
The entire process here is called Fleet server bootstrapping and is described in https://github.com/elastic/elastic-agent/blob/c7852a428f6afa0fd496c210ae296f5a4b062622/docs/fleet-server-bootstrap.asciidoc#L2 I will see if I can reproduce this but it likely won't be until later tomorrow. |
|
@cmacknz So I'm also investigating this issue, and it appears to be only the filebeat fleet-server integration that's failing here. All the agent instances "seem" healthy:
The fleet page in Kibana shows things unhealthy: Index management only shows the metricsbeat index from fleet server agent: Looking at the filebeat logs, there's a stack trace of some sort, along with it trying to connect to Elasticsearch over localhost?
All of the agent logs show this:
Can you look at the live configuration for agent, specific to filebeat? |
I initially thought, after I submitted this, that maybe I'm seeing a different error than @thbkrkr , since I'm not seeing the
If it helps, you can look at this live in this cluster @cmacknz:
All stack components are in namespace |
Thanks for the additional details. I ran out of time to look at this today (got a Sev-1 SDH instead) but I haven't forgotten about it. What I really need is the archive generated from The panic you are observing only happens on shutdown and is captured in elastic/beats#34219 already. It isn't causing the problem here. |
@cmacknz thanks for the update. I absolutely have time to grab those for you. I'll have those attached to this in about 30 minutes |
@cmacknz troubleshooting guide isn't getting me anywhere in solving this one.. |
One of the processes managed by agent didn't respond, which is unfortunate especially if it didn't write out the diagnostics for the agent itself. There is at least one bug we know of in 8.6.0 that will cause Filebeat not to be able to write events basically at random: elastic/elastic-agent#2086. You will see That could be in play here but I can't confirm it without a closer look at failing cluster. I'll have to look at this on Monday. |
I can almost guarantee you I saw |
@cmacknz some other testing I was doing in this cluster wiped out this failed setup. I'm working to get you another failure you can analyze now. Will update when done. |
And here's the error
namespace |
Thanks. The I would be very curious to see if the problems are solved with an agent build that includes that fix. |
@cmacknz thanks for the update. Would this fix be included in the |
Yes it will be once it is published. |
The below worked for me today, since I saw that 8.7.0-SNAPSHOT build had completed. (8.6.x-SNAPSHOT had failed for multiple days)
|
All agents failed to connect to the fleet server with:
elastic-agent-pod/logs.txt: {"log.level":"info","@timestamp":"2023-01-31T02:00:32.929Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":145},"message":"Fleet gateway started","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-01-31T02:00:33.243Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":190},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":2268510,"failed_checkins":1,"retry_after_ns":81354918331,"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-01-31T02:01:54.820Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":190},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":3948092,"failed_checkins":2,"retry_after_ns":187001170945,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-31T02:05:02.045Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":5615808,"failed_checkins":3,"retry_after_ns":462027578619,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-31T02:12:44.295Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":4790732,"failed_checkins":4,"retry_after_ns":551932525972,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-31T02:21:56.451Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":4894946,"failed_checkins":5,"retry_after_ns":468702515596,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-31T02:29:45.374Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":4828830,"failed_checkins":6,"retry_after_ns":623778092728,"ecs.version":"1.6.0"} fleet-server seems to listen
It looks like the same issue than in the initial comment of this issue where after having enroll agents, the fleet-server seems stopped. Looking at the last log of fleet-server:
|
@michel-laterman any ideas on what might be going wrong with fleet-server here? Based on the explicit log from fleet server that it is binding to I did try bringing up the agent in ECK locally and it came up green in the one time I tried it. |
strange, all logs indicate that |
This happened 3 times during last night's e2e tests on other
This fails due to the same reason in
|
On Fleet-server Agent instance
Last 10 lines of output
That agent never dies, and just sits in this state @cmacknz @michel-laterman . |
That is odd, it is almost like it is deadlocked in the signal handler. I think it would probably be useful to get a pprof goroutine dump from /debug/goroutine. You'd have to enable the profile in the fleet server configuration if you have access to that: profiler:
enabled: true
bind: localhost:6060 |
Bug is tracked in elastic/elastic-agent-autodiscover#41 I believe I know what is happening now. The fix is straightforward but will touch a lot of code, so might take a few days. |
We believe we've fixed this with elastic/elastic-agent#2352 Note that it isn't available in the latest 8.8.0-SNAPSHOT container yet, it should be available in the next successful build. |
@cmacknz Everything looks good now in |
Move version checks to agent builder to skip tests due to the following stack bugs: - Kibana bug "index conflict on install policy" (elastic/kibana#126611) - Elastic agent bug "deadlock on startup" (#6331 (comment))
When updating the stack version to 8.6.0 (#6327),
TestFleetKubernetesIntegrationRecipe
fails.https://devops-ci.elastic.co/job/cloud-on-k8s-pr-e2e-tests/98/
Index
logs-elastic_agent.fleet_server-default
is missing because fleet server and agents are in error.I can reproduce the issue just by applying
config/recipes/elastic-agent/fleet-kubernetes-integration.yaml
after updating the stack version to8.6.0
.The fleet-server agent pod is running, but the log shows that the process terminated as soon as it was enrolled.
Because fleet-server is in error, 2 agent pods are in
CrashLoopBackOff
state and the third is stuck retrying to reach fleet-server:Full log:
The text was updated successfully, but these errors were encountered: