Elastic Agents Unhealthy Elasticsearch connection failure #13416
-
Version2.4.80 Installation MethodSecurity Onion ISO image Descriptionother (please provide detail below) Installation TypeDistributed Locationon-prem with Internet access Hardware SpecsExceeds minimum requirements CPU8 for manager, 4 for both search nodes, 8 or more for sensors RAM32 GB for manager, 64 GB for both search nodes, 48 or more for sensors Storage for /300 GB for all nodes Storage for /nsm1.6 TB for manager, 4.5 TB for both search nodes, and 700+ GB for sensors Network Traffic Collectiontap Network Traffic SpeedsLess than 1Gbps StatusYes, all services on all nodes are running OK Salt StatusNo, there are no failures LogsYes, there are additional clues in /opt/so/log/ (please provide detail below) DetailI recently spun up a new distributed deployment, including pushing the elastic agent to all 250 Windows endpoints. I monitored the EPS, CPU, and RAM on the manager and search nodes as I added the agent in phases so I could look for the most frequent events to see if tuning was warranted. I found that Powershell script block logging was generated more events than anything else due to a heavy use of PowerShell scripts that run on all of our endpoints, so I opted to disable PowerShell log collection in the Windows integration in the endpoints-initial policy. About a week later, I noticed CPU was high on the manager and search nodes. I reviewed the Windows integration and found that PowerShell logging was on again. I don't recall doing anything that would turn it back on, so I figured that something updated to cause it to revert. I opted to duplicate the endpoints-initial policy, turned PowerShell logging off in the copied policy, and then tried to re-assign all the agents to the new policy. Then I noticed my host logs stopped logging to onion. I reviewed the fleet and the agents were unhealthy due to the elastic-defend-endpoints integration being degraded. When troubleshooting, I came across this post #11148 , so I tried the following:
That didn't seem to help, so I then tried to re-assign the agents back to the default endpoints-initial policy. When I look at agent activity, it shows some of the reassignments in process (even a day later after a reboot). I looked into the logs on an endpoint and found the following lines. {"log.level":"error","@timestamp":"2024-07-31T09:23:45.390Z","message":"Failed to connect to backoff(async(tcp://onion-manager:5055)): read tcp 10.13.100.111:58156->10.13.100.236:5055: i/o timeout","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":148,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"} There are several posts about these error in the Elastic discussion that talk about issue with either high CPU or client_inactivity_timeout. I add more CPUs to the manager and rebooted, but that didn't help and I'm not sure how to change the client_inactivity_timeout and not sure that even makes sense in this case. https://discuss.elastic.co/t/filebeat-failed-to-publish-events-caused-by-client-is-not-connected/217603/8 If I restart the logstash with so-logstash-restart, the agents will slow come back to healthy for a bit and then go back to unhealthy. They don't seem to foward logs during this time. Seems similar to this issue: #10696 Here's my outputs, which I haven't touched directly. Short of resetting my elastic fleet entirely, I'm out of ideas and hoping someone can help. I'm also wondering if I should setup a standalone fleet manager going forward, though the sensors (all heavy forwarders due to limited bandwidth and extra resources at each remote location) don't seem to have any issues as their integrations can connect fine. Guidelines
|
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 9 replies
-
What does the cpu memory usage look like on the manager from influxdb? Perhaps you would benefit from standing up a fleet node to assist with handling the endpoint logs https://docs.securityonion.net/en/2.4/architecture.html#elastic-fleet-standalone-node |
Beta Was this translation helpful? Give feedback.
-
Also, can you post a santized copy of the agent policy? |
Beta Was this translation helpful? Give feedback.
-
Everything really seemed to get wonky as soon as I duplicated the endpoints-initial policy, changed the PowerShell/applocker event setting, and then tried to assign all ~250 agents to the new policy at one time (well in batches of 50). If I reset the fleet, redeploying the agent isn't a big deal, but will that cause issues with Onion nodes? I'd like to just start fresh, but without redeploying the sensors. |
Beta Was this translation helpful? Give feedback.
-
More Logs.
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I added a receiver node and am now getting all of my logs ingested via that node successfully (hooray!), but I'm still not getting any able to get any logs via the manager's logstash instance on 5055. I still see these logs on the windows agent.
Is there a way to just reset the logstash ingestion on the manager? I generally would prefer logs to go to the new receiver anyway, but I would like to be able to reboot either and keep logs flowing as long as either is online. |
Beta Was this translation helpful? Give feedback.
-
@jstore-embers A few questions:
|
Beta Was this translation helpful? Give feedback.
I upgraded to .100. I'm still not seeing any logs go through the manager. My understanding is that they would go to both the manager and the receiver. It's not a big deal for me though. I generally would prefer logs going through the receiver anyway and I'm fine if they queue when the receiver reboots. I'll likely rebuild in Q1 of next year as new hardware becomes available.
The original resource issue seemed to be related to:
a.) Having noisy PowerShell logs drastically increase my overall EPS (we run a ton of PowerShell automation for security monitoring)
b.) Elastalert frequency with default settings when using lots of sigma rules
I overcame the resource issues by removing 2 noisy powe…