Elastic Agents Unhealthy Elasticsearch connection failure #13416

jstore-embers · 2024-07-31T19:15:48Z

jstore-embers
Jul 31, 2024

Version

2.4.80

Installation Method

Security Onion ISO image

Description

other (please provide detail below)

Installation Type

Distributed

Location

on-prem with Internet access

Hardware Specs

Exceeds minimum requirements

CPU

8 for manager, 4 for both search nodes, 8 or more for sensors

RAM

32 GB for manager, 64 GB for both search nodes, 48 or more for sensors

Storage for /

300 GB for all nodes

Storage for /nsm

1.6 TB for manager, 4.5 TB for both search nodes, and 700+ GB for sensors

Network Traffic Collection

tap

Network Traffic Speeds

Less than 1Gbps

Status

Yes, all services on all nodes are running OK

Salt Status

No, there are no failures

Logs

Yes, there are additional clues in /opt/so/log/ (please provide detail below)

Detail

I recently spun up a new distributed deployment, including pushing the elastic agent to all 250 Windows endpoints.

I monitored the EPS, CPU, and RAM on the manager and search nodes as I added the agent in phases so I could look for the most frequent events to see if tuning was warranted. I found that Powershell script block logging was generated more events than anything else due to a heavy use of PowerShell scripts that run on all of our endpoints, so I opted to disable PowerShell log collection in the Windows integration in the endpoints-initial policy.

About a week later, I noticed CPU was high on the manager and search nodes. I reviewed the Windows integration and found that PowerShell logging was on again. I don't recall doing anything that would turn it back on, so I figured that something updated to cause it to revert. I opted to duplicate the endpoints-initial policy, turned PowerShell logging off in the copied policy, and then tried to re-assign all the agents to the new policy. Then I noticed my host logs stopped logging to onion.

I reviewed the fleet and the agents were unhealthy due to the elastic-defend-endpoints integration being degraded.

When troubleshooting, I came across this post #11148 , so I tried the following:

sudo rm -f /opt/so/state/eaintegrations.txt
sudo so-elastic-fleet-integration-policy-load

That didn't seem to help, so I then tried to re-assign the agents back to the default endpoints-initial policy. When I look at agent activity, it shows some of the reassignments in process (even a day later after a reboot).

I looked into the logs on an endpoint and found the following lines.

{"log.level":"error","@timestamp":"2024-07-31T09:23:45.390Z","message":"Failed to connect to backoff(async(tcp://onion-manager:5055)): read tcp 10.13.100.111:58156->10.13.100.236:5055: i/o timeout","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":148,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-31T09:23:45.390Z","message":"Attempting to reconnect to backoff(async(tcp://onion-manager:5055)) with 10 reconnect attempt(s)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":139,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

There are several posts about these error in the Elastic discussion that talk about issue with either high CPU or client_inactivity_timeout. I add more CPUs to the manager and rebooted, but that didn't help and I'm not sure how to change the client_inactivity_timeout and not sure that even makes sense in this case.

https://discuss.elastic.co/t/filebeat-failed-to-publish-events-caused-by-client-is-not-connected/217603/8
https://discuss.elastic.co/t/failed-to-connect-to-backoff-async-tcp-localhost-5044-dial-tcp-127-0-0-1-connect-connection-refused/213713
elastic/beats#16335

If I restart the logstash with so-logstash-restart, the agents will slow come back to healthy for a bit and then go back to unhealthy. They don't seem to foward logs during this time.

Seems similar to this issue: #10696

Here's my outputs, which I haven't touched directly.

Short of resetting my elastic fleet entirely, I'm out of ideas and hoping someone can help. I'm also wondering if I should setup a standalone fleet manager going forward, though the sensors (all heavy forwarders due to limited bandwidth and extra resources at each remote location) don't seem to have any issues as their integrations can connect fine.

Guidelines

I have read the discussion guidelines at Read before posting! #1720 and assert that I have followed the guidelines.

Answered by jstore-embers

Sep 30, 2024

I upgraded to .100. I'm still not seeing any logs go through the manager. My understanding is that they would go to both the manager and the receiver. It's not a big deal for me though. I generally would prefer logs going through the receiver anyway and I'm fine if they queue when the receiver reboots. I'll likely rebuild in Q1 of next year as new hardware becomes available.

The original resource issue seemed to be related to:
a.) Having noisy PowerShell logs drastically increase my overall EPS (we run a ton of PowerShell automation for security monitoring)
b.) Elastalert frequency with default settings when using lots of sigma rules

I overcame the resource issues by removing 2 noisy powe…

View full answer

reyesj2 · 2024-08-06T17:33:42Z

reyesj2
Aug 6, 2024
Maintainer

What does the cpu memory usage look like on the manager from influxdb? Perhaps you would benefit from standing up a fleet node to assist with handling the endpoint logs https://docs.securityonion.net/en/2.4/architecture.html#elastic-fleet-standalone-node

1 reply

jstore-embers Aug 12, 2024
Author

Thanks @reyesj2. I was looking at standing up a fleet node if needed. I added virtual CPU and RAM in the short-term, but that doesn't seem to help.

When looking in the grid, the manager hovers around 60-65% RAM usage and 60-80% CPU.

Though when I look in Influx, it looks closer to idle (presumably because there's no endpoint events coming in and sensors are heavy nodes).

System and container CPU/Mem usage seem to be more accurate.

defensivedepth · 2024-08-09T18:39:02Z

defensivedepth
Aug 9, 2024
Maintainer

Also, can you post a santized copy of the agent policy?

1 reply

jstore-embers Aug 12, 2024
Author

@defensivedepth Yes. Thanks for looking. The only thing that should be different from default is that I removed 2 PowerShell events from the list of event IDs to forward and I enabled logging for Applocker.

elastic-agent.yml.txt

jstore-embers · 2024-08-12T19:42:20Z

jstore-embers
Aug 12, 2024
Author

Everything really seemed to get wonky as soon as I duplicated the endpoints-initial policy, changed the PowerShell/applocker event setting, and then tried to assign all ~250 agents to the new policy at one time (well in batches of 50).

If I reset the fleet, redeploying the agent isn't a big deal, but will that cause issues with Onion nodes? I'd like to just start fresh, but without redeploying the sensors.

0 replies

jstore-embers · 2024-08-22T13:38:10Z

jstore-embers
Aug 22, 2024
Author

More Logs.

"log.level":"info","@timestamp":"2024-08-22T13:31:25.864Z","message":"osquery client is connected","component":{"binary":"osquerybeat","dataset":"elastic_agent.osquerybeat","id":"osquery-default","type":"osquery"},"log":{"source":"osquery-default"},"log.logger":"osquerybeat","log.origin":{"file.line":118,"file.name":"osqdcli/client.go"},"service.name":"osquerybeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:25.894Z","message":"Registering extension (osqextman, 8801, version=, sdk=)","component":{"binary":"osquerybeat","dataset":"elastic_agent.osquerybeat","id":"osquery-default","type":"osquery"},"log":{"source":"osquery-default"},"log.origin":{"file.line":84,"file.name":"beater/logger_plugin.go"},"osquery.filename":"interface.cpp","osquery.line":137,"osquery.cal_time":"Thu Aug 22 13:31:25 2024 UTC","ctx":"logger","osquery.log_type":"status","osquery.severity":0,"log.logger":"osquerybeat","service.name":"osquerybeat","osquery.time":1724333485,"ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-08-22T13:31:25.894Z","message":"Query has invalid interval: : 0","component":{"binary":"osquerybeat","dataset":"elastic_agent.osquerybeat","id":"osquery-default","type":"osquery"},"log":{"source":"osquery-default"},"log.logger":"osquerybeat","service.name":"osquerybeat","ctx":"logger","osquery.line":226,"osquery.cal_time":"Thu Aug 22 13:31:25 2024 UTC","osquery.time":1724333485,"ecs.version":"1.6.0","log.origin":{"file.line":82,"file.name":"beater/logger_plugin.go"},"osquery.log_type":"status","osquery.severity":1,"osquery.filename":"packs.cpp","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:26.027Z","message":"Connecting to backoff(async(tcp://onion-manager:5055))","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":137,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-08-22T13:31:26.764Z","message":"error fetching EC2 Identity Document: operation error ec2imds: GetInstanceIdentityDocument, canceled, context deadline exceeded.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"log-default","type":"log"},"log":{"source":"log-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":91,"file.name":"add_cloud_metadata/provider_aws_ec2.go"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-08-22T13:31:26.868Z","message":"error fetching EC2 Identity Document: operation error ec2imds: GetInstanceIdentityDocument, canceled, context deadline exceeded.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":91,"file.name":"add_cloud_metadata/provider_aws_ec2.go"},"service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-08-22T13:31:26.935Z","message":"error fetching EC2 Identity Document: operation error ec2imds: GetInstanceIdentityDocument, canceled, context deadline exceeded.","component":{"binary":"osquerybeat","dataset":"elastic_agent.osquerybeat","id":"osquery-default","type":"osquery"},"log":{"source":"osquery-default"},"log.origin":{"file.line":91,"file.name":"add_cloud_metadata/provider_aws_ec2.go"},"service.name":"osquerybeat","ecs.version":"1.6.0","log.logger":"add_cloud_metadata","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-08-22T13:31:27.011Z","message":"error fetching EC2 Identity Document: operation error ec2imds: GetInstanceIdentityDocument, canceled, context deadline exceeded.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"ecs.version":"1.6.0","log.logger":"add_cloud_metadata","log.origin":{"file.line":91,"file.name":"add_cloud_metadata/provider_aws_ec2.go"},"service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:27.917Z","message":"Event publisher not enabled: etw_process_publisher: etw_process_publisher publisher disabled via configuration.","component":{"binary":"osquerybeat","dataset":"elastic_agent.osquerybeat","id":"osquery-default","type":"osquery"},"log":{"source":"osquery-default"},"log.logger":"osquerybeat","ctx":"logger","osquery.log_type":"status","ecs.version":"1.6.0","osquery.severity":0,"service.name":"osquerybeat","osquery.time":1724333485,"log.origin":{"file.line":84,"file.name":"beater/logger_plugin.go"},"osquery.filename":"eventfactory.cpp","osquery.line":156,"osquery.cal_time":"Thu Aug 22 13:31:25 2024 UTC","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:27.917Z","message":"Event publisher not enabled: ntfs_event_publisher: NTFS event publisher disabled via configuration","component":{"binary":"osquerybeat","dataset":"elastic_agent.osquerybeat","id":"osquery-default","type":"osquery"},"log":{"source":"osquery-default"},"service.name":"osquerybeat","osquery.line":156,"osquery.cal_time":"Thu Aug 22 13:31:25 2024 UTC","log.logger":"osquerybeat","osquery.log_type":"status","osquery.time":1724333485,"ecs.version":"1.6.0","log.origin":{"file.line":84,"file.name":"beater/logger_plugin.go"},"ctx":"logger","osquery.severity":0,"osquery.filename":"eventfactory.cpp","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-08-22T13:31:30.941Z","message":"Extension respawning too quickly: C:\\Program Files\\Elastic\\Agent\\data\\elastic-agent-a92ca3\\components\\osquery-extension.exe","component":{"binary":"osquerybeat","dataset":"elastic_agent.osquerybeat","id":"osquery-default","type":"osquery"},"log":{"source":"osquery-default"},"service.name":"osquerybeat","osquery.log_type":"status","osquery.severity":1,"osquery.cal_time":"Thu Aug 22 13:31:28 2024 UTC","log.logger":"osquerybeat","ctx":"logger","osquery.line":692,"ecs.version":"1.6.0","osquery.time":1724333488,"log.origin":{"file.line":82,"file.name":"beater/logger_plugin.go"},"osquery.filename":"watcher.cpp","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:30.942Z","message":"Registering extension (osquery-extension, 22083, version=, sdk=)","component":{"binary":"osquerybeat","dataset":"elastic_agent.osquerybeat","id":"osquery-default","type":"osquery"},"log":{"source":"osquery-default"},"log.origin":{"file.line":84,"file.name":"beater/logger_plugin.go"},"osquery.line":137,"osquery.log_type":"status","osquery.filename":"interface.cpp","osquery.cal_time":"Thu Aug 22 13:31:28 2024 UTC","ecs.version":"1.6.0","log.logger":"osquerybeat","osquery.severity":0,"osquery.time":1724333488,"service.name":"osquerybeat","ctx":"logger","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:46.209Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":1253},"message":"Component state changed endpoint-default (STARTING->HEALTHY): Healthy: communicating with endpoint service","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-08-22T13:31:47.853Z","message":"Failed to connect to backoff(async(tcp://onion-manager:5055)): dial tcp 10.13.100.236:5055: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":148,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:47.854Z","message":"Attempting to reconnect to backoff(async(tcp://onion-manager:5055)) with 1 reconnect attempt(s)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","log.origin":{"file.line":139,"file.name":"pipeline/client_worker.go"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-08-22T13:31:48.259Z","message":"Failed to connect to backoff(async(tcp://onion-manager:5055)): dial tcp 10.13.100.236:5055: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":148,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:48.259Z","message":"Attempting to reconnect to backoff(async(tcp://onion-manager:5055)) with 1 reconnect attempt(s)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"log.origin":{"file.line":139,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:54.148Z","message":"Non-zero metrics in the last 30s","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"log-default","type":"log"},"log":{"source":"log-default"},"log.logger":"monitoring","log.origin":{"file.line":187,"file.name":"log/log.go"},"service.name":"filebeat","monitoring":{"ecs.version":"1.6.0","metrics":{"beat":{"cpu":{"system":{"ticks":140,"time":{"ms":140}},"total":{"ticks":233,"time":{"ms":233},"value":233},"user":{"ticks":93,"time":{"ms":93}}},"info":{"ephemeral_id":"0cff9ff9-e706-498d-99ba-5f062065e041","name":"filebeat","uptime":{"ms":32540},"version":"8.10.4"},"memstats":{"gc_next":34679312,"memory_alloc":23271088,"memory_sys":50226248,"memory_total":61021592,"rss":77418496},"runtime":{"goroutines":50}},"filebeat":{"events":{"active":0},"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":2,"starts":2}},"output":{"events":{"active":0},"type":"logstash"},"pipeline":{"clients":2,"events":{"active":0},"queue":{"max_events":4096}}},"registrar":{"states":{"current":0}},"system":{"cpu":{"cores":20},"handles":{"open":272}}}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:54.225Z","message":"Non-zero metrics in the last 30s","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"service.name":"filebeat","monitoring":{"ecs.version":"1.6.0","metrics":{"beat":{"cpu":{"system":{"ticks":1062,"time":{"ms":1062}},"total":{"ticks":5093,"time":{"ms":5093},"value":5093},"user":{"ticks":4031,"time":{"ms":4031}}},"info":{"ephemeral_id":"dd360abd-b739-48d2-a536-0fb3413e9987","name":"filebeat","uptime":{"ms":32542},"version":"8.10.4"},"memstats":{"gc_next":119257456,"memory_alloc":113814544,"memory_sys":144283992,"memory_total":325134688,"rss":173989888},"runtime":{"goroutines":90}},"filebeat":{"events":{"active":4101,"added":4101},"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":12,"starts":12}},"output":{"events":{"active":0},"type":"logstash"},"pipeline":{"clients":12,"events":{"active":4101,"published":4096,"retry":2048,"total":4101},"queue":{"max_events":4096}}},"registrar":{"states":{"current":0}},"system":{"cpu":{"cores":20},"handles":{"open":411}}}},"log.logger":"monitoring","log.origin":{"file.line":187,"file.name":"log/log.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:54.318Z","message":"Non-zero metrics in the last 30s","component":{"binary":"osquerybeat","dataset":"elastic_agent.osquerybeat","id":"osquery-default","type":"osquery"},"log":{"source":"osquery-default"},"log.logger":"monitoring","log.origin":{"file.line":187,"file.name":"log/log.go"},"service.name":"osquerybeat","monitoring":{"ecs.version":"1.6.0","metrics":{"beat":{"cpu":{"system":{"ticks":156,"time":{"ms":156}},"total":{"ticks":234,"time":{"ms":234},"value":234},"user":{"ticks":78,"time":{"ms":78}}},"info":{"ephemeral_id":"85321dbe-584c-4de1-a85e-75f1d7543c7b","name":"osquerybeat","uptime":{"ms":32474},"version":"8.10.4"},"memstats":{"gc_next":12746040,"memory_alloc":7687856,"memory_sys":28376120,"memory_total":14556384,"rss":43048960},"runtime":{"goroutines":41}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"active":0},"type":"logstash"},"pipeline":{"clients":1,"events":{"active":0},"queue":{"max_events":4096}}},"system":{"cpu":{"cores":20},"handles":{"open":269}}}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:31:54.396Z","message":"Non-zero metrics in the last 30s","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"monitoring","log.origin":{"file.line":187,"file.name":"log/log.go"},"service.name":"filebeat","monitoring":{"ecs.version":"1.6.0","metrics":{"beat":{"cpu":{"system":{"ticks":281,"time":{"ms":281}},"total":{"ticks":952,"time":{"ms":952},"value":952},"user":{"ticks":671,"time":{"ms":671}}},"info":{"ephemeral_id":"5f99cfad-f0eb-4512-83a1-dc96db473deb","name":"filebeat","uptime":{"ms":32484},"version":"8.10.4"},"memstats":{"gc_next":120983936,"memory_alloc":60216200,"memory_sys":126502424,"memory_total":197028472,"rss":147525632},"runtime":{"goroutines":90}},"filebeat":{"events":{"active":4599,"added":4599},"harvester":{"open_files":14,"running":14,"started":14}},"libbeat":{"config":{"module":{"running":2,"starts":2}},"output":{"events":{"active":0},"type":"logstash"},"pipeline":{"clients":14,"events":{"active":4107,"filtered":492,"published":4096,"retry":2048,"total":4599},"queue":{"max_events":4096}}},"registrar":{"states":{"current":0}},"system":{"cpu":{"cores":20},"handles":{"open":325}}}},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-08-22T13:32:06.226Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":1251},"message":"Unit state changed endpoint-default-f5bdca20-fb54-4e4e-9949-775bf4dd26bd (STARTING->DEGRADED): Applied policy {f5bdca20-fb54-4e4e-9949-775bf4dd26bd}","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"HEALTHY"},"unit":{"id":"endpoint-default-f5bdca20-fb54-4e4e-9949-775bf4dd26bd","type":"input","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-08-22T13:32:06.227Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":1251},"message":"Unit state changed endpoint-default (STARTING->DEGRADED): Applied policy {f5bdca20-fb54-4e4e-9949-775bf4dd26bd}","log":{"source":"elastic-agent"},"component":{"id":"endpoint-default","state":"HEALTHY"},"unit":{"id":"endpoint-default","type":"output","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-08-22T13:32:11.436Z","message":"Failed to connect to backoff(async(tcp://onion-manager:5055)): dial tcp 10.13.100.236:5055: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","log.origin":{"file.line":148,"file.name":"pipeline/client_worker.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:32:11.436Z","message":"Attempting to reconnect to backoff(async(tcp://onion-manager:5055)) with 2 reconnect attempt(s)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":139,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-08-22T13:32:12.122Z","message":"Failed to connect to backoff(async(tcp://onion-manager:5055)): dial tcp 10.13.100.236:5055: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","log.origin":{"file.line":148,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:32:12.122Z","message":"Attempting to reconnect to backoff(async(tcp://onion-manager:5055)) with 2 reconnect attempt(s)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":139,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-08-22T13:32:24.152Z","message":"Non-zero metrics in the last 30s","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"log-default","type":"log"},"log":{"source":"log-default"},"log.logger":"monitoring","log.origin":{"file.line":187,"file.name":"log/log.go"},"service.name":"filebeat","monitoring":{"ecs.version":"1.6.0","metrics":{"beat":{"cpu":{"system":{"ticks":140},"total":{"ticks":249,"time":{"ms":16},"value":249},"user":{"ticks":109,"time":{"ms":16}}},"info":{"ephemeral_id":"0cff9ff9-e706-498d-99ba-5f062065e041","uptime":{"ms":62546},"version":"8.10.4"},"memstats":{"gc_next":34679312,"memory_alloc":23324960,"memory_total":61075464,"rss":77410304},"runtime":{"goroutines":50}},"filebeat":{"events":{"active":0},"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":2}},"output":{"events":{"active":0}},"pipeline":{"clients":2,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"handles":{"open":-1}}}},"ecs.version":"1.6.0"}

0 replies

JhonShell · 2024-08-29T16:00:38Z

JhonShell
Aug 29, 2024

Hi , please run a tcpdump (tcpdump -i interface port 5055) on your manager or receiver and check that the agent data are being ingest into logstash which destination port is 5055 if is not check the setting on

and make sure the certificate and other thing like name resolution are ok.

2 replies

jstore-embers Aug 29, 2024
Author

Thanks @JhonShell! That helps.

I ran tcpdump and found the hosts were all sending traffic (syns), but there were no replies. A quick netstat showed there was nothing listening on port 5055. I restarted logstash and then 5055 was available. tcpdump confirmed traffic was flowing in both directions and the agents went from unhealthy to healthy.

Then after a couple of minutes, the agents start to move from healthy to unhealthy. A few seem to drop off every minute so it takes a while (~ 1 hour) for them to all eventually move to unhealthy. The service is still listening at this point with traffic flowing in both directions. There is a brief, spike in logs ingested from only a couple of the windows agents at the beginning of this window (right after the service restarts), but only about 1500 logs are indexed and available for searching.

There is a total of 269 agents, so I'm wondering if that's too much input from the agents as they all reconnect for the resources being provided on the manager. I'm just not sure where to go to look at the logs for that service directly or if there any options to allocate additional resources to those specific services. The grid shows about 60% CPU usage. The manager has 12 cores. The container CPU shows bounces between 500 and 700% usage, but that didn't change much during the service restart as the agents came online.

I think I need to spin up a standalone fleet node for the long term, but want to be sure I resource that appropriately and can get the agents to connect enough to get the config to point to the new fleet node.

Fleet settings are as follows:

JhonShell Aug 29, 2024

Hi ,
it seems to be that you have resource issue, remember the feet server is only for managing, the feet server does not decrease the load, because it would not receive data, only the heartbeat, in order to decrease the load deploy a receiver which would run Logstash, and the agent would ingest the data there.

jstore-embers · 2024-09-05T17:12:31Z

jstore-embers
Sep 5, 2024
Author

I added a receiver node and am now getting all of my logs ingested via that node successfully (hooray!), but I'm still not getting any able to get any logs via the manager's logstash instance on 5055.

I still see these logs on the windows agent.

{"log.level":"info","@timestamp":"2024-09-05T11:51:37.193Z","message":"Connecting to failover(backoff(async(tcp://onion-manager:5055)),backoff(async(tcp://onion-receiver:5055)))","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","log.origin":{"file.line":137,"file.name":"pipeline/client_worker.go"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-09-05T11:51:37.262Z","message":"Connection to failover(backoff(async(tcp://onion-manager:5055)),backoff(async(tcp://onion-receiver:5055))) established","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-default","type":"winlog"},"log":{"source":"winlog-default"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":145,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

Is there a way to just reset the logstash ingestion on the manager? I generally would prefer logs to go to the new receiver anyway, but I would like to be able to reboot either and keep logs flowing as long as either is online.

1 reply

JhonShell Sep 5, 2024

Hi,
if you want to receive the log from the manager modified the entry on the fleet setting and outputs and put the manager Ip as primary.

defensivedepth · 2024-09-05T19:38:00Z

defensivedepth
Sep 5, 2024
Maintainer

@jstore-embers A few questions:

What version are you currently on? 2.4.100 was released last week, which included an Elastic upgrade. This version of Elastic Agent should be more stable
Can your endpoints resolve the name onion-manager and onion-receiver ?

4 replies

jstore-embers Sep 5, 2024
Author

I'm still running 2.4.80. No issues with DNS or firewalls. I can establish a TCP connection to both onion-manager and onion-receiver on port 5055. Grid view only shows EPS for onion-receiver. It's odd because logs were happily flowing through onion-manager as I onboarded agents, then it just stopped accepting them when I made those minor policy changes when the system was under heavy load from noisy events. Went for about a month with no endpoint logs, then just recently added the new receiver to get logs flowing again.

defensivedepth Sep 30, 2024
Maintainer

Any changes? Have you been able to try upgrading to .100?

jstore-embers Sep 30, 2024
Author

I upgraded to .100. I'm still not seeing any logs go through the manager. My understanding is that they would go to both the manager and the receiver. It's not a big deal for me though. I generally would prefer logs going through the receiver anyway and I'm fine if they queue when the receiver reboots. I'll likely rebuild in Q1 of next year as new hardware becomes available.

The original resource issue seemed to be related to:
a.) Having noisy PowerShell logs drastically increase my overall EPS (we run a ton of PowerShell automation for security monitoring)
b.) Elastalert frequency with default settings when using lots of sigma rules

I overcame the resource issues by removing 2 noisy powershell events from the agent policy and tuning elastalert to run less often (adjusted buffer_time, run_every, and old_query_limit). Those changes brought my manager and search node CPU to a reasonable level.

Still not sure why the logs won't go to the manager, but spinning up a receiver was a reasonable workaround. I'll look at 2 receivers if I rebuild next spring.

Answer selected by defensivedepth

defensivedepth Sep 30, 2024
Maintainer

Ok sounds good, please reach out if you run into further issues while rebuilding!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic Agents Unhealthy Elasticsearch connection failure #13416

{{title}}

Replies: 7 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Elastic Agents Unhealthy Elasticsearch connection failure #13416

jstore-embers Jul 31, 2024

Version

Installation Method

Description

Installation Type

Location

Hardware Specs

CPU

RAM

Storage for /

Storage for /nsm

Network Traffic Collection

Network Traffic Speeds

Status

Salt Status

Logs

Detail

Guidelines

Replies: 7 comments · 9 replies

reyesj2 Aug 6, 2024 Maintainer

jstore-embers Aug 12, 2024 Author

defensivedepth Aug 9, 2024 Maintainer

jstore-embers Aug 12, 2024 Author

jstore-embers Aug 12, 2024 Author

jstore-embers Aug 22, 2024 Author

JhonShell Aug 29, 2024

jstore-embers Aug 29, 2024 Author

JhonShell Aug 29, 2024

jstore-embers Sep 5, 2024 Author

JhonShell Sep 5, 2024

defensivedepth Sep 5, 2024 Maintainer

jstore-embers Sep 5, 2024 Author

defensivedepth Sep 30, 2024 Maintainer

jstore-embers Sep 30, 2024 Author

defensivedepth Sep 30, 2024 Maintainer

jstore-embers
Jul 31, 2024

Replies: 7 comments 9 replies

reyesj2
Aug 6, 2024
Maintainer

jstore-embers Aug 12, 2024
Author

defensivedepth
Aug 9, 2024
Maintainer

jstore-embers Aug 12, 2024
Author

jstore-embers
Aug 12, 2024
Author

jstore-embers
Aug 22, 2024
Author

JhonShell
Aug 29, 2024

jstore-embers Aug 29, 2024
Author

jstore-embers
Sep 5, 2024
Author

defensivedepth
Sep 5, 2024
Maintainer

jstore-embers Sep 5, 2024
Author

defensivedepth Sep 30, 2024
Maintainer

jstore-embers Sep 30, 2024
Author

defensivedepth Sep 30, 2024
Maintainer