Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestFleet* fail on 8.6.x #6331

Closed
thbkrkr opened this issue Jan 17, 2023 · 26 comments · Fixed by #6611
Closed

TestFleet* fail on 8.6.x #6331

thbkrkr opened this issue Jan 17, 2023 · 26 comments · Fixed by #6611

Comments

@thbkrkr
Copy link
Contributor

thbkrkr commented Jan 17, 2023

When updating the stack version to 8.6.0 (#6327), TestFleetKubernetesIntegrationRecipe fails.

https://devops-ci.elastic.co/job/cloud-on-k8s-pr-e2e-tests/98/

=== RUN   TestFleetKubernetesIntegrationRecipe/ES_data_should_pass_validations
Retries (30m0s timeout): .......
    step.go:44: 
        Error: Received unexpected error:
            elasticsearch client failed for https://elasticsearch-6qxw-es-internal-http.e2e-nru4o-mercury.svc:9200/_data_stream/logs-elastic_agent.fleet_server-default: 404 Not Found: 
            RootCause:[{
                Reason: no such index [logs-elastic_agent.fleet_server-default]
                Type: index_not_found_exception
             }]
    --- FAIL: TestFleetKubernetesIntegrationRecipe/ES_data_should_pass_validations (1800.00s)

Index logs-elastic_agent.fleet_server-defaultis missing because fleet server and agents are in error.

I can reproduce the issue just by applying config/recipes/elastic-agent/fleet-kubernetes-integration.yaml after updating the stack version to 8.6.0.

The fleet-server agent pod is running, but the log shows that the process terminated as soon as it was enrolled.

> k logs fleet-server-agent-6b8bb8976f-hcz24 | grep '{' | tail -15 | jq .message -r
fleet-server routes set up
server hard connection limiter installed
Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)
Unit state changed fleet-server-default-fleet-server (STARTING->DEGRADED): Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)
Unit state changed fleet-server-default (STARTING->DEGRADED): Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)
Fleet Server - Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)
Starting enrollment to URL: https://fleet-server-agent-http.default.svc:8220/
Elastic Agent successfully enrolled
Elastic Agent successfully enrolled
Elastic Agent has been enrolled; start Elastic Agent
signal "terminated" received
Shutting down Elastic Agent and sending last events...
On signal
Stopping
error while closing server

Because fleet-server is in error, 2 agent pods are in CrashLoopBackOff state and the third is stuck retrying to reach fleet-server:

NAME                                  READY   STATUS             RESTARTS        AGE
elastic-agent-agent-8cc82             0/1     CrashLoopBackOff   9 (4m35s ago)   27m
elastic-agent-agent-8mm5w             1/1     Running            1 (27m ago)     27m
elastic-agent-agent-fgrnn             0/1     CrashLoopBackOff   9 (5m ago)      27m
> k logs elastic-agent-agent-8cc82 # (CrashLoopBackOff)
Updating certificates in /etc/ssl/certs...
rehash: warning: skipping ca-certificates.crt,it does not contain exactly one certificate or CRL
1 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
{"log.level":"info","@timestamp":"2023-01-17T14:15:47.237Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":475},"message":"Starting enrollment to URL: https://fleet-server-agent-http.default.svc:8220/","ecs.version":"1.6.0"}
Error: fail to enroll: fail to execute request to fleet-server: dial tcp 10.111.196.28:8220: connect: connection refused
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.6/fleet-troubleshooting.html
Error: enrollment failed: exit status 1
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.6/fleet-troubleshooting.html
> k logs elastic-agent-agent-8mm5w | tail -1 # (Running)
{"log.level":"error","@timestamp":"2023-01-17T14:11:39.924Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},
  "message":"Cannot checkin in with fleet-server, retrying",
  "error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-agent-http.default.svc:8220/ errored: Post \"https://fleet-server-agent-http.default.svc:8220/api/fleet/agents/996ab694-ef86-4f09-9ad3-47b7cf2fa1ed/checkin?\": dial tcp 10.111.196.28:8220: connect: connection refused\n\n"},"request_duration_ns":5733281,"failed_checkins":5,"retry_after_ns":739653326754,"ecs.version":"1.6.0"}
Full log:
Updating certificates in /etc/ssl/certs...
rehash: warning: skipping ca-certificates.crt,it does not contain exactly one certificate or CRL
1 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
Policy selected for enrollment:  eck-fleet-server
info 2023-01-17T13:47:58.184Z origin":{"file.name":"cmd/enroll_cmd.go file.line":594},"message":"Spawning Elastic Agent daemon as a subprocess to complete bootstrap process. ecs.version":"1.6.0"}
info 2023-01-17T13:47:58.963Z origin":{"file.name":"cmd/run.go file.line":188},"message":"APM instrumentation disabled ecs.version":"1.6.0"}
info 2023-01-17T13:47:59.040Z origin":{"file.name":"application/application.go file.line":44},"message":"Gathered system information ecs.version":"1.6.0"}
info 2023-01-17T13:47:59.151Z origin":{"file.name":"application/application.go file.line":50},"message":"Detected available inputs and outputs inputs":["syslog mongodb/metrics cloudfoundry/metrics nats/metrics prometheus/metrics osquery aws-cloudwatch mqtt uwsgi/metrics haproxy/metrics kafka/metrics sql/metrics cloudbeat/cis_k8s postgresql/metrics oracle/metrics elasticsearch/metrics unix windows/metrics syncgateway/metrics azure-eventhub mssql/metrics iis/metrics aws/metrics cel tcp nginx/metrics stan/metrics traefik/metrics aws-s3 synthetics/tcp system/metrics cloudbeat httpjson cloudbeat/cis_eks synthetics/icmp beat/metrics apache/metrics packet container netflow udp azure/metrics jolokia/metrics journald gcp-pubsub redis fleet-server containerd/metrics activemq/metrics audit/auditd awsfargate/metrics vsphere/metrics mysql/metrics audit/system kafka winlog synthetics/http docker/metrics http/metrics gcp/metrics audit/file_integrity log synthetics/browser logstash/metrics rabbitmq/metrics endpoint cloudfoundry http_endpoint o365audit filestream kibana/metrics zookeeper/metrics apm kubernetes/metrics linux/metrics redis/metrics etcd/metrics docker"],"ecs.version":"1.6.0"}
info 2023-01-17T13:47:59.152Z origin":{"file.name":"capabilities/capabilities.go file.line":54},"message":"Capabilities file not found in /usr/share/elastic-agent/capabilities.yml ecs.version":"1.6.0"}
info 2023-01-17T13:47:59.152Z origin":{"file.name":"application/application.go file.line":56},"message":"Determined allowed capabilities ecs.version":"1.6.0"}
info 2023-01-17T13:48:00.056Z origin":{"file.name":"application/application.go file.line":105},"message":"Parsed configuration and determined agent is in Fleet Server bootstrap mode ecs.version":"1.6.0"}
info 2023-01-17T13:48:00.058Z log.logger":"api origin":{"file.name":"api/server.go file.line":68},"message":"Starting stats endpoint ecs.version":"1.6.0"}
info 2023-01-17T13:48:00.058Z log.logger":"api origin":{"file.name":"api/server.go file.line":70},"message":"Metrics endpoint listening on: 127.0.0.1:6791 (configured: http://localhost:6791) ecs.version":"1.6.0"}
info 2023-01-17T13:48:00.143Z log.logger":"composable.providers.kubernetes origin":{"file.name":"kubernetes/kubernetes.go file.line":113},"message":"Kubernetes provider started for resource pod with node scope ecs.version":"1.6.0"}
info 2023-01-17T13:48:00.144Z origin":{"file.name":"upgrade/upgrade.go file.line":102},"message":"Source URI reset from \"https://artifacts.elastic.co/downloads/\" to \"https://artifacts.elastic.co/downloads/\" ecs.version":"1.6.0"}
info 2023-01-17T13:48:00.241Z log.logger":"composable.providers.docker origin":{"file.name":"docker/docker.go file.line":44},"message":"Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? ecs.version":"1.6.0"}
info 2023-01-17T13:48:00.244Z origin":{"file.name":"cmd/enroll_cmd.go file.line":770},"message":"Waiting for Elastic Agent to start Fleet Server ecs.version":"1.6.0"}
info 2023-01-17T13:48:00.246Z origin":{"file.name":"coordinator/coordinator.go file.line":729},"message":"Updating running component model ecs.version":"1.6.0"}
info 2023-01-17T13:48:00.343Z log.logger":"composable.providers.kubernetes origin":{"file.name":"kubernetes/util.go file.line":129},"message":"kubernetes: Node gke-thb-dev-default-pool-b2e32065-fzrr discovered by in cluster pod node query ecs.version":"1.6.0"}
info 2023-01-17T13:48:00.950Z log.logger":"composable.providers.kubernetes origin":{"file.name":"kubernetes/kubernetes.go file.line":113},"message":"Kubernetes provider started for resource node with node scope ecs.version":"1.6.0"}
info 2023-01-17T13:48:00.950Z log.logger":"composable.providers.kubernetes origin":{"file.name":"kubernetes/util.go file.line":122},"message":"kubernetes: Using node gke-thb-dev-default-pool-b2e32065-fzrr provided in the config ecs.version":"1.6.0"}
info 2023-01-17T13:48:01.341Z origin":{"file.name":"coordinator/coordinator.go file.line":729},"message":"Updating running component model ecs.version":"1.6.0"}
info 2023-01-17T13:48:01.342Z origin":{"file.name":"coordinator/coordinator.go file.line":825},"message":"Spawned new component fleet-server-default: Starting: spawned pid '945' component":{"id":"fleet-server-default state":"STARTING"},"ecs.version":"1.6.0"}
info 2023-01-17T13:48:01.343Z origin":{"file.name":"coordinator/coordinator.go file.line":825},"message":"Spawned new unit fleet-server-default-fleet-server: Starting: spawned pid '945' component":{"id":"fleet-server-default state":"STARTING"},"unit":{"id":"fleet-server-default-fleet-server type":"input state":"STARTING"},"ecs.version":"1.6.0"}
info 2023-01-17T13:48:01.343Z origin":{"file.name":"coordinator/coordinator.go file.line":825},"message":"Spawned new unit fleet-server-default: Starting: spawned pid '945' component":{"id":"fleet-server-default state":"STARTING"},"unit":{"id":"fleet-server-default type":"output state":"STARTING"},"ecs.version":"1.6.0"}
info 2023-01-17T13:48:01.543Z origin":{"file.name":"coordinator/coordinator.go file.line":729},"message":"Updating running component model ecs.version":"1.6.0"}
info 2023-01-17T13:48:01.747Z message":"No applicable limit for 0 agents, using default. component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"level":"info time":"2023-01-17T13:48:01Z ecs.version":"1.6.0"}
info 2023-01-17T13:48:01.747Z message":"No applicable limit for 0 agents, using default. component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"level":"info time":"2023-01-17T13:48:01Z ecs.version":"1.6.0"}
info 2023-01-17T13:48:01.747Z message":"Boot fleet-server component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"args":["--agent-mode -E logging.level=debug -E logging.to_stderr=true -E path.data=/usr/share/elastic-agent/state/data/run/fleet-server-default"],"ecs.version":"1.6.0 version":"8.6.0 commit":"05088c13 pid":945,"service.name":"fleet-server ppid":931,"exe":"/usr/share/elastic-agent/data/elastic-agent-b79a5d/components/fleet-server ecs.version":"1.6.0"}
info 2023-01-17T13:48:01.841Z message":"starting communication connection back to Elastic Agent component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server ecs.version":"1.6.0"}
I0117 13:48:01.845453     931 leaderelection.go:248] attempting to acquire leader lease default/elastic-agent-cluster-leader...
I0117 13:48:01.855543     931 leaderelection.go:258] successfully acquired lease default/elastic-agent-cluster-leader
info 2023-01-17T13:48:01.944Z origin":{"file.name":"coordinator/coordinator.go file.line":831},"message":"Component state changed fleet-server-default (STARTING->HEALTHY): Healthy: communicating with pid '945' component":{"id":"fleet-server-default state":"HEALTHY old_state":"STARTING"},"ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.043Z message":"No applicable limit for 0 agents, using default. component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"service.name":"fleet-server ecs.version":"1.6.0 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.043Z message":"No applicable limit for 0 agents, using default. component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.043Z message":"No applicable limit for 0 agents, using default. component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.043Z message":"No applicable limit for 0 agents, using default. component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.043Z message":"No applicable limit for 0 agents, using default. component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.044Z message":"No applicable limit for 0 agents, using default. component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.044Z message":"No applicable limit for 0 agents, using default. component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"service.name":"fleet-server ecs.version":"1.6.0 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.045Z message":"No applicable limit for 0 agents, using default. component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.045Z message":"Setting cache config options component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"service.name":"fleet-server cfg":{"APIKeyJitter":300000000000,"APIKeyTTL":900000000000,"ActionTTL":300000000000,"ArtifactTTL":86400000000000,"EnrollKeyTTL":60000000000,"MaxCost":52428800,"NumCounters":500000},"ecs.version":"1.6.0 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.141Z message":"Starting component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"state":"STARTING ecs.version":"1.6.0 service.name":"fleet-server ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.141Z message":"No applicable limit for 0 agents, using default. component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"service.name":"fleet-server ecs.version":"1.6.0 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.143Z message":"initial server configuration component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server new":{"Fleet":{"Agent":{"ID":" Logging":{"Level":"info"},"Version":""},"Host":{"ID":" Name":""}},"HTTP":{"Enabled":false,"Host":"localhost Port":5066,"SecurityDescriptor":" User":""},"Inputs":[{"Cache":{"APIKeyJitter":0,"APIKeyTTL":0,"ActionTTL":0,"ArtifactTTL":0,"EnrollKeyTTL":0,"MaxCost":0,"NumCounters":0},"Monitor":{"FetchSize":0,"PollTimeout":0},"Policy":{"ID":""},"Server":{"Bulk":{"FlushInterval":250000000,"FlushMaxPending":8,"FlushThresholdCount":2048,"FlushThresholdSize":1048576},"CompressionLevel":1,"CompressionThresh":1024,"GC":{"CleanupAfterExpiredInterval":"30d ScheduleInterval":3600000000000},"Host":"0.0.0.0 Instrumentation":{"APIKey":" Enabled":false,"Environment":" GlobalLabels":" Hosts":null,"SecretToken":" TLS":{"ServerCA":" ServerCertificate":" SkipVerify":false},"TransactionSampleRate":""},"InternalPort":8221,"Limits":{"AckLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":2097152},"ArtifactLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"CheckinLimit":{"Burst":1000,"Interval":1000000,"Max":0,"MaxBody":1048576},"EnrollLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":524288},"MaxAgents":0,"MaxConnections":52428800,"MaxHeaderByteSize":8192,"PolicyThrottle":500000,"StatusLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0}},"Port":8220,"Profiler":{"Bind":"localhost:6060 Enabled":false},"Runtime":{"GCPercent":0},"TLS":{"CASha256":null,"CAs":null,"Certificate":{"Certificate":"/usr/share/fleet-server/config/http-certs/tls.crt Key":"[redacted] Passphrase":""},"CipherSuites":null,"ClientAuth":0,"CurveTypes":null,"Enabled":null,"VerificationMode":"full Versions":null},"Timeouts":{"CheckinJitter":30000000000,"CheckinLongPoll":300000000000,"CheckinTimestamp":30000000000,"Idle":30000000000,"Read":60000000000,"ReadHeader":5000000000,"Write":600000000000}},"Type":""}],"Logging":{"Files":null,"Level":"info Pretty":false,"ToFiles":true,"ToStderr":true},"Output":{"Elasticsearch":{"APIKey":" Headers":null,"Hosts":["elasticsearch-es-http.default.svc:9200"],"MaxConnPerHost":128,"MaxContentLength":104857600,"MaxRetries":3,"Path":" Protocol":"https ProxyDisable":false,"ProxyHeaders":{},"ProxyURL":" ServiceToken":"[redacted] TLS":{"CASha256":null,"CATrustedFingerprint":" CAs":["/mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs/ca.crt"],"Certificate":{"Certificate":" Key":" Passphrase":""},"CipherSuites":null,"CurveTypes":null,"Enabled":null,"Renegotiation":"never VerificationMode":"full Versions":null},"Timeout":90000000000},"Extra":null}},"ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.143Z message":"starting server on configuration change component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.416Z message":"elasticsearch cluster info component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"cluster.addr":["elasticsearch-es-http.default.svc:9200"],"cluster.maxConnsPersHost":128,"cluster.name":"elasticsearch cluster.uuid":"Pcd7DOp5QxWA7SABzLqvYg ecs.version":"1.6.0 service.name":"fleet-server cluster.version":"8.6.0 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.416Z message":"Run bulker with options component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"service.name":"fleet-server opts":{"apikeyMaxParallel":120,"apikeyMaxReqSize":104857600,"blockQueueSz":32,"flushInterval":250,"flushThresholdCnt":2048,"flushThresholdSz":1048576,"maxPending":8},"ecs.version":"1.6.0 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.419Z message":"Elasticsearch compatibility check successful component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server fleet_version":"8.6.0 elasticsearch_version":"8.6.0 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.480Z message":"elasticsearch cluster info component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"cluster.maxConnsPersHost":128,"cluster.uuid":"Pcd7DOp5QxWA7SABzLqvYg cluster.version":"8.6.0 cluster.addr":["elasticsearch-es-http.default.svc:9200"],"cluster.name":"elasticsearch 2023-01-17T13:48:02.48Z ecs.version":"1.6.0 service.name":"fleet-server ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.481Z message":"starting index monitor component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server index":".fleet-policies ctx":"index monitor ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.481Z message":"run policy monitor component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"throttle":0.5,"ecs.version":"1.6.0 service.name":"fleet-server ctx":"policy agent monitor ecs.version":"1.6.0"}
warn 2023-01-17T13:48:02.481Z message":"missing config fleet.agent.id; acceptable until Elastic Agent has enrolled component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server ctx":"policy leader manager ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.541Z message":"server listening component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server bind":"0.0.0.0:8220 rdTimeout":60000,"wrTimeout":600000,"ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.541Z message":"fleet-server creating new limiter component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server addr":"0.0.0.0:8220 limits":{"AckLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":2097152},"ArtifactLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"CheckinLimit":{"Burst":1000,"Interval":1000000,"Max":0,"MaxBody":1048576},"EnrollLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":524288},"MaxAgents":0,"MaxConnections":52428800,"MaxHeaderByteSize":8192,"PolicyThrottle":500000,"StatusLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0}},"ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.541Z message":"starting index monitor component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server index":".fleet-actions ctx":"index monitor ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.541Z message":"fleet-server route added component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"service.name":"fleet-server addr":"0.0.0.0:8220 method":"GET path":"/api/status ecs.version":"1.6.0 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.541Z message":"fleet-server route added component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"method":"POST path":"/api/fleet/agents/:id ecs.version":"1.6.0 service.name":"fleet-server addr":"0.0.0.0:8220 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.541Z message":"fleet-server route added component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"method":"POST path":"/api/fleet/agents/:id/checkin ecs.version":"1.6.0 service.name":"fleet-server addr":"0.0.0.0:8220 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.541Z message":"fleet-server route added component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server addr":"0.0.0.0:8220 method":"POST path":"/api/fleet/agents/:id/acks ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.541Z message":"fleet-server route added component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server addr":"0.0.0.0:8220 method":"GET path":"/api/fleet/artifacts/:id/:sha2 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.541Z message":"fleet-server routes set up component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server addr":"0.0.0.0:8220 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.541Z message":"server hard connection limiter installed component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server hardConnLimit":52428800,"ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.542Z message":"server listening component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"wrTimeout":600000,"ecs.version":"1.6.0 service.name":"fleet-server bind":"localhost:8221 rdTimeout":60000,"ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.542Z message":"fleet-server creating new limiter component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server addr":"localhost:8221 limits":{"AckLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":2097152},"ArtifactLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0},"CheckinLimit":{"Burst":1000,"Interval":1000000,"Max":0,"MaxBody":1048576},"EnrollLimit":{"Burst":100,"Interval":10000000,"Max":50,"MaxBody":524288},"MaxAgents":0,"MaxConnections":52428800,"MaxHeaderByteSize":8192,"PolicyThrottle":500000,"StatusLimit":{"Burst":25,"Interval":5000000,"Max":50,"MaxBody":0}},"ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.542Z message":"fleet-server route added component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server addr":"localhost:8221 method":"GET path":"/api/status ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.542Z message":"fleet-server route added component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"addr":"localhost:8221 method":"POST path":"/api/fleet/agents/:id ecs.version":"1.6.0 service.name":"fleet-server ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.542Z message":"fleet-server route added component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server addr":"localhost:8221 method":"POST path":"/api/fleet/agents/:id/checkin ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.542Z message":"fleet-server route added component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server addr":"localhost:8221 method":"POST path":"/api/fleet/agents/:id/acks ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.542Z message":"fleet-server route added component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server addr":"localhost:8221 method":"GET path":"/api/fleet/artifacts/:id/:sha2 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.542Z message":"fleet-server routes set up component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server addr":"localhost:8221 ecs.version":"1.6.0"}
info 2023-01-17T13:48:02.542Z message":"server hard connection limiter installed component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server hardConnLimit":52428800,"ecs.version":"1.6.0"}
info 2023-01-17T13:48:03.125Z message":"Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process) component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server state":"DEGRADED ecs.version":"1.6.0"}
warn 2023-01-17T13:48:03.125Z origin":{"file.name":"coordinator/coordinator.go file.line":829},"message":"Unit state changed fleet-server-default-fleet-server (STARTING->DEGRADED): Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process) component":{"id":"fleet-server-default state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server type":"input state":"DEGRADED old_state":"STARTING"},"ecs.version":"1.6.0"}
warn 2023-01-17T13:48:03.125Z origin":{"file.name":"coordinator/coordinator.go file.line":829},"message":"Unit state changed fleet-server-default (STARTING->DEGRADED): Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process) component":{"id":"fleet-server-default state":"HEALTHY"},"unit":{"id":"fleet-server-default type":"output state":"DEGRADED old_state":"STARTING"},"ecs.version":"1.6.0"}
info 2023-01-17T13:48:04.248Z origin":{"file.name":"cmd/enroll_cmd.go file.line":784},"message":"Fleet Server - Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process) ecs.version":"1.6.0"}
info 2023-01-17T13:48:05.176Z origin":{"file.name":"cmd/enroll_cmd.go file.line":475},"message":"Starting enrollment to URL: https://fleet-server-agent-http.default.svc:8220/ ecs.version":"1.6.0"}
info 2023-01-17T13:48:05.248Z message":"Elastic Agent successfully enrolled component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"http.request.id":"01GPZZS9S11BCSZ5X8HXMFWR7V http.response.body.bytes":1495,"event.duration":980491236,"ecs.version":"1.6.0 fleet.enroll.apikey.id":"C3_7v4UB1L2vmVtd_qDp fleet.agent.id":"996ab694-ef86-4f09-9ad3-47b7cf2fa1ed fleet.policy.id":"eck-agent fleet.access.apikey.id":"O-X8v4UBc8M4w3BNqMGN service.name":"fleet-server mod":"enroll ecs.version":"1.6.0"}
info 2023-01-17T13:48:06.039Z message":"Elastic Agent successfully enrolled component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 http.request.id":"01GPZZSAY3PXZ55A421HCJ92T1 fleet.agent.id":"d3bd6dd0-3715-41df-ae8f-e05869c82422 fleet.policy.id":"eck-fleet-server http.response.body.bytes":1012,"event.duration":586055123,"service.name":"fleet-server mod":"enroll fleet.enroll.apikey.id":"N-X7v4UBc8M4w3BN_sHx fleet.access.apikey.id":"PeX8v4UBc8M4w3BNrMHa ecs.version":"1.6.0"}
Successfully enrolled the Elastic Agent.
info 2023-01-17T13:48:06.259Z origin":{"file.name":"cmd/enroll_cmd.go file.line":277},"message":"Elastic Agent has been enrolled; start Elastic Agent ecs.version":"1.6.0"}
info 2023-01-17T13:48:06.259Z origin":{"file.name":"cmd/run.go file.line":242},"message":"signal \"terminated\" received ecs.version":"1.6.0"}
info 2023-01-17T13:48:06.259Z origin":{"file.name":"cmd/run.go file.line":254},"message":"Shutting down Elastic Agent and sending last events... ecs.version":"1.6.0"}
info 2023-01-17T13:48:06.259Z message":"On signal component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server sig":"terminated ecs.version":"1.6.0"}
info 2023-01-17T13:48:06.341Z message":"Stopping component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server state":"STOPPING 2023-01-17T13:48:06.26Z ecs.version":"1.6.0"}
error 2023-01-17T13:48:06.341Z message":"error while closing server component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"@timestamp":"2023-01-17T13:48:06.26Z ecs.version":"1.6.0 service.name":"fleet-server error.message":"close tcp 127.0.0.1:8221: use of closed network connection ecs.version":"1.6.0"}
@thbkrkr thbkrkr added the >bug Something isn't working label Jan 17, 2023
@cmacknz
Copy link
Member

cmacknz commented Jan 17, 2023

There's nothing obvious wrong in the logs here. In the fleet-server logs I see:

signal "terminated" received
Shutting down Elastic Agent and sending last events...
On signal
Stopping
error while closing server

The `signal "terminated" received" is coming from https://github.com/elastic/elastic-agent/blob/4dcea16e97f79b1d1b456af0348715f609ea76d3/internal/pkg/agent/cmd/run.go#L242 and means the fleet-server agent is shutting down because it was sent SIGTERM.

We'll need to find out where this is coming from. Is there anything in the output of kubectl describe pod or kubectl get events that can give us a hint about why the process is terminating?

@thbkrkr
Copy link
Contributor Author

thbkrkr commented Jan 17, 2023

I don't see anything relevant in the k8s events:

> k describe po fleet-server-agent-b878c97b7-tjrcq | grep -A10 Events:
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  7m23s  default-scheduler  Successfully assigned tmp/fleet-server-agent-b878c97b7-tjrcq to gke-thb-dev-default-pool-b2e32065-fzrr
  Normal  Pulled     7m21s  kubelet            Container image "docker.elastic.co/beats/elastic-agent:8.6.0" already present on machine
  Normal  Created    7m21s  kubelet            Created container agent
  Normal  Started    7m21s  kubelet            Started container agent

> k describe po elastic-agent-agent-xlpgd | grep -A10 Events:
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  7m31s                  default-scheduler  Successfully assigned tmp/elastic-agent-agent-xlpgd to gke-thb-dev-default-pool-b2e32065-fzrr
  Normal   Pulled     5m30s (x5 over 7m29s)  kubelet            Container image "docker.elastic.co/beats/elastic-agent:8.6.0" already present on machine
  Normal   Created    5m30s (x5 over 7m28s)  kubelet            Created container agent
  Normal   Started    5m30s (x5 over 7m28s)  kubelet            Started container agent
  Warning  BackOff    2m23s (x20 over 7m9s)  kubelet            Back-off restarting failed container

I looked for SIGTERM and found something interesting. An agent is stopped no matter what when performing fleet enrollment (code) and the stop is done by sending a SIGTERM, aka "terminated" (code).
That would mean this is normal behavior, but then the fleet-server should start? It seems stucked.

@cmacknz
Copy link
Member

cmacknz commented Jan 17, 2023

Yes, the agent restarting on re-enroll is expected. When it does this it will first terminate the child processes it started by sending SIGTERM. So fleet-server getting SIGTERM is expected. We can see this happening in the full logs where the logs originating from fleet-server are annotated with the binary name:

info 2023-01-17T13:48:06.259Z message":"On signal component":{"binary":"fleet-server dataset":"elastic_agent.fleet_server id":"fleet-server-default type":"fleet-server"},"ecs.version":"1.6.0 service.name":"fleet-server sig":"terminated ecs.version":"1.6.0"}

What I don't think is expected is Elastic Agent itself getting a SIGTERM. The Elastic Agent should restart by Exec-ing itself: https://github.com/elastic/elastic-agent/blob/c7852a428f6afa0fd496c210ae296f5a4b062622/internal/pkg/agent/application/reexec/reexec.go#L26

The log below I believe originates from the Elastic Agent and not the fleet-server subprocess. A terminated Elastic Agent won't restart itself which I think is what we are observing, but don't know why yet.

info 2023-01-17T13:48:06.259Z origin":{"file.name":"cmd/run.go file.line":242},"message":"signal \"terminated\" received ecs.version":"1.6.0"}

The entire process here is called Fleet server bootstrapping and is described in https://github.com/elastic/elastic-agent/blob/c7852a428f6afa0fd496c210ae296f5a4b062622/docs/fleet-server-bootstrap.asciidoc#L2

I will see if I can reproduce this but it likely won't be until later tomorrow.

@thbkrkr
Copy link
Contributor Author

thbkrkr commented Jan 18, 2023

TestFleetCustomLogsIntegrationRecipe and TestFleetAPMIntegrationRecipe are also flaky, most likely for the same yet unknown reason.

@naemono
Copy link
Contributor

naemono commented Jan 19, 2023

@cmacknz So I'm also investigating this issue, and it appears to be only the filebeat fleet-server integration that's failing here.

All the agent instances "seem" healthy:

❯ kc get agent -n e2e-mercury
NAME                 HEALTH   AVAILABLE   EXPECTED   VERSION   AGE
elastic-agent-fvll   green    2           2          8.6.0     4h19m
fleet-server-fvll    green    1           1          8.6.0     4h19m

The fleet page in Kibana shows things unhealthy:

image

Index management only shows the metricsbeat index from fleet server agent:

image

Looking at the filebeat logs, there's a stack trace of some sort, along with it trying to connect to Elasticsearch over localhost?

cat ~/tmp/fleet-logs-raw-pruned.json | jq 'select(.component.binary == "filebeat") | select(."log.level" != "info")| .message'

"Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning."
"Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning."
"error when check request status for getting IMDSv2 token: http request status 405. No token in the metadata request will be used."
"Error dialing dial tcp 127.0.0.1:9200: connect: connection refused"
"Failed to connect to backoff(elasticsearch(http://localhost:9200)): Get \"http://localhost:9200\": dial tcp 127.0.0.1:9200: connect: connection refused"
"Error dialing dial tcp [::1]:9200: connect: cannot assign requested address"
"Failed to connect to backoff(elasticsearch(http://localhost:9200)): Get \"http://localhost:9200\": dial tcp [::1]:9200: connect: cannot assign requested address"
"Error dialing dial tcp 127.0.0.1:9200: connect: connection refused"
"panic: close of closed channel"
"goroutine 175 [running]:"
"github.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata.(*cache).stop(...)"
"github.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata/cache.go:97"
"github.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata.(*kubernetesAnnotator).Close(0xc00094d3c0?)"
"github.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata/kubernetes.go:311 +0x4f"
"github.com/elastic/beats/v7/libbeat/processors.Close(...)"
"github.com/elastic/beats/v7/libbeat/processors/processor.go:58"
"github.com/elastic/beats/v7/libbeat/publisher/processing.(*group).Close(0xc000a24c28?)"
"github.com/elastic/beats/v7/libbeat/publisher/processing/processors.go:95 +0x159"
"github.com/elastic/beats/v7/libbeat/processors.Close(...)"
"github.com/elastic/beats/v7/libbeat/processors/processor.go:58"
"github.com/elastic/beats/v7/libbeat/publisher/processing.(*group).Close(0x0?)"
"github.com/elastic/beats/v7/libbeat/publisher/processing/processors.go:95 +0x159"
"github.com/elastic/beats/v7/libbeat/processors.Close(...)"
"github.com/elastic/beats/v7/libbeat/processors/processor.go:58"
"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*client).Close.func1()"
"github.com/elastic/beats/v7/libbeat/publisher/pipeline/client.go:167 +0x2df"
"sync.(*Once).doSlow(0x55e0bdcdf090?, 0x0?)"
"sync/once.go:68 +0xc2"
"sync.(*Once).Do(...)"
"sync/once.go:59"
"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*client).Close(0xc000567ce0?)"
"github.com/elastic/beats/v7/libbeat/publisher/pipeline/client.go:148 +0x59"
"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*Pipeline).runSignalPropagation(0xc000779fb8?)"
"github.com/elastic/beats/v7/libbeat/publisher/pipeline/pipeline.go:363 +0x226"
"created by github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*Pipeline).registerSignalPropagation.func1"
"github.com/elastic/beats/v7/libbeat/publisher/pipeline/pipeline.go:315 +0x96"
"Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning."
"Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning."

All of the agent logs show this:

  "log.level": "error",
{
  "log.level": "error",
  "@timestamp": "2023-01-19T20:30:22.247Z",
  "message": "failed to publish events: temporary bulk send failure",
  "component": {
    "binary": "filebeat",
    "dataset": "elastic_agent.filebeat",
    "id": "filestream-monitoring",
    "type": "filestream"
  },
  "ecs.version": "1.6.0",
  "log.logger": "publisher_pipeline_output",
  "log.origin": {
    "file.line": 176,
    "file.name": "pipeline/client_worker.go"
  },
  "service.name": "filebeat"
}

Can you look at the live configuration for agent, specific to filebeat?

@naemono
Copy link
Contributor

naemono commented Jan 19, 2023

I initially thought, after I submitted this, that maybe I'm seeing a different error than @thbkrkr , since I'm not seeing the CrashLoopBackoff error, but I think I'm seeing the issue a bit later in the process, after things have "stabilized". The state I am seeing would exhibit itself with the same e2e failure as the very first comment in this issue.

Reason: no such index [logs-elastic_agent.fleet_server-default]

If it helps, you can look at this live in this cluster @cmacknz:

gcloud container clusters get-credentials mmontgomery-testcluster01 --zone us-east1-c --project elastic-cloud-dev

All stack components are in namespace e2e-mercury.

@cmacknz
Copy link
Member

cmacknz commented Jan 19, 2023

Thanks for the additional details. I ran out of time to look at this today (got a Sev-1 SDH instead) but I haven't forgotten about it.

What I really need is the archive generated from elastic-agent diagnostics collect on one of the pods that is having issues. @naemono if it's easy for you to grab that by exec-ing into one of your pods I can look later today. If you don't have time for this no problem, I can connect in or set this up myself later.

The panic you are observing only happens on shutdown and is captured in elastic/beats#34219 already. It isn't causing the problem here.

@naemono
Copy link
Contributor

naemono commented Jan 19, 2023

@cmacknz thanks for the update. I absolutely have time to grab those for you. I'll have those attached to this in about 30 minutes

@naemono
Copy link
Contributor

naemono commented Jan 19, 2023

❯ kc exec -it elastic-agent-fvll-agent-7n6pf -n e2e-mercury /bin/sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
# bash
root@gke-mmontgomery-testcluster01-pool-1-9b30cf4a-xal5:/usr/share/elastic-agent# elastic-agent diagnostics collect
Error: failed to fetch component/unit diagnostics: failed to retrieve unit diagnostics: rpc error: code = Unavailable desc = error reading from server: EOF
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.6/fleet-troubleshooting.html
command terminated with exit code 137

@cmacknz troubleshooting guide isn't getting me anywhere in solving this one..

@cmacknz
Copy link
Member

cmacknz commented Jan 20, 2023

One of the processes managed by agent didn't respond, which is unfortunate especially if it didn't write out the diagnostics for the agent itself.

There is at least one bug we know of in 8.6.0 that will cause Filebeat not to be able to write events basically at random: elastic/elastic-agent#2086. You will see output unit has no config in the agent logs if this is the case.

That could be in play here but I can't confirm it without a closer look at failing cluster. I'll have to look at this on Monday.

@naemono
Copy link
Contributor

naemono commented Jan 20, 2023

I can almost guarantee you I saw output unit has no config in the logs I saw today. Let's pick this up on Monday.

@naemono
Copy link
Contributor

naemono commented Jan 23, 2023

@cmacknz some other testing I was doing in this cluster wiped out this failed setup. I'm working to get you another failure you can analyze now. Will update when done.

@naemono
Copy link
Contributor

naemono commented Jan 23, 2023

And here's the error output unit has no config

❯ kc logs -n e2e-mercury fleet-server-ghjh-agent-689d9c5cc-fzcb9 | grep "has no config"
{"log.level":"error","@timestamp":"2023-01-23T15:08:49.944Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed filestream-monitoring-filestream-monitoring-agent (STARTING->FAILED): [output unit has no config]","component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring-filestream-monitoring-agent","type":"input","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-23T15:08:49.944Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed filestream-monitoring (STARTING->FAILED): [output unit has no config]","component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}

namespace e2e-mercury, in same cluster as referenced before @cmacknz

@cmacknz
Copy link
Member

cmacknz commented Jan 23, 2023

Thanks. The output unit has no config issue should be fixed with elastic/elastic-agent#2151 in both main and 8.6, it will be in 8.6.1.

I would be very curious to see if the problems are solved with an agent build that includes that fix.

@naemono
Copy link
Contributor

naemono commented Jan 23, 2023

@cmacknz thanks for the update. Would this fix be included in the 8.7.0-SNAPSHOT build? If so, I believe we have an easy path to test to see if it's resolved fully for our team.

@cmacknz
Copy link
Member

cmacknz commented Jan 23, 2023

Yes it will be once it is published.

@naemono
Copy link
Contributor

naemono commented Jan 24, 2023

The below worked for me today, since I saw that 8.7.0-SNAPSHOT build had completed. (8.6.x-SNAPSHOT had failed for multiple days)

make e2e-local TESTS_MATCH=TestFleetKubernetesIntegrationRecipe E2E_TEST_ENV_TAGS=agent E2E_STACK_VERSION=8.7.0-SNAPSHOT

@thbkrkr
Copy link
Contributor Author

thbkrkr commented Jan 31, 2023

TestFleetKubernetesIntegrationRecipe is also failing in 8.6.1.
Note that there is no presence of output unit has no config in the log.

All agents failed to connect to the fleet server with:

fail to checkin to fleet-server: all hosts failed: 1 error occurred:
  * requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: 
    Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/c5a8c172-4523-42bf-b2b6-cbdd3d9be5b9/checkin?\": 
    dial tcp 10.61.204.178:8220: connect: connection refused

elastic-agent-pod/logs.txt:

{"log.level":"info","@timestamp":"2023-01-31T02:00:32.929Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":145},"message":"Fleet gateway started","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-01-31T02:00:33.243Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":190},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":2268510,"failed_checkins":1,"retry_after_ns":81354918331,"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-01-31T02:01:54.820Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":190},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":3948092,"failed_checkins":2,"retry_after_ns":187001170945,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-31T02:05:02.045Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":5615808,"failed_checkins":3,"retry_after_ns":462027578619,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-31T02:12:44.295Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":4790732,"failed_checkins":4,"retry_after_ns":551932525972,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-31T02:21:56.451Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":4894946,"failed_checkins":5,"retry_after_ns":468702515596,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-01-31T02:29:45.374Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/ errored: Post \"https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/api/fleet/agents/0919c133-7e19-4799-816b-c0fc373a5a0a/checkin?\": dial tcp 10.61.204.178:8220: connect: connection refused\n\n"},"request_duration_ns":4828830,"failed_checkins":6,"retry_after_ns":623778092728,"ecs.version":"1.6.0"}

fleet-server seems to listen 0.0.0.0:8220:

{
    "log.level": "info",
    "@timestamp": "2023-01-31T02:00:25.738Z",
    "message": "server listening",
    "component": {
        "binary": "fleet-server",
        "dataset": "elastic_agent.fleet_server",
        "id": "fleet-server-default",
        "type": "fleet-server"
    },
    "log": {
        "source": "fleet-server-default"
    },
    "ecs.version": "1.6.0",
    "service.name": "fleet-server",
    "bind": "0.0.0.0:8220",
    "rdTimeout": 60000,
    "wrTimeout": 600000
}

It looks like the same issue than in the initial comment of this issue where after having enroll agents, the fleet-server seems stopped.

Looking at the last log of fleet-server:

> cat e2e-qrhr2-mercury/pod/fleet-server-jjxg-agent-c4b9887f5-jcprj/logs.txt | grep '^{' | jq .message | tail -11
"Fleet Server - Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)"
"Starting enrollment to URL: https://fleet-server-jjxg-agent-http.e2e-qrhr2-mercury.svc:8220/"
"Elastic Agent successfully enrolled"
"Elastic Agent successfully enrolled"
"Elastic Agent successfully enrolled"
"Elastic Agent successfully enrolled"
"Elastic Agent has been enrolled; start Elastic Agent"
"signal \"terminated\" received"
"Shutting down Elastic Agent and sending last events..."
"On signal"
"Stopping"

@cmacknz
Copy link
Member

cmacknz commented Jan 31, 2023

@michel-laterman any ideas on what might be going wrong with fleet-server here?

Based on the explicit log from fleet server that it is binding to 0.0.0.0 it is unlikely to be caused by elastic/elastic-agent#2197.

I did try bringing up the agent in ECK locally and it came up green in the one time I tried it.

@michel-laterman
Copy link

strange, all logs indicate that 0.0.0.0 is being used. When the fleet-server is running can something curl the fleet-server's status endpoint GET /api/status?

@thbkrkr
Copy link
Contributor Author

thbkrkr commented Feb 1, 2023

This happened 3 times during last night's e2e tests on other ^TestFleet* tests:

gke  1.22        / stack-7.17.8         / TestFleetMode/ES_data_should_pass_validations
kind 1.24.4 ipv6 / stack-8.6.1          / TestFleetMode/ES_data_should_pass_validations
eks  1.20   arm  / stack-8.6.1          / TestFleetAPMIntegrationRecipe/ES_data_should_pass_validations

This fails due to the same reason in 8.6.1: fleet enrolled the agents, terminated and then agents can't connect to fleet.
However it is the first time and it is very strange, that TestFleetMode also failed in 7.17.8. The logs are different, not sure it's the same cause.

########
# gke  1.22        / stack-7.17.8         / TestFleetMode/ES_data_should_pass_validations
########

## agent
2023-02-01T02:12:24.914Z	INFO	cmd/enroll_cmd.go:472	Retrying enrollment to URL: https://test-agent-fleet-fs-68ww-agent-http.e2e-ge72n-mercury.svc:8220/
2023-02-01T02:12:25.019Z	WARN	cmd/enroll_cmd.go:465	Remote server is not ready to accept connections, will retry in a moment.

## fleet-server
2023-02-01T01:48:58.789Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://test-agent-fleet-fs-68ww-agent-http.e2e-ge72n-mercury.svc:8220/api/fleet/agents/41886ec7-4dd1-4653-ad26-c08be936c334/checkin?": dial tcp 10.54.31.39:8220: connect: connection refused
2023-02-01T01:49:43.350Z	ERROR	status/reporter.go:236	Elastic Agent status changed to: 'error'
2023-02-01T01:49:43.350Z	ERROR	log/reporter.go:36	2023-02-01T01:49:43Z - message: Application: fleet-server--7.17.8[41886ec7-4dd1-4653-ad26-c08be936c334]: State changed to FAILED: Missed two check-ins - type: 'ERROR' - sub_type: 'FAILED'

------------------------------------
########
# kind 1.24.4 ipv6 / stack-8.6.1          / TestFleetMode/ES_data_should_pass_validations
# eks  1.20   arm  / stack-8.6.1          / TestFleetAPMIntegrationRecipe/ES_data_should_pass_validations
########

## agent
{"log.level":"error","@timestamp":"2023-02-01T01:52:49.604Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://test-agent-fleet-fs-bjbn-agent-http.e2e-6uqnd-mercury.svc:8220/ errored: Post \"https://test-agent-fleet-fs-bjbn-agent-http.e2e-6uqnd-mercury.svc:8220/api/fleet/agents/3a5c1af1-25cc-4b23-9c13-bec97aa89f10/checkin?\": dial tcp 10.96.224.69:8220: connect: connection refused\n\n"},"request_duration_ns":2132372,"failed_checkins":5,"retry_after_ns":533049496714,"ecs.version":"1.6.0"}

## fleet-server
Successfully enrolled the Elastic Agent.
{"log.level":"info","@timestamp":"2023-02-01T01:29:31.442Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":277},"message":"Elastic Agent has been enrolled; start Elastic Agent","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-02-01T01:29:31.442Z","log.origin":{"file.name":"cmd/run.go","file.line":246},"message":"signal \"terminated\" received","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-02-01T01:29:31.442Z","log.origin":{"file.name":"cmd/run.go","file.line":258},"message":"Shutting down Elastic Agent and sending last events...","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-02-01T01:29:31.442Z","message":"On signal","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","sig":"terminated","ecs.version":"1.6.0"}

@naemono
Copy link
Contributor

naemono commented Feb 1, 2023

On Fleet-server Agent instance

root@test-agent-fleet-fs-pfk9-agent-566f5c5d47-7sjrs:/usr/share/elastic-agent# netstat -anp | grep LISTEN
tcp        0      0 127.0.0.1:6791          0.0.0.0:*               LISTEN      920/elastic-agent
unix  2      [ ACC ]     STREAM     LISTENING     14713074 920/elastic-agent    /usr/share/elastic-agent/state/data/tmp/elastic-agent-control.sock

root@test-agent-fleet-fs-pfk9-agent-566f5c5d47-7sjrs:/usr/share/elastic-agent# curl localhost:6791/api/status
404 page not found

Last 10 lines of output

❯ kc logs test-agent-fleet-fs-pfk9-agent-566f5c5d47-7sjrs -n e2e-mercury --tail 10
{"log.level":"info","@timestamp":"2023-02-01T21:21:02.347Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":784},"message":"Fleet Server - Running on policy with Fleet Server integration: eck-fleet-server; missing config fleet.agent.id (expected during bootstrap process)","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-02-01T21:21:02.680Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":475},"message":"Starting enrollment to URL: https://test-agent-fleet-fs-pfk9-agent-http.e2e-mercury.svc:8220/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-02-01T21:21:04.854Z","message":"Elastic Agent successfully enrolled","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","mod":"enroll","fleet.access.apikey.id":"cNbaDoYB-HaK6tJWw_tp","event.duration":1916806016,"http.response.body.bytes":1034,"ecs.version":"1.6.0","http.request.id":"01GR7DNGDFVTGE1XBY4C16EAZ9","fleet.enroll.apikey.id":"FIPZDoYBlCDrEHQWs-C7","fleet.agent.id":"5a560502-b69b-47c8-a388-b5eb409b66d2","fleet.policy.id":"eck-fleet-server","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-02-01T21:21:04.854Z","message":"Elastic Agent successfully enrolled","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","fleet.enroll.apikey.id":"zPzZDoYBCDPWL9tws3-m","fleet.agent.id":"1d4deaad-d456-49ea-83a5-a09d0e00d7a2","http.response.body.bytes":1004,"event.duration":1729025551,"ecs.version":"1.6.0","http.request.id":"01GR7DNGM1PY1RK0J3EZ4FQKXV","mod":"enroll","fleet.policy.id":"eck-agent","fleet.access.apikey.id":"FoPaDoYBlCDrEHQWw-BO","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-02-01T21:21:04.900Z","message":"Elastic Agent successfully enrolled","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"fleet.policy.id":"eck-agent","fleet.access.apikey.id":"GIPaDoYBlCDrEHQWx-Bp","event.duration":541886320,"service.name":"fleet-server","http.request.id":"01GR7DNHTF8459BSEWCFNSN7HA","mod":"enroll","fleet.enroll.apikey.id":"zPzZDoYBCDPWL9tws3-m","ecs.version":"1.6.0","fleet.agent.id":"929e8530-c3e7-4700-8b3f-236ede5e29fd","http.response.body.bytes":1005,"@timestamp":"2023-02-01T21:21:04.9Z","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-02-01T21:21:05.073Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":277},"message":"Elastic Agent has been enrolled; start Elastic Agent","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-02-01T21:21:05.074Z","log.origin":{"file.name":"cmd/run.go","file.line":246},"message":"signal \"terminated\" received","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
Successfully enrolled the Elastic Agent.
{"log.level":"info","@timestamp":"2023-02-01T21:21:05.074Z","log.origin":{"file.name":"cmd/run.go","file.line":258},"message":"Shutting down Elastic Agent and sending last events...","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-02-01T21:21:05.076Z","message":"On signal","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","sig":"terminated","ecs.version":"1.6.0"}

That agent never dies, and just sits in this state @cmacknz @michel-laterman .

@cmacknz
Copy link
Member

cmacknz commented Feb 1, 2023

That is odd, it is almost like it is deadlocked in the signal handler.

https://github.com/elastic/fleet-server/blob/322c2590c3a37ae14349697a9afcbe4ed8bb42b2/internal/pkg/signal/signal.go#L25-L38

I think it would probably be useful to get a pprof goroutine dump from /debug/goroutine. You'd have to enable the profile in the fleet server configuration if you have access to that:

profiler:
  enabled: true
  bind: localhost:6060

thbkrkr added a commit to thbkrkr/cloud-on-k8s that referenced this issue Feb 2, 2023
@cmacknz
Copy link
Member

cmacknz commented Feb 2, 2023

Bug is tracked in elastic/elastic-agent-autodiscover#41

I believe I know what is happening now. The fix is straightforward but will touch a lot of code, so might take a few days.

@thbkrkr thbkrkr changed the title TestFleetKubernetesIntegrationRecipe fails on 8.6.0 TestFleet* fail on 8.6.x Mar 7, 2023
@cmacknz
Copy link
Member

cmacknz commented Mar 21, 2023

We believe we've fixed this with elastic/elastic-agent#2352

Note that it isn't available in the latest 8.8.0-SNAPSHOT container yet, it should be available in the next successful build.

@thbkrkr
Copy link
Contributor Author

thbkrkr commented Mar 29, 2023

@cmacknz Everything looks good now in 8.7.0-SNAPSHOT and 8.8.0-SNAPSHOT :)

thbkrkr added a commit that referenced this issue Jun 27, 2023
Move version checks to agent builder to skip tests due to the following stack bugs:
- Kibana bug "index conflict on install policy" (elastic/kibana#126611)
- Elastic agent bug "deadlock on startup" (#6331 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants