Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat panics when getting the cluster UUID #34384

Closed
belimawr opened this issue Jan 25, 2023 · 5 comments
Closed

Metricbeat panics when getting the cluster UUID #34384

belimawr opened this issue Jan 25, 2023 · 5 comments
Assignees
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Infra Monitoring UI - DEPRECATED Infrastructure Monitoring UI team - DEPRECATED - Use Team:Monitoring v8.6.0

Comments

@belimawr
Copy link
Contributor

belimawr commented Jan 25, 2023

  • Version: main, 8.6
  • Operating System: Linux

Metricbeat sometimes panics during startup when calling this function:

func (m *MetricSet) getClusterUUID() (string, error) {
state, err := beat.GetState(m.MetricSet)
if err != nil {
return "", errors.Wrap(err, "could not get state information")
}
clusterUUID := state.Monitoring.ClusterUUID
if clusterUUID != "" {
return clusterUUID, nil
}
if state.Output.Name != "elasticsearch" {
return "", nil
}
clusterUUID = state.Outputs.Elasticsearch.ClusterUUID
if clusterUUID == "" {
// Output is ES but cluster UUID could not be determined. No point sending monitoring
// data with empty cluster UUID since it will not be associated with the correct ES
// production cluster. Log error instead.
return "", beat.ErrClusterUUID
}

Here is an example of the log generated:

{"log.level":"error","@timestamp":"2023-01-25T09:55:37.880Z","message":"recovered from panic while fetching 'beat/stats' for host 'unix'. Recovering, but please report this.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"stack":"github.com/elastic/elastic-agent-libs/logp.Recover\n\t/go/pkg/mod/github.com/elastic/elastic-agent-libs@v0.2.16/logp/global.go:102\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:844\nruntime.panicmem\n\t/usr/local/go/src/runtime/panic.go:220\nruntime.sigpanic\n\t/usr/local/go/src/runtime/signal_unix.go:818\ngh.neting.cc/elastic/beats/v7/metricbeat/module/beat/stats.(*MetricSet).getClusterUUID\n\t/go/src/github.com/elastic/beats/metricbeat/module/beat/stats/stats.go:85\ngh.neting.cc/elastic/beats/v7/metricbeat/module/beat/stats.(*MetricSet).Fetch\n\t/go/src/github.com/elastic/beats/metricbeat/module/beat/stats/stats.go:71\ngh.neting.cc/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).fetch\n\t/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:253\ngh.neting.cc/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).startPeriodicFetching\n\t/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:225\ngh.neting.cc/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).run\n\t/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:209\ngh.neting.cc/elastic/beats/v7/metricbeat/mb/module.(*Wrapper).Start.func1\n\t/go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:149","ecs.version":"1.6.0","log.origin":{"file.line":220,"file.name":"runtime/panic.go"},"service.name":"metricbeat","error":{"message":"runtime error: invalid memory address or nil pointer dereference"},"ecs.version":"1.6.0"}

Pretty printed stack trace

github.com/elastic/elastic-agent-libs/logp.Recover
        /go/pkg/mod/github.com/elastic/elastic-agent-libs@v0.2.16/logp/global.go:102
runtime.gopanic
        /usr/local/go/src/runtime/panic.go:844
runtime.panicmem
        /usr/local/go/src/runtime/panic.go:220
runtime.sigpanic
        /usr/local/go/src/runtime/signal_unix.go:818
github.com/elastic/beats/v7/metricbeat/module/beat/stats.(*MetricSet).getClusterUUID
        /go/src/github.com/elastic/beats/metricbeat/module/beat/stats/stats.go:85
github.com/elastic/beats/v7/metricbeat/module/beat/stats.(*MetricSet).Fetch
        /go/src/github.com/elastic/beats/metricbeat/module/beat/stats/stats.go:71
github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).fetch
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:253
github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).startPeriodicFetching
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:225
github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).run
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:209
github.com/elastic/beats/v7/metricbeat/mb/module.(*Wrapper).Start.func1
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:14

It seems that state.Monitoring on:

clusterUUID := state.Monitoring.ClusterUUID

is nil probably due to the order things get initialised when running under Elastic-Agent.

Steps to Reproduce

This issue does not happen all the time, it seems to be a race condition, so one might have to try a few times before reproducing it.

I have been able to reproduce it consistently on Linux

  1. Use elastic-package to bring up the stack: elastic-package stack up -v --version=8.7.0-SNAPSHOT -d
  2. Watch the logs for elastic-package-stack-fleet-server-1, most of the times it will panic.
    You can tail the docker logs, once the container is created, with:
    docker logs -f elastic-package-stack-fleet-server-1 2>&1 | grep panic
    
@belimawr belimawr added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Jan 25, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@klacabane
Copy link
Contributor

klacabane commented Feb 2, 2023

Looking at the function flow it is difficult to find how we could end up with a null reference here. Discussed this with @belimawr and one hypothesis is that we pass a pointer to a pointer to the json.Unmarshal call which is unusual and may trigger an edge case.

I'll try to reproduce and see if there is a specific state that causes this error

@klacabane
Copy link
Contributor

klacabane commented Feb 2, 2023

Playing around with json.Unmarshal I was able to get to a state where both the returned error and passed pointer are nil after the operation, and it's when the bytes would be literal null which is valid json (playground link):

state := &State{}
err := json.Unmarshal([]byte("null"), &state)
fmt.Printf("err: %v, state: %v", err, state)

--> err: <nil>, state: <nil>

When only passing a pointer (and not pointer of a pointer), the state would remain the initialized struct it was before the operation and not a nil pointer, so the hypothesis holds. This would mean that the beats /state API, or the golang code processing the response, could return null and I'll have to verify that. Note that there is a similar unmarshall call (of the / API) happening right before the offending one which does not seem to trigger the nil pointer, so it should be the API that returns null.

@cmacknz
Copy link
Member

cmacknz commented Feb 6, 2023

One thing to keep in mind here is that it is possible this was always happening and we just never noticed it because the Metricbeat log files prior to 8.6 were kept separate from the main Elastic Agent log files.

In 8.6 we merged all of the log files into one, which makes problems like this much more obvious. We likely would have only noticed this before if someone were specifically reading the agent monitoring Metricbeat logs on a regular basis, which I don't think was the case. This is the same reason nobody noticed the regular logs that Metricbeat couldn't obtain a cluster UUID at all prior to 8.6.

@AndersonQ
Copy link
Member

It has been fixed already by #34480
As we don't have any other 8.6.x release planned and it's backported to 8.7, i believe we can close this issue.

@jlind23 jlind23 closed this as completed Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Infra Monitoring UI - DEPRECATED Infrastructure Monitoring UI team - DEPRECATED - Use Team:Monitoring v8.6.0
Projects
None yet
Development

No branches or pull requests

6 participants