Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch and log panics #6199

Merged
merged 1 commit into from
Jan 30, 2018
Merged

Catch and log panics #6199

merged 1 commit into from
Jan 30, 2018

Conversation

andrewkroh
Copy link
Member

@andrewkroh andrewkroh commented Jan 26, 2018

When a Beat exits due to panic the stack trace is written to stderr. For Beats running under service managers where the stdout/err are discarded this makes it problematic to debug (or even detect a panic).

This commit modifies the beat runner to catch (aka recover) the panicking main goroutine, log it with logp, flush the logger, then exit. This ensures that the panic is present in the log output.

Under normal circumstances (no error) the final log message from every Beat will be <beatname> stopped.. When there is a panic a level=fatal log line will be written that contains the panic message and a stack trace. For example:

{
  "level": "info",
  "timestamp": "2018-01-26T13:36:59.708-0500",
  "caller": "runtime/asm_amd64.s:510",
  "message": "auditbeat stopped."
}
{
  "level": "fatal",
  "timestamp": "2018-01-26T13:36:59.709-0500",
  "logger": "auditbeat",
  "caller": "instance/beat.go:134",
  "message": "Failed due to panic.",
  "panic": "Oh no, big problem!",
  "stack": "github.com/elastic/beats/libbeat/cmd/instance.Run.func1.1
	/Users/akroh/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:134
runtime.call32
	/Users/akroh/.gvm/versions/go1.9.2.darwin.amd64/src/runtime/asm_amd64.s:509
runtime.gopanic
	/Users/akroh/.gvm/versions/go1.9.2.darwin.amd64/src/runtime/panic.go:491
github.com/elastic/beats/libbeat/cmd/instance.(*Beat).launch
	/Users/akroh/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:315
github.com/elastic/beats/libbeat/cmd/instance.Run.func1
	/Users/akroh/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:142
github.com/elastic/beats/libbeat/cmd/instance.Run
	/Users/akroh/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:143
github.com/elastic/beats/libbeat/cmd.genRunCmd.func1
	/Users/akroh/go/src/github.com/elastic/beats/libbeat/cmd/run.go:19
github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).execute
	/Users/akroh/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:704
github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).ExecuteC
	/Users/akroh/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:785
github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).Execute
	/Users/akroh/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:738
main.main
	/Users/akroh/go/src/github.com/elastic/beats/auditbeat/main.go:14
runtime.main
	/Users/akroh/.gvm/versions/go1.9.2.darwin.amd64/src/runtime/proc.go:195"
}

When a Beat exits due to panic the stack trace is written to stderr. For Beats running under service managers where the stdout/err are discarded this makes it problematic to debug (or even detect a panic).

This commit modifies the beat runner to catch (aka `recover`) the panicking main goroutine, log it with logp, flush the logger, then exit. This ensures that the panic is present in the log output.

Under normal circumstances (no error) the final log message from every Beat will be `<beatname> stopped.`. When there is a panic a l`evel=fatal` log line will be written that contains the panic message and a stack trace. For example:

```
{
  "level": "info",
  "timestamp": "2018-01-26T13:36:59.708-0500",
  "caller": "runtime/asm_amd64.s:510",
  "message": "auditbeat stopped."
}
{
  "level": "fatal",
  "timestamp": "2018-01-26T13:36:59.709-0500",
  "logger": "auditbeat",
  "caller": "instance/beat.go:134",
  "message": "Failed due to panic.",
  "panic": "Oh no, big problem!",
  "stack": "github.com/elastic/beats/libbeat/cmd/instance.Run.func1.1\n\t/Users/akroh/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:134\nruntime.call32\n\t/Users/akroh/.gvm/versions/go1.9.2.darwin.amd64/src/runtime/asm_amd64.s:509\nruntime.gopanic\n\t/Users/akroh/.gvm/versions/go1.9.2.darwin.amd64/src/runtime/panic.go:491\ngh.neting.cc/elastic/beats/libbeat/cmd/instance.(*Beat).launch\n\t/Users/akroh/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:315\ngh.neting.cc/elastic/beats/libbeat/cmd/instance.Run.func1\n\t/Users/akroh/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:142\ngh.neting.cc/elastic/beats/libbeat/cmd/instance.Run\n\t/Users/akroh/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:143\ngh.neting.cc/elastic/beats/libbeat/cmd.genRunCmd.func1\n\t/Users/akroh/go/src/github.com/elastic/beats/libbeat/cmd/run.go:19\ngh.neting.cc/elastic/beats/vendor/github.com/spf13/cobra.(*Command).execute\n\t/Users/akroh/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:704\ngh.neting.cc/elastic/beats/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/Users/akroh/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:785\ngh.neting.cc/elastic/beats/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/Users/akroh/go/src/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:738\nmain.main\n\t/Users/akroh/go/src/github.com/elastic/beats/auditbeat/main.go:14\nruntime.main\n\t/Users/akroh/.gvm/versions/go1.9.2.darwin.amd64/src/runtime/proc.go:195"
}
```

Log that panic object seperately
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change the exit code of a beat if the beat panics?

@andrewkroh
Copy link
Member Author

Yes, it will be 1 instead of 2 to indicate that we caught this panic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants