-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
oomparser: update to use kmsg based parser #1544
Conversation
Nice! This LGTM, but I'd like to get input from a few more people who understand the OOM pathways better than I do. |
@vishh - Could you take a look at this? |
utils/oomparser/oomparser.go
Outdated
}() | ||
// StreamOoms writes to a provided a stream of OomInstance objects representing | ||
// OOM events that are found in the logs. | ||
// It will blocks and should be called from a goroutine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/blocks/block/
utils/oomparser/oomparser.go
Outdated
ioreader: bufio.NewReader(tail), | ||
}, nil | ||
// Should not happen | ||
glog.Warningf("exiting analyzeLines. OOM events will not be reported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If this really should not happen it should be error level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before it was info and I thought it should be error, so I picked the middle as a compromise 😇
"[ 1532] 1020 1532 410347 398810 788 0 0 badsysprogram", | ||
"Out of memory: Kill process 1532 (badsysprogram) score 919 or sacrifice child", | ||
"Killed process 1532 (badsysprogram) total-vm:1641388kB, anon-rss:1595164kB, file-rss:76kB", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test case with multiple OOM events (note: I think writeAll needs to run in a goroutine to prevent deadlock)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
utils/oomparser/oomparser.go
Outdated
parser, err = tryLogFile() | ||
if err == nil { | ||
return parser, nil | ||
parser, err := kmsgparser.NewParser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I like the logger pattern, consider setting the logger to use glog for consistency?
utils/oomparser/oomparser.go
Outdated
// OOM events that are found in the logs. | ||
// It will blocks and should be called from a goroutine. | ||
func (self *OomParser) StreamOoms(outStream chan<- *OomInstance) { | ||
kmsgEntries := self.parser.Parse() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Should we call self.parser.SeekEnd()
here or not?
I think SeekEnd
more closely matches the journalctl .. -f
behavior of before (start at the end of kmsg), but I think that looking a little into the past makes the kubelet behave more correctly after a restart.
Will OOM events that happen while the kubelet is down, but are detected on kubelet startup actually do the right thing?
Are they idempotent so that power-cycling the kubelet having bunches of duplicate events is okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think reporting recent OOM events is desireable, and it's also consistent with the non-journald behavior.
@timstclair addressed your comments and fixed a bug where OOM events with a process name containing a Happy to break that one-line fix out into its own PR if you'd like |
b79ade5
to
5b2e38a
Compare
Rebased and squashed the commit addressing comments from back whenever. |
This fix would also make the packaging issue of cadvisor / journalctl inside a docker container obsolete. #1313 |
Bump on this, did a rote rebase and afaict there aren't open concerns (other than waiting on whether @vishh feels like reviewing) |
@k8s-bot test this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. Most of the essential work is implemented in the dependency https://github.com/euank/go-kmsg-parser.
From my point of view this change should be integrated, although I'm not a official reviewer.
The oomparser logic would end up stuck, unable to detect the end of a given oom trace, for any process with name that didn't match \w+. This includes processes like "python3.4" due to the '.', or 'docker-containerd' due to the '-'. This fix was included in pr google#1544 last year, but since that PR seems dead it seems like a good idea to break this more important fix out. I've updated the test such that it would have caught this issue.
The oomparser logic would end up stuck, unable to detect the end of a given oom trace, for any process with a name that didn't match \w+. This includes processes like 'python3.4' due to the '.', or 'docker-containerd' due to the '-'. This fix was included in pr google#1544 last year, but since that PR seems dead it seems like a good idea to break this more important fix out. I've updated the tests such that they would have caught this issue.
The oomparser logic would end up stuck, unable to detect the end of a given oom trace, for any process with a name that didn't match \w+. This includes processes like 'python3.4' due to the '.', or 'docker-containerd' due to the '-'. This fix was included in pr google#1544 last year, but since that PR seems dead it seems like a good idea to break this more important fix out. I've updated the tests such that they would have caught this issue.
The oomparser logic would end up stuck, unable to detect the end of a given oom trace, for any process with a name that didn't match \w+. This includes processes like 'python3.4' due to the '.', or 'docker-containerd' due to the '-'. This fix was included in pr google#1544 last year, but since that PR seems dead it seems like a good idea to break this more important fix out. I've updated the tests such that they would have caught this issue.
The oomparser logic would end up stuck, unable to detect the end of a given oom trace, for any process with a name that didn't match \w+. This includes processes like 'python3.4' due to the '.', or 'docker-containerd' due to the '-'. This fix was included in pr google#1544 last year, but since that PR seems dead it seems like a good idea to break this more important fix out. I've updated the tests such that they would have caught this issue.
Rebased for probably the 10th or so time. Manually tested via the following: $ sudo ./cadvisor -v 3 -logtostderr
...
# cgroup OOM
$ docker run -m 15M euank/gunpowder-memhog:latest 30M
# cadvisor logs:
I0830 20:24:01.875221 3264 manager.go:1142] Created an OOM event in container "/docker/6da85dc9675db12149e970c6ef6835240bab712e5499963d13e88a446248e82b" at 2017-08-30 20:24:01.583571597 +0000 UTC
# system OOM
$ docker run -d euank/gunpowder-memhog 500M
$ sleep 5
$ echo f | sudo tee /proc/sysrq-trigger
# cadvisor logs
I0830 20:59:27.831699 1611 manager.go:1142] Created an OOM event in container "/" at 2017-08-30 20:59:27.717686391 +0000 UTC @timstclair if this is going to land, can we land it? |
@euank I can help review this. is this ready for a review? |
This provides much more robust support for kernel logs via accessing the `/dev/kmsg` interface to them directly.
@vishh yup, it's been ready for review for 9 months now. |
🎉 |
This is required, because recent versions of containerd try to read from this file. Older LXC versions had a dedicated setting to create this symlink, so we have to rely on systemd here.
My comment for PR kubernetes/node-problem-detector#41 (comment) applies here too.
This would fix kubernetes/kubernetes#34965, probably fix #1484, and work-around/fix #645.
Things that should be considered before merging:
/dev/kmsg
. This should be documented as a breaking change and various examples / docs might need changing.Testing note: I removed the log-file formatted files. The parsing of log-files is now the responsibility of the kmsgparser package which lives elsewhere, so we can trust it to do the right thing. We also get to assume the timestamp etc.
I have not done manual testing of this change, but I have done decent manual testing of the kmsgparser.