-
Notifications
You must be signed in to change notification settings - Fork 39.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated the code to handle divide by zero panic #74676
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hpandeycodeit If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @hpandeycodeit. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/kind bug
/priority important-soon
/assing @derekwaynecarr @feiskyer
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@hpandeycodeit thanks for the fix. |
/sig windows |
@@ -545,6 +545,9 @@ func (p *criStatsProvider) getContainerUsageNanoCores(stats *runtimeapi.Containe | |||
} | |||
|
|||
nanoSeconds := stats.Cpu.Timestamp - cached.Timestamp | |||
if nanoSeconds == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would this ever occur? Does this mean two calls to the CRI stats gave metrics with the same timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ cc @adelina-t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen this on Windows nodes too, but not sure yet how this could've happened.
dockershim adds the timestamp itself for cpu usage by calling time.Now()
, so it should not have the same timestamp for two calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @dashpole that this is just papering over the actual bug. Given that, I'd suggest:
- Log an error message including the content of the new stats and the cache stats for further debugging
- return
nil
in this case to surface the problem (but not crashing kubelet of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddebroy any ideas on why this could have happened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the proposal to log the timestamps. The values are populated with time.Now().UnixNano() at https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockershim/docker_stats_windows.go#L89 . That is a core golang API.
The golang docs specify using sub()
to measure duration: https://golang.org/pkg/time/ because that uses monotonic time. This https://go.googlesource.com/proposal/+/master/design/12914-monotonic.md describes the proposal to shift to using monotonic times from wall clock and specifically mentions the usage of sub()
. VMs may be prone to wall clock drifts. So maybe it's safer to use sub()
and then converting the duration to nanoseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go timers are lower resolution on Windows.
(not to mention i've seen wonky behavior in the past).
if the existing process relies on nano precision, it has to be adapted for Windows i.e. consider introducing small delays if needed.
+1 for the proposal to log the timestamps
+1 here as well.
to illustrate what i mean above try comparing this between Windows and Linux:
package main
import (
"fmt"
"time"
"math/rand"
)
func load() {
for i := 0; i < rand.Intn(10); i++ {
fmt.Println("i=", i)
}
}
func main() {
for i := 0; i < 10; i++ {
t1 := time.Now()
load()
t2 := time.Now()
fmt.Println("-----------> diff", (t2.Sub(t1)).Nanoseconds())
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - I was going down the same path - if this isn't using time.sub() is that the problem?
I tried to track down the accuracy of time.Now()'s monotonic clock on Windows but ended up at a dead end with runtimeNano. If anyone has a pointer to the C or ASM code for this, please share. Windows does have both monotonic and wall time down to microsecond link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PatrickLang
https://github.com/golang/go/blob/master/src/runtime/sys_windows_amd64.s#L476-L494
https://github.com/golang/go/blob/master/src/runtime/os_windows.go#L345-L351
but interestingly:
https://github.com/golang/go/blob/master/src/runtime/os_windows.go#L337-L339
// useQPCTime controls whether time.now and nanotime use QueryPerformanceCounter.
// This is only set to 1 when running under Wine.
var useQPCTime uint8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it might be worth a shot to force an explicit proc-ed QueryPerformanceCounter
and don't rely on Now()
or Sub()
, to see if the behavior changes:
https://github.com/ScaleFT/monotime/blob/master/clock_windows_qpf.go
@@ -545,6 +545,9 @@ func (p *criStatsProvider) getContainerUsageNanoCores(stats *runtimeapi.Containe | |||
} | |||
|
|||
nanoSeconds := stats.Cpu.Timestamp - cached.Timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we log the timestamps when the nanoSeconds is 0 so that we can see what their actual values are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant log both stats.Cpu.Timestamp and cached.Timestamp
@dashpole @yujuhong @neolit123 @michmike @ddebroy I've actually run with logs for stats.Cpu.Timestamp and cached.Timestamp. See kubelet logs here: https://www.dropbox.com/s/4fu5jyd021vr3xn/clogs.zip?dl=0 Logs messages are error level and begin with "###" This node from which I took the logs had only one kubelet panic. Relevant logs:
|
thanks for the logs @adelina-t [1] looking at the common
the above two identical timestamps still could be the result of condensed execution, such as the reduced precision on Windows is not accounting for the actual difference. but given [1] this might not be the case. this might not happen on Linux and something like the following can be observed: i wouldn't discard a different type of bug at play here. |
As to how this may happen? Take a look in context. ( I put some messages inline but they won't bold)
FIST CALL to getContainerUsageNanoCores is done with an actual value of: 1551306491604643800 and cached value: 1551306478430587800 . When it finishes it should write 1551306491604643800 to cache ( defer ). SECOND CALL to getContainerUsageNanoCores is done with an actual value of: THIRD CALL to getContainerUsageNanoCores is done with an actual value of:
|
ok, the negative DIFF confirms there is a bug outside of the Windows limited timer precision. |
I only did a grep trough 2 of the last conformance runs on master on gce ( wget on individual kubelet logs is a pain ) and have not seen this. But that doesn't mean anything, as a kubelet restart was only noticed in our case because the CNI blocked ( a happy coincidence for finding this bug ). So it could happen on linux and nobody will notice a kubelet restart as it doesn't affect tests. If someone with access to those logs could pull more logs and grep trough them, that would help. Actually what I think happens is that one of the callers of I think the whole design of this function is not the best, as it's clear that the underlying assumption that the timestamp in stats will always be newer than the one in the cache, which we see is not the case. I'll actually give it a bunch of runs on linux too, with logs to see what happens. |
/hold |
i agree.
thanks. waiting on comments from @apelisse @yujuhong for the above too. |
@neolit123 Though one thing really bugs me. So what happens is this: CALLER1 gathers stats and calls getContainerUsageNanoCores with those stats. CALLER2 gathers stats at the exact same time and calls getContainerUsageNanoCores with those stats. One of these calls updates cache, the other one comes with the exact same value and well, divide by 0 happens. However: is it resonable to think that two callers gathered stats at the exact same time? Or maybe the function gets called from two places with the exact same stats object from two places? ( I don't know how stats are collected, need to investigate that. ). Though, upon further though, if it was the same stat object, we would have seen this more often. |
i think it's unlikely that they get the stats at the exact same time even with the reduced precision on Windows and given the negative DIFF problem, so i'd say the same stats object is a possibility too. i'm hoping that we don't hit compiler bugs and such, because that will be a bummer. |
@adelina-t I checked the code briefly and I don't think the stats should be the same. If you print the actual cpu usage(s), we can check whether they are the same.
Yes, I read the code again and this is not ideal, and may lead to more confusion. Caching and calculating the cpu usage for a single caller makes sense, since it controls the interval. Using a global cache for multiple callers would lead to inconsistent period, or even out of order stats (which can be mitigated by some tricks, but those doesn't tackle the main issue). I think we need a better way to handle this. |
@yujuhong are you suggesting that we need a better way to handle this for 1.14, or post 1.14? I agree we should add some more logging around the actual CPU statistics in the cached object to see what they look like. the CPU statistics could help shed some light if this is a caching issue or simply a time recording issue. i want to ask a different question. there are two problematic scenarios
|
closing this PR since #74933 addresses this issue |
@michmike: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
What this PR does / why we need it:
This fixes the bug where there can be a Panic when
nanoSeconds
value is zeroWhich issue(s) this PR fixes:
Fixes #74667
Does this PR introduce a user-facing change?: