Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong values for perf core events when starting the measurement #2629

Closed
wacuuu opened this issue Jul 23, 2020 · 4 comments · Fixed by #2632
Closed

Wrong values for perf core events when starting the measurement #2629

wacuuu opened this issue Jul 23, 2020 · 4 comments · Fixed by #2632

Comments

@wacuuu
Copy link

wacuuu commented Jul 23, 2020

Hi,

I noticed wired behavior of metrics when using cAdvisor with perf + Prometheus + Grafana. I noticed, that if I use more core perf events that the platform has counters (which triggers event scaling), I get some spikes in the beginning of data collection for a given container. I did some investigation and I found this function.

In some cases, the reading returns Value 0, Time Running 0 and Time Enabled non 0. This leads to returning an undefined number from the function, by which I mean some crazy high value that breaks all the stats. I did some investigation on why this is happening. Only thing I was able to find in this context, is this part of perf tool documentation, which says in the context in event scaling:

This provides an estimate of what the count would have been, had the event been measured during the entire run. It is very
important to understand this is an estimate not an actual count. Depending on the workload, there will be blind spots
which can introduce errors during scaling.

This lead me to conclusion that we may be facing an issue, that for some reason, although PID can be measured on a core, for some reason it is not, thus we get non 0 time enabled, but no value or time running. This pretty much makes metrics corrupted(rate charts have some spikes that make them unreadable).

My suggestion is to check if the value returned is equal to 0, and if it is, return PerfStat with all fields set to 0, as if nothing happened.

@dashpole
Copy link
Collaborator

cc @Creatone @katarzyna-z @iwankgb

@Creatone
Copy link
Collaborator

Good point, I agree with your suggestion.

@iwankgb
Copy link
Collaborator

iwankgb commented Jul 26, 2020

I would expect this code snippet to prevent problem that you are describing:

scalingRatio := 1.0
if perfData.TimeEnabled != 0 
{		
    scalingRatio = float64(perfData.TimeRunning) / float64(perfData.TimeEnabled)	
} 	
stat := info.PerfStat{		
    Value: uint64(float64(perfData.Value) / scalingRatio),		
    Name: name,		
    ScalingRatio: scalingRatio,		
    Cpu: cpu,	
}

If TimeEnabled equals 0 then scalingRatio remains zero and Value is set to zero too (0/1). The problem you are describing could occur if TimeEnabled is non-zero, TimeRunning is zero, and perfData.Value is zero.

@wacuuu
Copy link
Author

wacuuu commented Jul 26, 2020

I would too. But with a simple test i noticed that getting 0.0/0.0 returns NaN, but casting the result to uint64 returns ridiculous number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants