Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: latency graphs reporting 0 latency incorrectly #12998

Closed
petermattis opened this issue Jan 19, 2017 · 6 comments · Fixed by #13106
Closed

ui: latency graphs reporting 0 latency incorrectly #12998

petermattis opened this issue Jan 19, 2017 · 6 comments · Fixed by #13106
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@petermattis
Copy link
Collaborator

screen shot 2017-01-19 at 11 31 50 am

The above is from a cluster that is performing ~2k ops/sec. It is impossible for the latency to be 0.

@mrtracy
Copy link
Contributor

mrtracy commented Jan 19, 2017

What cluster was this?

@mrtracy
Copy link
Contributor

mrtracy commented Jan 19, 2017

There's only one line there, but the 99th percentile graph shows a line for every node (unless you are looking at a single node). Is it possible that this node has no requests?

Edit: I took a look, yeah something might not be getting recorded correctly on the backend.

@petermattis
Copy link
Collaborator Author

This was a 1-node cluster so it was definitely receiving requests. I've seen this happen multiple times.

@mrtracy
Copy link
Contributor

mrtracy commented Jan 19, 2017

After reading the code, i'm surprised it ever has non-zero values; the sliding histogram is only advanced when the histogram values are read. This means that, immediately before returning values, the histogram might rotate all previously recorded data out of the histogram, resulting in a zero value.

Two possible fixes:

  • "tick" histograms when values are recorded, rather than when values are read. This would involve taking a timestamp whenever a value is recorded; probably not too expensive, but it's something to consider.
  • Set the histogram duration to a longer interval than the recording window. This is not as accurate as the solution above (some values will be recorded in the wrong window), but is probably accurate enough for our purposes and involves taking fewer timestamps. That would also align with the suggestion in ui: latency metrics P99/P50 on cluster page fluctuate a lot #12835, although too much smoothing might make the data unintuitive.

@rjnn
Copy link
Contributor

rjnn commented Jan 30, 2017

What's the current status on this issue?

@mrtracy
Copy link
Contributor

mrtracy commented Feb 1, 2017

While a "fix" was prepared to create non-zero values, we are looking into prioritizing a much bigger change to the time series system that will allow us to handle histograms correctly (see #7896). This will allow for data which is far more intuitive (no odd "window" sizes), as well as the correct computation of quantiles across the cluster.

@mrtracy mrtracy added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 2, 2017
mrtracy pushed a commit to mrtracy/cockroach that referenced this issue Feb 3, 2017
Adds a derived configuration value "HistogramWindowDuration" to the server
config, which specifies the approximate duration for which individual samples
are retained in our sliding window histograms.

Previously, histogram windows were initialized using the MetricSampleWindow
value. However, this resulted in cockroachdb#12998; this is caused because the histogram
window is rotated when querying a histogram, not when recording a
sample. In many cases, the timing is such that all samples recorded are
discarded right before the histogram is queried, resulting in a zero value.

Because of this, and additionally issue cockroachdb#12835, we are lengthening the histogram
window duration for all histogram metrics. This includes all existing histograms
that were using MetricSampleWindow, and a few histograms which were using
constant values of 1 minute. The new value is arbitrarily chosen to be six times
the MetricSampleWindow configuration setting; under default settings, that
equates to a window of one minute.

The work here can be considered a temporary fix to alleviate the problems in a
couple of issues; the histogram system is going to be greatly overhauled in
issue cockroachdb#7896, part of which will be the removal of sliding-window histograms.

Fixes cockroachdb#12835
Fixes cockroachdb#12998
@rjnn rjnn mentioned this issue Feb 6, 2017
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants