ui: latency graphs reporting 0 latency incorrectly #12998

petermattis · 2017-01-19T16:32:52Z

The above is from a cluster that is performing ~2k ops/sec. It is impossible for the latency to be 0.

mrtracy · 2017-01-19T18:07:35Z

What cluster was this?

mrtracy · 2017-01-19T18:08:52Z

There's only one line there, but the 99th percentile graph shows a line for every node (unless you are looking at a single node). Is it possible that this node has no requests?

Edit: I took a look, yeah something might not be getting recorded correctly on the backend.

petermattis · 2017-01-19T18:11:30Z

This was a 1-node cluster so it was definitely receiving requests. I've seen this happen multiple times.

mrtracy · 2017-01-19T21:01:36Z

After reading the code, i'm surprised it ever has non-zero values; the sliding histogram is only advanced when the histogram values are read. This means that, immediately before returning values, the histogram might rotate all previously recorded data out of the histogram, resulting in a zero value.

Two possible fixes:

"tick" histograms when values are recorded, rather than when values are read. This would involve taking a timestamp whenever a value is recorded; probably not too expensive, but it's something to consider.
Set the histogram duration to a longer interval than the recording window. This is not as accurate as the solution above (some values will be recorded in the wrong window), but is probably accurate enough for our purposes and involves taking fewer timestamps. That would also align with the suggestion in ui: latency metrics P99/P50 on cluster page fluctuate a lot #12835, although too much smoothing might make the data unintuitive.

rjnn · 2017-01-30T15:20:24Z

What's the current status on this issue?

mrtracy · 2017-02-01T20:21:06Z

While a "fix" was prepared to create non-zero values, we are looking into prioritizing a much bigger change to the time series system that will allow us to handle histograms correctly (see #7896). This will allow for data which is far more intuitive (no odd "window" sizes), as well as the correct computation of quantiles across the cluster.

Adds a derived configuration value "HistogramWindowDuration" to the server config, which specifies the approximate duration for which individual samples are retained in our sliding window histograms. Previously, histogram windows were initialized using the MetricSampleWindow value. However, this resulted in cockroachdb#12998; this is caused because the histogram window is rotated when querying a histogram, not when recording a sample. In many cases, the timing is such that all samples recorded are discarded right before the histogram is queried, resulting in a zero value. Because of this, and additionally issue cockroachdb#12835, we are lengthening the histogram window duration for all histogram metrics. This includes all existing histograms that were using MetricSampleWindow, and a few histograms which were using constant values of 1 minute. The new value is arbitrarily chosen to be six times the MetricSampleWindow configuration setting; under default settings, that equates to a window of one minute. The work here can be considered a temporary fix to alleviate the problems in a couple of issues; the histogram system is going to be greatly overhauled in issue cockroachdb#7896, part of which will be the removal of sliding-window histograms. Fixes cockroachdb#12835 Fixes cockroachdb#12998

petermattis assigned mrtracy Jan 19, 2017

mrtracy mentioned this issue Jan 24, 2017

ui: Create HistogramWindowDuration config value #13106

Merged

mrtracy mentioned this issue Feb 1, 2017

ui: Store Histogram Bucket Counts in Time Series system #7896

Closed

mrtracy added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 2, 2017

mrtracy closed this as completed in #13106 Feb 3, 2017

rjnn mentioned this issue Feb 6, 2017

distsql: add metrics #12143

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ui: latency graphs reporting 0 latency incorrectly #12998

ui: latency graphs reporting 0 latency incorrectly #12998

petermattis commented Jan 19, 2017

mrtracy commented Jan 19, 2017

mrtracy commented Jan 19, 2017 •

edited

Loading

petermattis commented Jan 19, 2017

mrtracy commented Jan 19, 2017 •

edited

Loading

rjnn commented Jan 30, 2017

mrtracy commented Feb 1, 2017

ui: latency graphs reporting 0 latency incorrectly #12998

ui: latency graphs reporting 0 latency incorrectly #12998

Comments

petermattis commented Jan 19, 2017

mrtracy commented Jan 19, 2017

mrtracy commented Jan 19, 2017 • edited Loading

petermattis commented Jan 19, 2017

mrtracy commented Jan 19, 2017 • edited Loading

rjnn commented Jan 30, 2017

mrtracy commented Feb 1, 2017

mrtracy commented Jan 19, 2017 •

edited

Loading

mrtracy commented Jan 19, 2017 •

edited

Loading