-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telemetry: fix polling gauges always reporting NaN on Prometheus and timers #1218
Telemetry: fix polling gauges always reporting NaN on Prometheus and timers #1218
Conversation
@@ -163,7 +163,17 @@ protected String getLockName(String name, String partition) { | |||
} | |||
|
|||
protected void internalPollSingle(PollContext ctx) { | |||
String monitorName = getClass().getSimpleName(); | |||
String monitorName = | |||
!StringUtils.isBlank(this.getName()) ? this.getName() : getClass().getSimpleName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is was a bug, cause overridden function was totally ignored in metrics
@@ -67,12 +71,9 @@ public CommonPollingMonitor( | |||
this.discoveryStatusListener = discoveryStatusListener; | |||
this.lockService = lockService; | |||
this.scheduler = scheduler; | |||
this.instrumentation = new CommonPollingMonitorInstrumentation(registry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single instance per monitor, I know it's not needed, but given we just run a few, it's not a biggy, compared to fix all the tests and change the constructor of every subclass.
The key bit is the registry, which can be passed in for testing
} | ||
|
||
public void trackItemsCached(AtomicInteger numberOfItems, String monitor, String partition) { | ||
Gauge gauge = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor... but could move the registry.get into the if check bypassing the need for a cast
|
||
public void trackItemsOverThreshold( | ||
AtomicInteger numberOfItems, String monitor, String partition) { | ||
Gauge gauge = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, minor thing just prefer inline checks on some of this :) PERSONAL thing vs. "required"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this covers the sitaution :) Thank you!
Not going to worry about my "nits" Just curious :) |
I would have fixed it, no problems, but I saw you merged! Hopefully it helps the folks reporting the issue! |
Fixes armory-plugins/armory-observability-plugin#34.
Long overdue, I know, as we have been running with it for more than a year now and forgot to upstream it 😓
Context
Relevant polling gauges are currently always NaN: they represent new items (e.g images cached) or above the limit, requiring manual intervention (via admin endpoint).
The lack of visibility over these metrics (just visible in logs) might be a big issues, given blocked caching might have consequences like engineers not be able to find they releases or triggers not firing.
The PR fixes this using the spectator library, trying to achieve a similar outcome we use internally at Wise in our production services with Micrometer, i.e. registering some variables with strong references and letting Prometheus scrapes observe the latest value.
In order to do that, we introduced a separate instrumentation class, which eventually could contain logging too, as per Domain Probes .
Testing tried to simulate multiple runs and simple concurrency scenarios.
P.S. I wrote this long time ago and cleaned up the internal PR, so there might be some caveats based on the internal use I might have missed but it does fix the issue reported.