Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry: fix polling gauges always reporting NaN on Prometheus and timers #1218

Merged
merged 1 commit into from
Jan 28, 2024

Conversation

massimo-pacher-tw
Copy link
Contributor

@massimo-pacher-tw massimo-pacher-tw commented Jan 26, 2024

Fixes armory-plugins/armory-observability-plugin#34.
Long overdue, I know, as we have been running with it for more than a year now and forgot to upstream it 😓

Context

Relevant polling gauges are currently always NaN: they represent new items (e.g images cached) or above the limit, requiring manual intervention (via admin endpoint).

The lack of visibility over these metrics (just visible in logs) might be a big issues, given blocked caching might have consequences like engineers not be able to find they releases or triggers not firing.

The PR fixes this using the spectator library, trying to achieve a similar outcome we use internally at Wise in our production services with Micrometer, i.e. registering some variables with strong references and letting Prometheus scrapes observe the latest value.

In order to do that, we introduced a separate instrumentation class, which eventually could contain logging too, as per Domain Probes .

Testing tried to simulate multiple runs and simple concurrency scenarios.

P.S. I wrote this long time ago and cleaned up the internal PR, so there might be some caveats based on the internal use I might have missed but it does fix the issue reported.

@@ -163,7 +163,17 @@ protected String getLockName(String name, String partition) {
}

protected void internalPollSingle(PollContext ctx) {
String monitorName = getClass().getSimpleName();
String monitorName =
!StringUtils.isBlank(this.getName()) ? this.getName() : getClass().getSimpleName();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is was a bug, cause overridden function was totally ignored in metrics

@@ -67,12 +71,9 @@ public CommonPollingMonitor(
this.discoveryStatusListener = discoveryStatusListener;
this.lockService = lockService;
this.scheduler = scheduler;
this.instrumentation = new CommonPollingMonitorInstrumentation(registry);
Copy link
Contributor Author

@massimo-pacher-tw massimo-pacher-tw Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single instance per monitor, I know it's not needed, but given we just run a few, it's not a biggy, compared to fix all the tests and change the constructor of every subclass.
The key bit is the registry, which can be passed in for testing

}

public void trackItemsCached(AtomicInteger numberOfItems, String monitor, String partition) {
Gauge gauge =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor... but could move the registry.get into the if check bypassing the need for a cast


public void trackItemsOverThreshold(
AtomicInteger numberOfItems, String monitor, String partition) {
Gauge gauge =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, minor thing just prefer inline checks on some of this :) PERSONAL thing vs. "required"

Copy link
Member

@jasonmcintosh jasonmcintosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this covers the sitaution :) Thank you!

@jasonmcintosh
Copy link
Member

Not going to worry about my "nits" Just curious :)

@jasonmcintosh jasonmcintosh added the ready to merge Approved and ready for merge label Jan 28, 2024
@mergify mergify bot added the auto merged label Jan 28, 2024
@mergify mergify bot merged commit 957000f into spinnaker:master Jan 28, 2024
4 checks passed
@massimo-pacher-tw
Copy link
Contributor Author

Not going to worry about my "nits" Just curious :)

I would have fixed it, no problems, but I saw you merged! Hopefully it helps the folks reporting the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NaN values on some igor metrics
3 participants