bug(metrics): Large negative values are reported when time.IsZero() in code. #2743

cep21 · 2023-07-13T18:06:32Z

Checks

I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
I'm not using a custom entrypoint in my runner image

Controller Version

0.27.4

Helm Chart Version

No response

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

Unneeded for this bug.

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
My actions-runner-controller version (v0.x.y) does support the feature
I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

Unneeded for this bug.

To Reproduce

Gather a large number of metrics and inspect the /metrics endpoint for large negative values.

Describe the bug

We see large negative values in our bucket timings. Code bug in additional context

Describe the expected behavior

The "run duration" of a job should never be a large negative value

Whole Controller Logs

Whole Runner Pod Logs

Additional Context

I'm looking at the function here:

actions-runner-controller/pkg/actionsmetrics/event_reader.go

Line 226 in e0a7e14

    
           func (reader *EventReader) fetchAndParseWorkflowJobLogs(ctx context.Context, e *gogithub.WorkflowJobEvent) (*ParseResult, error) {

The code ends in this

	return &ParseResult{
		ExitCode:  exitCode,
		QueueTime: startedTime.Sub(queuedTime),
		RunTime:   completedTime.Sub(startedTime),
	}, nil

I believe that sometimes, startedTime or completedTime, are never parsed and are still the zero value. When they are the zero value, the returned value is 0-<now> which is very large and reports as a large negative number in the /metrics endpoint.

Proposed fix

Rather than this

	return &ParseResult{
		ExitCode:  exitCode,
		QueueTime: startedTime.Sub(queuedTime),
		RunTime:   completedTime.Sub(startedTime),
	}, nil

Do this

var ret ParseResult
if !startedTime.IsZero() && !queuedTime.IsZero() {
  ret.QueueTime = startedTime.Sub(queuedTime)
}

And the same for RunTime.

Then on the caller side, if the metric QueueTime.IsZero() is true, don't report the zero value to /metrics historgram

The text was updated successfully, but these errors were encountered:

github-actions · 2023-07-13T18:07:14Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

cep21 · 2023-07-13T20:31:07Z

The PR that introduced this bug is #2057

cep21 added bug Something isn't working needs triage Requires review from the maintainers labels Jul 13, 2023

cep21 mentioned this issue Jul 18, 2023

Expose workflow job metrics via new actions-metrics-server #2057

Merged

nikola-jokic added community Community contribution and removed needs triage Requires review from the maintainers labels Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(metrics): Large negative values are reported when time.IsZero() in code. #2743

bug(metrics): Large negative values are reported when time.IsZero() in code. #2743

cep21 commented Jul 13, 2023 •

edited

Loading

github-actions bot commented Jul 13, 2023

cep21 commented Jul 13, 2023

bug(metrics): Large negative values are reported when time.IsZero() in code. #2743

bug(metrics): Large negative values are reported when time.IsZero() in code. #2743

Comments

cep21 commented Jul 13, 2023 • edited Loading

Checks

Controller Version

Helm Chart Version

CertManager Version

Deployment Method

cert-manager installation

Checks

Resource Definitions

To Reproduce

Describe the bug

Describe the expected behavior

Whole Controller Logs

Whole Runner Pod Logs

Additional Context

Proposed fix

github-actions bot commented Jul 13, 2023

cep21 commented Jul 13, 2023

cep21 commented Jul 13, 2023 •

edited

Loading