Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(metrics): Large negative values are reported when time.IsZero() in code. #2743

Open
7 tasks done
cep21 opened this issue Jul 13, 2023 · 2 comments
Open
7 tasks done
Labels
bug Something isn't working community Community contribution

Comments

@cep21
Copy link

cep21 commented Jul 13, 2023

Checks

Controller Version

0.27.4

Helm Chart Version

No response

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

Unneeded for this bug.

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

Unneeded for this bug.

To Reproduce

Gather a large number of metrics and inspect the /metrics endpoint for large negative values.

Describe the bug

We see large negative values in our bucket timings. Code bug in additional context

Describe the expected behavior

The "run duration" of a job should never be a large negative value

Whole Controller Logs

.

Whole Runner Pod Logs

.

Additional Context

I'm looking at the function here:

func (reader *EventReader) fetchAndParseWorkflowJobLogs(ctx context.Context, e *gogithub.WorkflowJobEvent) (*ParseResult, error) {

The code ends in this

	return &ParseResult{
		ExitCode:  exitCode,
		QueueTime: startedTime.Sub(queuedTime),
		RunTime:   completedTime.Sub(startedTime),
	}, nil

I believe that sometimes, startedTime or completedTime, are never parsed and are still the zero value. When they are the zero value, the returned value is 0-<now> which is very large and reports as a large negative number in the /metrics endpoint.

Proposed fix

Rather than this

	return &ParseResult{
		ExitCode:  exitCode,
		QueueTime: startedTime.Sub(queuedTime),
		RunTime:   completedTime.Sub(startedTime),
	}, nil

Do this

var ret ParseResult
if !startedTime.IsZero() && !queuedTime.IsZero() {
  ret.QueueTime = startedTime.Sub(queuedTime)
}

And the same for RunTime.

Then on the caller side, if the metric QueueTime.IsZero() is true, don't report the zero value to /metrics historgram

@cep21 cep21 added bug Something isn't working needs triage Requires review from the maintainers labels Jul 13, 2023
@github-actions
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@cep21
Copy link
Author

cep21 commented Jul 13, 2023

The PR that introduced this bug is #2057

@nikola-jokic nikola-jokic added community Community contribution and removed needs triage Requires review from the maintainers labels Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community Community contribution
Projects
None yet
Development

No branches or pull requests

2 participants