Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workflow job metrics to Github webhook server #1814

Closed
wants to merge 2 commits into from

Conversation

ColinHeathman
Copy link
Contributor

@ColinHeathman ColinHeathman commented Sep 19, 2022

Relates to issue #1551

This PR adds these metrics:

github_workflow_job_queue_duration_seconds
github_workflow_job_run_duration_seconds
github_workflow_job_conclusions_total
github_workflow_jobs_queued_total
github_workflow_jobs_started_total
github_workflow_jobs_completed_total
github_workflow_job_failures_total

@ColinHeathman
Copy link
Contributor Author

I have tested this locally with a basic installation. I am in the process of testing this change in an environment with some live traffic, so the PR will remain a draft until that's completed.

Nitpicks are welcome in the meantime

case "in_progress":

// discard events if the previous event wasn't tracked
previousEvent, ok := reader.trackedJobs[job_id]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that this feature will work only when you have only one replica of the githubwebhookserver, but not for two or more replicas? (It's a standard webhook server so it is expected to be scaled out to 2 or more replicas whenever the operator decides to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I had in mind here. If that's not an assumption I can make, then there needs to be some way to share state, or github_workflow_job_status should be dropped

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ColinHeathman We may probably introduce an optional dependency to an external distributed kv store to track jobs later. Or GitHub could enhance their API and/or the workflow job event payload so that we can use some information included in the event to measure the queue duration. In the meantime though, we should omit this as this isn't going to work universally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started a discussion here to enhance the API. I believe that is the cleanest path forwards.

@mumoshu Is there another pending reason to add a kv store? I thought about it but I didn't think these metrics would justify the addition.

Alternatively, we could change github_workflow_job_status to a counter, rather than a gauge.
That way queue counts could be calculated by Prometheus, with github_workflow_job_status{job_status="queued"}[1m] - github_workflow_job_status{job_status="in_progress"}[1m]

I can't think of a way to make github_workflow_job_queue work without a kv store though

Copy link
Collaborator

@mumoshu mumoshu Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting the discussion! Let's see how it goes.

s there another pending reason to add a kv store?

No- this is the first time I thought we might need to add it.

I thought about it but I didn't think these metrics would justify the addition.

Same- if this isn't a short-term hard requirement, I'd like not to add the kv store as an ARC dependency. Might be too much.

we could change github_workflow_job_status to a counter, rather than a gauge.

Ah, counts might be good as well! Can we name the metric like github_workflow_jobs_total(job_status="...") though? If it's a count, I think that better follows the canonical metric name convention of Prometheus https://prometheus.io/docs/practices/naming/#metric-names

@ColinHeathman ColinHeathman force-pushed the job-metrics branch 2 times, most recently from c04eaba to 94d6979 Compare October 4, 2022 00:04
@ColinHeathman ColinHeathman marked this pull request as ready for review October 4, 2022 00:05
@ColinHeathman
Copy link
Contributor Author

@mumoshu Please take a look at this new implementation.

This version is stateless. I was able to find the information I need for the metrics from the logs for the jobs. This uses the GH client to download those logs and parse them, so there is a bit of a downside there, but I think having this be stateless is much better.

@mumoshu
Copy link
Collaborator

mumoshu commented Oct 4, 2022

@ColinHeathman Hey! Thanks a lot for your effort. Sorry to say, but this approach isn't going to scale and that's actually why we moved from the pull-based autoscaling to webhook-based autoscaling.
GitHub API rate limit can be quite insufficient when you're going to run hundreds of jobs per hour. AFAIK it's 1000/hour with PAT, and 15000/hour with GitHub App organizational installation.
You added an API call per every workflow_job event in the last commit. Assuming we might receive 3 workflow job events for queued, in_progress, and completed, this approach might make ARC with PAT not scale more than 333 jobs/hour, which is quite low. We do call some APIs in other places for e.g. runner graceful shutdown. So the actual limit would be lower.

@mumoshu
Copy link
Collaborator

mumoshu commented Oct 4, 2022

@ColinHeathman IMHO, making it stateful alone is fine, as long as the state is allowed to get lost and local. That said, I thought we'd be happy with your alternative mentioned in #1814 (comment), which is to make those counters so that you can still count events per status, while you can also aggregate counts at query time using e.g. Prometheus, which makes it possible to approximately count events per status without introducing a global state to ARC. WDYT?

@mumoshu
Copy link
Collaborator

mumoshu commented Oct 4, 2022

That said, your awesome fetchAndParseWorkflowJobLogs and the related code might better fit within another controller/command within ARC(name it e.g. actions-metrics-server) or another application dedicated to monitoring workflow jobs from the information provided by GitHub Actions API, not from ARC.

In the dedicated controller/command/applilcation, you might be able to poll those APIs less frequently than 1 per workflow job event, which prevents rate limit issues while providing the metrics you want.

Note though, providing workflow job event status count metrics per #1814 (comment) is orthogonal to the metrics derived from GitHub Actions. The former is for monitoring ARC and the latter is for monitoring GitHub Actions. Both look useful to me. Just that I think we shouldn't mix the two.

@ColinHeathman
Copy link
Contributor Author

@ColinHeathman IMHO, making it stateful alone is fine, as long as the state is allowed to get lost. That said, I thought we'd be happy with your alternative mentioned in #1814 (comment), which is to make those counters so that you can still count events per status, while you can also aggregate counts at query time using e.g. Prometheus, which makes it possible to approximately count events per status without introducing a global state to ARC. WDYT?

@mumoshu I did make the change to counters here (the gauges are gone) that's well and good.

Runtimes can be measured off of the action=completed event, but it's less accurate than the logs. I think losing that accuracy is fine though.

I currently don't see a way to get Queue times without either comparing events or making an API call for the logs. I think maybe that needs to be a separate issue or PR since it's going to be more difficult.

@ColinHeathman
Copy link
Contributor Author

What do you think of the metric names? I want to be sure that they are correct

githubWorkflowJobConclusionsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "github_workflow_job_conclusions_total",
Help: "Conclusions for tracked workflow jobs",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Help: "Conclusions for tracked workflow jobs",
Help: "Total count of conclusions for tracked workflow jobs",

Probably make it consistent with other metric names would be nicer

@mumoshu
Copy link
Collaborator

mumoshu commented Oct 4, 2022

@ColinHeathman Thanks for your confirmation. The metrics names look great to me!

@mumoshu
Copy link
Collaborator

mumoshu commented Oct 4, 2022

Runtimes can be measured off of the action=completed event, but it's less accurate than the logs. I think losing that accuracy is fine though.

I'm fine with that too. We'd better add some comment in code to let us keep remembering that the run_duration metric count is not ging to be the same as workflow_job_completions_total when you have two or more ARC webhook-server pods.

I currently don't see a way to get Queue times without either comparing events or making an API call for the logs. I think maybe that needs to be a separate issue or PR since it's going to be more difficult.

I agree with you! If it's going to be something like what I've explained in #1814 (comment), that looks good to me.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 7, 2022

@ColinHeathman Hey! Is this ready for review? If it's working for your cluster, that's all right- I'll try resolving merge conflicts, test it on my cluster, and add necessary changes if you're busy.

@ColinHeathman
Copy link
Contributor Author

Hey @mumoshu, this does work in my cluster. We at BenchSci are currently using this as it is, since we have GH Enterprise and don't need to worry so much about the API rate limit.

I was planning on moving this to the dedicated controller/command/applilcation as you suggested in #1814 (comment), but I haven't had the time to do the rework. I don't think I'll be able to get to it until next month.

Feel free to make changes :) I can try to help anyway I can

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 30, 2022

@ColinHeathman Hey! I finally had some time to work on it, and #2057's what I intended. Would you mind taking a look when you have some time?

@mumoshu
Copy link
Collaborator

mumoshu commented Dec 7, 2022

@ColinHeathman Hey! I'm closing this in favor of #2057. Thank you again for doing 99% of the work in this pull request! I've featured you in our release note to make your awesome work more visible ❤️
#2068

@mumoshu mumoshu closed this Dec 7, 2022
@mumoshu
Copy link
Collaborator

mumoshu commented Dec 7, 2022

#2057 will be included in the next release of ARC, 0.27.0. Please feel free to give it a shot and report back any issues and ideas if any 🙏

@mumoshu mumoshu added this to the v0.27.0 milestone Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants