-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add workflow job metrics to Github webhook server #1814
Conversation
I have tested this locally with a basic installation. I am in the process of testing this change in an environment with some live traffic, so the PR will remain a draft until that's completed. Nitpicks are welcome in the meantime |
case "in_progress": | ||
|
||
// discard events if the previous event wasn't tracked | ||
previousEvent, ok := reader.trackedJobs[job_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply that this feature will work only when you have only one replica of the githubwebhookserver, but not for two or more replicas? (It's a standard webhook server so it is expected to be scaled out to 2 or more replicas whenever the operator decides to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I had in mind here. If that's not an assumption I can make, then there needs to be some way to share state, or github_workflow_job_status
should be dropped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ColinHeathman We may probably introduce an optional dependency to an external distributed kv store to track jobs later. Or GitHub could enhance their API and/or the workflow job event payload so that we can use some information included in the event to measure the queue duration. In the meantime though, we should omit this as this isn't going to work universally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started a discussion here to enhance the API. I believe that is the cleanest path forwards.
@mumoshu Is there another pending reason to add a kv store? I thought about it but I didn't think these metrics would justify the addition.
Alternatively, we could change github_workflow_job_status
to a counter, rather than a gauge.
That way queue counts could be calculated by Prometheus, with github_workflow_job_status{job_status="queued"}[1m] - github_workflow_job_status{job_status="in_progress"}[1m]
I can't think of a way to make github_workflow_job_queue
work without a kv store though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting the discussion! Let's see how it goes.
s there another pending reason to add a kv store?
No- this is the first time I thought we might need to add it.
I thought about it but I didn't think these metrics would justify the addition.
Same- if this isn't a short-term hard requirement, I'd like not to add the kv store as an ARC dependency. Might be too much.
we could change github_workflow_job_status to a counter, rather than a gauge.
Ah, counts might be good as well! Can we name the metric like github_workflow_jobs_total(job_status="...")
though? If it's a count, I think that better follows the canonical metric name convention of Prometheus https://prometheus.io/docs/practices/naming/#metric-names
c04eaba
to
94d6979
Compare
@mumoshu Please take a look at this new implementation. This version is stateless. I was able to find the information I need for the metrics from the logs for the jobs. This uses the GH client to download those logs and parse them, so there is a bit of a downside there, but I think having this be stateless is much better. |
@ColinHeathman Hey! Thanks a lot for your effort. Sorry to say, but this approach isn't going to scale and that's actually why we moved from the pull-based autoscaling to webhook-based autoscaling. |
@ColinHeathman IMHO, making it stateful alone is fine, as long as the state is allowed to get lost and local. That said, I thought we'd be happy with your alternative mentioned in #1814 (comment), which is to make those counters so that you can still count events per status, while you can also aggregate counts at query time using e.g. Prometheus, which makes it possible to approximately count events per status without introducing a global state to ARC. WDYT? |
That said, your awesome In the dedicated controller/command/applilcation, you might be able to poll those APIs less frequently than 1 per workflow job event, which prevents rate limit issues while providing the metrics you want. Note though, providing workflow job event status count metrics per #1814 (comment) is orthogonal to the metrics derived from GitHub Actions. The former is for monitoring ARC and the latter is for monitoring GitHub Actions. Both look useful to me. Just that I think we shouldn't mix the two. |
@mumoshu I did make the change to counters here (the gauges are gone) that's well and good. Runtimes can be measured off of the I currently don't see a way to get Queue times without either comparing events or making an API call for the logs. I think maybe that needs to be a separate issue or PR since it's going to be more difficult. |
What do you think of the metric names? I want to be sure that they are correct |
githubWorkflowJobConclusionsTotal = prometheus.NewCounterVec( | ||
prometheus.CounterOpts{ | ||
Name: "github_workflow_job_conclusions_total", | ||
Help: "Conclusions for tracked workflow jobs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help: "Conclusions for tracked workflow jobs", | |
Help: "Total count of conclusions for tracked workflow jobs", |
Probably make it consistent with other metric names would be nicer
@ColinHeathman Thanks for your confirmation. The metrics names look great to me! |
I'm fine with that too. We'd better add some comment in code to let us keep remembering that the run_duration metric count is not ging to be the same as workflow_job_completions_total when you have two or more ARC webhook-server pods.
I agree with you! If it's going to be something like what I've explained in #1814 (comment), that looks good to me. |
@ColinHeathman Hey! Is this ready for review? If it's working for your cluster, that's all right- I'll try resolving merge conflicts, test it on my cluster, and add necessary changes if you're busy. |
Hey @mumoshu, this does work in my cluster. We at BenchSci are currently using this as it is, since we have GH Enterprise and don't need to worry so much about the API rate limit. I was planning on moving this to the dedicated controller/command/applilcation as you suggested in #1814 (comment), but I haven't had the time to do the rework. I don't think I'll be able to get to it until next month. Feel free to make changes :) I can try to help anyway I can |
@ColinHeathman Hey! I finally had some time to work on it, and #2057's what I intended. Would you mind taking a look when you have some time? |
@ColinHeathman Hey! I'm closing this in favor of #2057. Thank you again for doing 99% of the work in this pull request! I've featured you in our release note to make your awesome work more visible ❤️ |
#2057 will be included in the next release of ARC, 0.27.0. Please feel free to give it a shot and report back any issues and ideas if any 🙏 |
Relates to issue #1551
This PR adds these metrics: