-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Metric tekton_pipelines_controller_pipelinerun_count #4468
Conversation
The following is the coverage report on the affected files.
|
e4ad6bf
to
45efbea
Compare
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM - would it be possible to write a test that captures the buggy behavior we were seeing before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add documentation somewhere that this metric will only increment on completion.
45efbea
to
e39a776
Compare
The following is the coverage report on the affected files.
|
e39a776
to
ccfecc1
Compare
The following is the coverage report on the affected files.
|
ccfecc1
to
d557199
Compare
The following is the coverage report on the affected files.
|
d557199
to
af4bc8c
Compare
The following is the coverage report on the affected files.
|
af4bc8c
to
2f0e507
Compare
The following is the coverage report on the affected files.
|
Hi @khrm, After digging into it, it seems like the equality check of I managed to move the logic of calling Finally, both |
f22f6ca
to
bcca2ed
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
@chuangw6 Good catch but I think that doesn't solve the issue of the recount. I move the I caught one more issue in my and your code due to which I had to use pipelineRunLister to solve the recount. Sometimes, two simultaneous reconciler calls were coming in which before and after conditions were different. Try the updated pr, it should resolve the issue. |
// We get latest pipelinerun cr already to avoid recount | ||
newPr, err := c.pipelineRunLister.PipelineRuns(pr.Namespace).Get(pr.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with how/when the lister updates, but wouldn't pulling in the latest status mean that the current pr status and the "before" status always match, since the status is updated during reconcile
?
It's would be useful to add a test to make sure we're handling the reconciler state changes properly (i.e. making sure we're avoiding instances where before always equals after) - right now we're really only testing the DurationAndCount func directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reconcile
changes status but that isn't applied to the pipelienrun
object in Informer
cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding aligns with @khrm, the status changes but the change is not visible through the lister.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own knowledge - When would the lister value update? Is it because we're updating only the status that this is safe to do?
I'm a bit confused with the current impl/comments since we're getting the "latest" pipelinerun, but this is really the pipelinerun before the latest status update?
I played around with this change and this appears to be correct (I even tried adding some sleeps to try and trigger a race condition), but it feels like we're relying on non-obvious behavior of the lister. I'd have a lot more peace of mind if we could add a test to ensure that it's safe rely on this lister behavior and be able to catch if this assumption is no longer true (i.e. if we make an unrelated change that would cause the lister to start returning the true latest value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, sounds good. @khrm please add a unit test including an update during the reconcile
in addition to what we have otherwise looks good to me 👍 Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share @wlynch concern - this feels like introducing a potential race condition.
Wouldn't it be safer to extract the beforeCondition at the beginning of the reconcile so we can use it later?
That's what we do for events already - see https://github.com/tektoncd/pipeline/blob/7855cb720c3fc59e585c2e230f71e488fc9038be/pkg/reconciler/pipelinerun/pipelinerun.go#L148:L148.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check this comment.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Added condition check to avoid recount issue.
bcca2ed
to
f004865
Compare
The following is the coverage report on the affected files.
|
Hey @chuangw6 please provide an update if possible, whether these changes work for you. |
Hey @khrm @pritidesai, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
I would prefer if we didn't rely on the informer cache not being updated.
It should be a small change to rely on the condition read at the beginning of the reconcile, existing tests will still apply.
If you think you can iterate on this today, I'd be happy to wait for v0.33.0 and have this included in the next release.
@@ -319,6 +331,7 @@ func (c *Reconciler) resolvePipelineState( | |||
} | |||
|
|||
func (c *Reconciler) reconcile(ctx context.Context, pr *v1beta1.PipelineRun, getPipelineFunc resources.GetPipeline) error { | |||
defer c.durationAndCountMetrics(ctx, pr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about creating a copy of the PR condition in a beforeCondition
variable here, and pass that to durationAndCountMetrics
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afrittoli When I was running a test, somehow the same event came twice even when the informer cache was updated. So I decided to use informer cache. Any idea why this happens? @wlynch
Is it due to the controller resync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afrittoli @wlynch Please check this comment:
// We get latest pipelinerun cr already to avoid recount | ||
newPr, err := c.pipelineRunLister.PipelineRuns(pr.Namespace).Get(pr.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share @wlynch concern - this feels like introducing a potential race condition.
Wouldn't it be safer to extract the beforeCondition at the beginning of the reconcile so we can use it later?
That's what we do for events already - see https://github.com/tektoncd/pipeline/blob/7855cb720c3fc59e585c2e230f71e488fc9038be/pkg/reconciler/pipelinerun/pipelinerun.go#L148:L148.
As I wrote there, sometimes we get pipelineruns when status with old status even-though if we use Lister we get correct status. That's why we need to use Lister to avoid a recount. Is it due to controller resync? If so, why isn't it taking newer informer cache value?
https://tektoncd.slack.com/archives/CLCCEBUMU/p1650873192721789 |
Response from Matt Moor (Copied from Slack)
OK. So that event which was in queue (when previous event was processing), would have older condition than the one in Lister's cache. So I think it makes sense to use Lister cache here. |
Based on the latest discussion, it is safe to rely on the lister cache to receive the updates if there are any to the pipeline run status. /lgtm |
Added condition check to avoid recount issue.
Fixes: #4397
Changes
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
Release Notes