Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Metric tekton_pipelines_controller_pipelinerun_count #4468

Merged
merged 1 commit into from
May 25, 2022

Conversation

khrm
Copy link
Contributor

@khrm khrm commented Jan 12, 2022

Added condition check to avoid recount issue.
Fixes: #4397

Changes

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Docs included if any changes are user facing
  • Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

Fix  `tekton_pipelines_controller_pipelinerun_count` which was increasing without any new addition of pipelinerun.

@tekton-robot tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jan 12, 2022
@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 12, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 82.0% 81.5% -0.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 82.0% 81.5% -0.5

Copy link
Member

@wlynch wlynch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM - would it be possible to write a test that captures the buggy behavior we were seeing before?

Copy link
Member

@wlynch wlynch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add documentation somewhere that this metric will only increment on completion.

@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 17, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 82.0% 82.6% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 82.0% 82.6% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 82.0% 82.6% 0.6

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 28, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 82.0% 82.6% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 82.0% 82.6% 0.6

@chuangw6
Copy link
Member

chuangw6 commented Jan 28, 2022

Hi @khrm,
Thanks for creating the pr!
I pulled it to my local and did some tests. I found out that after applying the change, the two metrics of tekton_pipelines_controller_pipelinerun_count ({status="success"} and {status="failed"}) are not reported anymore.

After digging into it, it seems like the equality check of before and after conditions is always true, and the consequence of that is the DurationAndCount function always returns nil. I guess the main reason is that, the ReconcileKind function's pr.IsDone block (this is the place where DurationAndCount is called) is just reconciling finished objects and just doing some sanity checks. This explains why before and after conditions remains equal when we call DurationAndCount. Instead, I believe we might want to do DurationAndCount right after the status of the pipelinerun is set DONE or FAILED. This will not only report the metrics but also avoid the recount issue.

I managed to move the logic of calling DurationAndCount to the place where the pipelinerun is just completed (status is changed to "succeeded"). Reporting failed pipelinerun metrics might be trickier to handle as we have to deal with the fact that a pipelinerun can fail for various reasons. Fortunately, the good thing is that all the errors returned after pr.Status.MarkFailed will be propagated back to ReconcileKind function (place1 & place2).

Finally, both tekton_pipelines_controller_pipelinerun_count{status:"success"} and tekton_pipelines_controller_pipelinerun_count{status:"failed"} are now reported properly. Same logic applies to the taskrun recount issue.
Here are two commits on my working dev branch. Please let me know if you have any questions. Happy to clean up the existing one or make another PR. Thanks! And Thanks @wlynch for providing help and sharing thoughts along the way.

@khrm khrm force-pushed the bug/NbrOfPrunMetrics branch 2 times, most recently from f22f6ca to bcca2ed Compare January 28, 2022 23:19
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 82.0% 82.6% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 83.6% 83.6% -0.0

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 82.0% 82.6% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 83.6% 83.6% -0.0

@khrm
Copy link
Contributor Author

khrm commented Jan 28, 2022

@chuangw6 Good catch but I think that doesn't solve the issue of the recount. I move the DurationAndCount after the call of c.reconcile which fixed this.

I caught one more issue in my and your code due to which I had to use pipelineRunLister to solve the recount. Sometimes, two simultaneous reconciler calls were coming in which before and after conditions were different.

Try the updated pr, it should resolve the issue.

Comment on lines +227 to +243
// We get latest pipelinerun cr already to avoid recount
newPr, err := c.pipelineRunLister.PipelineRuns(pr.Namespace).Get(pr.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with how/when the lister updates, but wouldn't pulling in the latest status mean that the current pr status and the "before" status always match, since the status is updated during reconcile?

It's would be useful to add a test to make sure we're handling the reconciler state changes properly (i.e. making sure we're avoiding instances where before always equals after) - right now we're really only testing the DurationAndCount func directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcile changes status but that isn't applied to the pipelienrun object in Informer cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding aligns with @khrm, the status changes but the change is not visible through the lister.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own knowledge - When would the lister value update? Is it because we're updating only the status that this is safe to do?

I'm a bit confused with the current impl/comments since we're getting the "latest" pipelinerun, but this is really the pipelinerun before the latest status update?

I played around with this change and this appears to be correct (I even tried adding some sleeps to try and trigger a race condition), but it feels like we're relying on non-obvious behavior of the lister. I'd have a lot more peace of mind if we could add a test to ensure that it's safe rely on this lister behavior and be able to catch if this assumption is no longer true (i.e. if we make an unrelated change that would cause the lister to start returning the true latest value).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, sounds good. @khrm please add a unit test including an update during the reconcile in addition to what we have otherwise looks good to me 👍 Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share @wlynch concern - this feels like introducing a potential race condition.
Wouldn't it be safer to extract the beforeCondition at the beginning of the reconcile so we can use it later?
That's what we do for events already - see https://github.com/tektoncd/pipeline/blob/7855cb720c3fc59e585c2e230f71e488fc9038be/pkg/reconciler/pipelinerun/pipelinerun.go#L148:L148.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this comment.

#4468 (comment)

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2022
Added condition check to avoid recount issue.
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 82.0% 82.6% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 83.9% 83.9% 0.0

@pritidesai
Copy link
Member

@chuangw6 Good catch but I think that doesn't solve the issue of the recount. I move the DurationAndCount after the call of c.reconcile which fixed this.

I caught one more issue in my and your code due to which I had to use pipelineRunLister to solve the recount. Sometimes, two simultaneous reconciler calls were coming in which before and after conditions were different.

Try the updated pr, it should resolve the issue.

Hey @chuangw6 please provide an update if possible, whether these changes work for you.

@chuangw6
Copy link
Member

chuangw6 commented Feb 11, 2022

Hey @khrm @pritidesai,
Thanks for the updates. As far as I can see when testing on my local, the count matrices are reported correctly with the updated PR.

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

I would prefer if we didn't rely on the informer cache not being updated.
It should be a small change to rely on the condition read at the beginning of the reconcile, existing tests will still apply.

If you think you can iterate on this today, I'd be happy to wait for v0.33.0 and have this included in the next release.

@@ -319,6 +331,7 @@ func (c *Reconciler) resolvePipelineState(
}

func (c *Reconciler) reconcile(ctx context.Context, pr *v1beta1.PipelineRun, getPipelineFunc resources.GetPipeline) error {
defer c.durationAndCountMetrics(ctx, pr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about creating a copy of the PR condition in a beforeCondition variable here, and pass that to durationAndCountMetrics?

Copy link
Contributor Author

@khrm khrm Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afrittoli When I was running a test, somehow the same event came twice even when the informer cache was updated. So I decided to use informer cache. Any idea why this happens? @wlynch

Is it due to the controller resync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afrittoli @wlynch Please check this comment:

#4468 (comment)

Comment on lines +227 to +243
// We get latest pipelinerun cr already to avoid recount
newPr, err := c.pipelineRunLister.PipelineRuns(pr.Namespace).Get(pr.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share @wlynch concern - this feels like introducing a potential race condition.
Wouldn't it be safer to extract the beforeCondition at the beginning of the reconcile so we can use it later?
That's what we do for events already - see https://github.com/tektoncd/pipeline/blob/7855cb720c3fc59e585c2e230f71e488fc9038be/pkg/reconciler/pipelinerun/pipelinerun.go#L148:L148.

@khrm
Copy link
Contributor Author

khrm commented Apr 25, 2022

As I wrote there, sometimes we get pipelineruns when status with old status even-though if we use Lister we get correct status. That's why we need to use Lister to avoid a recount. Is it due to controller resync? If so, why isn't it taking newer informer cache value?
I think it's OK to merge pr and I can see that count that comes out is correct.
This log is from khrm#2 changes

WARN  21:14:33 beforeCond :
 name test-finally-gct8w
 BeforeFromLister &{Type:Succeeded Status:Unknown Severity: LastTransitionTime:{Inner:2022-04-24 21:12:43 +0000 UTC} Reason:Running Message:Tasks Completed: 1 (Failed: 1, Cancelled 0), Incomplete: 2, Skipped: 0},
 BeforeAvailable &{Type:Succeeded Status:Unknown Severity: LastTransitionTime:{Inner:2022-04-24 21:12:43 +0000 UTC} Reason:Running Message:Tasks Completed: 1 (Failed: 1, Cancelled 0), Incomplete: 2, Skipped: 0},
 After &{Type:Succeeded Status:False Severity: LastTransitionTime:{Inner:2022-04-24 21:14:33.002968622 +0000 UTC m=+223.977069708} Reason:PipelineRunTimeout Message:PipelineRun "test-finally-gct8w" failed to finish within "2m0s"}
WARN  21:14:33 beforeCond :
 name test-finally-gct8w
 BeforeFromLister &{Type:Succeeded Status:False Severity: LastTransitionTime:{Inner:2022-04-24 21:14:33
+0000 UTC} Reason:PipelineRunTimeout Message:PipelineRun "test-finally-gct8w" failed to finish within "2m0s"},
 BeforeAvailable &{Type:Succeeded Status:Unknown Severity: LastTransitionTime:{Inner:2022-04-24 21:12:43 +0000 UTC} Reason:Running Message:Tasks Completed: 1 (Failed: 1, Cancelled 0), Incomplete: 2, Skipped: 0},
 After &{Type:Succeeded Status:False Severity: LastTransitionTime:{Inner:2022-04-24 21:14:33.019457645 +0000 UTC m=+223.993558729} Reason:PipelineRunTimeout Message:PipelineRun "test-finally-gct8w" failed to finish within "2m0s"}

https://tektoncd.slack.com/archives/CLCCEBUMU/p1650873192721789

@khrm
Copy link
Contributor Author

khrm commented May 12, 2022

Response from Matt Moor (Copied from Slack)




however, when it comes in the event should trigger a reconciliation with the new state


mattmoor  6 days ago
The guarantee you get on lister cache consistency is that it is always at least as up to date as the original event that put the work into the workqueue, but:
multiple events could be coalesced before a key is processed, and
an event may not be received at the time of reconciliation (if an event comes in DURING key processing, it is queued for reprocessing regardless of how the active processing completes)


[mattmoor
If you are familiar with the concept of bounded staleness, the Lister cache’s bound on staleness is the most recent received event prior to reconciliation.


mattmoor  6 days ago
But it can be stale and reconcilers should tolerate that

OK. So that event which was in queue (when previous event was processing), would have older condition than the one in Lister's cache. So I think it makes sense to use Lister cache here.

@pritidesai
Copy link
Member

Based on the latest discussion, it is safe to rely on the lister cache to receive the updates if there are any to the pipeline run status.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 25, 2022
@tekton-robot tekton-robot merged commit 71fad61 into tektoncd:main May 25, 2022
@khrm khrm deleted the bug/NbrOfPrunMetrics branch February 28, 2023 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tekton_pipelines_controller_pipelinerun_count metric counter increases without having any pipeline executed
7 participants