Fix Metric tekton_pipelines_controller_pipelinerun_count #4468

khrm · 2022-01-12T11:29:38Z

Added condition check to avoid recount issue.
Fixes: #4397

Changes

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Docs included if any changes are user facing
Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including
functionality, content, code)
Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

Fix  `tekton_pipelines_controller_pipelinerun_count` which was increasing without any new addition of pipelinerun.

tekton-robot · 2022-01-12T11:32:02Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	82.0%	81.5%	-0.5

tekton-robot · 2022-01-12T11:57:11Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	82.0%	81.5%	-0.5

wlynch

Mostly LGTM - would it be possible to write a test that captures the buggy behavior we were seeing before?

wlynch

We should also add documentation somewhere that this metric will only increment on completion.

tekton-robot · 2022-01-17T11:45:29Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	82.0%	82.6%	0.6

tekton-robot · 2022-01-17T11:50:54Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	82.0%	82.6%	0.6

tekton-robot · 2022-01-17T12:08:08Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	82.0%	82.6%	0.6

tekton-robot · 2022-01-28T12:25:50Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	82.0%	82.6%	0.6

tekton-robot · 2022-01-28T12:29:32Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	82.0%	82.6%	0.6

chuangw6 · 2022-01-28T19:38:43Z

Hi @khrm,
Thanks for creating the pr!
I pulled it to my local and did some tests. I found out that after applying the change, the two metrics of tekton_pipelines_controller_pipelinerun_count ({status="success"} and {status="failed"}) are not reported anymore.

After digging into it, it seems like the equality check of before and after conditions is always true, and the consequence of that is the DurationAndCount function always returns nil. I guess the main reason is that, the ReconcileKind function's pr.IsDone block (this is the place where DurationAndCount is called) is just reconciling finished objects and just doing some sanity checks. This explains why before and after conditions remains equal when we call DurationAndCount. Instead, I believe we might want to do DurationAndCount right after the status of the pipelinerun is set DONE or FAILED. This will not only report the metrics but also avoid the recount issue.

I managed to move the logic of calling DurationAndCount to the place where the pipelinerun is just completed (status is changed to "succeeded"). Reporting failed pipelinerun metrics might be trickier to handle as we have to deal with the fact that a pipelinerun can fail for various reasons. Fortunately, the good thing is that all the errors returned after pr.Status.MarkFailed will be propagated back to ReconcileKind function (place1 & place2).

Finally, both tekton_pipelines_controller_pipelinerun_count{status:"success"} and tekton_pipelines_controller_pipelinerun_count{status:"failed"} are now reported properly. Same logic applies to the taskrun recount issue.
Here are two commits on my working dev branch. Please let me know if you have any questions. Happy to clean up the existing one or make another PR. Thanks! And Thanks @wlynch for providing help and sharing thoughts along the way.

tekton-robot · 2022-01-28T23:20:24Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	82.0%	82.6%	0.6
pkg/reconciler/pipelinerun/pipelinerun.go	83.6%	83.6%	-0.0

tekton-robot · 2022-01-28T23:22:08Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	82.0%	82.6%	0.6
pkg/reconciler/pipelinerun/pipelinerun.go	83.6%	83.6%	-0.0

khrm · 2022-01-28T23:22:40Z

@chuangw6 Good catch but I think that doesn't solve the issue of the recount. I move the DurationAndCount after the call of c.reconcile which fixed this.

I caught one more issue in my and your code due to which I had to use pipelineRunLister to solve the recount. Sometimes, two simultaneous reconciler calls were coming in which before and after conditions were different.

Try the updated pr, it should resolve the issue.

wlynch · 2022-01-31T22:40:53Z

pkg/reconciler/pipelinerun/pipelinerun.go

+		// We get latest pipelinerun cr already to avoid recount
+		newPr, err := c.pipelineRunLister.PipelineRuns(pr.Namespace).Get(pr.Name)


I'm not super familiar with how/when the lister updates, but wouldn't pulling in the latest status mean that the current pr status and the "before" status always match, since the status is updated during reconcile?

It's would be useful to add a test to make sure we're handling the reconciler state changes properly (i.e. making sure we're avoiding instances where before always equals after) - right now we're really only testing the DurationAndCount func directly.

reconcile changes status but that isn't applied to the pipelienrun object in Informer cache.

My understanding aligns with @khrm, the status changes but the change is not visible through the lister.

For my own knowledge - When would the lister value update? Is it because we're updating only the status that this is safe to do?

I'm a bit confused with the current impl/comments since we're getting the "latest" pipelinerun, but this is really the pipelinerun before the latest status update?

I played around with this change and this appears to be correct (I even tried adding some sleeps to try and trigger a race condition), but it feels like we're relying on non-obvious behavior of the lister. I'd have a lot more peace of mind if we could add a test to ensure that it's safe rely on this lister behavior and be able to catch if this assumption is no longer true (i.e. if we make an unrelated change that would cause the lister to start returning the true latest value).

sure, sounds good. @khrm please add a unit test including an update during the reconcile in addition to what we have otherwise looks good to me 👍 Thanks!

I share @wlynch concern - this feels like introducing a potential race condition.
Wouldn't it be safer to extract the beforeCondition at the beginning of the reconcile so we can use it later?
That's what we do for events already - see https://github.com/tektoncd/pipeline/blob/7855cb720c3fc59e585c2e230f71e488fc9038be/pkg/reconciler/pipelinerun/pipelinerun.go#L148:L148.

Please check this comment.

#4468 (comment)

tekton-robot · 2022-02-03T15:08:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Added condition check to avoid recount issue.

tekton-robot · 2022-02-03T18:47:23Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	82.0%	82.6%	0.6
pkg/reconciler/pipelinerun/pipelinerun.go	83.9%	83.9%	0.0

pritidesai · 2022-02-11T18:11:22Z

@chuangw6 Good catch but I think that doesn't solve the issue of the recount. I move the DurationAndCount after the call of c.reconcile which fixed this.

I caught one more issue in my and your code due to which I had to use pipelineRunLister to solve the recount. Sometimes, two simultaneous reconciler calls were coming in which before and after conditions were different.

Try the updated pr, it should resolve the issue.

Hey @chuangw6 please provide an update if possible, whether these changes work for you.

chuangw6 · 2022-02-11T18:55:54Z

Hey @khrm @pritidesai,
Thanks for the updates. As far as I can see when testing on my local, the count matrices are reported correctly with the updated PR.

afrittoli

Thanks for this!

I would prefer if we didn't rely on the informer cache not being updated.
It should be a small change to rely on the condition read at the beginning of the reconcile, existing tests will still apply.

If you think you can iterate on this today, I'd be happy to wait for v0.33.0 and have this included in the next release.

afrittoli · 2022-02-16T11:38:04Z

pkg/reconciler/pipelinerun/pipelinerun.go

@@ -319,6 +331,7 @@ func (c *Reconciler) resolvePipelineState(
 }

 func (c *Reconciler) reconcile(ctx context.Context, pr *v1beta1.PipelineRun, getPipelineFunc resources.GetPipeline) error {
+	defer c.durationAndCountMetrics(ctx, pr)


What about creating a copy of the PR condition in a beforeCondition variable here, and pass that to durationAndCountMetrics?

@afrittoli When I was running a test, somehow the same event came twice even when the informer cache was updated. So I decided to use informer cache. Any idea why this happens? @wlynch

Is it due to the controller resync?

@afrittoli @wlynch Please check this comment:

#4468 (comment)

afrittoli · 2022-02-16T11:42:12Z

pkg/reconciler/pipelinerun/pipelinerun.go

+		// We get latest pipelinerun cr already to avoid recount
+		newPr, err := c.pipelineRunLister.PipelineRuns(pr.Namespace).Get(pr.Name)


I share @wlynch concern - this feels like introducing a potential race condition.
Wouldn't it be safer to extract the beforeCondition at the beginning of the reconcile so we can use it later?
That's what we do for events already - see https://github.com/tektoncd/pipeline/blob/7855cb720c3fc59e585c2e230f71e488fc9038be/pkg/reconciler/pipelinerun/pipelinerun.go#L148:L148.

khrm · 2022-04-25T09:11:34Z

As I wrote there, sometimes we get pipelineruns when status with old status even-though if we use Lister we get correct status. That's why we need to use Lister to avoid a recount. Is it due to controller resync? If so, why isn't it taking newer informer cache value?
I think it's OK to merge pr and I can see that count that comes out is correct.
This log is from khrm#2 changes

WARN  21:14:33 beforeCond :
 name test-finally-gct8w
 BeforeFromLister &{Type:Succeeded Status:Unknown Severity: LastTransitionTime:{Inner:2022-04-24 21:12:43 +0000 UTC} Reason:Running Message:Tasks Completed: 1 (Failed: 1, Cancelled 0), Incomplete: 2, Skipped: 0},
 BeforeAvailable &{Type:Succeeded Status:Unknown Severity: LastTransitionTime:{Inner:2022-04-24 21:12:43 +0000 UTC} Reason:Running Message:Tasks Completed: 1 (Failed: 1, Cancelled 0), Incomplete: 2, Skipped: 0},
 After &{Type:Succeeded Status:False Severity: LastTransitionTime:{Inner:2022-04-24 21:14:33.002968622 +0000 UTC m=+223.977069708} Reason:PipelineRunTimeout Message:PipelineRun "test-finally-gct8w" failed to finish within "2m0s"}
WARN  21:14:33 beforeCond :
 name test-finally-gct8w
 BeforeFromLister &{Type:Succeeded Status:False Severity: LastTransitionTime:{Inner:2022-04-24 21:14:33
+0000 UTC} Reason:PipelineRunTimeout Message:PipelineRun "test-finally-gct8w" failed to finish within "2m0s"},
 BeforeAvailable &{Type:Succeeded Status:Unknown Severity: LastTransitionTime:{Inner:2022-04-24 21:12:43 +0000 UTC} Reason:Running Message:Tasks Completed: 1 (Failed: 1, Cancelled 0), Incomplete: 2, Skipped: 0},
 After &{Type:Succeeded Status:False Severity: LastTransitionTime:{Inner:2022-04-24 21:14:33.019457645 +0000 UTC m=+223.993558729} Reason:PipelineRunTimeout Message:PipelineRun "test-finally-gct8w" failed to finish within "2m0s"}

https://tektoncd.slack.com/archives/CLCCEBUMU/p1650873192721789

khrm · 2022-05-12T07:44:22Z

Response from Matt Moor (Copied from Slack)




however, when it comes in the event should trigger a reconciliation with the new state


mattmoor  6 days ago
The guarantee you get on lister cache consistency is that it is always at least as up to date as the original event that put the work into the workqueue, but:
multiple events could be coalesced before a key is processed, and
an event may not be received at the time of reconciliation (if an event comes in DURING key processing, it is queued for reprocessing regardless of how the active processing completes)


[mattmoor
If you are familiar with the concept of bounded staleness, the Lister cache’s bound on staleness is the most recent received event prior to reconciliation.


mattmoor  6 days ago
But it can be stale and reconcilers should tolerate that

OK. So that event which was in queue (when previous event was processing), would have older condition than the one in Lister's cache. So I think it makes sense to use Lister cache here.

pritidesai · 2022-05-25T04:10:09Z

Based on the latest discussion, it is safe to rely on the lister cache to receive the updates if there are any to the pipeline run status.

/lgtm

tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jan 12, 2022

tekton-robot requested review from afrittoli and dibyom January 12, 2022 11:29

tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 12, 2022

khrm mentioned this pull request Jan 12, 2022

tekton_pipelines_controller_pipelinerun_count metric counter increases without having any pipeline executed #4397

Closed

khrm force-pushed the bug/NbrOfPrunMetrics branch from e4ad6bf to 45efbea Compare January 12, 2022 11:51

wlynch requested changes Jan 12, 2022

View reviewed changes

wlynch reviewed Jan 12, 2022

View reviewed changes

khrm force-pushed the bug/NbrOfPrunMetrics branch from 45efbea to e39a776 Compare January 17, 2022 11:43

tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 17, 2022

khrm force-pushed the bug/NbrOfPrunMetrics branch from e39a776 to ccfecc1 Compare January 17, 2022 11:47

khrm force-pushed the bug/NbrOfPrunMetrics branch from ccfecc1 to d557199 Compare January 17, 2022 12:00

khrm force-pushed the bug/NbrOfPrunMetrics branch from d557199 to af4bc8c Compare January 28, 2022 12:23

tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 28, 2022

khrm force-pushed the bug/NbrOfPrunMetrics branch from af4bc8c to 2f0e507 Compare January 28, 2022 12:27

khrm force-pushed the bug/NbrOfPrunMetrics branch 2 times, most recently from f22f6ca to bcca2ed Compare January 28, 2022 23:19

wlynch reviewed Jan 31, 2022

View reviewed changes

vdemeester approved these changes Feb 3, 2022

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2022

Fix Metric tekton_pipelines_controller_pipelinerun_count

f004865

Added condition check to avoid recount issue.

khrm force-pushed the bug/NbrOfPrunMetrics branch from bcca2ed to f004865 Compare February 3, 2022 18:43

afrittoli reviewed Feb 16, 2022

View reviewed changes

This was referenced Feb 22, 2022

Inconsistent results hitting metrics endpoint for tekton_pipelines_controller_pipelinerun_duration_seconds_ metric #4141

Closed

Tekton Taskrun Metrics are not accurate #3739

Closed

tekton-robot assigned pritidesai May 25, 2022

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 25, 2022

tekton-robot merged commit 71fad61 into tektoncd:main May 25, 2022

khrm deleted the bug/NbrOfPrunMetrics branch February 28, 2023 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Metric tekton_pipelines_controller_pipelinerun_count #4468

Fix Metric tekton_pipelines_controller_pipelinerun_count #4468

khrm commented Jan 12, 2022 •

edited

Loading

tekton-robot commented Jan 12, 2022

tekton-robot commented Jan 12, 2022

wlynch left a comment

wlynch left a comment

tekton-robot commented Jan 17, 2022

tekton-robot commented Jan 17, 2022

tekton-robot commented Jan 17, 2022

tekton-robot commented Jan 28, 2022

tekton-robot commented Jan 28, 2022

chuangw6 commented Jan 28, 2022 •

edited

Loading

tekton-robot commented Jan 28, 2022

tekton-robot commented Jan 28, 2022

khrm commented Jan 28, 2022 •

edited

Loading

wlynch Jan 31, 2022

khrm Feb 1, 2022

pritidesai Feb 11, 2022

wlynch Feb 11, 2022

pritidesai Feb 11, 2022

afrittoli Feb 16, 2022

khrm May 11, 2022

tekton-robot commented Feb 3, 2022

tekton-robot commented Feb 3, 2022

pritidesai commented Feb 11, 2022

chuangw6 commented Feb 11, 2022 •

edited

Loading

afrittoli left a comment

afrittoli Feb 16, 2022

khrm Feb 23, 2022 •

edited

Loading

khrm May 11, 2022

afrittoli Feb 16, 2022

khrm commented Apr 25, 2022

khrm commented May 12, 2022 •

edited

Loading

pritidesai commented May 25, 2022

		// We get latest pipelinerun cr already to avoid recount
		newPr, err := c.pipelineRunLister.PipelineRuns(pr.Namespace).Get(pr.Name)

Fix Metric tekton_pipelines_controller_pipelinerun_count #4468

Fix Metric tekton_pipelines_controller_pipelinerun_count #4468

Conversation

khrm commented Jan 12, 2022 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Jan 12, 2022

tekton-robot commented Jan 12, 2022

wlynch left a comment

Choose a reason for hiding this comment

wlynch left a comment

Choose a reason for hiding this comment

tekton-robot commented Jan 17, 2022

tekton-robot commented Jan 17, 2022

tekton-robot commented Jan 17, 2022

tekton-robot commented Jan 28, 2022

tekton-robot commented Jan 28, 2022

chuangw6 commented Jan 28, 2022 • edited Loading

tekton-robot commented Jan 28, 2022

tekton-robot commented Jan 28, 2022

khrm commented Jan 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Feb 3, 2022

tekton-robot commented Feb 3, 2022

pritidesai commented Feb 11, 2022

chuangw6 commented Feb 11, 2022 • edited Loading

afrittoli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khrm Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khrm commented Apr 25, 2022

khrm commented May 12, 2022 • edited Loading

pritidesai commented May 25, 2022

khrm commented Jan 12, 2022 •

edited

Loading

chuangw6 commented Jan 28, 2022 •

edited

Loading

khrm commented Jan 28, 2022 •

edited

Loading

chuangw6 commented Feb 11, 2022 •

edited

Loading

khrm Feb 23, 2022 •

edited

Loading

khrm commented May 12, 2022 •

edited

Loading