Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing nightly unit test failure #3160

Closed
wants to merge 1 commit into from

Conversation

pritidesai
Copy link
Member

@pritidesai pritidesai commented Sep 1, 2020

Changes

Unless service account is explicitly set in taskrun spec, by default, service account is set to "default".

Scott's error capture says:

* failed to create task run pod "test-taskrun-run-success": translating TaskSpec to Pod: serviceaccounts "default" not found. Maybe missing or invalid Task foo/test-task

test-taskrun-run-success taskrun has no service account set. Later in the config map, default-service-account is set to pipelines. But if service account is not specified in the taskrun spec, task run controller sets the service account to default which is clear from the error message where controller is looking for service account default instead of pipelines.

It can also mean that the config map settings/updates done in the unit test are not visible to taskrun controller which needs fix.

This error is intermittent and time plays role here since the other unit tests in taskrun_test.go are creating default service account (and never deleted). If this unit test TestReconcile_ExplicitDefaultSA is the first one to start, we are missing default service account and fails with this kind of error.

Fixes #2815

/cc @bobcatfish @sbwsg

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Fixed nightly build failure due to unit test failure

Unless service account is explicitly set in taskrun spec, by default,
service account is set to "default".
@tekton-robot tekton-robot requested review from bobcatfish and a user September 1, 2020 22:43
@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 1, 2020
@pritidesai
Copy link
Member Author

pritidesai commented Sep 1, 2020

@afrittoli @adshmh we could also look into configMap settings to see why default-service-account is not getting set to pipelines instead its set to default.

Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is intermittent and time plays role here since the other unit tests in taskrun_test.go are creating default service account (and never deleted). If this unit test TestReconcile_ExplicitDefaultSA is the first one to start, we are missing default service account and fails with this kind of error.

Great catch @pritidesai !!!

serviceaccounts "default" not found

I want to dig into this a little bit:

  • I don't quite understand why "default" wouldn't exist - afaik there is always a "default" serviceaccount in every namespace (i think?)
  • The error says "default" was not found, but the change in the PR changes the test to use "default" instead of "pipelines"

I think this makes sense if the problem was that that "pipelines" service account didnt existing, but the error above looks like it's the other way around - could you explain a bit more?

I also don't totally understand how this could have ever worked since I don't see anything creating the "pipelines" service account either 🤔

@@ -359,7 +359,7 @@ func TestReconcile_ExplicitDefaultSA(t *testing.T) {
tb.TaskRunServiceAccountName("test-sa"),
))
taskruns := []*v1beta1.TaskRun{taskRunSuccess, taskRunWithSaSuccess}
defaultSAName := "pipelines"
defaultSAName := "default"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the PR that added this change (#1227) and trying to understand the intent, i think the goal was to explicitly set the serviceaccount to something other than default, is that your understanding as well? (or maybe it just needs to be something other than "tekton" https://github.com/tektoncd/pipeline/blob/master/pkg/apis/config/testdata/config-defaults.yaml#L22 ?)

it looks like the default value is "default" so i think by changing this to "default" we might be removing the thing this test is meant to cover

Copy link
Member Author

@pritidesai pritidesai Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup thats my understanding as well

This particular unit test is testing two things:

(1) taskrun without having specified service account in the spec unlike the next test case in the same function
(2) changing default service account through config map

Trying to think if it's possible to separate these two.

And I am not sure if we could be removing the thing this test is meant to cover because default value is tekton according to testdata and irrespective of what it is set to (other than tekton), the same value is set in config-map and sent to controller. 🤔

But the value(tekton) in testdata is not reflected as default service account either since Scott's error message says the taskrun controller is looking for default service account instead of tekton when task spec is missing service account.

I am still looking into this and would appreciate input from config map experts 🙏

Copy link
Member Author

@pritidesai pritidesai Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran TestReconcile which has a list of tasks without any service account set in the task spec. So taskrun controller is relying on default service account. The test is creating service account as part of the fake client. Now, this test runs fine which means default service account is default and not tekton for taskrun controller and changing this to zoo in fake client, results in failure:

{"level":"error","ts":1599113959.2378,"logger":"fallback","caller":"taskrun/taskrun.go:377","msg":"Failed to create task run pod for taskrun \"test-taskrun-run-success\": failed to create task run pod \"test-taskrun-run-success\": translating TaskSpec to Pod: serviceaccounts \"default\" not found. Maybe missing or invalid Task foo/test-task", ...

Tracing this error takes us to the creds init where serviceaccount is hardcoded to default.

Should this be read from config-map or somewhere else instead of having it hardcoded to default? 🤔

Copy link
Member Author

@pritidesai pritidesai Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on getting this hard coded default out of creds_init.go 👩‍💻 🤞

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created PR #3168 to remove this hard coded default and also add it to the pipelinerun/taskrun spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the problem is the hardcoded default serviceaccount (although, it would be nice if it wasn't harcoded 😛 )

@pritidesai
Copy link
Member Author

I want to dig into this a little bit:

  • I don't quite understand why "default" wouldn't exist - afaik there is always a "default" serviceaccount in every namespace (i think?)
  • The error says "default" was not found, but the change in the PR changes the test to use "default" instead of "pipelines"

I think this makes sense if the problem was that that "pipelines" service account didnt existing, but the error above looks like it's the other way around - could you explain a bit more?

default service account would exist in a running instance 😜 Here, the validation or check is being done against the test asset (specifically, fake clients) and that's reason this unit test has to create/register the service account in that client.

This unit test creates pipelines service account and sets pipelines as default service account through config map but taskrun controller fails to read that change done in config map. Without any config map changes, for taskrun controller, default service account is default and thats why test is complaining that default does not exist in the fake client (my brain is fried at this point 😆 )

I also don't totally understand how this could have ever worked since I don't see anything creating the "pipelines" service account either 🤔

The unit test is creating pipelines service account. The test is checking if spec has set any service account which is not true in this case and results in saName to be empty. When saName is empty, its set to defaultSAName which is pipelines in this particular case.

@pritidesai pritidesai mentioned this pull request Sep 3, 2020
4 tasks
@bobcatfish
Copy link
Collaborator

@pritidesai i am still confused but I think it's just because I need to take some time to understand this and it's the evening before a long weekend and I just can't do it! Apologies for my confusion and weekend brain! I'll try to summarize it again but you've probably already addressed it:

  • The error in the PR description says the default service account doesnt exist, but the change in the PR is changing a test to USE that very service account that apparently didnt exist in the test failure

I think it makes sense to me that the default service account would exist on a running instance, but not in this mock environment, and i can see how the order of test execution could make that sometimes not get created, i just cant understand how the change in the PR goes along with that 😦 (and you've probably explained it already!!)

Happy to defer to you in the meantime and I appreciate your thorough analysis! I might be back with more questions later but in the meantime:

/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 3, 2020
@pritidesai
Copy link
Member Author

pritidesai commented Sep 4, 2020

No worries @bobcatfish yup I can understand it being so tricky and appreciate you following the entire analysis. This screen shot from Scott's error capture might be useful understanding and feel free to hold this unless you are sure.

Even when the config-defaults ConfigMap has DefaultServiceAccount set to pipelines and no service account specified in the taskrun spec, taskrun controller is looking for default instead of pipelines:

Screen Shot 2020-09-03 at 10 00 23 PM

Since I am not able to reproduce this, I am feeding off of this error capture and changing it to default.

I have opened another PR # 3168 related to the findings from this effort. PR #3168 is implementing changes to
set service account in taskrun spec when neither specified in the spec nor in config-defaults ConfigMap compared to what we have today - setting service account in taskrun spec if not specified in the spec but exists in config-defaults ConfigMap.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this is not the right fix 🙃

As far as I understand, we do create the serviceAccount during the tests. One thing that we do not do (and that we do when we setup tasks, …) is to make sure the informers are up-to-date too.

I take a quick look at it, and this is the "fix" I came up with master...vdemeester:3160-sa-unit-failure (@bobcatfish @pritidesai)

@@ -359,7 +359,7 @@ func TestReconcile_ExplicitDefaultSA(t *testing.T) {
tb.TaskRunServiceAccountName("test-sa"),
))
taskruns := []*v1beta1.TaskRun{taskRunSuccess, taskRunWithSaSuccess}
defaultSAName := "pipelines"
defaultSAName := "default"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the problem is the hardcoded default serviceaccount (although, it would be nice if it wasn't harcoded 😛 )

@pritidesai
Copy link
Member Author

Thanks @vdemeester I am closing this for now assuming this is fixed with PR #3189 🙏

@pritidesai pritidesai closed this Sep 28, 2020
@pritidesai pritidesai deleted the nighlty-unit-test branch September 28, 2020 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nightly unit test failure due to race condition when initializing config stores
4 participants