Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate TaskRun compatibility with the Affinity Assistant #2885

Conversation

jlpettersson
Copy link
Member

@jlpettersson jlpettersson commented Jul 1, 2020

Changes

A TaskRun that mount more than one PVC-backed workspace is incompatible
with the Affinity Assistant. But there is no validation if the TaskRun
is compatible - so the TaskRun Pod is stuck with little information on why
to the user.

This commit adds validation of TaskRuns. When a TaskRun is associated with
an Affinity Assistant, it is checked that not more than one PVC workspace
is used - if so, the TaskRun will fail with a TaskRunValidationFailed condition.

Proposed in #2829 (comment)
Closes #2864

/kind feature
/release-note-none

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

@tekton-robot
Copy link
Collaborator

@jlpettersson: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Jul 1, 2020
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 1, 2020
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/question: Issues or PRs that are questions around the project or a particular feature
kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 1, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 76.5% 77.2% 0.7

@jlpettersson jlpettersson force-pushed the validate_taskrun_compatibility_with_aa branch from 6ef27af to 8fa78cc Compare July 1, 2020 17:39
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 76.5% 77.2% 0.7

@jlpettersson
Copy link
Member Author

/release-note-none

@tekton-robot tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 1, 2020
}

if pvcWorkspaces > 1 {
return fmt.Errorf("TaskRun mounts more than one PersistentVolumeClaim - that is forbidden when the Affinity Assistant is enabled")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "mounts more than one writeable PVC"? I thought it was OK to receive multiple PVCs in read-only? Or will that also trip into potential deadlock scenarios?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question @sbwsg

TLDR; as the message is written - it matches the current implementation of the Affinity Assistant.

The Affinity Assistant makes so that TaskRun pods are scheduled to the same Node. This actually helps to avoid two different problems:

  1. All pods sharing a PVC (e.g. with accessMode ReadWriteOnce) can now execute in parallel - this may be possible with other accessModes e.g. ReadWriteMany even without the Affinity Assistant.
  2. All pods sharing a PVC (e.g. with a zonal StorageClass) is scheduled to the same AZ, so it will not deadlock the pipeline - this may be possible with regional StorageClasses even without the Affinity Assistant - but on e.g. GCP regional storageClass volumes is only available on two of three AZs - so it may still be a problem. If using accessMode ReadWriteMany pods can execute in parallel within an AZ - but for regional clusters they additionally need to be regional StorageClass to avoid the problems that the Affinity Assistant solves.

The guidelines about "at most one writeable workspace" is for Tasks - e.g. they don't say if it is a PVC or emptyDir or regional StorageClass. With "writeable" I also meant PVCs - e.g. using a Secret or ConfigMap-workspace is still fine for Tasks - in addition to a PVC workspace.

I may add a new section to the workspace documentation to document more about this - it is a bit complicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand - using a ReadWriteOnce in readOnly mode and regional storageClass - should work - but is currently not supported/implemented by the Affinity Assistant. We may want to implement it - but it is a bit complicated too - and probably a rarely used use case.

@ghost
Copy link

ghost commented Jul 7, 2020

/approve

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2020
if err := validateWorkspaceCompatibilityWithAffinityAssistant(tr); err != nil {
logger.Errorf("TaskRun %q workspaces are invalid: %v", tr.Name, err)
tr.Status.MarkResourceFailed(podconvert.ReasonFailedValidation, err)
return nil, nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a PermanentError otherwise the controller will requeue this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good catch!

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, it looks good - only one issue, a permanent error should be returned instead.
One of the matrix uni tests checks for all invalid situations, and enforces the permanent error. I wonder if we could use that test instead of creating a new one?

@ghost
Copy link

ghost commented Jul 17, 2020

I'm going to add this to the 0.15 milestone as it would be great to get the validation in place in time for the next release.

@ghost ghost added this to the Pipelines v0.15 milestone Jul 17, 2020
@ghost
Copy link

ghost commented Jul 17, 2020

I'm also happy to pick this up @jlpettersson if you don't have time to finish. let me know!

@jlpettersson jlpettersson force-pushed the validate_taskrun_compatibility_with_aa branch from 8fa78cc to 48d704a Compare July 17, 2020 15:07
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 76.5% 77.2% 0.7

@jlpettersson
Copy link
Member Author

I'm also happy to pick this up @jlpettersson if you don't have time to finish. let me know!

@sbwsg I fixed the PermanentError - but haven't investigated if the tests can be refactored as @afrittoli suggested. Its up to you if you want to add those improvements in this PR or if it can be done in its own PR after this is merged.

Thanks @sbwsg and @afrittoli for review!

@ghost
Copy link

ghost commented Jul 17, 2020

Awesome, this looks great to me as it is but I'll leave it to @afrittoli to give the PermanentError another look.

@vdemeester
Copy link
Member

/lgtm
/hold
@afrittoli can you take a look 🙃

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 24, 2020
@tekton-robot tekton-robot added lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 24, 2020
Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The change looks good, but unfortunately you'll need to rebase first.
/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

A TaskRun that mount more than one PVC-backed workspace is incompatible
with the Affinity Assistant. But there is no validation if the TaskRun
is compatible - so the TaskRun Pod is stuck with little information on why
to the user.

This commit adds validation of TaskRuns. When a TaskRun is associated with
an Affinity Assistant, it is checked that not more than one PVC workspace
is used - if so, the TaskRun will fail with a TaskRunValidationFailed condition.

Proposed in tektoncd#2829 (comment)
Closes tektoncd#2864
@jlpettersson jlpettersson force-pushed the validate_taskrun_compatibility_with_aa branch from 48d704a to 1cd4d8c Compare July 27, 2020 17:58
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 76.6% 77.3% 0.7

@jlpettersson
Copy link
Member Author

/hold cancel

@tekton-robot tekton-robot removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 27, 2020
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 28, 2020
@jlpettersson
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

1 similar comment
@jlpettersson
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesnt merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate conditions for affinity assistant to operate correctly
4 participants