Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass SecurityContext in all the AppRepo Controller's jobs #6605

Closed

Conversation

antgamdia
Copy link
Contributor

@antgamdia antgamdia commented Aug 10, 2023

Description of the change

Even if the sync jobs were added a security context (by means of each AppRepo CRD), this information was not available for Cleanup jobs. Note that those jobs are spun once a NotFound error is thrown when fetching an AppRepo.

The approach herein applied has been just passing this information via command line arguments. However, to maximize the usefulness of the feature, it is also taken into account in the SyncJobs (if a config is passed via AppRepo CRD, it will override the default one).

Besides, this PR is adding some minor fixes, namely:

  • fixing a duplicated key.
  • adding some recommended security context params (more or less, the ones used in other Bitnami charts - but Trivy still complains). I have removed those changes, CI was failing... so I think they would deserve their own PR.
  • fix a missing conditional handling of the CronJob API (v1 vs v1beta1).
  • passing the whole security context in the AppRepo (as opposed to passing just a single field).
  • using podSecuirtyContext and containerSecurityContext where actually corresponds.

Benefits

Users will be able to use SecurityContext in all the jobs created by Kubeapps.

Possible drawbacks

N/A (let's see what CI thinks :P) It seems there are some issues during the chart installation, will look at it next week. After reverting some changes in the actual security context, CI is passing again.

Applicable issues

Additional information

Here is a diff of the helm template before/after the changes in this PR:

Details
diff --git a/out.yaml b/out.yaml
index a1cd13a03..b2218e54f 100644
--- a/out.yaml
+++ b/out.yaml
@@ -717,6 +717,9 @@ spec:
           image: docker.io/kubeapps/apprepository-controller:latest
           imagePullPolicy: "IfNotPresent"
           securityContext:
             runAsNonRoot: true
             runAsUser: 1001
           command:
@@ -733,6 +736,8 @@ spec:
             - --database-user=postgres
             - --database-name=assets
             - --repos-per-namespace=true
+            - --default-container-security-context='{enabled":true,"runAsNonRoot":true,"runAsUser":1001}'
+            - --default-pod-security-context='{"enabled":true,"fsGroup":1001}'
           env:
             - name: REPO_SYNC_IMAGE
               value: docker.io/kubeapps/asset-syncer:latest
@@ -801,6 +806,9 @@ spec:
           image: docker.io/kubeapps/dashboard:latest
           imagePullPolicy: "IfNotPresent"
           securityContext:
             runAsNonRoot: true
             runAsUser: 1001
           env:
@@ -943,6 +951,9 @@ spec:
           image: docker.io/bitnami/nginx:1.25.1-debian-11-r37
           imagePullPolicy: "IfNotPresent"
           securityContext:
             runAsNonRoot: true
             runAsUser: 1001
           env:
@@ -1039,6 +1050,9 @@ spec:
           image: docker.io/kubeapps/kubeapps-apis:latest
           imagePullPolicy: "IfNotPresent"
           securityContext:
             runAsNonRoot: true
             runAsUser: 1001
           command:
@@ -1081,22 +1095,20 @@ spec:
               containerPort: 50051
           livenessProbe:
             failureThreshold: 6
-            initialDelaySeconds: 60
+            initialDelaySeconds: 10
             periodSeconds: 10
             successThreshold: 1
             timeoutSeconds: 5
             exec:
               command: ["grpc_health_probe", "-addr=:50051"]
-            initialDelaySeconds: 10
           readinessProbe:
             failureThreshold: 6
-            initialDelaySeconds: 0
+            initialDelaySeconds: 5
             periodSeconds: 10
             successThreshold: 1
             timeoutSeconds: 5
             exec:
               command: ["grpc_health_probe", "-addr=:50051"]
-            initialDelaySeconds: 5
           resources:
             limits:
               cpu: 250m
@@ -1286,5 +1298,13 @@ spec:
   url: https://charts.bitnami.com/bitnami
   syncJobPodTemplate:
     spec:
+      containers:
+        -
+          securityContext:
+            runAsNonRoot: true
+            runAsUser: 1001
       securityContext:
-        runAsUser: 1001
+        fsGroup: 1001

Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
… tpl

Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
@netlify
Copy link

netlify bot commented Aug 10, 2023

Deploy Preview for kubeapps-dev canceled.

Built without sensitive environment variables

Name Link
🔨 Latest commit 5f83dd8
🔍 Latest deploy log https://app.netlify.com/sites/kubeapps-dev/deploys/64da3bbb5a74f40008df90e9

Copy link
Contributor

@absoludity absoludity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch - sorry that it wasn't an easy fix as I'd originally thought.

Even if the sync jobs were added a security context (by means of each AppRepo CRD), this information was not available for Cleanup jobs. Note that those jobs are spun once a NotFound error is thrown when fetching an AppRepo.

Right, that signals that it's too late to be running the clean-up job at that point. Kubernetes has a solution for that (see below).

The approach herein applied has been just passing this information via command line arguments. However, to maximize the usefulness of the feature, it is also taken into account in the SyncJobs (if a config is passed via AppRepo CRD, it will override the default one).

Why would we provide AppRepo CRD-specific security context, if the controller is going to run with a security context to use by default anyway?

See what you think, but I reckon the native way to solve this would be for us to instead use finalizers for the AppRepo CRDs (which allow the controller to run clean-up tasks before the resource is deleted).

Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
@antgamdia
Copy link
Contributor Author

Why would we provide AppRepo CRD-specific security context, if the controller is going to run with a security context to use by default anyway?

It does not really make much sense, I agree. The point is that, currently, we do support the customization of the syncJobPodTemplate as it is part of our CRD. Before this PR, the only way to pass this information is by setting the pod template in the CRD... which is precisely adding per-repo security context (as opposed to the jobs to inherit the controller's context).

Whether or not using a custom security context in each repo makes actually sense... I don't know; haven't thoroughly thought about it - though I have the impression it could be useful in some contexts where the sync job is run in a certain restricted namespace somehow.

Coming back to this PR, now we would have two different ways of specifying a security context: 1) using the existing CRD's syncJobPodTemplate field and/or setting a default value in the AppRepoController side.
The point is: what should be the expected behavior when upgrading Kubeapps (assuming existing jobs) and setting the controller-wide default?

  • A) If a controller-wide default is found AND a CRD-scoped config is found, then solely use the default -as is- in every job.
  • B) If a controller-wide default is found AND a CRD-scoped config is found, then solely use the CRD's -as is- in every job.
  • C) If a controller-wide default is found AND a CRD-scoped config is found, then use both and, if they conflict, preserve default's
  • D) If a controller-wide default is found AND a CRD-scoped config is found, then use both and, if they conflict, preserve CRD's

In this PR, I opted for option D, as I see it as the less disruptive maximizing backward-compatibility,

See what you think, but I reckon the native way to solve this would be for us to instead use finalizers for the AppRepo CRDs (which allow the controller to run clean-up tasks before the resource is deleted).

+1000, the current approach based upon syncJobs+cleanupJobs seems like a bad smell given the current myriad of k8s-native solutions... but if we start improving it... we would end up in sth like the flux HelmRepository.
Anyway, we could open a new issue for trying to, al least, move from clean-up jobs to finalizers.

@absoludity
Copy link
Contributor

Why would we provide AppRepo CRD-specific security context, if the controller is going to run with a security context to use by default anyway?

It does not really make much sense, I agree. The point is that, currently, we do support the customization of the syncJobPodTemplate as it is part of our CRD. Before this PR, the only way to pass this information is by setting the pod template in the CRD... which is precisely adding per-repo security context (as opposed to the jobs to inherit the controller's context).

Sorry, my thought was more that the AppRepo CRD-specific security context makes sense at the moment, but if we add a default security context then it no longer makes sense (to me?). In a multi-tenant use-case, having AppRepo specific security context allows different users to utilise different resources. But if they all have access to the default context, then you no longer have multi-tenancy for those resources (nodes, or whatever are protected by the security context).

Whether or not using a custom security context in each repo makes actually sense... I don't know; haven't thoroughly thought about it - though I have the impression it could be useful in some contexts where the sync job is run in a certain restricted namespace somehow.

Yes.

Coming back to this PR, now we would have two different ways of specifying a security context: 1) using the existing CRD's syncJobPodTemplate field and/or setting a default value in the AppRepoController side. The point is: what should be the expected behavior when upgrading Kubeapps (assuming existing jobs) and setting the controller-wide default?

* A) If a controller-wide default is found AND a CRD-scoped config is found, then solely use the default -as is- in every job.

* B) If a controller-wide default is found AND a CRD-scoped config is found, then solely use the CRD's -as is- in every job.

* C) If a controller-wide default is found AND a CRD-scoped config is found, then use both and, if they conflict, preserve default's

* D) If a controller-wide default is found AND a CRD-scoped config is found, then use both and, if they conflict, preserve CRD's

In this PR, I opted for option D, as I see it as the less disruptive maximizing backward-compatibility,

Yes, it's less disruptive, but I'm worried it's adding tech-debt in a direction away from our goal, when it's a minor problem that could be solved in a way that moves us towards our goal.

See what you think, but I reckon the native way to solve this would be for us to instead use finalizers for the AppRepo CRDs (which allow the controller to run clean-up tasks before the resource is deleted).

+1000, the current approach based upon syncJobs+cleanupJobs seems like a bad smell given the current myriad of k8s-native solutions... but if we start improving it... we would end up in sth like the flux HelmRepository. Anyway, we could open a new issue for trying to, al least, move from clean-up jobs to finalizers.

Yeah - the question is: is there a positive step forward here that we could take that's simple and doesn't add something moving us away from our goal? Not sure, but one idea that comes to mind: we could instead update to run the cleanup code when the app repo is deleted by Kubeapps - it's not perfect as it won't cover the case where the app repo is deleted by the CLI, but it's a step towards doing the work in the controller triggered (later) by a finalizer. (I don't think it's a huge amount of work, the cleanup).

I'm going to leave a +1 on this, because I know that you'll think about this in more detail than I have, considering this info, and make a good decision based on what you think (either way), I just wanted to be sure you understood my concern about adding something that may hinder us in the future, when there may be other steps forwards more aligned with the goal of using a finalizer.

@antgamdia
Copy link
Contributor Author

Haven't had the time to reply yet (will do it tomorrow), but TL;DR, I gave it a try to the finalizers thing... and it seems to work pretty well :)

apprepos.mp4

@antgamdia
Copy link
Contributor Author

Yes, it's less disruptive, but I'm worried it's adding tech-debt in a direction away from our goal, when it's a minor problem that could be solved in a way that moves us towards our goal.

Yep, I agree. I was a bit reluctant to add more features related to the CR lifecycle and just went with the simplistic approach. But if continue doing so, we would never get to fix that tech debt.

is there a positive step forward here that we could take that's simple and doesn't add something moving us away from our goal? Not sure, but one idea that comes to mind: we could instead update to run the cleanup code when the app repo is deleted by Kubeapps

I don't think that fully works since it can even lead to some inconsistent jobs. Not sure if diverging the behavior between what is done by Kubeapps and what is expected from the controller when interacting with the k8s resources directly, is a good idea.

So, in the end, I just added the finalizers thing. It wasn't that difficult to add: #6647 :) Let's move the technical discussion there.

I have also moved the extra changes I performed here to #6646.

Copy link
Contributor

@absoludity absoludity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks @antgamdia. I'll close this one after reviewing your others (so that we don't accidentally land it). Or maybe I'll switch the review instead.

antgamdia added a commit that referenced this pull request Aug 21, 2023
#6646)

### Description of the change

While working on #6605, I noticed I had a bunch of unrelated changes.
Instead of overloading the PR, I have extracted them into this one. The
main changes are:

- Apply the same fix in go Dockerfiles as we did for kubeapps-apis
(avoid downloading linters if `lint=false`). It helps reduce the build
time locally.
- Remove some duplicated keys in the yamls, as we already discussed.
- Add some missing apprepo-controller args in the tests, mainly just for
completness.
- Fix some tests in the apprepo-controller. Mainly just swapping some
misplaced `get, want`.
- Handle cronjob v1beta1 vs v1 in a missing place.
- Pass the `podSecurityContext` and `containerSecurityContext` in the
pod template properly.
- Update a missing copyright header.
- Fix wrong values.yaml metadata (mainly `ocicatalog/ociCatalog`)

### Benefits

Get the aforementioned issues solved.

### Possible drawbacks

N/A (not adding new logic here)

### Applicable issues

- (partially) related to
##6545

### Additional information

N/A

---------

Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
@antgamdia antgamdia closed this Aug 21, 2023
antgamdia added a commit that referenced this pull request Aug 22, 2023
### Description of the change

Even if the sync jobs were added a security context (by means of each
AppRepo CRD), this information was not available for Cleanup jobs. This
is mainly due to the fact that those jobs are spun up once a NotFound
error is thrown when fetching an AppRepo.
However, Kubernetes does have a native approach for dealing with these
scenarios: finalizers.

In #6605 we proposed a
simplistic workaround based on adding more params to the controller...
but as suggested in
#6605 (comment),
moving to finalizers is a better long-term solution.


### Benefits

Cleanup jobs are now handled within an existing AppRepo context...
meaning we have all the syncJobTemplate available to be used (ie,
securityPolicies and so on).

### Possible drawbacks

When dealing with finalizers in the past I often found it really
annoying when they get stuck and prevent the resource to get deleted. I
wonder if we should add some info in the FAQ on how to manually remove
the finalizers.

Additionally, and this might be something important: for the AppRepo
controller to be able to `update` AppRepos in other namespaces !=
kubeapps.... (to add the finalizer) it now needs extra RBAC. Before we
were just granting `...-appprepos-read`.. but now we would need to grant
`...-write` as well...and I'm not sure we really want to do so.
WDYT, @absoludity ?
Another idea is using an admission policy... but not sure again if we
want to deal with that...

~(I haven't modified the RBAC yet in this PR)~ Changes have been
performed finally

### Applicable issues

- fixes #6545

### Additional information

This PR is based on top of
#6646, but the main change
to review is
6e70910
The rest is just moving code into separate files, mostly.

Also, I have been taking a look at `kubebuilder` to create a new
controller, based on the `sigs.k8s.io/controller-runtime` rather than on
the workqueues we currently have. While it is pretty easy to start with
([see quickstart](https://book.kubebuilder.io/quick-start)), I think it
is adding too much complexity (using kustomize, adding rbac proxies,
prometheus metrics, etc...
I also quickly tried the k8s codegen scripts, but ran into some issues
with my setup... but perhaps it's the best option.

IMO, at some point we should start thinking about moving towards a new
state-of-the-art k8s controller boilerplate.

---------

Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
@antgamdia antgamdia deleted the 6545-add-seccontext-apprepo branch December 18, 2023 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot configure securityContext of sync and cleanup jobs
3 participants