Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3299: Update kmsv2 kep for beta #3804

Merged
merged 1 commit into from
Feb 7, 2023

Conversation

aramase
Copy link
Member

@aramase aramase commented Jan 31, 2023

Signed-off-by: Anish Ramasekar anish.ramasekar@gmail.com
Co-authored-by: Mo Khan i@monis.app

fixes kubernetes/kubernetes#114318

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 31, 2023
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 31, 2023
@aramase aramase mentioned this pull request Jan 31, 2023
8 tasks
2. Run the e2e suite against a kind cluster that has kms v2 encryption enabled (as defined below).
3. Compare `request_duration_seconds`, `request_terminations_total`, `request_aborts_total` API server metrics between the two runs. The acceptable delta should be less than 20%.
4. Observe metrics from the reference implementation to determine time taken at each step of the encryption/decryption process.
5. Observe API server startup time with and without kms encryption enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the acceptable delta in startup time for the apiserver w/ and w/o the feature enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we defined the acceptable delta for API server metrics (latency) to be 20% we can do the same here. WDYT @enj?

Copy link
Member Author

@aramase aramase Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@logicalhan Me and @enj had a conversation and this is the rough math we came up with:

Assuming 10ms for every KMS RPC request to be complete, for every 5000 resource encrypted, it will increase the startup time by roughly a minute (5000*10ms = 50s). Is this something you would like us to document in the KEP?

Comment on lines +420 to +422
1. Add new `identity` provider at the top of encryption config
1. Restart kube-apiserver
1. Run storage migration to migrate all the existing encrypted data to use the `identity` provider
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, does adding identity provider unencrypt the data in etcd?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know if the migration has successfully completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, does adding identity provider unencrypt the data in etcd?

By setting identity as the first provider in the list, that'll be used for writes (no encryption). The rest of the providers will be used for read. When the user runs kubectl get secrets --all-namespaces -o json | kubectl replace -f -, the kms providers will be used to decrypt the data and on write the identity provider will be used. At the end of this, all the data will be stored unencrypted.

How do you know if the migration has successfully completed?

We're proposing a metric to record the number of times a key is used for read and write (xref: kubernetes/kubernetes#115394). With this metric, when the API server is restarted after storage migration, the count for key used to read should be 0 indicating all the data is already unencrypted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using a condition on the CRD to denote that the unencryption has completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a CRD for KMS. As a first step, we were thinking about exposing enough details in the metrics + documentation on how to use the metric, that could be used to determine if the rotation is complete.

Copy link
Contributor

@deads2k deads2k Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still have the storage-version-migrator. explaining how to verify the writes with that API is worthwhile. A link in the PRR is sufficient to describe it.

1. Remove the KMS provider from the encryption config and restart kube-apiserver
2. At the end of these steps, all the data in etcd will be unencrypted.

More details are available [here](https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/#disabling-encryption-at-rest)

Disabling this gate without first doing a storage migration to use a different encryption at rest mechanism will result in data loss.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just data loss, right? It will basically break the cluster, since apiserver won't be able to read the objects from etcd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just data loss, right? It will basically break the cluster, since apiserver won't be able to read the objects from etcd.

That is only when all the resources in etcd are encrypted using KMS. Typically the configuration is only for a subset of resources (secrets, configmaps). In that scenario, those resources can't be retrieved if the KMS providers are removed without running storage migration. If the user adds the KMS providers back (only for read), they would still be able to retrieve the resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's imagine we encrypted secrets that are mounted in pods. This will prevent any future pods referencing secrets from starting, right? Given that secrets are the most likely thing to encrypt, it's worth calling out the consequences of the failure here.

keps/sig-auth/3299-kms-v2-improvements/README.md Outdated Show resolved Hide resolved
@logicalhan
Copy link
Member

/cc @deads2k

@deads2k
Copy link
Contributor

deads2k commented Feb 2, 2023

looks like there are a couple PRR updates for beta and this will be good.

@aramase
Copy link
Member Author

aramase commented Feb 6, 2023

@deads2k Updated the PRR sections. PTAL when you get a chance!

keps/sig-auth/3299-kms-v2-improvements/README.md Outdated Show resolved Hide resolved
keps/sig-auth/3299-kms-v2-improvements/README.md Outdated Show resolved Hide resolved
keps/sig-auth/3299-kms-v2-improvements/README.md Outdated Show resolved Hide resolved
Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>
Co-authored-by: Mo Khan <i@monis.app>
Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>
@deads2k
Copy link
Contributor

deads2k commented Feb 7, 2023

PRR lgtm. I can't approve without it picking up for sig-auth as well and I think that @ritazh and @enj own sig approval for this one. Ping me on slack when it's time for PRR as well.

@enj
Copy link
Member

enj commented Feb 7, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2023
@deads2k
Copy link
Contributor

deads2k commented Feb 7, 2023

/approve

for PRR

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aramase, deads2k, enj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2023
@k8s-ci-robot k8s-ci-robot merged commit b1fdf4f into kubernetes:master Feb 7, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Feb 7, 2023
@aramase aramase deleted the aramase/3299/beta branch February 7, 2023 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Status: Done
Development

Successfully merging this pull request may close these issues.

[KMSv2] Update KEP-3299 for beta
6 participants