Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deploy/kubernetes/base/node.yaml: Fixing deployment on Bottlerocket. #247

Closed
wants to merge 1 commit into from
Closed

deploy/kubernetes/base/node.yaml: Fixing deployment on Bottlerocket. #247

wants to merge 1 commit into from

Conversation

tatodorov
Copy link

@tatodorov tatodorov commented Sep 9, 2020

Updated the path where are written efs-utils.conf, efs-utils.crt and
privateKey.pem files to /var/amazon/efs which is one of the few
writeable file systems in Bottlerocket.

Is this a bug fix or adding new feature?
This resolves issue #246.

What is this PR about? / Why do we need it?
This allows aws-efs-csi-driver to be deployed also on Bottlerocket OS where /etc is read-only.

What testing is done?
aws-efs-csi-driver was successfully deployed on Amazon Linux 2 and Bottlerocket OS.

on Bottlerocket OS

NAME                              STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION   CONTAINER-RUNTIME
ip-192-168-102-136.ec2.internal   Ready    <none>   14m   v1.17.9   192.168.102.136   <none>        Bottlerocket OS 1.0.1   5.4.50           containerd://1.3.7+unknown
ip-192-168-102-151.ec2.internal   Ready    <none>   14m   v1.17.9   192.168.102.151   <none>        Bottlerocket OS 1.0.1   5.4.50           containerd://1.3.7+unknown
ip-192-168-102-72.ec2.internal    Ready    <none>   14m   v1.17.9   192.168.102.72    <none>        Bottlerocket OS 1.0.1   5.4.50           containerd://1.3.7+unknown

kubectl -n kube-system get pods -l app=efs-csi-node -o wide
NAME                 READY   STATUS    RESTARTS   AGE   IP                NODE                              NOMINATED NODE   READINESS GATES
efs-csi-node-bpxd9   3/3     Running   0          20m   192.168.102.136   ip-192-168-102-136.ec2.internal   <none>           <none>
efs-csi-node-mgzpk   3/3     Running   0          19m   192.168.102.151   ip-192-168-102-151.ec2.internal   <none>           <none>
efs-csi-node-mwfm7   3/3     Running   0          20m   192.168.102.72    ip-192-168-102-72.ec2.internal    <none>           <none>

Logs from efs-plugin:

kubectl -n kube-system logs efs-csi-node-bpxd9 efs-plugin
0909 20:33:43.690761       1 mount_linux.go:163] Cannot run systemd-run, assuming non-systemd OS
I0909 20:33:43.690825       1 mount_linux.go:164] systemd-run failed with: exit status 1
I0909 20:33:43.690834       1 mount_linux.go:165] systemd-run output:
I0909 20:33:43.691006       1 driver.go:87] Starting watchdog
I0909 20:33:43.691089       1 efs_watch_dog.go:174] Copying /etc/amazon/efs/efs-utils.conf since it doesn't exist
I0909 20:33:43.691155       1 efs_watch_dog.go:174] Copying /etc/amazon/efs/efs-utils.crt since it doesn't exist
I0909 20:33:43.691443       1 driver.go:93] Staring subreaper
I0909 20:33:43.691458       1 driver.go:96] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0909 20:33:45.219276       1 node.go:242] NodeGetInfo: called with args
I0909 20:35:07.764987       1 node.go:226] NodeGetCapabilities: called with args
I0909 20:35:10.413111       1 node.go:226] NodeGetCapabilities: called with args
I0909 20:35:10.414248       1 node.go:51] NodePublishVolume: called with args volume_id:"fs-2f7ccfac:/default/test-efs-pod" target_path:"/var/lib/kubelet/pods/36fdc3f8-e12a-44e8-8c3b-ef75f266ccc2/volumes/kubernetes.io~csi/test-efs-pod-data/mount" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"encryptInTransit" value:"true" >
I0909 20:35:10.414314       1 node.go:167] NodePublishVolume: creating dir /var/lib/kubelet/pods/36fdc3f8-e12a-44e8-8c3b-ef75f266ccc2/volumes/kubernetes.io~csi/test-efs-pod-data/mount
I0909 20:35:10.414341       1 node.go:172] NodePublishVolume: mounting fs-2f7ccfac:/default/test-efs-pod at /var/lib/kubelet/pods/36fdc3f8-e12a-44e8-8c3b-ef75f266ccc2/volumes/kubernetes.io~csi/test-efs-pod-data/mount with options [tls]
I0909 20:35:10.414355       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t efs -o tls fs-2f7ccfac:/default/test-efs-pod /var/lib/kubelet/pods/36fdc3f8-e12a-44e8-8c3b-ef75f266ccc2/volumes/kubernetes.io~csi/test-efs-pod-data/mount])

NOTE

This PR resolves the deployment of aws-efs-csi-driver to Bottlerocket OS, but there is still an issue to attach EFS to a Pod when running on Bottlerocket OS.

Log from efs-plugin

Sep 10 14:56:08 ip-192-168-102-149.ec2.internal kubelet[3687]: E0910 14:56:08.234142    3687 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-2f7ccfac:/default/test-efs-pod podName: nodeName:}" failed. No retries permitted until 2020-09-10 14:58:10.234068793 +0000 UTC m=+2597.763747074 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"test-efs-pod-data\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-2f7ccfac:/default/test-efs-pod\") pod \"test-efs-pod-0\" (UID: \"1fdd7f4e-46ef-48c5-a1be-f71864d51449\") : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Sep 10 14:56:26 ip-192-168-102-149.ec2.internal kubelet[3687]: E0910 14:56:26.914296    3687 kubelet.go:1681] Unable to attach or mount volumes for pod "test-efs-pod-0_default(1fdd7f4e-46ef-48c5-a1be-f71864d51449)": unmounted volumes=[test-efs-pod-data], unattached volumes=[test-efs-pod-data test-efs-pod-token-2zppv]: timed out waiting for the condition; skipping pod
Sep 10 14:56:26 ip-192-168-102-149.ec2.internal kubelet[3687]: E0910 14:56:26.915660    3687 pod_workers.go:191] Error syncing pod 1fdd7f4e-46ef-48c5-a1be-f71864d51449 ("test-efs-pod-0_default(1fdd7f4e-46ef-48c5-a1be-f71864d51449)"), skipping: unmounted volumes=[test-efs-pod-data], unattached volumes=[test-efs-pod-data test-efs-pod-token-2zppv]: timed out waiting for the condition

Log from kubelet

Sep 10 06:59:16 ip-192-168-102-157.ec2.internal kubelet[3589]: I0910 06:59:16.953435    3589 reconciler.go:209] operationExecutor.VerifyControllerAttachedVolume started for volume "test-efs-pod-data" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-2f7ccfac:/default/test-efs-pod") pod "test-efs-pod-0" (UID: "31bec37a-0b16-441a-b323-70be44dcd57c")
Sep 10 06:59:16 ip-192-168-102-157.ec2.internal kubelet[3589]: I0910 06:59:16.954345    3589 reconciler.go:209] operationExecutor.VerifyControllerAttachedVolume started for volume "test-efs-pod-token-fjjz9" (UniqueName: "kubernetes.io/secret/31bec37a-0b16-441a-b323-70be44dcd57c-test-efs-pod-token-fjjz9") pod "test-efs-pod-0" (UID: "31bec37a-0b16-441a-b323-70be44dcd57c")
Sep 10 06:59:17 ip-192-168-102-157.ec2.internal kubelet[3589]: I0910 06:59:17.056503    3589 clientconn.go:104] parsed scheme: ""
Sep 10 06:59:17 ip-192-168-102-157.ec2.internal kubelet[3589]: I0910 06:59:17.056760    3589 clientconn.go:104] scheme "" not registered, fallback to default scheme
Sep 10 06:59:17 ip-192-168-102-157.ec2.internal kubelet[3589]: I0910 06:59:17.056874    3589 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock 0  <nil>}] <nil>}
Sep 10 06:59:17 ip-192-168-102-157.ec2.internal kubelet[3589]: I0910 06:59:17.056967    3589 clientconn.go:577] ClientConn switching balancer to "pick_first"
Sep 10 06:59:17 ip-192-168-102-157.ec2.internal kubelet[3589]: I0910 06:59:17.063981    3589 csi_attacher.go:310] kubernetes.io/csi: attacher.MountDevice STAGE_UNSTAGE_VOLUME capability not set. Skipping MountDevice...
Sep 10 06:59:17 ip-192-168-102-157.ec2.internal kubelet[3589]: I0910 06:59:17.064015    3589 operation_generator.go:587] MountVolume.MountDevice succeeded for volume "test-efs-pod-data" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-2f7ccfac:/default/test-efs-pod") pod "test-efs-pod-0" (UID: "31bec37a-0b16-441a-b323-70be44dcd57c") device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/test-efs-pod-data/globalmount"

on Amazon Linux 2 (EKS optimized)

NAME                              STATUS   ROLES    AGE   VERSION              INTERNAL-IP       EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-192-168-102-132.ec2.internal   Ready    <none>   53s   v1.17.9-eks-4c6976   192.168.102.132   <none>        Amazon Linux 2   4.14.193-149.317.amzn2.x86_64   docker://19.3.6
ip-192-168-102-142.ec2.internal   Ready    <none>   82s   v1.17.9-eks-4c6976   192.168.102.142   <none>        Amazon Linux 2   4.14.193-149.317.amzn2.x86_64   docker://19.3.6
ip-192-168-102-91.ec2.internal    Ready    <none>   91s   v1.17.9-eks-4c6976   192.168.102.91    <none>        Amazon Linux 2   4.14.193-149.317.amzn2.x86_64   docker://19.3.6

kubectl -n kube-system get pods -l app=efs-csi-node -o wide
NAME                 READY   STATUS    RESTARTS   AGE   IP                NODE                              NOMINATED NODE   READINESS GATES
efs-csi-node-dwfxh   3/3     Running   0          76s   192.168.102.142   ip-192-168-102-142.ec2.internal   <none>           <none>
efs-csi-node-sg2sn   3/3     Running   0          47s   192.168.102.132   ip-192-168-102-132.ec2.internal   <none>           <none>
efs-csi-node-xfbqx   3/3     Running   0          85s   192.168.102.91    ip-192-168-102-91.ec2.internal    <none>           <none>

Logs from efs-plugin:

kubectl -n kube-system logs efs-csi-node-sg2sn efs-plugin
I0909 19:44:01.135535       1 mount_linux.go:163] Cannot run systemd-run, assuming non-systemd OS
I0909 19:44:01.135652       1 mount_linux.go:164] systemd-run failed with: exit status 1
I0909 19:44:01.135664       1 mount_linux.go:165] systemd-run output: Failed to create bus connection: No such file or directory
I0909 19:44:01.135819       1 driver.go:87] Starting watchdog
I0909 19:44:01.135905       1 efs_watch_dog.go:174] Copying /etc/amazon/efs/efs-utils.conf since it doesn't exist
I0909 19:44:01.136091       1 efs_watch_dog.go:174] Copying /etc/amazon/efs/efs-utils.crt since it doesn't exist
I0909 19:44:01.137521       1 driver.go:93] Staring subreaper
I0909 19:44:01.137538       1 driver.go:96] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0909 19:44:03.078723       1 node.go:242] NodeGetInfo: called with args
I0909 19:44:17.031525       1 node.go:226] NodeGetCapabilities: called with args
I0909 19:44:17.632663       1 node.go:226] NodeGetCapabilities: called with args
I0909 19:44:17.638825       1 node.go:51] NodePublishVolume: called with args volume_id:"fs-2f7ccfac:/default/test-efs-pod" target_path:"/var/lib/kubelet/pods/befebc5f-ff88-40b1-bab5-8ac778cbf258/volumes/kubernetes.io~csi/test-efs-pod-data/mount" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"encryptInTransit" value:"true" >
I0909 19:44:17.638922       1 node.go:167] NodePublishVolume: creating dir /var/lib/kubelet/pods/befebc5f-ff88-40b1-bab5-8ac778cbf258/volumes/kubernetes.io~csi/test-efs-pod-data/mount
I0909 19:44:17.638953       1 node.go:172] NodePublishVolume: mounting fs-2f7ccfac:/default/test-efs-pod at /var/lib/kubelet/pods/befebc5f-ff88-40b1-bab5-8ac778cbf258/volumes/kubernetes.io~csi/test-efs-pod-data/mount with options [tls]
I0909 19:44:17.638970       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t efs -o tls fs-2f7ccfac:/default/test-efs-pod /var/lib/kubelet/pods/befebc5f-ff88-40b1-bab5-8ac778cbf258/volumes/kubernetes.io~csi/test-efs-pod-data/mount])

Updated the path where are written efs-utils.conf, efs-utils.crt and
privateKey.pem files to /var/amazon/efs which is one of the few
writeable file systems in Bottlerocket.
This resolves issue #246.
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Sep 9, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @tatodorov!

It looks like this is your first PR to kubernetes-sigs/aws-efs-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/aws-efs-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @tatodorov. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tatodorov
To complete the pull request process, please assign d-nishi
You can assign the PR to them by writing /assign @d-nishi in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 9, 2020
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 9, 2020
@wongma7
Copy link
Contributor

wongma7 commented Oct 21, 2020

We rely on this path being persisted in case the driver gets restarted. I am not 100% sure about the details but have observed that if /etc/amazon/efs/privateKey.pem is not persisted and the driver gets restarted then mounts can hang. The private key i used to make tls connections with EFS via stunnel.

We do not decide the path, efs-utils does https://github.com/aws/efs-utils/search?q=%2Fetc%2Famazon%2Fefs&type=code

@jqmichael
Copy link
Contributor

Since the PR only changes path in the Volume, it shouldn't impact path on the EFS driver container, which is where efs-util is running.

@wongma7
Copy link
Contributor

wongma7 commented Oct 21, 2020

@juqing pointed out that hostPath path is not the same as mountPath within the container, so the container can still write to its /etc/amazon/efs/ which is bind mounted to wherever on the host.

However I am still reluctant to merge this as is because it would break updates if a user has their privateKey stored on host at /etc/amazon/efs, then after the update the driver tries to read it from host at /var/amazon/efs. Thinking of a solution....

@wongma7
Copy link
Contributor

wongma7 commented Oct 21, 2020

I think we need a bottlerocket overlay

@faarshad
Copy link

I think we need a bottlerocket overlay

Fyi, some folks who are doing migration from regular Amazon Linux 2(AL2) to Bottlerocket will need to run both AL2 and Bottlerocket in parallel for some duration thus needing to deploy the csi driver in both. A separate overlay might full-fill bottlerocket deployment needs but will not be generic!

@wongma7
Copy link
Contributor

wongma7 commented Oct 21, 2020

If bottlerocket/al2 nodes have a well-known label distinguishing them we could put nodeSelectors in the overlays and instruct users to deploy two DaemonSets. Or instruct users to label the nodes themselves and set nodeSelectors for two DaemonSets accordingly. I want to minimize disruption for users but I think it is unavoidable that they will have to do something special

@faarshad
Copy link

I am favoring the idea of users setting specific node-labels on each of the two node types and the overlay being generic enough to use nodeSelector based on the set nodeLabels.

@jqmichael
Copy link
Contributor

jqmichael commented Dec 2, 2020

We had an offline chat.

In summary, for the upgrade case where there're already files present at /etc/amazon/efs on the host, driver will keep writing to the same directory. Otherwise, driver will start writing to /var/amazon/efs on the host. Within the container, all files will still be read from or written to /etc/amazon/efs.

The specific plan is to mount /var/amazon/efs, the new folder, to the driver container as is and /etc/amazon/efs, the original folder, to the container as /etc/amazon/efs-x. At driver startup, it will check if there're files already present at /etc/amazon/efs-x, if so, it's the upgrade case, it will add a symlink /etc/amazon/efs in the container to /etc/amazon/efs-x. Otherwise it's a fresh host/cluster, driver will add the same symlink /etc/amazon/efs to /var/amazon/efs.

CC: @wongma7

@webern webern mentioned this pull request Dec 7, 2020
@wongma7
Copy link
Contributor

wongma7 commented Jan 7, 2021

/close

superseded by #286 which I am working on getting into the next release (don't think I will make this week but shooting for next week)

Thanks all for productive discussion.

@wongma7 wongma7 closed this Jan 7, 2021
@k8s-ci-robot
Copy link
Contributor

@tatodorov: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants