Unable to build cluster with CIS profile (cis-1.5) enabled #851

grk-pancham · 2021-04-06T19:32:57Z

We are unable to build cluster with CIS profile (cis-1.5) enabled. I think it is failing in the initial CIS benchmarks checks and aborts with the below error due to unmet requirements. We are using containerd for the container runtime. Where is the initial setup or requirements documentated so that the initial CIS checks pass and rke2 can build cluster successfully ?

Error:
missing required user: unknow user etcd
invalid kernel parameter value vm.overcommit_memory=0 - expected 1
invalid kernel parameter value kernel.panic=0 - expected 10

Version: v1.19.7+rke2r1

Config:
write-kubeconfig-mode: "0600"
write-kubeconfig: /app/rke2/kube-config.yaml
data-dir: /app/rke2
cluster-cidr: "10.42.0.0/16"
service-cidr: "10.43.0.0/16"
disable:

rke2-canal
cloud-provider-name: "aws"
tls-san:
""
node-name: ""
node-label:
"server=rke2-server-dev"
profile: "cis-1.5"
selinux: true

brandond · 2021-04-06T20:05:42Z

Have you checked out the CIS hardening guide in the docs?
https://docs.rke2.io/security/hardening_guide/

grk-pancham · 2021-04-06T22:04:34Z

Thanks Brad pointing out to the docs. I am somehow missed that part.

I am able to pass thru the initial failure but now the kubelet does not seem to start the other k8 services like api server, scheduler etc. kubelet is trying to reach the api server to register itself but the api server is not running and is basically stuck in that process. Please advice.

brandond · 2021-04-06T22:13:11Z

How long have you given it? Are you using a private registry or airgap image archive to mirror the images locally? It can take a bit to start up the first time as it pulls all the various images and it won't appear to be doing anything until the etcd and apiserver pods are running.

grk-pancham · 2021-04-06T22:27:03Z

I have given long enough time but it looks like it is stuck. looking at the kubelet logs , looks like it is unable start containerd process to spin up the k8 services. any idea why it is failing to start containerd ? is it possible that the containerd configuration is missing something that is required for CIS to be met.

E0406 22:20:09.724172   17733 remote_runtime.go:113] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown
E0406 22:20:09.724213   17733 kuberuntime_sandbox.go:69] CreatePodSandbox for pod "internal_kube-system(1be4fc34bdb6056763aa9650087de0fb)" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown
E0406 22:20:09.724228   17733 kuberuntime_manager.go:741] createPodSandbox for pod "internal_kube-system(1be4fc34bdb6056763aa9650087de0fb)" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown
E0406 22:20:09.724311   17733 pod_workers.go:191] Error syncing pod 1be4fc34bdb6056763aa9650087de0fb ("internal_kube-system(1be4fc34bdb6056763aa9650087de0fb)"), skipping: failed to "CreatePodSandbox" for "etcd-ip-10-12-137-185.us-gov-west-1.compute.internal_kube-system(1be4fc34bdb6056763aa9650087de0fb)" with CreatePodSandboxError: "CreatePodSandbox for pod \"internal_kube-system(1be4fc34bdb6056763aa9650087de0fb)\" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown"

brandond · 2021-04-06T22:40:10Z

Is this on a selinux-enabled system? Did you install the correct selinux packages? What distro and kernel is this host running?

The best clue I have is at opencontainers/runc#2031 (comment) which suggests this is caused by older kernels + odd selinux configuration?

grk-pancham · 2021-04-06T23:59:01Z

Correct. sestatus is enforcing.

We are using rhel 7.9 and kernel is 3.10.0-1160.15.2.el7.x86_64.

brandond · 2021-04-07T00:01:57Z

We validate on RHEL7 and 8 and I haven't seen this. How did you install RKE2? Can you confirm that you installed from RPM, and have the required rke2-selinux packages (and their dependencies) installed?

Edit: I see that you have customized the data-dir: value. This makes your life MUCH harder on selinux-enabled systems, since all the selinux policies apply to files in their default path of /var/lib/rancher/rke2. Is there any reason in particular that you're changing that? Your life will be much easier if you can keep it in the default location - mount a different disk or partition there if necessary (and then reinstall rke2-selinux to trigger restorecon to label it properly), but don't change it.

We don't specifically call this out in the docs at the moment, but we do have other issues regarding it: #474 (comment)

grk-pancham · 2021-04-07T12:56:45Z

We installed rke2 from the tar file. How do i verify that rke2-selinux dependencies are installed ? If you can point me to some docs that will be great.

We changed the data-dir location since the root directory we have have only limited capacity and not allowed to grow. Since the data dir will grow over time, i updated it to new location on separate EBS volume that we use. Should I remove data-dir and give it a try ?

Also how do i customize the location of the etcd database ? I did not see any option in rke2 to customize the location of the etcd database

brandond · 2021-04-07T19:20:52Z

We installed rke2 from the tar file. How do i verify that rke2-selinux dependencies are installed ? If you can point me to some docs that will be great.

It is recommended that RKE2 be installed from RPM on selinux-enabled systems, as this ensures that all the selinux dependencies are installed. The tarball install does not use the same paths for RKE2 binaries as the RPM, so even if you installed the rke2-selinux RPM alongside the tarball, it still would not fix your problem.

We changed the data-dir location since the root directory we have have only limited capacity and not allowed to grow. Since the data dir will grow over time, i updated it to new location on separate EBS volume that we use. Should I remove data-dir and give it a try ?

I would recommend mounting the secondary EBS volume at /var/lib/rancher and then using the default data-dir value so that you don't have to try to build your own selinux policy. Ensure that this path is mounted when you install RKE2 so that the selinux labels are set properly.

Also how do i customize the location of the etcd database ? I did not see any option in rke2 to customize the location of the etcd database

The etcd database cannot currently be individually relocated; it will always be at $DATADIR/server/db.

grk-pancham · 2021-04-07T21:32:45Z

Hi Brad - I am unable to install rke2 using rpm. how do I fix this issue ? I did not provide any version or install type before running the install script. Are the rpm repos available now ?

failure: repodata/repomd.xml.asc from rancher-rke2-1.20-stable: [Errno 256] No more mirrors to try.
https://rpm.rancher.io/rke2/stable/1.20/centos/7/x86_64/repodata/repomd.xml.asc: [Errno 14] HTTPS Error 404 - Not Found

grk-pancham · 2021-04-07T22:13:44Z

Hi Brad - on side note, I removed the data-dir and selinux options from config file and just keeping profile cis-1.5 option in it. I was able to get to a point where it starts all the services but when I check the node status with kubectl it says node is NotReady. I found that the calico install is not run to install calico. I dropped the two tigera yaml files in server/manifestss folder but looks like rke2 did not pick them up to install calico. i ran the tigera yaml files manually but do not see the calico node created in kube-system or tigera-operator namespace. How do I install calico on rke2 ? I followed this docs to install calico.
530c685

brandond · 2021-04-07T22:32:20Z

That's not a file that exists as part of our Yum repo; I'm not sure why your system is looking for it. Is this what originally led you to installing via tarball? Can you compare your repo file?

[root@centos01 ~]# cat /etc/yum.repos.d/rancher-rke2.repo
[rancher-rke2-common-stable]
name=Rancher RKE2 Common (stable)
baseurl=https://rpm.rancher.io/rke2/stable/common/centos/7/noarch
enabled=1
gpgcheck=1
gpgkey=https://rpm.rancher.io/public.key
[rancher-rke2-1.20-stable]
name=Rancher RKE2 1.20 (stable)
baseurl=https://rpm.rancher.io/rke2/stable/1.20/centos/7/x86_64
enabled=1
gpgcheck=1
gpgkey=https://rpm.rancher.io/public.key

For the CNI issue - have you disabled canal on all of your servers? Are there any errors in the rke2-server logs regarding deployment of those manifests?

grk-pancham · 2021-04-08T21:08:35Z

Hi Brad - I have the exact yum repo file except with reference to 1.19 but still not sure why yum install is failing on that specific url.

[rancher-rke2-common-stable]
name=Rancher RKE2 Common (stable)
baseurl=https://rpm.rancher.io/rke2/stable/common/centos/7/noarch
enabled=1
gpgcheck=1
gpgkey=https://rpm.rancher.io/public.key
[rancher-rke2-1.19-stable]
name=Rancher RKE2 1.19 (stable)
baseurl=https://rpm.rancher.io/rke2/stable/1.19/centos/7/x86_64
enabled=1
gpgcheck=1
gpgkey=https://rpm.rancher.io/public.key

grk-pancham · 2021-04-22T03:59:11Z

Hi Brad - Sorry for the late reply. I finally was able to install RKE2 rpms with local install. This time I see in kubelet log file that the container runtime is not ready even though I have copied the calico manifests files to "/var/lib/rancher/rke2/server/manifests" directory. Does RKE2 run these manifests every time we start the rke2 server service or does it only run these manifests only once ? Any idea why should kubelet say container runtime is not ready ?

RPMS installed:
container-selinux-2.119.2-1.911c772.el7_8.noarch.rpm rke2-common-1.19.9~rke2r1-0.el8.x86_64.rpm rke2-selinux-0.4-1.el8.noarch.rpm rke2-server-1.19.9~rke2r1-0.el8.x86_64.rpm

Error in kubelet
E0422 03:54:16.745706 17705 kubelet.go:2134] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Error in containerd
error="failed to create containerd task: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown"

brandond · 2021-04-22T05:00:31Z

failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown

opencontainers/runc#2030

This appears to be an error in runc, but should be fixed in the version included in RKE2. Do you only see this error when attempting to use your own CNI plugin, or do you get the same thing with Canal?

grk-pancham · 2021-04-22T13:08:27Z

Hi Brad,

These are the steps I performed before starting the rke2-server via the service. looks like I am missing some install step here. please advice.

installed containerd via the tar file. this installs the ctr client and when I run ctr version I get the client and server version. Please note I have not installed the containered.io package which is mentioned in the Kubernetes containerd install page as shown below Is that required ?
Install the containerd.io package from the official Docker repositories. Instructions for setting up the Docker repository for your respective Linux distribution and installing the containerd.io package can be found at Install Docker Engine.Install the containerd.io package from the official Docker repositories. Instructions for setting up the Docker repository for your respective Linux distribution and installing the containerd.io package can be found at Install Docker Engine.
Installed below RPMS
container-selinux-2.119.2-1.911c772.el7_8.noarch.rpm
rke2-common-1.19.9rke2r1-0.el8.x86_64.rpm
rke2-selinux-0.4-1.el8.noarch.rpm
rke2-server-1.19.9rke2r1-0.el8.x86_64.rpm
Copied the calico manifests to the /var/lib/rancher/rke2/server/manifests" directory

After this I started the rke2 server . It starts kubelet which starts the pods for api server, etc but the pod fail too start due to CNI not initialized and the error that I gave earlier. Does this mean that the containerd is not installed properly or CNI plugin is not installed ? I thought calico would install CNI plugins. Do i need to follow the steps to install CNI plugin mentioned on this page
https://docs.projectcalico.org/getting-started/kubernetes/hardway/install-cni-plugin. I also see the --network-plugin=cni parameter is missing in the kubelet process .

brandond · 2021-04-22T19:09:36Z

Wait, why are you running your own containerd? RKE2 includes its own containerd, and the selinux policies we install will only work for paths used by our containerd, not a user-provided containerd.

grk-pancham · 2021-04-22T19:51:31Z

that was good to know. I will skip the containerd install and try.

So you do not think I need to install the CNI plugin for Calico that I mentioned earlier ?

grk-pancham · 2021-04-22T21:06:10Z

Hi Brad - I skipped the containerd manual install but installed the CNI plugin as mentioned in the calico page but I still get the same error. Am I missing any step here ?

brandond · 2021-04-22T21:53:10Z

At this point I would probably just follow the quick-start instructions and get a basic installation working. Once that is done, try again replacing canal with calico.

grk-pancham · 2021-04-23T22:47:10Z

Hi Brad - As suggested , I started from the quick start and was finally able to build cluster with RKE2 and calico. However I see that the DNS service is not running in kube-system and see errors in the coredns pod. Any idea why this would happen ?

.:53 [INFO] plugin/reload: Running configuration MD5 = 7da3877dbcacfd983f39051ecafd33bd CoreDNS-1.6.9 linux/amd64, go1.15.8b5, 17665683 [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:60721->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:56845->10.11.176.134:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:54198->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:54023->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:53421->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:50191->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:33279->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:39477->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:51302->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:58005->10.11.176.235:53: i/o timeout

brandond · 2021-04-23T23:00:53Z

Wait, is it not running, or is it running with errors?

I don't recognize any of those IP addresses - they're not in any of the normal cluster CIDR ranges. Are you able to identify them within your environment?

grk-pancham · 2021-04-24T00:50:57Z

I think this CIDR (192.*) is from the calico manifests. I am going to update the CALICO_IPV4POOL_CIDR value to the cluster CIDR in manifests and reinstall calico. will update you soon.

grk-pancham · 2021-04-24T02:21:06Z

I updated the CALICO_IPV4POOL_CIDR value to cluster CIDR in calico manifests and reinstalled calico but still getting the same error. The core dns pods are up but showing the above errors that I mentioned. Any idea why coredns is failing ?

brandond · 2021-04-24T02:36:21Z

Did you rebuild the cluster? It's pretty hard to change cidrs once the cluster is up.

grk-pancham · 2021-04-24T02:43:44Z

Yes, I had to rebuild the cluster. Do we need to install any DNS add-on ? according to Kubernetes docs, I could be missing some add-on.
https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
If you see that no CoreDNS Pod is running or that the Pod has failed/completed, the DNS add-on may not be deployed by default in your current environment and you will have to deploy it manually.

brandond · 2021-04-24T03:38:33Z

CoreDNS is your dns addon. Are you getting the exact same messages? Did you see the same messages when using the default CNI?

grk-pancham · 2021-04-24T04:08:28Z

Correct, CoreDNS is failing with timeout errors.

I did not look into CoreDNS when I tried default Canal. what do you mean by default CNI ?

grk-pancham · 2021-04-24T20:51:11Z

You are right Brad, I do not see any issues if I use default canal. I see this coredns timeout only when I use calico

grk-pancham · 2021-04-24T20:56:39Z

Also when I run ctr client , I only get client version and it times out and fails to return server version. I see the containerd process is also running. Why it fails to show server version ?

ctr version
`Client:
Version: v1.4.4-k3s1
Revision: 70786f0464ebb57cc75df378049a52850d71cc66
Go version: go1.15.8b5

ctr: failed to dial "/run/containerd/containerd.sock": context deadline exceeded
`

grk-pancham · 2021-04-25T17:24:21Z

Hi Brad - I just wanted to check to find out if Calico works on RKE2 ?

grk-pancham · 2021-04-26T14:39:34Z

Hi Brad - I was able to build RKE2 K8 cluster with calico. But i am facing a weird issue now. After the cluster was successfully build and tested, I found that the RKE2 binaries like rke2 has disappeared from out install dir. The install dir only has containerd binaries. We are installing RKE2 in custom folder and not in /var/lib/rancher/rke folder. Due to this I could not restart the cluster since the rke2 binary is missing. Please advice.

brandond · 2021-04-26T17:38:15Z

Which binaries are you missing? The main RKE2 binary should install to /usr/local/bin/rke2 or /usr/bin/rke2, depending on whether you're using the tarball or RPM. Everything else gets extracted from the runtime image to $DATADIR/data/$RELEASE/bin/ during startup.

grk-pancham · 2021-04-26T18:55:01Z

I used the INSTALL_RKE2_ARTIFACT_PATH to install rke2 binaries in /app/rke2 folder. Now they are missing from there after 12 hours. I successfully tested the cluster after running the rke2 server and everything was running fine.
BTW they are also not in the /usr/local/bin/rke2 or /usr/bin/rke2 folders

brandond · 2021-04-26T19:00:30Z

INSTALL_RKE2_ARTIFACT_PATH is just the path where the tarballs or RPMs and checksums should be found when the install script is run, it is NOT the location that RKE2 is installed to. We don't delete them at the end of the install script, so I am guessing something or someone else is responsible for their removal. Did you perhaps put them in a temporary directory that is cleaned up nightly?

stale · 2021-10-23T19:11:36Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

brandond mentioned this issue Aug 9, 2021

RKE2 not starting up services on RHEL8 #1539

Closed

stale bot added the status/stale label Oct 23, 2021

stale bot closed this as completed Nov 7, 2021

Unable to build cluster with CIS profile (cis-1.5) enabled #851

Unable to build cluster with CIS profile (cis-1.5) enabled #851

Comments

grk-pancham commented Apr 6, 2021

brandond commented Apr 6, 2021

grk-pancham commented Apr 6, 2021

brandond commented Apr 6, 2021

grk-pancham commented Apr 6, 2021 • edited by brandond Loading

brandond commented Apr 6, 2021 • edited Loading

grk-pancham commented Apr 6, 2021

brandond commented Apr 7, 2021 • edited Loading

grk-pancham commented Apr 7, 2021

brandond commented Apr 7, 2021 • edited Loading

grk-pancham commented Apr 7, 2021

grk-pancham commented Apr 7, 2021

brandond commented Apr 7, 2021 • edited Loading

grk-pancham commented Apr 8, 2021 • edited Loading

grk-pancham commented Apr 22, 2021 • edited Loading

brandond commented Apr 22, 2021 • edited Loading

grk-pancham commented Apr 22, 2021 • edited Loading

brandond commented Apr 22, 2021

grk-pancham commented Apr 22, 2021

grk-pancham commented Apr 22, 2021

brandond commented Apr 22, 2021

grk-pancham commented Apr 23, 2021

brandond commented Apr 23, 2021 • edited Loading

grk-pancham commented Apr 24, 2021

grk-pancham commented Apr 24, 2021

brandond commented Apr 24, 2021

grk-pancham commented Apr 24, 2021

brandond commented Apr 24, 2021

grk-pancham commented Apr 24, 2021

grk-pancham commented Apr 24, 2021

grk-pancham commented Apr 24, 2021

grk-pancham commented Apr 25, 2021

grk-pancham commented Apr 26, 2021

brandond commented Apr 26, 2021 • edited Loading

grk-pancham commented Apr 26, 2021

brandond commented Apr 26, 2021

stale bot commented Oct 23, 2021

grk-pancham commented Apr 6, 2021 •

edited by brandond

Loading

brandond commented Apr 6, 2021 •

edited

Loading

brandond commented Apr 7, 2021 •

edited

Loading

brandond commented Apr 7, 2021 •

edited

Loading

brandond commented Apr 7, 2021 •

edited

Loading

grk-pancham commented Apr 8, 2021 •

edited

Loading

grk-pancham commented Apr 22, 2021 •

edited

Loading

brandond commented Apr 22, 2021 •

edited

Loading

grk-pancham commented Apr 22, 2021 •

edited

Loading

brandond commented Apr 23, 2021 •

edited

Loading

brandond commented Apr 26, 2021 •

edited

Loading