Canal containers give selinux related error message #1691

nheinemans · 2019-10-11T11:22:58Z

RKE version:
0.3.0

Docker version: (docker version,docker info preferred)

Client: Docker Engine - Community
 Version:           19.03.3
 API version:       1.39 (downgraded from 1.40)
 Go version:        go1.12.10
 Git commit:        a872fc2f86
 Built:             Tue Oct  8 00:58:10 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.1
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       4c52b90
  Built:            Wed Jan  9 19:06:30 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Docker daemon.json:

{
  "selinux-enabled": true,
  "userland-proxy": false,
  "bip": "10.10.0.1/24",
  "fixed-cidr": "10.10.0.1/24"
}

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
NAME="Red Hat Enterprise Linux"
VERSION="8.0 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.0"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.0:GA"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.0
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.0"

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Doesn't matter

cluster.yml file:

cluster_name: name

nodes:
  - address: node1
    user: user
    ssh_key_path: /home/user/.ssh/id_rsa
    role: [controlplane,etcd,worker]
  - address: node2
    user: user
    ssh_key_path: /home/user/.ssh/id_rsa
    role: [controlplane,etcd,worker]
  - address: node3
    user: user
    ssh_key_path: /home/user/.ssh/id_rsa
    role: [controlplane,etcd,worker]

private_registries:
  - url: internal-registry
    is_default: true # All system images will be pulled using this registry. 

services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h

Steps to Reproduce:
rke up
When the cluster is built, I see problemens with canal pods:

kubectl -n kube-system get pods
NAME                                      READY   STATUS                  RESTARTS   AGE
canal-9vg2d                               1/2     Running                 0          45h
canal-ftfrv                               0/2     Init:CrashLoopBackOff   197        16h
canal-l5g2d                               2/2     Running                 0          147m
coredns-5c98fc7769-wbscd                  0/1     CrashLoopBackOff        487        45h
coredns-autoscaler-64c857cf7-qgqwc        1/1     Running                 0          167m
metrics-server-7cf4dfc846-2vvbl           1/1     Running                 34         167m
rke-coredns-addon-deploy-job-kn952        0/1     Completed               0          45h
rke-ingress-controller-deploy-job-f29cv   0/1     Completed               0          45h
rke-metrics-addon-deploy-job-hfsxx        0/1     Completed               0          45h
rke-network-plugin-deploy-job-lfnj4       0/1     Completed               0          45h

Looking into the cni-install pod, I see this error message:

mv: inter-device move failed: '/calico.conf.tmp' to '/host/etc/cni/net.d/10-canal.conflist'; unable to remove target: Permission denied
Failed to mv files. This may be caused by selinux configuration on the host, or something else.

Results:
Cluster doesn't work properly. Setting selinux to permissive is not really an option.

The text was updated successfully, but these errors were encountered:

carloscarnero · 2019-10-11T12:59:48Z

Exactly the same thing happened to me after updating a cluster from CentOS 7.6 to 7.7, leading me to believe that something changed in SELinux in the transition (I've checked their release notes and found nothing.)

I "fixed" it by changing the network plugin and using plain flannel for the time being (which was... laborious) but, because of this, I still haven't upgraded CentOS on the other clusters. Also, see projectcalico/calico#2704.

leodotcloud · 2019-11-01T10:52:55Z

Whil trying to reproduce the problem using a couple of different cloud providers, I see that ip_tables module is not loaded by default in RHEL8/CentOS 8 VMs.

[root@ip-172-31-16-240 ~]# lsmod | grep ip_tables
[root@ip-172-31-16-240 ~]#

This is causing problems with the install. Running modprobe ip_tables enables the modules and the installation goes through fine with 'Enforcing' setting.

@nheinemans and @carloscarnero could you check if this step resolves your problem?

carloscarnero · 2019-11-01T12:06:30Z

I have not upgraded to CentOS 8 yet. Instead, I observed the problem going from 7.6 to 7.7. Thank you @leodotcloud for looking into this!

leodotcloud · 2019-11-01T14:40:28Z

@carloscarnero Where are your machines running (cloud/on-prem)? Any steps to reproduce the problem?

carloscarnero · 2019-11-01T16:48:13Z

@carloscarnero Where are your machines running (cloud/on-prem)? Any steps to reproduce the problem?

The following is the rke configuration for a three-node on-premises 1.15.5 cluster (some data is obscured/anonymized) that uses an internal registry because this setup is (mostly) air-gapped:

---
cluster_name: development
nodes:
- address: cfdd9f3c.example.com
  user: dockeruser
  role:
  - controlplane
  - etcd
  - worker
- address: b5833011.example.com
  user: dockeruser
  role:
  - controlplane
  - etcd
  - worker
- address: 307309d8.example.com
  user: dockeruser
  role:
  - controlplane
  - etcd
  - worker
network:
  plugin: canal
dns:
  provider: coredns
  upstreamnameservers:
  - 8.8.8.8
ingress:
  provider: none
system_images:
  etcd: example.com/rancher/coreos-etcd:v3.3.10-rancher1
  alpine: example.com/rancher/rke-tools:v0.1.50
  nginx_proxy: example.com/rancher/rke-tools:v0.1.50
  cert_downloader: example.com/rancher/rke-tools:v0.1.50
  kubernetes: example.com/rancher/hyperkube:v1.15.5-rancher1
  kubernetes_services_sidecar: example.com/rancher/rke-tools:v0.1.50
  pod_infra_container: example.com/rancher/pause:3.1
  kubedns: example.com/rancher/k8s-dns-kube-dns-amd64:1.15.0
  dnsmasq: example.com/rancher/k8s-dns-dnsmasq-nanny-amd64:1.15.0
  kubedns_sidecar: example.com/rancher/k8s-dns-sidecar-amd64:1.15.0
  kubedns_autoscaler: example.com/rancher/cluster-proportional-autoscaler:1.3.0
  coredns: example.com/rancher/coredns:1.3.1
  coredns_autoscaler: example.com/rancher/cluster-proportional-autoscaler:1.3.0
  flannel: example.com/rancher/coreos-flannel:v0.11.0-rancher1
  flannel_cni: example.com/rancher/coreos-flannel-cni:v0.3.0-rancher5
  calico_node: example.com/rancher/calico-node:v3.7.4
  calico_cni: example.com/rancher/calico-cni:v3.7.4
  calico_controllers: example.com/rancher/calico-kube-controllers:v3.7.4
  calico_ctl: example.com/rancher/calico-ctl:v2.0.0
  canal_node: example.com/rancher/calico-node:v3.7.4
  canal_cni: example.com/rancher/calico-cni:v3.7.4
  canal_flannel: example.com/rancher/coreos-flannel:v0.11.0
  weave_node: example.com/rancher/weave-kube:2.5.2
  weave_cni: example.com/rancher/weave-npc:2.5.2
  ingress: example.com/rancher/nginx-ingress-controller:nginx-0.25.1-rancher1
  ingress_backend: example.com/rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
  metrics_server: example.com/rancher/metrics-server:v0.3.3

The nodes are based on CentOS 7.7, updated up to the last minute; and during basic system configuration, the documented requirements were taken into account. SELinux is completely enabled, of course, and that's what's preventing calico/canal to start.

leodotcloud · 2019-11-01T17:12:00Z

Thanks @carloscarnero for sharing the detailed info. I will try to reproduce it on my end. I have one more question. Do you use the stock CentOS ISO to bring up the machines or do you have any other customizations done?

carloscarnero · 2019-11-01T17:59:11Z

Thanks @carloscarnero for sharing the detailed info. I will try to reproduce it on my end. I have one more question. Do you use the stock CentOS ISO to bring up the machines or do you have any other customizations done?

I'm using the CentOS minimal install option, practically vanilla. Close to no customizations, except that I remove the firewalld service and install iptables, which is a fully supported option (besides, it has worked as such close to two years.)

One more thing I have just discovered: I was wrong that this happened during the upgrade from CentOS 7.6 to 7.7... I just checked a non-upgraded cluster and it was already failing (fragment):

NAMESPACE     NAME                     READY  STATUS                  RESTARTS   AGE
kube-system   canal-2xvgm              0/2    Init:CrashLoopBackOff   4289	 15d
kube-system   canal-7x2tq              2/2    Running                 0          27d
kube-system   canal-jsfrm              0/2    Init:CrashLoopBackOff   2814	  9d
kube-system   canal-rpgpd              0/2    Init:CrashLoopBackOff   2808	  9d
kube-system   canal-tbvt5              0/2    Init:CrashLoopBackOff   2808	  9d
kube-system   canal-vqbh8              2/2    Running                 0          27d
kube-system   canal-xcfqb              0/2    Init:CrashLoopBackOff   2808	  9d
kube-system   coredns-795fc698b-68qjv  1/1    Running                 4          27d
kube-system   coredns-795fc698b-xthjd  1/1    Running                 5          27d

The above comes from a seven-node cluster, configured with the same settings as before, and you can see that five pods failed, and two are running.

I can be 100% certain that the OS settings are the same, as they're managed via Ansible. The logs for the failing pods show exactly the same message as with the opening message of this issue:

mv: inter-device move failed: '/calico.conf.tmp' to …
  '/host/etc/cni/net.d/10-canal.conflist'; unable to …
  remove target: Permission denied

Every time the previous message pops up, there's a corresponding one on the SELinux audit log:

type=AVC msg=audit(1572642338.208:17704): avc: …
  denied  { unlink } for  pid=16000 comm="mv" …
  name="10-canal.conflist" dev="sda1" ino=891087 …
  scontext=system_u:system_r:container_t:s0:c387,c438 …
  tcontext=system_u:object_r:container_file_t:s0:c308,c873 …
  tclass=file permissive=0

carloscarnero · 2019-11-02T03:16:47Z

From the discussion in projectcalico/calico#2704 it seems that

securityContext:
  privileged: true

is needed in order to properly handle SELinux systems. Thus, I edited the running canal daemonset with kubectl -n kube-system edit daemonset/canal and added those lines to the init container named install-cni.

After saving, the pods immediately reached the running state, and no more errors were logged. Maybe this suggests that those lines are missing in the template?

carloscarnero · 2019-11-06T21:57:04Z

@leodotcloud I have tried the fix above in another different cluster, and it seems to work.

superseb · 2020-03-05T19:20:52Z

RHEL8 support is tracked in rancher/rancher#23045.

To validate the new templates (should show privileged true in the new templates and nothing in the old templates):

Canal

kubectl get ds -n kube-system -l k8s-app=canal -o json | jq .items[].spec.template.spec.initContainers[].securityContext

Calico

kubectl get ds -n kube-system -l k8s-app=calico-node -o json | jq .items[].spec.template.spec.initContainers[].securityContext

superseb · 2020-03-05T20:33:30Z

@carloscarnero If you can test this change on some lab machines which are identical to the ones that were exhibiting the problem, that would be appreciated

soumyalj · 2020-03-05T22:26:31Z

Reproduced the issue with RKE version v0.3.2 for Canal network plugin:
security context for the template returns a NULL as below:

soumyas-MBP:rke soumya$ kubectl --kubeconfig kube_config_clusterzero.yml get ds -n kube-system -l k8s-app=canal -o json | jq .items[].spec.template.spec.initContainers[].securityContext 
null

Tested with rke version v1.1.0-rc11.
Created a 3 node cluster with 3 roles K8s version K8s1.15.10-rancher1-2 for different configs as below. Cluster came up successfully . Security context for the template returns, privileged=true

RHEL 7.7 nodes, native docker and SELINUX ON - Canal network plugin

kubectl --kubeconfig kube_config_clusterzero.yml get ds -n kube-system -l k8s-app=canal -o json | jq .items[].spec.template.spec.initContainers[].securityContext 
{
  "privileged": true
}

RHEL 7.7 nodes, native docker and SELINUX ON - Calico network plugin

soumyas-MBP:rke soumya$ kubectl --kubeconfig kube_config_clusterzero.yml get ds -n kube-system -l k8s-app=calico-node -o json | jq .items[].spec.template.spec.initContainers[].securityContext 
{
  "privileged": true
}
{
  "privileged": true
}

RHEL 7.7 nodes, native docker and SELINUX OFF - Canal network plugin

 kubectl get ds -n kube-system -l k8s-app=canal -o json | jq .items[].spec.template.spec.initContainers[].securityContext
{
  "privileged": true
}

RHEL 7.7 nodes, upstream docker and SELINUX ON - Canal network plugin.

kubectl get ds -n kube-system -l k8s-app=canal -o json | jq .items[].spec.template.spec.initContainers[].securityContext
{
  "privileged": true
}

Automation tests were also run on the above setups with Canal network plugin and no issues were found.

carloscarnero · 2020-03-05T22:26:32Z

@carloscarnero If you can test this change on some lab machines which are identical to the ones that were exhibiting the problem, that would be appreciated

@superseb I'm not clear what I should test. I mean... should I use rke v1.1.0-rc11? If that's the case, should I test against one of that version's supported K8s?

EDIT: based on the previous comment, I will test with v1.1.0-rc11 and K8s1.15.10-rancher1-2. The operating system is CentOS 7.7, completely updated, with SELinux enabled and enforcing. This will take some time because all my setups are air-gapped and I have to prime the internal registry.

carloscarnero · 2020-03-06T15:48:35Z

Success using v1.1.0-rc11 and K8s1.15.10-rancher1-2 on CentOS 7.7 with enforcing SELinux! Note, however:

I had to use the CoreDNS images from v1.16.7-rancher1-2, rancher/coredns-coredns:1.6.2, instead of rancher/coredns-coredns:1.3.1 because the latter was failing with an error (pod logs reported that the --nodelabel option was incorrect, and I assumed that it was introduced later.)
I had to specify the calico_flexvol and canal_flexvol images in the config.yml because the nodes were trying to get them from the Internet, not sure why (that failed because this is an air-gapped setup.) I used the values from v1.16.7-rancher1-2.

Next test is upgrading from 1.15.5 to 1.15.10, and will report back in this very comment to avoid further noise.

EDIT: A cluster upgrade into 1.15.10 from 1.15.5 was successful! The canal pods are privileged and running properly.

superseb · 2020-03-11T12:13:29Z

Thanks for testing

deniseschannon added kind/bug team/ca labels Oct 17, 2019

deniseschannon added this to the v0.3.2 milestone Oct 17, 2019

alena1108 assigned leodotcloud Oct 17, 2019

alena1108 added the [zube]: Next Up label Oct 17, 2019

deniseschannon modified the milestones: v0.3.2, v1.0.0 Oct 22, 2019

sangeethah assigned soumyalj Nov 1, 2019

carloscarnero mentioned this issue Nov 2, 2019

Made calico init containers privileged in order to prevent issues on … rancher/kontainer-driver-metadata#75

Closed

deniseschannon modified the milestones: v1.0 - Rancher v2.3.3, v1.0.x - Rancher 2.3.x Nov 14, 2019

deniseschannon added [zube]: Backlog A and removed [zube]: Next Up labels Nov 14, 2019

deniseschannon assigned superseb and unassigned leodotcloud Dec 18, 2019

deniseschannon added the [zube]: Next Up label Dec 18, 2019

zube bot removed the [zube]: Backlog A label Dec 18, 2019

deniseschannon modified the milestones: v1.0.1 - Rancher 2.3.4, v1.1 - Rancher v2.4 Dec 18, 2019

superseb added [zube]: Working and removed [zube]: Next Up labels Jan 1, 2020

superseb mentioned this issue Jan 3, 2020

Add privileged true to fix SELinux issues rancher/kontainer-driver-metadata#105

Closed

superseb added [zube]: Peer Review and removed [zube]: Working labels Jan 3, 2020

deniseschannon added [zube]: Review and removed [zube]: Peer Review labels Feb 25, 2020

superseb mentioned this issue Feb 28, 2020

Add nodelocal DNS and add privileged to CNI rancher/kontainer-driver-metadata#149

Merged

maggieliu removed the team/ca label Mar 2, 2020

superseb added [zube]: To Test and removed [zube]: Review labels Mar 3, 2020

soumyalj closed this as completed Mar 10, 2020

zube bot added [zube]: Done and removed [zube]: To Test labels Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canal containers give selinux related error message #1691

Canal containers give selinux related error message #1691

nheinemans commented Oct 11, 2019

carloscarnero commented Oct 11, 2019

leodotcloud commented Nov 1, 2019

carloscarnero commented Nov 1, 2019

leodotcloud commented Nov 1, 2019

carloscarnero commented Nov 1, 2019

leodotcloud commented Nov 1, 2019

carloscarnero commented Nov 1, 2019 •

edited

Loading

carloscarnero commented Nov 2, 2019

carloscarnero commented Nov 6, 2019

superseb commented Mar 5, 2020

superseb commented Mar 5, 2020

soumyalj commented Mar 5, 2020 •

edited

Loading

carloscarnero commented Mar 5, 2020 •

edited

Loading

carloscarnero commented Mar 6, 2020 •

edited

Loading

superseb commented Mar 11, 2020

Canal containers give selinux related error message #1691

Canal containers give selinux related error message #1691

Comments

nheinemans commented Oct 11, 2019

carloscarnero commented Oct 11, 2019

leodotcloud commented Nov 1, 2019

carloscarnero commented Nov 1, 2019

leodotcloud commented Nov 1, 2019

carloscarnero commented Nov 1, 2019

leodotcloud commented Nov 1, 2019

carloscarnero commented Nov 1, 2019 • edited Loading

carloscarnero commented Nov 2, 2019

carloscarnero commented Nov 6, 2019

superseb commented Mar 5, 2020

superseb commented Mar 5, 2020

soumyalj commented Mar 5, 2020 • edited Loading

carloscarnero commented Mar 5, 2020 • edited Loading

carloscarnero commented Mar 6, 2020 • edited Loading

superseb commented Mar 11, 2020

carloscarnero commented Nov 1, 2019 •

edited

Loading

soumyalj commented Mar 5, 2020 •

edited

Loading

carloscarnero commented Mar 5, 2020 •

edited

Loading

carloscarnero commented Mar 6, 2020 •

edited

Loading