Rook/Ceph on k3s on Fedora CoreOS is extremely slow #329

stellirin · 2019-12-15T14:10:24Z

I use Fedora CoreOS to set up a 3 node k3s cluster on local hardware (3x Core i5, 12GB ram). I boot each node manually via PXE and the k3s cluster eventually configures itself as expected.

Each node has a second HDD so I want to set up a distributed file system with Rook/Ceph. I basically follow the default installation process (example instructions on medium.com).

I am able to bring up the Ceph cluster but provisioning is extremely slow. Most instructions suggest this should not take more than a few minutes, but for me on Fedora CoreOS it takes over 6 hours.

I have an issue open at rook/rook#4470 but I suspect it is something about Fedora CoreOS that causes this behaviour. Many commands executed by the Ceph processes hang for approximately 7 minutes, with top showing a process stuck at 100% CPU on a single core.

In the logs I see even trivial commands like ln -s get stuck:

2019-12-11 01:08:15.851205 I | Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-36ac78de-a0dd-47e7-b63f-1bcd937c500b /dev/sdb
2019-12-11 01:15:41.436758 I |  stdout: Physical volume "/dev/sdb" successfully created.
2019-12-11 01:15:41.484904 I |  stdout: Volume group "ceph-36ac78de-a0dd-47e7-b63f-1bcd937c500b" successfully created
2019-12-11 01:23:07.163973 I | Running command: /usr/sbin/lvcreate --yes -l 298 -n osd-data-8eadc725-d865-4be2-8ac7-25be2fca23ad ceph-36ac78de-a0dd-47e7-b63f-1bcd937c500b
2019-12-11 01:30:32.676035 I |  stdout: Logical volume "osd-data-8eadc725-d865-4be2-8ac7-25be2fca23ad" created.
2019-12-11 02:22:12.183093 I | Running command: /bin/ceph-authtool --gen-print-key
2019-12-11 02:29:37.255539 I | Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new edd97953-8143-4a93-9ed1-ce86e922bc5b
2019-12-11 04:33:08.015851 I | Running command: /bin/ceph-authtool --gen-print-key
2019-12-11 04:40:34.765191 I | Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2
2019-12-11 04:47:59.120653 I | Running command: /bin/chown -h ceph:ceph /dev/ceph-36ac78de-a0dd-47e7-b63f-1bcd937c500b/osd-data-8eadc725-d865-4be2-8ac7-25be2fca23ad
2019-12-11 04:55:32.812044 I | Running command: /bin/chown -R ceph:ceph /dev/mapper/ceph--36ac78de--a0dd--47e7--b63f--1bcd937c500b-osd--data--8eadc725--d865--4be2--8ac7--25be2fca23ad
2019-12-11 05:02:57.915667 I | Running command: /bin/ln -s /dev/ceph-36ac78de-a0dd-47e7-b63f-1bcd937c500b/osd-data-8eadc725-d865-4be2-8ac7-25be2fca23ad /var/lib/ceph/osd/ceph-2/block
2019-12-11 05:10:05.876488 I | Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-2/activate.monmap

Unfortunately I'm new to Ceph so debugging this is slow, and I don't see this behaviour described anywhere else.

There are a lot of moving parts here, but any insight from the Fedora CoreOS side that may help explain and debug this behaviour will be greatly appreciated.

ignition.yaml

variant: fcos
version: 1.0.0

passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - "ssh-rsa ..."

storage:

  disks:
    - device: /dev/sdb
      wipe_table: true

  files:
    - path: "/usr/local/bin/k3s"
      contents:
        source: "https://github.com/rancher/k3s/releases/download/v1.0.0/k3s"
        verification:
          hash: sha512-bbea8cd86a5c660e646b002a013593d25d9c20bb1194f2d9bd4ecaa60eb9d43bda368e313c5f09df6c6d5d2602e3235bb366336c1a2fd57f3042f05ff09f3de4
      mode: 0755
      user:
        name: root
      group:
        name: root
    - path: "/usr/local/etc/k3s/token"
      contents:
        # uuidgen | base64 -w 0
        source: "data:text/plain;charset=utf-8;base64,NWE5NzAzYzktOTcyOC00NjAyLWE0ZmYtNGRjY2Q0NDIwOGNkCg=="
      mode: 0400
      user:
        name: root
      group:
        name: root

systemd:
  units:

    - name: k3s.service
      enabled: true
      contents: |
        [Unit]
        Description=Lightweight Kubernetes
        Documentation=https://k3s.io
        After=network-online.target

        [Install]
        WantedBy=multi-user.target

        [Service]
        Type=notify
        KillMode=process
        Delegate=yes
        LimitNOFILE=infinity
        LimitNPROC=infinity
        LimitCORE=infinity
        TasksMax=infinity
        TimeoutStartSec=0
        Restart=always
        RestartSec=5s
        ExecStartPre=-/sbin/modprobe br_netfilter
        ExecStartPre=-/sbin/modprobe overlay
        ExecStart=/usr/local/bin/k3s server \
          --cluster-domain example.com \
          --token-file /usr/local/etc/k3s/token

rpm-ostree status

State: idle
AutomaticUpdates: disabled
Deployments:
● ostree://fedora:fedora/x86_64/coreos/testing
                   Version: 31.20191211.1 (2019-12-11T17:33:17Z)
                    Commit: f42843346587ae8cb8a022b5d7364864057fd2f6b54ca0b1dbf6f35faac63e0a
              GPGSignature: Valid signature by 7D22D5867F2A4236474BF7B850CB390B3C3359C4

The text was updated successfully, but these errors were encountered:

lucab · 2019-12-16T10:00:17Z

Thanks for the report.
It looks like those commands may be hanging for a long time, waiting on I/O. I've no specific Ceph knowledge, but those are usually symptoms of misconfigurations or networking troubles.

I would suggest checking the logs to see if you are hitting any timeouts.
Also, double-check network traffic: if Ceph traffic is running on top of the overlay network, there could a bottleneck there. Silly MTU mismatches, DNS fallbacks and misrouted multicast are common mistakes.

lucab · 2019-12-16T10:04:40Z

Additionally, is this using Ceph via the in-kernel driver or is it this purely containerized userspace?

stellirin · 2019-12-16T19:10:03Z

Rook does claim it requires a modern kernel with the rbd module. However lsmod shows that the rbd module (nor even the ceph module) has so far not been loaded.

I anyway added both modules to /etc/modules-load.d and restarted a new cluster but it hasn't had any effect so far, it's showing the same behaviour (it looks like it will again take 6 hours to do the initial provision).

How do you recommend to debug I/O on FCOS? right now it doesn't have any tools like iotop or iftop.

stellirin · 2019-12-16T21:44:43Z

I found rpm-ostree install and was able to get iftop and iotop installed. With both tools I don't see any significant load.

jan--f · 2019-12-17T16:14:36Z

The issues seen here are unlikely related to ceph, as this is the preparation procedure before a new ceph component is initialized. The log above is from a tool called ceph-volume, which is a python script that sets up LVM volumes for the OSD (a ceph daemon) to use.
So this could be related to LVM usage (though this seems to work in other k8s environments) or it could be caused by running a python script.

stellirin · 2020-01-02T11:32:42Z

I have eliminated K3S, Rook, and Ceph as the cause of the issue. I am able to use the same ignition config with RedHat CoreOS (with trivial adjustments for the older ignition) and Ceph can start a cluster quickly.

NAME                       STATUS   ROLES    AGE     VERSION         INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                                                   KERNEL-VERSION                CONTAINER-RUNTIME
k3s-server-0.example.net   Ready    master   7m20s   v1.16.3-k3s.2   192.168.16.101   <none>        Red Hat Enterprise Linux CoreOS 42.80.20191002.0 (Ootpa)   4.18.0-80.11.2.el8_0.x86_64   containerd://1.3.0-k3s.4
k3s-agent-0.example.net    Ready    worker   6m47s   v1.16.3-k3s.2   192.168.16.102   <none>        Red Hat Enterprise Linux CoreOS 42.80.20191002.0 (Ootpa)   4.18.0-80.11.2.el8_0.x86_64   containerd://1.3.0-k3s.4
k3s-agent-1.example.net    Ready    worker   5m47s   v1.16.3-k3s.2   192.168.16.103   <none>        Red Hat Enterprise Linux CoreOS 42.80.20191002.0 (Ootpa)   4.18.0-80.11.2.el8_0.x86_64   containerd://1.3.0-k3s.4

versus:

NAME                       STATUS   ROLES    AGE     VERSION         INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                                KERNEL-VERSION           CONTAINER-RUNTIME
k3s-server-0.example.net   Ready    master   10m     v1.16.3-k3s.2   192.168.16.101   <none>        Fedora CoreOS preview 31.20191217.2.0   5.3.16-300.fc31.x86_64   containerd://1.3.0-k3s.4
k3s-agent-0.example.net    Ready    worker   6m38s   v1.16.3-k3s.2   192.168.16.102   <none>        Fedora CoreOS preview 31.20191217.2.0   5.3.16-300.fc31.x86_64   containerd://1.3.0-k3s.4
k3s-agent-1.example.net    Ready    worker   6m5s    v1.16.3-k3s.2   192.168.16.103   <none>        Fedora CoreOS preview 31.20191217.2.0   5.3.16-300.fc31.x86_64   containerd://1.3.0-k3s.4

Same log snippet as above, it takes ~16 seconds on RedHat versus 4 hours (!) on Fedora:

2020-01-02 11:20:04.738365 I | Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-c51e73c4-71ce-48e7-b6f7-4c4338916172 /dev/sdb
2020-01-02 11:20:05.178519 I |  stderr: Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will manage logical volume symlinks in device directory.
2020-01-02 11:20:05.178832 I |  stderr: Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will obtain device list by scanning device directory.
2020-01-02 11:20:05.452117 I |  stdout: Physical volume "/dev/sdb" successfully created.
2020-01-02 11:20:05.498573 I |  stdout: Volume group "ceph-c51e73c4-71ce-48e7-b6f7-4c4338916172" successfully created
2020-01-02 11:20:05.986049 I | Running command: /usr/sbin/lvcreate --yes -l 298 -n osd-data-cd39837c-7c20-4ab3-a0db-613fedb56ca0 ceph-c51e73c4-71ce-48e7-b6f7-4c4338916172
2020-01-02 11:20:06.428890 I |  stderr: Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will manage logical volume symlinks in device directory.
2020-01-02 11:20:06.429223 I |  stderr: Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will obtain device list by scanning device directory.
2020-01-02 11:20:06.443223 I |  stderr: Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, device-mapper library will manage device nodes in device directory.
2020-01-02 11:20:06.536848 I |  stdout: Logical volume "osd-data-cd39837c-7c20-4ab3-a0db-613fedb56ca0" created.
2020-01-02 11:20:09.782896 I | Running command: /bin/ceph-authtool --gen-print-key
2020-01-02 11:20:10.247722 I | Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 233b1cfa-9c90-4988-bae6-538c42ea58cd
2020-01-02 11:20:18.700221 I | Running command: /bin/ceph-authtool --gen-print-key
2020-01-02 11:20:19.165869 I | Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1
2020-01-02 11:20:19.603050 I | Running command: /bin/chown -h ceph:ceph /dev/ceph-c51e73c4-71ce-48e7-b6f7-4c4338916172/osd-data-cd39837c-7c20-4ab3-a0db-613fedb56ca0
2020-01-02 11:20:20.037883 I | Running command: /bin/chown -R ceph:ceph /dev/mapper/ceph--c51e73c4--71ce--48e7--b6f7--4c4338916172-osd--data--cd39837c--7c20--4ab3--a0db--613fedb56ca0
2020-01-02 11:20:20.458045 I | Running command: /bin/ln -s /dev/ceph-c51e73c4-71ce-48e7-b6f7-4c4338916172/osd-data-cd39837c-7c20-4ab3-a0db-613fedb56ca0 /var/lib/ceph/osd/ceph-1/block
2020-01-02 11:20:20.895316 I | Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-1/activate.monmap

Could the log lines mentioning DM_DISABLE_UDEV point to something missing on the Fedora CoreOS side?

My ignition files for both fcos and rhcos found at: https://github.com/stellirin/coreos-k3s-ceph

mattrapa · 2020-03-10T11:46:50Z

Did you manage to figure this out I have encountered the same issue.

tomaaron · 2020-03-11T10:12:14Z

We encoutered the same problem and our solution was to kill each osd prepare pod and initialy the osd came up. This is not really a permanent solution but it might help.

stellirin · 2020-03-11T22:07:18Z

My 'solution' was to stay with the RedHat images instead of the Fedora images. So far I haven't had the opportunity to take another look (waiting six hours for each test run really sucks). I've put off upgrading my nodes because of this.

Anyway, this was an experiment for my home lab so I may just go with a different solution entirely if/when I need to rebuild the cluster.

stellirin · 2020-03-22T22:21:30Z

The behaviour of the latest Rook/Ceph is still the same on the latest Fedora CoreOS release, 31.20200223.3.0.

The latest release of RedHat CoreOS, 4.3.0, does not show the behaviour, so I used this when I rebuilt my cluster. I guess I probably won't revisit this topic again for some time as I want to have some stability for a few months.

LorbusChris · 2020-03-25T16:22:38Z

I'm wondering whether it would be the same behavior on standard Fedora vs. RHEL

stellirin · 2020-04-19T15:28:19Z

Due to hardware failure I need to rebuild my cluster already. Can you suggest some test scenarios I can try to narrow down the cause?

mcsaucy · 2020-05-02T03:30:15Z

I was trying to emulate this with VMs and attached strace to an unhappy ceph-volume process. Not sure if it's a red herring, but there's a whole lot of:

...
close(779248286)                        = -1 EBADF (Bad file descriptor)
close(779248287)                        = -1 EBADF (Bad file descriptor)
close(779248288)                        = -1 EBADF (Bad file descriptor)
close(779248289)                        = -1 EBADF (Bad file descriptor)
...

mcsaucy · 2020-05-02T16:58:18Z

BLUF:

[root@rook-ceph-osd-prepare-node0-mx2dn /]# ulimit -n
1073741816
[root@rook-ceph-osd-prepare-node0-mx2dn /]# time timeout 15 /usr/bin/python2.7 /usr/sbin/ceph-volume inventory --format json /dev/sdb || echo ":("

real    0m15.003s
user    0m0.062s
sys     0m0.030s
:(
[root@rook-ceph-osd-prepare-node0-mx2dn /]# ulimit -n 1024
[root@rook-ceph-osd-prepare-node0-mx2dn /]# time timeout 15 /usr/bin/python2.7 /usr/sbin/ceph-volume inventory --format json /dev/sdb || echo ":("
{"available": true, "lvs": [], "rejected_reasons": [], "sys_api": {"size": 5368709120.0, "scheduler_mode": "bfq", "rotational": "1", "vendor": "ATA", "human_readable_size": "5.00 GB", "sectors": 0, "sas_device_handle": "", "rev": "2.5+", "sas_address": "", "locked": 0, "sectorsize": "512", "removable": "0", "path": "/dev/sdb", "support_discard": "512", "model": "QEMU HARDDISK", "ro": "0", "nr_requests": "2", "partitions": {}}, "path": "/dev/sdb", "device_id": "QEMU_HARDDISK_QM00002"}

real    0m0.291s
user    0m0.148s
sys     0m0.077s
[root@rook-ceph-osd-prepare-node0-mx2dn /]#

Poking around some more, looks it seems like this Popen call in ceph-volume is causing slowness. Minimal repro case is:

# this may hang for a long time depending upon how large `ulimit -n` is
python2 <<< "import subprocess; subprocess.Popen('true', close_fds=True)"

The behavior of close_fds in Python2.7 seems to involve closing ~all fds up to ulimit -n, which translates to a billion wasted close(2) calls (~30 minutes for me) per Popen:

[root@rook-ceph-osd-prepare-node0-mx2dn /]# ulimit -n
1073741816

Looks like this was brought up in https://bugs.python.org/issue8052, but an strace of the testcase suggests this didn't get ported to 2.7:

# many matches
strace -f python2 <<< "import subprocess; subprocess.Popen('true', close_fds=True)" |& grep EBADF
# zero matches
strace -f python3 <<< "import subprocess; subprocess.Popen('true', close_fds=True)" |& grep EBADF # zero matches

If this is in fact the issue, I'd be interested in how RHCOS differs here. We should be using the same impacted python2.7 binary on each side, so I wonder if the container on RHCOS gets a different max FD limit.

mcsaucy · 2020-05-02T20:27:26Z

Dug into this some more.

Taking the original ignition file, I was able to fully mitigate this issue by setting k3s.service's LimitNOFILE to a smaller number (1048576, in my case) from unlimited. Provisioning time dropped from hours to ~10 minutes. Can you see if that works for you, @stellirin?

If so, then IMO there's nothing for FCOS (whose only crime is having high open fd limits) to do here. If I'm right, then the real issue is Python2.7. Since that's EOL and rook is still running ceph-volume with it, moving it to Python 3 would be a good call. If that's not feasible, a potential hack would be to set stricter fd rlimits (performance is tolerable at 1 million) for the number of open files somewhere in rook-land.

dustymabe · 2020-05-03T01:42:51Z

Wow @mcsaucy. Nice investigation!

stellirin · 2020-05-03T01:49:06Z

Thanks a lot @mcsaucy for investigating this! I really appreciate it. I should have guessed that this would be yet another issue caused by ulimits on a RedHat OS...

I rebuilt my cluster with LimitNOFILE=1048576 and the issue is fully mitigated for me on FCOS. Provisioning a new Rook/Ceph cluster takes the same amount of time as on RHCOS and k3OS.

A google search for container LimitNOFILE actually brings up a few similar issues, with the result being that upstream containerd and docker both setting the same limit, based on similar performance issues.

I'm also curious why this appears on FCOS but not RHCOS. One link suggests it is due to the how the kernel is tuned, with sudo sysctl fs.nr_open giving fs.nr_open = 1073741816 on FCOS, which is what we see as ulimit -n inside the containers. I didn't have time to check the values on RHCOS.

Ultimately, if the fix is good enough for the big container projects, it should be good enough for k3s.

stellirin · 2020-05-07T06:41:05Z

Fix is merged in upstream k3s.

When k3s is installed on an OS with default high ulimits, performance issues can be observed. This was discovered on CoreOS where the default value is 1073741816. Symptoms include very slow file operations such as installing a Rook/Ceph cluster will take ~6 hours instead of ~10 minutes. A google search for 'container LimitNOFILE' will show that most major projects set this already, including the (unused) containerd systemd unit found in this repository at /vendor/github.com/containerd/containerd/containerd.service k3OS is not affected becuasse the default there is already 1048576. See description in coreos/fedora-coreos-tracker#329

AndreKR · 2021-03-18T12:27:08Z

@stellirin Is it possible you made a mistake in phrasing the comment? As I read this issue, having no limit (infinity) causes performance issues, right?

The comment currently says "Having non-zero limits causes performance issues" and then a non-zero limit is set, so until I found this issue I assumed I am supposed to set the limit to zero to get better performance.

stellirin · 2021-03-18T12:46:15Z

@AndreKR you're right, it does seem contradictory, but if I remember correctly I simply copied the comment from an upstream change on containerd or dockerd.

AndreKR · 2021-03-18T14:36:24Z

@stellirin Indeed I can find the same comment in multiple recommendations for Docker service files (1 2 3), but in all of them they actually seem to mean you should not set any limits, so exactly the opposite of what was found in this issue.

travisby · 2023-02-19T17:59:01Z

I use a custom built FCOS image for my own K8S (kubeadm, not k3s) and experienced this same issue.

Adding

$ cat /etc/systemd/system/containerd.service.d/override.conf
[Service]
LimitNOFILE=1048576

Allowed rook-ceph to finish building successfully (all other times I would get an error about ceph not responding).

lucab added needs/more-information platform/metal labels Dec 16, 2019

BlaineEXE mentioned this issue Dec 16, 2019

Slow prepare due to 100% CPU usage causes operator timeout rook/rook#4470

Closed

herzogf mentioned this issue Feb 11, 2020

nodes: check switch from debian to fedora coreos herzogf/homek8s#57

Open

stellirin mentioned this issue May 3, 2020

[systemd] Add value to LimitNOFILE due to performance problems k3s-io/k3s#1735

Merged

stellirin closed this as completed May 7, 2020

oglok mentioned this issue Aug 18, 2022

USHIFT-197: Enable resource accounting in MicroShift systemd service openshift/microshift#868

Merged

travisby mentioned this issue Feb 19, 2023

Revert commit that changed LimitNOFILE to infinity to avoid regressions containerd/containerd#7566

Closed

occ mentioned this issue May 21, 2024

1 slow ops, oldest one blocked for 1880444 sec, mon.e has slow ops rook/rook#14126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rook/Ceph on k3s on Fedora CoreOS is extremely slow #329

Rook/Ceph on k3s on Fedora CoreOS is extremely slow #329

stellirin commented Dec 15, 2019

lucab commented Dec 16, 2019

lucab commented Dec 16, 2019

stellirin commented Dec 16, 2019

stellirin commented Dec 16, 2019

jan--f commented Dec 17, 2019

stellirin commented Jan 2, 2020

mattrapa commented Mar 10, 2020 •

edited

Loading

tomaaron commented Mar 11, 2020

stellirin commented Mar 11, 2020

stellirin commented Mar 22, 2020

LorbusChris commented Mar 25, 2020

stellirin commented Apr 19, 2020

mcsaucy commented May 2, 2020

mcsaucy commented May 2, 2020

mcsaucy commented May 2, 2020

dustymabe commented May 3, 2020

stellirin commented May 3, 2020

stellirin commented May 7, 2020

AndreKR commented Mar 18, 2021

stellirin commented Mar 18, 2021

AndreKR commented Mar 18, 2021

travisby commented Feb 19, 2023

Rook/Ceph on k3s on Fedora CoreOS is extremely slow #329

Rook/Ceph on k3s on Fedora CoreOS is extremely slow #329

Comments

stellirin commented Dec 15, 2019

lucab commented Dec 16, 2019

lucab commented Dec 16, 2019

stellirin commented Dec 16, 2019

stellirin commented Dec 16, 2019

jan--f commented Dec 17, 2019

stellirin commented Jan 2, 2020

mattrapa commented Mar 10, 2020 • edited Loading

tomaaron commented Mar 11, 2020

stellirin commented Mar 11, 2020

stellirin commented Mar 22, 2020

LorbusChris commented Mar 25, 2020

stellirin commented Apr 19, 2020

mcsaucy commented May 2, 2020

mcsaucy commented May 2, 2020

mcsaucy commented May 2, 2020

dustymabe commented May 3, 2020

stellirin commented May 3, 2020

stellirin commented May 7, 2020

AndreKR commented Mar 18, 2021

stellirin commented Mar 18, 2021

AndreKR commented Mar 18, 2021

travisby commented Feb 19, 2023

mattrapa commented Mar 10, 2020 •

edited

Loading