Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc with --no-pivot retains global mount context #1602

Closed
nakato opened this issue Sep 30, 2017 · 6 comments · Fixed by #1606
Closed

runc with --no-pivot retains global mount context #1602

nakato opened this issue Sep 30, 2017 · 6 comments · Fixed by #1606

Comments

@nakato
Copy link

nakato commented Sep 30, 2017

When a runc container is started with --no-pivot, the mount namespace of the process running under runc retains some form of the mount namespace, resulting in the inability to unmount filesystems mounted before runc started from the system. This happens regardless of the unmounted filesystem not appearing in /proc/$PID/mounts


Expected

Filesystem not referenced by runc container to be unmounted when attempted to be unmounted from the system.

Actual

Filesystem remains mounted until runc containers are stopped.


Reproduction

Reproducible: Always

Runc version: v1.0.0-rc4
Kernel: 4.12.10-1-ARCH

This issue was originally noticed with physical devices, on Ubuntu Xenial with runc 1.0.0-rc2, but the following reproduction is on the above, as listed.

Prepare loopback mount

$ truncate -s 1G lofile
$ mkdir lomount
$ mkfs.xfs lofile
$ mount -o loop lofile lomount
# ps auxf | grep loop
root     22940  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [loop0]
root     22948  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [xfs-buf/loop0]
root     22952  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [xfs-data/loop0]
root     22953  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [xfs-conv/loop0]
root     22954  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [xfs-cil/loop0]
root     22956  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [xfs-log/loop0]
root     22958  0.0  0.0      0     0 ?        S    13:14   0:00  \_ [xfsaild/loop0]
# mount | grep lomount
/root/loopfs on /root/lomount type xfs (rw,relatime,attr2,inode64,noquota)

Prepare and run container

# docker pull gcr.io/google_containers/pause:0.8.0
# mkdir rootfs
# docker create gcr.io/google_containers/pause:0.8.0
$HASH
# docker export $HASH | tar -C rootfs -x
# oci-runtime-tool generate --args /pause --rootfs-path $(pwd)/rootfs > config.json
$ runc run --no-pivot pause
# cat /proc/$PAUSE_PID/mounts
/dev/mapper/home / ext4 rw,relatime,data=ordered 0 0
proc /proc proc rw,relatime 0 0
tmpfs /dev tmpfs rw,nosuid,size=65536k,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs ro,nosuid,nodev,noexec,relatime 0 0

# cat /proc/$PAUSE_PID/mountinfo 
274 229 254:2 /nakato/tmp/runc/x/usr/bin/rootfs / rw,relatime - ext4 /dev/mapper/home rw,data=ordered
275 274 0:63 / /proc rw,relatime - proc proc rw
276 274 0:64 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
277 276 0:65 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666
278 276 0:66 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=65536k
279 276 0:62 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
280 274 0:67 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs ro

# cat /proc/$PAUSE_PID/mountstats 
device /dev/mapper/home mounted on / with fstype ext4
device proc mounted on /proc with fstype proc
device tmpfs mounted on /dev with fstype tmpfs
device devpts mounted on /dev/pts with fstype devpts
device shm mounted on /dev/shm with fstype tmpfs
device mqueue mounted on /dev/mqueue with fstype mqueue
device sysfs mounted on /sys with fstype sysfs

Unmount filesystem

# umount /root/lomount
# mount | grep lomount

No output, looks unmounted.

# ps auxf | grep loop
root     22940  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [loop0]
root     22948  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [xfs-buf/loop0]
root     22952  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [xfs-data/loop0]
root     22953  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [xfs-conv/loop0]
root     22954  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [xfs-cil/loop0]
root     22956  0.0  0.0      0     0 ?        S<   13:14   0:00  \_ [xfs-log/loop0]
root     22958  0.0  0.0      0     0 ?        S    13:14   0:00  \_ [xfsaild/loop0]

Kernel threads are still here, it's still mounted.

Stop the container

I hit Ctrl+C on the container window to stop the container. Runc and pause have exited.

# ps auxf | grep loop

No output, filesystem is now unmounted.


Some of the repercussions of this issue is the inability to unmount and xfs_repair a physical disk that was mounted before runc started a namespaced process. This means having to unmount the fs, stop all runc containers, then start all runc processes back up, or remove the filesystem from being mounted on boot and rebooting the system.

I was also able to reproduce this behavior with ext4. The kernel thread for ext4 is jbd2/loop

This issue cannot be reproduced if pivot occurs.

@cyphar
Copy link
Member

cyphar commented Sep 30, 2017

When you use --no-pivot you are telling runc to use chroot rather than pivot_root. One of the semantics of pivot_root is that the root mounts are swapped with another mount tree, giving you the ability to unmount the old root tree entirely. From memory you cannot do that with chroot (though I can double check this when I get back to my desk tomorrow).

Out of interest, is there a reason you cannot use pivot_root? It was added in Linux 2.3.41(!), and a lot of the mount improvements in Linux 3.x removed most of the old issues we had (that required --no-pivot).

@cyphar
Copy link
Member

cyphar commented Sep 30, 2017

Also I'd be interested to see if the mount shows up in /proc/self/mountinfo. From memory, if the mount tree is broken then you won't see all of the mounts in /proc/self/mount, but mountinfo should show you what references are held by the process's mount namespace.

@nakato
Copy link
Author

nakato commented Oct 4, 2017

@cyphar Did the steps to reproduce the issue not work for you? Is there something I can do to clarify them for you?

/proc/self refers to the PID of my shell, the output of /proc/self/mountinfo and /proc/self/mounts is identical between my shell's view and runc's view. Both runc and my shell exist in the main mount namespace.

# ls -lha /proc/self/ns/mnt 
lrwxrwxrwx 1 root root 0 Oct  4 10:19 /proc/self/ns/mnt -> 'mnt:[4026531840]'
# ls -lha /proc/${RUNC}/ns/mnt
lrwxrwxrwx 1 root root 0 Oct  4 10:07 /proc/$RUNC/ns/mnt -> 'mnt:[4026531840]'
# ls -lha /proc/${PAUSE}/ns/mnt
lrwxrwxrwx 1 root root 0 Oct  4 10:17 /proc/$PAUSE/ns/mnt -> 'mnt:[4026532458]'

/proc/self/mountinfo with runc running and the FS mounted.

18 62 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:2 - sysfs sysfs rw
19 62 0:4 / /proc rw,nosuid,nodev,noexec,relatime shared:22 - proc proc rw
20 62 0:6 / /dev rw,nosuid shared:18 - devtmpfs devtmpfs rw,size=8081548k,nr_inodes=2020387,mode=755
21 18 0:7 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:3 - securityfs securityfs rw
22 20 0:19 / /dev/shm rw,nosuid,nodev shared:19 - tmpfs tmpfs rw
23 20 0:20 / /dev/pts rw,nosuid,noexec,relatime shared:20 - devpts devpts rw,gid=5,mode=620,ptmxmode=000
24 62 0:21 / /run rw,nosuid,nodev shared:21 - tmpfs tmpfs rw,mode=755
25 18 0:22 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:4 - tmpfs tmpfs ro,mode=755
26 25 0:23 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:5 - cgroup2 cgroup rw
27 25 0:24 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,name=systemd
28 18 0:25 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:16 - pstore pstore rw
29 18 0:26 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime shared:17 - efivarfs efivarfs rw
30 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:7 - cgroup cgroup rw,cpuset
31 25 0:28 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:8 - cgroup cgroup rw,net_cls,net_prio
32 25 0:29 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,cpu,cpuacct
33 25 0:30 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,freezer
34 25 0:31 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,memory
35 25 0:32 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,blkio
36 25 0:33 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,devices
37 25 0:34 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,perf_event
38 25 0:35 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,pids
62 0 254:1 / / rw,relatime shared:1 - ext4 /dev/mapper/root rw,data=ordered
39 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:23 - autofs systemd-1 rw,fd=23,pgrp=1,timeout=0,minproto=5,maxproto=5,direct
40 18 0:8 / /sys/kernel/debug rw,relatime shared:24 - debugfs debugfs rw
41 62 0:38 / /tmp rw,nosuid,nodev shared:25 - tmpfs tmpfs rw
42 20 0:39 / /dev/hugepages rw,relatime shared:26 - hugetlbfs hugetlbfs rw
43 20 0:17 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw
44 18 0:40 / /sys/kernel/config rw,relatime shared:28 - configfs configfs rw
77 62 259:1 / /boot rw,relatime shared:29 - vfat /dev/nvme0n1p1 rw,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro
79 62 254:2 / /home rw,relatime shared:30 - ext4 /dev/mapper/home rw,data=ordered
249 24 0:45 / /run/user/1000 rw,nosuid,nodev,relatime shared:194 - tmpfs tmpfs rw,size=1618268k,mode=700,uid=1000,gid=1000
197 249 0:44 / /run/user/1000/gvfs rw,nosuid,nodev,relatime shared:143 - fuse.gvfsd-fuse gvfsd-fuse rw,user_id=1000,group_id=1000
203 18 0:46 / /sys/fs/fuse/connections rw,relatime shared:148 - fusectl fusectl rw
208 62 0:47 / /var/lib/docker rw,relatime shared:152 - btrfs /dev/mapper/docker rw,ssd,space_cache,subvolid=5,subvol=/
209 208 0:47 /plugins /var/lib/docker/plugins rw,relatime - btrfs /dev/mapper/docker rw,ssd,space_cache,subvolid=5,subvol=/plugins
218 208 0:47 /btrfs /var/lib/docker/btrfs rw,relatime - btrfs /dev/mapper/docker rw,ssd,space_cache,subvolid=5,subvol=/btrfs
223 62 7:0 / /root/lomount rw,relatime shared:162 - xfs /dev/loop0 rw,attr2,inode64,noquota

And /proc/self/mounts

sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
devtmpfs /dev devtmpfs rw,nosuid,size=8081548k,nr_inodes=2020387,mode=755 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/unified cgroup2 rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
/dev/mapper/root / ext4 rw,relatime,data=ordered 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=23,pgrp=1,timeout=0,minproto=5,maxproto=5,direct 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
/dev/nvme0n1p1 /boot vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 0
/dev/mapper/home /home ext4 rw,relatime,data=ordered 0 0
tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=1618268k,mode=700,uid=1000,gid=1000 0 0
gvfsd-fuse /run/user/1000/gvfs fuse.gvfsd-fuse rw,nosuid,nodev,relatime,user_id=1000,group_id=1000 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
/dev/mapper/docker /var/lib/docker btrfs rw,relatime,ssd,space_cache,subvolid=5,subvol=/ 0 0
/dev/mapper/docker /var/lib/docker/plugins btrfs rw,relatime,ssd,space_cache,subvolid=5,subvol=/plugins 0 0
/dev/mapper/docker /var/lib/docker/btrfs btrfs rw,relatime,ssd,space_cache,subvolid=5,subvol=/btrfs 0 0
/dev/loop0 /root/lomount xfs rw,relatime,attr2,inode64,noquota 0 0

And after unmount with the no-pivot runc still holding the mount open.

18 62 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:2 - sysfs sysfs rw
19 62 0:4 / /proc rw,nosuid,nodev,noexec,relatime shared:22 - proc proc rw
20 62 0:6 / /dev rw,nosuid shared:18 - devtmpfs devtmpfs rw,size=8081548k,nr_inodes=2020387,mode=755
21 18 0:7 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:3 - securityfs securityfs rw
22 20 0:19 / /dev/shm rw,nosuid,nodev shared:19 - tmpfs tmpfs rw
23 20 0:20 / /dev/pts rw,nosuid,noexec,relatime shared:20 - devpts devpts rw,gid=5,mode=620,ptmxmode=000
24 62 0:21 / /run rw,nosuid,nodev shared:21 - tmpfs tmpfs rw,mode=755
25 18 0:22 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:4 - tmpfs tmpfs ro,mode=755
26 25 0:23 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:5 - cgroup2 cgroup rw
27 25 0:24 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,name=systemd
28 18 0:25 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:16 - pstore pstore rw
29 18 0:26 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime shared:17 - efivarfs efivarfs rw
30 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:7 - cgroup cgroup rw,cpuset
31 25 0:28 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:8 - cgroup cgroup rw,net_cls,net_prio
32 25 0:29 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,cpu,cpuacct
33 25 0:30 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,freezer
34 25 0:31 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,memory
35 25 0:32 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,blkio
36 25 0:33 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,devices
37 25 0:34 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,perf_event
38 25 0:35 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,pids
62 0 254:1 / / rw,relatime shared:1 - ext4 /dev/mapper/root rw,data=ordered
39 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:23 - autofs systemd-1 rw,fd=23,pgrp=1,timeout=0,minproto=5,maxproto=5,direct
40 18 0:8 / /sys/kernel/debug rw,relatime shared:24 - debugfs debugfs rw
41 62 0:38 / /tmp rw,nosuid,nodev shared:25 - tmpfs tmpfs rw
42 20 0:39 / /dev/hugepages rw,relatime shared:26 - hugetlbfs hugetlbfs rw
43 20 0:17 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw
44 18 0:40 / /sys/kernel/config rw,relatime shared:28 - configfs configfs rw
77 62 259:1 / /boot rw,relatime shared:29 - vfat /dev/nvme0n1p1 rw,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro
79 62 254:2 / /home rw,relatime shared:30 - ext4 /dev/mapper/home rw,data=ordered
249 24 0:45 / /run/user/1000 rw,nosuid,nodev,relatime shared:194 - tmpfs tmpfs rw,size=1618268k,mode=700,uid=1000,gid=1000
197 249 0:44 / /run/user/1000/gvfs rw,nosuid,nodev,relatime shared:143 - fuse.gvfsd-fuse gvfsd-fuse rw,user_id=1000,group_id=1000
203 18 0:46 / /sys/fs/fuse/connections rw,relatime shared:148 - fusectl fusectl rw
208 62 0:47 / /var/lib/docker rw,relatime shared:152 - btrfs /dev/mapper/docker rw,ssd,space_cache,subvolid=5,subvol=/
209 208 0:47 /plugins /var/lib/docker/plugins rw,relatime - btrfs /dev/mapper/docker rw,ssd,space_cache,subvolid=5,subvol=/plugins
218 208 0:47 /btrfs /var/lib/docker/btrfs rw,relatime - btrfs /dev/mapper/docker rw,ssd,space_cache,subvolid=5,subvol=/btrfs
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
devtmpfs /dev devtmpfs rw,nosuid,size=8081548k,nr_inodes=2020387,mode=755 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/unified cgroup2 rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
/dev/mapper/root / ext4 rw,relatime,data=ordered 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=23,pgrp=1,timeout=0,minproto=5,maxproto=5,direct 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
/dev/nvme0n1p1 /boot vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 0
/dev/mapper/home /home ext4 rw,relatime,data=ordered 0 0
tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=1618268k,mode=700,uid=1000,gid=1000 0 0
gvfsd-fuse /run/user/1000/gvfs fuse.gvfsd-fuse rw,nosuid,nodev,relatime,user_id=1000,group_id=1000 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
/dev/mapper/docker /var/lib/docker btrfs rw,relatime,ssd,space_cache,subvolid=5,subvol=/ 0 0
/dev/mapper/docker /var/lib/docker/plugins btrfs rw,relatime,ssd,space_cache,subvolid=5,subvol=/plugins 0 0
/dev/mapper/docker /var/lib/docker/btrfs btrfs rw,relatime,ssd,space_cache,subvolid=5,subvol=/btrfs 0 0

However, the mount being removed from my namespace is expected, it still exists within the namespace that is used to execute the pause command in this case.


While it may be using chroot, something more is happening here, the chroot is being executed inside a mount namespace.

This is the mount namespace of pause running inside only a chroot.

# ls -lha /proc/self/ns/mnt 
lrwxrwxrwx 1 root root 0 Oct  4 10:23 /proc/self/ns/mnt -> 'mnt:[4026531840]'
# ls -lha /proc/$PAUSE/ns/mnt
lrwxrwxrwx 1 root root 0 Oct  4 10:23 /proc/29664/ns/mnt -> 'mnt:[4026531840]'

And /proc/$PAUSE/mounts and mountinfo are both empty, even though the pause command shares the mount namespace with the main system.

# cat /proc/$PAUSE/mounts
# cat /proc/$PAUSE/mountinfo 

I don't know how much more the chroot command does compared to chroot(), however seeing the mounts/mountinfo from a process running directly under the chroot command seems to suggest it does something to obscure the mounts in a namespace, which would explain why I cannot see the mounted filesystem via looking at /proc/$PID/mounts, even though it is certainly being kept mounted inside that mount namespace.


While I do not know the reason for this system using --no-pivot, I will speculate this was done as the container filesystems reside on a read-only filesystem.

While I am working on moving these system to use pivot-root, I don't think that is an appropriate resolution to this bug. If --no-pivot is a supported configuration with runc, and when being run with --no-pivot the user is incapable of cleanly unmounting a filesystem on a disk, the damage could be rather severe.

If I hit unmount in a UI or in the CLI and the unmount command succeeds with return code 0, I'm going to yank the drive out of a hot-swap bay or out of a USB port, and I'm unlikely to notice that something's not right until I get prompted to do a repair, or have data corruption.

There is no easy way for a user to detect that the filesystem is still mounted in the system with this bug present. The only reason I noticed this was the message that the kernel partition table could not be updated when trying to repartition/reformat the drive as the kernel refused to update the partition map.

If I had not noticed the message, and had not changed the number of partitions it would have been extremely easy to reformat the "new" partitions and only notice that there was a problem after a reboot, which with servers could be costly as reboots can be months apart, or a laptop which a reboot could land in a very inopportune time when traveling.

As well, when booting a system, and getting a message to run xfs_repair, I should be able to unmount the filesystem without having to reboot into a single-user mode or shut down a bunch of unrelated services running on the system to allow the filesystem to unmount and xfs_repair to be able to run. As well, the error messages are thoroughly confusing to a user when they encounter this issue, the user is told the filesystem is mounted, when from the users view it is not, and this is very confusing to start debugging.

@cyphar
Copy link
Member

cyphar commented Oct 4, 2017

@nakato

chroot is done in a new mount namespace, because otherwise you cannot set up a container's rootfs. --no-pivot effectively just replaces the usage of pivot_root with chroot. It used to be the case that you couldn't use pivot_root on a read-only filesystem but I fixed that with #1125 several months ago.

If --no-pivot is a supported configuration with runc, and when being run with --no-pivot the user is incapable of cleanly unmounting a filesystem on a disk, the damage could be rather severe.

I agree, I'm just wondering why we still have this as a supported configuration. The reason why I asked why you couldn't use pivot_root was to try to see what usecase you might have that necessitates --no-pivot and whether it can be resolved without keeping the chroot code around. chroot is less secure in some senses than pivot_root, partially due to the mount leakage issue you mention (partially due to the fact that it can be escaped much more easily from an design standpoint).

I will try to reproduce the problem you describe, I was asking introductory questions to make sure that we don't hit an XY problem. I will cross-check a "normal" chroot to what runc does and double check where the problem lies.

@nakato
Copy link
Author

nakato commented Oct 4, 2017

Fair enough, I was just worried my reproduction steps were not detailed enough.
I asked someone else to take a stab at the instructions for me to make sure they were able to follow them and reproduce the issue, however they tested with master and were unable to reproduce the issue.

With that in mind we sought out the commit that fixed the issue, and found that it was fixed with commit 4301b44

I am not the person who originally configured the systems this issue was found on, I just had to debug the issue. I'm fairly certain it was only done for the read only FS. And even if that was not fixed in 1.0.0-rc4, I would be mounting an overlayfs over the containers as a way to rectify this issue.

Sorry for the noise, I should have checked against master first. I'll close the issue, the only thing I can think of doing here now is to maybe write a test-case so the issue does not inadvertently get re-introduced.

@nakato nakato closed this as completed Oct 4, 2017
@cyphar
Copy link
Member

cyphar commented Oct 4, 2017

@nakato Ah, so it was caused by mount propagation (which was going to be my guess). Effectively the reason why this occurs is because chroot doesn't allow you to remove the mountpoints outside the chroot (by design). This means that if you use MS_PRIVATE mount propagation on / before doing a chroot, you have just given yourself a reference to a bunch of mounts that you cannot unmount. This is one of the issues I was getting at when I was talking about why pivot_root is preferred.

Maybe we should give a warning if you're doing MS_PRIVATE (basically anything other than MS_SLAVE or MS_SHARED) with --no-pivot. Because you're right that is probably is not something a user expects.

That patch just changes the default, you can still trigger it by setting "rootfsPropagation": "rprivate".

I'm going to reopen this bug, if you don't mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants