Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rootfs: make pivot_root(2) dance handle initramfs case #4434

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Oct 10, 2024

While pivot_root(2) normally refuses to pivot a mount if you are running
with / as initramfs (because initramfs doesn't have a parent mount), you
can create a bind-mount of / and make that your new root to work around
this problem. This does use chroot(2), but this is only done temporarily
to set current->fs->root to the new mount. Once pivot_root(2) finishes,
the chroot(2) and / are gone.

Variants of this hack are fairly well known and is used all over the
place (see 1, 2) but until now we have forced users to have a far less
secure configuration with --no-pivot. This is a slightly modified
version that uses the container rootfs as the temporary spot for the /
clone -- this allows runc to continue working with read-only image-based
OS images.

Signed-off-by: Aleksa Sarai cyphar@cyphar.com

libcontainer/rootfs_linux.go Outdated Show resolved Hide resolved
@cyphar cyphar modified the milestones: 1.2.1, 1.3.0 Oct 21, 2024
@cyphar cyphar added the backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 label Oct 23, 2024
@cyphar cyphar modified the milestones: 1.3.0, 1.2.1 Oct 25, 2024
@cyphar
Copy link
Member Author

cyphar commented Oct 25, 2024

Okay, I managed to test this and this version definitely works. The setup is not too complicated, but I'm not sure if we could practically test this within our CI (what would be a nice way of verifying the container ran inside the VM?).

Script used to create the initramfs (openSUSE)
#!/bin/bash

set -Eeuo pipefail

#sudo zypper in -y busybox syslinux skopeo umoci

sudo rm -rf boot-img/

[ -e runc ] || curl -sSL "https://github.com/opencontainers/runc/releases/download/v1.2.0/runc.amd64" -o runc

[ -d opensuse ] || skopeo copy docker://opensuse/tumbleweed:latest oci:opensuse:tumbleweed
[ -d bundle ] || sudo umoci unpack --image opensuse:tumbleweed ./bundle

mkdir -vp boot-img/

pushd boot-img

mkdir -p ./usr/bin
ln -sv usr/bin ./bin

# Copy runc.
ln ../runc ./usr/bin/runc
# Copy the rootfs bundle.
mkdir -p run
sudo cp -aR ../bundle ./run/bundle

# openSUSE makes /usr/bin/busybox non-static, and you can't ask busybox.install
# to install the static version. So install busybox using symlinks and then
# replace busybox with busybox-static.
busybox.install . --symlinks
cp -v /usr/bin/busybox-static ./usr/bin/busybox

# Boot into a shell.
cat >./init <<EOF
#!/bin/sh

echo "HELLO WORLD"

mkdir -p /proc
mount -t proc proc /proc

mkdir -p /sys
mount -t sysfs sysfs /sys

mkdir -p /sys/fs/cgroup
mount -t cgroup2 cgroup2 /sys/fs/cgroup

mkdir -p /tmp
mount -t tmpfs tmpfs /tmp

mkdir -p /dev
mount -t devtmpfs devtmpfs /dev
mkdir -p /dev/pts
mount -t devpts -o newinstance devpts /dev/pts
mkdir -p /dev/shm
mount --bind /tmp /dev/shm

/bin/sh
EOF
chmod +x ./init

# Build our init.cpio.
sudo find . | sudo cpio -o -H newc > ../init.cpio
popd

And you can then just do qemu-system-x86_64 -kernel /boot/vmlinuz -initrd ./init.cpio -m 2G -nographic -append console=ttyS0 to run a VM with this setup. You can verify this new version works by just doing runc run -b /run/bundle ctr.

@cyphar cyphar marked this pull request as ready for review October 25, 2024 07:08
@cyphar cyphar force-pushed the bind-pivot-root branch 2 times, most recently from 47dffa5 to e27cce8 Compare October 25, 2024 07:24
@cyphar
Copy link
Member Author

cyphar commented Oct 27, 2024

@kolyshkin @AkihiroSuda Do you want me to try to come up with a CI test for this case? I'm not really sure if there is a nice way of testing the output of qemu (doesn't GitHub block nested VMs as well?). Then again, maybe we could make the init script run runc and just parse the -nographic -append console=ttyS0 output?

@kolyshkin
Copy link
Contributor

kolyshkin commented Oct 28, 2024

I'm afraid it's going to be tough. In case qemu is not working in GHA (last time I checked it was only working on Mac OS X instances, but it was ~2y ago), try cirrus-ci, it uses GCP and with this instance it's possible to use nested virt (which we do to run vagrant-libvirt):

runc/.cirrus.yml

Lines 22 to 29 in 4ad9f7f

compute_engine_instance:
image_project: cirrus-images
image: family/docker-kvm
platform: linux
nested_virtualization: true
# CPU limit: `16 / NTASK`: see https://cirrus-ci.org/faq/#are-there-any-limits
cpu: 4
# Memory limit: `4GB * NCPU`

@AkihiroSuda
Copy link
Member

AkihiroSuda commented Oct 28, 2024

The default Linux instances of GHA now supports nested virt.
(Used in the CI of containerd, Lima, etc.)

@cyphar cyphar force-pushed the bind-pivot-root branch 3 times, most recently from 5a291f2 to 0dfe91b Compare October 28, 2024 07:22
@cyphar cyphar force-pushed the bind-pivot-root branch 3 times, most recently from 3c366c2 to 5cd283a Compare October 28, 2024 07:33
if err != nil {
// Always try to do pivot_root(2) because it's safe, and only fallback
// to the unsafe MS_MOVE+chroot(2) dance if pivot_root(2) fails.
logrus.Warnf("your container failed to start with pivot_root(2) (%v) -- please open a bug report to let us know about your usecase", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End-users may not understand who are "us", as they don't execute runc directly.

We may link https://github.com/opencontainers/runc/issues for explicitness, but we may potentially get a report about some third-party product that we have never heard of.

Copy link
Member Author

@cyphar cyphar Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already provide a link to the tracking issue for this deprecation if you pass --no-pivot (regardless of whether pivot_root(2) works), I don't think we need to provide the same link twice?

@cyphar cyphar force-pushed the bind-pivot-root branch 5 times, most recently from e3acf8d to beaf64a Compare October 28, 2024 08:34
@cyphar cyphar force-pushed the bind-pivot-root branch 2 times, most recently from ba51090 to 3563055 Compare October 29, 2024 06:19
@cyphar cyphar marked this pull request as ready for review October 29, 2024 06:30
@cyphar
Copy link
Member Author

cyphar commented Oct 29, 2024

Okay, I managed to get this working on AlmaLinux 9 so now it's ready for review again @kolyshkin @AkihiroSuda.

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one trivial nit; LGTM. This is nice!

While pivot_root(2) normally refuses to pivot a mount if you are running
with / as initramfs (because initramfs doesn't have a parent mount), you
can create a bind-mount of / and make that your new root to work around
this problem. This does use chroot(2), but this is only done temporarily
to set current->fs->root to the new mount. Once pivot_root(2) finishes,
the chroot(2) and / are gone.

Variants of this hack are fairly well known and is used all over the
place (see [1,2]) but until now we have forced users to have a far less
secure configuration with --no-pivot. This is a slightly modified
version that uses the container rootfs as the temporary spot for the /
clone -- this allows runc to continue working with read-only image-based
OS images.

[1]: containers/bubblewrap#592 (comment)
[2]: https://aconz2.github.io/2024/07/29/container-from-initramfs.html

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
run.go Outdated
Usage: "do not use pivot root to jail process inside rootfs. This should be used whenever the rootfs is on top of a ramdisk",
Usage: "do not use pivot root to jail process inside rootfs to work around limitations when running in an initramfs (this option is deprecated, insecure, and unnecessary now that pivot_root works with initramfs -- see <https://github.com/opencontainers/runc/issues/4435>)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need the same change for create.go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (slightly more bold) alternative is to make the option hidden (in this case the description can be removed).

Copy link
Member Author

@cyphar cyphar Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of hiding it for 1.3.0 but keeping it visible for 1.2.z. I've added a new patch deprecating it, and I'll exclude that from the backport.

Despite the hardenings we've added to the MS_MOVE+chroot dance over the
years like commit 28a697c ("rootfs: umount all procfs and sysfs
with --no-pivot"), --no-pivot is fundamentally insecure and the primary
reason why people use it (to run containers from initramfs) can now be
done safely with pivot_root(2).

So we should always try to pivot_root(2) and give a warning to the user
that their configuration is insecure if we have to use the --no-pivot
fallback (users should not see this message in practice, because the
primary users that couldn't use pivot_root(2) now can and will
transparently use it if possible).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This includes bats_pipe and some other nice features we can use in our
tests.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The runc wrapper we have had for a long time is useful for debugging
errors (because bats's built-in "run" doesn't provide the output of the
program in case of an error). However, it might be necessary to run
other programs in a similar wrapper.

In addition, it might be needed to add timeouts or use the bats_pipe
feature (bats v1.10.0) with similar wrapping. Adding support for this in
the sane_run wrapper makes it a little easier to use these things when
writing tests.

This is somewhat adapted from umoci's sane_run helper.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The only way to run in the proper initramfs is to start a VM using a
custom initrd that runs runc. This should be a fairly reasonable
smoke-test that matches what minikube and kata do.

Unfortunately, running the right qemu for the native architecture on
various distros is a little different, so we need a helper function to
get it to work on both Debian and AlmaLinux.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Existing users will still be able to use it, but new users won't be
tempted into using this flag (and their containers should be able to run
without issue without this flag anyway).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
return &os.PathError{Op: "pivot_root", Path: ".", Err: err}
pivotErr := unix.PivotRoot(".", ".")
if errors.Is(pivotErr, unix.EINVAL) {
// If pivot_root(2) failed with -EINVAL, one of the possible reasons is
Copy link
Member

@lifubang lifubang Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are six reasons will cause -EINVAL:

       EINVAL new_root is not a mount point.

       EINVAL put_old is not at or underneath new_root.

       EINVAL The current root directory is not a mount point (because
              of an earlier [chroot(2)](https://man7.org/linux/man-pages/man2/chroot.2.html)).

       EINVAL The current root is on the rootfs (initial ramfs) mount;
              see NOTES.

       EINVAL Either the mount point at new_root, or the parent mount of
              that mount point, has propagation type MS_SHARED.

       EINVAL put_old is a mount point and has the propagation type
              MS_SHARED.

This PR tries to deal with the forth one, does this pr would cause some unexpected behaviors for the fifth one? (Maybe need to check once have a time)
The user case:
For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed as MS_PRIVATE by runc, but the parent mount of rootfs may have propagation type MS_SHARED.
Before this patch, it will return an error immediately, but after this patch, it will do some magic operation to let pivot_root(2) work, I think we should be careful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user case:
For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed as MS_PRIVATE by runc, but the parent mount of rootfs may have propagation type MS_SHARED.

For this case, this patch has the same security issue like --no-pivot, even this flag is not provided. I can escape from the container.
So, I think we should split this PR into two:

  1. Deprecate the flag no-pivot, this should be in milestone 1.2.1;
  2. Find a new way to handle initramfs case, this one can be moved to the next version if we can't fix it in a short time.

Copy link
Member Author

@cyphar cyphar Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, this patch has the same security issue like --no-pivot, even this flag is not provided. I can escape from the container.

How? After pivot_root(2) the whole host mount tree is gone, no? chroot_fs_refs should move the current->fs->root to the pivot_root(2)'d path so the intermediate chroot(2) should not have any effect on the final mount setup AFAICS. Can you share a reproducer?

For docker, containerd, kubernetes, the rootfs is always mount as a overlay fs, it will always be changed as MS_PRIVATE by runc, but the parent mount of rootfs may have propagation type MS_SHARED.

We also set the parent to MS_PRVIATE in rootfsParentMountPrivate. (It already should not be the case that we get -EINVAL for that reason because we explicitly make sure that it works with rootfsParentMountPrivate.) Is that not sufficient?

The bind-mount of / keeps the propagation settings the same as well. So if the propagation was actually wrong, the second pivot_root(2) would still fail. I also don't see how you could use that to break out of the container -- AFAIK the reason the kernel requires the mount to be !shared is so that the mount tree changes during pivot_root(2) don't propagate to the host (and break it).

Find a new way to handle initramfs case, this one can be moved to the next version if we can't fix it in a short time.

The "magic" done is just creating a bind-mount of the root that can be used for pivot_root(2) (in an MS_PRIVATE mount so the mount itself will be MS_PRIVATE, to answer your question).

This is the only way of doing it (I spoke to the VFS maintainers a few weeks ago, and this is the only "right" way of doing it -- FWIW Lennart implied that systemd does this but they don't it seems).

It's a bit hard to understand what the problem is without an actual reproducer.

Deprecate the flag no-pivot, this should be in milestone 1.2.1;

We can't deprecate it without having a solution for minikube and kata...

Copy link
Member

@lifubang lifubang Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't deprecate it without having a solution for minikube and kata...

Mark it deprecate not means we remove it right now, it's only to remind users that they should not use it except there is no way to run their case, and to inform users that we can't guarantee the container's security with --no-privot.
But yes, it's a little strange do it in the situation that we still have no solution for minikube and kata or other projects. I think we can find a solution to fix it. I suggest to split into two because maybe we need more time to find out this solution, but I think we should release v1.2.1 ASAP, then we can let runc v1.2.* could be tested by more and more users, especially in some downstream rc versions.

We also set the parent to MS_PRVIATE in rootfsParentMountPrivate. (It already should not be the case that we get -EINVAL for that reason because we explicitly make sure that it works with rootfsParentMountPrivate.) Is that not sufficient?

No. I have concern about it for a long time after #4417 merged, the commit is: 13a6f56. This commit has changed the behavior of rootfsParentMountPrivate:
1. Before this commit, we make / to MS_PRIVATE mount;
2. After this commit, if the rootfs is a overlay mount, we will only make rootfs to MS_PRIVATE mount.
@kolyshkin I don't know this behavior change will cause some issues, but we will not set parent to MS_PRIVATE is a fact, but maybe make rootfs MS_PRIVATE is enough.

EDIT: Sorry, I'm wrong, the parent mount point has always been changed to MS_PRIVATE by rootfsParentMountPrivate; So this bug is not related to this, I have sent the email to @cyphar, if you don't receive it, please let me know.

It's a bit hard to understand what the problem is without an actual reproducer.

I'll send you a email ASAP.

@cyphar cyphar marked this pull request as draft November 1, 2024 07:51
@rata
Copy link
Member

rata commented Nov 1, 2024

I'm assuming we won't include this in 1.2.1 as this is still a draft. We can include it later, though.

@AkihiroSuda AkihiroSuda modified the milestones: 1.2.1, 1.3.0 Nov 1, 2024
@kolyshkin
Copy link
Contributor

I'm assuming we won't include this in 1.2.1 as this is still a draft. We can include it later, though.

This looks more like a "new feature" than a "bug fix" for me, and so it is probably 1.3 material.

@cyphar
Copy link
Member Author

cyphar commented Nov 2, 2024

Yeah, 1.2.1 was the wrong thing to label it. (Especially since there are some subtleties I need to look into...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.2-todo A PR in main branch which needs to be backported to release-1.2 impact/changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants