kctf is a script that uses nsjail inside a docker container, which are launched in kubernetes. The nsjail is for security/isolation, the docker is for easy setup, kubernetes is for scaling, and kctf
is a wrapper that launches all of that on google cloud.
There are some basic examples in the kctf repo and more advanced examples in googlectf quals. The official walkthrough explains how to set kctf up and launch it.
kctf clearly was made for simple pwn chals in mind that operate on stdin/stdout. For any ... more interesting challenges, kctf can be a lot of work.
This document contains abstractions of what i learned struggling with it, and it will hopefully grow if I find the time for more additions.
You already have a docker that runs your ctf challenge, and want to use it in kctf?
- Ensure you do not need state beyond one tcp connection, because kctf might launch multiple docker instances for load-balancing and nsjail will not allow any second tcp connection to the same jail instance of the challenge.
- If you want to write to the filesystem, that can be enabled, but must be done explicitly, otherwise kubernetes will just default to non-writeable filesystem.
- set the filesystem permissions in
nsjail.cfg
torw: true
if you want the challenge to modify the filesystem ... but usually you don't want that, and should instead mount tmpfs filesystems. But sometimes you might need this in order to even be able to create the tmpfs mountpoint... so, good to know.- Also check out the section OverlayFS -- Example On Root -- below: I explain there how to make an overlayFS that is writable and per challenge jail instance.
- specify it correctly in
challenge.yaml
as well, see this kctf github issue for an example where exactly to specify this. It is inspec.podTemplate.template.spec.containers.securityContext
. - Sometimes it seemed to me that I also needed
privileged: true
when it complained about readonly filesystem.
- set the filesystem permissions in
Aside from those worries, you will also need to have kctf-related scripts and nsjail in the docker. The easiest ways to do so is to give your currently built stage a name in the Dockerfile (see docker multistage builds) and then start a new final stage where you copy everything over to the image the kctf examples use, e.g. the pwn example. Or take the latest from gcr.io/kctf-docker/challenge
, i guess.
FROM some_image as chroot
# your challenge dockerfile stuff here.
FROM gcr.io/kctf-docker/challenge@sha256:eb0f8c3b97460335f9820732a42702c2fa368f7d121a671c618b45bbeeadab28 as kctfstage
COPY --from=chroot / /chroot
COPY nsjail.cfg /home/user
CMD kctf_setup && \
kctf_drop_privs \
socat \
TCP-LISTEN:1337,reuseaddr,fork \
EXEC:"kctf_pow nsjail --config /home/user/nsjail.cfg -- /home/user/chal"
In this, the nsjail
launches the challenge binary at /home/user/chal
. You might wonder "But the chal
binary is actually at /chroot/home/user/chal
?!". That is true, but the nsjail.cfg
contains a config that mounts the /chroot
(as seen from the docker container perspective) to /
(as seen from inside the nsjail).
mount: [
{
src: "/chroot"
dst: "/"
is_bind: true
},
If you later want to only build the initial docker you had at the start, you can do so using
docker build --target chroot .
To run the newly created kctf-stage docker, you will have to use --privileged
, lest you get "permission denied":
docker run --privileged -p 1337:1337 -it $(docker build -q --target kctf-stage .)
The socat
line in the CMD
of the example. listens on port 1337 and launches an nsjail instance every time a connection comes in (thanks to fork
, otherwise it would only take one connection).
The communication then goes through the nsjail's stdin.
Inside the nsjail, we want to do the opposite, as per this web example. Note that this relies on /dev/null
, so the nsjail.cfg now needs to mount /dev
and /dev/null
.
To do this, I create a script unsocat.sh
that launches my server and then feeds stdin to the port 1337 it is listening on, and the output from it to stdout again. Which in turn gets fed through the nsjail layer to the client.
#!/bin/bash
# unsocat.sh: This script maps stdin to a tcp port and the output from there back to stdout.
# But first, let's launch our server.
# This is the same command as I would use as docker CMD if I were not using nsjail/kctf.
# But:
# * in the background, thanks to the "&" at the end.
# * not printing output, thanks to the &>/dev/null redirect. Leave it out for debugging.
(python3 /chal/chalspecific_server.py ) &
#echo "Debug: Server Launched."
# Now connect to it, without buffering.
# Ensure that the option is installed inside the chroot.
# Option 1:
#stdbuf -o0 -i0 nc "127.0.0.1" 1337
# Option 2: socat has the "forever" option to retry until it works.
stdbuf -o0 -i0 socat - "TCP:127.0.0.1:1337,forever"
#echo "Debug: Done."
In the Dockerfile, I execute this script now instead of my server directly:
# So far so good. Now let's put that into kctf and nsjail.
FROM gcr.io/kctf-docker/challenge@sha256:eb0f8c3b97460335f9820732a42702c2fa368f7d121a671c618b45bbeeadab28 as kctf-stage
COPY --from=chal / /chroot
RUN mkdir /chroot/kctf
COPY --chmod=555 unsocat.sh /chroot/kctf/
COPY --chmod=444 nsjail.cfg /kctf/nsjail.cfg
RUN echo "127.0.0.1 NSJAIL" >> /chroot/etc/hosts
CMD kctf_setup && \
kctf_drop_privs \
socat \
TCP-LISTEN:1337,reuseaddr,fork \
EXEC:"kctf_pow nsjail --config /kctf/nsjail.cfg -- /kctf/unsocat.sh"
Also, my challenge would normally listen on hostname:port
. But now, it seems the hostname is NSJAIL
and does not resolve... so we need to add that to the /etc/hosts
file. i do this with the line RUN echo "127.0.0.1 NSJAIL" >> /chroot/etc/hosts
in my Dockerfile's final stage.
Mounting /dev/null
is not so hard, but if you write to it you might get /dev/null: Permission denied
. The reason might be explained in this great answer. (archive link)
In a very specific edge case, when /dev
is a tmpfs
with the sticky bit set and group- or world-writable, but the owner of /dev/null
is neither the owner of /dev
nor the current user, there is the problem that echo
tries to create the file and gets denied.
Now, because I was specifically mounting /dev
as a tmpfs instead of following the example in the kctf repo, as I wanted to avoid having any devs I don't strictly need, I did have a tempfs. And without specifying the mount option mode=
, the default does set the sticky bit. Due to the fact that I do not map root
inside to root
outside, the owner of the /dev/null
is nobody
, but the owner of /dev
is my current user. We have this exact case.
The fix then is to set the /dev
permissions more like in the "real" environments, where it would be -o mode=755
.
generic@motorbrot:~$ ls -la /dev/null
crw-rw-rw- 1 root root 1, 3 Okt 26 18:55 /dev/null
generic@motorbrot:~$ ls -la /dev/.
total 33
drwxr-xr-x 20 root root 6180 Nov 7 15:46 .
So in nsjail.cfg
:
# Allow access to /dev/null
mount [
{
dst: "/dev"
fstype: "tmpfs"
rw: true,
options: "mode=755"
},
{
src: "/dev/null"
dst: "/dev/null"
rw: true
is_bind: true
}
]
kctf
has an option to simulate a local "kind" kubernetes cluster. But you can also just launch your docker almost normally. It will usually need --privileged
, but that is okay because the docker is not what is ensuring the security anyway.
docker build -t myimage .
docker run --privileged -p1337:1337 -it myimage
This is nice for a small deployment with few anticipated users, as it still has nsjail but no scaling kubernetes stuff.
But if you are planning to eventually use kubernetes for the CTF event, then please do first test it at least in kind, if not on the cloud already. I had the experience that GKE and Kind behave differently. E.g. I can access one challenge from another using kind, but not GKE.
At least in the case of one challenge, this also worked for me:
docker run --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add SYS_ADMIN -p 1337:1337 -it $(docker build -q --target kctf-stage .)
If you don't need it, you can also add --security-opt no-new-privileges
List of Potential Mistakes / Errors to make them googleable:
-
Don't put
--target kctf-stage
unless you have a docker build target stage with that name. You can just omit it to build your full image. -
If you need the new privileges capability,it might be because you use newgidmap. Then, the docker will launch correctly but nsjail will fail on connection, saying that
/usr/bin/newgidmap
failed. -
If you do not state
--security-opt apparmor=unconfined
you might get at startup alreadymount: /kctf/.fullproc/proc: cannot mount none read-only
if you are mounting something at startup. -
If you do not state
--security-opt seccomp=unconfined
, you will be told that[E][2023-11-14T00:18:50+0000][1] initCloneNs():435 pivot_root('/tmp/nsjail.1000.root', '/tmp/nsjail.1000.root'): Operation not permitted [F][2023-11-14T00:18:50+0000][1] runChild():483 Launching child process failed [W][2023-11-14T00:18:50+0000][15] runChild():503 Received error message from the child process before it has been executed [E][2023-11-14T00:18:50+0000][15] standaloneMode():275 Couldn't launch the child process
You can specify /bin/bash
as the command that nsjail launches. This allows you to connect to it and look around, run the commands you need to run, see how they fail.
If you get this without even touching the kctf
command, just by running the docker, it is not because of kubernetes but because of nsjail (probably).
Connecting over netcat as usual, but with the nsjail command launching /bin/bash
, we can explore the filesystem inside the jail. But even though the permissions look fine, doing echo asdf > test.txt
makes the output of the nsjail docker terminal show bin/bash: line 10: test.txt: Read-only file system
. The culprit is likely seen in the nsjail.cfg
: In my case, I have mounted /chroot
to /
inside the jail, as a bind mount.
mount {
src: "/chroot"
dst: "/"
is_bind: true
}
I could add rw: true
to make it writable, but then it would modify the files on disk also for new instances of the challenge. So instead, you might want to use a tmpfs for the writable files instead -- one tmpfs per jail instance:
mount {
dst: "/tmp"
fstype: "tmpfs"
rw: true
}
Perhaps you might also get an OverlayFS to work, but considering that that needs a tmpfs to store the changes in anyway, you might as well just stick with a copy of the original folder.
The advantages of the OverlayFS would be:
- no needless copy of large files.
- if one of the files can only be read by root, it is easier if it can be left in place. Making a copy would require to be run as root, which can be done but has to be done before
kctf_drop_privs
, and that is supposed to happen before the call tosocat
...
The option X-mount.mkdir seems ideal to have the mount command create the missing mount points for us, but it does not work ("Invalid argument").
Without it, we need some pre-existing directories we can use in the overlay mount command (upperdir, lowerdir, workdir). And upperdir and workdir must be on the same filesystem to allow for atomic writing... so we must have a writable tmpfs and create two directories on it somehow. Otherwise, this config fails:
# using a tmpfs i can make an overlayfs to avoid copying
# a large pre-made folder. I don't have any large folders,
# but I am curious whether I can make it work.
#
# https://github.com/google/nsjail states:
# --mount|-m VALUE
# Arbitrary mount, format src:dst:fs_type:options
mount [
{
dst: "/tmp"
fstype: "tmpfs"
rw: true
},
{
fstype: "overlay"
options: "lowerdir=/chal,upperdir=/tmp/chal_upper,workdir=/tmp/chal_work"
dst: "/tmp/chal"
}
]
We can not create these missing directories in advance, since the filesystem outside the nsjail is not supposed to be writable from inside. But well, i guess we can make an exception this time ... We have to create a tmpfs before launching nsjail every time socat receives a connection (give it some randomized name), create the upperdir and workdir as empty folders inside it, and then pass that name to the nsjail command using the command line flag --bindmount "$NAMEHERE":/tmp2
.
But that fails with the warning
[W][2023-11-10T19:38:06+0000][1] remountPt():283 mount('/tmp2', flags:MS_REMOUNT|MS_BIND|MS_RELATIME): Invalid argument
If this happens to you, you have perhaps written something like this:
kctf_pow nsjail --bindmount "${mytmpfs}:/tmp2" --config /kctf/nsjail.cfg -- /bin/bash
If you change the order to the following, it works:
kctf_pow nsjail --config /kctf/nsjail.cfg --bindmount "${mytmpfs}:/tmp2" -- /bin/bash
The --config
apparently must come first. Probably due to the mounting of /
in the config that comes too late otherwise? But well, this now also means that we must specify the overlay mount command itself in the command line here instead of in the config. So something like this should work in my opinion, but it does not work and I'm not sure what is missing:
#!/bin/bash
set -x
mytmpfs=$(mktemp -d)
# if something fails, try to cleanup before exiting
function onexit {
rm -r "$mytmpfs"
}
trap onexit EXIT
mkdir "${mytmpfs}/upperdir"
mkdir "${mytmpfs}/workdir"
TMP_INSIDE="/tmp"
TMP_OUTSIDE="$mytmpfs"
kctf_pow nsjail --config /kctf/nsjail.cfg --bindmount "${mytmpfs}:${TMP_INSIDE}" --mount ":/chal:overlay:lowerdir=/chal,upperdir=${TMP_INSIDE}/upperdir,workdir=${TMP_INSIDE}/workdir" -- /bin/bash
[W][2023-11-10T20:41:52+0000][1] mountPt():223 mount(''/chal' flags: type:'overlay' options:'lowerdir=/chal,upperdir=/tmp/upperdir,workdir=/tmp/workdir' dir:true') src:'none' dstpath:'/tmp/nsjail.1000.root//chal' failed: No such file or directory
[E][2023-11-10T20:41:52+0000][1] initCloneNs():414 Couldn't mount '/chal'
Instead of making nsjail create the overlayfs, we could also follow the approach from KNOXDEV/terraform-kubernetes-ctf-chal and first make the overlayfs and then pass it in to nsjail.
If you drop the privs too early, mount will complain mount: only root can use "--options" option
. This means that we will have to move the kctf_drop_privs
command from outside the socat call to inside the command it executes.
This Dockerfile and launch_nsjail.sh
script exerpt creates an overlayfs in the /tmp
of the container (outside the nsjail instances), and mounts it inside the jail onto a pre-existing folder /tmp2
(it was /chroot/tmp2
before entering the jail).
CMD kctf_setup && \
#kctf_drop_privs \ # Actually, I'll call this in launch_nsjail approach 3, inside socat.
socat \
TCP-LISTEN:1337,reuseaddr,fork \
EXEC:"/kctf/launch_nsjail.sh"
# File: launch_nsjail.sh
#!/bin/bash
set -x
mytmpfs=$(mktemp -d)
# if something fails, try to cleanup before exiting
function onexit {
rm -r "$mytmpfs"
}
trap onexit EXIT
# ---------
# Attempt 3: OverlayFS manually made
#
# Source: https://github.com/KNOXDEV/terraform-kubernetes-ctf-chal/blob/5d9b4e647f9769d08e6529ff8d48d86f98d861b3/docker-images/tunnelling/socat_entrypoint.sh#L3
# This approach must be run before dropping caps with kctf_drop_privs. It calls that itself.
# ---------
# on each connection, we need to create an overlayfs for nsjail to mount to
OUTSIDE_TEMP_DIR=$"mytmpfs"
CHROOT_INSIDE="/tmp2" # TODO: actually just chroot to the whole overlayfs.
mkdir "$OUTSIDE_TEMP_DIR/upper" "$OUTSIDE_TEMP_DIR/work" "$OUTSIDE_TEMP_DIR/fs"
# this step is necessary to allow nsjail to mount the overlay once we drop privs
chmod -R o+x "$OUTSIDE_TEMP_DIR"
mount -t overlay overlay -o "lowerdir=/chroot,upperdir=$OUTSIDE_TEMP_DIR/upper,workdir=$OUTSIDE_TEMP_DIR/work" "$OUTSIDE_TEMP_DIR/fs"
kctf_drop_privs kctf_pow nsjail --config /kctf/nsjail.cfg \
--bindmount "${OUTSIDE_TEMP_DIR}/fs:${CHROOT_INSIDE}" \
--tmpfsmount /tmp \
--cwd "$CHROOT_INSIDE" \
-- /bin/bash
#-- /kctf/unsocat.sh
In the KNOXDEV repo, they actually just mount the overlayfs as their main chroot. That seems like a good option too, to avoid having weird paths.
# File: Dockerfile
CMD kctf_setup && \
#kctf_drop_privs \ # Actually, I'll call this in launch_nsjail approach 3, inside socat.
socat \
TCP-LISTEN:1337,reuseaddr,fork \
EXEC:"/kctf/launch_nsjail.sh"
#!/bin/bash
# File: launch_nsjail.sh
#!/bin/bash
set -x
mytmpfs=$(mktemp -d)
mount -t tmpfs -o size=512M tmpfs "$mytmpfs"
# if something fails, try to cleanup before exiting
function onexit {
umount "$mytmpfs/fs"
umount "$mytmpfs"
rm -rf "$mytmpfs"
echo "$(date) : onexit done." 1>&2
}
trap onexit EXIT
# ---------
# Attempt 3: OverlayFS manually made, and then passed into nsjail as writable root /.
#
# Source: https://github.com/KNOXDEV/terraform-kubernetes-ctf-chal/blob/5d9b4e647f9769d08e6529ff8d48d86f98d861b3/docker-images/tunnelling/socat_entrypoint.sh#L3
# This approach must be run before dropping caps with kctf_drop_privs. It calls that itself.
# ---------
# on each connection, we need to create an overlayfs for nsjail to mount to
OUTSIDE_TEMP_DIR="$mytmpfs"
mkdir "$OUTSIDE_TEMP_DIR/upper" "$OUTSIDE_TEMP_DIR/work" "$OUTSIDE_TEMP_DIR/fs"
# "this step is necessary to allow nsjail to mount the overlay once we drop privs" - KNOXDEV
# But actually, i don't think that's true?
#chmod -R o+x "$OUTSIDE_TEMP_DIR"
mount -t overlay overlay -o "lowerdir=/chroot,upperdir=$OUTSIDE_TEMP_DIR/upper,workdir=$OUTSIDE_TEMP_DIR/work" "$OUTSIDE_TEMP_DIR/fs"
kctf_drop_privs kctf_pow nsjail \
--chroot "${OUTSIDE_TEMP_DIR}/fs" --rw \
--config /kctf/nsjail.cfg \
--tmpfsmount /tmp \
--cwd "/chal" \
-- /bin/bash
#-- /kctf/unsocat.sh
We can pass --rw
because anything written from the jail only ends up in the temporary upperdir of the overlayfs. The trap cleans that up in the end. Of course, during the connection there might still be a DOS attack by writing a huge amount of data. This can be prevented by limiting the size of the upperdir by having a separate tmpfs for every jail and limiting its size with the mount option, and that is the reason why my example above mounts a new tmpfs at the start.
The nsjail.cfg
for this example does not contain a mount for /
.
Notably, when I got the order of --config
and --chroot
wrong here it was working anyway... so i assume nsjail handles --chroot
specially.
In general, nsjail
makes networking impossible for the jailed challenge, except for the one incoming tcp connection that you get in through stdin. Suppose you have a challenge where the ctf player wants to launch a connection to a server on the internet. How to do that?
Well, that was a huge pain to get working. My notes are a huge jumble and the challenge where I implemented this is still secret. But if you need to do this, let me know and I'll try to read myself back into it to make better notes here.
See also my confused answer attempt on serverfault back before I fully succeeded at it. And after I succeeded:
I managed to make outgoing network access work for a dynamic number of jails and without requiring one LAN ip address per jail. I did so using veth pairs and masquerading. I hope I will find time to write up how to do so once I have thoroughly tested it
The basic idea was to give every nsjail instance inside the docker a veth pair that allows connecting from inside the jail to outside, and then there using masquerading to get the request to the internet and back. This will also needs routing setup.
Adding this to the nsjail config makes setuid binaries actually run as their owner:
disable_no_new_privs: true
More info in docs.kernel.org:
Any task can set
no_new_privs
. Once the bit is set, it is inherited across fork, clone, and execve and cannot be unset. Withno_new_privs
set,execve()
promises not to grant the privilege to do anything that could not have been done without the execve call. For example, the setuid and setgid bits will no longer change the uid or gid; file capabilities will not add to the permitted set, and LSMs will not relax constraints after execve.[...] Note that
no_new_privs
does not prevent privilege changes that do not involveexecve()
. An appropriately privileged task can still callsetuid(2)
and receive SCM_RIGHTS datagrams.[...] and
no_new_privs
+chroot
is considerable less dangerous than chroot by itself.
The last part sounds spooky though. Might want to be careful with this setting? I'm asking on serverfault. In the meantime, it makes sense to at least ensure that there are no setuid binaries owned by root, or if there are then at least that the "root" user in the jail is mapped to some non-root user outside the jail.
In nsjail.cfg
:
uidmap [
{inside_id: "3777", outside_id: "1000", use_newidmap: true},
{inside_id: "0", outside_id: "7777", use_newidmap: true}
]
gidmap [
{inside_id: "3777", outside_id: "1000", use_newidmap: true},
{inside_id: "0", outside_id: "7777", use_newidmap: true}
]
disable_no_new_privs: true
In Dockerfile
: Ensure the user with uid and gid 7777 exists. The user with id 1000 already exists in the kctf docker image from the start. I am calling my fake root user and group for inside the jail "route" in the container (outside the jail) but that is arbitrary.
I also ensure that the user 1000 may map jail users to the (outside-) ids 1000 and 7777.
FROM gcr.io/kctf-docker/challenge@sha256:eb0f8c3b97460335f9820732a42702c2fa368f7d121a671c618b45bbeeadab28 as kctf-stage
RUN apt-get install -yq --no-install-recommends uidmap &&\
rm -rf /var/lib/apt/lists/* &&\
groupadd --system --gid 7777 route &&\
useradd --system --no-create-home --shell /sbin/nologin --uid 7777 route --gid 7777
COPY --from=chal / /chroot
RUN chown --from=root:root route:route -R /chroot
RUN echo "127.0.0.1 NSJAIL" >> /chroot/etc/hosts &&\
echo "1000:1000:1\n1000:0:1\n1000:7777:1" > /etc/subuid &&\
echo "1000:1000:1\n1000:0:1\n1000:7777:1" > /etc/subgid
CMD kctf_setup && \
kctf_drop_privs \
socat \
TCP-LISTEN:1337,reuseaddr,fork \
EXEC:"/kctf/launch_nsjail.sh"
This was surprisingly tricky. The hardest of the issues to figure out is probably the one about needing a user with id 1
in the jail. It must not be the current user... but it must exist. Here's a paste from my notes:
To summarize the points I was stuck at in case someone needs sudo as well in kctf inside the nsjail jail:
-
make sure the sudoers file is inside the jail, not only outside lmao
-
make sure all groups and uids actually exist inside the jail. An invalid uid/gid can make it fail
-
make sure all permissions of sudo and its library are correct. It must be owned by the user who is considered root inside the jail
-
if you
chown
ed the sudo files, keep in mind that this will have removed the setuid bit from them, you need to re-add them -
all sudo-able uids/gids need to be mapped in the nsjail config to some valid outside uid/gid (not 100% sure that is necessary actually, if that user does not own any files). To do so for multiple users requires
use_newidmap: true
syntax inside the list of uid mappings, and a file/etc/subuid
resp./etc/subgid
outside the jail. The nsjail config syntax is rough to get right. -
the uid
1
must exist inside the jail and be mapped to something valid outside the jail, otherwise the kernel gets confused and refuses the hackaround that sudo is trying to do to avoid some problem I do not understand, from 13 years ago -
there must be the necessary privileges granted in the nsjail config. Maybe I can still make this stricter, but for now I have
disable_no_new_privs: true
andkeep_caps: true
. (Likely it will be enough to saycap: [ "CAP_SETFCAP", "CAP_SETUID", "CAP_SETGID" ]
instead ofkeep_caps: true
. Notice that all are needed.) -
It is not necessary to map the root user inside to the root user outside if you do a chown + set chmod with suid bit again.
-
if you need to debug sudo, you can enable logging like this:
# set up logging for real sudo RUN echo "Debug sudo /tmp/log_sudo all@debug\nDebug sudoers.so /tmp/log_sudoers all@debug\n" >> /etc/sudo.conf
Luckily my poc setup for getting sudo to run in nsjail (I have not yet run it inside the kctf kind cluster btw, that could still fail lol) is separate from my challenge, so if anyone needs to use sudo i guess it's easiest to just borrow my poc settings to see how the syntax works.
- My poc is based on ubuntu 18
- basing the chroot on python-alpine fails with new errors. Not recommendable.
- the python-slim and python base images seem to have locked the root account, so sudo fails when run as root.
The easiest way to deal with this is to redesign the challenge. The second easiest way is to make the challenge contain a "jumpbox" shell from where the ctf player can operate. Because, well... there is just no way in kctf to reliably get traffic to the same instance through different tcp connections. And even if you do, it would then not reach the nsjail instance, unless you decided to go without nsjail.
But for some reason, it is configurable anyway, so maybe i am wrong.
When you want to have multiple users, you should map their user ids to existing user ids outside the jail. Otherwise, file permissions will be weird.
Ideally though, no user in the jail maps to the real root user id. Because otherwise they could read the files that root can read. (Only the files they can see in the jail, at least)
To map multiple user ids, we need nsjail to use newuidmap and we need to specify in a /etch/subuid
file inside the container (outside the jails) which user ids may be mapped to by which user. The syntax user:outsideUserID:N
declares a user who may map a range from some userID up to userID+N. For Groups, a similar thing /etc/subgid
exists. The first part (user
) may be a userID or a username.
Note that usually, you call nsjail
after kctf_drop_privs
which looks like this in the image they use:
#!/bin/bash
# There are two copies of this file in the nsjail and healthcheck base images.
all_caps="-cap_0"
for i in $(seq 1 $(cat /proc/sys/kernel/cap_last_cap)); do
all_caps+=",-cap_${i}"
done
exec setpriv --init-groups --reset-env --reuid user --regid user --inh-caps=${all_caps} -- "$@"
So kctf_drop_privs
makes the current user become user
, and hence this user is the one you need to give permissions to. The user user
(outside the nsjail) has id 1000.
Usually, you'll want to map to user
, because that makes things easier.
I have files belonging to a user with id 3777
that I bind-mount into the jail. That is, you would have in nsjail.cfg
something like this:
uidmap [
{inside_id: "1000", outside_id: "3777", use_newidmap: true},
{inside_id: "0", outside_id: "0", use_newidmap: true}
]
gidmap [
{inside_id: "1000", outside_id: "3777", use_newidmap: true},
{inside_id: "0", outside_id: "0", use_newidmap: true}
]
And correspondingly in /etc/subuid
and /etc/subgid
the user and group id mappings allowed:
# Dockerfile:
RUN echo "1000:3777:1\n1000:0:1" > /etc/subuid &&\
echo "1000:3777:1\n1000:0:1" > /etc/subgid
Sidenote: it seems that /etc/subuid
does not have to specify that a user is allowed to map to their own user id. So if you were to launch nsjail with this config as the (outside jail) user with id 3777 instead of the user with id 1000, it would still work. That can be confusing.
This is a questionable design choice because:
-
The config above maps root inside the jail to root outside the jail. So if anyone attains root inside the jail, they can access the files they see like the real root user can.
[W][2023-11-13T13:52:05+0000][20] logParams():266 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
This is tangential though, I could as well have made an example without this issue.
-
The config above maps the user
user
(uid 1000) inside the jail to the useredwald
(uid 3777) outside the jail. But in my challenge I want edwald to be able to access certain files. By default, everything in the docker image has permissions set up so that it will work for mounting the thingsuser
needs... but for other users we run into things such as this message when I was trying to mount something from a temp folder outside the jail to a temp folder inside the jail:[W][2023-11-13T13:52:05+0000][1] mountPt():217 mount(''/tmp/tmp.0zS1sNKHAL/chal' -> '/tmp/chal' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true') src:'/tmp/tmp.0zS1sNKHAL/chal' dstpath:'/tmp/nsjail.1000.root//tmp/chal' failed. Try fixing this problem by applying 'chmod o+x' to the '/tmp/tmp.0zS1sNKHAL/chal' directory and its ancestors: Permission denied [E][2023-11-13T13:52:05+0000][1] initCloneNs():414 Couldn't mount '/tmp/chal'
We can debug such issues by printing
ls -lah 2>&1
outputs to stderr in the dockerfile launch command, so we see the permissions (or use docker exec to enter the container and explore interactively). In this particular case, the reason for the error message seems to be that/tmp/tmp.0zS1sNKHAL
in the container (outside the jail) apparently belongs touser
and has permissionsdrwx------
, emphasizing the "and its ancestors" part of the error message.Using the user
user
, we would not have this problem because the folder belongs touser
. If we're being stubborn and want to use our custom useredwald
instead, we need to first runchmod o+x
for that temporary folder in a script betweenkctf_drop_privs
(we now run asuser
) and beforensjail
(we enter the jail).
This solution will of course make this temporary directory listable by every user in the container. That should not be an issue, as nsjail mounts only that folder, and for each instance a separate such folder:
[I][2023-11-13T14:11:53+0000] Mount: '/tmp/tmp.5TT1WU4lwT/chal' -> '/tmp/chal' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
But still, the easier, cleaner, more maintainable way to go about this is to keep the standard user of the nsjail (uid 1000 inside the jail) mapped to the user user
(uid 1000 outside the jail).
Or at least to keep the outside user user
and map our custom user from inside the jail to it. (Without creating a custom user in the container outside the jail).
This means that the files will need to be accessible to user
now, outside the jail. Not to edwald
. You'll have to change some permissions. To do so, it is useful to be aware that:
- You can run chown to only change files with specif user and group:
chown --from=root:root user:user -R /chroot
- You can copy from another docker image and specify the new owner in the same command:
COPY --from=chal --chown=user:user
- Any change of ownership will reset suid bits :(
In some cases, modifying all the file permissions in the right way will be just as annoying as just using the different custom user.
While it is not necessary to know this, sooner or later you will run into it while trying to understand what nsjail is doing.
My latest attempt at understanding capabilities, using [https://blog.container-solutions.com/linux-capabilities-why-they-exist-and-how-they-work](this link) and (some mental gymnastics to construct a concept that makes sense to me) follows.
Other useful sources are the capabilities man page and the setpriv man page.
Refer to this while reading when you feel like it:
-
Processes have the sets: Effective, Permitted, Inheritable, Ambient, Bounding
-
Files have the sets Permitted and Inheritable, and a single bit called the EffectiveBit.
-
The Effective set is simply what caps the current process has.
- Permitted capabilities become effective when the process makes a call to add them to the effective set.
-
If the current process
$P$ decides to launch a new process$P'$ by executing a file$F$ , that file will inherit the Inheritable Set ... but only those capabilities that$F$ specifies that it needs, in the file-based inheritable set. -
When a new process is created, the inheritable set remains the same. It describes which caps can be inherited by new threads.
-
Historically, many files are not aware that capabilities are a thing, and so did not set anything up in
$F$ . Running such a file would not inherit anything from the Inheritable set, because the inheritable set on the file$F$ does not specify that it wants any caps.To work around this, there is also an Ambient set.
-
Anything in
$P'(ambient)$ is automatically also in$P'(permitted)$ . -
The Ambient set stays the same forever, unless the file
$F$ is privileged in which case it is zeroed out.But then .. How does anything ever get into the ambient set then?
-
-
There are also some important rules for adding/removing capabilities from the ambient set.
- A capability can never be in a thread’s ambient set if it is not also in the inheritable and permitted sets.
- Dropping a capability from either of those sets also removes it from the ambient set.
- Non-root threads can add capabilities from their permitted set to the ambient set, which will allow their children to also use that capability with normal files.
-
The Bounding set limits which capabilities a process may accumulate.
-
The bounding set is roughly intended to control which capabilities are available within a process tree.
A capability can be added to the permitted set if the current thread has the CAP_SETPCAP capability and the capability is within the bounding set.
-
-
On the file
$F$ , if we put something in the permitted set of the file, it will put that into the permitted set of the newly launched process. Unless the bounding set forbids that (by not containing that capability). -
tricky:
The confusion comes when we consider that the bounding set does not control the inheritable set; you can keep capabilities in the inheritable set that are not in the bounding set. If a thread has a capability in its inherited set and executes a file with the capability in its inheritable set, the resultant process will have that capability in the permitted and effective sets, regardless of whether it exists in the bounding set.
Removing a capability from the bounding set does not remove it from the thread's inherited set. However it does prevent the capability from being added back into the thread's inherited set in the future.
(source: man page quote on askubuntu)
• A capability can be added to the inheritable set only if it is currently present in the bounding set.
• A capability can be added to the ambient set only if it is currently present in both the permitted and inheritable sets.
• Notwithstanding the syntax offered by setpriv, the kernel does not permit capabilities to be added to the bounding set.
If you drop a capability from the bounding set without also dropping it from the inheritable set, you are likely to become confused. Do not do that.
- So I think this means that when a cap is in the inheritable set, we can not get rid of it for sure simply by removing it from the bounding set... ==TODO== But why is that so?
- On that note, notice that even if a file
$F$ did not have a cap in its inheritable set, that capability will (not be in the Permitted set but will) stay in the inheritable set.
-
If the effectiveBit on the file
$F$ is set, it will on launch get all caps from the Permitted set into the Effective set. Otherwise, if the effectiveBit is not set, it will only get the Ambient set into the Effective set.-
As everything in the ambient set must also be in the inheritable and permitted sets, the Ambient set is
$\leq$ the Permitted set. -
==TODO== what is the point of that? Why not always give the permitted set? This answer might be explaining it but i don't really get it. Something about legacy binaries that do not know how to deal with capabilities?
I guess it's basically "if effectiveBit is set, make the Permitted set effective, otherwise do nothing until the binary moves the Permitted caps by itself to the effective Set."
A capability aware application, will start with no effective capabilities, and copy capabilities to the effective set when needed, and clear the effective set when done (reducing the harm that bugs can do).
So set the bit if the binary needs some capabilities effective from the start. Instead of just being allowed to make them effective.
-
-
==TODO== do files retain their capabilities on copy?
-
So how do I fully get rid of a capability?
-
if it is in the Ambient set, the child will still have it
-
if it is in the inheritable set and the child file
$F$ has it also in the (desired, if i got that right?) inheritable set, then the child will still have it in the permitted set.- And independent of
$F$ it will stay in the inheritable set further down the chain (unless explicitly removed, which a process can do!!) even if it is removed from the bounding set.
- And independent of
-
if it is in the Permitted and Bounding Set, it stays permitted in the child.
-
If it is not in the bounding set, not in the ambient set, and not inheritable, then it will not be permitted in the child.
- And I think in this case it won't be getting it back into the permitted set, and hence also not into the ambient or effective sets.
-
If it is in the bounding set, the current process will not have any advantage from it but a child might, by having on the file also set the inheritable and permitted set.
yes the only way for an Inheritable capability to be promoted into the Permitted or Effective set is if the file is also marked with that capability.
implementation detail: this needs xattrs.
-
-
-
Dropping caps requires a cap as well, because otherwise:
The example discussed there points out that if an unprivileged user can drop
CAP_SETUID
from the bounding set, then setuid-root programs can no longer drop privileges by changing UID fromroot
to some other user.