The affected code originates from the official Linux kernel from https://kernel.org/ and is part of the Netfilter nf_tables component (net/netfilter/nf_tables_api.c).
Netfilter nf_tables allows to update its configuration as an atomic operation. When using this feature, the user-mode clients send batch requests containing a list of basic operations. Netfilter nf_tables then processes all the operations within the batch as single transaction. When processing the batch, Netfilter nf_tables then checks the configuration state updates to ensure that each successive basic operation is valid and this also accounts for the state updates from all the previous operations within the batch. However, the currently implemented check is insufficient.
In our specific scenario we start with a Netfilter nf_tables configuration
that has an nft_rule
with lookup
expression on anonymous nft_set
, and
where the anonymous nft_set
contains some elements. Next, we send a batch
request containing the following two basic operations:
NFT_MSG_DELRULE
operation to delete thenft_rule
.
Note that this also implicitly deletes thelookup
expression and the anonymousnft_set
.NFT_MSG_DELSETELEM
operation to delete any of the elements of the deleted anonymousnft_set
.
The current version of Netfilter nf_tables accepts the above batch request.
It then calls nf_tables_commit_release() that appends released resources to
nf_tables_destroy_list
. The nf_tables_destroy_list
is then processed by
nf_tables_trans_destroy_work() that first deallocates resources related to
NFT_MSG_DELRULE
operation by calling:
nft_commit_release()
nf_tables_rule_destroy()
nf_tables_expr_destroy()
expr->ops->destroy() that points to nft_lookup_destroy()
nf_tables_destroy_set()
nft_set_destroy()
kvfree() that deallocates memory used by `nft_set`
before processing NFT_MSG_DELSETELEM
operation, where reference to the
deallocated nft_set
is accessed via nft_trans_elem_set() during the
following calls:
nft_commit_release()
nf_tables_set_elem_destroy()
nft_set_elem_ext()
Within nft_set_elem_ext() above, the memory location of the deallocated
nft_set
is accessed to determine location of nft_set_ext
:
static inline struct nft_set_ext *nft_set_elem_ext(const struct nft_set *set,
void *elem)
{
return elem + set->ops->elemsize;
}
for the operations that follow. So whenever the value of set->ops->elemsize
gets corrupted, certain unexpected memory location could be interpreted as
list of nft_expr
to be destroyed:
static void nf_tables_set_elem_destroy(const struct nft_ctx *ctx,
const struct nft_set *set, void *elem)
{
struct nft_set_ext *ext = nft_set_elem_ext(set, elem);
if (nft_set_ext_exists(ext, NFT_SET_EXT_EXPRESSIONS))
nft_set_elem_expr_destroy(ctx, nft_set_ext_expr(ext));
Exploiting the above vulnerability requires winning a race with nf_tables_trans_destroy_work() that executes from background worker thread from the Linux kernel. This seems to complicate practical exploitation even before we consider existing mitigations, such as hardening of kernel slab allocator, Kernel Address Space Layout Randomization (KASLR) and especially Control-Flow Integrity. However, the attached PoC proves that it is still possible to achieve reasonably reliable exploitation in practice.
In order to exploit the vulnerability we need to modify content of memory
from nft_set
after it is deallocated under nf_tables_rule_destroy(), but
before it is used under nf_tables_set_elem_destroy(). Both
nf_tables_rule_destroy() and nf_tables_set_elem_destroy() are called within
single invocation of nf_tables_trans_destroy_work() that executes from
background worker thread from the Linux kernel. Further, the deallcated
memory chunk is usually available for reuse only from the same CPU core.
When racing with nf_tables_trans_destroy_work(), we improve our chances by
adding a controlled delay for the background worker thread between it calls
nf_tables_rule_destroy() and nf_tables_set_elem_destroy(). For that we insert
an additional operation to destroy another nft_set
containing a large
number of elements. Additionally, we keep all the other CPU cores busy, such
that the background worker thread is likely to be scheduled on a specific CPU
core, so we can attempt to allocate a new structure from the same CPU core
just after it deallocates nft_set
under nf_tables_rule_destroy(). Our goal
is to allocate a new nft_set
of different type to reuse memory location of
the nft_set
deallocated under nf_tables_rule_destroy().
The new nft_set
type is selected to use a different value for
set->ops->elemsize
. So when the background worker thread finally calls
nf_tables_set_elem_destroy() to process NFT_MSG_DELSETELEM
operation, it
interprets its elem
argument incorrectly, such that the corrupted
nft_set_ext *ext
is a few bytes after the correct location. This means that
certain user-controlled data field of the original nft_set_ext
are now
interpreted as headers, resulting with type confusion.
One way to abuse this type confusion is by crafting the corrupted
nft_set_ext
headers with offsets values such that
nf_tables_set_elem_destroy() interprets content of any adjacent memory blocks
as the list of nft_expr
to destroy via the following calls:
nft_set_elem_expr_destroy()
__nft_set_elem_expr_destroy()
nf_tables_expr_destroy()
expr->ops->destroy()
At this point of exploitation, we do not yet have details of the kernel
memory layout. So it is not possible to craft absolute pointer addresses.
However, when crafting the corrupted nft_set_ext
headers we can still use
out-of-range offsets, such that expr->ops->destroy()
is called on certain
valid nft_expr
in the adjacent memory chunks.
For this we spray nft_log
expressions, with controlled NFTA_LOG_PREFIX.
That nft_log->prefix
is then deallocated by nft_log_destroy() once
expr->ops->destroy()
is called:
static void nft_log_destroy(const struct nft_ctx *ctx,
const struct nft_expr *expr)
{
struct nft_log *priv = nft_expr_priv(expr);
struct nf_loginfo *li = &priv->loginfo;
if (priv->prefix != nft_log_null_prefix)
kfree(priv->prefix);
Note that we can still access and even again deallocate this memory via the
other reference from the sprayed nft_log
expression.
Additionally, we can also control the size of nft_log->prefix
, such that it
can be allocated from any of the slabs kmalloc-{8, ..., 192}. Finally, the
refereed memory is interpreted as a string of characters by the kernel, so no
need to worry about corruptions when we overlay different objects over it.
This is essentially game over.
One inconvenience is that any NULL characters terminate nft_log->prefix
, so
we cannot read past NULL bytes when leaking memory content. This is addressed
in the next step, where we allocate nft_object->udata
to reuse
nft_log->prefix
memory chunk and destroy the nft_log
expression. This
deallocates nft_object->udata
memory, but now we can still use the
nft_object->udata
dangling pointer to leak memory content without
restrictions on NULL bytes.
Looking for suitable structures for the following steps, we decided on
nft_expr
allocated from nft_dynset_new(). These live in the same slabs as
nft_log->prefix
and nft_object->udata
. And also, we have reasonable
control over the allocation size, such that later we could easily switch
between slabs of different size if needed.
To use these structures, we create packet filter with nft_dynset
expression. And when we send any packets over the loopback interface,
nft_dynset
expression calls nft_dynset_new() to create new elements for the
associated nft_set
. The created elements are stateful expressions of the
following types:
-
nft_counter
to obtain the location ofnf_tables.ko
in kernel memory.
The structure includes a pointer tonft_counter_ops
innf_tables.ko
kernel module. We leak this pointer by readingnft_object->udata
. -
nft_quota
for arbitrary memory read and write.
We can repeatedly deallocate and reallocatenft_object->udata
to modify thenft_quota->consumed
pointer. Next, we performNFT_MSG_GETSETELEM
operation that calls nft_quota_do_dump() to read the content of the referenced memory and passes the result asNFTA_QUOTA_CONSUMED
attribute in the result. As for writes, we simply send packets over the loopback interface, where nft_quota_do_eval() calls:static inline bool nft_overquota(struct nft_quota *priv, const struct sk_buff *skb) { return atomic64_add_return(skb->len, priv->consumed) >=
to modify
nft_quota->consumed
.
We use the above arbitrary memory read to obtain base address of the kernel core. And then we proceed to modify "sbin" substring of "/sbin/modprobe" pathname, so it is replaced with "/tmp". The resulting pathname "//tmp/modprobe" is then used by the kernel to start a process with root privileges, where we control the file content.
Note that we didn't put any intentional effort to bypass Control-Flow Integrity. However, for each of the exploitation steps, we consciously picked the most flexible and the most robust primitives. Turns-out, that our selection somehow avoided any of the primitives that could potentially be blocked by Control-Flow Integrity. We are now curious to confirm with testing that the resulting exploit really works against systems with Control-Flow Integrity mitigations.