net: remember the name of the lock chain (nftables) #2550

adrianreber · 2024-12-17T13:44:30Z

Using libnftables the chain to lock the network is composed of ("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing with errors like this:

Error: No such file or directory; did you mean table 'CRIU-62' in family inet?
delete table inet CRIU-86

The reason is that as soon as a process is running in a namespace the real PID can be anything and only the PID in the namespace is restored correctly. Relying on the real PID does not work for the chain name.

Using the PID of the innermost namespace would lead to the chain be called 'CRIU-1' most of the time which is also not really unique.

The uniqueness of the name was always problematic. With this change all tests are working again which rely on network locking if the nftables backend is used for network locking.

mihalicyn · 2024-12-18T13:53:30Z

Hi Adrian!

Please, can you tell how and in which circumstances you've caught this issue?
It is not on our GitHub CI tests, right?

As far as I understand the idea of your fix is to ensure that we keep nftables table name in the inventory image file instead of dynamically recalculate it on restore (using root_item->pid->real). Am I right?

A first question I have after going through this is "How this has worked before?".

P.S. I'll take a closer look into this. I've spent not enough time yet to fully understand what's going on there.

adrianreber · 2024-12-18T14:36:02Z

A first question I have after going through this is "How this has worked before?".

It probably never did. We are not running all the tests on a system without iptables with nftables locking backend. Only two or four tests are running with the nftables backend.

adrianreber · 2024-12-18T15:03:31Z

Please, can you tell how and in which circumstances you've caught this issue?

I am trying to switch the default locking backend in Fedora and CentOS >= 10 to nftables from iptables because iptables is no longer installed by default.

As far as I understand the idea of your fix is to ensure that we keep nftables table name in the inventory image file instead of dynamically recalculate it on restore (using root_item->pid->real). Am I right?

Yes. The table name makes sense if the locking and unlocking happens in the same CRIU run, but between CRIU runs it does not work with the existing approach.

mihalicyn · 2024-12-18T15:22:30Z

Ah, thanks for clarifications!

I wonder if we can do something like this:

$ git diff
diff --git a/criu/netfilter.c b/criu/netfilter.c
index 9e78dc4b0..c558f9bf1 100644
--- a/criu/netfilter.c
+++ b/criu/netfilter.c
@@ -299,7 +299,7 @@ int nftables_lock_connection(struct inet_sk_desc *sk)
 
 int nftables_get_table(char *table, int n)
 {
-       if (snprintf(table, n, "inet CRIU-%d", root_item->pid->real) < 0) {
+       if (snprintf(table, n, "inet CRIU-%d", root_item->ids->pid_ns_id) < 0) {
                pr_err("Cannot generate CRIU's nftables table name\n");
                return -1;
        }

Yes, it's not a forward-compatible change and will break restore of images which were dumped with an older CRIU. In this form only works for experimental purposes (and have to check for root_item->ids->has_pid_ns_id too). But I'm curious if it helps.

mihalicyn · 2024-12-18T15:27:38Z

My idea is that instead of introducing a new field nft_lock_table in the inventory_entry just for a single purpose, we can use inventory_entry->root_ids->pid_ns_id or root_item->ids->pid_ns_id as a source of unique criu run id. And we already do something like that when generate criu_run_id value:

void util_init(void)
{
...
	criu_run_id = getpid();
	if (!stat("/proc/self/ns/pid", &statbuf))
		criu_run_id |= (uint64_t)statbuf.st_ino << 32;

adrianreber · 2024-12-18T15:59:35Z

@mihalicyn I am happy to use whatever makes most sense.

What is pid_ns_id? Is that basically the inode of the PID NS? Or more? It still sounds like something we need to save somewhere in the image, right?

Yes, it's not a forward-compatible change and will break restore of images which were dumped with an older CRIU.

I don't think we have to worry about this. Currently it doesn't work at all.

Let me know which ID makes most sense and I can rework this PR. I think the important part is that is has to come from some value of the checkpoint image and not be generated during restore.

adrianreber · 2024-12-18T16:12:21Z

@mihalicyn I think I understood your proposal now. The PR could be really simple as pid_ns_id is already in the image. Let me try it out.

adrianreber · 2024-12-18T16:45:44Z

With this line it also passes all the zdtm test cases (besides a couple of tests which call iptables (which I did not install)) if I switch to the nftables locking backend:

if (snprintf(table, n, "inet CRIU-%d", root_item->ids->pid_ns_id) < 0) {

That brings it down to a one line change. Very good idea @mihalicyn. Thanks.

How long can the pid_ns_id be? Currently the variable table is set to 32 characters.

adrianreber · 2024-12-18T17:21:43Z

@mihalicyn Tests are happy, but root_item->ids->pid_ns_id always returns 1 when running in the host PID namespace.

So that is not really a good idea I think as it not really unique.

mihalicyn · 2024-12-18T17:41:38Z

Hey Adrian,

Is that basically the inode of the PID NS?

Yes, precisely.

It still sounds like something we need to save somewhere in the image, right?

We don't as we already have it in the image anyways.

I don't think we have to worry about this. Currently it doesn't work at all.

Are we 100% percent sure that it doesn't work and never worked in any circumstances?

How long can the pid_ns_id be? Currently the variable table is set to 32 characters.

Hmm, it's uint32 so in a string representation I guess it's like 10 chars.

Tests are happy, but root_item->ids->pid_ns_id always returns 1 when running in the host PID namespace.

That's my bad, actually, to get pid namespace inode number you need something like:

		ns = lookup_ns_by_id(root_item->ids->pid_ns_id, &pid_ns_desc);
		if (ns) {
			ns->kid // << this thing is inode number of pid namespace

But yes, I don't think that even with this change having pid_ns_id would be enough, I think we still need to add a new field to inventory_entry but my point is to make it universal for different cases like this one and name it criu_run_id or something like that. Also, to clearly document when it's unique and for what purposes must be used.

mihalicyn · 2024-12-18T17:44:11Z

Also, we have inventory_entry.dump_uptime field we can consume to get a certain degree of uniqueness.

adrianreber · 2024-12-18T17:59:44Z

Ah, okay. So let's use the criu_run_id and store it in the inventory.

I don't think we have to worry about this. Currently it doesn't work at all.

Are we 100% percent sure that it doesn't work and never worked in any circumstances?

I don't know. All tests with open TCP connections are just hanging during restore because the network locking cannot be disabled. According to zdtm it is so broken that it doesn't work currently.

Also, we have inventory_entry.dump_uptime field we can consume to get a certain degree of uniqueness.

As an additional field in the nft table name? "CRIU-%d-%" PRIx64 " ", criu_run_id, inventory_entry.dump_uptime)

Or instead of criu_run_id? Currently we collect the uptime rather late in the process of checkpointing. Definitely after the network locking. It seems to be only used during detect_pid_reuse() and that is never directly use, but it seems is only relevant during pre-dump and looking at the parent process. So it seems like we could move the uptime detection to an earlier point and then also use it in the network locking chain. During restore we would need to look at the criu_run_id from the checkpointing run and at the uptime of the checkpointing run. That could work.

rst0git · 2024-12-18T20:12:58Z

A first question I have after going through this is "How this has worked before?".

It probably never did. We are not running all the tests on a system without iptables with nftables locking backend. Only two or four tests are running with the nftables backend.

Would it be possible to add a CI workflow or modify an existing one to run all tests with the nftables backend?

Using libnftables the chain to lock the network is composed of ("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing with errors like this: Error: No such file or directory; did you mean table 'CRIU-62' in family inet? delete table inet CRIU-86 The reason is that as soon as a process is running in a namespace the real PID can be anything and only the PID in the namespace is restored correctly. Relying on the real PID does not work for the chain name. Using the PID of the innermost namespace would lead to the chain be called 'CRIU-1' most of the time which is also not really unique. With this commit the change is now named using the already existing CRIU run ID. To be able to correctly restore the process and delete the locking table, the CRIU run id during checkpointing is now stored in the inventory as dump_criu_run_id. Signed-off-by: Adrian Reber <areber@redhat.com>

adrianreber · 2024-12-20T10:13:22Z

@mihalicyn What do you think about the latest version. This works in my tests just as good as the previous version. Now using criu_run_id as suggested.

mihalicyn · 2024-12-23T11:21:35Z

Hi Adrian!

What do you think about the latest version. This works in my tests just as good as the previous version. Now using criu_run_id as suggested.

Looks great! The only thing that worries me is that idea behind criu_run_id was to make it relatively unique within one machine/system, but here we try to use it as "unique enough" for purpose of migration between different machines, right? It's not a question to you, but in general. Maybe, we should consider adding more randomness to it? And what is even more important is that as we add criu_run_id to the images we loose ability to increase it size (from uint64_t to something bigger) in the future.

Also, I tried to play with nftables-based locking mechanism on my machine, and found, that for host-flavor test we have no problem with locking:

sudo ./test/zdtm.py run --ignore-taint -t zdtm/static/socket-tcp-local -f h
userns is supported
The kernel is tainted: '12288'
=== Run 1/1 ================ zdtm/static/socket-tcp-local
==================== Run zdtm/static/socket-tcp-local in h =====================
Start test
Running zdtm/static/socket-tcp-local.hook(--post-start)
./socket-tcp-local --pidfile=socket-tcp-local.pid --outfile=socket-tcp-local.out
Running zdtm/static/socket-tcp-local.hook(--pre-dump)
State      Recv-Q     Send-Q         Local Address:Port          Peer Address:Port     Process     
LISTEN     0          1                    0.0.0.0:8880               0.0.0.0:*                    
ESTAB      0          0                  127.0.0.1:8880             127.0.0.1:56452                
ESTAB      0          0                  127.0.0.1:56452            127.0.0.1:8880                 
Run criu dump
Running zdtm/static/socket-tcp-local.hook(--pre-restore)
Run criu restore
=[log]=> dump/zdtm/static/socket-tcp-local/55/1/restore.log
------------------------ grep Error ------------------------
b'(00.006320) pie: 55: 55: Restored'
b'(00.006413) Running post-restore scripts'
b'(00.006475) pidfile: Wrote pid 55 to /home/alex/storage/dev/criu/test/zdtm/static/socket-tcp-local.pid (2 bytes)'
b'(00.006480) net: Unlock network'
------------------------ ERROR OVER ------------------------
Running zdtm/static/socket-tcp-local.hook(--post-restore)
State      Recv-Q     Send-Q         Local Address:Port          Peer Address:Port     Process     
LISTEN     0          1                    0.0.0.0:8880               0.0.0.0:*                    
ESTAB      0          0                  127.0.0.1:8880             127.0.0.1:56452                
ESTAB      0          0                  127.0.0.1:56452            127.0.0.1:8880                 
Check TCP images
Send the 15 signal to  55
Wait for zdtm/static/socket-tcp-local(55) to die for 0.100000
Running zdtm/static/socket-tcp-local.hook(--clean)
Removing dump/zdtm/static/socket-tcp-local/55
==================== Test zdtm/static/socket-tcp-local PASS ====================

While for -f ns/-f uns we have the problem you described. BUT! When -f ns/-f uns is used, it means that locking happens inside net namespace and we can't expect any issues with uniqueness of nftable table names, right? My point is that we can simplify things for the case when net namespace is dumped, while for the case when we don't use namespaces we can still use this simple pid-based approach. Tell me if I'm saying something stupid.

mihalicyn · 2024-12-23T11:40:08Z

Just for the sake of demonstrating my point. Something like this:

--- a/criu/netfilter.c
+++ b/criu/netfilter.c
@@ -14,6 +14,7 @@
 #include "util.h"
 #include "common/list.h"
 #include "files.h"
+#include "namespaces.h"
 #include "netfilter.h"
 #include "sockets.h"
 #include "sk-inet.h"
@@ -299,7 +300,19 @@ int nftables_lock_connection(struct inet_sk_desc *sk)
 
 int nftables_get_table(char *table, int n)
 {
-       if (snprintf(table, n, "inet CRIU-%d", root_item->pid->real) < 0) {
+       int table_id = 0;
+
+       if (root_ns_mask & CLONE_NEWNET) {
+               table_id = 0;
+       } else if (!(root_ns_mask & CLONE_NEWPID)) {
+               table_id = root_item->pid->real;
+       } else {
+               // here we need something unique
+               // as we don't have a net namespace and table name conflict is possible
+               // also, we *do* have a pid namespace and root_item->pid->real makes no sense.
+       }
+
+       if (snprintf(table, n, "inet CRIU-%d", table_id) < 0) {
                pr_err("Cannot generate CRIU's nftables table name\n");
                return -1;
        }

fixes tests for -f h and -f ns/-f uns. At the same time, it's still broken if you make a setup where process use pid namespace and don't use net namespace.

adrianreber · 2024-12-23T14:23:37Z

Also, I tried to play with nftables-based locking mechanism on my machine, and found, that for host-flavor test we have no problem with locking:

Right, because pid_real is restored correctly.

While for -f ns/-f uns we have the problem you described. BUT! When -f ns/-f uns is used, it means that locking happens inside net namespace and we can't expect any issues with uniqueness of nftable table names, right? My point is that we can simplify things for the case when net namespace is dumped, while for the case when we don't use namespaces we can still use this simple pid-based approach. Tell me if I'm saying something stupid.

You are right. It works for -f h. I am not sure that having different code paths for host network namespace and separate namespaces is such a good idea. It doesn't seem to break anything if we always use the criu_run_id. Having one less if is not much, but if we do not gain anything from it, we could leave it away.

Maybe, we should consider adding more randomness to it?

No real opinion here, but it might be good. No idea.

And what is even more important is that as we add criu_run_id to the images we loose ability to increase it size (from uint64_t to something bigger) in the future.

We can just deprecate the protobuf field at some point in the future and use a new one if we feel that is necessary.

mihalicyn · 2024-12-23T14:46:29Z

criu/net.c

@@ -229,6 +229,8 @@ static const char *unix_conf_entries[] = {
 	"max_dgram_qlen",
 };

+extern char nft_lock_table[32];


I guess, we don't need this anymore.

mihalicyn · 2024-12-23T14:57:00Z

criu/image.c

+		 * information is needed to identify the name of the network
+		 * locking table.
+		 */
+		dump_criu_run_id = he->dump_criu_run_id;


if (he->has_dump_criu_run_id) { dump_criu_run_id = he->dump_criu_run_id; }

mihalicyn · 2024-12-23T15:09:24Z

criu/netfilter.c

 {
-	if (snprintf(table, n, "inet CRIU-%d", root_item->pid->real) < 0) {
+	if (snprintf(table, n, "inet CRIU-%" PRIx64, id) < 0) {


/* * Keep compatibility with images * without he->dump_criu_run_id field. */ if (!id) { if (!(root_ns_mask & CLONE_NEWPID)) { id = root_item->pid->real; } else { pr_err("Cannot generate CRIU's nftables table name because of issue #2550\n"); return -1; } } if (snprintf(table, n, "inet CRIU-%" PRIx64, id) < 0) {

What do you think about this?

mihalicyn · 2024-12-23T15:13:56Z

We can just deprecate the protobuf field at some point in the future and use a new one if we feel that is necessary.

I agree.

I am not sure that having different code paths for host network namespace and separate namespaces is such a good idea.

Yeah, I agree. I just wanted to ensure that I understood problem right and this code example is the best way to show different scenarios we have and when it works and when it doesn't.

But we still need some extra checks for compatibility reasons, IMHO.

In general, this PR looks great to me. Thanks for working on this, Adrian!

mihalicyn · 2024-12-23T15:50:52Z

criu/net.c


-	if (nftables_get_table(table, sizeof(table)))
+	if (dump_criu_run_id == 0)


I think we need to introduce a boolean parameter here to determine if we are on restore/dump codepath as dump_criu_run_id == 0 might be in two different cases: when we deal with an old image (without dump_criu_run_id field) or if we are on the restore codepath.

adrianreber force-pushed the 2024-12-17-nftables-lock-name branch from c483710 to 0305093 Compare December 17, 2024 13:46

avagin requested a review from mihalicyn December 17, 2024 16:25

rst0git mentioned this pull request Dec 18, 2024

network lock method for checkpoint/restore defaults to iptables containers/crun#1627

Open

adrianreber force-pushed the 2024-12-17-nftables-lock-name branch from 0305093 to 30e76fd Compare December 20, 2024 10:12

rst0git mentioned this pull request Dec 23, 2024

net: redirect nftables stdout and stderr to log CRIU's log file #2549

Open

mihalicyn requested changes Dec 23, 2024

View reviewed changes

mihalicyn reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net: remember the name of the lock chain (nftables) #2550

net: remember the name of the lock chain (nftables) #2550

adrianreber commented Dec 17, 2024

mihalicyn commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

mihalicyn commented Dec 18, 2024

mihalicyn commented Dec 18, 2024 •

edited

Loading

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

mihalicyn commented Dec 18, 2024 •

edited

Loading

mihalicyn commented Dec 18, 2024

adrianreber commented Dec 18, 2024

rst0git commented Dec 18, 2024 •

edited

Loading

adrianreber commented Dec 20, 2024

mihalicyn commented Dec 23, 2024

mihalicyn commented Dec 23, 2024 •

edited

Loading

adrianreber commented Dec 23, 2024

mihalicyn Dec 23, 2024

mihalicyn Dec 23, 2024

mihalicyn Dec 23, 2024

mihalicyn commented Dec 23, 2024 •

edited

Loading

mihalicyn Dec 23, 2024


		if (nftables_get_table(table, sizeof(table)))
		if (dump_criu_run_id == 0)

net: remember the name of the lock chain (nftables) #2550

Are you sure you want to change the base?

net: remember the name of the lock chain (nftables) #2550

Conversation

adrianreber commented Dec 17, 2024

mihalicyn commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

mihalicyn commented Dec 18, 2024

mihalicyn commented Dec 18, 2024 • edited Loading

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

adrianreber commented Dec 18, 2024

mihalicyn commented Dec 18, 2024 • edited Loading

mihalicyn commented Dec 18, 2024

adrianreber commented Dec 18, 2024

rst0git commented Dec 18, 2024 • edited Loading

adrianreber commented Dec 20, 2024

mihalicyn commented Dec 23, 2024

mihalicyn commented Dec 23, 2024 • edited Loading

adrianreber commented Dec 23, 2024

mihalicyn Dec 23, 2024

Choose a reason for hiding this comment

mihalicyn Dec 23, 2024

Choose a reason for hiding this comment

mihalicyn Dec 23, 2024

Choose a reason for hiding this comment

mihalicyn commented Dec 23, 2024 • edited Loading

mihalicyn Dec 23, 2024

Choose a reason for hiding this comment

mihalicyn commented Dec 18, 2024 •

edited

Loading

mihalicyn commented Dec 18, 2024 •

edited

Loading

rst0git commented Dec 18, 2024 •

edited

Loading

mihalicyn commented Dec 23, 2024 •

edited

Loading

mihalicyn commented Dec 23, 2024 •

edited

Loading