-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zebra_apic threads not started after FRR service restart #16747
Comments
I have been using the following script to reproduce the issue. I have modified the original service file to permit faster restarts so we could see the issue faster, the issue appears with the default configuration as well, if anyone wants to reproduce it using the default configuration you can change the sleep to match the service configuration.
|
As I understand we should see |
If the threads are missing and the owner of the socket is root, then the issue occured. The missing
|
Fixes: FRRouting#16747 Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
Fixes: FRRouting#16747 Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
Could you test this PR #16749? |
I still see the issue where the |
I looked with a friend at the code and we think that the issue might be in
The above
|
Could you show the whole |
This is the entire content of the
|
Can you reopen it since the issue persists? |
I'm also seeing the wrong permissions issue for
Currently testing that theory by running the restart script with FRR built with capabilities enabled. @ToshikiRen do you have capabilities enabled in your package? If not, this might be the issue. |
Actually I just saw the thread, and you do have them disabled. |
The more I look at the code, the more I am convinced that for AFAICT there is nothing preventing another process from raising the privs, or it being run while privs are raised in these cases, just that raising and lowering privs isn't done in parallel. And it also fits the observed state, the socket was created as Script still running without failing in a loop. |
Looking at 7bfe765, building without it is already highly discouraged in 10, so maybe libcap just should be mandatory, since it isn't just a performance penalty, but actually broken (at least in 9.1+). Might be easier than trying to fix this. |
Without libcap privilege is attained per process instead of per-thread. Apart from being slow due to the synchronisation [1], it is even broken as it does not guard unprivileged executions from being run in privileged context [2]. Fix this by always enabling libcap. [1] FRRouting/frr@7bfe765 [2] FRRouting/frr#16747 Signed-off-by: Jonas Gorski <jonas.gorski@bisdn.de>
Without libcap privilege is attained per process instead of per-thread. Apart from being slow due to the synchronisation [1], it is even broken as it does not guard unprivileged executions from being run in privileged context [2]. Fix this by always enabling libcap. [1] FRRouting/frr@7bfe765 [2] FRRouting/frr#16747 Signed-off-by: Jonas Gorski <jonas.gorski@bisdn.de>
Without libcap privilege is attained per process instead of per thread. Apart from being slow due to the synchronisation [1], it is even broken as it does not guard unprivileged executions from being run in privileged context [2]. Fix this by always enabling libcap. [1] FRRouting/frr@7bfe765 [2] FRRouting/frr#16747 Signed-off-by: Jonas Gorski <jonas.gorski@bisdn.de>
Without libcap privilege is attained per process instead of per thread. Apart from being slow due to the need of synchronisation [1], it is even broken as it does not guard unprivileged executions from being run in privileged context [2]. Fix this by always enabling libcap. [1] FRRouting/frr@7bfe765 [2] FRRouting/frr#16747 Signed-off-by: Jonas Gorski <jonas.gorski@bisdn.de>
@eqvinox since you authored that commit, what's your preference here? Trying to fix the per-process privileges code to ensure that zserv creates the unix socket in an unprivileged context (and make it even slower), or make libcap mandatory (and probably rip out the per-process code). I checked the code, and AFAICT zebra's zserv is the only user with NULL privileges here. The code seems to have been that way since quite a while (I only checked 8.2). I did not check if there are differences preventing it on older versions (e.g. startup order/non-existent threading). |
Without libcap privilege is attained per process instead of per thread. Apart from being slow due to the need of synchronisation [1], it is even broken as it does not guard unprivileged executions from being run in privileged context [2]. Fix this by always enabling libcap. [1] FRRouting/frr@7bfe765 [2] FRRouting/frr#16747 Signed-off-by: Jonas Gorski <jonas.gorski@bisdn.de>
Discussed in #16638
Originally posted by ToshikiRen August 23, 2024
zebra_apic threads not started after FRR service restart (happens after multiple restarts, not all the time, the issue occurrence is mostly random)
The issue is that no routes are sent from routing daemons (e.g., BGP) to the kernel.
Questions:
I managed to reproduce the issue on a box with the following configuration, without configuring BGP peers:
The error from the logs when the issue occurs during restart:
FRR version: 9.1
Show version output:
When the issue occurs the zebra zserv.api socket is owned by root instead of frr:
Looking into the code it seems the only case for root to own this socket would be to use a TCP connection but it is not the case for our configuration.
I have seen this issue on the latest frr release (10.1) as well.
The text was updated successfully, but these errors were encountered: