Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgp_rpki module crashes in rare instances. #15260

Closed
donaldsharp opened this issue Jan 30, 2024 · 2 comments
Closed

bgp_rpki module crashes in rare instances. #15260

donaldsharp opened this issue Jan 30, 2024 · 2 comments
Assignees
Labels
triage Needs further investigation
Milestone

Comments

@donaldsharp
Copy link
Member

Code crash:

(gdb) 
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=136763135739456) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=136763135739456) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=136763135739456, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007c62a7e42476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#4  0x00007c62a8b382b8 in core_handler (signo=6, siginfo=0x7c62a61f9f70, context=0x7c62a61f9e40) at lib/sigevent.c:248
#5  <signal handler called>
#6  __pthread_kill_implementation (no_tid=0, signo=6, threadid=136763135739456) at ./nptl/pthread_kill.c:44
#7  __pthread_kill_internal (signo=6, threadid=136763135739456) at ./nptl/pthread_kill.c:78
#8  __GI___pthread_kill (threadid=136763135739456, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#9  0x00007c62a7e42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#10 0x00007c62a7e287f3 in __GI_abort () at ./stdlib/abort.c:79
#11 0x00007c62a8b7cf45 in _zlog_assert_failed (xref=0x7c62a8c214a0 <rcu_read_lock._xref>, extra=0x0) at lib/zlog.c:670
#12 0x00007c62a8ab80d6 in rcu_read_lock () at lib/frrcu.c:294
#13 0x00007c62a8b7c8ee in vzlog_notls (xref=0x0, prio=2, fmt=0x7c62a8bc4fbb "%s:%d: %s(): assertion (%s) failed", ap=0x7c62a61fb250) at lib/zlog.c:425
#14 0x00007c62a8b7c58c in vzlogx (xref=0x0, prio=2, fmt=0x7c62a8bc4fbb "%s:%d: %s(): assertion (%s) failed", ap=0x7c62a61fb250) at lib/zlog.c:627
#15 0x00007c62a8b7d0a4 in zlog (prio=2, fmt=0x7c62a8bc4fbb "%s:%d: %s(): assertion (%s) failed") at ./lib/zlog.h:73
#16 0x00007c62a8b7d000 in _zlog_assert_failed (xref=0x7c62a8c214a0 <rcu_read_lock._xref>, extra=0x0) at lib/zlog.c:687
#17 0x00007c62a8ab80d6 in rcu_read_lock () at lib/frrcu.c:294
#18 0x00007c62a8b7c8ee in vzlog_notls (xref=0x7c62a8da42c0 <rpki_create_socket._xref>, prio=3, fmt=0x7c62a8d9ee95 "getaddrinfo: %s", ap=0x7c62a61fb890)
    at lib/zlog.c:425
#19 0x00007c62a8b7c58c in vzlogx (xref=0x7c62a8da42c0 <rpki_create_socket._xref>, prio=3, fmt=0x7c62a8d9ee95 "getaddrinfo: %s", ap=0x7c62a61fb890) at lib/zlog.c:627
#20 0x00007c62a8d952e5 in zlog_ref (xref=0x7c62a8da42c0 <rpki_create_socket._xref>, fmt=0x7c62a8d9ee95 "getaddrinfo: %s") at ./lib/zlog.h:84
#21 0x00007c62a8d98dba in rpki_create_socket (_cache=0x5dba80cf8570) at bgpd/bgp_rpki.c:1337
#22 0x00007c62a80ea643 in ?? () from /lib/x86_64-linux-gnu/librtr.so.0
#23 0x00007c62a80eaf69 in ?? () from /lib/x86_64-linux-gnu/librtr.so.0
#24 0x00007c62a7e94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#25 0x00007c62a7f26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) ```

Bisect of the code base leads us to:

commit b28fd4e (HEAD, refs/bisect/bad)
Author: Louis Scalbert louis.scalbert@6wind.com
Date: Tue Jan 23 11:52:59 2024 +0100

bgpd: fix res validity in rpki_create_socket

Fix coverity scanner issue 1575912 where res pointer is supposed to
valid in:

> socket = vrf_socket(res->ai_family, ...)

but is checked for validity a few lines later.

Note that vrf_getaddrinfo returns an error code if getaddrinfo() fails
to allocate res and in this case, rpki_create_socket() returns.

Fixes: a951752 ("bgpd: create cache server socket in vrf")
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>```
@donaldsharp donaldsharp added the triage Needs further investigation label Jan 30, 2024
@louis-6wind
Copy link
Contributor

I get the bug everytime with this patch.

diff --git a/bgpd/bgp_rpki.c b/bgpd/bgp_rpki.c
index 94275762f6..8f66f99338 100644
--- a/bgpd/bgp_rpki.c
+++ b/bgpd/bgp_rpki.c
@@ -1333,11 +1333,10 @@ static int rpki_create_socket(void *_cache)
 	frr_with_privs (&bgpd_privs) {
 		ret = vrf_getaddrinfo(host, port, &hints, &res, vrf->vrf_id);
 	}
-	if (ret != 0) {
-		flog_err_sys(EC_LIB_SOCKET, "getaddrinfo: %s",
-			     gai_strerror(ret));
-		return -1;
-	}
+//	if (ret != 0) {
+		zlog_warn("getaddrinfo: debug");
+//		return -1;
+//	}
 
 	frr_with_privs (&bgpd_privs) {
 		socket = vrf_socket(res->ai_family, res->ai_socktype,

The issue is because rpki_create_socket() is a hook that is called from the rtrlib library. I need to find the root cause

@louis-6wind
Copy link
Contributor

full backtrace

#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f6e21723798 in core_handler (signo=6, siginfo=0x7f6e1e502ef0, context=0x7f6e1e502dc0) at lib/sigevent.c:248
#2  <signal handler called>
#3  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#4  0x00007f6e2144e537 in __GI_abort () at abort.c:79
#5  0x00007f6e2176348e in _zlog_assert_failed (xref=0x7f6e2180c920 <_xref.16>, extra=0x0) at lib/zlog.c:670
#6  0x00007f6e216b1eda in rcu_read_lock () at lib/frrcu.c:294
#7  0x00007f6e21762da8 in vzlog_notls (xref=0x0, prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed", ap=0x7f6e1e504248) at lib/zlog.c:425
#8  0x00007f6e217632fb in vzlogx (xref=0x0, prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed", ap=0x7f6e1e504248) at lib/zlog.c:627
#9  0x00007f6e217621f5 in zlog (prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed") at lib/zlog.h:73
#10 0x00007f6e21763596 in _zlog_assert_failed (xref=0x7f6e2180c920 <_xref.16>, extra=0x0) at lib/zlog.c:687
#11 0x00007f6e216b1eda in rcu_read_lock () at lib/frrcu.c:294
#12 0x00007f6e21762da8 in vzlog_notls (xref=0x7f6e21a50040 <_xref.68>, prio=4, fmt=0x7f6e21a4999f "getaddrinfo: debug", ap=0x7f6e1e504878) at lib/zlog.c:425
#13 0x00007f6e217632fb in vzlogx (xref=0x7f6e21a50040 <_xref.68>, prio=4, fmt=0x7f6e21a4999f "getaddrinfo: debug", ap=0x7f6e1e504878) at lib/zlog.c:627
#14 0x00007f6e21a3f774 in zlog_ref (xref=0x7f6e21a50040 <_xref.68>, fmt=0x7f6e21a4999f "getaddrinfo: debug") at ./lib/zlog.h:84
#15 0x00007f6e21a451b2 in rpki_create_socket (_cache=0x55729149cc30) at bgpd/bgp_rpki.c:1337
#16 0x00007f6e2120e7b7 in tr_tcp_open (tr_socket=0x5572914d1520) at /home/lscalber/git/rtrlib/rtrlib/transport/tcp/tcp_transport.c:111
#17 0x00007f6e2120e212 in tr_open (socket=0x5572914b5e00) at /home/lscalber/git/rtrlib/rtrlib/transport/transport.c:16
#18 0x00007f6e2120faa2 in rtr_fsm_start (rtr_socket=0x557290e17180) at /home/lscalber/git/rtrlib/rtrlib/rtr/rtr.c:130
#19 0x00007f6e218b7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#20 0x00007f6e21527a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@Jafaral Jafaral added this to the 10.0 milestone Jan 30, 2024
louis-6wind added a commit to louis-6wind/frr that referenced this issue Jan 31, 2024
Fix the following crash when logging from rpki_create_socket():

> #0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f6e21723798 in core_handler (signo=6, siginfo=0x7f6e1e502ef0, context=0x7f6e1e502dc0) at lib/sigevent.c:248
> #2  <signal handler called>
> #3  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> FRRouting#4  0x00007f6e2144e537 in __GI_abort () at abort.c:79
> FRRouting#5  0x00007f6e2176348e in _zlog_assert_failed (xref=0x7f6e2180c920 <_xref.16>, extra=0x0) at lib/zlog.c:670
> FRRouting#6  0x00007f6e216b1eda in rcu_read_lock () at lib/frrcu.c:294
> FRRouting#7  0x00007f6e21762da8 in vzlog_notls (xref=0x0, prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed", ap=0x7f6e1e504248) at lib/zlog.c:425
> FRRouting#8  0x00007f6e217632fb in vzlogx (xref=0x0, prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed", ap=0x7f6e1e504248) at lib/zlog.c:627
> FRRouting#9  0x00007f6e217621f5 in zlog (prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed") at lib/zlog.h:73
> FRRouting#10 0x00007f6e21763596 in _zlog_assert_failed (xref=0x7f6e2180c920 <_xref.16>, extra=0x0) at lib/zlog.c:687
> FRRouting#11 0x00007f6e216b1eda in rcu_read_lock () at lib/frrcu.c:294
> FRRouting#12 0x00007f6e21762da8 in vzlog_notls (xref=0x7f6e21a50040 <_xref.68>, prio=4, fmt=0x7f6e21a4999f "getaddrinfo: debug", ap=0x7f6e1e504878) at lib/zlog.c:425
> FRRouting#13 0x00007f6e217632fb in vzlogx (xref=0x7f6e21a50040 <_xref.68>, prio=4, fmt=0x7f6e21a4999f "getaddrinfo: debug", ap=0x7f6e1e504878) at lib/zlog.c:627
> FRRouting#14 0x00007f6e21a3f774 in zlog_ref (xref=0x7f6e21a50040 <_xref.68>, fmt=0x7f6e21a4999f "getaddrinfo: debug") at ./lib/zlog.h:84
> FRRouting#15 0x00007f6e21a451b2 in rpki_create_socket (_cache=0x55729149cc30) at bgpd/bgp_rpki.c:1337
> FRRouting#16 0x00007f6e2120e7b7 in tr_tcp_open (tr_socket=0x5572914d1520) at rtrlib/rtrlib/transport/tcp/tcp_transport.c:111
> FRRouting#17 0x00007f6e2120e212 in tr_open (socket=0x5572914b5e00) at rtrlib/rtrlib/transport/transport.c:16
> FRRouting#18 0x00007f6e2120faa2 in rtr_fsm_start (rtr_socket=0x557290e17180) at rtrlib/rtrlib/rtr/rtr.c:130
> FRRouting#19 0x00007f6e218b7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
> FRRouting#20 0x00007f6e21527a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

rpki_create_socket() is a hook function called from the rtrlib library.
The issue arises because rtrlib initiates its own separate pthread in which
it runs the hook, which does not establish an FRR RCU context. Consequently,
this leads to failures in the logging mechanism that relies on RCU.

Initialize a new FRR pthread context from the rtrlib pthread with a
valid RCU context to allow logging from the rpki_create_socket() and
dependent functions.

Link: FRRouting#15260
Fixes: a951752 ("bgpd: create cache server socket in vrf")
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
louis-6wind added a commit to louis-6wind/frr that referenced this issue Feb 2, 2024
Fix the following crash when logging from rpki_create_socket():

> #0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f6e21723798 in core_handler (signo=6, siginfo=0x7f6e1e502ef0, context=0x7f6e1e502dc0) at lib/sigevent.c:248
> #2  <signal handler called>
> #3  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> FRRouting#4  0x00007f6e2144e537 in __GI_abort () at abort.c:79
> FRRouting#5  0x00007f6e2176348e in _zlog_assert_failed (xref=0x7f6e2180c920 <_xref.16>, extra=0x0) at lib/zlog.c:670
> FRRouting#6  0x00007f6e216b1eda in rcu_read_lock () at lib/frrcu.c:294
> FRRouting#7  0x00007f6e21762da8 in vzlog_notls (xref=0x0, prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed", ap=0x7f6e1e504248) at lib/zlog.c:425
> FRRouting#8  0x00007f6e217632fb in vzlogx (xref=0x0, prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed", ap=0x7f6e1e504248) at lib/zlog.c:627
> FRRouting#9  0x00007f6e217621f5 in zlog (prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed") at lib/zlog.h:73
> FRRouting#10 0x00007f6e21763596 in _zlog_assert_failed (xref=0x7f6e2180c920 <_xref.16>, extra=0x0) at lib/zlog.c:687
> FRRouting#11 0x00007f6e216b1eda in rcu_read_lock () at lib/frrcu.c:294
> FRRouting#12 0x00007f6e21762da8 in vzlog_notls (xref=0x7f6e21a50040 <_xref.68>, prio=4, fmt=0x7f6e21a4999f "getaddrinfo: debug", ap=0x7f6e1e504878) at lib/zlog.c:425
> FRRouting#13 0x00007f6e217632fb in vzlogx (xref=0x7f6e21a50040 <_xref.68>, prio=4, fmt=0x7f6e21a4999f "getaddrinfo: debug", ap=0x7f6e1e504878) at lib/zlog.c:627
> FRRouting#14 0x00007f6e21a3f774 in zlog_ref (xref=0x7f6e21a50040 <_xref.68>, fmt=0x7f6e21a4999f "getaddrinfo: debug") at ./lib/zlog.h:84
> FRRouting#15 0x00007f6e21a451b2 in rpki_create_socket (_cache=0x55729149cc30) at bgpd/bgp_rpki.c:1337
> FRRouting#16 0x00007f6e2120e7b7 in tr_tcp_open (tr_socket=0x5572914d1520) at rtrlib/rtrlib/transport/tcp/tcp_transport.c:111
> FRRouting#17 0x00007f6e2120e212 in tr_open (socket=0x5572914b5e00) at rtrlib/rtrlib/transport/transport.c:16
> FRRouting#18 0x00007f6e2120faa2 in rtr_fsm_start (rtr_socket=0x557290e17180) at rtrlib/rtrlib/rtr/rtr.c:130
> FRRouting#19 0x00007f6e218b7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
> FRRouting#20 0x00007f6e21527a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

rpki_create_socket() is a hook function called from the rtrlib library.
The issue arises because rtrlib initiates its own separate pthread in which
it runs the hook, which does not establish an FRR RCU context. Consequently,
this leads to failures in the logging mechanism that relies on RCU.

Initialize a new FRR pthread context from the rtrlib pthread with a
valid RCU context to allow logging from the rpki_create_socket() and
dependent functions.

Link: FRRouting#15260
Fixes: a951752 ("bgpd: create cache server socket in vrf")
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
cscarpitta pushed a commit to cscarpitta/frr that referenced this issue Feb 9, 2024
Fix the following crash when logging from rpki_create_socket():

> #0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f6e21723798 in core_handler (signo=6, siginfo=0x7f6e1e502ef0, context=0x7f6e1e502dc0) at lib/sigevent.c:248
> FRRouting#2  <signal handler called>
> FRRouting#3  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> FRRouting#4  0x00007f6e2144e537 in __GI_abort () at abort.c:79
> FRRouting#5  0x00007f6e2176348e in _zlog_assert_failed (xref=0x7f6e2180c920 <_xref.16>, extra=0x0) at lib/zlog.c:670
> FRRouting#6  0x00007f6e216b1eda in rcu_read_lock () at lib/frrcu.c:294
> FRRouting#7  0x00007f6e21762da8 in vzlog_notls (xref=0x0, prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed", ap=0x7f6e1e504248) at lib/zlog.c:425
> FRRouting#8  0x00007f6e217632fb in vzlogx (xref=0x0, prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed", ap=0x7f6e1e504248) at lib/zlog.c:627
> FRRouting#9  0x00007f6e217621f5 in zlog (prio=2, fmt=0x7f6e217afe50 "%s:%d: %s(): assertion (%s) failed") at lib/zlog.h:73
> FRRouting#10 0x00007f6e21763596 in _zlog_assert_failed (xref=0x7f6e2180c920 <_xref.16>, extra=0x0) at lib/zlog.c:687
> FRRouting#11 0x00007f6e216b1eda in rcu_read_lock () at lib/frrcu.c:294
> FRRouting#12 0x00007f6e21762da8 in vzlog_notls (xref=0x7f6e21a50040 <_xref.68>, prio=4, fmt=0x7f6e21a4999f "getaddrinfo: debug", ap=0x7f6e1e504878) at lib/zlog.c:425
> FRRouting#13 0x00007f6e217632fb in vzlogx (xref=0x7f6e21a50040 <_xref.68>, prio=4, fmt=0x7f6e21a4999f "getaddrinfo: debug", ap=0x7f6e1e504878) at lib/zlog.c:627
> FRRouting#14 0x00007f6e21a3f774 in zlog_ref (xref=0x7f6e21a50040 <_xref.68>, fmt=0x7f6e21a4999f "getaddrinfo: debug") at ./lib/zlog.h:84
> FRRouting#15 0x00007f6e21a451b2 in rpki_create_socket (_cache=0x55729149cc30) at bgpd/bgp_rpki.c:1337
> FRRouting#16 0x00007f6e2120e7b7 in tr_tcp_open (tr_socket=0x5572914d1520) at rtrlib/rtrlib/transport/tcp/tcp_transport.c:111
> FRRouting#17 0x00007f6e2120e212 in tr_open (socket=0x5572914b5e00) at rtrlib/rtrlib/transport/transport.c:16
> FRRouting#18 0x00007f6e2120faa2 in rtr_fsm_start (rtr_socket=0x557290e17180) at rtrlib/rtrlib/rtr/rtr.c:130
> FRRouting#19 0x00007f6e218b7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
> FRRouting#20 0x00007f6e21527a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

rpki_create_socket() is a hook function called from the rtrlib library.
The issue arises because rtrlib initiates its own separate pthread in which
it runs the hook, which does not establish an FRR RCU context. Consequently,
this leads to failures in the logging mechanism that relies on RCU.

Initialize a new FRR pthread context from the rtrlib pthread with a
valid RCU context to allow logging from the rpki_create_socket() and
dependent functions.

Link: FRRouting#15260
Fixes: a951752 ("bgpd: create cache server socket in vrf")
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

3 participants