-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libnetconf deadlock with multiple notification subscribers #199
Comments
Hi, I wasn't able to reproduce the issue, but I see the problem from your description. The locks in the libnetconf session are mess and I'm not sure with the bugfix since I'm not able to reproduce the problem. Therefore, the bugfix is available in a separate branch |
Hi Unfortunate this fix doesn't solve the problem. would it be possible to make the method void ncntf_dispatch_stop(struct nc_session *session) available (part of the API) so that we can call it from outside? Best Regards |
I'm afraid that making ncntf_dispatch_stop() public would just make things more complicated and wouldn't solve anything. btw, by "still have the crash" you mean you have deadlock, right? Can you examine it more and check where the threads are waiting? Because ncntf_dispatch_stop() has changed and it should let the ncntf_dispatch_send() continue (by releasing the session mutex). |
Hi, I only can reproduce the problem if I exit all sessions (with subscription) simultaneously (I'm using mobaxterm to type exit in all netopeer-cli at the same time). when I enter in the deadlock this is what I get in the lib backtrace: So doing the exit or disconnect have different effect on the server, this might help understand the problem. The reason why I was asking to have the ncntf_dispatch_stop public, is to allow us to first call this method and then delete the session, this way we have full control of the threads when stopping them. NOTE: here we are talking about deadlock issue only, the "crash" its other issue that I'm still trying to reproduce/debug and later maybe I'll open a new issue for this one. |
And the other thread is looping in ncntf_dispatch_stop() around session.c:1200, right? But then the thread in ncntf_dispatch_send() continues from notifications.c:2406 back to the beginning of the while loop, locks the mutex, gets the ntf_stop flag (:2395), breaks the loop (:2398) and returns from the ncntf_dispatch_send() function. Where is the deadlock? |
Not exactly, please see below all the backtraces. Please note that we do not detach the libnetconf notification thread, so when we want to delete a netconf session we issue a netconf nc_session_free and try to join the notification thread. (create 4 Netconf Sessions ) [New Thread 0x7f90f39f5700 (LWP 2926)] <- NetconfSession 1 (subscribe all of them to notifications) [New Thread 0x7f90b9ad8700 (LWP 2932)] <- Subscriber 1 (type exit in all netopeer-cli instances) [Thread 0x7f90f39f5700 (LWP 2926) exited] <- Session 1 ^C Thread 19 (Thread 0x7f90b92d7700 (LWP 2933)): Thread 18 (Thread 0x7f90b9ad8700 (LWP 2932)): (Thread that destroyes the NetconfSessions) |
When something bad happenied to the session and this state was detected by ncntf_dispatch_send(), the session was correctly closed, but the dispatch thread was still running in neverending loop. Fixes #199
ok, that seems as a different bug in comparison to the previous deadlock. What about the commited patch (still the |
this happens with the patch from the ncntflocks branch. |
with the new one (a few minutes old)? |
no, let me try :) I didn't notice. I'll let you know about the result. |
I've repeated the test with the new fix and I'm not able to reproduce the issue anymore. |
Great! Please, check also if the patch changed something with #201. |
Hi, there was another issue #200, so I had to do another changes, which are currently available in a separate branch |
Hi, |
Hi |
Thanks |
Hi
We have 2 netconf sessions enabled, both of them subscribed for notifications.
Suddenly both connections are lost (for example the terminals where both netopeer-cli instances are running are closed), and we enter a deadlock.
This only occurs if more than one client is subscribed for notifications and they exit at the same time (at least it is how I reproduce the problem 100% of the times)
Follows a short description of the problem:
The Netconf Session tries to free the session by issuing a nc_session_free, and there it will lock the mut_session. Afterwards it calls the ncntf_dispatch_stop function where it will stay in an endless loop:
On the other end, the Notification thread processing the ncntf_dispatch_send will try to lock the mut_session , and it will get stuck as the mutex is already locked by the Netconf Session:
This means that the Netconf session will never get out of the loop mentioned above, since the notification thread is stuck and cannot assess the ntf_stop variable, thus updating the ntf_active flag.
Follows the backtrace snippet of the relevant threads
Thread 8 (Thread 0x7ff856ffd700 (LWP 60656)):
#0 ncntf_dispatch_stop (session=0x7ff8440020b0) at src/notifications.c:2343
#1 0x00007ff866a0faba in nc_session_close (session=0x7ff8440020b0, reason=NC_SESSION_TERM_CLOSED) at src/session.c:1201
#2 0x00007ff866a0fd52 in nc_session_free (session=0x7ff8440020b0) at src/session.c:1343
#3 0x00007ff867e4e373 in NetconfSession::~NetconfSession (this=0x7ff8340008c0, __in_chrg=) at src/NetconfSession.cpp:68
Thread 2 (Thread 0x7ff839e0f700 (LWP 60928)):
#0 0x00007ff8667eaf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007ff8667e6d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2 0x00007ff8667e6c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007ff866a1cd2a in ncntf_dispatch_send (session=0x7ff8440020b0, subscribe_rpc=0x7ff844005130) at src/notifications.c:2546
#4 0x00007ff867e52a7a in NetconfSessionNotification::run (this=0x7ff834000c60) at src/NetconfSessionNotification.cpp:44
mut_session = {
__data = {
__lock = 2,
__count = 1,
__owner = 60656,
__nusers = 1,
__kind = 1,
__spins = 0,
__list = {
__prev = 0x0,
__next = 0x0
}
},
Best Regards
The text was updated successfully, but these errors were encountered: