Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libnetconf deadlock with multiple notification subscribers #199

Closed
ntadas opened this issue Jun 2, 2016 · 16 comments
Closed

libnetconf deadlock with multiple notification subscribers #199

ntadas opened this issue Jun 2, 2016 · 16 comments

Comments

@ntadas
Copy link
Contributor

ntadas commented Jun 2, 2016

Hi

We have 2 netconf sessions enabled, both of them subscribed for notifications.
Suddenly both connections are lost (for example the terminals where both netopeer-cli instances are running are closed), and we enter a deadlock.
This only occurs if more than one client is subscribed for notifications and they exit at the same time (at least it is how I reproduce the problem 100% of the times)

Follows a short description of the problem:

The Netconf Session tries to free the session by issuing a nc_session_free, and there it will lock the mut_session. Afterwards it calls the ncntf_dispatch_stop function where it will stay in an endless loop:

            while (session->ntf_active) {
                                            DBG_UNLOCK("mut_ntf");
                                            pthread_mutex_unlock(&(session->mut_ntf));
                                            usleep(NCNTF_DISPATCH_SLEEP);
                                            DBG_LOCK("mut_ntf");
                                            pthread_mutex_lock(&(session->mut_ntf));
                            }

On the other end, the Notification thread processing the ncntf_dispatch_send will try to lock the mut_session , and it will get stuck as the mutex is already locked by the Netconf Session:

            if (nacm_check_notification(ntf, session) == NACM_PERMIT) {
                            DBG_LOCK("mut_session");
                            **pthread_mutex_lock(&(session->mut_session));**
                            DBG_LOCK("mut_ntf");
                            pthread_mutex_lock(&(session->mut_ntf));

This means that the Netconf session will never get out of the loop mentioned above, since the notification thread is stuck and cannot assess the ntf_stop variable, thus updating the ntf_active flag.

Follows the backtrace snippet of the relevant threads

Thread 8 (Thread 0x7ff856ffd700 (LWP 60656)):
#0 ncntf_dispatch_stop (session=0x7ff8440020b0) at src/notifications.c:2343
#1 0x00007ff866a0faba in nc_session_close (session=0x7ff8440020b0, reason=NC_SESSION_TERM_CLOSED) at src/session.c:1201
#2 0x00007ff866a0fd52 in nc_session_free (session=0x7ff8440020b0) at src/session.c:1343
#3 0x00007ff867e4e373 in NetconfSession::~NetconfSession (this=0x7ff8340008c0, __in_chrg=) at src/NetconfSession.cpp:68

Thread 2 (Thread 0x7ff839e0f700 (LWP 60928)):
#0 0x00007ff8667eaf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007ff8667e6d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2 0x00007ff8667e6c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007ff866a1cd2a in ncntf_dispatch_send (session=0x7ff8440020b0, subscribe_rpc=0x7ff844005130) at src/notifications.c:2546
#4 0x00007ff867e52a7a in NetconfSessionNotification::run (this=0x7ff834000c60) at src/NetconfSessionNotification.cpp:44

mut_session = {
__data = {
__lock = 2,
__count = 1,
__owner = 60656,
__nusers = 1,
__kind = 1,
__spins = 0,
__list = {
__prev = 0x0,
__next = 0x0
}
},

Best Regards

@rkrejci
Copy link
Contributor

rkrejci commented Jun 2, 2016

Hi, I wasn't able to reproduce the issue, but I see the problem from your description. The locks in the libnetconf session are mess and I'm not sure with the bugfix since I'm not able to reproduce the problem. Therefore, the bugfix is available in a separate branch ncntflocks, so, please, try if it works for you. If so, I will merge it into the master.

@ntadas
Copy link
Contributor Author

ntadas commented Jun 3, 2016

Hi

Unfortunate this fix doesn't solve the problem.
If we exit or close 2 session that have notifications subscribed we still have the crash.

would it be possible to make the method void ncntf_dispatch_stop(struct nc_session *session) available (part of the API) so that we can call it from outside?

Best Regards

@rkrejci
Copy link
Contributor

rkrejci commented Jun 7, 2016

I'm afraid that making ncntf_dispatch_stop() public would just make things more complicated and wouldn't solve anything. btw, by "still have the crash" you mean you have deadlock, right? Can you examine it more and check where the threads are waiting? Because ncntf_dispatch_stop() has changed and it should let the ncntf_dispatch_send() continue (by releasing the session mutex).

@ntadas
Copy link
Contributor Author

ntadas commented Jun 7, 2016

Hi,

I only can reproduce the problem if I exit all sessions (with subscription) simultaneously (I'm using mobaxterm to type exit in all netopeer-cli at the same time).
if I do a simple disconnect connect I don't see the issue.

when I enter in the deadlock this is what I get in the lib backtrace:
#0 0x00007fbd6ec5a49d in nanosleep () from /lib64/libc.so.6
#1 0x00007fbd6ec8b0d4 in usleep () from /lib64/libc.so.6
#2 0x00007fbd72e8d8af in ncntf_dispatch_send (session=0x7fbd28000aa0, subscribe_rpc=0x7fbd280149a0) at src/notifications.c:2405

So doing the exit or disconnect have different effect on the server, this might help understand the problem.

The reason why I was asking to have the ncntf_dispatch_stop public, is to allow us to first call this method and then delete the session, this way we have full control of the threads when stopping them.

NOTE: here we are talking about deadlock issue only, the "crash" its other issue that I'm still trying to reproduce/debug and later maybe I'll open a new issue for this one.

@rkrejci
Copy link
Contributor

rkrejci commented Jun 7, 2016

And the other thread is looping in ncntf_dispatch_stop() around session.c:1200, right? But then the thread in ncntf_dispatch_send() continues from notifications.c:2406 back to the beginning of the while loop, locks the mutex, gets the ntf_stop flag (:2395), breaks the loop (:2398) and returns from the ncntf_dispatch_send() function. Where is the deadlock?

@ntadas
Copy link
Contributor Author

ntadas commented Jun 7, 2016

Not exactly, please see below all the backtraces.
Almost all the threads are "dead", but 2 notification threads are still running and do not exit.
We try to join the thread, so we can delete everything safely, but the these 2 threads are still running

Please note that we do not detach the libnetconf notification thread, so when we want to delete a netconf session we issue a netconf nc_session_free and try to join the notification thread.
Since the notification thread doesn't exits/ends we get stuck while joining.
So its not like a libnetconf deadlock, but a zombie thread or thread leak.

(create 4 Netconf Sessions )

[New Thread 0x7f90f39f5700 (LWP 2926)] <- NetconfSession 1
[New Thread 0x7f90d1ff3700 (LWP 2927)] <- NetconfSession 2
[New Thread 0x7f90d17f2700 (LWP 2928)] <- NetconfSession 3
[New Thread 0x7f90d0ff1700 (LWP 2929)] <- NetconfSession 4

(subscribe all of them to notifications)

[New Thread 0x7f90b9ad8700 (LWP 2932)] <- Subscriber 1
[New Thread 0x7f90b92d7700 (LWP 2933)] <- Subscriber 2
[New Thread 0x7f90b8ad6700 (LWP 2934)] <- Subscriber 3
[New Thread 0x7f90a3fff700 (LWP 2935)] <- Subscriber 4

(type exit in all netopeer-cli instances)

[Thread 0x7f90f39f5700 (LWP 2926) exited] <- Session 1
[Thread 0x7f90d0ff1700 (LWP 2929) exited] <- Session 4
[Thread 0x7f90d17f2700 (LWP 2928) exited] <- Session 3
[Thread 0x7f90d1ff3700 (LWP 2927) exited] <- Session 2
[Thread 0x7f90a3fff700 (LWP 2935) exited] <- subscriber 4
[Thread 0x7f90b8ad6700 (LWP 2934) exited] <- subscriber 3

^C
Program received signal SIGINT, Interrupt.
0x00007f90fdc31790 in sem_wait () from /lib64/libpthread.so.0
(gdb) thread apply all bt

Thread 19 (Thread 0x7f90b92d7700 (LWP 2933)):
#0 0x00007f90f9c2f49d in nanosleep () from /lib64/libc.so.6
#1 0x00007f90f9c600d4 in usleep () from /lib64/libc.so.6
#2 0x00007f90fde638af in ncntf_dispatch_send (session=0x7f90ac000aa0, subscribe_rpc=0x7f90ac002320) at src/notifications.c:2405
#3 0x00007f90ff290f6e in NetconfSessionNotification::run (this=0x7f90cc016d90) at Netconf/src/NetconfSessionNotification.cpp:47

Thread 18 (Thread 0x7f90b9ad8700 (LWP 2932)):
#0 0x00007f90f9c2f49d in nanosleep () from /lib64/libc.so.6
#1 0x00007f90f9c600d4 in usleep () from /lib64/libc.so.6
#2 0x00007f90fde638af in ncntf_dispatch_send (session=0x7f90a4000aa0, subscribe_rpc=0x7f90a4002af0) at src/notifications.c:2405
#3 0x00007f90ff290f6e in NetconfSessionNotification::run (this=0x7f90cc0221d0) at Netconf/src/NetconfSessionNotification.cpp:47

(Thread that destroyes the NetconfSessions)
Thread 8 (Thread 0x7f90f29f3700 (LWP 724)):
#0 0x00007f90fdc2cef7 in pthread_join () from /lib64/libpthread.so.0
#1 0x00007f90fa4fffe7 in std::thread::join() () from /lib64/libstdc++.so.6
#2 0x00007f90ff28f3e0 in ThreadBase::join (this=0x7f90cc0221d0) at ThreadPool/Thread.h:60
#3 0x00007f90ff28c8d6 in NetconfSession::~NetconfSession (this=0x7f90cc021e30, __in_chrg=) at Netconf/src/NetconfSession.cpp:69
#4 0x00007f90ff28c936 in NetconfSession::~NetconfSession (this=0x7f90cc021e30, __in_chrg=) at Netconf/src/NetconfSession.cpp:70

rkrejci added a commit that referenced this issue Jun 7, 2016
When something bad happenied to the session and this state was
detected by ncntf_dispatch_send(), the session was correctly closed, but
the dispatch thread was still running in neverending loop.

Fixes #199
@rkrejci
Copy link
Contributor

rkrejci commented Jun 7, 2016

ok, that seems as a different bug in comparison to the previous deadlock. What about the commited patch (still the ncntflocks branch)?

@ntadas
Copy link
Contributor Author

ntadas commented Jun 7, 2016

this happens with the patch from the ncntflocks branch.

@rkrejci
Copy link
Contributor

rkrejci commented Jun 7, 2016

with the new one (a few minutes old)?

@ntadas
Copy link
Contributor Author

ntadas commented Jun 7, 2016

no, let me try :) I didn't notice. I'll let you know about the result.

@ntadas
Copy link
Contributor Author

ntadas commented Jun 7, 2016

I've repeated the test with the new fix and I'm not able to reproduce the issue anymore.
I think you can do the merge to the main branch.
thanks a lot.

@rkrejci
Copy link
Contributor

rkrejci commented Jun 7, 2016

Great! Please, check also if the patch changed something with #201.

@rkrejci
Copy link
Contributor

rkrejci commented Jun 15, 2016

Hi, there was another issue #200, so I had to do another changes, which are currently available in a separate branch deadlockfix. Could you please verify, that the code in this branch still works for you?

@ntadas
Copy link
Contributor Author

ntadas commented Jun 16, 2016

Hi,
sorry didn't had time yet to verify this one, I'm struggling with the debug of #205
I'll try to test it still this week.

@ntadas
Copy link
Contributor Author

ntadas commented Jun 20, 2016

Hi
sorry for the delay. I've run some tests and its seems fine everything is working as before.

@rkrejci
Copy link
Contributor

rkrejci commented Jun 22, 2016

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants