-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bluetooth: bt_recv deadlock on supervision timeout with pending GATT Write Commands #23364
Comments
Console log with reproduced with no callback assigned:
And the btmon log: |
It is probably due to this line:
But I see it has been modified to follow the supervision timeout so it timeout properly, does that solves the problem or it still deadlocks after that? |
@Vudentz yes, its the K_FOREVER. My modifications where to identify where the deadlock was. Using a timeout releases the deadlock, but not sure if thats a solution or workaround. Using a timeout means the disconnect complete is delayed by that timeout, hence use of timeout is only a workaround. |
We may actually have to introduce a timeout bt_conn_send_cb or somehow detect what is the supervision timeout, @jhedberg thoughts? But I wonder if during that we receive a disconnect from the remote side does the RX thread is unblocked? We may have to treat disconnect event differently in order to handle this properly. |
One proposal from @joerchan was to use |
Right and it might be important data so we can't just discard, are there any other solutions on the table currently? Btw, we want to default BT_CONN_TX_MAX to BT_L2CAP_TX_BUF_COUNT + 1 to try to avoid blocking at bt_conn_send_cb. |
One idea that comes to mind is to split the disconnect processing into two parts: one that's safe to do in |
Btw, my suggestion above is not so different from how other |
I can look into it. |
Above branch will make it easy to reproduce (its a branch with the patch in the issue description applied). Steps to reproduce (timing related):
Reproducible in latest master too (62b9854): |
My current notes related to the fix in #25954: # Below is the common behavior of the Controller for both combined
# Host+Controller and Controller-only builds.
#
# This is part of the Controller design, and helps "a" Host implementation
# The design is credits to numerous contributors to the Controller over the
# years.
# Priority of HCI commands, ACL data and ISO data packet "down" to the
# Controller (not reviewed all current transport drivers for use of this value)
CONFIG_BT_HCI_TX_PRIO=7
# Priority of "High" priority HCI packet from the Controller sent "up" to the
# Host:
# 1. Command Complete
# 2. Disconnection Complete Event
# 3. Number of Completed Packets Event
# 4. Data Buffer Overflow Event
#
# Note: Having a higher priority ensures these packets can mitigate stalled Tx
# and "normal" Rx thread.
#
# Fun fact: Angel number 768 signifies a lot of events to come, and number 6
# represents change!
# Use of these numbers is just a coincidence?!
#
CONFIG_BT_DRIVER_RX_HIGH_PRIO=6
# Priority of "normal" HCI packet from the Controller sent "up" to the Host
CONFIG_BT_RX_PRIO=8
# Tx thread stack dependent on the Controller's program stack usage (call depth)
CONFIG_BT_HCI_TX_STACK_SIZE=1280
# CONFIG_BT_HCI_TX_STACK_SIZE_WITH_PROMPT is not set
# "High" priority and "normal" priority threads in the Controller
CONFIG_BT_CTLR_RX_PRIO_STACK_SIZE=448
CONFIG_BT_CTLR_RX_STACK_SIZE=896
# Rx thread stack dependent on the transport from the Controller to the Host
CONFIG_BT_RX_STACK_SIZE=768 |
Describe the bug
Performing GATT write commands from a peripheral as fast as possible from the main loop while the peer central is performing GATT service discovery and a supervision timeout puts the system in a deadlock.
Preliminary analysis shows the below callpath leading to the deadlock:
The deadlock is caused by use of all
free_tx
by the below callpath:To Reproduce
Steps to reproduce the behavior:
On a linux PC use BlueZ and hci_usb (nrf52840_pca10056) as the controller. (The issue should be reproducible using internal controller too, not tried though).
bluetoothctl scan on
<ctrl-c>
bluetoothctrl connect <bdaddr of the peripheral sample>
Expected behavior
Peripheral sample should gracefully disconnect with reason as supervision timeout and resume to connectable advertisements.
Impact
showstopper
Screenshots or console output
Below log is for the modified host by adding timeout in
conn_tx_alloc
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
Patch file:
0001-Peripheral-with-Write-Commands.patch.txt
btmon log:
btmon_recv_thread_stall.txt
The text was updated successfully, but these errors were encountered: