Bluetooth: bt_recv deadlock on supervision timeout with pending GATT Write Commands #23364

cvinayak · 2020-03-10T08:22:01Z

Describe the bug
Performing GATT write commands from a peripheral as fast as possible from the main loop while the peer central is performing GATT service discovery and a supervision timeout puts the system in a deadlock.

Preliminary analysis shows the below callpath leading to the deadlock:

recv_thread
- bt_recv
-- hci_acl
--- bt_conn_recv
---- bt_l2cap_recv
----- l2cap_chan_recv
------ chan->ops->recv
------- bt_att_recv
-------- handler->func
--------- att_read_req
---------- bt_att_create_pdu
----------- bt_l2cap_send_cb
------------ bt_conn_send_cb
------------- conn_tx_alloc

The deadlock is caused by use of all free_tx by the below callpath:

bt_gatt_write_without_response_cb
- bt_att_create_pdu
- bt_att_send (with cb)
-- att_send
--- bt_l2cap_send_cb
---- bt_conn_send_cb
----- conn_tx_alloc

To Reproduce
Steps to reproduce the behavior:

Apply the attached patch to the upstream/master repo
mkdir build; cd build
cmake -GNinja -DBOARD=nrf52_pca10040 ../samples/bluetooth/peripheral
ninja
ninja flash

On a linux PC use BlueZ and hci_usb (nrf52840_pca10056) as the controller. (The issue should be reproducible using internal controller too, not tried though).

bluetoothctl scan on
<ctrl-c>
bluetoothctrl connect <bdaddr of the peripheral sample>
Take the peripheral device away to cause supervision timeout
Try few times (issue reproducible in 1 in 5 supervision timeout tries)

Expected behavior
Peripheral sample should gracefully disconnect with reason as supervision timeout and resume to connectable advertisements.

Impact
showstopper

Screenshots or console output
Below log is for the modified host by adding timeout in conn_tx_alloc

*** Booting Zephyr OS build v2.2.0-rc3-68-gbf8caf395646  ***
Bluetooth initialized
Advertising successfully started
Connected: e5:e0:45:33:29:e6 (random)
[00:00:00.007,354] <inf> fs_nvs: 6 Sectors of 4096 bytes
[00:00Write: count= 0, len= 0, rate= 0 bps.
:00.007,354] <inf> fs_nvs: alloc wra: 0, f70
[00:00:00.007,354] <inf> fs_nvs: data wra: 0, 124
[00:00:00.008,789] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,789] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,819] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 2.2 Build 0
[00:00:00.009,094] <inf> bt_hci_core: No ID address. App must call settings_load()
[00:00:00.011,230] <err> settings: set-value failure. key: bt/ccc/38cada5db6360 error(-2)
[00:00:00.012,359] <inf> bt_hci_core: Identity: c1:ab:66:13:11:2e (random)
[00:00:00.012,359] <inf> bt_hci_core: HCI: version 5.1 (0x0a) revision 0x0000, manufacturer 0x05f1
[00:00:00.012,359] <inf> bt_hci_core: LMP: version 5.1 (0x0a) subver 0xffff
Write: count= 58, len= 13256, rate= 107932 bps.

D1

D0

D3

D2

D4

D5
conn_tx_alloc timeout.
Write cmd failed (-55).
conn_tx_alloc timeout.

D6

D99
Disconnected (reason 0x08)
Write cmd failed (-57).

Environment (please complete the following information):

OS: Linux
Toolchain: gnuarmemb gcc version 9.2.0 (Arch Repository)
Commit SHA or Version used: bf8caf3

Additional context
Add any other context about the problem here.

Patch file:
0001-Peripheral-with-Write-Commands.patch.txt

btmon log:
btmon_recv_thread_stall.txt

The text was updated successfully, but these errors were encountered:

cvinayak · 2020-03-10T11:13:47Z

Console log with reproduced with no callback assigned:

*** Booting Zephyr OS build v2.2.0-rc3-69-gd8697ba32e2b  ***
Bluetooth initialized
Advertising successfully started
Connected: e5:e0:45:33:29:e6 (random)
[00:00:00.007,324] <inf> fs_nvs: 6 Sectors of 4096 bytes
[00:00:00.007,324] <inf> fs_nvs: alloc wra: 0, f70
[00:00:00.007,354] <inf> fs_nvs: data wra: 0, 124
[00:00:00.008,758] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,758] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,789] <inf> bt_hci_core: Firmware: Standard Bluetooth controller  App must call settings_load()
[00:00:00.011,199] <err> settings: set-value failure. key: bt/ccc/38cada5db6360 error(-2)
[00:00:00.012,329] <inf> bt_hci_core: Identity: c1:ab:66:13:11:2e (random)
[00:00:00.012,329] <inf> bt_hci_core: HCI: version 5.1 (0x0a) revision 0x0000, manufacturer 0x05f1
[00:00:00.012,329] <inf> bt_hci_core: LMP: version 5.1 (0x0a) subver 0xfff
D1

D0

D3

D2

D4

D5
conn_tx_alloc timeout.

D6

D6

D99
Disconnected (reason 0x08)
Write cmd failed (-57).

And the btmon log:
btmon_20200310_nocb.log

Vudentz · 2020-03-10T16:42:57Z

It is probably due to this line:

return k_fifo_get(&free_tx, K_FOREVER);

But I see it has been modified to follow the supervision timeout so it timeout properly, does that solves the problem or it still deadlocks after that?

cvinayak · 2020-03-10T17:09:51Z

@Vudentz yes, its the K_FOREVER. My modifications where to identify where the deadlock was. Using a timeout releases the deadlock, but not sure if thats a solution or workaround. Using a timeout means the disconnect complete is delayed by that timeout, hence use of timeout is only a workaround.

Vudentz · 2020-03-10T17:27:23Z

We may actually have to introduce a timeout bt_conn_send_cb or somehow detect what is the supervision timeout, @jhedberg thoughts? But I wonder if during that we receive a disconnect from the remote side does the RX thread is unblocked? We may have to treat disconnect event differently in order to handle this properly.

jhedberg · 2020-03-10T17:29:02Z

We may have to treat disconnect event differently in order to handle this properly.

One proposal from @joerchan was to use bt_recv_prio() for the disconnect event, however that comes with its own issues related to ordering (you could receive packets for a connection after processing its disconnect).

Vudentz · 2020-03-10T17:49:35Z

We may have to treat disconnect event differently in order to handle this properly.

One proposal from @joerchan was to use bt_recv_prio() for the disconnect event, however that comes with its own issues related to ordering (you could receive packets for a connection after processing its disconnect).

Right and it might be important data so we can't just discard, are there any other solutions on the table currently? Btw, we want to default BT_CONN_TX_MAX to BT_L2CAP_TX_BUF_COUNT + 1 to try to avoid blocking at bt_conn_send_cb.

jhedberg · 2020-03-10T18:45:48Z

Right and it might be important data so we can't just discard, are there any other solutions on the table currently? Btw, we want to default BT_CONN_TX_MAX to BT_L2CAP_TX_BUF_COUNT + 1 to try to avoid blocking at bt_conn_send_cb.

One idea that comes to mind is to split the disconnect processing into two parts: one that's safe to do in bt_recv_prio() (e.g. marking the TX path defunct, so that you can't send anymore but the connection can still receive packets), and another that's done in the normal RX thread (until which point you can process RX packets but not send anything, and here we finally set the disconnected state on the bt_conn object).

jhedberg · 2020-03-10T18:49:53Z

Btw, my suggestion above is not so different from how other bt_recv_prio() handling is done: the all the existing handlers do is to release some resource (e.g. give a semaphore) which releases some other thread that does the "finalisation" or second stage of the event handling.

carlescufi · 2020-05-14T16:03:03Z

@Vudentz @cvinayak @joerchan @jhedberg This is a high bug so we need to fix it during the 2.3 stabilization period. Is there something we can do here, either as a patch or a mitigation, during the next few days?

joerchan · 2020-05-15T08:56:48Z

I can look into it.

cvinayak · 2020-06-03T11:43:36Z

cvinayak@0b01ebd

Above branch will make it easy to reproduce (its a branch with the patch in the issue description applied).

Steps to reproduce (timing related):

build and flash the peripheral sample from the branch.
build and flash a vanilla central_hr sample in another board.
When central_hr connects and peripheral sample prints out Connected: <bdaddr>, reset the central_hr board to cause linkloss (linkloss needs to occur during service discovery).
peripheral sample will get stuck in conn_tx_alloc function in host/conn.c

Reproducible in latest master too (62b9854):
cvinayak@55153be

cvinayak · 2024-12-21T05:57:04Z

My current notes related to the fix in #25954:

# Below is the common behavior of the Controller for both combined
# Host+Controller and Controller-only builds.
#
# This is part of the Controller design, and helps "a" Host implementation
# The design is credits to numerous contributors to the Controller over the
# years.

# Priority of HCI commands, ACL data and ISO data packet "down" to the
# Controller (not reviewed all current transport drivers for use of this value)
CONFIG_BT_HCI_TX_PRIO=7

# Priority of "High" priority HCI packet from the Controller sent "up" to the
# Host:
#   1. Command Complete
#   2. Disconnection Complete Event
#   3. Number of Completed Packets Event
#   4. Data Buffer Overflow Event
#
# Note: Having a higher priority ensures these packets can mitigate stalled Tx
#       and "normal" Rx thread.
#
# Fun fact: Angel number 768 signifies a lot of events to come, and number 6
#           represents change!
#           Use of these numbers is just a coincidence?!
#
CONFIG_BT_DRIVER_RX_HIGH_PRIO=6

# Priority of "normal" HCI packet from the Controller sent "up" to the Host
CONFIG_BT_RX_PRIO=8

# Tx thread stack dependent on the Controller's program stack usage (call depth)
CONFIG_BT_HCI_TX_STACK_SIZE=1280
# CONFIG_BT_HCI_TX_STACK_SIZE_WITH_PROMPT is not set

# "High" priority and "normal" priority threads in the Controller
CONFIG_BT_CTLR_RX_PRIO_STACK_SIZE=448
CONFIG_BT_CTLR_RX_STACK_SIZE=896

# Rx thread stack dependent on the transport from the Controller to the Host
CONFIG_BT_RX_STACK_SIZE=768

cvinayak added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug area: Bluetooth labels Mar 10, 2020

carlescufi assigned jhedberg, Vudentz and joerchan Mar 10, 2020

cvinayak changed the title ~~Bluetooth: bt_recv deadlock on supervision timeout and pending GATT Write Commands~~ Bluetooth: bt_recv deadlock on supervision timeout with pending GATT Write Commands Mar 10, 2020

n3rd4i mentioned this issue Mar 12, 2020

Mitigation in case system [created] threads hang/non-responsive #23423

Closed

joerchan mentioned this issue Jun 2, 2020

kernel: k_queue_get return NULL before timeout #25904

Closed

carlescufi mentioned this issue Jun 2, 2020

kernel/queue: Remove interior use of k_poll() #25906

Merged

joerchan mentioned this issue Jun 3, 2020

Bluetooth: Fix RX thread deadlock #25954

Merged

AndreyDodonov-EH mentioned this issue Jun 25, 2020

Bluetooth: can't close bt_driver log output #20297

Closed

carlescufi closed this as completed in #25954 Jul 6, 2020

alwa-nordic mentioned this issue Aug 15, 2023

A disconnect shortly after a pairing completes results in pairing failed #61465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bluetooth: bt_recv deadlock on supervision timeout with pending GATT Write Commands #23364

Bluetooth: bt_recv deadlock on supervision timeout with pending GATT Write Commands #23364

cvinayak commented Mar 10, 2020 •

edited

Loading

cvinayak commented Mar 10, 2020

Vudentz commented Mar 10, 2020 •

edited

Loading

cvinayak commented Mar 10, 2020

Vudentz commented Mar 10, 2020 •

edited

Loading

jhedberg commented Mar 10, 2020

Vudentz commented Mar 10, 2020 •

edited

Loading

jhedberg commented Mar 10, 2020

jhedberg commented Mar 10, 2020

carlescufi commented May 14, 2020

joerchan commented May 15, 2020

cvinayak commented Jun 3, 2020 •

edited

Loading

cvinayak commented Dec 21, 2024

Bluetooth: bt_recv deadlock on supervision timeout with pending GATT Write Commands #23364

Bluetooth: bt_recv deadlock on supervision timeout with pending GATT Write Commands #23364

Comments

cvinayak commented Mar 10, 2020 • edited Loading

cvinayak commented Mar 10, 2020

Vudentz commented Mar 10, 2020 • edited Loading

cvinayak commented Mar 10, 2020

Vudentz commented Mar 10, 2020 • edited Loading

jhedberg commented Mar 10, 2020

Vudentz commented Mar 10, 2020 • edited Loading

jhedberg commented Mar 10, 2020

jhedberg commented Mar 10, 2020

carlescufi commented May 14, 2020

joerchan commented May 15, 2020

cvinayak commented Jun 3, 2020 • edited Loading

cvinayak commented Dec 21, 2024

cvinayak commented Mar 10, 2020 •

edited

Loading

Vudentz commented Mar 10, 2020 •

edited

Loading

Vudentz commented Mar 10, 2020 •

edited

Loading

Vudentz commented Mar 10, 2020 •

edited

Loading

cvinayak commented Jun 3, 2020 •

edited

Loading