Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth: bt_recv deadlock on supervision timeout with pending GATT Write Commands #23364

Closed
cvinayak opened this issue Mar 10, 2020 · 12 comments · Fixed by #25954
Closed

Bluetooth: bt_recv deadlock on supervision timeout with pending GATT Write Commands #23364

cvinayak opened this issue Mar 10, 2020 · 12 comments · Fixed by #25954
Assignees
Labels
area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug

Comments

@cvinayak
Copy link
Contributor

cvinayak commented Mar 10, 2020

Describe the bug
Performing GATT write commands from a peripheral as fast as possible from the main loop while the peer central is performing GATT service discovery and a supervision timeout puts the system in a deadlock.

Preliminary analysis shows the below callpath leading to the deadlock:

recv_thread
- bt_recv
-- hci_acl
--- bt_conn_recv
---- bt_l2cap_recv
----- l2cap_chan_recv
------ chan->ops->recv
------- bt_att_recv
-------- handler->func
--------- att_read_req
---------- bt_att_create_pdu
----------- bt_l2cap_send_cb
------------ bt_conn_send_cb
------------- conn_tx_alloc

The deadlock is caused by use of all free_tx by the below callpath:

bt_gatt_write_without_response_cb
- bt_att_create_pdu
- bt_att_send (with cb)
-- att_send
--- bt_l2cap_send_cb
---- bt_conn_send_cb
----- conn_tx_alloc

To Reproduce
Steps to reproduce the behavior:

  1. Apply the attached patch to the upstream/master repo
  2. mkdir build; cd build
  3. cmake -GNinja -DBOARD=nrf52_pca10040 ../samples/bluetooth/peripheral
  4. ninja
  5. ninja flash

On a linux PC use BlueZ and hci_usb (nrf52840_pca10056) as the controller. (The issue should be reproducible using internal controller too, not tried though).

  1. bluetoothctl scan on
  2. <ctrl-c>
  3. bluetoothctrl connect <bdaddr of the peripheral sample>
  4. Take the peripheral device away to cause supervision timeout
  5. Try few times (issue reproducible in 1 in 5 supervision timeout tries)

Expected behavior
Peripheral sample should gracefully disconnect with reason as supervision timeout and resume to connectable advertisements.

Impact
showstopper

Screenshots or console output
Below log is for the modified host by adding timeout in conn_tx_alloc

*** Booting Zephyr OS build v2.2.0-rc3-68-gbf8caf395646  ***
Bluetooth initialized
Advertising successfully started
Connected: e5:e0:45:33:29:e6 (random)
[00:00:00.007,354] <inf> fs_nvs: 6 Sectors of 4096 bytes
[00:00Write: count= 0, len= 0, rate= 0 bps.
:00.007,354] <inf> fs_nvs: alloc wra: 0, f70
[00:00:00.007,354] <inf> fs_nvs: data wra: 0, 124
[00:00:00.008,789] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,789] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,819] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 2.2 Build 0
[00:00:00.009,094] <inf> bt_hci_core: No ID address. App must call settings_load()
[00:00:00.011,230] <err> settings: set-value failure. key: bt/ccc/38cada5db6360 error(-2)
[00:00:00.012,359] <inf> bt_hci_core: Identity: c1:ab:66:13:11:2e (random)
[00:00:00.012,359] <inf> bt_hci_core: HCI: version 5.1 (0x0a) revision 0x0000, manufacturer 0x05f1
[00:00:00.012,359] <inf> bt_hci_core: LMP: version 5.1 (0x0a) subver 0xffff
Write: count= 58, len= 13256, rate= 107932 bps.

D1

D0

D3

D2

D4

D5
conn_tx_alloc timeout.
Write cmd failed (-55).
conn_tx_alloc timeout.

D6

D99
Disconnected (reason 0x08)
Write cmd failed (-57).

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: gnuarmemb gcc version 9.2.0 (Arch Repository)
  • Commit SHA or Version used: bf8caf3

Additional context
Add any other context about the problem here.

Patch file:
0001-Peripheral-with-Write-Commands.patch.txt

btmon log:
btmon_recv_thread_stall.txt

@cvinayak cvinayak added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug area: Bluetooth labels Mar 10, 2020
@cvinayak cvinayak changed the title Bluetooth: bt_recv deadlock on supervision timeout and pending GATT Write Commands Bluetooth: bt_recv deadlock on supervision timeout with pending GATT Write Commands Mar 10, 2020
@cvinayak
Copy link
Contributor Author

Console log with reproduced with no callback assigned:

*** Booting Zephyr OS build v2.2.0-rc3-69-gd8697ba32e2b  ***
Bluetooth initialized
Advertising successfully started
Connected: e5:e0:45:33:29:e6 (random)
[00:00:00.007,324] <inf> fs_nvs: 6 Sectors of 4096 bytes
[00:00:00.007,324] <inf> fs_nvs: alloc wra: 0, f70
[00:00:00.007,354] <inf> fs_nvs: data wra: 0, 124
[00:00:00.008,758] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,758] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,789] <inf> bt_hci_core: Firmware: Standard Bluetooth controller  App must call settings_load()
[00:00:00.011,199] <err> settings: set-value failure. key: bt/ccc/38cada5db6360 error(-2)
[00:00:00.012,329] <inf> bt_hci_core: Identity: c1:ab:66:13:11:2e (random)
[00:00:00.012,329] <inf> bt_hci_core: HCI: version 5.1 (0x0a) revision 0x0000, manufacturer 0x05f1
[00:00:00.012,329] <inf> bt_hci_core: LMP: version 5.1 (0x0a) subver 0xfff
D1

D0

D3

D2

D4

D5
conn_tx_alloc timeout.

D6

D6

D99
Disconnected (reason 0x08)
Write cmd failed (-57).

And the btmon log:
btmon_20200310_nocb.log

@Vudentz
Copy link
Contributor

Vudentz commented Mar 10, 2020

It is probably due to this line:

return k_fifo_get(&free_tx, K_FOREVER);

But I see it has been modified to follow the supervision timeout so it timeout properly, does that solves the problem or it still deadlocks after that?

@cvinayak
Copy link
Contributor Author

@Vudentz yes, its the K_FOREVER. My modifications where to identify where the deadlock was. Using a timeout releases the deadlock, but not sure if thats a solution or workaround. Using a timeout means the disconnect complete is delayed by that timeout, hence use of timeout is only a workaround.

@Vudentz
Copy link
Contributor

Vudentz commented Mar 10, 2020

We may actually have to introduce a timeout bt_conn_send_cb or somehow detect what is the supervision timeout, @jhedberg thoughts? But I wonder if during that we receive a disconnect from the remote side does the RX thread is unblocked? We may have to treat disconnect event differently in order to handle this properly.

@jhedberg
Copy link
Member

We may have to treat disconnect event differently in order to handle this properly.

One proposal from @joerchan was to use bt_recv_prio() for the disconnect event, however that comes with its own issues related to ordering (you could receive packets for a connection after processing its disconnect).

@Vudentz
Copy link
Contributor

Vudentz commented Mar 10, 2020

We may have to treat disconnect event differently in order to handle this properly.

One proposal from @joerchan was to use bt_recv_prio() for the disconnect event, however that comes with its own issues related to ordering (you could receive packets for a connection after processing its disconnect).

Right and it might be important data so we can't just discard, are there any other solutions on the table currently? Btw, we want to default BT_CONN_TX_MAX to BT_L2CAP_TX_BUF_COUNT + 1 to try to avoid blocking at bt_conn_send_cb.

@jhedberg
Copy link
Member

Right and it might be important data so we can't just discard, are there any other solutions on the table currently? Btw, we want to default BT_CONN_TX_MAX to BT_L2CAP_TX_BUF_COUNT + 1 to try to avoid blocking at bt_conn_send_cb.

One idea that comes to mind is to split the disconnect processing into two parts: one that's safe to do in bt_recv_prio() (e.g. marking the TX path defunct, so that you can't send anymore but the connection can still receive packets), and another that's done in the normal RX thread (until which point you can process RX packets but not send anything, and here we finally set the disconnected state on the bt_conn object).

@jhedberg
Copy link
Member

Btw, my suggestion above is not so different from how other bt_recv_prio() handling is done: the all the existing handlers do is to release some resource (e.g. give a semaphore) which releases some other thread that does the "finalisation" or second stage of the event handling.

@carlescufi
Copy link
Member

@Vudentz @cvinayak @joerchan @jhedberg This is a high bug so we need to fix it during the 2.3 stabilization period. Is there something we can do here, either as a patch or a mitigation, during the next few days?

@joerchan
Copy link
Contributor

I can look into it.

@cvinayak
Copy link
Contributor Author

cvinayak commented Jun 3, 2020

cvinayak@0b01ebd

Above branch will make it easy to reproduce (its a branch with the patch in the issue description applied).

Steps to reproduce (timing related):

  1. build and flash the peripheral sample from the branch.
  2. build and flash a vanilla central_hr sample in another board.
  3. When central_hr connects and peripheral sample prints out Connected: <bdaddr>, reset the central_hr board to cause linkloss (linkloss needs to occur during service discovery).
  4. peripheral sample will get stuck in conn_tx_alloc function in host/conn.c

Reproducible in latest master too (62b9854):
cvinayak@55153be

@cvinayak
Copy link
Contributor Author

My current notes related to the fix in #25954:

# Below is the common behavior of the Controller for both combined
# Host+Controller and Controller-only builds.
#
# This is part of the Controller design, and helps "a" Host implementation
# The design is credits to numerous contributors to the Controller over the
# years.

# Priority of HCI commands, ACL data and ISO data packet "down" to the
# Controller (not reviewed all current transport drivers for use of this value)
CONFIG_BT_HCI_TX_PRIO=7

# Priority of "High" priority HCI packet from the Controller sent "up" to the
# Host:
#   1. Command Complete
#   2. Disconnection Complete Event
#   3. Number of Completed Packets Event
#   4. Data Buffer Overflow Event
#
# Note: Having a higher priority ensures these packets can mitigate stalled Tx
#       and "normal" Rx thread.
#
# Fun fact: Angel number 768 signifies a lot of events to come, and number 6
#           represents change!
#           Use of these numbers is just a coincidence?!
#
CONFIG_BT_DRIVER_RX_HIGH_PRIO=6

# Priority of "normal" HCI packet from the Controller sent "up" to the Host
CONFIG_BT_RX_PRIO=8

# Tx thread stack dependent on the Controller's program stack usage (call depth)
CONFIG_BT_HCI_TX_STACK_SIZE=1280
# CONFIG_BT_HCI_TX_STACK_SIZE_WITH_PROMPT is not set

# "High" priority and "normal" priority threads in the Controller
CONFIG_BT_CTLR_RX_PRIO_STACK_SIZE=448
CONFIG_BT_CTLR_RX_STACK_SIZE=896

# Rx thread stack dependent on the transport from the Controller to the Host
CONFIG_BT_RX_STACK_SIZE=768

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants