Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/semtech-loramac: fix deadlock when sending unconfirmed messages #11535

Closed

Conversation

aabadie
Copy link
Contributor

@aabadie aabadie commented May 16, 2019

Contribution description

This PR is attempt to fix #11530. Even if the problem was well explained in #11530, I had a hard time figuring out what was going on with the messages exchanged between the MAC and the caller thread in the loramac adaption code. I end up to the conclusion that one TX done message should'nt be sent from the mcps_confirm event callback, as it was received by the semtech-loramac_recv function. And in the case of data effectively received, the RX message was blocking the event loop. On the next send, calling the send function in the mac, which is using a msg_send_receive, was causing the deadlock.

The proposed solution in this PR is to use a background timer to send a TX done message to the MAC thread after an amount time enough to ensure received data are processed in semtech-loramac_recv. This message, is not sent anymore from the mcps_confirm event. The RX message is then correctly retrieved by the caller thread. And no deadlock on future sends.
For me, it's working quite reliably now with this.

I tried other solutions, like setting a message queue to the caller thread but I found this was a workaround not really fixing the original issue.

Maybe there's a better fix that I didn't think of, suggestions are welcome.

@ParksProjets would you like to have a look ang give some feedback ?

Testing procedure

Repeat the procedure described in #11530, the deadlock doesn't occur anymore with this PR

Issues/PRs references

fixes #11530

aabadie added 2 commits May 16, 2019 17:05
In uncnf tx mode, when no data is received from the server there's no easy way to be notified from the MAC that the TX is complete. Before there was a hack in the mcps_confirm event but this triggers a deadlock in the case where data are effectively received. To workaround this, we use an xtimer to send a TX done message with an offset greater than the RX windows: this ensures that normal RX data are received by the called thread and the caller thread doesn't remain locked when no data is received after an unconfirmed message is sent
@aabadie aabadie added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: pkg Area: External package ports Area: LoRa Area: LoRa radio support labels May 16, 2019
@aabadie aabadie requested a review from jia200x May 16, 2019 15:28
@jia200x
Copy link
Member

jia200x commented May 16, 2019

Thanks for tackling this.

I tried other solutions, like setting a message queue to the caller thread but I found this was a workaround not really fixing the original issue.

Why?
Strictly speaking the TX Done status should be reported by the Confirm primitive and any kind of data reception by Indication (although the pkg seems to call Indication even for ACKs...). I think it would simplify a lot the implementation if we use the messages as signals instead of a synchronization between the caller and MAC.

Also, "Indication right after Confirm" is only valid for class A, so we would need a message queue anyway for class B and class C.

What do you think about:

  • Sending a message with the packet to the caller in case of an Indication with data (same principle as GNRC's _pass_on_pkt)
  • Decide if we send an extra message with TX status. In most cases this is handled internally by the MAC (increment tx_failed counter), but I'm aware most LoRaWAN applications directly interact with the MAC.

This way we could also send other information to the caller (Link Check, etc) and the caller can't block or de-synchronize the MAC layer.
What do you think?

@ParksProjets
Copy link
Contributor

One way you could do it is to read McpsInd flag from LoRaMacFlags variable (which strangely is not static). This flag indicates that an Indication event will be trigger soon.

But in fact, the real problem is to use Indication event for confirming an uplink. I found some other potential bugs that can occur by doing so.

For example a single message (even CONFIRMED) we sent can produce two data responses if the LoRa server has two downlinks to make. It will send it a first downlink containing the first message data, an Ack and FramePending set to true. The semtech_loramac package will automatically make a second uplink (because of FramePending) and the LoRa server will send its second downlink. As semtech_loramac_recv now only returns when this second message is received, data from the first downlink will be lost.

Also, I noticed in LoRaMAC-node source code that there is a case where Indication event is not called (even with CONFIRMED message). It happens when the LoRa server send again a CONFIRMED downlink because it didn't received an Ack from the device. In that case, the application will be stuck forever on semtech_loramac_recv.

One way to solve all of these problems is to use Confirm event (and not Indication event) to confirm that an uplink has been sent, as expected by LoRaMAC-node.
For handling downlink data, we can imagine that the application could define a receiving thread or a function called when a downlink message arrives (triggered by Indication event from LoRaMAC-node). That's what is done in GNRC LoRaWAN.

@aabadie
Copy link
Contributor Author

aabadie commented May 17, 2019

Thanks for your help @jia200x @ParksProjets.

Your suggestion of only use mcps confirm event to confirm the TX is interesting, that would indeed make this more robust. For receiving messages, I prefer the thread approach over the callback function.

See #11541 that provides this change.

@aabadie
Copy link
Contributor Author

aabadie commented May 19, 2019

closing in favor of #11541

@aabadie aabadie closed this May 19, 2019
@aabadie aabadie deleted the pr/pkg/semtech_loramac_uncnf_deadlock branch May 20, 2019 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: LoRa Area: LoRa radio support Area: pkg Area: External package ports Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pkg/semtech_loramac: deadlock with UNCONFIRMED messages
3 participants