-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random disconnect during transmission over WiFi #114
Comments
Original comment by Bart Cox (Bitbucket: bcox_pv).
|
Original comment by Nate Koenig (Bitbucket: Nathan Koenig). |
Original comment by Nate Koenig (Bitbucket: Nathan Koenig). Thanks for the information. I'll work on reproducing your test setup. In the mean-time, can you try out pull request #416? That PR feels related to this issue, but it's a shot in the dark right now. |
Original comment by Bart Cox (Bitbucket: bcox_pv). I’ve run the revision of pull request #416 but the same problems persists on WiFi. The bench test finishes successfully but when running a publisher/subscriber example, all transmissions halts on average after 250 seconds (sample size is 20). For completeness I ran the same code on a fully wired connection and no errors were encountered (sample size is again 20 and the tests were stopped after 15 minutes due to the lack of any errors). |
Original comment by Nate Koenig (Bitbucket: Nathan Koenig). Okay. Thanks for the info. I'm still digging into the problem. |
Original comment by Nate Koenig (Bitbucket: Nathan Koenig). |
Original comment by Carlos Agüero (Bitbucket: caguero, GitHub: caguero). See pull request #436, Bart Cox (bartcox) , could you confirm that the pull request fixes your issue (only for pub/sub for now)? |
Closing in order to triage the issue tracker. Please re-open if the problem persists. |
* Remove warnings using ZMQ 4.3.1 or greater. * Do not use ZMQ_CPP11 * Win debugging. * backport improved compiler support for std::filesystem * Close branch backport_compiler_filesystem * Restore original Playback::Start and add overload with new parameter to fix ABI * bump to 7.2.2 and update changelog * Close branch fix_abi_7 * Write to disk from a background thread in log recorder * Update Changelog * Move `dataWriterState = true` to Recorder::Implementation::DataWriterThread() thread. * Revert moving dataWriterState * Failing test with incorrect time stamps * Correctly record message reception time stamp * Reorder functions * Specify buffer size in MB rather than number of elements in the data queue * Flush any remaining data to log file when stopping the Recorder * Codecheck. The rvalue ref is used to ensure that std::vector is always moved * Version update * Added tag ignition-transport7_7.3.0~pre1 for changeset 173fae6c362d * recorder.cc: include <optional> * Add console message to indicate buffer being flushed * Add <numeric> * Close branch async_recorder * Prepare for 7.3.0 * Added tag ignition-transport7_7.3.0 for changeset 367d4f1bfcf7 * Configurable buffer sizes. * Fix typo. * Changelog. * Clarify high water mark policy. * Tweak documentation and error messages. * Close branch issue_116_transport7 * fix line lengths * Close branch codecheck7 * Update default values for the high water marks * Update buffer default values * Changelog.md edited online with Bitbucket * Close branch default_hwm * Adding connection message. * ConnectionMsg implementation. * Test * No control socket. * Preserve ABI. * Discard registrations when needed. * Tweaks. * Changelog * Fix issue #114. * Close branch discovery_extended_p2 * 7.4.0 * Move changelog entry * Close branch ign-transport7-4 * Added tag ignition-transport7_7.4.0 for changeset 083e7bf41080 * Protobuf warnings * Close branch proto_deprecations * Close branch issue_111 * Windows warnings * revert commit to release branch * Fix version for send_falgs command * Close branch ign-transport7_fix_send_flags * Backport pull request #441 * updates * Added another check * Close branch issue_118 * mv hgignore Signed-off-by: claireyywang <clairewang@openrobotics.org> * add gitignore Signed-off-by: claireyywang <clairewang@openrobotics.org> * [ign-transport7] Update BitBucket links (#123) * [ign-transport7] Update BitBucket links Signed-off-by: Louise Poubel <louise@openrobotics.org> * changelog pull-requests * Apply suggestions from code review * Update tutorials/07_relay.md Co-authored-by: Marya Belanger <marya@openrobotics.org> * [ign-transport7] Workflow updates (#132) * [ign-transport7] Workflow updates Signed-off-by: Louise Poubel <louise@openrobotics.org> * Helper function to get a valid topic name (#153) Signed-off-by: Louise Poubel <louise@openrobotics.org> * Remove Windows warnings (#151) Signed-off-by: Carlos Aguero <caguero@openrobotics.org> * Remove warnings on Homebrew (#150) Signed-off-by: Carlos Aguero <caguero@openrobotics.org> Co-authored-by: Louise Poubel <louise@openrobotics.org> * Bump to 7.5.0 (#156) Signed-off-by: Louise Poubel <louise@openrobotics.org> * Modernize actions CI (#158) Signed-off-by: Louise Poubel <louise@openrobotics.org> * remove ci-bionic Signed-off-by: Louise Poubel <louise@openrobotics.org> * add focal Signed-off-by: Louise Poubel <louise@openrobotics.org> * msgs5 Signed-off-by: Louise Poubel <louise@openrobotics.org> * Suppress focal-specific warnings (#159) * Suppress focal-specific warnings Signed-off-by: Michael Carroll <michael@openrobotics.org> * Warn when lsb_release isn't present Signed-off-by: Michael Carroll <michael@openrobotics.org> * Adding header guard. Signed-off-by: Carlos Agüero <caguero@openrobotics.org> * Include correct header file for version check Signed-off-by: Michael Carroll <michael@openrobotics.org> * Added more debug output Signed-off-by: Nate Koenig <nate@openrobotics.org> * Fix focal test and codecheck Signed-off-by: Nate Koenig <nate@openrobotics.org> * Change endtime expectation Signed-off-by: Carlos Agüero <caguero@openrobotics.org> Co-authored-by: Carlos Agüero <caguero@openrobotics.org> Co-authored-by: Nate Koenig <nate@openrobotics.org> Co-authored-by: Carlos Aguero <caguero@osrfoundation.org> Co-authored-by: Steve Peters <scpeters@openrobotics.org> Co-authored-by: Steve Peters <scpeters@osrfoundation.org> Co-authored-by: Carlos Agüero <cen.aguero@gmail.com> Co-authored-by: Addisu Z. Taddese <addisu@openrobotics.org> Co-authored-by: Nate Koenig <natekoenig@gmail.com> Co-authored-by: Carlos Aguero <caguero@openrobotics.org> Co-authored-by: Jose Luis Rivero <jrivero@osrfoundation.org> Co-authored-by: claireyywang <clairewang@openrobotics.org> Co-authored-by: Marya Belanger <marya@openrobotics.org> Co-authored-by: Michael Carroll <michael@openrobotics.org> Co-authored-by: Nate Koenig <nate@openrobotics.org>
Original report (archived issue) by Bart Cox (Bitbucket: bcox_pv).
Description
When we use ignition transport over WiFi we experience long delays on communication via (asynchronous) service calls and disconnects on pub/sub traffic. These seem to be accompanied with frequent detected disconnects and connects in the discovery layer. Interestingly, these delays seem to happen to a few nodes (but not all) at once and seem to resolve at the same time as well. We have been able to rule out any deadlock-like situations as our nodes will still accept and process service requests from nodes not affected by the delay in the network. Once the delay resolves, the messages seem to come in all at once.
We tested this problem with the basic example code from the source. When running the basic examples publisher.cc and subscriber.cc over WiFi, random disconnection callbacks are fired while both machines are still connected to the same network. We seem to experience similar problems with communication in the publisher/subscriber example which disconnects within a few minutes and in severe cases even seconds.
To rule out relevant external factors, we used an isolated network without any other active clients on a professional grade router and access-point but that seemed to have no influence on the robustness of the connections. We have also been able to exclude Ubuntu versions (16.04/18.04), client hardware/architecture and ignition-transport versions(5.xx - 7.xx), during our tests.
When we run the same tests on the same machines over a wired network no long delays or disconnects are occurring, the connection is stable.
Steps to Reproduce
Expected behavior:
No disconnection callbacks when the machine is connected to the (wireless) network
Actual behavior:
After 2 minutes the subscriber gets a disconnect callback and stops receiving messages. The publisher keeps sending messages.
Reproduces how often:
Periodically.
Versions
Additional context
Our first assumption was that UDP multicast traffic carrying discovery information might get lost over a WiFi connection. Therefore we have been experimenting with different parameter sets in the discovery layer such as a lower heartbeat interval, higher silence interval etc. Only a longer silence interval resulted in a better performance in our tests but only at large values of 20 seconds or more.
We have further tried forcing all the traffic over unicast through modifying the relay functionality such that all discovery related messages are send over unicast within the same network (but not relayed). We were hoping that this lead to more stable connections but we did not see any significant improvement.
The text was updated successfully, but these errors were encountered: