Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random disconnect during transmission over WiFi #114

Closed
osrf-migration opened this issue Dec 13, 2019 · 8 comments
Closed

Random disconnect during transmission over WiFi #114

osrf-migration opened this issue Dec 13, 2019 · 8 comments
Labels
bug Something isn't working

Comments

@osrf-migration
Copy link

Original report (archived issue) by Bart Cox (Bitbucket: bcox_pv).


Description

When we use ignition transport over WiFi we experience long delays on communication via (asynchronous) service calls and disconnects on pub/sub traffic. These seem to be accompanied with frequent detected disconnects and connects in the discovery layer. Interestingly, these delays seem to happen to a few nodes (but not all) at once and seem to resolve at the same time as well. We have been able to rule out any deadlock-like situations as our nodes will still accept and process service requests from nodes not affected by the delay in the network. Once the delay resolves, the messages seem to come in all at once.

We tested this problem with the basic example code from the source. When running the basic examples publisher.cc and subscriber.cc over WiFi, random disconnection callbacks are fired while both machines are still connected to the same network. We seem to experience similar problems with communication in the publisher/subscriber example which disconnects within a few minutes and in severe cases even seconds.

To rule out relevant external factors, we used an isolated network without any other active clients on a professional grade router and access-point but that seemed to have no influence on the robustness of the connections. We have also been able to exclude Ubuntu versions (16.04/18.04), client hardware/architecture and ignition-transport versions(5.xx - 7.xx), during our tests.

When we run the same tests on the same machines over a wired network no long delays or disconnects are occurring, the connection is stable.

Steps to Reproduce

  • Use Ubuntu 18.04
  • Install dependencies
sudo apt-get update\
          && apt-get -y install\
            gnupg lsb-release\
            cmake pkg-config cppcheck git mercurial build-essential curl\
            libprotobuf-dev protobuf-compiler libprotoc-dev libzmq3-dev uuid-dev\
            doxygen ruby-ronn libsqlite3-dev g++-8\
          && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 800 --slave /usr/bin/g++ g++ /usr/bin/g++-8 --slave /usr/bin/gcov gcov /usr/bin/gcov-8
echo "deb <http://packages.osrfoundation.org/gazebo/ubuntu-stable> $(lsb_release -cs) main" > /etc/apt/sources.list.d/gazebo-stable.list
echo "deb <http://packages.osrfoundation.org/gazebo/ubuntu-prerelease> `lsb_release -cs` main" > /etc/apt/sources.list.d/gazebo-prerelease.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys D2486D2DD83DB69272AFE98867170598AF249743
  • Install ignition libraries
sudo apt-get update \
          && sudo apt-get -y install \
            libignition-cmake2-dev \
            libignition-math6-dev \
            libignition-msgs4-dev \
            libignition-tools-dev
  • Build ignition-transport 7.7.0 from source

hg clone https://osrf-migration.github.io/ignition-gh-pages/#!/ignitionrobotics/ign-transport/ ign-transport \ 
&& cd ign-transport \ 
&& hg up ignition-transport7_7.0.0 \ 
&& mkdir -p build \ 
&& cd build \ 
&& cmake ../ \ 
&& make -j4 \ 
&& sudo make install -j4 \ 
&& cd ../example \ 
&& mkdir -p build \ 
&& cd build \ 
&& cmake .. \ 
&& make -j4
  • Run the publisher example from the source code on machine A

export IGN_PARTITION=transmission_test 
export IGN_VERBOSE=1 
export IGN_IP=${OWN_IP} 
./build/publisher
  • Run the subscriber example from the source code on machine B

export IGN_PARTITION=transmission_test 
export IGN_VERBOSE=1 
export IGN_IP=${OWN_IP} 
./build/subscriber

Expected behavior:

No disconnection callbacks when the machine is connected to the (wireless) network

Actual behavior:

After 2 minutes the subscriber gets a disconnect callback and stops receiving messages. The publisher keeps sending messages.

Reproduces how often:

Periodically.

Versions

  • Ubuntu 18.04
  • source install
  • ignition-transport 7.7.0

Additional context

Our first assumption was that UDP multicast traffic carrying discovery information might get lost over a WiFi connection. Therefore we have been experimenting with different parameter sets in the discovery layer such as a lower heartbeat interval, higher silence interval etc. Only a longer silence interval resulted in a better performance in our tests but only at large values of 20 seconds or more.

We have further tried forcing all the traffic over unicast through modifying the relay functionality such that all discovery related messages are send over unicast within the same network (but not relayed). We were hoping that this lead to more stable connections but we did not see any significant improvement.

@osrf-migration
Copy link
Author

Original comment by Bart Cox (Bitbucket: bcox_pv).


  • Edited issue description

@osrf-migration
Copy link
Author

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


  • set assignee_account_id to "557058:095b1e12-74ed-4e20-b44f-2f0745b616e0"
  • set assignee to "nkoenig (Bitbucket: nkoenig, GitHub: nkoenig)"

@osrf-migration
Copy link
Author

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Thanks for the information. I'll work on reproducing your test setup. In the mean-time, can you try out pull request #416? That PR feels related to this issue, but it's a shot in the dark right now.

@osrf-migration
Copy link
Author

Original comment by Bart Cox (Bitbucket: bcox_pv).


I’ve run the revision of pull request #416 but the same problems persists on WiFi.

The bench test finishes successfully but when running a publisher/subscriber example, all transmissions halts on average after 250 seconds (sample size is 20).

For completeness I ran the same code on a fully wired connection and no errors were encountered (sample size is again 20 and the tests were stopped after 15 minutes due to the lack of any errors).

@osrf-migration
Copy link
Author

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Okay. Thanks for the info. I'm still digging into the problem.

@osrf-migration
Copy link
Author

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


  • changed assignee_account_id from "557058:095b1e12-74ed-4e20-b44f-2f0745b616e0" to "557058:4ded1ddf-947e-4154-bbd1-3dba24f1bdbd"
  • changed assignee from "nkoenig (Bitbucket: nkoenig, GitHub: nkoenig)" to "caguero (Bitbucket: caguero, GitHub: caguero)"

@osrf-migration
Copy link
Author

Original comment by Carlos Agüero (Bitbucket: caguero, GitHub: caguero).


See pull request #436, Bart Cox (bartcox) , could you confirm that the pull request fixes your issue (only for pub/sub for now)?

@osrf-migration osrf-migration added major bug Something isn't working labels Apr 15, 2020
@nkoenig
Copy link
Contributor

nkoenig commented May 8, 2020

Closing in order to triage the issue tracker. Please re-open if the problem persists.

@nkoenig nkoenig closed this as completed May 8, 2020
nkoenig added a commit that referenced this issue Aug 28, 2020
* Remove warnings using ZMQ 4.3.1 or greater.
* Do not use ZMQ_CPP11
* Win debugging.
* backport improved compiler support for std::filesystem
* Close branch backport_compiler_filesystem
* Restore original Playback::Start and add overload with new parameter to fix ABI
* bump to 7.2.2 and update changelog

* Close branch fix_abi_7

* Write to disk from a background thread in log recorder

* Update Changelog

* Move `dataWriterState = true` to Recorder::Implementation::DataWriterThread() thread.

* Revert moving dataWriterState

* Failing test with incorrect time stamps

* Correctly record message reception time stamp

* Reorder functions

* Specify buffer size in MB rather than number of elements in the data queue

* Flush any remaining data to log file when stopping the Recorder

* Codecheck. The rvalue ref is used to ensure that std::vector is always moved

* Version update

* Added tag ignition-transport7_7.3.0~pre1 for changeset 173fae6c362d

* recorder.cc: include <optional>

* Add console message to indicate buffer being flushed

* Add <numeric>

* Close branch async_recorder

* Prepare for 7.3.0

* Added tag ignition-transport7_7.3.0 for changeset 367d4f1bfcf7

* Configurable buffer sizes.

* Fix typo.

* Changelog.

* Clarify high water mark policy.

* Tweak documentation and error messages.

* Close branch issue_116_transport7

* fix line lengths

* Close branch codecheck7

* Update default values for the high water marks

* Update buffer default values

* Changelog.md edited online with Bitbucket

* Close branch default_hwm

* Adding connection message.

* ConnectionMsg implementation.

* Test

* No control socket.

* Preserve ABI.

* Discard registrations when needed.

* Tweaks.

* Changelog

* Fix issue #114.

* Close branch discovery_extended_p2

* 7.4.0

* Move changelog entry

* Close branch ign-transport7-4

* Added tag ignition-transport7_7.4.0 for changeset 083e7bf41080

* Protobuf warnings

* Close branch proto_deprecations

* Close branch issue_111

* Windows warnings

* revert commit to release branch

* Fix version for send_falgs command

* Close branch ign-transport7_fix_send_flags

* Backport pull request #441

* updates

* Added another check

* Close branch issue_118

* mv hgignore

Signed-off-by: claireyywang <clairewang@openrobotics.org>

* add gitignore

Signed-off-by: claireyywang <clairewang@openrobotics.org>

* [ign-transport7] Update BitBucket links (#123)

* [ign-transport7] Update BitBucket links

Signed-off-by: Louise Poubel <louise@openrobotics.org>

* changelog pull-requests

* Apply suggestions from code review

* Update tutorials/07_relay.md

Co-authored-by: Marya Belanger <marya@openrobotics.org>

* [ign-transport7] Workflow updates (#132)

* [ign-transport7] Workflow updates

Signed-off-by: Louise Poubel <louise@openrobotics.org>

* Helper function to get a valid topic name (#153)

Signed-off-by: Louise Poubel <louise@openrobotics.org>

* Remove Windows warnings (#151)

Signed-off-by: Carlos Aguero <caguero@openrobotics.org>

* Remove warnings on Homebrew (#150)

Signed-off-by: Carlos Aguero <caguero@openrobotics.org>
Co-authored-by: Louise Poubel <louise@openrobotics.org>

* Bump to 7.5.0 (#156)

Signed-off-by: Louise Poubel <louise@openrobotics.org>

* Modernize actions CI (#158)

Signed-off-by: Louise Poubel <louise@openrobotics.org>

* remove ci-bionic

Signed-off-by: Louise Poubel <louise@openrobotics.org>

* add focal

Signed-off-by: Louise Poubel <louise@openrobotics.org>

* msgs5

Signed-off-by: Louise Poubel <louise@openrobotics.org>

* Suppress focal-specific warnings (#159)

* Suppress focal-specific warnings

Signed-off-by: Michael Carroll <michael@openrobotics.org>

* Warn when lsb_release isn't present

Signed-off-by: Michael Carroll <michael@openrobotics.org>

* Adding header guard.

Signed-off-by: Carlos Agüero <caguero@openrobotics.org>

* Include correct header file for version check

Signed-off-by: Michael Carroll <michael@openrobotics.org>

* Added more debug output

Signed-off-by: Nate Koenig <nate@openrobotics.org>

* Fix focal test and codecheck

Signed-off-by: Nate Koenig <nate@openrobotics.org>

* Change endtime expectation

Signed-off-by: Carlos Agüero <caguero@openrobotics.org>

Co-authored-by: Carlos Agüero <caguero@openrobotics.org>
Co-authored-by: Nate Koenig <nate@openrobotics.org>

Co-authored-by: Carlos Aguero <caguero@osrfoundation.org>
Co-authored-by: Steve Peters <scpeters@openrobotics.org>
Co-authored-by: Steve Peters <scpeters@osrfoundation.org>
Co-authored-by: Carlos Agüero <cen.aguero@gmail.com>
Co-authored-by: Addisu Z. Taddese <addisu@openrobotics.org>
Co-authored-by: Nate Koenig <natekoenig@gmail.com>
Co-authored-by: Carlos Aguero <caguero@openrobotics.org>
Co-authored-by: Jose Luis Rivero <jrivero@osrfoundation.org>
Co-authored-by: claireyywang <clairewang@openrobotics.org>
Co-authored-by: Marya Belanger <marya@openrobotics.org>
Co-authored-by: Michael Carroll <michael@openrobotics.org>
Co-authored-by: Nate Koenig <nate@openrobotics.org>
chapulina pushed a commit that referenced this issue Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants