Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🧑‍🌾 Flaky test test_generic_pub_sub__rmw_connextdds in linux and linux-rhel nightlies #124

Closed
Crola1702 opened this issue Jul 3, 2024 · 3 comments · Fixed by #125
Assignees

Comments

@Crola1702
Copy link

Description

test_generic_pub_sub__rmw_connextdds is failing in linux and linux-rhel nightlies. As it's a flaky test, it happens mostly on repeated jobs.

Reference build:

Test regressions:

Log output (see test report full log output):

AssertionError: 2 != 1 : expected 1 event matching rmw_publisher_handle=24180656 in events: <Long output...>

Flaky ratio:

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_linux-rhel_repeated 2024-07-02 2024-06-22 15 11 73.33
nightly_linux-rhel_release 2024-07-02 2024-06-27 15 6 40.0
nightly_linux_repeated 2024-07-01 2024-06-22 14 9 64.29
nightly_linux-rhel_debug 2024-07-01 2024-06-22 15 6 40.0
nightly_linux_release 2024-06-30 2024-06-22 15 2 13.33
nightly_linux_debug 2024-06-30 2024-06-30 15 1 6.67
Rci__nightly-connext_ubuntu_noble_amd64 2024-06-29 2024-06-22 11 6 54.55
@christophebedard
Copy link
Member

christophebedard commented Jul 3, 2024

Thanks for opening this issue. I usually look at the nightly CI results from the bot's comment every day, and I've been noticing that a number of these tracing tests have started failing somewhat consistently.

This test_generic_pub_sub test runs the test_generic_ping & test_generic_pong executables, which are pretty much equivalent to the test_pub_sub test and the test_ping & test_pong executables. Then the following sequence is expected:

  1. The *_ping process has a timer that periodically publishes a message on the /ping topic
  2. The *_pong process receives the message from the /ping topic, immediately publishes a message on the /pong topic, and then shuts down
  3. The *_ping process receives the message from the /pong topic and shuts down

Due to potential timing differences in the launch order and all, an initial /ping message has to be be published periodically until the process/node is shut down, otherwise the *_pong process might not get the first/only /ping message, leading to the test hanging.

The test only expects 1 initial /ping message, which has worked in the past*. However, if the machine is too slow, or if the rmw implementation** takes too long to initialize or something, it might need a second message in order to properly trigger the ping->pong sequence. If we look at the timestamps of the 2 ros2:rmw_publish events, they are ~500 ms apart, which is the period of the initial timer: 1719987494798313704 ns - 1719987494298538374 ns = 499.8 ms.

I just need to relax these kinds of assertions and expect >= 1 /ping message instead. I should also do the same for test_pub_sub.

(*) I realize that this is kind of incoherent given the paragraph above 😅

(**) we've only recently started running these tests against all rmw implementations (#116)

@clalancette
Copy link
Contributor

Due to potential timing differences in the launch order and all, an initial /ping message has to be be published periodically until the process/node is shut down, otherwise the *_pong process might not get the first/only /ping message, leading to the test hanging.

I'm not sure if this changes the nature of the test, but would using a transient_local topic help? That would ensure that no matter what order ping and pong comes up, pong will always get one message.

@christophebedard
Copy link
Member

Huh, I don't know why I didn't think of that 🤦 yeah, that sounds like the perfect solution. I'll give that a shot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants