Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition between roscore and roslaunch --wait #1097

Closed
drigz opened this issue Jul 12, 2017 · 9 comments
Closed

Race condition between roscore and roslaunch --wait #1097

drigz opened this issue Jul 12, 2017 · 9 comments

Comments

@drigz
Copy link
Contributor

drigz commented Jul 12, 2017

Steps to reproduce

echo "<launch></launch>" > launch.xml
roscore & roslaunch --wait launch.xml

Rarely, both roscore and roslaunch will start the rosout node. If this happens, the rosout nodes will conflict forever as they are both started with respawn="true".

To make this more likely to happen, you can do the following:

edit _setup() in roslaunch/launch.py:
        self._launch_master()
        import random
        time.sleep(random.random())
        self._launch_core_nodes() 
edit _wait_for_master() in roslaunch/rlutil.py:
    while not is_running:
        # time.sleep(0.1)
        is_running = m.is_running()
run this script and look for multiple "started core service" lines:
    #!/bin/bash
    export PYTHONUNBUFFERED=yes
    roscore | sed -n "/core service/ {s/^/roscore: /; p}" &
    for i in $(seq 1 100); do
      roslaunch --wait launch.xml | sed -n "/core service/ {s/^/roslaunch $i: /; p}" &
    done
    sleep 5
    pkill -P $$
    wait

Explanation

If you run roslaunch --wait, it will wait for the master before starting, but the --wait flag has no effect on the starting of core nodes. In roslaunch/launch.py, we have:

        # start up the core: master + core nodes defined in core.xml
        self._launch_master()
        self._launch_core_nodes() 

unconditionally of whether --core or --wait is used. In the --wait case, rlutil._wait_for_master has already been called, and _launch_master() notices that the master is already running so does nothing. However, _launch_core_nodes may still try to launch rosout if the roscore hasn't started it yet.

Workaround

Run this command before roslaunch --wait:

until rosnode info rosout | grep Pid; do sleep 1; done
@dirk-thomas
Copy link
Member

I guess it is an uncommon use case to start roscore as well as roslaunch in parallel. Currently that will have the race you described. The right solution for this would be to add an option --no-core to roslaunch to never attempt to start the core on its own and always rely on it being started externally.

Beside that I am not sure why you want to invoke roscore and roslaunch in parallel in the first plac. You could just invoke roslaunch and it will start the core automatically. Usually the core is only started separately in order to be able to stop / restart the launch file without restarting the core. But since you invoke these commands together that doesn't seem to be your use case. Maybe describing your use case will help to understand why you are trying to do it.

Anyway regarding the mentioned new option it would be great if you could provide a PR implementing it if that is of use for your application.

@drigz
Copy link
Contributor Author

drigz commented Jul 12, 2017

Thanks for the suggestion!

We have a containerized setup running on Kubernetes. When you create a deployment, it starts a roscore container along with containers for any ROS tools that are desired (eg nav, motion planning, network bridging). Some of these ROS tools are just single nodes, so there's no risk, but others use roslaunch to bring up multiple nodes, in which case we see this problem.

Does anybody start the rosmaster without starting the core nodes? It seems like roscore is the only supported way of starting a master - f that's the case, we could just make --wait skip the core services, since whatever started the master will start them. This would be simpler and less error prone, since most users wouldn't find out about --no-core. What do you think?

@dirk-thomas
Copy link
Member

The --wait option has a different semantic at the moment. It still starts the core but then wait until the core is up and available before starting any node. Existing code will rely on that behavior. So changing the semantic of the existing option would likely break those.

Therefore adding a new option for this is better. Both options could even make sense to be used together or not. E.g. using --no-core without --wait if you want to start the nodes even though the core might not be ready yet. Or using both options to wait for the external core to be available.

@drigz
Copy link
Contributor Author

drigz commented Jul 12, 2017

The --wait option has a different semantic at the moment. It still starts the core but then wait until the core is up and available before starting any node

I don't quite understand this. As I understand, --wait means the first thing done is to wait for a master to come up, ie it doesn't wait for the core nodes/params:

    if options.wait_for_master:
        if options.core:
            parser.error("--wait cannot be used with roscore")
        rlutil._wait_for_master()            

From then on, the option is largely ignored, and _launch_master() and _launch_core_nodes() are both called, even though _launch_master() has no effect (since we already waited for an existing master) and _launch_core_nodes() introduces the race condition described (but normally has no effect, since roscore has already launched the core nodes).

If the --wait option waits for the master as before, but doesn't call _launch_core_nodes(), code will not break (except for the case that they are running rosmaster directly without running roscore, and assuming that roslaunch --wait will start core nodes without starting a master, which seems unlikely).

I share your caution about changing the behavior of the existing code, but the current behaviour of --wait is broken in the case of a simultaneous launch, and leaving it broken by default seems the bigger risk.

E.g. using --no-core without --wait if you want to start the nodes even though the core might not be ready yet.

This would not be reliable - roslaunch can start nodes before the master is running, but it can't set params.

@mikepurvis
Copy link
Member

mikepurvis commented Jul 12, 2017

I definitely assumed the behaviour of --wait was to skip starting the core nodes.

We don't currently but have contemplated an architecture of starting roscore + N roslaunch --wait instances each in their own supervisor/upstart/systemd background job (so they can be stopped/started/disabled etc using standard tools and APIs). It seems like attempting to do that with the behaviour as it currently is would fail in subtle and hard-to-debug ways. And I don't really see the use case for the other direction— what would be the circumstance where it would be desirable to wait for the rosmaster, but still launch your own instance of /rosout?

Given that the documentation literally says:

--wait
Delay the launch until a roscore is detected.

I would support making the existing flag suppress all aspects of the roscore (rosmaster, rosout node).

@tappan-at-git
Copy link
Contributor

I definitely assumed the behaviour of --wait was to skip starting the core nodes.

Same. I've been fighting this issue for an application much like drigz's. The current behavior seems non-intuitive, and not well described by the documentation.

@dirk-thomas
Copy link
Member

My memory of the existing options was wrong. I had to re-read the relevant code to understand what it actually does. Sorry for that.

I agree with your idea to change the behavior that when passing --wait roslaunch should not attempt to start the core nodes. 👍

One way of achieving this (without passing the flag through multiple levels of API) would be to update _launch_master to return a boolean if a master was started by that function. The following invocation of _launch_core_nodes could than be made conditional only if that was the case. Otherwise assuming the master was started externally (and using rosmaster is discouraged) and therefore expecting the core nodes to be already running. Does that sound like it would work for you? Since the change was really minimal I created #1098. It would be great it you could give it a try and confirm that it addresses your use case.

@drigz
Copy link
Contributor Author

drigz commented Jul 13, 2017

Great idea - that's a super simple way to fix it, and fixes the issue in my reproduction setup. Thanks!

@dirk-thomas
Copy link
Member

@drigz Thank you for testing the patch. Now I just have to get CI to turn green 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants