Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using different typesupports on same node and topic ends in segfault #265

Closed
gonzodepedro opened this issue Mar 21, 2019 · 29 comments · Fixed by #342 or #350
Closed

Using different typesupports on same node and topic ends in segfault #265

gonzodepedro opened this issue Mar 21, 2019 · 29 comments · Fixed by #342 or #350
Labels
bug Something isn't working help wanted Extra attention is needed in progress Actively being worked on (Kanban column)

Comments

@gonzodepedro
Copy link

Bug report

Required Info:

  • Operating System:
    • Ubuntu 18.04
  • Installation type:
    • Source
  • Version or commit hash:
    • Head
  • DDS implementation:
    • Fast RTPS
  • Client library (if applicable):
    • N/A

Steps to reproduce issue

Create a node that registers to an already registered topic but with a different typesupport. If we assume that the node is using rosout log, and that rosout regsiters a publisher in the /rosout topic with C typesupport. You can subscribe to the /rosout topic with a CPP typesupport.

#include <iostream>
#include "rclcpp/rclcpp.hpp"
#include <rcl_interfaces/msg/log.hpp>

class MinimalSubscriber : public rclcpp::Node
{
public:
  MinimalSubscriber()
  : Node("minimal_subscriber")
  {
    subscription_ = this->create_subscription<rcl_interfaces::msg::Log>(
      "rosout",
      [this](rcl_interfaces::msg::Log::UniquePtr msg) {
      RCLCPP_INFO(this->get_logger(), "I heard: '%s'", msg->msg.c_str());
    });
  }

private:
  rclcpp::Subscription<rcl_interfaces::msg::Log>::SharedPtr subscription_;
};

int main(int argc, char * argv[])
{
  rclcpp::init(argc, argv);
  rclcpp::spin(std::make_shared<MinimalSubscriber>());
  rclcpp::shutdown();
  return 0;
}

Expected behavior

Both registered publisher and subscribers should be able to publish and receive messages for the topic.

Actual behavior

The process exits with a segfault when receiving a message on the subscribed topic.

Additional information

Two different typesupports cannot be in the same topic at the same time.
This is due to the fact that rmw_fastrtps uses the same name to register both typesupports in the FastRTPS Domain. See code

Disambiguate with a different name (appending or prepending a string) is not possible because the name of the typesupport is directly related to the name of the topic. See code and fastrtps TopicData

Probably happening in other implementations.

@gonzodepedro
Copy link
Author

This only happens in fastrtps.
Connext and OpenSplice don't show this behavior.

@hidmic
Copy link
Contributor

hidmic commented Mar 28, 2019

Is there anyone from eProsima we can ping to take a look at this?

@clalancette
Copy link
Contributor

Is there anyone from eProsima we can ping to take a look at this?

Generally @richiware is our point of contact with eProsima. @richiware any thoughts about this issue?

@richiware
Copy link
Contributor

I will try to check it this week.

@richiware
Copy link
Contributor

I was not able to check it. We are working on the next release and I didn't have time. We have this issue tracked on our tracker system. We will address it at some point. Sorry.

@dirk-thomas
Copy link
Member

This is really a blocker for FastRTPS compatibility. So please consider working on a patch early in order to ensure a fix will be available in the upcoming release.

@LuisGP
Copy link

LuisGP commented Apr 24, 2019

Hi everybody,

We have been taking a look into this and the problem is that in this case we are trying to register two different types (the C typesupport, and the CPP typesupport) with the same name in the same participant, and this behavior isn't allowed by Fast-RTPS.

Once the participant registers one typesupport, it will always use that typesupport. So, when the application tries to write C data with the CPP typesupport registered, it will crash.

We will discuss internally how to solve it, in the meantime, do you have any ideas?

In any case, is ROS2 thought to use C and CPP typesupport mixed in the same application?

@dirk-thomas
Copy link
Member

is ROS2 thought to use C and CPP typesupport mixed in the same application?

Yes, in general we can have multiple different representations on the ROS side which all map to the same type for DDS. I am not sure if that N-to-1 mapping can happen within the rmw_fastrtps_cpp package or needs to be supported by FastRTPS natively.

@LuisGP
Copy link

LuisGP commented Apr 25, 2019

Thank you @dirk-thomas
We need to think carefully about how to support that N-to-1 mapping.
Just to throw some light on our throughs, we need to be able to switch between typesupport with the same name, taking into account that the participant currently has a 1-1 relationship between types and names and without losing DDS type name (and thus, DDS matching).
Any suggestion will be welcome.

@dirk-thomas
Copy link
Member

Please see ros2/rcl_interfaces#61 for a workaround (passing _log_disable_rosout:=true).

@sloretz
Copy link
Contributor

sloretz commented Jun 21, 2019

Commenting that this appears to prevent a lifecycle node from subscribing to the .../transition_event topic of another lifecycle node. A C++ subscriber crashes when taking data because the C typesupport was established for type by rcl_lifecycle.

backtrace
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff6a7b801 in __GI_abort () at abort.c:79
#2  0x00007ffff6ac4897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff6bf1b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff6acb90a in malloc_printerr (str=str@entry=0x7ffff6befefd "realloc(): invalid pointer") at malloc.c:5350
#4  0x00007ffff6ad3eba in __GI___libc_realloc (oldmem=0x5555562c1980, bytes=1) at malloc.c:3174
#5  0x00007ffff4bfb432 in rosidl_generator_c__String__assignn () from /opt/ros/dashing/lib/librosidl_generator_c.so
#6  0x00007fffe9bf4e47 in ?? () from /opt/ros/dashing/lib/liblifecycle_msgs__rosidl_typesupport_fastrtps_c.so
#7  0x00007fffe9bf5329 in ?? () from /opt/ros/dashing/lib/liblifecycle_msgs__rosidl_typesupport_fastrtps_c.so
#8  0x00007ffff317370c in ?? () from /opt/ros/dashing/lib/librmw_fastrtps_cpp.so
#9  0x00007ffff2b37ce2 in rmw_fastrtps_shared_cpp::TypeSupport::deserialize(eprosima::fastrtps::rtps::SerializedPayload_t*, void*) () from /opt/ros/dashing/lib/librmw_fastrtps_shared_cpp.so
#10 0x00007ffff2583830 in eprosima::fastrtps::SubscriberHistory::takeNextData(void*, eprosima::fastrtps::SampleInfo_t*) () from /opt/ros/dashing/lib/libfastrtps.so.1
#11 0x00007ffff2b34fc3 in rmw_fastrtps_shared_cpp::_take(char const*, rmw_subscription_t const*, void*, bool*, rmw_message_info_t*, rmw_subscription_allocation_t*) ()
   from /opt/ros/dashing/lib/librmw_fastrtps_shared_cpp.so
#12 0x00007ffff6826d28 in rcl_take () from /opt/ros/dashing/lib/librcl.so
#13 0x00007ffff7667a2c in rclcpp::executor::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>) () from /opt/ros/dashing/lib/librclcpp.so
#14 0x00007ffff7668b25 in rclcpp::executor::Executor::execute_any_executable(rclcpp::executor::AnyExecutable&) () from /opt/ros/dashing/lib/librclcpp.so
#15 0x00007ffff766ef2f in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/dashing/lib/librclcpp.so
#16 0x00005555555586c8 in main (argc=2, argv=0x7fffffffe7e8) at /home/sloretz/workspaces/dashing/src/lifecycle_manager/src/lifecycle_manager_standalone.cpp:27

@nburek
Copy link
Contributor

nburek commented Aug 8, 2019

For anyone who is blocked on having their node subscribe to rosout when using rclcpp you can work around this issue by disabling logging to rosout on the node that you have subscribing to rosout. The example of how to do this using the ros1_bridge as an example is below.

When using ros2 run:
ros2 run ros1_bridge dynamic_bridge __log_disable_rosout:=true

In your launch file when using ros2 launch:

launch_ros.actions.Node(package='ros1_bridge`,
  node_executable='dynamic_bridge',
  arguments=['__log_disable_rosout:=true'])

@mm318
Copy link
Member

mm318 commented Aug 28, 2019

So __log_disable_rosout:=true needs to come after --ros-args now?

How are you supposed to write that as a launch_ros.actions.Node() in a launch file now?

Looking at ros2/launch_ros#52, launch_ros.actions.Node(arguments=['__log_disable_rosout:=true']) will put __log_disable_rosout:=true before the --ros-args flag.

@ivanpauno
Copy link
Member

So __log_disable_rosout:=true needs to come after --ros-args now?

Yes

Looking at ros2/launch_ros#52, launch_ros.actions.Node(arguments=['__log_disable_rosout:=true']) will put __log_disable_rosout:=true before the --ros-args flag.

No. First args is added to cmd, then --ros-args is added followed by remapping rules and a parameter file. See:
https://github.com/ros2/launch_ros/blob/413ad50fc6e759b4fad60a80a8297de0c475b628/launch_ros/launch_ros/actions/node.py#L129-L132

If you want to pass __log_disable_rosout, you have to do:
launch_ros.actions.Node(arguments=['--ros-args', '__log_disable_rosout:=true'])

We could add an extra parameter in the constructor, called ros_arguments, as syntactic sugar for this.

@mm318
Copy link
Member

mm318 commented Aug 29, 2019

Won't

launch_ros.actions.Node(arguments=['--ros-args', '__log_disable_rosout:=true'])

result in an extraneous --ros-args at the end, like

--ros-args __log_disable_rosout:=true --ros-args

That looks non-ideal.

We could add an extra parameter in the constructor, called ros_arguments, as syntactic sugar for this.

That would be great!

@nburek
Copy link
Contributor

nburek commented Aug 29, 2019

This seems like a conversation that needs to be turned into an issue on roslaunch to support passing these kinds of parameters properly.

@ivanpauno
Copy link
Member

Won't

launch_ros.actions.Node(arguments=['--ros-args', '__log_disable_rosout:=true'])

result in an extraneous --ros-args at the end, like

--ros-args __log_disable_rosout:=true --ros-args

Yes, that will exactly happen, and it's handled correctly in rcl. See ros2/rcl#477.

This seems like a conversation that needs to be turned into an issue on roslaunch to support passing these kinds of parameters properly.

FYI ros2/launch_ros#60.

@dirk-thomas
Copy link
Member

dirk-thomas commented Nov 5, 2019

@richiware @JaimeMartin Another report about this problem: ros2/rcl_interfaces#86 Has there been any progress on resolving this issue?

@JaimeMartin
Copy link
Contributor

@dirk-thomas this issue was not prioritized sorry. I can assign resources to this as soon we finish the release of Fast RTPS for eloquent, scheduled for next week.

@MiguelCompany
Copy link
Collaborator

@dirk-thomas I just created PR #342 which should solve this. Merry Xmas! 🎅

@wjwwood
Copy link
Member

wjwwood commented Jan 9, 2020

Pending merge of #342, and also a patch to remove a workaround/TODO in the ros1 bridge.

hidmic added a commit to ros2/ros1_bridge that referenced this issue Jan 9, 2020
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
@hidmic
Copy link
Contributor

hidmic commented Jan 15, 2020

I rushed to merge #342...


As you may see on ros2/ros1_bridge#233, tests are passing for rmw_fastrtps_cpp but failing for rmw_fastrtps_dynamic_cpp. Both test output and backtraces show the problem remains to be the same that was reported in #265, caused by typesupport data being always the first one that got into the per participant type registers that Fast-RTPS keeps.

So I took a deeper look at this patch, in the context of all rmw_fastrtps_*_cpp packages' implementation (which I should've done before merging). #342 does not correct the fundamental issue described above. Instead it piggybacks payloads with pointers to typesupport data. That seems to work for rmw_fastrtps_cpp, as most serialization/deserialization work is carried out by callbacks in said typesupport data. It's actually broken if messages are bounded in size (as you may get the bounded size for another typesupport). For rmw_fastrtps_dynamic_cpp, the patch simply doesn't work, as larger portions of the serialization/deserialization code depend on the typesupport that was actually registered (as you may see here).


I'm sorry for not providing this feedback earlier in the review process (and definitely before merging). Moreover, we did not see this issue in CI because rmw_fastrtps_dynamic_cpp was not included in the run. @MiguelCompany I don't know what a proper solution for this problem would look like, but I'm under the impression #342 is not it. Leaving bad typesupport data laying around feels wrong.

@MiguelCompany
Copy link
Collaborator

It's actually broken if messages are bounded in size (as you may get the bounded size for another typesupport)

Both typesupports should return the same max_serialized_size in this case, so I don't think the patch is broken for that use case.

So for rmw_fastrtps_cpp we are covered, as the same object type was being created for both typesupport. The only difference was the callbacks passed in the constructor, which are now passed on every call.

For rmw_fastrtps_dynamic_cpp the thing is different, I see it now. A different object type (MessageTypeSupport_c or MessageTypeSupport_cpp) is created when registering the first time, and that is the reason for the seg_faults. So we have to find a way to get to the same approach: using a single instance capable of handling different introspection structures for the same type

@hidmic
Copy link
Contributor

hidmic commented Jan 27, 2020

Both typesupports should return the same max_serialized_size in this case, so I don't think the patch is broken for that use case.

I see. It's still slightly odd to ask for the serialization size to the "wrong" type support, but OK.

A different object type (MessageTypeSupport_c or MessageTypeSupport_cpp) is created when registering the first time, and that is the reason for the seg_faults.

Precisely, yes.

So we have to find a way to get to the same approach: using a single instance capable of handling different introspection structures for the same type

That sounds reasonable to me. If I may, I think it'd also be desirable to not have multiple apparent ways to get typesupport (e.g. registering typesupport with a domain but not really using it later because there's no way to tell if it's the right one).

@MiguelCompany do you think you can provide a patch in the near future?

@MiguelCompany
Copy link
Collaborator

@hidmic I have it planned for next week.

@ivanpauno
Copy link
Member

@MiguelCompany any news?

@MiguelCompany
Copy link
Collaborator

Sorry for the delay. Last three weeks I've been busy with non-coding tasks. I have just open #350, which should fix this issue on rmw_fastrtps_dynamic_cpp.

@claireyywang claireyywang added help wanted Extra attention is needed in progress Actively being worked on (Kanban column) labels Feb 26, 2020
dirk-thomas pushed a commit to ros2/ros1_bridge that referenced this issue Feb 28, 2020
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
@swaroophs
Copy link

Hello folks,

I am seeing a similar issue where I have a node subscribing to rosout exiting ungracefully with a realloc (): invalid pointer message. Could someone confirm if this change has been ported to eloquent and works there? I am suspecting it is something similar since if I launch my node with arguments=['__log_disable_rosout:=true'], everything seems to work as expected.

Please let me know if you need any more information.

Best,
Swarooph

@dirk-thomas
Copy link
Member

Could someone confirm if this change has been ported to eloquent and works there?

@mjcarroll Can you please comment about the status of releasing #342 into Eloquent. It seems it has been marked for backporting since January?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed in progress Actively being worked on (Kanban column)
Projects
None yet