Missing checks for rmw handle in rclpy_create_publisher #826

squizz617 · 2021-09-22T20:12:29Z

Required Info:

Operating System: Ubuntu 20.04
Installation type: Source install
Version or commit hash: foxy
DDS implementation: Fast-RTPS
Client library (if applicable): rclpy

Feature request

Hi, this issue is in the gray area between bug report and feature request.

When an application creates a publisher through rclcpp, it invokes the following rcl APIs:

rcl_get_zero_initialized_publisher to get a handle,
rcl_publisher_init to initialize the publisher,
rcl_publisher_get_rmw_handle followed by a NULL check to make sure there really is rmw_handle.

However, the last check is missing in rclpy_create_publisher of rclpy. In other words, rclpy may think it successfully created a publisher even when rmw_handle is asynchronously set to NULL.

I did find cases when this becomes problematic. Consider an rclpy application that creates a publisher and publishes messages. For example:

After a call to rcl_publisher_init, if for any reason, publisher->impl->rmw_handle becomes NULL, that error goes undetected back in the rcl_create_publisher.
Then, rcl_crate_publisher returns publisher_capsule back to node.py.
The returned capsule is used for creating a rclpy.publisher.Publisher instance.
During the initialization of the Publisher instance, a QoSEventHandler object is created, which internally calls _rclpy.rclpy_create_event (in rclpy qos_event.py) -> rcl_publisher_event_init (in rclpy _rclpy_qos_event.c) -> rmw_publisher_event_init (in rcl event.c) -> rmw_publisher_event_init (rmw_implementation) -> rmw_publisher_event_init (rmw_fastrtps). There, it segfaults when de-referncing a null pointer.

Core dump:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f7781a45090 in rmw_publisher_event_init () from /home/seulbae/workspace/ros2_foxy/install/rmw_fastrtps_cpp/lib/librmw_fastrtps_cpp.so 
---
RAX  0x0
RBP  0x7ffdd13adb00 —▸ 0x7ffdd13adb40 —▸ 0x7ffdd13adbc0 —▸ 0x7ffdd13adc40 —▸ 0x7f778264e400 ◂— ...
RSP  0x7ffdd13adae0 —▸ 0x7ffdd13adb20 ◂— 0x0
RIP  0x7f7781a45090 (rmw_publisher_event_init+27) ◂— 0xf0458b4808488b48
---
 ► 0x7f7781a45090 <rmw_publisher_event_init+27>    mov    rcx, qword ptr [rax + 8] 
---
pwndbg> bt                                                                                                                                                               
#0  0x00007f7781a45090 in rmw_publisher_event_init () from /home/seulbae/workspace/ros2_foxy/install/rmw_fastrtps_cpp/lib/librmw_fastrtps_cpp.so
#1  0x00007f7782034a62 in rmw_publisher_event_init () from /home/seulbae/workspace/ros2_foxy/install/rmw_implementation/lib/librmw_implementation.so
#2  0x00007f7782051a44 in rcl_publisher_event_init () from /home/seulbae/workspace/ros2_foxy/install/rcl/lib/librcl.so
#3  0x00007f7783dac5f7 in rclpy_create_event () from /home/seulbae/workspace/ros2_foxy/install/rclpy/lib/python3.8/site-packages/rclpy/_rclpy.cpython-38-x86_64-linux-gnu.so

The effect of a missing pointer is silently manifested at a location that is far away from the fault site, making the debugging tricky.
Such issue could've prevented by sanity-checking rcl_publisher_get_rmw_handle like how rclcpp does. The rclcpp application does not suffer from the same issue, as the missing rmw handle is caught right away.

P.S. The documentation for rcl states the following about rcl_publisher_get_rmw_handle:

The returned handle is made invalid if the publisher is finalized or if rcl_shutdown() is called. The returned handle is not guaranteed to be valid for the life time of the publisher as it may be finalized and recreated itself. Therefore it is recommended to get the handle from the publisher using this function each time it is needed and avoid use of the handle concurrently with functions that might change it.

Any thoughts?
Thanks!

The text was updated successfully, but these errors were encountered:

fujitatomoya · 2021-11-05T23:55:27Z

could you elaborate a bit, i am not sure if i understand correctly. seems like you faced an actual problem? if that so, could you provide reproducible test sample code that can causes this problem?

are you suggesting that we should add NULL check for publisher_impl?

rclpy/rclpy/rclpy/publisher.py

Lines 57 to 58 in f2cb25b

    
           self.event_handlers: QoSEventHandler = event_callbacks.create_event_handlers( 
        
               callback_group, publisher_impl, topic)

squizz617 · 2021-11-08T21:14:52Z

Hi @fujitatomoya , thanks for the reply.

I'm suggesting that in rclpy/_rclpy.c::rclpy_create_publisher (code), a NULL check should be performed against the publisher's rmw handle object, like what rcl suggests here, as well as how it's done in the publisher of rclcpp (code).

The core dump I posted above is from a racy environment, in which the rmw handle was overwritten by NULL before rclpy_create_publisher returned.

fujitatomoya · 2021-11-08T21:51:09Z

NULL check would be okay to add, but is that really enough to avoid the problem you are describing? after NULL check, i think there is still racy condition and chance which could be NULL with multi-threaded program. that is why it raises exception to user space, if i am not mistaken.

squizz617 · 2021-11-08T22:02:28Z

Right, the race condition won't necessarily be circumvented even with the NULL check. However the point is, with the NULL check added to the right spot, the race will be called out immediately and the debugging will get much easier as we won't need to back-trace all the way from the place where the null ptr is dereferenced, which can be pretty far from the fault site.

fujitatomoya · 2021-11-08T22:06:05Z

I am okay to add NULL check, would you mind considering PR?

squizz617 · 2021-11-08T22:16:41Z

Sure thing. Will open a PR and let you know! Thanks.

squizz617 · 2021-11-18T22:19:53Z

I opened PR #851 for this.

clalancette assigned sloretz Oct 18, 2021

fujitatomoya added help wanted Extra attention is needed more-information-needed Further information is required labels Nov 5, 2021

squizz617 mentioned this issue Nov 18, 2021

check if publisher's rmw handle is NULL #851

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing checks for rmw handle in rclpy_create_publisher #826

Missing checks for rmw handle in rclpy_create_publisher #826

squizz617 commented Sep 22, 2021

fujitatomoya commented Nov 5, 2021

squizz617 commented Nov 8, 2021

fujitatomoya commented Nov 8, 2021

squizz617 commented Nov 8, 2021

fujitatomoya commented Nov 8, 2021

squizz617 commented Nov 8, 2021

squizz617 commented Nov 18, 2021

Missing checks for rmw handle in rclpy_create_publisher #826

Missing checks for rmw handle in rclpy_create_publisher #826

Comments

squizz617 commented Sep 22, 2021

Feature request

fujitatomoya commented Nov 5, 2021

squizz617 commented Nov 8, 2021

fujitatomoya commented Nov 8, 2021

squizz617 commented Nov 8, 2021

fujitatomoya commented Nov 8, 2021

squizz617 commented Nov 8, 2021

squizz617 commented Nov 18, 2021