node topic statistics broken #1626

kikass13 · 2019-02-22T13:10:24Z

Hello,

after we tried testing the contents of the /statistics topic via 'rosparam set enable_statistics true' in our robot, our hosts cpu usage went nuts.

After that i tried to debug the roscpp statistics class ...

Original kinetic statistics analysis and bugfix

i put some chrono timers in the callback function of the statistics class, so that i can visualize them later
i started a cpp node with 20 publishers (200Hz TX frequency for each one)
i started a cpp node with 20 subcribers
x axis in the graph is the element index, y axis is the processing time of the callback (in us)
the red circles define the point, where the callback function actually calculates stuff and publishes it's contents onto the /statistics topic

As can be seen in the image, the spikes occur when the class calculates and publishes it's contents. IMPORTANT: the time window is of the publishes is doubled over time (halfing the frequency each time)

I found the problem here (kinetic and melodic):

ros_comm/clients/roscpp/src/libros/statistics.cpp

Line 116 in 194c776

if (stats.last_publish + ros::Duration(pub_frequency_) < received_time)

This line has to be changed to this:
if (stats_it->second.last_publish + ros::Duration(1/pub_frequency_) < received_time)

By doing that, we get a much more stable (and not so resource heavy) frequency in the timing analysis:

Performance improvements

I didn't like the performance and the way this std::map<std::string, struct StatData> map_ object was handled inside the callback ...
so i tried to tweak some things (by only using the original iterator and directly changing the data structures instead of doing multiple (slow, binary tree) map lookups of the topic names. (i guess a <unordered_map> is not suitable because there are some code parts, where the order of the elements inside the map matters.

The results, after messing around with the code and changing each access to the map to it's original iterator, can be seen here:

medianTime = 3 us
which means my code performs around 12 times faster on my machine (intel i5, 2.2GHz, publishers and subsribers on localhost [loopback])

i can provide a merge request if needed, but i first want to get some feedback on this issue ...
I really dont want to deploy my own fork of roscpp, so i tried providing my results here.

Maybe there's something wrong with our system or i did a mistake, so please lecture me if i did a boo boo :) ...

thanks.

Nick Fiege
StreetScooter GmbH
Autonomous Logistics

The text was updated successfully, but these errors were encountered:

wentz89 · 2019-02-22T13:49:59Z

I experienced the same behavior (#1621 - got closed). Would be nice if it will be fixed.

kikass13 · 2019-02-22T14:50:37Z

#1621 thanks, we found the same bug at the same time , lel

…erformance optimization included

dirk-thomas · 2019-02-22T16:57:35Z

@kikass13 Please try the patch proposed in #1625 and comment if it addresses the problem for you.

kikass13 · 2019-02-23T01:52:05Z

I proposed the same code change in #1627 as described in #1625. It would work bit that is not enough, the bigger picture is a performance issue aswell. (we are talking about a function that is called EVERY time)

I tested it and as described in the images above, it will fix the 'CPU leaking" (ultrahigh performance hunger overload when used in a busy - topic heavy - environment)

I understand that these are two separate things, I just want to get a better picture of the necessary steps towards this feature.

wentz89 · 2019-02-26T12:52:18Z

I also investigated a bit more. So for my test:

start a cpp-node (100 Hz frequency) with 20 publishers
start a cpp-node (100 Hz frequency) with 20 subscribers (to the publisher)

Edit 1: This is without the patch proposed in #1625, im working on Ubuntu 16.04, Kernel 4.15.0-45-generic with x86 64 bit and use

Im using another node to monitor values of processes. This information's are collected with the psutil library.

I want to share the plots of the process information's since they show that there is a problem.

The first plot is without statistics enabled:

The second plot is with statistics enabled:

One can see different process information of the subscriber-node "/Sub20/" over the time. These are described in detail in the psutil documentation (https://psutil.readthedocs.io/en/latest/). Note that the plots has different scaling's depending on the actual value's.

I just want to point out the difference in CPU-usage - from around 6-10 % without statistics enabled up to 10-90 % with statistics enabled - and an increasing in physical memory usage over time with statistics enabled - see memory_info_rss. Surely not much but still visible.

CPU usage is in the top left corner.
physical memory usage is in the middle to the right.

Edit 2:
I recognized a pattern in the high cpu-usage, io_counters_read/write, and memory_usage from the second plot, which suggests a correlation with the time a message is published on statistics topic.

The following plot shows on the x-axis the time and on the y-axis number of evaluated messages per statistic-message. The topic has a high frequency (100Hz) and one can see the effect of the dynamic time window in combination with wrong use of the frequency value leading to only 4 messages in 120 sec.

There is a correlation between the time a statistic message was send and:

a drop of cpu usage
a jump in io_counters_read/write
a stop of increasing in physical memory usage - which later restarts

Edit 3: Checking out an other setup i changed the parameter /statistics_window_max_size to 10 seconds and this somehow fixes high cpu usage and increase in memory usage. There are still some differences in comparison to the plot without statistics enabled but nothing looks critical to me.

What looks more of a problem to me is that there are still connections with only a few or no messages on statistics. This might be caused by the high frequency and large number of (20) subscriber's in one node.

kikass13 mentioned this issue Feb 22, 2019

ros_comm: Bug in statistics.cpp causing Topic Statistics rate under 0.016 Hz for some connections #1621

Closed

kikass13 pushed a commit to kikass13/ros_comm that referenced this issue Feb 22, 2019

bugfix: statistics related to ros#1626 ros#1621, (x86 -- x12 times) p…

7c480d9

…erformance optimization included

kikass13 mentioned this issue Feb 22, 2019

bugfix: statistics related to #1626 #1621 #1627

Closed

dirk-thomas added the duplicate label Feb 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node topic statistics broken #1626

node topic statistics broken #1626

kikass13 commented Feb 22, 2019

wentz89 commented Feb 22, 2019

kikass13 commented Feb 22, 2019

dirk-thomas commented Feb 22, 2019

kikass13 commented Feb 23, 2019

wentz89 commented Feb 26, 2019 •

edited

Loading

node topic statistics broken #1626

node topic statistics broken #1626

Comments

kikass13 commented Feb 22, 2019

wentz89 commented Feb 22, 2019

kikass13 commented Feb 22, 2019

dirk-thomas commented Feb 22, 2019

kikass13 commented Feb 23, 2019

wentz89 commented Feb 26, 2019 • edited Loading

wentz89 commented Feb 26, 2019 •

edited

Loading