Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node topic statistics broken #1626

Open
kikass13 opened this issue Feb 22, 2019 · 5 comments
Open

node topic statistics broken #1626

kikass13 opened this issue Feb 22, 2019 · 5 comments

Comments

@kikass13
Copy link

Hello,

after we tried testing the contents of the /statistics topic via 'rosparam set enable_statistics true' in our robot, our hosts cpu usage went nuts.

After that i tried to debug the roscpp statistics class ...

  1. Original kinetic statistics analysis and bugfix
  • i put some chrono timers in the callback function of the statistics class, so that i can visualize them later
  • i started a cpp node with 20 publishers (200Hz TX frequency for each one)
  • i started a cpp node with 20 subcribers
  • x axis in the graph is the element index, y axis is the processing time of the callback (in us)
  • the red circles define the point, where the callback function actually calculates stuff and publishes it's contents onto the /statistics topic

image

As can be seen in the image, the spikes occur when the class calculates and publishes it's contents. IMPORTANT: the time window is of the publishes is doubled over time (halfing the frequency each time)

I found the problem here (kinetic and melodic):

if (stats.last_publish + ros::Duration(pub_frequency_) < received_time)

This line has to be changed to this:
if (stats_it->second.last_publish + ros::Duration(1/pub_frequency_) < received_time)

By doing that, we get a much more stable (and not so resource heavy) frequency in the timing analysis:

image

  1. Performance improvements

I didn't like the performance and the way this std::map<std::string, struct StatData> map_ object was handled inside the callback ...
so i tried to tweak some things (by only using the original iterator and directly changing the data structures instead of doing multiple (slow, binary tree) map lookups of the topic names. (i guess a <unordered_map> is not suitable because there are some code parts, where the order of the elements inside the map matters.

The results, after messing around with the code and changing each access to the map to it's original iterator, can be seen here:

image

medianTime = 3 us
which means my code performs around 12 times faster on my machine (intel i5, 2.2GHz, publishers and subsribers on localhost [loopback])


i can provide a merge request if needed, but i first want to get some feedback on this issue ...
I really dont want to deploy my own fork of roscpp, so i tried providing my results here.

Maybe there's something wrong with our system or i did a mistake, so please lecture me if i did a boo boo :) ...

thanks.

Nick Fiege
StreetScooter GmbH
Autonomous Logistics

@wentz89
Copy link
Contributor

wentz89 commented Feb 22, 2019

I experienced the same behavior (#1621 - got closed). Would be nice if it will be fixed.

@kikass13
Copy link
Author

#1621 thanks, we found the same bug at the same time , lel

@dirk-thomas
Copy link
Member

@kikass13 Please try the patch proposed in #1625 and comment if it addresses the problem for you.

@kikass13
Copy link
Author

I proposed the same code change in #1627 as described in #1625. It would work bit that is not enough, the bigger picture is a performance issue aswell. (we are talking about a function that is called EVERY time)

I tested it and as described in the images above, it will fix the 'CPU leaking" (ultrahigh performance hunger overload when used in a busy - topic heavy - environment)

I understand that these are two separate things, I just want to get a better picture of the necessary steps towards this feature.

@wentz89
Copy link
Contributor

wentz89 commented Feb 26, 2019

I also investigated a bit more. So for my test:

  • start a cpp-node (100 Hz frequency) with 20 publishers
  • start a cpp-node (100 Hz frequency) with 20 subscribers (to the publisher)

Edit 1: This is without the patch proposed in #1625, im working on Ubuntu 16.04, Kernel 4.15.0-45-generic with x86 64 bit and use

Im using another node to monitor values of processes. This information's are collected with the psutil library.

I want to share the plots of the process information's since they show that there is a problem.

The first plot is without statistics enabled:
sub20

The second plot is with statistics enabled:
sub20

One can see different process information of the subscriber-node "/Sub20/" over the time. These are described in detail in the psutil documentation (https://psutil.readthedocs.io/en/latest/). Note that the plots has different scaling's depending on the actual value's.

I just want to point out the difference in CPU-usage - from around 6-10 % without statistics enabled up to 10-90 % with statistics enabled - and an increasing in physical memory usage over time with statistics enabled - see memory_info_rss. Surely not much but still visible.

CPU usage is in the top left corner.
physical memory usage is in the middle to the right.

Edit 2:
I recognized a pattern in the high cpu-usage, io_counters_read/write, and memory_usage from the second plot, which suggests a correlation with the time a message is published on statistics topic.

The following plot shows on the x-axis the time and on the y-axis number of evaluated messages per statistic-message. The topic has a high frequency (100Hz) and one can see the effect of the dynamic time window in combination with wrong use of the frequency value leading to only 4 messages in 120 sec.
with_stat_test100

There is a correlation between the time a statistic message was send and:

  • a drop of cpu usage
  • a jump in io_counters_read/write
  • a stop of increasing in physical memory usage - which later restarts

Edit 3: Checking out an other setup i changed the parameter /statistics_window_max_size to 10 seconds and this somehow fixes high cpu usage and increase in memory usage. There are still some differences in comparison to the plot without statistics enabled but nothing looks critical to me.

What looks more of a problem to me is that there are still connections with only a few or no messages on statistics. This might be caused by the high frequency and large number of (20) subscriber's in one node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants