Talwai/agent dev mode #1577

talwai · 2015-04-24T16:03:17Z

This adds a "developer mode" to the agent, that can be enabled by the --profile command line flag or by the developer_mode setting in agentConfig.

Developer mode can be enabled at the check level or the collector level. When enabled at the check
level, e.g. ./agent.py check nginx --profile, the check.run() function is profiled with cProfile and the pstats output is dumped to the log.

When enabled at the collector level, the following behavior is enabled:

Every run of the collector loop is profiled with cProfile, in the same way as above
The AgentMetrics check (checks.d/agent_metrics.py) is run at the end of each collector loop. This check can be configured to run a set of additional psutil.Process methods on the current process. These additional metrics are flushed to the dispatcher as well as dumped to the log. See conf.d/agent_metrics.yaml.default for the default configuration, which can be extended as needed. Currently a setting under process_metrics is ignored if there is no corresponding psutil.Process method.

degemer · 2015-04-24T20:08:44Z

Hey @talwai, I tested it a little and have some remarks (not concerning the code) about it, you probably already talked about it with @LeoCavaille, but I'm going to add my 2 cents.

I was expecting a lot more "by check" metrics, ie at the end of every run, display some stats for each check (and not only globally), for instance run time, memory consumption after/before, io read/write (basically the metrics you're already displaying for each run, and maybe tag them by check if you want to upload them to DD).
Also, about these metrics, could you display them in a nicer way ? Sure, once it's uploaded to DD, it should be really easy to analyze them, but a display in the logs would be a real plus.
Actually I was thinking something like the info display. 😄 (collector being the "global" metrics)

consul:
  run_time: 1.01s
  memory_before: 60785934
  memory_after: 60785935
  io read: 2
  ...
pgbouncer:
  run_time: 6.61s
  memory_before: 60785834
  memory_after: 60785840
  io read: 255
  ...
collector:
  run_time: 15.04s
  memory_before: 60785834
  memory_after: 60786001
  io read: 256
  ...

If you do something like this, I think that the profiler should be another option, not activated by default with the "developer mode", because we don't always need this level of details when debugging a check. And maybe also a profiler by check ? (not sure it's needed, just a suggestion)

Another thing: I don't think you should put agent_metrics.py in checks.d, because when you do this it becomes visible in

initialized checks.d checks: ['agent_metrics', 'network', 'pgbouncer', 'ntp', 'consul', 'nginx']

and a standard user shouldn't see this (event if it doesn't do anything).

The new agent_metrics are really nice! 👍 I think you could just prefix them by datadog.agent.collector instead of datadog.agent. (and as you wrote in your TODOs, some are probably rates and not gauges)

I just realized that we already have some agent_metrics here: https://github.com/DataDog/dd-agent/blob/master/checks/agent_metrics.py, you might want to put yours there too. (or maybe you split them on purpose ?)

talwai · 2015-04-24T21:06:20Z

@degemer Thanks for the feedback!

I was expecting a lot more "by check" metrics, ie at the end of every run, display some stats for each check (and not only globally), for instance run time, memory consumption after/before, io read/write

I agree that we should be getting more data about the performance of individual checks, and not just the collector run as a whole. I can extend wrap_profiling to include some before and after psutil calls. Though I wonder if the extra system calls would hinder the performance of the check itself i.e. check.run() becomes psutil.get_memory_info() ; check.run() ; psutil.get_memory_info() for every check.

Also, about these metrics, could you display them in a nicer way ?

Yep, the output is ugly. Will improve the formatting for the log dumps

I think that the profiler should be another option, not activated by default with the "developer mode", because we don't always need this level of details when debugging a check. And maybe also a profiler by check ? (not sure it's needed, just a suggestion)

I get where you're coming from. In fact, I myself silenced the profiler dump when debugging the AgentMetrics check because it was mostly unhelpful. I wonder if that is the case in general though. Easy enough to make them two separate options, but would be interested in hearing other opinions.

Another thing: I don't think you should put agent_metrics.py in checks.d, because when you do this it becomes visible in
initialized checks.d checks: ['agent_metrics', 'network', 'pgbouncer', 'ntp', 'consul', 'nginx']

The reason it's in there is:

Because it appears that that's where all the new-style checks go
So that it can be tested via AgentCheckTest.
I realize that it is possible to test without the scaffolding, but my thought was that all tests for new-style checks should follow the same pattern. If the only issue with locating it in checks.d/ is the log output, then I would prefer to just special-case it so it doesn't show up. Admittedly there may be other issues with having it in checks.d/ that I'm not aware of.

I just realized that we already have some agent_metrics here: https://github.com/DataDog/dd-agent/blob/master/checks/agent_metrics.py, you might want to put yours there too. (or maybe you split them on purpose ?)

Basically checks.d/agent_metrics.py is all of checks/agent_metrics.py and more. I opted against simply extending the class in place because it uses the old Check API, and if we plan to deprecate that API we should probably stop using it internally.

Will also fix the metric namespace and change gauge -> rate as appropriate

LeoCavaille · 2015-04-28T19:39:08Z

utils/profile.py

@@ -0,0 +1,44 @@
+#3p
+import psutil


You probably don't need psutil here!?

LeoCavaille · 2015-04-28T19:59:23Z

Nice! Good to see progress, this will prove super helpful to help the community design more performant checks. Made a few remarks just giving a first look.
Two ideas of stuff that it would be super cool to have:

rather than just logging a dozen lines of pstats in logs at every run (and sometimes it might not be exactly what you want actually), could it be possible to enable profiling at a higher level for a while and accumulate the whole collector pstats?
and optionally to be able to dump that to a file instead of logs so that you can use pstats for forensics later on offline.
What do you think?

remh · 2015-04-28T20:05:31Z

checks/__init__.py

@@ -22,6 +22,7 @@

 # 3rd party
 import yaml
+import psutil


psutil is actually not available on source install without a compiler :/
Can you try and catch an import error and disable the developer mode accordingly ?

talwai · 2015-04-28T21:50:24Z

@LeoCavaille thanks for the review! I think both your ideas are good, if I understand correctly. I will look into moving the profiler higher in the call stack, and dumping cumulative statistics after longer intervals, rather than on every collector run. This should give us more useful higher-level insight than the current setup does. I will also add an option for dumping pstats to a file instead of / in addition to logging.

talwai · 2015-04-30T15:26:35Z

The updated default functionality is as follows:

agent.py developer mode: Enables more granular checks in checks.d/agent_metrics.py. These checks are flushed to Datadog and also dumped to log.info. A pstats profile is dumped to the file ./collector-stats.dmp after COLLECTOR_PROFILE_INTERVAL runs ( both the file name and the interval are candidates for moving to agentConfig). Additionally every check calls the static method AgentCheck._collect_stats before and after its run. This collects some psutil stats, the defaults of which are specified here: https://github.com/DataDog/dd-agent/blob/talwai/agent_dev_mode/checks/__init__.py#L37 (maybe also a candidate for living in agentConfig). These stats are prettified and dumped to log.info
check developer mode: This collects pstats output of the check run and dumps to log.debug. (Maybe it can dump to a file as well? Seems overkill). Additionally it calls AgentCheck._collect_stats() before and after its run and outputs data in the agent.py info format. e.g.:

$ python agent.py check ntp --profile
2015-04-30 11:20:44,471 | INFO | dd.collector | config(config.py:922) | initialized checks.d checks: ['ntp', 'network']
2015-04-30 11:20:44,471 | INFO | dd.collector | config(config.py:923) | initialization failed checks.d checks: []
2015-04-30 11:20:44,471 | INFO | dd.collector | checks.collector(collector.py:435) | Running check ntp
Metrics:  [('ntp.offset', 1430407244, 0.02178668975830078, {'hostname': 'Aadityas-MacBook-Pro.local', 'type': 'gauge'})]
Events:  []
Service Checks:  [{'status': 0, 'tags': None, 'timestamp': 1430407244.5470452, 'check': 'ntp.in_sync', 'host_name': 'Aadityas-MacBook-Pro.local', 'message': None, 'id': 1}]
    ntp
    ---
      - instance #0 [OK] Last run duration: 0.0944309234619
      - Collected 1 metric, 0 events & 1 service check
      - Stats: 
            Memory Before (RSS): 18092032
            Memory After (RSS): 18108416
            Memory Before (VMS): 2534723584
            Memory After (VMS): 2534723584

remh · 2015-05-08T14:43:40Z

checks/__init__.py


 # 3rd party
 import yaml

+try:
+    import psutil
+    PSUTIL_PRESENT = True


You don't need this flag.
To test if psutil is here you can just do

if psutil is not None

e.g.

dd-agent/checks/system/win32.py

Lines 5 to 8 in 1bef95f

try:

import psutil

except ImportError:

psutil = None

remh · 2015-05-08T17:27:44Z

Great work!
We should probably add the option in the datadog.conf by the way.

Also does it work properly on windows?

remh · 2015-05-12T18:09:11Z

agent.py

+                        profiler.disable_profiling()
+                        profiled = False
+                        collector_profiled_runs = 0
+                    except Exception:


Any idea what kind of exception can be raised here ?
If you just want to play it safe to make sure that the profiler doesn't crash the agent (which is a good thing) can you log the raised exception please ?

talwai · 2015-05-13T22:21:57Z

@remh This works on windows

remh · 2015-05-14T14:24:26Z

checks/__init__.py

@@ -564,6 +648,15 @@ def run(self):
                    tb=traceback.format_exc()
                )
            instance_statuses.append(instance_status)
+
+        if self.in_developer_mode and self.name != 'agent_metrics':


nitpick but you should use your

AGENT_METRICS_CHECK_NAME

constant

remh · 2015-05-14T14:35:36Z

Looks great beside the last comments!

It should be ready to merge once those are fixed

remh · 2015-05-14T15:21:45Z

config.py

 LEGACY_DATADOG_URLS = [
    "app.datadoghq.com",
    "app.datad0g.com",
 ]

+#Checks whose log output to suppress, unless explicitly asked for
+HIDDEN_CHECKS = ['agent_metrics']


You should use the constant you defined here.

…ance metrics

Talwai/agent dev mode

coveralls · 2016-12-29T07:47:55Z

Changes Unknown when pulling 915ab28 on talwai/agent_dev_mode into ** on master**.

coveralls · 2016-12-29T13:21:20Z

Changes Unknown when pulling 915ab28 on talwai/agent_dev_mode into ** on master**.

coveralls · 2017-01-20T11:02:52Z

Changes Unknown when pulling 915ab28 on talwai/agent_dev_mode into ** on master**.

talwai force-pushed the talwai/agent_dev_mode branch from 2c3b4cc to 3122e10 Compare April 27, 2015 21:32

LeoCavaille reviewed Apr 28, 2015
View reviewed changes

utils/profile.py

@@ -0,0 +1,44 @@

#3p

import psutil

Copy link

Member

LeoCavaille Apr 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need psutil here!?

remh reviewed Apr 28, 2015
View reviewed changes

talwai force-pushed the talwai/agent_dev_mode branch 3 times, most recently from 8febb1d to af59b19 Compare April 30, 2015 00:07

talwai force-pushed the talwai/agent_dev_mode branch 2 times, most recently from f3db44f to 726abef Compare May 7, 2015 02:44

remh reviewed May 8, 2015
View reviewed changes

talwai force-pushed the talwai/agent_dev_mode branch from 4b80cc1 to 4d51cb2 Compare May 12, 2015 17:06

remh reviewed May 12, 2015
View reviewed changes

talwai force-pushed the talwai/agent_dev_mode branch 5 times, most recently from 384a798 to 9036155 Compare May 13, 2015 22:20

remh reviewed May 14, 2015
View reviewed changes

talwai force-pushed the talwai/agent_dev_mode branch 2 times, most recently from 0885520 to ac9e194 Compare May 14, 2015 15:11

remh reviewed May 14, 2015
View reviewed changes

[dev] Add an Agent Developer Mode for collecting fine-grained perform…

915ab28

…ance metrics

talwai force-pushed the talwai/agent_dev_mode branch from ac9e194 to 915ab28 Compare May 14, 2015 15:46

talwai added a commit that referenced this pull request May 14, 2015

Merge pull request #1577 from DataDog/talwai/agent_dev_mode

6a53aa2

Talwai/agent dev mode

talwai merged commit 6a53aa2 into master May 14, 2015

talwai deleted the talwai/agent_dev_mode branch May 20, 2015 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Talwai/agent dev mode #1577

Talwai/agent dev mode #1577

talwai commented Apr 24, 2015

degemer commented Apr 24, 2015

talwai commented Apr 24, 2015

LeoCavaille Apr 28, 2015

LeoCavaille commented Apr 28, 2015

remh Apr 28, 2015

talwai commented Apr 28, 2015

talwai commented Apr 30, 2015

remh May 8, 2015

remh May 8, 2015

remh commented May 8, 2015

remh May 12, 2015

talwai commented May 13, 2015

remh May 14, 2015

remh commented May 14, 2015

remh May 14, 2015

coveralls commented Dec 29, 2016

coveralls commented Dec 29, 2016

coveralls commented Jan 20, 2017

Talwai/agent dev mode #1577

Talwai/agent dev mode #1577

Conversation

talwai commented Apr 24, 2015

degemer commented Apr 24, 2015

talwai commented Apr 24, 2015

LeoCavaille Apr 28, 2015

Choose a reason for hiding this comment

LeoCavaille commented Apr 28, 2015

remh Apr 28, 2015

Choose a reason for hiding this comment

talwai commented Apr 28, 2015

talwai commented Apr 30, 2015

remh May 8, 2015

Choose a reason for hiding this comment

remh May 8, 2015

Choose a reason for hiding this comment

remh commented May 8, 2015

remh May 12, 2015

Choose a reason for hiding this comment

talwai commented May 13, 2015

remh May 14, 2015

Choose a reason for hiding this comment

remh commented May 14, 2015

remh May 14, 2015

Choose a reason for hiding this comment

coveralls commented Dec 29, 2016

coveralls commented Dec 29, 2016

coveralls commented Jan 20, 2017