Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Talwai/agent dev mode #1577

Merged
merged 1 commit into from
May 14, 2015
Merged

Talwai/agent dev mode #1577

merged 1 commit into from
May 14, 2015

Conversation

talwai
Copy link
Contributor

@talwai talwai commented Apr 24, 2015

This adds a "developer mode" to the agent, that can be enabled by the --profile command line flag or by the developer_mode setting in agentConfig.

Developer mode can be enabled at the check level or the collector level. When enabled at the check
level, e.g. ./agent.py check nginx --profile, the check.run() function is profiled with cProfile and the pstats output is dumped to the log.

When enabled at the collector level, the following behavior is enabled:

  • Every run of the collector loop is profiled with cProfile, in the same way as above
  • The AgentMetrics check (checks.d/agent_metrics.py) is run at the end of each collector loop. This check can be configured to run a set of additional psutil.Process methods on the current process. These additional metrics are flushed to the dispatcher as well as dumped to the log. See conf.d/agent_metrics.yaml.default for the default configuration, which can be extended as needed. Currently a setting under process_metrics is ignored if there is no corresponding psutil.Process method.

@degemer
Copy link
Member

degemer commented Apr 24, 2015

Hey @talwai, I tested it a little and have some remarks (not concerning the code) about it, you probably already talked about it with @LeoCavaille, but I'm going to add my 2 cents.

I was expecting a lot more "by check" metrics, ie at the end of every run, display some stats for each check (and not only globally), for instance run time, memory consumption after/before, io read/write (basically the metrics you're already displaying for each run, and maybe tag them by check if you want to upload them to DD).
Also, about these metrics, could you display them in a nicer way ? Sure, once it's uploaded to DD, it should be really easy to analyze them, but a display in the logs would be a real plus.
Actually I was thinking something like the info display. 😄 (collector being the "global" metrics)

consul:
  run_time: 1.01s
  memory_before: 60785934
  memory_after: 60785935
  io read: 2
  ...
pgbouncer:
  run_time: 6.61s
  memory_before: 60785834
  memory_after: 60785840
  io read: 255
  ...
collector:
  run_time: 15.04s
  memory_before: 60785834
  memory_after: 60786001
  io read: 256
  ...

If you do something like this, I think that the profiler should be another option, not activated by default with the "developer mode", because we don't always need this level of details when debugging a check. And maybe also a profiler by check ? (not sure it's needed, just a suggestion)

Another thing: I don't think you should put agent_metrics.py in checks.d, because when you do this it becomes visible in

initialized checks.d checks: ['agent_metrics', 'network', 'pgbouncer', 'ntp', 'consul', 'nginx']

and a standard user shouldn't see this (event if it doesn't do anything).

The new agent_metrics are really nice! 👍 I think you could just prefix them by datadog.agent.collector instead of datadog.agent. (and as you wrote in your TODOs, some are probably rates and not gauges)

I just realized that we already have some agent_metrics here: https://github.com/DataDog/dd-agent/blob/master/checks/agent_metrics.py, you might want to put yours there too. (or maybe you split them on purpose ?)

@talwai
Copy link
Contributor Author

talwai commented Apr 24, 2015

@degemer Thanks for the feedback!

I was expecting a lot more "by check" metrics, ie at the end of every run, display some stats for each check (and not only globally), for instance run time, memory consumption after/before, io read/write

I agree that we should be getting more data about the performance of individual checks, and not just the collector run as a whole. I can extend wrap_profiling to include some before and after psutil calls. Though I wonder if the extra system calls would hinder the performance of the check itself i.e. check.run() becomes psutil.get_memory_info() ; check.run() ; psutil.get_memory_info() for every check.

Also, about these metrics, could you display them in a nicer way ?

Yep, the output is ugly. Will improve the formatting for the log dumps

I think that the profiler should be another option, not activated by default with the "developer mode", because we don't always need this level of details when debugging a check. And maybe also a profiler by check ? (not sure it's needed, just a suggestion)

I get where you're coming from. In fact, I myself silenced the profiler dump when debugging the AgentMetrics check because it was mostly unhelpful. I wonder if that is the case in general though. Easy enough to make them two separate options, but would be interested in hearing other opinions.

Another thing: I don't think you should put agent_metrics.py in checks.d, because when you do this it becomes visible in
initialized checks.d checks: ['agent_metrics', 'network', 'pgbouncer', 'ntp', 'consul', 'nginx']

The reason it's in there is:

  1. Because it appears that that's where all the new-style checks go
  2. So that it can be tested via AgentCheckTest.
    I realize that it is possible to test without the scaffolding, but my thought was that all tests for new-style checks should follow the same pattern. If the only issue with locating it in checks.d/ is the log output, then I would prefer to just special-case it so it doesn't show up. Admittedly there may be other issues with having it in checks.d/ that I'm not aware of.

I just realized that we already have some agent_metrics here: https://github.com/DataDog/dd-agent/blob/master/checks/agent_metrics.py, you might want to put yours there too. (or maybe you split them on purpose ?)

Basically checks.d/agent_metrics.py is all of checks/agent_metrics.py and more. I opted against simply extending the class in place because it uses the old Check API, and if we plan to deprecate that API we should probably stop using it internally.

Will also fix the metric namespace and change gauge -> rate as appropriate

@@ -0,0 +1,44 @@
#3p
import psutil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need psutil here!?

@LeoCavaille
Copy link
Member

Nice! Good to see progress, this will prove super helpful to help the community design more performant checks. Made a few remarks just giving a first look.
Two ideas of stuff that it would be super cool to have:

  • rather than just logging a dozen lines of pstats in logs at every run (and sometimes it might not be exactly what you want actually), could it be possible to enable profiling at a higher level for a while and accumulate the whole collector pstats?
  • and optionally to be able to dump that to a file instead of logs so that you can use pstats for forensics later on offline.
    What do you think?

@@ -22,6 +22,7 @@

# 3rd party
import yaml
import psutil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

psutil is actually not available on source install without a compiler :/
Can you try and catch an import error and disable the developer mode accordingly ?

@talwai
Copy link
Contributor Author

talwai commented Apr 28, 2015

@LeoCavaille thanks for the review! I think both your ideas are good, if I understand correctly. I will look into moving the profiler higher in the call stack, and dumping cumulative statistics after longer intervals, rather than on every collector run. This should give us more useful higher-level insight than the current setup does. I will also add an option for dumping pstats to a file instead of / in addition to logging.

@talwai talwai force-pushed the talwai/agent_dev_mode branch 3 times, most recently from 8febb1d to af59b19 Compare April 30, 2015 00:07
@talwai
Copy link
Contributor Author

talwai commented Apr 30, 2015

The updated default functionality is as follows:

  1. agent.py developer mode: Enables more granular checks in checks.d/agent_metrics.py. These checks are flushed to Datadog and also dumped to log.info. A pstats profile is dumped to the file ./collector-stats.dmp after COLLECTOR_PROFILE_INTERVAL runs ( both the file name and the interval are candidates for moving to agentConfig). Additionally every check calls the static method AgentCheck._collect_stats before and after its run. This collects some psutil stats, the defaults of which are specified here: https://github.com/DataDog/dd-agent/blob/talwai/agent_dev_mode/checks/__init__.py#L37 (maybe also a candidate for living in agentConfig). These stats are prettified and dumped to log.info
  2. check developer mode: This collects pstats output of the check run and dumps to log.debug. (Maybe it can dump to a file as well? Seems overkill). Additionally it calls AgentCheck._collect_stats() before and after its run and outputs data in the agent.py info format. e.g.:
$ python agent.py check ntp --profile
2015-04-30 11:20:44,471 | INFO | dd.collector | config(config.py:922) | initialized checks.d checks: ['ntp', 'network']
2015-04-30 11:20:44,471 | INFO | dd.collector | config(config.py:923) | initialization failed checks.d checks: []
2015-04-30 11:20:44,471 | INFO | dd.collector | checks.collector(collector.py:435) | Running check ntp
Metrics:  [('ntp.offset', 1430407244, 0.02178668975830078, {'hostname': 'Aadityas-MacBook-Pro.local', 'type': 'gauge'})]
Events:  []
Service Checks:  [{'status': 0, 'tags': None, 'timestamp': 1430407244.5470452, 'check': 'ntp.in_sync', 'host_name': 'Aadityas-MacBook-Pro.local', 'message': None, 'id': 1}]
    ntp
    ---
      - instance #0 [OK] Last run duration: 0.0944309234619
      - Collected 1 metric, 0 events & 1 service check
      - Stats: 
            Memory Before (RSS): 18092032
            Memory After (RSS): 18108416
            Memory Before (VMS): 2534723584
            Memory After (VMS): 2534723584

@talwai talwai force-pushed the talwai/agent_dev_mode branch 2 times, most recently from f3db44f to 726abef Compare May 7, 2015 02:44

# 3rd party
import yaml

try:
import psutil
PSUTIL_PRESENT = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this flag.
To test if psutil is here you can just do

if psutil is not None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.

try:
import psutil
except ImportError:
psutil = None

@remh
Copy link
Contributor

remh commented May 8, 2015

Great work!
We should probably add the option in the datadog.conf by the way.

Also does it work properly on windows?

profiler.disable_profiling()
profiled = False
collector_profiled_runs = 0
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea what kind of exception can be raised here ?
If you just want to play it safe to make sure that the profiler doesn't crash the agent (which is a good thing) can you log the raised exception please ?

@talwai talwai force-pushed the talwai/agent_dev_mode branch 5 times, most recently from 384a798 to 9036155 Compare May 13, 2015 22:20
@talwai
Copy link
Contributor Author

talwai commented May 13, 2015

@remh This works on windows

@@ -564,6 +648,15 @@ def run(self):
tb=traceback.format_exc()
)
instance_statuses.append(instance_status)

if self.in_developer_mode and self.name != 'agent_metrics':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick but you should use your

AGENT_METRICS_CHECK_NAME

constant

@remh
Copy link
Contributor

remh commented May 14, 2015

Looks great beside the last comments!

It should be ready to merge once those are fixed

@talwai talwai force-pushed the talwai/agent_dev_mode branch 2 times, most recently from 0885520 to ac9e194 Compare May 14, 2015 15:11
LEGACY_DATADOG_URLS = [
"app.datadoghq.com",
"app.datad0g.com",
]

#Checks whose log output to suppress, unless explicitly asked for
HIDDEN_CHECKS = ['agent_metrics']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use the constant you defined here.

talwai added a commit that referenced this pull request May 14, 2015
@talwai talwai merged commit 6a53aa2 into master May 14, 2015
@talwai talwai deleted the talwai/agent_dev_mode branch May 20, 2015 20:55
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 915ab28 on talwai/agent_dev_mode into ** on master**.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 915ab28 on talwai/agent_dev_mode into ** on master**.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 915ab28 on talwai/agent_dev_mode into ** on master**.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants