Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart agent if it uses too much memory: #429

Merged
merged 1 commit into from
Apr 19, 2013
Merged

Restart agent if it uses too much memory: #429

merged 1 commit into from
Apr 19, 2013

Conversation

echohead
Copy link
Contributor

Augment the Watchdog to restart the agent process
not only if it has been running longer than the restart
interval, but also if the agent process exceeds a
memory threshold.

The obvious way to do this would be setrlimit with a max
RSS size, and allow the process to die with a MemoryError
when it is exceeded, but the Tornado library being used
swallows MemoryErrors and allows the process to continue.

In order to ensure the process is terminated, the current
memory usage is checked in the Watchdog-reset method, and
the process is aborted if the threshold is exceeded.

Augment the Watchdog to restart the agent process
not only if it has been running longer than the restart
interval, but also if the agent process exceeds a
memory threshold.

The obvious way to do this would be setrlimit with a max
RSS size, and allow the process to die with a MemoryError
when it is exceeded, but the Tornado library being used
swallows MemoryErrors and allows the process to continue.

In order to ensure the process is terminated, the current
memory usage is checked in the Watchdog-reset method, and
the process is aborted if the threshold is exceeded.
@echohead
Copy link
Contributor Author

echohead commented Apr 1, 2013

Looks like the hard-coded timeout and memory limit in the new test don't work on the Travis test VM.

@ghost ghost assigned alq666 Apr 1, 2013
@alq666
Copy link
Member

alq666 commented Apr 2, 2013

@echohead I'm wondering whether we should put it here or make it part of /etc/security/limits.conf.d/ so as to catch every single case of going over quota. The nice thing about the systemic approach in limits.conf is that we get that for free but it requires a reboot to take (and testing has been done for us by generations of linux users). Is that an approach you've considered?

@mastrolinux
Copy link
Contributor

Why not to use an event-listener in supervisord? http://supervisord.org/events.html#configuring-an-event-listener it is in the superlance plugin https://superlance.readthedocs.org/en/latest/memmon.html

@clofresh
Copy link
Contributor

clofresh commented Apr 5, 2013

I do like the supervisor plugin, but sadly it doesn't cover CentOS 5 users, since they don't run with supervisor. Good to know about the memmon plugin though.

@echohead
Copy link
Contributor Author

echohead commented Apr 7, 2013

alq666 - putting it under /etc/security/limits.conf.d would make sense to me. I was basically just throwing this out there for discussion.

While of course any approach will have its tradeoffs, a couple of nice things about this approach:

  • no system restart requrired
  • no dependencies (it just uses the kernel's setrlimit syscall)

Also, this approach was chosen with a motivation of achieving the desired behavior with minimal modification to the agent:
instead of restart yourself if you've been running >N seconds, it would be restart if (running >N seconds) || (memory-usage >M)

@@ -208,6 +214,11 @@ def self_destruct(signum, frame):


def reset(self):
# self destruct if using too much memory, as tornado will swallow MemoryErrors
mem_usage_kb = int(os.popen('ps -p %d -o %s | tail -1' % (os.getpid(), 'rss')).read())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code won't work on Windows.
I think you should use the resource python module: http://docs.python.org/2/library/resource.html#resource-usage
@clofresh any thought ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@remh The Watchdog isn't used on Windows at all so it should be okay.

@remh
Copy link
Contributor

remh commented Apr 19, 2013

Thanks @mastrolinux it looks good !

remh added a commit that referenced this pull request Apr 19, 2013
Restart agent if it uses too much memory:
@remh remh merged commit 7e317a3 into DataDog:master Apr 19, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants