Only log '[num] workers' message when it changes. #1078

rwe · 2015-07-12T17:38:08Z

Otherwise when debug logging is on, the message prints every second even with no system activity, drowning out more useful log messages:

[2015-07-12 17:39:25 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:26 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:27 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:28 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:29 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:30 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:31 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:32 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:33 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:34 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:35 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:36 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:37 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:38 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:39 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:40 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:41 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:42 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:43 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:45 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:46 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:47 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:48 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:49 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:50 +0000] [19970] [DEBUG] 3 workers

benoitc · 2015-07-12T18:09:24Z

gunicorn/arbiter.py

@@ -473,6 +473,8 @@ def manage_workers(self):
        Maintain the number of workers by spawning or killing
        as required.
        """
+        orig_num_workers = self.num_workers
+
        if len(self.WORKERS.keys()) < self.num_workers:


i would rather tag here if a change has been done and test against it instead of comparing numbers. Thoughts?

In my experience, using a mutable has_changed-type flag tends to open up a class of errors where the flag does not get set appropriately after modification. Particularly when it happens in a separate method call (Can kill_worker/spawn_workers modify num_workers? Pretty sure. Do they always? I don't know). That kind of bug crops up more during future code changes/maintenance.

It also makes the if-statement harder to reason about: does this flag represent that the value being printed out changed? That it could have changed? Should have? If I see multiple "3 workers" messages in sequence, is that expected or a bug?

This is a short method, but those are the reasons I tend to prefer the state-comparison approach when it's not a computationally difficult one.

More concretely, as someone unfamiliar with the codebase, I didn't know if spawn_workers would necessarily change that number, or could be capped by memory or some other configuration value.

It could be expected to try to spawn each time but not necessarily succeed at increasing that number; I'd have to look deeper to see if that was a (long-term) guarantee.

(I hope that makes it clear where my head was at here)

I think your reasoning is sound but if the flag is localized to this function then it's pretty fail proof.

At the top of the function, just set count_changed = False.
In each of the loops you can set count_changed = True.
At the bottom of the function you can check count_changed.

I actually think I slightly prefer what I just described for exactly the reasons you say. While it should be clear from the code that self.num_workers must change or the loops would not terminate, it's perhaps clearest to just set a flag when we're attempting to change the number.

It's conceivable that self.num_workers could change between the loops so that it is first increased and then decreased, so if the edge case of logging the same number twice bothers us then your way is indeed better.

What if kill_worker moved the worker from self.WORKERS to a dead worker list? I don't think ESRCH block is necessary because reap_workers should still get the SIGCHLD.

Then the manage_workers becomes three steps:

Scratch that last part, not three steps. manage_workers can stay the same, I think.

@erydo maybe let's make a PR to clean this dance up a bit and then let's rebase this. If you're up for it.

Sure thing, I'll give it a stab.

rwe · 2015-07-13T21:58:59Z

This PR is on hold until #1084 is completed.

Otherwise when debug logging is on, the message prints every second even with no system activity.

rwe · 2015-07-19T20:03:11Z

(Rebasing against current master)

This is easier and safer than only logging when we detect that self.WORKERS has changed or that `spawn_worker` or `kill_worker` has been done.

rwe · 2015-07-19T20:14:15Z

@tilgovi @benoitc — I've updated this to no longer rely on synchronicity/asynchronicity of spawn_worker, kill_worker, or self.WORKERS. It just stores the last number that was logged and checks against that.

This was possibly the right approach to begin with (no error conditions!) and decouples it from #1084, which might take some more time to merge in.

benoitc · 2015-07-23T21:44:31Z

Quite like the idea :) While we are here maybe we could also just log where the number of workers did really change in spawn workers and when we reap them. Something like the patch below. I didn't test it though. Thoughts?

diff --git a/gunicorn/arbiter.py b/gunicorn/arbiter.py
index b7ee05d..85fa5ea 100644
--- a/gunicorn/arbiter.py
+++ b/gunicorn/arbiter.py
@@ -55,6 +55,7 @@ class Arbiter(object):
         os.environ["SERVER_SOFTWARE"] = SERVER_SOFTWARE

         self._num_workers = None
+        self.worker_count = None
         self.setup(app)

         self.pidfile = None
@@ -464,6 +465,7 @@ class Arbiter(object):
                     if not worker:
                         continue
                     worker.tmp.close()
+                    self._log_numworkers()
         except OSError as e:
             if e.errno != errno.ECHILD:
                 raise
@@ -475,6 +477,7 @@ class Arbiter(object):
         """
         if len(self.WORKERS.keys()) < self.num_workers:
             self.spawn_workers()
+            self._log_numworkers()

         workers = self.WORKERS.items()
         workers = sorted(workers, key=lambda w: w[1].age)
@@ -482,10 +485,6 @@ class Arbiter(object):
             (pid, _) = workers.pop(0)
             self.kill_worker(pid, signal.SIGTERM)

-        self.log.debug("{0} workers".format(len(workers)),
-                       extra={"metric": "gunicorn.workers",
-                              "value": len(workers),
-                              "mtype": "gauge"})

     def spawn_worker(self):
         self.worker_age += 1
@@ -563,8 +562,20 @@ class Arbiter(object):
                 try:
                     worker = self.WORKERS.pop(pid)
                     worker.tmp.close()
+                    self._log_numworkers()
                     self.cfg.worker_exit(self, worker)
                     return
                 except (KeyError, OSError):
                     return
             raise
+
+    def _log_numworkers(self):
+        nworker_count = len(workers)
+        if self.worker_count != nworker_count:
+            self.log.debug("{0} workers".format(nworker_count),
+                    extra={"metric": "gunicorn.workers",
+                        "value": len(workers),
+                        "mtype": "gauge"})
+            self.worker_count = nworker_count
+

benoitc · 2015-07-23T21:45:12Z

gunicorn/arbiter.py

@@ -51,6 +51,8 @@ class Arbiter(object):
        if name[:3] == "SIG" and name[3] != "_"
    )

+    last_logged_worker_count = None


shouldn't it be probably an instance variable?

The instance variable is created once it's assigned to for the first time. I'd be fine moving it into __init__.

rwe · 2015-07-23T22:10:38Z

@benoitc — I would make the same argument as I made at the beginning of this PR. The patch I've proposed here has no edge cases and is resilient to refactorings instead of trying to find and maintain every place that modifies self.WORKERS.

Additionally, the patch in your comment would log every change in manage_workers which could be quite noisy if it's spawning/killing multiple workers (e.g. on startup you'd get a log message with an incrementing number for each worker). My motivation for this PR in the first place was to reduce the unnecessary noise.

benoitc · 2015-07-23T22:20:37Z

hrm the patch in manage_workers only log if new workers need to be spawned and after it. Did I miss smth?

The point to log there and in reap workers is to only log when the number did really change not before it changes on kill. I think it would be more accurate although we can optimise the reap workers case. This is at least the intention :)

rwe · 2015-07-23T22:28:36Z

Ah, you're right, I misread your patch! For some reason I misremembered spawning as being in a loop like killing is. I'd still stand by the maintenance argument, though: manage_workers gets called regularly. Logging the worker count there is the least complex, least error-prone implementation of logging the current active worker count.

If it's desirable to also log once workers finish dying, I'd suggest that that change should be made separately, and should likely be based off #1084, which consolidates that worker cleanup work.

benoitc · 2015-07-24T08:11:55Z

All of this depends of what is expected by using this metric: supervise the number of active workers or supervise the number of workers alives (running). Maybe indeed we should have 2 metrics indeed and I guess you're right and your patch is enough to tell the number of active workers and then logged as it.

I commented the patch for a last one change, then let's commit it:)

benoitc · 2015-07-24T08:13:04Z

gunicorn/arbiter.py

@@ -55,6 +55,8 @@ def __init__(self, app):
        os.environ["SERVER_SOFTWARE"] = SERVER_SOFTWARE

        self._num_workers = None
+        self._last_logged_active_worker_count = None


last_logged is not needed. let's just call it self._active_worker_count .

benoitc · 2015-07-29T13:08:50Z

OK if noone object I will merge this PR and do the renaming right after. OK?

tilgovi · 2015-07-29T19:53:10Z

👍

On Wed, Jul 29, 2015, 06:08 Benoit Chesneau notifications@github.com
wrote:

OK if noone object I will merge this PR and do the renaming right after.
OK?

—
Reply to this email directly or view it on GitHub
#1078 (comment).

Only log '[num] workers' message when it changes.

benoitc reviewed Jul 12, 2015
View reviewed changes

rwe mentioned this pull request Jul 13, 2015

Remove dead workers immediately after killing. #1084

Closed

Only log '[num] workers' message when it changes.

09357ed

Otherwise when debug logging is on, the message prints every second even with no system activity.

rwe force-pushed the reduce-arbiter-noise branch from 258eba0 to 09357ed Compare July 19, 2015 20:03

Store last logged worker count.

6f6ec92

This is easier and safer than only logging when we detect that self.WORKERS has changed or that `spawn_worker` or `kill_worker` has been done.

benoitc reviewed Jul 23, 2015
View reviewed changes

Make last logged worker count an explicit instance var.

e028710

benoitc reviewed Jul 24, 2015
View reviewed changes

benoitc added a commit that referenced this pull request Aug 19, 2015

Merge pull request #1078 from preo/reduce-arbiter-noise

a132ca9

Only log '[num] workers' message when it changes.

benoitc merged commit a132ca9 into benoitc:master Aug 19, 2015

mjjbell pushed a commit to mjjbell/gunicorn that referenced this pull request Mar 16, 2018

Merge pull request benoitc#1078 from preo/reduce-arbiter-noise

693bb4b

Only log '[num] workers' message when it changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only log '[num] workers' message when it changes. #1078

Only log '[num] workers' message when it changes. #1078

rwe commented Jul 12, 2015

benoitc Jul 12, 2015

rwe Jul 12, 2015

rwe Jul 12, 2015

tilgovi Jul 13, 2015

tilgovi Jul 13, 2015

tilgovi Jul 13, 2015

tilgovi Jul 13, 2015

tilgovi Jul 13, 2015

rwe Jul 13, 2015

rwe Jul 13, 2015

rwe commented Jul 13, 2015

rwe commented Jul 19, 2015

rwe commented Jul 19, 2015

benoitc commented Jul 23, 2015

benoitc Jul 23, 2015

rwe Jul 23, 2015

rwe commented Jul 23, 2015

benoitc commented Jul 23, 2015

rwe commented Jul 23, 2015

benoitc commented Jul 24, 2015

benoitc Jul 24, 2015

benoitc commented Jul 29, 2015

tilgovi commented Jul 29, 2015

Only log '[num] workers' message when it changes. #1078

Only log '[num] workers' message when it changes. #1078

Conversation

rwe commented Jul 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rwe commented Jul 13, 2015

rwe commented Jul 19, 2015

rwe commented Jul 19, 2015

benoitc commented Jul 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rwe commented Jul 23, 2015

benoitc commented Jul 23, 2015

rwe commented Jul 23, 2015

benoitc commented Jul 24, 2015

Choose a reason for hiding this comment

benoitc commented Jul 29, 2015

tilgovi commented Jul 29, 2015