simple/test-cluster-worker-death.js fails #3198

coltrane · 2012-05-03T11:07:10Z

There are two problems with this test:

1- The test fails initially because it expects worker.on('exit', ...) to send a statusCode, but as of 5f08c3c, that event sends an instance of the new Worker class.

2- After fixing the first problem, the test fails because it expects to receive cluster.on('death', ...), but that event name was changed to 'exit' via the combined effects of 5f08c3c and 90ce5b3.

These are both breaks in backward compatibility to 0.6.x. I'm guessing that #1 is not a big deal. The following thoughts pertain to #2.

I suspect that the intent of 90ce5b3 (@isaacs, please correct me if I'm wrong) was only to rename the event that's emitted by the worker, while leaving the cluster's event alone, thus restoring backward compatibility; but 5f08c3c had previously unified that logic so that the event names were tied together.

I'll be glad to prepare a fix, but I need some feedback on how to proceed:

Is the backward-break intentional? if so, we can just "fix" the test so it looks for the exit event from cluster instead of the death event.
Or do we want to preserve backward compatibility? in which case, we'll need to make some changes in cluster.js to get the event names restored to what they used to be.

isaacs · 2012-05-01T20:48:35Z

The event on the Worker should be named 'exit'. The cluster module has been basically rewritten, so breaking backwards compatibility is ok in this case.

I think @bnoordhuis had pointed out that the worker argument added in 5f08c3c is unnecessary, since it's already this, so maybe we should just remove it.

So, patch welcome:

Remove Worker argument from the worker exit event, so it's just worker.emit('exit', code)
Rename 'death' event listener in the test to be an 'exit' event listener instead.
Make sure to update any docs so that everything is in sync.
Make sure you sign the cla if you haven't already :)

- changed worker `death` to `exit`. - corrected argument type expected by worker `exit` handler.

worker 'exit' event now emits arguments consistent with the corresponding event in child_process module.

coltrane · 2012-05-03T11:56:09Z

Commits are attached. CLA is signed.

In addition to the points @isaacs identified, I've also included a couple of new tests that focus on the behavior of cluster's worker processes when they are killed and/or exited.

AndreasMadsen · 2012-05-04T11:34:22Z

lib/cluster.js

+  this.process.once('exit', function(exitCode, signalCode) {
+    prepareExit(self, 'dead');
+    self.emit('exit', exitCode, signalCode);
+    cluster.emit('exit', self);


This is so confusing, why do the worker.on('exit') have exitCode and signalCode, but cluster.on('exit') don't. But fixing that would be confusing too, since you would have to syntaxes:

worker.on('exit', function (exit, signal) {}); worker.on('exit', function (worker, exit, signal) {});

Personally I would like to see this be leaved as is, because userland can access this information using worker.process.exitCode and worker.process.signalCode. However we should properly extend an example so it shows how to obtain exitCode and signalCode.

Other than @bnoordhuis already-stated reason, I think another benefit of this change is that it makes worker.on('exit') match child_process.on('exit'). This makes sense to me -- ideally, anything a developer learns about using the child_process module, should be transferrable to using the workers produced by the cluster module as well:

using child_process:

var cp = child_process.fork(process.argv[1]); cp.on('exit', function(code, signal) { // ... });

using cluster:

var worker = cluster.fork(); worker.on('exit', function(code, signal) { // ... works the same as child_process });

AndreasMadsen · 2012-05-04T11:42:51Z

@isaacs if you fell that the worker argument in the Worker.on events should be removed, then please remember to remove it from the listening event too.

coltrane · 2012-05-04T19:22:31Z

@AndreasMadsen I was thinking same thing. I actually started, last night, to eliminate the worker argument from all the Worker events. I just got side-tracked while updating the documentation. I'll try to get back to it tonight, and I'll push the changes asap.

While we're on the subject of confusing :) -- I'd like to offer a thought / make a suggestion:

I think it's confusing to have the child_process instance exposed as part of the worker (worker.process). Of course, it's necessary to have access to the child_process, but as it is now, it's easy to end up with some really odd stuff:

var worker = cluster.fork();

// oops, the author probably intended to listen on `worker`, not on `worker.process`
worker.process.on('exit', function(code,signal) { ... })

// calling `worker.process.disconnect()` is actually subtly different from calling 
// `worker.disconnect()`.
worker.process.disconnect();

more examples of similar conundrums are possible. There are so many different possibilities here, that it would be very difficult to get test coverage for them all.

In my opinion, it would be better if the cluster module were designed so that Worker "extends" the api of ChildProcess, rather than wrapping it with a subtly different api. This would achieve complete consistency between the two interfaces, and simplify things at the same time.

If we did this, here's how I imagine it might turn out:

var worker = cluster.fork();

if( cluster.isMaster ) {

  // in the Master process...

  // `worker.process` would be removed, and all the `ChildProcess` 
  // api is exposed as part of `Worker`. so `Worker` *is a* `ChildProcess`...
  assert.ok( worker.process === undefined );

  // properties of child_process are available directly
  var child_pid = worker.pid;

  // methods too...
  worker.kill('SIGHUP');

  // methods can be overridden with cluster-specific logic when needed
  worker.disconnect()

  // Worker can add new methods/properties too, as needed
  console.log("Worker ID = %s",  worker.uniqueID);
  console.log("Suicide?  %s",  worker.suicide);
  worker.destroy();

  // the same things are true of events...

  // events inherited from child_process
  worker.on('exit', function(code,signal) { ... });
  worker.on('close', function() { ... });
  worker.on('disconnect', function() { ... });

  // worker defines additional events as needed...
  worker.on('message', function() { ... });
  worker.on('online', function() { ... });
  worker.on('listening', function() { ... });

if( cluster.isChild ) {

  // in the Child process...

  // There is no special `Worker` object instance.  Instead, let the 
  // Cluster module "extend" the `process` object itself -- the same 
  // way the child_process module already does.  Keep `cluster.worker` 
  // for convenience and clarity - but it would just point to the existing
  // `process` object.
  assert.ok( cluster.worker === process );

  // child_process's existing functions (eg: `process.send()`, 
  // `process.disconnect()`, etc) could be overridden by cluster-specific 
  // functions as needed.  Allowing cluster-specific logic to be injected, while 
  // preserving the pre-existing interface from child_process.

  // calls like `process.disconnect()` would invoke a function from Cluster, 
  // which would delegate internally to the corresponding child_process function.
  // this is the *only* version of `disconnect()` available in the public api
  // whether you're coding inside a worker process, or a plain old child_process.
  process.disconnect() 

  // of course, `cluster.worker.*` is equivalent to `process.*`
  // so, instead, we could have spelled it like this...
  cluster.worker.disconnect();

  // Worker adds the `worker.destroy()` method, which is 
  // not present in child_process.
  process.destroy();

  // the 'exit' event is already defined by `process`. no changes needed.
  process.on('exit', function() { ... });

  // child_process adds a few events...
  process.on('message', ...);
  process.on('disconnect', ...);

  // cluster could add specific events too, if needed.
  // ...but I don't think there are any additional ones 
  // for the child at this time (?)
}

@AndreasMadsen, I think your recent changes add valuable and much-needed functionality to the Cluster module. Thank you for taking the time and effort to contribute that. I also think that, with a little additional effort (which I'm happy to contribute) --cleaning up, and unifying with process and child_process-- we could make this part of the api simpler and much easier to use for the majority of Node's audience.

I'm interested to hear thoughts, discussion, etc. (Thanks for reading this far).

AndreasMadsen · 2012-05-04T19:38:07Z

Hi @coltrane funny, I too was in the process of writing a long comment about this, before I knew you did so. I have read your comment twice, but it is too late here, so I won't give you a quick responds today, but try to give a well deserved respond tomorrow.

However in the light of objectivity I will still give you what I wrote before reading yours.

This is mostly taken from my original discussion with my self when I designed the new cluster API. I’m sorry if there is too much irrelevant stuff, I have tried to sort most of it out.

Taking a look at the original cluster implementation https://github.com/joyent/node/blob/v0.6/lib/cluster.js#L161 it is clear that the cluster.fork() is meant to look like the child_process.fork() API. This seems quite idealistic, at this point it shows best in the way worker.on(‘message’) leaks internal RPC communication. However it has later shown that implementing graceful methods like .disconnect, required more that just child_process.disconnect. Isolating child_process from the return object of cluster.fork(), therefore seams like a desired solution. However child_process.fork() have a lot of useful API there will be lost. Since the increase in complexity will be to high, if both child_process.fork() and cluster.fork() have to be updated each time a bug is fixed, the solution should therefore be, to move child_process.fork() to an .process property, this would also allow an nice Worker class abstraction there can be used in both workers and master, nice process and child_process.fork() have almost the same API.

The cluster.fork() should therefor be based on child_process.fork(), but should only inherent a subset of the API, since cluster is a subset of child_process where a more closed API is more desirable, since it usually rescues failures, and should userland wan’t more worker.process is always an option.

Is the above correct, I don’t know. But trust me I have rewritten that cluster module 4 times and my conclusion so far is that cluster.fork() should not have exactly the same API as child_process.fork().

The next decision is events, LearnBoost/cluster have a lot of useful event, but keeping everything simple the listening is properly the most useful and the biggest difference between a normal child process and a worker. It should also have the same events as child_process.fork(). Should this be named death or exit, lets chose death because of 8652848#L1R177 and names can be changed, it will also indicate a difference between child_process.fork() and cluster.fork() death, should such difference show up. The next questing is the arguments, the address argument in the listening was very useful in LearnBoost/cluster to make abstractions in master, so that should definitely be included. I also like the current idea of having global cluster events, this way one can be user to catch all information easily and the worker argument seams to be required in such feature, no changes there. Making abstractions is always important, so userland should be allowed to switch between cluster.on and worker.on without making changes to there code. Since the address event is already a part of the listening event and it will be a bad idea change the position of an worker argument, it should properly be placed first and all other arguments should be placed after.

tl;dr

I hope you didn't start here :)

I must admit I never thought about the exitCode and signalCode arguments, if those are important and API bloat is not an issue then sure add them. But make it consistent and include it in both the worker and cluster object so userland can switch between cluster.on and worker.on without worrying about argument positions shifting. In the end if userland have to do:

var signalCode = this === cluster ? arguments[0].process.signalCode : arguments[1];

they will properly just do

var worker = (this === cluster ? arguments[0] : this);
var signalCode = worker.process.signalCode;

And then why not just allow them to do:

var signalCode = worker.process.signalCode;

isaacs · 2012-05-05T00:35:43Z

@coltrane Landed on a62dd44. Thanks!

@AndreasMadsen I think your points are valid. But it's a bit odd to emit the object itself as an argument to its own event listener, where this already points at it. Nothing else in node works that way. It's better if the worker's events match a ChildProcess object's events as much as possible, or else cluster seems like too much of a stranger.

AndreasMadsen · 2012-05-05T08:03:39Z

@isaacs today you really surprised me and made me sad on the same time, congratulations.

The reason is not just that you closed this, without letting the discussion finish, or at least give me some time to evaluate
@coltrane concerns. But because it seams to me you didn't even read my comment:

#3198 (comment)

@isaacs if you fell that the worker argument in the Worker.on events should be removed, then please remember to remove it from the listening event too.

And please also remember to remove it from fork and online.

https://github.com/joyent/node/pull/3198/files#r773549

This is so confusing, why do the worker.on('exit') have exitCode and signalCode, but cluster.on('exit') don't.

AndreasMadsen · 2012-05-05T20:49:10Z

@coltrane as promised I will give you my thoughts on you proposed API.

To make every thing clear, lets set this up in a pros/cons matrix :)

	pros	cons
your API	The API is very transparent between a child process object and a worker object. This will of course results in a more beautiful and easy to remember API.	The API is based on an overwrite of the child process object. This will make it impossible or very difficult to go beyond the edge of the cluster module.
Before `a62dd44`	The worker API is recognizable different from the child process API, making it clear that there are differences. It also allow low level access to the underlaying child process object.	The worker API is difference from the child process API, that increase the API surface, among humans it often leads to some confusion.

In the design process I did actually consider your API, it is more clear but lacks the ability to go beyond the immediately capabilities of the given API. However with a simple .process property it is clear that the Worker API is not mean to be like the child process API. I understand this may lead to some confuse, but node has never tried to protect userland and the worker API isn’t that different from the child process API, with this patch the only different is a worker argument in events and .kill is called .destroy. This was chosen because killing a worker is usually very and .destroy indicate that, much like socket.destroy.

As for you concerns about testcases, I don’t see your point. A worker is simply an child process with a different API surface because the needs aren’t the same. If you call worker.process.disconnect it will disconnect the child process itself, however if you call worker.disconnect it will disconnect the worker in the most graceful way from a worker perspective, with all the server instances and so on.

@isaacs today we are discovering new land, “nothing else works this way” because there is nothing in node there directly translate to this case. And issacs there are cases where we send extra arguments to an handler, there really aren’t needed. The buffer argument in fs.read is an example of such case and so is of course the signalCode and exitCode in an child process exit event too.

isaacs · 2012-05-05T22:04:17Z

@AndreasMadsen I saw your comment. Yes, failing to remove the worker arg from worker.on('online') and 'listening' was an oversight on my part. That will be fixed shortly, thanks for the reminder :)

The .process member is still there as a reference to the underlying ChildProcess object, however, so I'm not sure what you mean when you say it "lacks the ability to go beyond the immediately capabilities of the given API".

The bottom line is that it's better to reduce redundancy and make the API surfaces more consistent throughout node.

Regarding discussion in nodejs#3198. Passing the worker as an argument to an event emitted on the worker is redundant, and an unnecessary break in consistency vs the events on the ChildProcess objects. It was removed from 'exit', but 'listening' and others were overlooked. This corrects that oversight.

AndreasMadsen · 2012-05-06T08:55:42Z

The .process member is still there as a reference to the underlying ChildProcess object, however, so I'm not sure what you mean when you say it "lacks the ability to go beyond the immediately capabilities of the given API".

@isaacs that was only a respond the API proposed by coltrane, it has nothing todo with the current API in node/master.

coltrane · 2012-05-08T08:16:51Z

@AndreasMadsen, thanks for your reply. A few thoughts and questions...

coltrane's API
Con - The API is based on an overwrite of the child process object. This will make it impossible or very difficult to go beyond the edge of the cluster module.

Can you explain this further? What do you mean by "go beyond the edge of the cluster module"?

As for you concerns about testcases, I don’t see your point. A worker is simply an child process with a different API surface because the needs aren’t the same.

I will try to explain: worker is an instance of Worker, and worker.process is an instance of ChildProcess. It is reasonable then, to expect that the methods of worker.process should behave the same way as the corresponding methods of a "regular" instance of ChildProcess. But that's not the case: Worker actually has side-effects that change the behavior of some of the child_process functions.

For example: inside a "regular" child_process, calling process.disconnect() closes the IPC channel, but leaves the child process running. But if you're inside a worker process, then process.disconnect() (or cluster.worker.process.disconnect()), will close the IPC channel, then exit the process.

This is the very kind of inconsistency that leads me to suggest hiding worker.process entirely, and exposing a single child_process-like interface through worker. But if we don't do that, then we'll need to ensure that we either (a) fix the inconsistencies, so that worker.process behaves the same as a raw ChildProcess; or (b) document the differences clearly, so that userland programmers can know what to expect.

In either case, it is not sufficient that we have tested the "child_process" module on its own, to get full coverage, we would need to re-test the ChildProcess functions via the Worker. This quickly becomes a complicated task, if you really try to test all the different combinations of methods and events on worker and worker.process in both parent and child.

AndreasMadsen · 2012-05-09T06:49:26Z

Can you explain this further? What do you mean by "go beyond the edge of the cluster module"?

Sure, in many ways it is like the HTTP module. If you create a normal http server:

http.createServer(function (req, res) { });

The given API surface is http.ServerRequest and http.ServerResponse and those are based on the HTTP protocol use case. However you are allowed to "go beyond the edge of the http module" by using the req.connection. This property will like worker.process give you access to the underlying net.Socket object.

cluster.worker.process.disconnect() will close the IPC channel, then exit the process.

It is a know issue, and is caused by this code: https://github.com/joyent/node/blob/master/lib/cluster.js#L499-503
Its function. I'm currently waiting for #3007 to be decided since that will tell me how many critical use cases there are.

In either case, it is not sufficient that we have tested the "child_process" module on its own, to get full coverage.

I agree, we should definitely test as many use cases as possible, however cluster is considered experimental and as unfortunately as it is, that means there may be bugs and design flaws. This issues will have to be solved as they arise but it is a long process.

To translate to the .connection property in the http module again, it has a .pause() property (and may others), however this is not tested in combination with the http module. Sure it properly should be tested, but to expect that one can write a module, test all possible use cases and say "done". Is a futile point of view (in my opinion of course).

coltrane · 2012-05-09T18:07:05Z

@AndreasMadsen that makes sense, thank you for explaining. I'll think about this some more with the new perspective you've provided.

coltrane added 3 commits May 3, 2012 05:38

test: fixes due to new cluster api.

413c25f

- changed worker `death` to `exit`. - corrected argument type expected by worker `exit` handler.

test: more tests of cluster.worker death

82628d4

cluster: fixed arguments on worker 'exit' event

157e897

worker 'exit' event now emits arguments consistent with the corresponding event in child_process module.

AndreasMadsen reviewed May 4, 2012
View reviewed changes

isaacs closed this May 5, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simple/test-cluster-worker-death.js fails #3198

simple/test-cluster-worker-death.js fails #3198

coltrane commented May 3, 2012

isaacs commented May 1, 2012

coltrane commented May 3, 2012

AndreasMadsen May 4, 2012

coltrane May 4, 2012

AndreasMadsen commented May 4, 2012

coltrane commented May 4, 2012

AndreasMadsen commented May 4, 2012

isaacs commented May 5, 2012

AndreasMadsen commented May 5, 2012

AndreasMadsen commented May 5, 2012

isaacs commented May 5, 2012

AndreasMadsen commented May 6, 2012

coltrane commented May 8, 2012

AndreasMadsen commented May 9, 2012

coltrane commented May 9, 2012

simple/test-cluster-worker-death.js fails #3198

simple/test-cluster-worker-death.js fails #3198

Conversation

coltrane commented May 3, 2012

isaacs commented May 1, 2012

coltrane commented May 3, 2012

AndreasMadsen May 4, 2012

Choose a reason for hiding this comment

coltrane May 4, 2012

Choose a reason for hiding this comment

AndreasMadsen commented May 4, 2012

coltrane commented May 4, 2012

AndreasMadsen commented May 4, 2012

isaacs commented May 5, 2012

AndreasMadsen commented May 5, 2012

AndreasMadsen commented May 5, 2012

isaacs commented May 5, 2012

AndreasMadsen commented May 6, 2012

coltrane commented May 8, 2012

AndreasMadsen commented May 9, 2012

coltrane commented May 9, 2012