Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different method names than "occupied()" and "unoccupied()" #32

Closed
wlandau opened this issue Mar 6, 2023 · 14 comments
Closed

Different method names than "occupied()" and "unoccupied()" #32

wlandau opened this issue Mar 6, 2023 · 14 comments

Comments

@wlandau
Copy link
Owner

wlandau commented Mar 6, 2023

No description provided.

@shikokuchuo
Copy link
Contributor

shikokuchuo commented Mar 6, 2023

Leaving aside the toilet humour...

busy() is what NNG uses. The analagous free() is unfortunate though. available()?

@wlandau
Copy link
Owner Author

wlandau commented Mar 6, 2023

Yeah, busy() is a good one.

inactive() vs active()?

Or maybe connected() vs disconnected() would be more descriptive? I would need to move the existing connected() method to something like listening() (which would make more sense anyway because it describes what the mirai client is doing.)

@shikokuchuo
Copy link
Contributor

I think the most natural might actually be busy() and free(). Free only has other meanings for programmers. Do you need both though - I would have thought one would be enough.

I love the idea of calling free() on workers in any case :)

For listening() - simply ready()?

@wlandau
Copy link
Owner Author

wlandau commented Mar 7, 2023

For the mirai client, I think I will go with listen() and listening() because these are specific and unobtrusive, and I will go with terminate() because it is what I am using as a shutdown term across all the R6 classes.

For the workers/servers, the more specific and technical versions of these terms help me think about what is going on and wrap my head around issues like #31. busy() and free() are nice and idiomatic to NNG, but I confuse them with whether or not a task is running. In this lexicon, a worker can be "busy" in the sense that the connection is active, but also "idle" because it is not running a task. That is why I like "connected" and "disconnected".

For #31 and auto-scaling, I also need to consider other aspects of a worker/server than just "connected". I need to be able to determine if a worker is "active", i.e. able to accept a task. I need to know how many "active" workers there are in relation to the user-defined worker limit and the number of tasks in the queue. "active" is much more complicated than just "connected" because:

  1. A worker can take an unpredictable amount of time to start up, not only because of the overhead of various AWS services, but also due to complicated renv environments, a long queueing time on a SLURM cluster, etc. crew needs to be able to handle these cases without giving up to early because then it would restart workers and create a mess. I think the grace period for staring up should be at around the 95th percentile of observed startup times.
  2. Alternatively, a worker can start up quickly, run a job quickly, and then idle out before the crew client even notices what is happening. (crew has no daemons of its own.) Like (1), this scenario looks like a "disconnected" worker. However, it is very different. In (1), we should give a chance for the worker to start up and accept the task. In (2), the worker is already done and needs to restart.

So I want to use "joined" to describe a worker which

  1. Completed at least one task, or
  2. The client has observed connected from daemons() at some point since it was launched.

So when crew submits a task, is the worker "active" (able to accept the task)? In my current thinking, there are 4 cases:

. Has joined at some point Never joined
Currently connected Worker is "active". Worker is "active". Label as "joined".
Currently disconnected Worker is "inactive". Force-terminate the worker process if running. If auto-scaling up, call another mirai::server() and immediately consider the new worker "active". Worker is "active" if within the (generously long) startup period, "inactive" if afterwards. If "inactive" and auto-scaling up, force-terminate the old process, call another mirai::server(), and immediately consider the new worker "active".

I think I will need to add a strong disclaimer in crew that

  1. We are trusting the workers to start up and join within a generous time window, and
  2. The user should monitor the crew jobs using the specific platform available and manually keep track of them, terminating dangling jobs as necessary.

@shikokuchuo, am I missing something?

@wlandau
Copy link
Owner Author

wlandau commented Mar 7, 2023

Hmmm... maybe "joined" and "connected" are too similar. I need a word for "has connected at some point since it launched"...

@brendanf
Copy link

brendanf commented Mar 7, 2023

Using targets with clustermq on Slurm, I sometimes get extreme queue times for some of the workers, on the order of days. I think these end gracefully if the main job finishes before they actually start, but I haven't actually tested carefully. I know you're still focusing on getting the basic infrastructure working using the callr backend, but once you dive into more distributed resources, are you opposed to checking the status of a a cluster job by asking the cluster manager? I know this definitely gets into the weeds of specific cluster managers, and possibly even particular installations.

@wlandau
Copy link
Owner Author

wlandau commented Mar 7, 2023

are you opposed to checking the status of a a cluster job by asking the cluster manager?

I would strongly prefer not to do that because it would need to be called every time a task is pushed, and frequent pushes could overwhelm sequeue. Similarly, polling the AWS web API could get expensive. These operations could also be slow. Even the is_alive() method polling callr processes is a bottleneck in the profiling studies I have done so far.

Also, I am not sure the direct job status would be as useful as the mirai websocket connection. For example, a job could still be running but in an unusable crashed state. (By imposing a max startup time (albeit a lengthy one), crew will force-terminate these jobs once the startup time is elapsed.)

I sometimes get extreme queue times for some of the workers, on the order of days.

That's the most extreme queue time length I have ever heard of. How busy is the cluster when that happens? Do you have a way to check empirically?

For the majority of cases, I would expect the queue and startup time to be only be a few minutes at most.

@wlandau
Copy link
Owner Author

wlandau commented Mar 7, 2023

If it takes a day for SLURM to start the job, versus only a few seconds to start R and connect to the websocket, then polling the connection versus polling SLURM would agree for almost the entire startup duration.

@brendanf
Copy link

brendanf commented Mar 7, 2023

This is definitely happening when the cluster is busy, and I don't think there's anything wrong with the behavior of targets or clustermq.

In those situations Slurm will often start one worker at a time over a period of several hours. I think there may be a mechanism in place where my priority goes down as the number of jobs I am already running goes up. Recently I started three pipelines at almost the same time, and Slurm started all the workers for the first pipeline before starting on the second, then all the workers for the second pipeline before the third. The third pipeline was the case where the master had to wait more than 1 day for the first worker to report in.

The clustermq backend seems to handle this situation as well as can be expected; once the worker finally starts and reports in, then it starts getting tasks assigned to it. However, I can imagine this being an issue for crew, if it thinks that all the workers who don't connect within 1 hour (a very generous startup time) are dead, and starts submitting more jobs, which also sit in the queue...

@wlandau
Copy link
Owner Author

wlandau commented Mar 7, 2023

However, I can imagine this being an issue for crew, if it thinks that all the workers who don't connect within 1 hour (a very generous startup time) are dead, and starts submitting more jobs, which also sit in the queue...

Yes. The startup time will be configurable, so you could set it to 24 hours or even Inf. And if you set it to an hour but the workers always take a day to start, crew will terminate the queued job before it assigns a new one so the number of workers does not creep past the maximum you set in advance.

@brendanf
Copy link

brendanf commented Mar 7, 2023

If it takes a day for SLURM to start the job, versus only a few seconds to start R and connect to the websocket, then polling the connection versus polling SLURM would agree for almost the entire startup duration.

I agree that, as long as the worker has reported back before the "max startup time", then there is no need to ask SLURM anything. I am mostly thinking of the case where the worker has passed the maximum startup time, and crew needs to decide whether to give up on it and launch a new worker. Or maybe to abort with an error because there may be something wrong with the cluster config. At that point it would be helpful to know whether SLURM considers the worker job to be queueing, running, or dead. I don't imagine that this needs to be polled frequently at all.

@brendanf
Copy link

brendanf commented Mar 7, 2023

Yes. The startup time will be configurable, so you could set it to 24 hours or even Inf. And if you set it to an hour but the workers always take a day to start, crew will terminate the queued job before it assigns a new one so the number of workers does not creep past the maximum you set in advance.

That seems like a reasonable solution. If I want crew to wait 3 days for Slurm to start my job, then I can just tell it that.

@wlandau
Copy link
Owner Author

wlandau commented Mar 7, 2023

Or maybe to abort with an error because there may be something wrong with the cluster config.

That is a good idea. I will consider a configurable policy for startup timeouts.

At that point it would be helpful to know whether SLURM considers the worker job to be queueing, running, or dead. I don't imagine that this needs to be polled frequently at all.

I think the outcome would be the same in all 3 cases: after all this time, the worker still has not connected to the websocket, and so we need to make sure it is terminated. In the case of an already terminated worker, with a descriptive enough job name, I do not think this would be a problem.

For the way I want to design crew, adding a new backend should be as simple as coding up how to launch and terminate a worker.

@wlandau
Copy link
Owner Author

wlandau commented Mar 9, 2023

Closing the original thread because I think I have the terminology figured out in #31 (comment).

@wlandau wlandau closed this as completed Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants