-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different method names than "occupied()" and "unoccupied()" #32
Comments
Leaving aside the toilet humour...
|
Yeah,
Or maybe |
I think the most natural might actually be I love the idea of calling For |
For the For the workers/servers, the more specific and technical versions of these terms help me think about what is going on and wrap my head around issues like #31. For #31 and auto-scaling, I also need to consider other aspects of a worker/server than just "connected". I need to be able to determine if a worker is "active", i.e. able to accept a task. I need to know how many "active" workers there are in relation to the user-defined worker limit and the number of tasks in the queue. "active" is much more complicated than just "connected" because:
So I want to use "joined" to describe a worker which
So when
I think I will need to add a strong disclaimer in
@shikokuchuo, am I missing something? |
Hmmm... maybe "joined" and "connected" are too similar. I need a word for "has connected at some point since it launched"... |
Using |
I would strongly prefer not to do that because it would need to be called every time a task is pushed, and frequent pushes could overwhelm Also, I am not sure the direct job status would be as useful as the
That's the most extreme queue time length I have ever heard of. How busy is the cluster when that happens? Do you have a way to check empirically? For the majority of cases, I would expect the queue and startup time to be only be a few minutes at most. |
If it takes a day for SLURM to start the job, versus only a few seconds to start R and connect to the websocket, then polling the connection versus polling SLURM would agree for almost the entire startup duration. |
This is definitely happening when the cluster is busy, and I don't think there's anything wrong with the behavior of In those situations Slurm will often start one worker at a time over a period of several hours. I think there may be a mechanism in place where my priority goes down as the number of jobs I am already running goes up. Recently I started three pipelines at almost the same time, and Slurm started all the workers for the first pipeline before starting on the second, then all the workers for the second pipeline before the third. The third pipeline was the case where the master had to wait more than 1 day for the first worker to report in. The |
Yes. The startup time will be configurable, so you could set it to 24 hours or even |
I agree that, as long as the worker has reported back before the "max startup time", then there is no need to ask SLURM anything. I am mostly thinking of the case where the worker has passed the maximum startup time, and |
That seems like a reasonable solution. If I want |
That is a good idea. I will consider a configurable policy for startup timeouts.
I think the outcome would be the same in all 3 cases: after all this time, the worker still has not connected to the websocket, and so we need to make sure it is terminated. In the case of an already terminated worker, with a descriptive enough job name, I do not think this would be a problem. For the way I want to design crew, adding a new backend should be as simple as coding up how to launch and terminate a worker. |
Closing the original thread because I think I have the terminology figured out in #31 (comment). |
No description provided.
The text was updated successfully, but these errors were encountered: