Queue jobs in internal queue instead of dumping all jobs on cluster at once #46

sandeepklr · 2015-06-21T14:53:34Z

Hi Dan,

I have modified code to include a queue for maximum number of jobs to run on the cluster at any time.

Please find below a summary of the changes:

Use max_processes parameter for maximum # of cluster jobs to run at once.
Create a Session object before starting JobMonitor and embed the Session object in the job monitor.
- Everywhere that used a session_id now uses the embedded session object in the JobMonitor.
Function _submit_jobs() is no longer used. All jobs are submitted from the JobMonitor using _append_job_to_session()
check_alive() function has been refactored into two functions: check_alive() and check_job_status():
- check_alive() is still called everytime the local heartbeat is received
- check_alive() goes through the queue and looks for jobs to remove from queue either because they have finished, or they have hit the maximum number of resubmits in case of errors. Depending on the number of empty slots, new jobs are spun up.
all_jobs_done() is now simplified to just check that ALL jobs have been processed on the cluster.

…ping all jobs on the cluster queue at once.

landscape-bot · 2015-06-21T14:57:35Z

Repository health increased by 25% when pulling 174ab9b on sandeepklr:master into c291881 on pygridtools:master.

landscape-bot · 2015-06-23T05:52:58Z

Repository health increased by 26% when pulling a3a0b7b on sandeepklr:master into c291881 on pygridtools:master.

…ting up.

desilinguist · 2021-04-26T22:14:35Z

HI @sandeepklr can you please refresh this PR if you are still interested in merging this in? Thanks!

Create a running-window queue to batch submission jobs instead of dum…

174ab9b

…ping all jobs on the cluster queue at once.

Fix modified _append_to_session() method call in _resubmit().

a3a0b7b

sandeepklr added 2 commits July 6, 2015 16:18

Introducing a MAX_BOOTUP_TIME variable to account for jobs never star…

1f0f403

…ting up.

Kill zombie jobs even when not attempting to resubmit them.

946e3e5

Provide feedback