julia_v_slurm_MWE

MWE test of julia distributed malfunctions when using a slurm cluster.

In this example, we use our standard proccedure to launch parallel jobs in a slurm cluster.

functions.jl contains a function with an infinite parallel loop to keep the workers occupied during the test.
job_julia.jl launches the parallel julia workers accross all nodes. Option :auto in my_procs means that there will be as many workers per node as physical cores per node.
job.slurm manages the sbatchoptions and launches the julia code.

MWE

We use this code from a terminal in the cluster access node:

$ bash job.slurm 2
Submitted batch job jobid

We check that the job is running in the nodes specified at machinefile. We cancel the process manually to simulate an error,

$ scancel jobid

While the process has disappeared from the slurm queue, the julia workers are still zombie-runing at the cluster nodes.

We cannot close them from julia by means of rmprocs(workers()...) as there is no julia REPL in the access node. The only way to kill this processes is to ssh to the affected nodes and kill the zombie processes.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
functions.jl		functions.jl
job.slurm		job.slurm
job_julia.jl		job_julia.jl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

julia_v_slurm_MWE

MWE

About

Releases

Packages

Languages

CarlosP24/julia_v_slurm_MWE

Folders and files

Latest commit

History

Repository files navigation

julia_v_slurm_MWE

MWE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages