Erroneous resource allocation in partition "E880" #60

pantaray · 2024-06-21T09:11:34Z

Describe the problem
Allocating a distributed computing client with custom CPU/mem settings in the E880 partition does not actually allocate the specified resources.

Steps To Reproduce

from acme import ParallelMap, esi_cluster_setup
myClient = esi_cluster_setup(n_workers=3, cores_per_worker=16, mem_per_worker="5GB", partition="E880", timeout=timedelta(minutes=10).total_seconds(), 
                             n_workers_startup=1, verbose=True, debug=True)

This produces sbatch scripts missing the CPU spec and thus using default core allocations:

#!/usr/bin/env bash#SBATCH -J dask-worker
#SBATCH -p E880
#SBATCH -n 1

Additional Information
Changing the underlying SLURMCluster call to

 cluster = SLURMCluster(queue="E880", cores=16, memory="8GB", processes=16)

fixes the problem:

scontrol show job 26317414
JobId=26317414 JobName=dask-worker
   ...
   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=8G,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryNode=8G MinTmpDiskNode=0
   ...

A possible fix in esi_cluster_setup could be the following change of line L203

     processes_per_worker = kwargs.pop("processes_per_worker", cores_per_worker)

The text was updated successfully, but these errors were encountered:

pantaray · 2024-06-21T13:02:41Z

An additional bug emerged here as well: the job_directives_skip removes any lines from the generated sbatch that contain the specified string, e.g.,

cluster = SLURMCluster(cores=32, memory="4000MB", processes=4, queue="E880", job_cpu = 56, job_directives_skip=['--mem'])

print(cluster.job_script())

#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p E880
#SBATCH -n 1
#SBATCH --cpus-per-task=56
#SBATCH -t 00:30:00

cluster = SLURMCluster(cores=32, memory="4000MB", processes=4, queue="E880", job_cpu =56, job_directives_skip=['t'])

print(cluster.job_script())
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p E880
#SBATCH -n 1
#SBATCH --mem=4G

- modified `SLURMCluster` invocation so that resource specs are correctly propagated to SLURM workers (closes #60) - fixed type declaration in `slurm_cluster_setup`: `mem_per_worker` can be `None` - adapted corresponding test On branch dev Changes to be committed: modified: acme/dask_helpers.py modified: acme/tests/test_dask.py

pantaray added the bug Something isn't working label Jun 21, 2024

pantaray self-assigned this Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erroneous resource allocation in partition "E880" #60

Erroneous resource allocation in partition "E880" #60

pantaray commented Jun 21, 2024

pantaray commented Jun 21, 2024

Erroneous resource allocation in partition "E880" #60

Erroneous resource allocation in partition "E880" #60

Comments

pantaray commented Jun 21, 2024

pantaray commented Jun 21, 2024