Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous resource allocation in partition "E880" #60

Open
pantaray opened this issue Jun 21, 2024 · 1 comment
Open

Erroneous resource allocation in partition "E880" #60

pantaray opened this issue Jun 21, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@pantaray
Copy link
Member

Describe the problem
Allocating a distributed computing client with custom CPU/mem settings in the E880 partition does not actually allocate the specified resources.

Steps To Reproduce

from acme import ParallelMap, esi_cluster_setup
myClient = esi_cluster_setup(n_workers=3, cores_per_worker=16, mem_per_worker="5GB", partition="E880", timeout=timedelta(minutes=10).total_seconds(), 
                             n_workers_startup=1, verbose=True, debug=True)

This produces sbatch scripts missing the CPU spec and thus using default core allocations:

#!/usr/bin/env bash#SBATCH -J dask-worker
#SBATCH -p E880
#SBATCH -n 1

Additional Information
Changing the underlying SLURMCluster call to

 cluster = SLURMCluster(queue="E880", cores=16, memory="8GB", processes=16) 

fixes the problem:

scontrol show job 26317414
JobId=26317414 JobName=dask-worker
   ...
   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=8G,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryNode=8G MinTmpDiskNode=0
   ...

A possible fix in esi_cluster_setup could be the following change of line L203

     processes_per_worker = kwargs.pop("processes_per_worker", cores_per_worker)
@pantaray pantaray added the bug Something isn't working label Jun 21, 2024
@pantaray pantaray self-assigned this Jun 21, 2024
@pantaray
Copy link
Member Author

An additional bug emerged here as well: the job_directives_skip removes any lines from the generated sbatch that contain the specified string, e.g.,

cluster = SLURMCluster(cores=32, memory="4000MB", processes=4, queue="E880", job_cpu = 56, job_directives_skip=['--mem'])

print(cluster.job_script())

#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p E880
#SBATCH -n 1
#SBATCH --cpus-per-task=56
#SBATCH -t 00:30:00

cluster = SLURMCluster(cores=32, memory="4000MB", processes=4, queue="E880", job_cpu =56, job_directives_skip=['t'])

print(cluster.job_script())
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p E880
#SBATCH -n 1
#SBATCH --mem=4G

pantaray added a commit that referenced this issue Jun 26, 2024
- modified `SLURMCluster` invocation so that resource specs are
  correctly propagated to SLURM workers (closes #60)
- fixed type declaration in `slurm_cluster_setup`: `mem_per_worker` can
  be `None`
- adapted corresponding test

On branch dev
Changes to be committed:
	modified:   acme/dask_helpers.py
	modified:   acme/tests/test_dask.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant