Release AWS ParallelCluster v2.5.0 · aws/aws-parallelcluster-node

We're excited to announce the release of AWS ParallelCluster Node 2.5.0.

This is associated with AWS ParallelCluster v2.5.0.

Slurm:
- Add support for scheduling with GPU options. Currently supports the following GPU-related options: —G/——gpus, ——gpus-per-task, ——gpus-per-node, ——gres=gpu, ——cpus-per-gpu.
- Add gres.conf and slurm_parallelcluster_gres.conf in order to enable GPU options. slurm_parallelcluster_gres.conf is automatically generated by node daemon and contains GPU information from compute instances. If need to specify additional GRES options manually, please modify gres.conf and avoid changing slurm_parallelcluster_gres.conf when possible.
- Integrated GPU requirements into scaling logic, cluster will scale automatically to satisfy GPU/CPU requirements for pending jobs. When submitting GPU jobs, CPU/node/task information is not required but preferred in order to avoid ambiguity. If only GPU requirements are specified, cluster will scale up to the minimum number of nodes required to satisfy all GPU requirements.
- Slurm daemons will now keep running when cluster is stopped for better stability. However, it is not recommended to submit jobs when the cluster is stopped.
- Change jobwatcher logic to consider both GPU and CPU when making scaling decision for slurm jobs. In general, cluster will scale up to the minimum number of nodes needed to satisfy all GPU/CPU requirements.
Reduce number of calls to ASG in nodewatcher to avoid throttling, especially at cluster scale-down.

Increase max number of SQS messages that can be processed by sqswatcher in a single batch from 50 to 200. This improves the scaling time especially with increased ASG launch rates.
Increase faulty node termination timeout from 1 minute to 5 in order to give some additional time to the scheduler to recover when under heavy load.

Fix jobwatcher behaviour that was marking nodes locked by the nodewatcher as busy even if they had been removed already from the ASG Desired count. This was causing, in rare circumstances, a cluster overscaling.
Better handling of errors occurred when adding/removing nodes from the scheduler config.
Fix bug that was causing failures in sqswatcher when ADD and REMOVE event for the same host are fetched together.

Provide feedback