Skip to content

Releases: aws/aws-parallelcluster-node

AWS ParallelCluster v2.7.0

19 May 08:37
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Node 2.7.0

This is associated with AWS ParallelCluster v2.7.0

ENHANCEMENTS

  • sqswatcher: The daemon is now compatible with VPC Endpoints so that SQS messages can be passed without traversing the public internet.

AWS ParallelCluster v2.6.1

09 Apr 23:04
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 2.6.1.

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

  • Improved the management of SQS messages and retries to speed-up recovery times when failures occur.

CHANGES

  • Do not launch a replacement for an unhealthy or unresponsive node until this is terminated. This makes cluster slower at provisioning new nodes when failures occur but prevents any temporary over-scaling with respect to the expected capacity.
  • Increase parallelism when starting slurmd on compute nodes that join the cluster from 10 to 30.
  • Reduce the verbosity of messages logged by the node daemons.
  • Do not dump logs to /home/logs when nodewatcher encounters a failure and terminates the node. CloudWatch can be used to debug such failures.
  • Reduce the number of retries for failed REMOVE events in sqswatcher.

BUG FIXES

  • Fixed a bug in the ordering and retrying of SQS messages that was causing, under certain circumstances of heavy load, the scheduler configuration to be left in an inconsistent state.
  • Delete from queue the REMOVE events that are discarded due to hostname collision with another event fetched as part of the same sqswatcher iteration.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

AWS ParallelCluster v2.6.0

26 Feb 20:35
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Node 2.6.0

This is associated with AWS ParallelCluster v2.6.0

Changes

  • Remove logic that was adding compute nodes identity to known_hosts file for all OSs except CentOS6

Bug Fixes

  • Fix Torque issue that was limiting the max number of running jobs to the max size of the cluster.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

AWS ParallelCluster v2.5.1

13 Dec 16:33
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Node 2.5.1.

This is associated with AWS ParallelCluster v2.5.1.

Bug Fixes

  • Fix bug in sqswatcher that was causing the daemon to crash when more than 100 DynamoDB tables are present in the cluster region.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

AWS ParallelCluster v2.5.0

15 Nov 22:39
2b08d83
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Node 2.5.0.

This is associated with AWS ParallelCluster v2.5.0.

Enhancements

  • Slurm:
    • Add support for scheduling with GPU options. Currently supports the following GPU-related options: —G/——gpus, ——gpus-per-task, ——gpus-per-node, ——gres=gpu, ——cpus-per-gpu.
    • Add gres.conf and slurm_parallelcluster_gres.conf in order to enable GPU options. slurm_parallelcluster_gres.conf is automatically generated by node daemon and contains GPU information from compute instances. If need to specify additional GRES options manually, please modify gres.conf and avoid changing slurm_parallelcluster_gres.conf when possible.
    • Integrated GPU requirements into scaling logic, cluster will scale automatically to satisfy GPU/CPU requirements for pending jobs. When submitting GPU jobs, CPU/node/task information is not required but preferred in order to avoid ambiguity. If only GPU requirements are specified, cluster will scale up to the minimum number of nodes required to satisfy all GPU requirements.
    • Slurm daemons will now keep running when cluster is stopped for better stability. However, it is not recommended to submit jobs when the cluster is stopped.
    • Change jobwatcher logic to consider both GPU and CPU when making scaling decision for slurm jobs. In general, cluster will scale up to the minimum number of nodes needed to satisfy all GPU/CPU requirements.
  • Reduce number of calls to ASG in nodewatcher to avoid throttling, especially at cluster scale-down.

Changes

  • Increase max number of SQS messages that can be processed by sqswatcher in a single batch from 50 to 200. This improves the scaling time especially with increased ASG launch rates.
  • Increase faulty node termination timeout from 1 minute to 5 in order to give some additional time to the scheduler to recover when under heavy load.

Bug Fixes

  • Fix jobwatcher behaviour that was marking nodes locked by the nodewatcher as busy even if they had been removed already from the ASG Desired count. This was causing, in rare circumstances, a cluster overscaling.
  • Better handling of errors occurred when adding/removing nodes from the scheduler config.
  • Fix bug that was causing failures in sqswatcher when ADD and REMOVE event for the same host are fetched together.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

AWS ParallelCluster v2.4.1

29 Jul 10:32
fc3ffe9
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Node 2.4.1.

This is associated with AWS ParallelCluster v2.4.1.

Enhancements

  • Torque:
    • process nodes added to or removed from the cluster in batches in order to speed up cluster scaling.
    • scale up only if required slots/nodes can be satisfied
    • scale down if pending jobs have unsatisfiable CPU/nodes requirements
    • add support for jobs in hold/suspended state (this includes job dependencies)
    • automatically terminate and replace faulty or unresponsive compute nodes
    • add retries in case of failures when adding or removing nodes
    • add support for ncpus reservation and multi nodes resource allocation (e.g. -l nodes=2:ppn=3+3:ppn=6)

Changes

  • Drop support for Python 2. Node daemons now support Python >= 3.5.
  • Torque: trigger a scheduling cycle every 1 minute when there are pending jobs in the queue. This is done in order
    to speed up jobs scheduling with a dynamic cluster size.

Bug Fixes

  • Restore logic that was automatically adding compute nodes identity to known_hosts file.
  • Slurm: fix issue that was causing the daemons to fail when the cluster is stopped and an empty compute nodes file
    is imported in Slurm config.
  • Torque: fix command to disable hosts in the scheduler before termination.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

AWS ParallelCluster v2.4.0

11 Jun 15:31
9dbff99
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Node 2.4.0.

This is associated with AWS ParallelCluster v2.4.0.

Enhancements

  • Dynamically fetch compute instance type and cluster size in order to support updates
  • SGE:
    • process nodes added to or removed from the cluster in batches in order to speed up cluster scaling.
    • scale up only if required slots/nodes can be satisfied
    • scale down if pending jobs have unsatisfiable CPU/nodes requirements
    • add support for jobs in hold/suspended state (this includes job dependencies)
    • automatically terminate and replace faulty or unresponsive compute nodes
    • add retries in case of failures when adding or removing nodes
  • Slurm:
    • scale up only if required slots/nodes can be satisfied
    • scale down if pending jobs have unsatisfiable CPU/nodes requirements
    • automatically terminate and replace faulty or unresponsive compute nodes
  • Dump logs of replaced failing compute nodes to shared home directory

Changes

  • SQS messages that fail to be processed are re-queued only 3 times and not forever
  • Reset idletime to 0 when the host becomes essential for the cluster (because of min size of ASG or because there are
    pending jobs in the scheduler queue)
  • SGE: a node is considered as busy when in one of the following states "u", "C", "s", "d", "D", "E", "P", "o".
    This allows a quick replacement of the node without waiting for the nodewatcher to terminate it.

Bug Fixes

  • Slurm: add "BeginTime", "NodeDown", "Priority" and "ReqNodeNotAvail" to the pending reasons that trigger
    a cluster scaling
  • Add a timeout on remote commands execution so that the daemons are not stuck if the compute node is unresponsive
  • Fix an edge case that was causing the nodewatcher to hang forever in case the node had become essential to the
    cluster during a call to self_terminate.

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

AWS ParallelCluster 2.3.1

03 Apr 09:00
9fae22d
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Node 2.3.1.

This is associated with AWS ParallelCluster v2.3.1.

Changes

  • sqswatcher: Slurm - dynamically adjust max cluster size based on ASG settings
  • sqswatcher: Slurm - use FUTURE state for dummy nodes to prevent Slurm daemon from contacting unexisting nodes
  • sqswatcher: Slurm - dynamically change the number of configured FUTURE nodes based on the actual nodes that join the cluster. The max size of the cluster seen by the scheduler always matches the max capacity of the ASG.
  • sqswatcher: Slurm - process nodes added to or removed from the cluster in batches. This speeds up cluster scaling which is able to react with a delay of less than 1 minute to variations in the ASG capacity.
  • sqswatcher: Slurm - add support for job dependencies and pending reasons. The cluster won't scale up if the job cannot start due to an unsatisfied dependency.
  • Slurm - set ReturnToService=1 in scheduler config in order to recover instances that were initially marked as down due to a transient issue.
  • sqswatcher: remove DynamoDB table creation
  • improve and standardize shell command execution
  • add retries on failures and exceptions

Bug Fixes

  • sqswatcher: Slurm - set compute nodes to DRAIN state before removing them from cluster. This prevents the scheduler from submitting a job to a node that is being terminated.
  • sqswatcher: Slurm - Fix host removal

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

AWS ParallelCluster 2.2.1

28 Feb 13:47
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Node 2.2.1.

This is associated with AWS ParallelCluster v2.2.1.

Features

  • Support for FSx Lustre with Centos 7
  • Check AWS EC2 account limits before starting cluster creation
  • Allow users to force job deletion with SGE scheduler

Changes

  • Set default value to compute for placement_group option
  • pcluster ssh: use private IP when the public one is not available
  • pcluster ssh: now works also when stack is not completed as long as the master IP is available

Bugfixes

  • awsbsub: fix file upload with absolute path
  • pcluster ssh: fix issue that was preventing the command from working correctly when stack status is UPDATE_ROLLBACK_COMPLETE
  • Fix block device conversion to correctly attach EBS nvme volumes
  • Wait for Torque scheduler initialization before completing master node setup
  • pcluster version: now works also when no ParallelCluster config is present
  • Improve nodewatcher daemon logic to detect if a SGE compute node has running jobs

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192

AWS ParallelCluster 2.1.1

08 Jan 14:38
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster Node 2.1.1.

This is associated with AWS ParallelCluster v2.1.1.

Features

  • Support for AWS Beijing Region (cn-north-1) and Ningxia Region (cn-northwest-1

Bugfixes

  • No longer schedule jobs on compute nodes that are terminating

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192