Skip to content

Performance study for HPC applications across Google, AWS, and Azure clouds ☁️

License

Notifications You must be signed in to change notification settings

converged-computing/performance-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Performance Study

DOI

This study tested HPC application performance across three clouds and on-premises HPC. The repository is organized as follows:

  • docker: includes container builds for different environments. Containers are shared between environments when possible to reduce redundancy.

    • google: includes Google builds for each of CPU an GPU
    • aws: includes AWS builds for each of CPU and GPU. The distinguishing feature is building with libfabric for EFA.
    • azure: includes Microsoft Azure builds for each of CPU and GPU, targeting infiniband.
  • experiments: are organized first by cloud, and then the underlying environment. In each, a README with the full experiment protocol (and usually commands to run) are included.

    • Google Cloud includes HPC Toolkit (Compute Engine), and GKE (Kubernetes) for each of CPU and GPU
    • Amazon Web Services includes Parallel Cluster (EC2), and EKS (KUbernetes) for each of CPU and GPU
    • Microsoft Azure includes CycleCloud (VMs), and AKS (Kubernetes) for each of CPU and GPU.
  • analysis: includes preliminary plots for data exploration. Note that not all are finalized.

  • paper: includes a subset of cleaned up and further worked on plots intended for use in publications, etc.

Experiments

"Bare Metal"

  • Microsoft Azure CycleCloud CPU (date)
    • size 32 (abhik done 6 apps 8/28/2024, done milroy 8/30/2024)
    • size 64 (abhik done 6 apps 8/28/2024, done milroy 8/30/2024)
    • size 128 (done milroy 8/30/2024)
    • size 256 (done milroy 8/31/2024)
  • Microsoft Azure CycleCloud GPU (date)
    • size 4 (milroy and ani 8/31/2024)
    • size 8 (milroy and ani 8/31/2024)
    • size 16 (milroy and ani 8/31/2024)
    • size 32 (milroy and ani 8/31/2024)
  • AWS GPU Parallel Cluster
    • size 32 (not going to do, could not build image)
    • size 64 (not going to do, could not build image)
    • size 128 (not going to do, could not build image)
    • size 256 (not going to do, could not build image)
  • AWS CPU Parallel Cluster
    • size 32 (done milroy 8/29/2024-8/30/2024)
    • size 64 (done ani 8/29/2024-8/30/2024)
    • size 128 (done ani 8/29/2024-8/30/2024)
    • size 256 (done ani 8/29/2024-8/30/2024)
  • Google Cloud Compute Engine CPU (redone several times due to app configurations)
    • size 32 (vsoch done 8/26/2024)
    • size 64 (vsoch done 8/26/2024)
    • size 128 (vsoch done 8/27/2024)
    • size 256 (vsoch done 8/27/2024)
  • Google Compute Engine GPU
    • done on llnl-flux
    • New VM and automation needed with Terraform (vsoch, early 9/2024)
    • size 4 (vsoch 9/6/2024)
    • size 8 (vsoch 9/7/2024)
    • size 16 (vsoch 9/8/2024)
    • size 32 (vsoch 9/8/2024)
    • quicksilver and osu all reduce need runs at all sizes (vsoch 9/9/2024)

Kubernetes

  • Microsoft Azure AKS CPU
    • size 32 (vsoch done 8/24/2024), redone with placement (vsoch 8/28/2024)
    • size 64 (vsoch done 8/24/2024), redone with placement (vsoch 8/28/2024)
    • size 128 (vsoch done 8/28/2024)
    • size 256 (vsoch TBA 8/29/2024)
  • Google Cloud GKE CPU
    • size 32 (vsoch done 8/21/2024)
    • size 64 (vsoch done 8/22/2024)
    • size 128 (vsoch done 8/23/2024)
    • size 256 (vsoch done 8/23/2024)
  • AWS CPU EKS
    • size 32 (vsoch done 8/21/2024-8/22/2024)
    • size 64 (vsoch done 8/22/2024)
    • size 128 (vsoch done 8/22/2024)
    • size 256 (vsoch done on 8/31/2024)
  • AWS GPU EKS
    • size 4 (done vsoch 8/26/2024, milroy lammps/osu 8/27/2024)
    • size 8 (done vsoch 8/26/2024, milroy lammps/osu 8/27/2024)
    • size 16 (done vsoch, milroy lammps/osu 8/27/2024)
    • size 32 not possible, could not get more than 16 nodes from AWS
  • Google Cloud GKE GPU
    • size 4 (done vsoch 8/29/2024)
    • size 8 (done vsoch TBA 8/29/2024)
    • size 16 (done vsoch 8/30/2024)
    • size 32 (done vsoch 8/30/2024)
    • milroy figured out installing latest drivers - key to success here!
  • Microsoft Azure AKS GPU
    • size 4 (done vsoch 8/31/2024)
    • size 8 (done vsoch 8/31/2024)
    • size 16 (done vsoch 8/31/2024)
    • size 32 (done vsoch 8/31/2024)

License

HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.

See LICENSE, COPYRIGHT, and NOTICE for details.

SPDX-License-Identifier: (MIT)

LLNL-CODE- 842614