Skip to content

Image build

Steve Brasier edited this page Sep 16, 2021 · 9 revisions

The appliance always uses the ansible in ansible/site.yml but this can be used in two ways:

  1. To directly configure nodes (baremetal or VM).
  2. To build a compute or login node images in Packer, which can subsequently be deployed to nodes.

Note that building control node images is not currently supported.

It is recommended that:

  • An initial deployment is done by directly configuring nodes, probably on a smaller cluster than the final goal. This enables any bugs in config, networking, DNS etc to be worked out. The ansible/adhoc/hpctests.yml can be run to check hardware/software stack performance is as expected.
  • Once happy, it is strongly recommended that images are built and deployed to the final cluster. This ensures that nodes can always be replaced and protects against changes in upstream repos (e.g. OpenHPC releases) which make subsequent directly-configured nodes incompatible with the existing cluster.

If using dev/staging/prod environments consider:

  • Using direct configuration in dev
  • Testing built images in staging
  • Only deploying images in prod

Testing images in a staging environment needs to consider the cluster-specific state built into the images. This is typically:

  • The address for the slurmctld, any filesystem server addresses and monitoring service addresses
  • The contents of the environment's secrets.yml such as the Munge key

By default, addresses are defined in the appliance config using the inventory_hostname for the relevant node. With working DNS it should therefore be possible to create a staging cluster with names matching production in a separate network, which can run unmodified production images. The only differences will be in the definition of the Slurm partitions (assuming a different cluster size) on the Slurm control node. Alternatively, compute node images could be tested in a temporary partition of a production cluster, and login node images tested in a similar way, just by modifying the partition/login node definitions.

By default, yum update is enabled in packer builds but not for direct configuration. This ensures running ansible against an existing cluster does not (by default) perform updates. To perform updates during direct configuration (i.e. on a live cluster) override the variables in environments/common/inventory/group_vars/all/update.yml - the ansible/adhoc/update-packages.yml playbook can also be used to avoid running the whole of ansible/site.yml.

Alternatively, updates could be performed on a pre-built image either by:

  • Booting the image, making changes (manually or via ansible) and snapshotting it.

  • Using tools such as virt-customize:

    $ virt-customize -a <image.qcow2> --run-command "dnf upgrade -y <pkg_name>"
    
Clone this wiki locally