Setup guide for an InfiniBand HPC cluster based on Dell PowerEdge nodes running Ubuntu Server 20.04 LTS. We will use Slurm as a job scheduler.
Warning: if you have a ConnectX-3 InfiniBand card, you must use Ubuntu Server 20.04. Later versions of Ubuntu no longer support drivers for these cards. If you have a ConnextX-4 card, this should not be a problem and you can upgrade to newer versions of the OS. Check compatibility of your card before continuing!
As an example, the following guide has been written with the following hardware configuration:
- Dell PowerEdge R620 master node (2x 8-core Intel Xeon E5-2680 w/ 128 GB RAM, RAID support)
- 4x Dell PowerEdge C6220 slave nodes (2x 8-core Intel Xeon E5-2680 w/ 256 GB RAM)
- Mellanox InfiniBand switch
- Create at least 1 USB key with Ubuntu Server 20.04 LTS
Note: Rufus may sometimes cause issues, it is recommended to set up the stick with UNetbootin
- If the servers have RAID cards, you need to configure the disks in the RAID BIOS settings first, otherwise the installer will not be able to see the disks. For the Dell PowerEdge R620 master node follow this guide
- Boot from USB and install the OS with the default options (1 single partition taking up all the disk space, untoggle the LVM checkmark)
- Enable OpenSSH in settings, leave everything else as default
This is necessary for being able to connect between nodes via ssh using the InfiniBand interface. The following must be done on all nodes:
sudo apt install rdma-core opensm infiniband-diags ibverbs-utils
- Check that cards are correctly recognised with the commands
ibstat
oribv_devinfo
- Allow non-root users to use IB:
sudo chmod go+rw /dev/infiniband/umad0
- Check under which name the IB card is set with
ifconfig -a
. Should read something like this:
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.27.31 netmask 255.255.255.0 broadcast 172.16.27.255
inet6 fe80::1a03:73ff:feff:b45a prefixlen 64 scopeid 0x20<link>
ether 18:03:73:ff:b4:5a txqueuelen 1000 (Ethernet)
RX packets 922145 bytes 454874133 (454.8 MB)
RX errors 0 dropped 16504 overruns 0 frame 0
TX packets 1402282 bytes 546407927 (546.4 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0x91a20000-91a3ffff
eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether 18:03:73:ff:b4:5b txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0x91a00000-91a1ffff
ibp3s0: flags=4163<BROADCAST,MULTICAST> mtu 4092 <---
unspec 80-00-02-08-FE-80-00-00-00-00-00-00-00-00-00-00 txqueuelen 256 (UNSPEC) <---
RX packets 0 bytes 0 (0.0 B) <---
RX errors 0 dropped 0 overruns 0 frame 0 <---
TX packets 0 bytes 0 (0.0 B) <---
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 <---
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 11485 bytes 1888342 (1.8 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 11485 bytes 1888342 (1.8 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
- Assign the IP address to each IB card manually:
sudo ipconfig ibp3s0 10.0.0.11/24
, where10.0.0.11
must be a unique IP address for each node (for convenience, use the same subnet and just change the last number...) - If you want this to be set at startup, create the file
/etc/systemd/network/ibp3s0.network
file with the following content (Name
andAddress
must be tailored to your specific situation):
[Match]
Name=ibp3s0
Address=10.0.0.11
- Check that the connection is correctly established with the
route
command. The IB network should show up - You should now be able to
ssh
between the various nodes using IPoIB
If your machines are also connected to an Ethernet switch, it's a good idea to assign static IP addresses to each machine, so that if you want to connect directly to a specific node, you will always know its IP address.
- Identify the currently used network interface with
ip addr
. Look for the Ethernet interface inBROADCAST
state:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 <---
link/ether 18:03:73:ff:b4:5a brd ff:ff:ff:ff:ff:ff
inet 172.16.27.31/24 brd 172.16.27.255 scope global eno1
valid_lft forever preferred_lft forever
inet6 fe80::1a03:73ff:feff:b45a/64 scope link
valid_lft forever preferred_lft forever
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 18:03:73:ff:b4:5b brd ff:ff:ff:ff:ff:ff
4: ibp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP group default qlen 256
link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:50:6b:4b:03:00:80:34:41 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 10.0.0.11/24 brd 10.0.0.255 scope global ibp3s0
valid_lft forever preferred_lft forever
inet6 fe80::526b:4b03:80:3441/64 scope link
valid_lft forever preferred_lft forever
- Identify the current gateway with
route -n
:
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 172.16.27.5 <-- 0.0.0.0 UG 0 0 0 eno1 <--
10.0.0.0 0.0.0.0 255.255.255.0 U 0 0 0 ibp3s0
172.16.27.0 0.0.0.0 255.255.255.0 U 0 0 0 eno1
- Edit the
/etc/netplan/*.yaml
file and edit the info about the network interface, settingdhcp4: no
and adding the required addresses, gateway, and nameservers (for DNS purposes, you can use 8.8.8.8 and 1.1.1.1):
# This is the network config written by 'subiquity'
network:
ethernets:
eno1: <--- Currently used network interface
dhcp4: no <--- Must be *no*
addresses:
- 172.16.27.31/24 <--- The IP address we want
gateway4: 172.16.27.5 <--- Default gateway
nameservers:
addresses: [8.8.8.8, 1.1.1.1]
eno2:
dhcp4: true
version: 2
- Apply the changes:
sudo netapp apply
If your cluster is located in a location difficult to access (a remote datacenter), it's always a good idea to enable Wake On Lan (WOL). This enables you to restart nodes remotely (for example, after a power outage) without requiring physical access to the servers.
Note : WOL magic packets can only be sent from a machine on the same physical network as the "sleeping" one. Does not work with a VPN
- Create the
/etc/systemd/system/wol.service
file with the following content (make sure the network interface is the connected one!):
[Unit]
Description=Enable Wake On Lan
[Service]
Type=oneshot
ExecStart = /usr/sbin/ethtool --change eno1 wol g <--- we are connected via Ethernet on eno1
[Install]
WantedBy=basic.target
Note : make sure the path to
ethtool
is correct. Check withwhich ethtool
if you are not sure!
- If you want to enable WOL without having to reboot the server:
sudo ethtool –change eno1 wol g
- Enable the service:
sudo systemctl enable wol.service
This step will allow us to "propagate" folders from the master to the slave nodes. This is especially useful for syncing home directories and program installation folders.
- All nodes: edit the
/etc/hosts
file and add a line for each node in the cluster using the IPoIB address:
Note : comment out the 127.0.1.1 entry, otherwise the node will try to use this when attempting a connection.
127.0.0.1 localhost
#127.0.1.1 snorlax-01 <---
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
# Add all the cluster nodes to the Host list using IPoIB addresses
10.0.0.10 snorlax-master <---
10.0.0.11 snorlax-01 <---
10.0.0.12 snorlax-02 <---
10.0.0.13 snorlax-03 <---
10.0.0.14 snorlax-04 <---
- Master node:
sudo apt install nfs-server
- Slave nodes:
sudo apt install nfs-client
- All nodes: create the folders you want to sync (e.g.,
/shared
) - Master node: edit the
/etc/exports
file adding the following line (add more lines if you are syncing more folders):/shared *(rw,sync)
- Master node: restart the NFS service with
sudo service nfs-kernel-server restart
- Slave nodes: edit the
/etc/fstab
file and add the following line (adjust for your specific configuration):snorlax-master:/shared /shared nfs
- Slave nodes: remount all partitions with
sudo mount -a
Note : this step must be done everytime the nodes are rebooted!
- Check that folders are correctly synced (create a file on the master node and see if it propagates to the slaves), and set the required permissions with the
chmod
command (otherwise slave nodes may not be able to access the shared folder contents)
- All nodes:
sudo apt-get install -y libmunge-dev libmunge2 munge
- Master node:
sudo dd if=/dev/urandom bs=1 count=1024 | sudo tee /etc/munge/munge.key
- Master node:
sudo chown munge:munge /etc/munge/munge.key
- Master node:
sudo chmod 400 /etc/munge/munge.key
- Master node (once for each slave node):
scp /etc/munge/munge.key <node-user>@<node-ip>:/etc/munge/munge.key
- All nodes: edit the
/etc/passwd
file and modify themunge
entry to:munge:x:501:501::/var/run/munge;/sbin/nologin
Note : if problems arise, make sure the
munge
user has access to all its folders (/etc/munge
,/var/log/munge
,/var/lib/munge
,/run/munge
)
- All nodes:
sudo systemctl enable munge
- All nodes:
sudo systemctl start munge
- All nodes:
sudo apt-get install -y slurm-wlm
- All nodes: make sure the following folders and files exist. If not, create them:
/etc/slurm-llnl
/var/spool/slurm
/var/log/slurm_jobacct.log
- Apply
sudo chown -R slurm:slurm <FOLDER>
to each folder and file in the previous point - Generate a
slurm.conf
file using the online configuration tool, according to your cluster configuration. A few things to keep in mind:
- Set your master node's hostname in
SlurmctldHost
- Set
CPUs
as appropriate, and optionallySockets
,CoresPerSocket
andThreadsPerCore
(uselscpu
to find out what you have exactly) - Set
StateSaveLocation
to/var/spool/slurm
- Set
MpiDefault
toMPI-PMI2
if you are using OpenMPI 3.1.4 or later - Set
ProctrackType
tocgroup
- Make sure
SelectType
is set tocons_tres
andSelectTypeParameters
toCR_CPU_Memory
- Set
JobAcctGatherType
toLinux
andAccountingStorageType
toFileTxt
- All nodes: copy the generated
slurm.conf
file to/etc/slurm-llnl
.
Note : the
slurm.conf
file must be exactly the same in all nodes!
- All nodes: create the
/etc/slurm-llnl/cgroup.conf
file with the following content:
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
- Master node:
sudo touch /var/slurmctld.pid; sudo chown slurm:slurm /var/slurmctld.pid
- Slave nodes:
sudo touch /var/slurmd.pid; sudo chown slurm:slurm /var/slurmd.pid
- Master node: add the line
User=slurm
to/lib/systemd/system/slurmctld.service
- Slave nodes: add the line
User=root
to/lib/systemd/system/slurmcd.service
- All nodes: add relevant users to the
slurm
group:sacctmgr create user name=<USERNAME> account=<GROUP>
- Master node:
sudo systemctl enable slurmctld
- Master node:
sudo systemctl start slurmctld
- Slave nodes:
sudo systemctl enable slurmd
- Slave nodes:
sudo systemctl start slurmd
Note : if the service cannot start, there might be issues with the ownership of PID files. Try the following:
- Create the
/etc/tmpfiles.d/slurm.conf
file with the following content:d /run/slurm 0770 root slurm -
- Edit the
slurm.conf
file and update the new PID file locations (/run/slurm/...
)- Edit the
/lib/systemd/system/slurm*.service
files with the same info- Reboot all systems
- Confiure
prolog
,taskprolog
andepilog
scripts in/etc/slurm-llnl
if you need something to be done at the start/end of each job (edit theslurm.conf
file indicating the path to the scripts if you use them). For example, the following files create temporary directories on each node at the start of a job, export a$TMPDIR
environment variable accessible within the slurm script, and deletes the temporary folder at the end of a job (even if it crashed):
prolog
#!/bin/bash
scratch_dir=/scratch/${SLURM_JOB_USER}/${SLURM_JOB_ID}
/bin/mkdir -p ${scratch_dir}
/bin/chmod 700 ${scratch_dir}
/bin/chown ${SLURM_JOB_USER} ${scratch_dir}
task_prolog
#!/bin/bash
scratch_dir=/scratch/${SLURM_JOB_USER}/${SLURM_JOB_ID}
echo "export TMPDIR=${scratch_dir}"
epilog
#!/bin/bash
scratch_dir=/scratch/${SLURM_JOB_USER}/${SLURM_JOB_ID}
/bin/rm -rf ${scratch_dir}
The Modules package is a tool that simplifies shell initialization and lets users easily modify their environment during a session using modulefiles. This enables users to set up PATH exports and environment variables automatically for use with specific programs.
- Download the desired version of environment modules from the official website
- Install TCL via
sudo apt install tcl-dev
- Enter the program folder and configure the setup according to where you want to put the modulefiles:
./configure --prefix=/shared/modules-5.3.0 --modulefilesdir=/shared/modules-5.3.0/modulefiles
- Run
make
andmake install
- Enable the initialization of modules at startup via a symbolic link:
sudo ln -s PREFIX/init/profile.sh /etc/profile.d/modules.sh
- Create a
modulefile
for each program you need to configure. Example for Orca 5.0.4 with OpenMPI-4.1.4:
/shared/modules-5.3.0/modulefiles/orca/orca-5.0.4
#%Module1.0#####################################################################
##
## module to load orca-5.0.4
##
proc ModulesHelp { } {
global version prefix
puts stderr "\tLoad orca-5.0.4"
}
module-whatis "Load orca-5.0.4"
# for Tcl script use only
set version 5.0.4
set prefix /shared/orca_5_0_4_linux_x86-64_openmpi411
prereq openmpi/openmpi-4.1.4
prepend-path PATH $prefix
setenv ORCA_HOME $prefix
- Modules can now be enabled via the following syntax:
module load orca/orca-5.0.4
Cockpit allows sysadmins to monitor the performance of a HPC cluster in real time, and optionally to carry out maintenance from a web interface. Setup is very easy and should not require any particular configuration steps. Simply install the software on all nodes and enable the service:
sudo apt install cockpit cockpit-pcp
sudo systemctl start cockpit
The Cockpit web interface can be found at the IP address of the corresponding machine, on port 9090:
https://172.16.27.30:9090
Login credentials are the same as for the machine.
Note : you will receive a safety warning when visiting the website, as we did not install any certificates for that "web page". You can safely ignore the warning and proceed with the login.