Skip to content

Commit

Permalink
SlurmGCP. Draft topology doc
Browse files Browse the repository at this point in the history
  • Loading branch information
mr0re1 committed Oct 29, 2024
1 parent 19fa5ca commit b07a4f3
Showing 1 changed file with 58 additions and 0 deletions.
58 changes: 58 additions & 0 deletions docs/slurm-topology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Network topology aware scheduling

Slurm can be [configured](https://slurm.schedmd.com/topology.html) to support topology-aware
resource allocation to optimize job performance.

If you are using Slurm via ClusterToolkit, the Slurm Topology Plugin is automatically configured with:

```ini
TopologyPlugin=topology/tree
TopologyParam=SwitchAsNodeRank
```

This does two things:

* **Minimizes inter-rack communication:** For jobs smaller than the full cluster size, Slurm will assign the job to as few racks as possible.
* **Optimizes rank placement:** Within a job, the Slurm node rank (used to assign global Slurm / MPI ranks) is ordered by the Switch that the node is on, such that ranks are ordered by rack.

SlurmGCP automatically updates topology information for all nodes in the cluster, according to their [physical location](https://cloud.google.com/compute/docs/instances/use-compact-placement-policies#verify-vm-location).

> [!NOTE]
> The physical location information is only available for VMs configured with a placement policy.
> VMs without a defined placement policy will be assigned a less efficient 'fake' topology.
## Inspect topology

You can inspect topology used by Slurm by running:

```sh
scontrol show topology

# Or by listing the configuration file:
cat /etc/slurm/topology.conf !!!
```

To inspect the "real" topology and verify the physical host placement, you can list the `physical_host` property of nodes:

```sh
#!/bin/bash

# /home/where.sh - echo machines hostname and its physicalHost
echo "$(hostname) $(curl 'http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host' -H 'Metadata-Flavor: Google' -s)"
```

```sh
srun --nodelist={nodes_to_inspect} -l /home/where.sh | sort -V
```

## Drawbacks

Updates to `topology.conf` require reconfiguration of Slurm controllercontroller. This can be a costly operation that affects the responsiveness of the controller.

You have the option to disable the Slurm Topology Plugin (along with automatic updates) by providing the following settings to controller module:

```yaml
settings:
cloud_parameters:
topology_plugin: none # !!!
```

0 comments on commit b07a4f3

Please sign in to comment.