-
Notifications
You must be signed in to change notification settings - Fork 152
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
58 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Network topology aware scheduling | ||
|
||
Slurm can be [configured](https://slurm.schedmd.com/topology.html) to support topology-aware | ||
resource allocation to optimize job performance. | ||
|
||
If you are using Slurm via ClusterToolkit, the Slurm Topology Plugin is automatically configured with: | ||
|
||
```ini | ||
TopologyPlugin=topology/tree | ||
TopologyParam=SwitchAsNodeRank | ||
``` | ||
|
||
This does two things: | ||
|
||
* **Minimizes inter-rack communication:** For jobs smaller than the full cluster size, Slurm will assign the job to as few racks as possible. | ||
* **Optimizes rank placement:** Within a job, the Slurm node rank (used to assign global Slurm / MPI ranks) is ordered by the Switch that the node is on, such that ranks are ordered by rack. | ||
|
||
SlurmGCP automatically updates topology information for all nodes in the cluster, according to their [physical location](https://cloud.google.com/compute/docs/instances/use-compact-placement-policies#verify-vm-location). | ||
|
||
> [!NOTE] | ||
> The physical location information is only available for VMs configured with a placement policy. | ||
> VMs without a defined placement policy will be assigned a less efficient 'fake' topology. | ||
## Inspect topology | ||
|
||
You can inspect topology used by Slurm by running: | ||
|
||
```sh | ||
scontrol show topology | ||
|
||
# Or by listing the configuration file: | ||
cat /etc/slurm/topology.conf !!! | ||
``` | ||
|
||
To inspect the "real" topology and verify the physical host placement, you can list the `physical_host` property of nodes: | ||
|
||
```sh | ||
#!/bin/bash | ||
|
||
# /home/where.sh - echo machines hostname and its physicalHost | ||
echo "$(hostname) $(curl 'http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host' -H 'Metadata-Flavor: Google' -s)" | ||
``` | ||
|
||
```sh | ||
srun --nodelist={nodes_to_inspect} -l /home/where.sh | sort -V | ||
``` | ||
|
||
## Drawbacks | ||
|
||
Updates to `topology.conf` require reconfiguration of Slurm controllercontroller. This can be a costly operation that affects the responsiveness of the controller. | ||
|
||
You have the option to disable the Slurm Topology Plugin (along with automatic updates) by providing the following settings to controller module: | ||
|
||
```yaml | ||
settings: | ||
cloud_parameters: | ||
topology_plugin: none # !!! | ||
``` |