-
Notifications
You must be signed in to change notification settings - Fork 140
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding examples and front-end under community folder.
Signed-off-by: Carlos Boneti <carlosboneti@google.com>
- Loading branch information
Showing
2 changed files
with
393 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,393 @@ | ||
# Example Configs | ||
|
||
This directory contains a set of example YAML files that can be fed into gHPC | ||
to create a blueprint. | ||
|
||
## Instructions | ||
|
||
Ensure your project_id is set and other deployment variables such as zone and | ||
region are set correctly under `vars` before creating and deploying an example | ||
config. | ||
|
||
Please note that global variables defined under `vars` are automatically | ||
passed to resources if the resources have an input that matches the variable name. | ||
|
||
### (Optional) Setting up a remote terraform state | ||
|
||
The following block will configure terraform to point to an existing GCS bucket | ||
to store and manage the terraform state. Add your own bucket name and | ||
(optionally) a service account in the configuration. If not set, the terraform | ||
state will be stored locally within the generated blueprint. | ||
|
||
Add this block to the top-level of your input YAML: | ||
|
||
```yaml | ||
terraform_backend_defaults: | ||
type: gcs | ||
configuration: | ||
bucket: a_bucket | ||
impersonate_service_account: a_bucket_reader@project.iam.gserviceaccount.com | ||
``` | ||
## Config Descriptions | ||
### hpc-cluster-small.yaml | ||
Creates a basic auto-scaling SLURM cluster with mostly default settings. The | ||
blueprint also creates a new VPC network, and a filestore instance mounted to | ||
`/home`. | ||
|
||
There are 2 partitions in this example: `debug` and `compute`. The `debug` | ||
partition uses `n2-standard-2` VMs, which should work out of the box without | ||
needing to request additional quota. The purpose of the `debug` partition is to | ||
make sure that first time users are not immediately blocked by quota | ||
limitations. | ||
|
||
#### Compute Partition | ||
|
||
There is a `compute` partition that achieves higher performance. Any | ||
performance analysis should be done on the `compute` partition. By default it | ||
uses `c2-standard-60` VMs with placement groups enabled. You may need to request | ||
additional quota for `C2 CPUs` in the region you are deploying in. You can | ||
select the compute partition using the `srun -p compute` argument. | ||
|
||
Quota required for this example: | ||
|
||
* Cloud Filestore API: Basic SSD (Premium) capacity (GB) per region: **2660 GB** | ||
* Compute Engine API: Persistent Disk SSD (GB): **~10 GB** | ||
* Compute Engine API: N2 CPUs: **12** | ||
* Compute Engine API: C2 CPUs: **60/node** up to 1200 - _only needed for | ||
`compute` partition_ | ||
* Compute Engine API: Affinity Groups: **one for each job in parallel** - _only | ||
needed for `compute` partition_ | ||
* Compute Engine API: Resource policies: **one for each job in parallel** - | ||
_only needed for `compute` partition_ | ||
|
||
### hpc-cluster-high-io.yaml | ||
|
||
Creates a slurm cluster with tiered file systems for higher performance. It | ||
connects to the default VPC of the project and creates two partitions and a | ||
login node. | ||
|
||
File systems: | ||
|
||
* The homefs mounted at `/home` is a default "PREMIUM" tier filestore with | ||
2.5TiB of capacity | ||
* The projectsfs is mounted at `/projects` and is a high scale SSD filestore | ||
instance with 10TiB of capacity. | ||
* The scratchfs is mounted at `/scratch` and is a | ||
[DDN Exascaler Lustre](../resources/third-party/file-system/DDN-EXAScaler/README.md) | ||
file system designed for high IO performance. The capacity is ~10TiB. | ||
|
||
There are two partitions in this example: `low_cost` and `compute`. The | ||
`low_cost` partition uses `n2-standard-4` VMs. This partition can be used for | ||
debugging and workloads that do not require high performance. | ||
|
||
Similar to the small example, there is a | ||
[compute partition](#compute-partition) that should be used for any performance | ||
analysis. | ||
|
||
Quota required for this example: | ||
|
||
* Cloud Filestore API: Basic SSD (Premium) capacity (GB) per region: **2660 GB** | ||
* Cloud Filestore API: High Scale SSD capacity (GB) per region: **10240 GiB** - _min | ||
quota request is 61440 GiB_ | ||
* Compute Engine API: Persistent Disk SSD (GB): **~14000 GB** | ||
* Compute Engine API: N2 CPUs: **158** | ||
* Compute Engine API: C2 CPUs: **60/node** up to 12,000 - _only needed for | ||
`compute` partition_ | ||
* Compute Engine API: Affinity Groups: **one for each job in parallel** - _only | ||
needed for `compute` partition_ | ||
* Compute Engine API: Resource policies: **one for each job in parallel** - | ||
_only needed for `compute` partition_ | ||
|
||
### spack-gromacs.yaml | ||
|
||
Spack is a HPC software package manager. This example creates a small slurm | ||
cluster with software installed with | ||
[Spack](../resources/scripts/spack-install/README.md) The controller will | ||
install and configure spack, and install [gromacs](https://www.gromacs.org/) | ||
using spack. Spack is installed in a shared location (/apps) via filestore. This | ||
build leverages the startup-script resource and can be applied in any cluster by | ||
using the output of spack-install or startup-script resources. | ||
|
||
The installation will occur as part of the slurm startup-script, a warning | ||
message will be displayed upon SSHing to the login node indicating | ||
that configuration is still active. To track the status of the overall | ||
startup script, run the following command on the login node: | ||
|
||
```shell | ||
sudo tail -f /var/log/messages | ||
``` | ||
|
||
Spack specific installation logs will be sent to the spack_log as configured in | ||
your YAML, by default /var/log/spack.log in the login node. | ||
|
||
```shell | ||
sudo tail -f /var/log/spack.log | ||
``` | ||
|
||
Once Slurm and spack installation is complete, spack will available on the login | ||
node. To use spack in the controller or compute nodes, the following command | ||
must be run first: | ||
|
||
```shell | ||
source /apps/spack/share/spack/setup-env.sh | ||
``` | ||
|
||
To load the gromacs module, use spack: | ||
|
||
```shell | ||
spack load gromacs | ||
``` | ||
|
||
**_NOTE:_** Installing spack compilers and libraries in this example can take 1-2 | ||
hours to run on startup. To decrease this time in future deployments, consider | ||
including a spack build cache as described in the comments of the example. | ||
|
||
### omnia-cluster.yaml | ||
|
||
Creates a simple omnia cluster, with an | ||
omnia-manager node and 2 omnia-compute nodes, on the pre-existing default | ||
network. Omnia will be automatically installed after the nodes are provisioned. | ||
All nodes mount a filestore instance on `/home`. | ||
|
||
### image-builder.yaml | ||
|
||
This Blueprint helps create custom VM images by applying necessary software and | ||
configurations to existing images, such as the [HPC VM Image][hpcimage]. | ||
Using a custom VM image can be more scalable than installing software using | ||
boot-time startup scripts because | ||
|
||
* it avoids reliance on continued availability of package repositories | ||
* VMs will join an HPC cluster and execute workloads more rapidly due to reduced | ||
boot-time configuration | ||
* machines are guaranteed to boot with a static set of packages available when | ||
the custom image was created. No potential for some machines to be upgraded | ||
relative to other based upon their creation time! | ||
|
||
[hpcimage]: https://cloud.google.com/compute/docs/instances/create-hpc-vm | ||
|
||
**Note**: it is important _not to modify_ the subnetwork name in either of the | ||
two resource groups without modifying them both. These _must_ match! | ||
|
||
#### Custom Network (resource group) | ||
|
||
A tool called [Packer](https://packer.io) builds custom VM images by creating | ||
short-lived VMs, executing scripts on them, and saving the boot disk as an | ||
image that can be used by future VMs. The short-lived VM must operate in a | ||
network that | ||
|
||
* has outbound access to the internet for downloading software | ||
* has SSH access from the machine running Packer so that local files/scripts | ||
can be copied to the VM | ||
|
||
This resource group creates such a network, while using [Cloud Nat][cloudnat] | ||
and [Identity-Aware Proxy (IAP)][iap] to allow outbound traffic and inbound SSH | ||
connections without exposing the machine to the internet on a public IP address. | ||
|
||
[cloudnat]: https://cloud.google.com/nat/docs/overview | ||
[iap]: https://cloud.google.com/iap/docs/using-tcp-forwarding | ||
|
||
#### Packer Template (resource group) | ||
|
||
The Packer template in this resource group accepts a list of Ansible playbooks | ||
which will be run on the VM to customize it. Although it defaults to creating | ||
VMs with a public IP address, it can be easily set to use [IAP][iap] for SSH | ||
tunneling following the [example in its README](../resources/packer/custom-image/README.md). | ||
|
||
## Config Schema | ||
|
||
A user defined config should follow the following schema: | ||
|
||
```yaml | ||
# Required: Name your blueprint, this will also be the name of the directory | ||
# the blueprint created in. | ||
blueprint_name: MyBlueprintName | ||
# Top-level variables, these will be pulled from if a required variable is not | ||
# provided as part of a resource. Any variables can be set here by the user, | ||
# labels will be treated differently as they will be applied to all created | ||
# GCP resources. | ||
vars: | ||
project_id: GCP_PROJECT_ID | ||
# https://cloud.google.com/compute/docs/regions-zones | ||
region: us-central1 | ||
zone: us-central1-a | ||
# https://cloud.google.com/resource-manager/docs/creating-managing-labels | ||
labels: | ||
global_label: label_value | ||
# Many resources can be added from local and remote directories. | ||
resource_groups: | ||
- group: groupName | ||
resources: | ||
# Local source, prefixed with ./ (/ and ../ also accepted) | ||
- source: ./resources/role/resource-name # Required: Points to the resource directory. | ||
kind: < terraform | packer > # Required: Type of resource, currently choose from terraform or packer. | ||
id: <a unique id> # Required: Name of this resource used to uniquely identify it. | ||
# Optional: All configured settings for the resource. For terraform, each | ||
# variable listed in variables.tf can be set here, and are mandatory if no | ||
# default was provided and are not defined elsewhere (like the top-level vars) | ||
settings: | ||
setting1: value1 | ||
setting2: | ||
- value2a | ||
- value2b | ||
setting3: | ||
key3a: value3a | ||
key3b: value3b | ||
# Embedded resource (part of the toolkit), prefixed with resources/ | ||
- source: resources/role/resource-name | ||
# GitHub resource over SSH, prefixed with git@github.com | ||
- source: git@github.com:org/repo.git//resources/role/resource-name | ||
# GitHub resource over HTTPS, prefixed with github.com | ||
- source: github.com/org/repo//resources/role/resource-name | ||
``` | ||
|
||
## Writing Config YAML | ||
|
||
The input YAML is composed of 3 primary parts, top-level parameters, global variables and resources group. These are described in more detail below. | ||
|
||
### Top Level Parameters | ||
|
||
* **blueprint_name** (required): Name of this set of blueprints. This also defines the name of the directory the blueprints will be created into. | ||
|
||
### Global Variables | ||
|
||
```yaml | ||
vars: | ||
region: "us-west-1" | ||
labels: | ||
"user-defined-global-label": "slurm-cluster" | ||
... | ||
``` | ||
|
||
Global variables are set under the vars field at the top level of the YAML. | ||
These variables can be explicitly referenced in resources as | ||
[Config Variables](#config-variables). Any resource setting (inputs) not explicitly provided and | ||
matching exactly a global variable name will automatically be set to these | ||
values. | ||
|
||
Global variables should be used with care. Resource default settings with the | ||
same name as a global variable and not explicitly set will be overwritten by the | ||
global variable. | ||
|
||
The global “labels” variable is a special case as it will be appended to labels | ||
found in resource settings, whereas normally an explicit resource setting would | ||
be left unchanged. This ensures that global labels can be set alongside resource | ||
specific labels. Precedence is given to the resource specific labels if a | ||
collision occurs. Default resource labels will still be overwritten by global | ||
labels. | ||
|
||
The HPC Toolkit uses special reserved labels for monitoring each deployment. | ||
These are set automatically, but can be overridden through global vars or | ||
resource settings. They include: | ||
|
||
* ghpc_blueprint: The name of the blueprint the deployment was created from | ||
* ghpc_deployment: The name of the specific deployment of the blueprint | ||
* ghpc_role: The role of a given resource, e.g. compute, network, or | ||
file-system. By default, it will be taken from the folder immediately | ||
containing the resource. Example: A resource with the source path of | ||
`./resources/network/vpc` will have `network` as its `ghpc_role` label by | ||
default. | ||
|
||
### Resource Groups | ||
|
||
Resource groups allow distinct sets of resources to be defined and deployed as a | ||
group. A resource group can only contain resources of a single kind, for example | ||
a resource group may not mix packer and terraform resources. | ||
|
||
For terraform resources, a top-level main.tf will be created for each resource | ||
group so different groups can be created or destroyed independently. | ||
|
||
A resource group is made of 2 fields, group and resources. They are described in | ||
more detail below. | ||
|
||
#### Group | ||
|
||
Defines the name of the group. Each group must have a unique name. The name will | ||
be used to create the subdirectory in the blueprint directory that the resource | ||
group will be defined in. | ||
|
||
#### Resources | ||
|
||
Resources are the building blocks of an HPC environment. They can be composed to | ||
create complex deployments using the config YAML. Several resources are provided | ||
by default in the [resources](../resources/README.md) folder. | ||
|
||
To learn more about how to refer to a resource in a YAML, please consult the | ||
[resources README file.](../resources/README.md) | ||
|
||
## Variables | ||
|
||
Variables can be used to refer both to values defined elsewhere in the config | ||
and to the output and structure of other resources. | ||
|
||
### Config Variables | ||
|
||
Variables in a ghpc config YAML can refer to global variables or the outputs of | ||
other resources. For global and resource variables, the syntax is as follows: | ||
|
||
```yaml | ||
vars: | ||
zone: us-central1-a | ||
resource_groups: | ||
- group: primary | ||
resources: | ||
- source: path/to/resource/1 | ||
id: resource1 | ||
... | ||
- source: path/to/resource/2 | ||
... | ||
settings: | ||
key1: $(vars.zone) | ||
key2: $(resource1.name) | ||
``` | ||
|
||
The variable is referred to by the source, either vars for global or the | ||
resource ID for resource variables, followed by the name of the value being | ||
referenced. The entire variable is then wrapped in “$()”. | ||
|
||
Currently, references to variable attributes and string operations with | ||
variables are not supported. | ||
|
||
### Literal Variables | ||
|
||
Formally passthrough variables. | ||
|
||
Literal variables are not interpreted by `ghpc` directly, but rather for the | ||
underlying resource. Literal variables should only be used by those familiar | ||
with the underlying resource technology (Terraform or Packer); no validation | ||
will be done before deployment to ensure that they are referencing | ||
something that exists. | ||
|
||
Literal variables are occasionally needed when referring to the data structure | ||
of the underlying resource. For example, take the | ||
[hpc-cluster-high-io.yaml](./hpc-cluster-high-io.yaml) example config. The | ||
DDN-EXAScaler resource requires a subnetwork self link, which is not currently | ||
an output of either network resource, therefore it is necessary to refer to the | ||
primary network self link through terraform itself: | ||
|
||
```yaml | ||
subnetwork_self_link: ((module.network1.primary_subnetwork.self_link)) | ||
``` | ||
|
||
Here the network1 module is referenced, the terraform module name is the same | ||
as the ID in the `ghpc` config. From the module we can refer to it's underlying | ||
variables as deep as we need, in this case the self_link for it's | ||
primary_subnetwork. | ||
|
||
The entire text of the variable is wrapped in double parentheses indicating that | ||
everything inside will be provided as is to the resource. | ||
|
||
Whenever possible, config variables are preferred over literal variables. `ghpc` | ||
will perform basic validation making sure all config variables are defined | ||
before creating a blueprint making debugging quicker and easier. |
Empty file.