Skip to content

Commit

Permalink
Merge pull request #2 from heyealex/community-paths
Browse files Browse the repository at this point in the history
Update paths to community resources throughout the codebase
  • Loading branch information
heyealex authored Apr 26, 2022
2 parents a2af9c7 + 8ae40e6 commit 602ceae
Show file tree
Hide file tree
Showing 36 changed files with 148 additions and 201 deletions.
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -360,8 +360,9 @@ resume.py ERROR: ... "Quota 'C2_CPUS' exceeded. Limit: 300.0 in region europe-we

The solution here is to [request more of the specified quota](#gcp-quotas),
`C2 CPUs` in the example above. Alternatively, you could switch the partition's
[machine_type](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/resources/third-party/compute/SchedMD-slurm-on-gcp-partition#input_machine_type)
, to one which has sufficient quota.
[machine type][partition-machine-type], to one which has sufficient quota.

[partition-machine-type]: community/resources/compute/SchedMD-slurm-on-gcp-partition/README.md#input_machine_type

#### Placement Groups

Expand All @@ -379,10 +380,11 @@ $ cat /var/log/slurm/resume.log
resume.py ERROR: group operation failed: Requested minimum count of 6 VMs could not be created.
```

One way to resolve this is to set
[enable_placement](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/resources/third-party/compute/SchedMD-slurm-on-gcp-partition#input_enable_placement)
One way to resolve this is to set [enable_placement][partition-enable-placement]
to `false` on the partition in question.

[partition-enable-placement]: https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/community/resources/compute/SchedMD-slurm-on-gcp-partition#input_enable_placement

### Terraform Deployment

When `terraform apply` fails, Terraform generally provides a useful error
Expand Down
114 changes: 0 additions & 114 deletions community/examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,76 +31,6 @@ terraform_backend_defaults:
## Config Descriptions
### hpc-cluster-small.yaml
Creates a basic auto-scaling SLURM cluster with mostly default settings. The
blueprint also creates a new VPC network, and a filestore instance mounted to
`/home`.

There are 2 partitions in this example: `debug` and `compute`. The `debug`
partition uses `n2-standard-2` VMs, which should work out of the box without
needing to request additional quota. The purpose of the `debug` partition is to
make sure that first time users are not immediately blocked by quota
limitations.

#### Compute Partition

There is a `compute` partition that achieves higher performance. Any
performance analysis should be done on the `compute` partition. By default it
uses `c2-standard-60` VMs with placement groups enabled. You may need to request
additional quota for `C2 CPUs` in the region you are deploying in. You can
select the compute partition using the `srun -p compute` argument.

Quota required for this example:

* Cloud Filestore API: Basic SSD (Premium) capacity (GB) per region: **2660 GB**
* Compute Engine API: Persistent Disk SSD (GB): **~10 GB**
* Compute Engine API: N2 CPUs: **12**
* Compute Engine API: C2 CPUs: **60/node** up to 1200 - _only needed for
`compute` partition_
* Compute Engine API: Affinity Groups: **one for each job in parallel** - _only
needed for `compute` partition_
* Compute Engine API: Resource policies: **one for each job in parallel** -
_only needed for `compute` partition_

### hpc-cluster-high-io.yaml

Creates a slurm cluster with tiered file systems for higher performance. It
connects to the default VPC of the project and creates two partitions and a
login node.

File systems:

* The homefs mounted at `/home` is a default "PREMIUM" tier filestore with
2.5TiB of capacity
* The projectsfs is mounted at `/projects` and is a high scale SSD filestore
instance with 10TiB of capacity.
* The scratchfs is mounted at `/scratch` and is a
[DDN Exascaler Lustre](../resources/third-party/file-system/DDN-EXAScaler/README.md)
file system designed for high IO performance. The capacity is ~10TiB.

There are two partitions in this example: `low_cost` and `compute`. The
`low_cost` partition uses `n2-standard-4` VMs. This partition can be used for
debugging and workloads that do not require high performance.

Similar to the small example, there is a
[compute partition](#compute-partition) that should be used for any performance
analysis.

Quota required for this example:

* Cloud Filestore API: Basic SSD (Premium) capacity (GB) per region: **2660 GB**
* Cloud Filestore API: High Scale SSD capacity (GB) per region: **10240 GiB** - _min
quota request is 61440 GiB_
* Compute Engine API: Persistent Disk SSD (GB): **~14000 GB**
* Compute Engine API: N2 CPUs: **158**
* Compute Engine API: C2 CPUs: **60/node** up to 12,000 - _only needed for
`compute` partition_
* Compute Engine API: Affinity Groups: **one for each job in parallel** - _only
needed for `compute` partition_
* Compute Engine API: Resource policies: **one for each job in parallel** -
_only needed for `compute` partition_

### spack-gromacs.yaml
Spack is a HPC software package manager. This example creates a small slurm
Expand Down Expand Up @@ -152,50 +82,6 @@ omnia-manager node and 2 omnia-compute nodes, on the pre-existing default
network. Omnia will be automatically installed after the nodes are provisioned.
All nodes mount a filestore instance on `/home`.

### image-builder.yaml

This Blueprint helps create custom VM images by applying necessary software and
configurations to existing images, such as the [HPC VM Image][hpcimage].
Using a custom VM image can be more scalable than installing software using
boot-time startup scripts because

* it avoids reliance on continued availability of package repositories
* VMs will join an HPC cluster and execute workloads more rapidly due to reduced
boot-time configuration
* machines are guaranteed to boot with a static set of packages available when
the custom image was created. No potential for some machines to be upgraded
relative to other based upon their creation time!

[hpcimage]: https://cloud.google.com/compute/docs/instances/create-hpc-vm

**Note**: it is important _not to modify_ the subnetwork name in either of the
two resource groups without modifying them both. These _must_ match!

#### Custom Network (resource group)

A tool called [Packer](https://packer.io) builds custom VM images by creating
short-lived VMs, executing scripts on them, and saving the boot disk as an
image that can be used by future VMs. The short-lived VM must operate in a
network that

* has outbound access to the internet for downloading software
* has SSH access from the machine running Packer so that local files/scripts
can be copied to the VM

This resource group creates such a network, while using [Cloud Nat][cloudnat]
and [Identity-Aware Proxy (IAP)][iap] to allow outbound traffic and inbound SSH
connections without exposing the machine to the internet on a public IP address.

[cloudnat]: https://cloud.google.com/nat/docs/overview
[iap]: https://cloud.google.com/iap/docs/using-tcp-forwarding

#### Packer Template (resource group)

The Packer template in this resource group accepts a list of Ansible playbooks
which will be run on the VM to customize it. Although it defaults to creating
VMs with a public IP address, it can be easily set to use [IAP][iap] for SSH
tunneling following the [example in its README](../resources/packer/custom-image/README.md).

## Config Schema

A user defined config should follow the following schema:
Expand Down
59 changes: 59 additions & 0 deletions community/resources/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Community Resources

To learn more about using and writing resources, see the [core resources
documentation](../../resources/README.md).

## Compute

* [**SchedMD-slurm-on-gcp-partition**](compute/SchedMD-slurm-on-gcp-partition/README.md):
Creates a SLURM partition that can be used by the
SchedMD-slurm_on_gcp_controller.

## Database

*
[**slurm-cloudsql-federation**](database/slurm-cloudsql-federation/README.md):
Creates a [Google SQL Instance](https://cloud.google.com/sql/) meant to be
integrated with a
[slurm controller](./third-pary/scheduler/SchedMD-slurm-on-gcp-controller/README.md).

## File System

* [**nfs-server**](file-system/nfs-server/README.md): Creates a VM instance and
configures an NFS server that can be mounted by other VM instances.

* [**DDN-EXAScaler**](third-party/file-system/DDN-EXAScaler/README.md): Creates
a DDN Exascaler lustre](<https://www.ddn.com/partners/google-cloud-platform/>)
file system. This resource has
[license costs](https://console.developers.google.com/marketplace/product/ddnstorage/exascaler-cloud).

## Project

* [**new-project**](project/new-project/README.md): Creates a Google Cloud Projects

* [**service-account**](project/service-account/README.md): Creates [service
accounts](https://cloud.google.com/iam/docs/service-accounts) for a GCP project.

* [**service-enablement**](project/service-enablement/README.md): Allows
enabling various APIs for a Google Cloud Project

## Scripts

* [**omnia-install**](scripts/omnia-install/README.md): Installs SLURM via omnia
onto a cluster of compute VMs

* [**spack-install**](scripts/spack-install/README.md): Creates a startup script
to install spack on an instance or the slurm controller

* [**wait-for-startup**](scripts/wait-for-startup/README.md): Waits for
successful completion of a startup script on a compute VM

## Scheduler

* [**SchedMD-slurm-on-gcp-controller**](scheduler/SchedMD-slurm-on-gcp-controller/README.md):
Creates a SLURM controller node using
[slurm-gcp](https://github.com/SchedMD/slurm-gcp/tree/master/tf/modules/controller)

* [**SchedMD-slurm-on-gcp-login-node**](scheduler/SchedMD-slurm-on-gcp-login-node/README.md):
Creates a SLURM login node using
[slurm-gcp](https://github.com/SchedMD/slurm-gcp/tree/master/tf/modules/login)
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Create a partition resource with a max node count of 200, named "compute",
connected to a resource subnetwork and with homefs mounted.

```yaml
- source: ./resources/third-party/compute/SchedMD-slurm-on-gcp-partition
- source: ./community/resources/compute/SchedMD-slurm-on-gcp-partition
kind: terraform
id: compute_partition
settings:
Expand Down
2 changes: 1 addition & 1 deletion community/resources/file-system/nfs-server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ files with other clients over a network via the
### Example

```yaml
- source: resources/file-system/nfs-server
- source: ./community/resources/file-system/nfs-server
kind: terraform
id: homefs
settings:
Expand Down
2 changes: 1 addition & 1 deletion community/resources/project/new-project/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ This module is meant for use with Terraform 0.13.
### Example

```yaml
- source: ./resources/project/new-project
- source: ./community/resources/project/new-project
kind: terraform
id: project
settings:
Expand Down
2 changes: 1 addition & 1 deletion community/resources/project/service-account/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Allows creation of service accounts for a Google Cloud Platform project.
### Example

```yaml
- source: ./resources/service-account
- source: ./community/resources/project/service-account
kind: terraform
id: service_acct
settings:
Expand Down
2 changes: 1 addition & 1 deletion community/resources/project/service-enablement/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Allows management of multiple API services for a Google Cloud Platform project.
### Example

```yaml
- source: ./resources/service-enablement
- source: ./community/resources/project/service-enablement
kind: terraform
id: services-api
settings:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ More information about Slurm On GCP can be found at the [project's GitHub page](
### Example

```yaml
- source: ./resources/third-party/scheduler/SchedMD-slurm-on-gcp-controller
- source: ./community/resources/scheduler/SchedMD-slurm-on-gcp-controller
kind: terraform
id: slurm_controller
settings:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ resource.
### Example

```yaml
- source: ./resources/third-party/scheduler/SchedMD-slurm-on-gcp-login-node
- source: ./community/resources/scheduler/SchedMD-slurm-on-gcp-login-node
kind: terraform
id: slurm_login
settings:
Expand Down
12 changes: 6 additions & 6 deletions community/resources/scripts/spack-install/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,26 +35,26 @@ https://www.googleapis.com/auth/devstorage.read_write
As an example, the below is a possible definition of a spack installation.

```yaml
- source: ./resources/scripts/spack-install
- source: ./community/resources/scripts/spack-install
kind: terraform
id: spack
settings:
install_dir: /apps/spack
install_dir: /sw/spack
spack_url: https://github.com/spack/spack
spack_ref: v0.17.0
spack_cache_url:
- mirror_name: 'gcs_cache'
mirror_url: gs://example-buildcache/linux-centos7
configs:
- type: 'single-config'
value: 'config:install_tree:/apps/spack/opt'
value: 'config:install_tree:/sw/spack/opt'
scope: 'site'
- type: 'file'
scope: 'site'
value: |
config:
build_stage:
- /apps/spack/stage
- /sw/spack/stage
- type: 'file'
scope: 'site'
value: |
Expand Down Expand Up @@ -91,7 +91,7 @@ Following the above description of this resource, it can be added to a Slurm
deployment via the following:
```yaml
- source: resources/third-party/scheduler/SchedMD-slurm-on-gcp-controller
- source: ./community/resources/scheduler/SchedMD-slurm-on-gcp-controller
kind: terraform
id: slurm_controller
use: [spack]
Expand All @@ -116,7 +116,7 @@ Alternatively, it can be added as a startup script via:
destination: install_spack_deps.yml
- type: shell
content: $(spack.startup_script)
destination: "/apps/spack-install.sh"
destination: "/sw/spack-install.sh"
```
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
Expand Down
2 changes: 1 addition & 1 deletion community/resources/scripts/wait-for-startup/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ up a node.
### Example

```yaml
- source: ./resources/scripts/wait-for-startup
- source: ./community/resources/scripts/wait-for-startup
kind: terraform
id: wait
settings:
Expand Down
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ File systems:
* The projectsfs is mounted at `/projects` and is a high scale SSD filestore
instance with 10TiB of capacity.
* The scratchfs is mounted at `/scratch` and is a
[DDN Exascaler Lustre](../resources/third-party/file-system/DDN-EXAScaler/README.md)
[DDN Exascaler Lustre](../community/resources/file-system/DDN-EXAScaler/README.md)
file system designed for high IO performance. The capacity is ~10TiB.

There are two partitions in this example: `low_cost` and `compute`. The
Expand Down
10 changes: 5 additions & 5 deletions examples/hpc-cluster-high-io.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,14 +48,14 @@ resource_groups:
size_gb: 10240
local_mount: /projects

- source: resources/third-party/file-system/DDN-EXAScaler
- source: ./community/resources/file-system/DDN-EXAScaler
kind: terraform
id: scratchfs
use: [network1]
settings:
local_mount: /scratch

- source: resources/third-party/compute/SchedMD-slurm-on-gcp-partition
- source: ./community/resources/compute/SchedMD-slurm-on-gcp-partition
kind: terraform
id: low_cost_partition
use:
Expand All @@ -71,7 +71,7 @@ resource_groups:
machine_type: n2-standard-4

# This compute_partition is far more performant than low_cost_partition.
- source: resources/third-party/compute/SchedMD-slurm-on-gcp-partition
- source: ./community/resources/compute/SchedMD-slurm-on-gcp-partition
kind: terraform
id: compute_partition
use:
Expand All @@ -83,7 +83,7 @@ resource_groups:
max_node_count: 200
partition_name: compute

- source: resources/third-party/scheduler/SchedMD-slurm-on-gcp-controller
- source: ./community/resources/scheduler/SchedMD-slurm-on-gcp-controller
kind: terraform
id: slurm_controller
use:
Expand All @@ -94,7 +94,7 @@ resource_groups:
- low_cost_partition # low cost partition will be default as it is listed first
- compute_partition

- source: resources/third-party/scheduler/SchedMD-slurm-on-gcp-login-node
- source: ./community/resources/scheduler/SchedMD-slurm-on-gcp-login-node
kind: terraform
id: slurm_login
use:
Expand Down
Loading

0 comments on commit 602ceae

Please sign in to comment.