Skip to content

Commit

Permalink
Add module to support deployment of cluster toolkit module (#269)
Browse files Browse the repository at this point in the history
* class skelethon

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* inital changes

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix linting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* refactor

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix pytype

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* tests

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* tests

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix formatting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix formatting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* add changes to github workflows

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* refactor tests

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* add init

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix path

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* change deployment dir

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix echo

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* set env variables

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* set deployment dir

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* add gcloud config

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix gcloud

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* debug

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* add missing cred file

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* run tests

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* change env variable

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* debug

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* mv credentials file

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* update dockerfile commit

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* add volumes to output

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix formatting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* ignore this tests

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix formatting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* add tests and staging files

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix staging and formatting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix typo

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* add tests for a3

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* add a3 tests description

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix linting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix pylint

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix yaml

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* set removing container after command excution to true

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix linting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* add bluperint directory

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* remove blueprint

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* change to repo path instead of download from gh

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix formatting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix import

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* rm unused function

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* tests

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* add missing file and set enable_private_endpoints to false

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix linting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix unit tests

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* uncomment tests

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix linting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* review fixes (refactor of everything)

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* remove a3 ultra files

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* tests refactor

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* uncomment deploy

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix pylint

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix pyink

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix pytype

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix unit tests

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* mode blueprint to separate dir; remove nodepool from blueprint; add blueprint versioning draft

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix integration tests

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* move hardcoded numerical values to methods args

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix linting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* remove build method and move it to initialize; fix unit tests

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* add uploads to path in docker_manager

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* enhance docstrings

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* fix unit tests and formatting

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

* review fixes

Signed-off-by: Piotr Pawłowski <ppawl@google.com>

---------

Signed-off-by: Piotr Pawłowski <ppawl@google.com>
  • Loading branch information
pawloch00 authored Dec 12, 2024
1 parent 6bb8914 commit dba38bd
Show file tree
Hide file tree
Showing 20 changed files with 1,208 additions and 400 deletions.
41 changes: 40 additions & 1 deletion .github/workflows/build_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,10 @@ env:
PATHWAYS_WORKLOAD_NAME: xpkpw-build-${{ github.run_attempt }}
CLUSTER_ARGUMENTS: "--network=${{secrets.NETWORK_NAME}} --subnetwork=${{secrets.SUBNETWORK_NAME}} --maintenance-window=23:50"
RUN_ID: "pr-${{ github.event.number }}"
PROJECT_ID: ${{secrets.PROJECT_NAME}}
DEPLOYMENT_NAME: "xpk-ctk-int"
ZONE: us-central2-a
REGION: us-central2

jobs:
run-unit-tests:
Expand All @@ -42,10 +46,45 @@ jobs:
run : make install-dev
- name: Run unit tests
run: make run-unittests

run-integration-tests:
runs-on: [ubuntu-22.04]
needs: [run-unit-tests]
concurrency: # We support one build or nightly test to run at a time currently.
group: build-test-cluster-group
cancel-in-progress: false
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.10'
- uses: 'google-github-actions/auth@v2'
with:
credentials_json: '${{ secrets.GCP_SA_KEY }}'
- uses: google-github-actions/setup-gcloud@v2
with:
version: '>= 363.0.0'
install_components: 'beta,gke-gcloud-auth-plugin'
- name: Verify gcp setup
run: gcloud info
- name: Install dependencies
run : make install-dev
- name: "Set auth cidr"
run: echo "AUTH_CIDR=$(curl api.ipify.org)/32" >> $GITHUB_ENV
- name: "Set GCLOUD_CFG_PATH"
run: echo "GCLOUD_CFG_PATH=/home/runner/work/xpk/xpk/" >> $GITHUB_ENV
- name: "Copy credentials"
run: cp $GOOGLE_APPLICATION_CREDENTIALS $GCLOUD_CFG_PATH/application_default_credentials.json
- name: "Set DEPLOYMENT_DIR"
run: echo "DEPLOYMENT_DIR=$HOME/deployment" >> $GITHUB_ENV
- name: Create deployment dir
run: mkdir -p $DEPLOYMENT_DIR
- name: Run integration tests
run: make run-integrationtests

cluster-create-and-delete:
runs-on: [ubuntu-22.04]
needs: [run-unit-tests]
needs: [run-integration-tests]
concurrency: # We support one nightly test and one build test for each branch to run at a time currently.
group: build-test-cluster-group-${{ github.ref }}
cancel-in-progress: false
Expand Down
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,10 @@ install-pytest:

.PHONY: run-unittests
run-unittests:
pytest src/xpk/
pytest -vv src/xpk/core/tests/unit/

run-integrationtests:
pytest src/xpk/core/tests/integration/

.PHONY: install-kjob
install-kjob: install-kubectl
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ dev = [
version = {attr = "xpk.core.core.__version__"}

[tool.setuptools]
packages = ["xpk", "xpk.parser", "xpk.core", "xpk.commands", "xpk.utils"]
packages = ["xpk", "xpk.parser", "xpk.core", "xpk.commands", "xpk.utils", "xpk.core.blueprint"]
package-dir = {"" = "src"}

[tool.pyink]
Expand Down
6 changes: 6 additions & 0 deletions src/xpk/blueprints/a3mega/config-map.yaml.tftpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
kind: ConfigMap
apiVersion: v1
metadata:
name: ${name}
data:
h100-mega-80gb-8: "${num_nodes}"
73 changes: 73 additions & 0 deletions src/xpk/blueprints/a3mega/kueue-xpk-configuration.yaml.tftpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: 1xh100-mega-80gb-8
spec:
nodeLabels:
cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb
---

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: cluster-queue
spec:
preemption:
reclaimWithinCohort: Never # Don't preempt other queues in the cohort.
withinClusterQueue: LowerPriority
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: 1xh100-mega-80gb-8
resources:
- name: "nvidia.com/gpu"
nominalQuota: ${num_chips}
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
namespace: default
name: multislice-queue
spec:
clusterQueue: cluster-queue
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: very-low
value: 100
globalDefault: false
description: "Very Low"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low
value: 250
globalDefault: false
description: "Low"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: medium
value: 500
globalDefault: false
description: "Medium"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high
value: 750
globalDefault: false
description: "High"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: very-high
value: 1000
globalDefault: false
description: "Very High"
Loading

0 comments on commit dba38bd

Please sign in to comment.