Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nina xpk gpu h100 #87

Merged
merged 38 commits into from
Mar 21, 2024
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
08b7dd2
Update xpk.py to support GKE cluster creation for H100.
yangyuwei Feb 8, 2024
15f9c21
Fix pylint warnings.
yangyuwei Feb 8, 2024
a9b6354
Update xpk.py to support workload creation and deletion on H100 GPUs.
yangyuwei Feb 27, 2024
f1aeb3c
Fix Kueue coveredResources config.
yangyuwei Feb 28, 2024
da23bbb
Make changes to provide consistent user experience of running command…
yangyuwei Feb 29, 2024
cda06bc
Change an env flag from USE_GPUDIRECT_TCPX to USE_GPUDIRECT.
yangyuwei Mar 13, 2024
fbb9f24
address comments
NinaCai Mar 15, 2024
289b33c
Merge branch 'main' into nina-xpk-gpu-h100
NinaCai Mar 15, 2024
ec8e66b
test create a cluster now
NinaCai Mar 15, 2024
69cbff8
fix typo
NinaCai Mar 15, 2024
f9bc61a
use exact command
NinaCai Mar 15, 2024
2d0670f
add command for running workload
NinaCai Mar 15, 2024
aa8a69c
change get_cluster_configmap to deal with warnings
NinaCai Mar 16, 2024
697b145
change kueue to v0.6.1
NinaCai Mar 18, 2024
e3d6893
remove trailing whitespaces
NinaCai Mar 18, 2024
1559b6f
remove trailing whitespace
NinaCai Mar 18, 2024
c85aa06
resolve too many arguments error
NinaCai Mar 18, 2024
bba177a
address comments
NinaCai Mar 19, 2024
e906fd1
change env_format
NinaCai Mar 20, 2024
82facb6
delete workload delete config func
NinaCai Mar 20, 2024
2bcbc6a
remove pull_request check
NinaCai Mar 20, 2024
5548983
add comment for env_format
NinaCai Mar 20, 2024
8ae3c13
remove workload_delete_yaml
NinaCai Mar 20, 2024
000908f
remove device/tpu_type
NinaCai Mar 20, 2024
5e55995
num_slices represents the num of nodepools
NinaCai Mar 20, 2024
bdfac47
introduce num_nodes, maxtext to workload
NinaCai Mar 20, 2024
f6ee23e
rebase main and resolve conflicts
NinaCai Mar 20, 2024
dff04e2
make num_nodes an optional argument
NinaCai Mar 20, 2024
bddfd35
address comments
NinaCai Mar 21, 2024
702af71
rebase main branch
NinaCai Mar 21, 2024
e4220d2
resolve whitespace trailing
NinaCai Mar 21, 2024
119dd21
resolve pw_resources error
NinaCai Mar 21, 2024
4c0b4af
add back add_pw_resource_flavors func
NinaCai Mar 21, 2024
0a22e1c
remove dup func
NinaCai Mar 21, 2024
68b81fa
move pw_resources_kueue
NinaCai Mar 21, 2024
2757292
remove duplicate code
NinaCai Mar 21, 2024
31e7d19
add both cpu and tpu to the condition
NinaCai Mar 21, 2024
381228a
change the code based on pylint suggestions
NinaCai Mar 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,36 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
--reservation=$RESERVATION_ID
Obliviour marked this conversation as resolved.
Show resolved Hide resolved
```

* Cluster Delete (deprovision capacity):

```shell
python3 xpk.py cluster delete \
--cluster xpk-test
```

* Cluster List (see provisioned capacity):

```shell
python3 xpk.py cluster list
```

* Cluster Describe (see capacity):

```shell
python3 xpk.py cluster describe \
--cluster xpk-test
```


* Cluster Cacheimage (enables faster start times):

```shell
python3 xpk.py cluster cacheimage \
--cluster xpk-test --docker-image gcr.io/your_docker_image \
--device-type=h100-80gb-8
```


* [Install NVIDIA GPU device drivers](https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install)
```shell
# List available driver versions
Expand All @@ -394,6 +424,85 @@ In order to use XPK for GPU, you can do so by using `device-type` flag.
--command="echo hello world"
```

* Workload Delete (delete training job):

```shell
python3 xpk.py workload delete \
--workload xpk-test-workload --cluster xpk-test
```

This will only delete `xpk-test-workload` workload in `xpk-test` cluster.

* Workload Delete (delete all training jobs in the cluster):

```shell
python3 xpk.py workload delete \
--cluster xpk-test
```

This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt.

* Workload Delete supports filtering. Delete a portion of jobs that match user criteria.
* Filter by Job: `filter-by-job`

```shell
python3 xpk.py workload delete \
--cluster xpk-test --filter-by-job=$USER
```

This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt.

* Filter by Status: `filter-by-status`

```shell
python3 xpk.py workload delete \
--cluster xpk-test --filter-by-status=QUEUED
```

This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`.

* Workload List (see training jobs):

```shell
python3 xpk.py workload list \
--cluster xpk-test
```

* Example Workload List Output:

The below example shows four jobs of different statuses:

* `user-first-job-failed`: **filter-status** is `FINISHED` and `FAILED`.
* `user-second-job-success`: **filter-status** is `FINISHED` and `SUCCESSFUL`.
* `user-third-job-running`: **filter-status** is `RUNNING`.
* `user-forth-job-in-queue`: **filter-status** is `QUEUED`.
* `user-fifth-job-in-queue-preempted`: **filter-status** is `QUEUED`.

```
Jobset Name Created Time Priority TPU VMs Needed TPU VMs Running/Ran TPU VMs Done Status Status Message Status Time
user-first-job-failed 2023-1-1T1:00:00Z medium 4 4 <none> Finished JobSet failed 2023-1-1T1:05:00Z
user-second-job-success 2023-1-1T1:10:00Z medium 4 4 4 Finished JobSet finished successfully 2023-1-1T1:14:00Z
user-third-job-running 2023-1-1T1:15:00Z medium 4 4 <none> Admitted Admitted by ClusterQueue cluster-queue 2023-1-1T1:16:00Z
user-forth-job-in-queue 2023-1-1T1:16:05Z medium 4 <none> <none> Admitted couldn't assign flavors to pod set slice-job: insufficient unused quota for google.com/tpu in flavor 2xv4-8, 4 more need 2023-1-1T1:16:10Z
user-fifth-job-preempted 2023-1-1T1:10:05Z low 4 <none> <none> Evicted Preempted to accommodate a higher priority Workload 2023-1-1T1:10:00Z
```

* Workload List supports filtering. Observe a portion of jobs that match user criteria.

* Filter by Status: `filter-by-status`

Filter the workload list by the status of respective jobs.
Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`

* Filter by Job: `filter-by-job`

Filter the workload list by the name of a job.

```shell
python3 xpk.py workload list \
--cluster xpk-test --filter-by-job=$USER
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines can be deleted, They are already in the readme in earlier sections and is not GPU specific

## CPU usage

In order to use XPK for CPU, you can do so by using `device-type` flag.
Expand Down
Loading
Loading