Nina xpk gpu h100 #87

NinaCai · 2024-03-15T01:07:20Z

Fixes / Features

Add cluster and workload creation for A3 cluster

Testing / Documentation

Ran xpk cluster create successfully in integ tests

[ y ] Tests pass
[ y ] Appropriate changes to documentation are included in the PR

… for TPUs and GPUs.

xpk.py

Obliviour · 2024-03-19T18:40:34Z

xpk.py

@@ -851,7 +1077,8 @@ def add_env_config(args):
  Args:
    args: user provided arguments for running the command.
  """
-  env = {'JOBSET_NAME': args.workload}
+  device_type = args.tpu_type if args.tpu_type else args.device_type
+  env = {} if device_type == h100_device_type else {'JOBSET_NAME': args.workload}


Curious why the JOBSET_NAME variable isnt needed in h100s?

We set the JobSet name is under the gpu_workload_create_yaml. Adding {'JOBSET_NAME': args.workload} won't lead to errors, but it is not necessary.

Sounds good, thanks Nina!

xpk.py

Obliviour · 2024-03-19T20:08:18Z

xpk.py

+                  - "bash"
+                  - "-c"
+                  - |
+                    echo XPK Start: $(date) ; _sigterm() ( kill -SIGTERM $!;); trap _sigterm SIGTERM; (cd /deps && bash gpu_multi_process_run.sh) & PID=$!; while kill -0 $PID 2>/dev/null; do sleep 5; done; EXIT_CODE=$? ; echo XPK End: $(date); echo EXIT_CODE=$EXIT_CODE; echo Main app is done > /usr/share/maxtext/workload_terminated


"echo Main app is done > /usr/share/maxtext/workload_terminated" seems MaxText specific --> will this work on non-maxtext workloads?

It works for --command "echo goodbye", any other workloads you are interested to test?

Why is there a MaxText specific command run in the general xpk command?

If it is only needed for MaxText, /usr/share/maxtext/workload_terminated shouldn't be in the general xpk command right? What is the purpose of /usr/share/maxtext/workload_terminated is for.

Obliviour · 2024-03-19T20:12:45Z

xpk.py

 workload_delete_yaml = """apiVersion: jobset.x-k8s.io/v1alpha2
 kind: JobSet
 metadata:
  name: {args.workload}
-  annotations:
-    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool # 1:1 job replica to node pool assignment
+  {annotation_config}


Hmm, thinking about the deletion process more, currently we only support deleting TPU jobs, sad. And in order to support GPUs, we also need to add a new required argument for the device type.

I am thinking if there is a way to avoid this new required argument since it will break current user flows and shouldn't be needed in theory.

@danielvegamyhre do you have any thoughts here. We want to delete a jobset just from its name. Do we need the annotation?

I talked with @danielvegamyhre offline and he suggested that we delete the jobset by name using cli command or python sdk. Let's use CLI command for now since that aligns with our other command usage:

From Daniel:

You can delete a jobset by namespaced name
you'll just need to do it using a different kubectl command, or the python sdk
kubectl delete jobset -n

So the ask here is to move to kubectl delete jobset {NAME} -n default. Then minimizes complexity in workload delete code by a lot, yay.

The namespace we use is default. (https://github.com/google/xpk/blob/main/xpk.py#L1548). Feel free to create a const var for the namespace if you'd like.

xpk.py

Obliviour · 2024-03-19T20:22:18Z

xpk.py

+        f' --additional-node-network network={args.cluster}-net-4,subnetwork={args.cluster}-sub-4'
+        ' --no-enable-autoupgrade  --scopes="https://www.googleapis.com/auth/cloud-platform"'
+        )
+    else: # other gpu types


We also have CPU Types which would fall into this else statement. Can you add an elif system.accelerator_type == AcceleratorType['GPU'] gpu case? (if needed)

Or add a comment saying other GPU and CPU Types.

Obliviour · 2024-03-19T20:23:52Z

xpk.py

@@ -1598,6 +2073,51 @@ def enable_kueue_crds(args, system) -> int:
  return 0


+def get_kueue_covered_resources_config(args, cluster_hardware_name, resource_type, total_chips) -> str:


Just to note this might conflict with pathways changes (https://github.com/google/xpk/pull/74/files). Not sure which PR will go in first but I am happy to work through the merge conflict.

I like this function so hopefully we can use this in pathways PR. cc @RoshaniN

Looks like Pathways PR is merged already. How do we proceed here for any potential conflict?

I made a helper to add resources to kueue config for Pathways, we can work on adding the kueue config conditionally.

If --enable-pathways, {pathways resources are added}
if GPU , {GPU resources are added}

Obliviour · 2024-03-19T20:26:12Z

xpk.py

-    debugging_dashboard_id = get_gke_debugging_dashboard(args)
+
+  device_type = args.tpu_type if args.tpu_type else args.device_type
+  if device_type == h100_device_type:


nit you can use system.device_type == h100_device_type here

Obliviour · 2024-03-20T17:50:32Z

xpk.py

-      command = f'kubectl delete -f {str(tmp.file.name)}'
+      device_type = args.tpu_type if args.tpu_type else args.device_type
+      if device_type == h100_device_type:
+        command = f'kubectl delete jobset {workload} -n default'


Thanks Nina, I think we don't even need to parse if it is a TPU / CPU / GPU.

Can we use kubectl delete jobset {workload} -n default for all delete cases and delete the --tpu-type and --device-type required arguments for the workload_delete_parser?

This allows us to avoid creating new required arguments, and delete the not needed workload_delete_yaml (big yay).

xpk.py

RoshaniN

Thank you for your changes, Nina!

Please pull in the latest changes, you may have to solve a few merge conflicts, but I am happy to help as needed.

Most of Pathways code is present under enable-pathways and use-pathways flags. I believe most of your changes are compatible.

Obliviour · 2024-03-20T22:08:23Z

xpk.py

-  if args.enable_pathways:
-    command += (' --enable-ip-alias ')
-    command += (f' --create-subnetwork name={args.cluster}-subnetwork')
+    command += ('--release-channel rapid --enable-autoscaling --location-policy=BALANCED'


missing space before --release-channel which is causing the build test to fail

README.md

Obliviour · 2024-03-20T22:13:39Z

README.md

+*   Workload Delete (delete all training jobs in the cluster):
+
+    ```shell
+    python3 xpk.py workload delete \
+    --cluster xpk-test
+    ```
+
+    This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt.
+
+*   Workload Delete supports filtering. Delete a portion of jobs that match user criteria.
+    * Filter by Job: `filter-by-job`
+
+    ```shell
+    python3 xpk.py workload delete \
+    --cluster xpk-test --filter-by-job=$USER
+    ```
+
+    This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt.
+
+    * Filter by Status: `filter-by-status`
+
+    ```shell
+    python3 xpk.py workload delete \
+    --cluster xpk-test --filter-by-status=QUEUED
+    ```
+
+    This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`.
+
+*   Workload List (see training jobs):
+
+    ```shell
+    python3 xpk.py workload list \
+    --cluster xpk-test
+    ```
+
+* Example Workload List Output:
+
+  The below example shows four jobs of different statuses:
+
+  * `user-first-job-failed`: **filter-status** is `FINISHED` and `FAILED`.
+  * `user-second-job-success`: **filter-status** is `FINISHED` and `SUCCESSFUL`.
+  * `user-third-job-running`: **filter-status** is `RUNNING`.
+  * `user-forth-job-in-queue`: **filter-status** is `QUEUED`.
+  * `user-fifth-job-in-queue-preempted`: **filter-status** is `QUEUED`.
+
+  ```
+  Jobset Name                     Created Time           Priority   TPU VMs Needed   TPU VMs Running/Ran   TPU VMs Done      Status     Status Message                                                  Status Time
+  user-first-job-failed           2023-1-1T1:00:00Z      medium     4                4                     <none>            Finished   JobSet failed                                                   2023-1-1T1:05:00Z
+  user-second-job-success         2023-1-1T1:10:00Z      medium     4                4                     4                 Finished   JobSet finished successfully                                    2023-1-1T1:14:00Z
+  user-third-job-running          2023-1-1T1:15:00Z      medium     4                4                     <none>            Admitted   Admitted by ClusterQueue cluster-queue                          2023-1-1T1:16:00Z
+  user-forth-job-in-queue         2023-1-1T1:16:05Z      medium     4                <none>                <none>            Admitted   couldn't assign flavors to pod set slice-job: insufficient unused quota for google.com/tpu in flavor 2xv4-8, 4 more need   2023-1-1T1:16:10Z
+  user-fifth-job-preempted        2023-1-1T1:10:05Z      low        4                <none>                <none>            Evicted    Preempted to accommodate a higher priority Workload             2023-1-1T1:10:00Z
+  ```
+
+* Workload List supports filtering. Observe a portion of jobs that match user criteria.
+
+  * Filter by Status: `filter-by-status`
+
+  Filter the workload list by the status of respective jobs.
+  Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`
+
+  * Filter by Job: `filter-by-job`
+
+  Filter the workload list by the name of a job.
+
+    ```shell
+    python3 xpk.py workload list \
+    --cluster xpk-test --filter-by-job=$USER
+    ```
+


These lines can be deleted, They are already in the readme in earlier sections and is not GPU specific

Obliviour · 2024-03-20T22:16:38Z

xpk.py

@@ -1344,6 +1726,15 @@ def create_cluster_configmaps(args, system):
  Returns:
    0 if successful and 1 otherwise.
  """
+  device_type = args.tpu_type if args.tpu_type else args.device_type
+  if device_type == h100_device_type:
+    data = f'{device_type}: "{1 * int(args.num_nodes)}"'


nit "1 * int(args.num_nodes)" can be int(args.num_nodes)

Obliviour · 2024-03-21T17:25:53Z

xpk.py

@@ -3072,10 +3678,18 @@ def workload_create(args) -> int:
    container = get_main_and_sidecar_container(args, system, docker_image)
    # Get GKE debugging dashboard only when sidecar container is deployed for TPU workloads
    debugging_dashboard_id = get_gke_debugging_dashboard(args)
-  else:
+  elif system.accelerator_type == AcceleratorType['CPU']:


We also want this to run in the TPU case.

Obliviour · 2024-03-21T17:30:14Z

xpk.py

+  if args.use_pathways:
+    # Ensure the cluster and CPU nodepools were created with --enable-pathways
+    all_node_pools = get_all_nodepools_programmatic(args)
+    desired_pw_cpu_node_pools = {'cpu-user-np', 'cpu-rm-np', 'cpu-proxy-np'}
+    if not desired_pw_cpu_node_pools.issubset(set(all_node_pools[0])):
+      xpk_print(
+          'Cluster needs to be created with --enable-pathways to run Pathways workloads.'
+      )
+      xpk_exit(1)
+
+    # Ensure device type is TPUs - currently Pathways supports TPUs only.
+    if system.accelerator_type != AcceleratorType['TPU']:
+      xpk_print(
+          'Currently, Pathways workloads can only be run on TPUs.'
+      )
+      xpk_exit(1)
+
+    yml_string = pw_workload_create_yaml.format(args=args,
+                                        system=system,
+                                        container=container,
+                                        accelerator_label=create_accelerator_label(system.accelerator_type, system),
+                                        machine_label=create_machine_label(system.accelerator_type, system),
+                                        pathways_rm_args = get_pathways_rm_args(args),
+                                        pathways_worker_args = get_pathways_worker_args(args),
+                                        pathways_proxy_args = get_pathways_proxy_args(args),
+                                        resource_type=resource_type,
+                                        local_queue_name=_LOCAL_QUEUE_NAME)
    tmp = write_temporary_file(yml_string)
    command = f'kubectl apply -f {str(tmp.file.name)}'
-    return_code = run_command_with_updates(command, 'Creating Workload', args)
+    return_code = run_command_with_updates(command, 'Creating a Pathways Workload', args)
+
+


This Pathways code (+3732 to +3763) is duplicated. Already exists above. We can delete it here.

Obliviour

Thank you for this change, h100s in XPK is awesome

RoshaniN · 2024-03-21T19:25:09Z

Nina was able to check that Pathways functionality still work after these changes! Thank you for the change!

RoshaniN

Thanks Nina for verifying Pathways functionality!

yangyuwei and others added 7 commits February 7, 2024 16:21

Update xpk.py to support GKE cluster creation for H100.

08b7dd2

Fix pylint warnings.

15f9c21

Update xpk.py to support workload creation and deletion on H100 GPUs.

a9b6354

Fix Kueue coveredResources config.

f1aeb3c

Make changes to provide consistent user experience of running command…

da23bbb

… for TPUs and GPUs.

Change an env flag from USE_GPUDIRECT_TCPX to USE_GPUDIRECT.

cda06bc

address comments

fbb9f24

NinaCai requested a review from Obliviour as a code owner March 15, 2024 01:07

NinaCai and others added 10 commits March 14, 2024 21:19

Merge branch 'main' into nina-xpk-gpu-h100

289b33c

test create a cluster now

ec8e66b

fix typo

69cbff8

use exact command

f9bc61a

add command for running workload

2d0670f

change get_cluster_configmap to deal with warnings

aa8a69c

change kueue to v0.6.1

697b145

remove trailing whitespaces

e3d6893

remove trailing whitespace

1559b6f

resolve too many arguments error

c85aa06

Obliviour self-assigned this Mar 18, 2024

Obliviour reviewed Mar 19, 2024

View reviewed changes

NinaCai added 5 commits March 19, 2024 23:02

address comments

bba177a

change env_format

e906fd1

delete workload delete config func

82facb6

remove pull_request check

2bcbc6a

add comment for env_format

5548983

Obliviour reviewed Mar 20, 2024

View reviewed changes

Obliviour requested a review from RoshaniN March 20, 2024 17:53

NinaCai added 2 commits March 20, 2024 17:58

remove workload_delete_yaml

8ae3c13

remove device/tpu_type

000908f

RoshaniN reviewed Mar 20, 2024

View reviewed changes

xpk.py Show resolved Hide resolved

RoshaniN requested changes Mar 20, 2024

View reviewed changes

NinaCai added 4 commits March 20, 2024 18:49

num_slices represents the num of nodepools

5e55995

introduce num_nodes, maxtext to workload

bdfac47

rebase main and resolve conflicts

f6ee23e

make num_nodes an optional argument

dff04e2

Obliviour reviewed Mar 20, 2024

View reviewed changes

README.md Show resolved Hide resolved

Obliviour reviewed Mar 20, 2024

View reviewed changes

NinaCai added 7 commits March 21, 2024 14:10

address comments

bddfd35

rebase main branch

702af71

resolve whitespace trailing

e4220d2

resolve pw_resources error

119dd21

add back add_pw_resource_flavors func

4c0b4af

remove dup func

0a22e1c

move pw_resources_kueue

68b81fa

Obliviour reviewed Mar 21, 2024

View reviewed changes

NinaCai added 3 commits March 21, 2024 17:45

remove duplicate code

2757292

add both cpu and tpu to the condition

31e7d19

change the code based on pylint suggestions

381228a

Obliviour approved these changes Mar 21, 2024

View reviewed changes

NinaCai requested a review from RoshaniN March 21, 2024 19:26

RoshaniN approved these changes Mar 21, 2024

View reviewed changes

NinaCai merged commit c794fbe into main Mar 21, 2024
4 checks passed

NinaCai deleted the nina-xpk-gpu-h100 branch March 21, 2024 19:28

This was referenced Mar 21, 2024

Update xpk.py to support GKE cluster creation for H100. #68

Closed

Add multihost GPU support #59

Closed

RoshaniN mentioned this pull request Mar 25, 2024

Fix number of nodes in CPUs. #96

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nina xpk gpu h100 #87

Nina xpk gpu h100 #87

NinaCai commented Mar 15, 2024

Obliviour Mar 19, 2024

NinaCai Mar 19, 2024 •

edited

Loading

Obliviour Mar 20, 2024

Obliviour Mar 19, 2024

NinaCai Mar 20, 2024

Obliviour Mar 20, 2024

Obliviour Mar 19, 2024

Obliviour Mar 19, 2024

Obliviour Mar 19, 2024

Obliviour Mar 19, 2024

NinaCai Mar 20, 2024

RoshaniN Mar 20, 2024

Obliviour Mar 19, 2024

Obliviour Mar 20, 2024

RoshaniN left a comment

Obliviour Mar 20, 2024

Obliviour Mar 20, 2024

Obliviour Mar 20, 2024

Obliviour Mar 21, 2024

Obliviour Mar 21, 2024

Obliviour left a comment

RoshaniN commented Mar 21, 2024

RoshaniN left a comment

		@@ -1598,6 +2073,51 @@ def enable_kueue_crds(args, system) -> int:
		return 0


		def get_kueue_covered_resources_config(args, cluster_hardware_name, resource_type, total_chips) -> str:

Nina xpk gpu h100 #87

Nina xpk gpu h100 #87

Conversation

NinaCai commented Mar 15, 2024

Fixes / Features

Testing / Documentation

Choose a reason for hiding this comment

NinaCai Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RoshaniN left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Obliviour left a comment

Choose a reason for hiding this comment

RoshaniN commented Mar 21, 2024

RoshaniN left a comment

Choose a reason for hiding this comment

NinaCai Mar 19, 2024 •

edited

Loading