Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nina xpk gpu h100 #87

Merged
merged 38 commits into from
Mar 21, 2024
Merged

Nina xpk gpu h100 #87

merged 38 commits into from
Mar 21, 2024

Conversation

NinaCai
Copy link
Collaborator

@NinaCai NinaCai commented Mar 15, 2024

Fixes / Features

  • Add cluster and workload creation for A3 cluster

Testing / Documentation

Ran xpk cluster create successfully in integ tests

  • [ y ] Tests pass
  • [ y ] Appropriate changes to documentation are included in the PR

@NinaCai NinaCai requested a review from Obliviour as a code owner March 15, 2024 01:07
@Obliviour Obliviour self-assigned this Mar 18, 2024
xpk.py Show resolved Hide resolved
xpk.py Outdated
@@ -851,7 +1077,8 @@ def add_env_config(args):
Args:
args: user provided arguments for running the command.
"""
env = {'JOBSET_NAME': args.workload}
device_type = args.tpu_type if args.tpu_type else args.device_type
env = {} if device_type == h100_device_type else {'JOBSET_NAME': args.workload}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why the JOBSET_NAME variable isnt needed in h100s?

Copy link
Collaborator Author

@NinaCai NinaCai Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set the JobSet name is under the gpu_workload_create_yaml. Adding {'JOBSET_NAME': args.workload} won't lead to errors, but it is not necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks Nina!

xpk.py Outdated Show resolved Hide resolved
xpk.py Outdated
- "bash"
- "-c"
- |
echo XPK Start: $(date) ; _sigterm() ( kill -SIGTERM $!;); trap _sigterm SIGTERM; (cd /deps && bash gpu_multi_process_run.sh) & PID=$!; while kill -0 $PID 2>/dev/null; do sleep 5; done; EXIT_CODE=$? ; echo XPK End: $(date); echo EXIT_CODE=$EXIT_CODE; echo Main app is done > /usr/share/maxtext/workload_terminated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"echo Main app is done > /usr/share/maxtext/workload_terminated" seems MaxText specific --> will this work on non-maxtext workloads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works for --command "echo goodbye", any other workloads you are interested to test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a MaxText specific command run in the general xpk command?

If it is only needed for MaxText, /usr/share/maxtext/workload_terminated shouldn't be in the general xpk command right? What is the purpose of /usr/share/maxtext/workload_terminated is for.

xpk.py Outdated
workload_delete_yaml = """apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: {args.workload}
annotations:
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool # 1:1 job replica to node pool assignment
{annotation_config}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thinking about the deletion process more, currently we only support deleting TPU jobs, sad. And in order to support GPUs, we also need to add a new required argument for the device type.


I am thinking if there is a way to avoid this new required argument since it will break current user flows and shouldn't be needed in theory.

@danielvegamyhre do you have any thoughts here. We want to delete a jobset just from its name. Do we need the annotation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked with @danielvegamyhre offline and he suggested that we delete the jobset by name using cli command or python sdk. Let's use CLI command for now since that aligns with our other command usage:

From Daniel:

You can delete a jobset by namespaced name
you'll just need to do it using a different kubectl command, or the python sdk
kubectl delete jobset -n


So the ask here is to move to kubectl delete jobset {NAME} -n default. Then minimizes complexity in workload delete code by a lot, yay.

The namespace we use is default. (https://github.com/google/xpk/blob/main/xpk.py#L1548). Feel free to create a const var for the namespace if you'd like.

xpk.py Show resolved Hide resolved
xpk.py Show resolved Hide resolved
xpk.py Outdated
f' --additional-node-network network={args.cluster}-net-4,subnetwork={args.cluster}-sub-4'
' --no-enable-autoupgrade --scopes="https://www.googleapis.com/auth/cloud-platform"'
)
else: # other gpu types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have CPU Types which would fall into this else statement. Can you add an elif system.accelerator_type == AcceleratorType['GPU'] gpu case? (if needed)

Or add a comment saying other GPU and CPU Types.

@@ -1598,6 +2073,51 @@ def enable_kueue_crds(args, system) -> int:
return 0


def get_kueue_covered_resources_config(args, cluster_hardware_name, resource_type, total_chips) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note this might conflict with pathways changes (https://github.com/google/xpk/pull/74/files). Not sure which PR will go in first but I am happy to work through the merge conflict.

I like this function so hopefully we can use this in pathways PR. cc @RoshaniN

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Pathways PR is merged already. How do we proceed here for any potential conflict?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a helper to add resources to kueue config for Pathways, we can work on adding the kueue config conditionally.

If --enable-pathways, {pathways resources are added}
if GPU , {GPU resources are added}

xpk.py Outdated
debugging_dashboard_id = get_gke_debugging_dashboard(args)

device_type = args.tpu_type if args.tpu_type else args.device_type
if device_type == h100_device_type:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit you can use system.device_type == h100_device_type here

xpk.py Outdated
command = f'kubectl delete -f {str(tmp.file.name)}'
device_type = args.tpu_type if args.tpu_type else args.device_type
if device_type == h100_device_type:
command = f'kubectl delete jobset {workload} -n default'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nina, I think we don't even need to parse if it is a TPU / CPU / GPU.

Can we use kubectl delete jobset {workload} -n default for all delete cases and delete the --tpu-type and --device-type required arguments for the workload_delete_parser?

This allows us to avoid creating new required arguments, and delete the not needed workload_delete_yaml (big yay).

@Obliviour Obliviour requested a review from RoshaniN March 20, 2024 17:53
xpk.py Show resolved Hide resolved
Copy link
Collaborator

@RoshaniN RoshaniN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your changes, Nina!

Please pull in the latest changes, you may have to solve a few merge conflicts, but I am happy to help as needed.

Most of Pathways code is present under enable-pathways and use-pathways flags. I believe most of your changes are compatible.

xpk.py Outdated
if args.enable_pathways:
command += (' --enable-ip-alias ')
command += (f' --create-subnetwork name={args.cluster}-subnetwork')
command += ('--release-channel rapid --enable-autoscaling --location-policy=BALANCED'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space before --release-channel which is causing the build test to fail

README.md Show resolved Hide resolved
README.md Outdated
Comment on lines 478 to 547
* Workload Delete (delete all training jobs in the cluster):

```shell
python3 xpk.py workload delete \
--cluster xpk-test
```

This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt.

* Workload Delete supports filtering. Delete a portion of jobs that match user criteria.
* Filter by Job: `filter-by-job`

```shell
python3 xpk.py workload delete \
--cluster xpk-test --filter-by-job=$USER
```

This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt.

* Filter by Status: `filter-by-status`

```shell
python3 xpk.py workload delete \
--cluster xpk-test --filter-by-status=QUEUED
```

This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`.

* Workload List (see training jobs):

```shell
python3 xpk.py workload list \
--cluster xpk-test
```

* Example Workload List Output:

The below example shows four jobs of different statuses:

* `user-first-job-failed`: **filter-status** is `FINISHED` and `FAILED`.
* `user-second-job-success`: **filter-status** is `FINISHED` and `SUCCESSFUL`.
* `user-third-job-running`: **filter-status** is `RUNNING`.
* `user-forth-job-in-queue`: **filter-status** is `QUEUED`.
* `user-fifth-job-in-queue-preempted`: **filter-status** is `QUEUED`.

```
Jobset Name Created Time Priority TPU VMs Needed TPU VMs Running/Ran TPU VMs Done Status Status Message Status Time
user-first-job-failed 2023-1-1T1:00:00Z medium 4 4 <none> Finished JobSet failed 2023-1-1T1:05:00Z
user-second-job-success 2023-1-1T1:10:00Z medium 4 4 4 Finished JobSet finished successfully 2023-1-1T1:14:00Z
user-third-job-running 2023-1-1T1:15:00Z medium 4 4 <none> Admitted Admitted by ClusterQueue cluster-queue 2023-1-1T1:16:00Z
user-forth-job-in-queue 2023-1-1T1:16:05Z medium 4 <none> <none> Admitted couldn't assign flavors to pod set slice-job: insufficient unused quota for google.com/tpu in flavor 2xv4-8, 4 more need 2023-1-1T1:16:10Z
user-fifth-job-preempted 2023-1-1T1:10:05Z low 4 <none> <none> Evicted Preempted to accommodate a higher priority Workload 2023-1-1T1:10:00Z
```

* Workload List supports filtering. Observe a portion of jobs that match user criteria.

* Filter by Status: `filter-by-status`

Filter the workload list by the status of respective jobs.
Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`

* Filter by Job: `filter-by-job`

Filter the workload list by the name of a job.

```shell
python3 xpk.py workload list \
--cluster xpk-test --filter-by-job=$USER
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines can be deleted, They are already in the readme in earlier sections and is not GPU specific

xpk.py Outdated
@@ -1344,6 +1726,15 @@ def create_cluster_configmaps(args, system):
Returns:
0 if successful and 1 otherwise.
"""
device_type = args.tpu_type if args.tpu_type else args.device_type
if device_type == h100_device_type:
data = f'{device_type}: "{1 * int(args.num_nodes)}"'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit "1 * int(args.num_nodes)" can be int(args.num_nodes)

xpk.py Outdated
@@ -3072,10 +3678,18 @@ def workload_create(args) -> int:
container = get_main_and_sidecar_container(args, system, docker_image)
# Get GKE debugging dashboard only when sidecar container is deployed for TPU workloads
debugging_dashboard_id = get_gke_debugging_dashboard(args)
else:
elif system.accelerator_type == AcceleratorType['CPU']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want this to run in the TPU case.

xpk.py Outdated
Comment on lines 3732 to 3763
if args.use_pathways:
# Ensure the cluster and CPU nodepools were created with --enable-pathways
all_node_pools = get_all_nodepools_programmatic(args)
desired_pw_cpu_node_pools = {'cpu-user-np', 'cpu-rm-np', 'cpu-proxy-np'}
if not desired_pw_cpu_node_pools.issubset(set(all_node_pools[0])):
xpk_print(
'Cluster needs to be created with --enable-pathways to run Pathways workloads.'
)
xpk_exit(1)

# Ensure device type is TPUs - currently Pathways supports TPUs only.
if system.accelerator_type != AcceleratorType['TPU']:
xpk_print(
'Currently, Pathways workloads can only be run on TPUs.'
)
xpk_exit(1)

yml_string = pw_workload_create_yaml.format(args=args,
system=system,
container=container,
accelerator_label=create_accelerator_label(system.accelerator_type, system),
machine_label=create_machine_label(system.accelerator_type, system),
pathways_rm_args = get_pathways_rm_args(args),
pathways_worker_args = get_pathways_worker_args(args),
pathways_proxy_args = get_pathways_proxy_args(args),
resource_type=resource_type,
local_queue_name=_LOCAL_QUEUE_NAME)
tmp = write_temporary_file(yml_string)
command = f'kubectl apply -f {str(tmp.file.name)}'
return_code = run_command_with_updates(command, 'Creating Workload', args)
return_code = run_command_with_updates(command, 'Creating a Pathways Workload', args)


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Pathways code (+3732 to +3763) is duplicated. Already exists above. We can delete it here.

Copy link
Collaborator

@Obliviour Obliviour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this change, h100s in XPK is awesome

@RoshaniN
Copy link
Collaborator

Nina was able to check that Pathways functionality still work after these changes! Thank you for the change!

@NinaCai NinaCai requested a review from RoshaniN March 21, 2024 19:26
Copy link
Collaborator

@RoshaniN RoshaniN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nina for verifying Pathways functionality!

@NinaCai NinaCai merged commit c794fbe into main Mar 21, 2024
4 checks passed
@NinaCai NinaCai deleted the nina-xpk-gpu-h100 branch March 21, 2024 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants