-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nina xpk gpu h100 #87
Conversation
… for TPUs and GPUs.
xpk.py
Outdated
@@ -851,7 +1077,8 @@ def add_env_config(args): | |||
Args: | |||
args: user provided arguments for running the command. | |||
""" | |||
env = {'JOBSET_NAME': args.workload} | |||
device_type = args.tpu_type if args.tpu_type else args.device_type | |||
env = {} if device_type == h100_device_type else {'JOBSET_NAME': args.workload} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why the JOBSET_NAME variable isnt needed in h100s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set the JobSet name is under the gpu_workload_create_yaml
. Adding {'JOBSET_NAME': args.workload}
won't lead to errors, but it is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks Nina!
xpk.py
Outdated
- "bash" | ||
- "-c" | ||
- | | ||
echo XPK Start: $(date) ; _sigterm() ( kill -SIGTERM $!;); trap _sigterm SIGTERM; (cd /deps && bash gpu_multi_process_run.sh) & PID=$!; while kill -0 $PID 2>/dev/null; do sleep 5; done; EXIT_CODE=$? ; echo XPK End: $(date); echo EXIT_CODE=$EXIT_CODE; echo Main app is done > /usr/share/maxtext/workload_terminated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"echo Main app is done > /usr/share/maxtext/workload_terminated" seems MaxText specific --> will this work on non-maxtext workloads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works for --command "echo goodbye"
, any other workloads you are interested to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a MaxText specific command run in the general xpk command?
If it is only needed for MaxText, /usr/share/maxtext/workload_terminated
shouldn't be in the general xpk command right? What is the purpose of /usr/share/maxtext/workload_terminated
is for.
xpk.py
Outdated
workload_delete_yaml = """apiVersion: jobset.x-k8s.io/v1alpha2 | ||
kind: JobSet | ||
metadata: | ||
name: {args.workload} | ||
annotations: | ||
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool # 1:1 job replica to node pool assignment | ||
{annotation_config} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, thinking about the deletion process more, currently we only support deleting TPU jobs, sad. And in order to support GPUs, we also need to add a new required argument for the device type.
I am thinking if there is a way to avoid this new required argument since it will break current user flows and shouldn't be needed in theory.
@danielvegamyhre do you have any thoughts here. We want to delete a jobset just from its name. Do we need the annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked with @danielvegamyhre offline and he suggested that we delete the jobset by name using cli command or python sdk. Let's use CLI command for now since that aligns with our other command usage:
From Daniel:
You can delete a jobset by namespaced name
you'll just need to do it using a different kubectl command, or the python sdk
kubectl delete jobset -n
So the ask here is to move to kubectl delete jobset {NAME} -n default
. Then minimizes complexity in workload delete code by a lot, yay.
The namespace we use is default. (https://github.com/google/xpk/blob/main/xpk.py#L1548). Feel free to create a const var for the namespace if you'd like.
xpk.py
Outdated
f' --additional-node-network network={args.cluster}-net-4,subnetwork={args.cluster}-sub-4' | ||
' --no-enable-autoupgrade --scopes="https://www.googleapis.com/auth/cloud-platform"' | ||
) | ||
else: # other gpu types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have CPU Types which would fall into this else statement. Can you add an elif system.accelerator_type == AcceleratorType['GPU']
gpu case? (if needed)
Or add a comment saying other GPU and CPU Types.
@@ -1598,6 +2073,51 @@ def enable_kueue_crds(args, system) -> int: | |||
return 0 | |||
|
|||
|
|||
def get_kueue_covered_resources_config(args, cluster_hardware_name, resource_type, total_chips) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note this might conflict with pathways changes (https://github.com/google/xpk/pull/74/files). Not sure which PR will go in first but I am happy to work through the merge conflict.
I like this function so hopefully we can use this in pathways PR. cc @RoshaniN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Pathways PR is merged already. How do we proceed here for any potential conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a helper to add resources to kueue config for Pathways, we can work on adding the kueue config conditionally.
If --enable-pathways, {pathways resources are added}
if GPU , {GPU resources are added}
xpk.py
Outdated
debugging_dashboard_id = get_gke_debugging_dashboard(args) | ||
|
||
device_type = args.tpu_type if args.tpu_type else args.device_type | ||
if device_type == h100_device_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit you can use system.device_type == h100_device_type here
xpk.py
Outdated
command = f'kubectl delete -f {str(tmp.file.name)}' | ||
device_type = args.tpu_type if args.tpu_type else args.device_type | ||
if device_type == h100_device_type: | ||
command = f'kubectl delete jobset {workload} -n default' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nina, I think we don't even need to parse if it is a TPU / CPU / GPU.
Can we use kubectl delete jobset {workload} -n default
for all delete cases and delete the --tpu-type and --device-type required arguments for the workload_delete_parser?
This allows us to avoid creating new required arguments, and delete the not needed workload_delete_yaml (big yay).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your changes, Nina!
Please pull in the latest changes, you may have to solve a few merge conflicts, but I am happy to help as needed.
Most of Pathways code is present under enable-pathways and use-pathways flags. I believe most of your changes are compatible.
xpk.py
Outdated
if args.enable_pathways: | ||
command += (' --enable-ip-alias ') | ||
command += (f' --create-subnetwork name={args.cluster}-subnetwork') | ||
command += ('--release-channel rapid --enable-autoscaling --location-policy=BALANCED' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing space before --release-channel which is causing the build test to fail
README.md
Outdated
* Workload Delete (delete all training jobs in the cluster): | ||
|
||
```shell | ||
python3 xpk.py workload delete \ | ||
--cluster xpk-test | ||
``` | ||
|
||
This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt. | ||
|
||
* Workload Delete supports filtering. Delete a portion of jobs that match user criteria. | ||
* Filter by Job: `filter-by-job` | ||
|
||
```shell | ||
python3 xpk.py workload delete \ | ||
--cluster xpk-test --filter-by-job=$USER | ||
``` | ||
|
||
This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt. | ||
|
||
* Filter by Status: `filter-by-status` | ||
|
||
```shell | ||
python3 xpk.py workload delete \ | ||
--cluster xpk-test --filter-by-status=QUEUED | ||
``` | ||
|
||
This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`. | ||
|
||
* Workload List (see training jobs): | ||
|
||
```shell | ||
python3 xpk.py workload list \ | ||
--cluster xpk-test | ||
``` | ||
|
||
* Example Workload List Output: | ||
|
||
The below example shows four jobs of different statuses: | ||
|
||
* `user-first-job-failed`: **filter-status** is `FINISHED` and `FAILED`. | ||
* `user-second-job-success`: **filter-status** is `FINISHED` and `SUCCESSFUL`. | ||
* `user-third-job-running`: **filter-status** is `RUNNING`. | ||
* `user-forth-job-in-queue`: **filter-status** is `QUEUED`. | ||
* `user-fifth-job-in-queue-preempted`: **filter-status** is `QUEUED`. | ||
|
||
``` | ||
Jobset Name Created Time Priority TPU VMs Needed TPU VMs Running/Ran TPU VMs Done Status Status Message Status Time | ||
user-first-job-failed 2023-1-1T1:00:00Z medium 4 4 <none> Finished JobSet failed 2023-1-1T1:05:00Z | ||
user-second-job-success 2023-1-1T1:10:00Z medium 4 4 4 Finished JobSet finished successfully 2023-1-1T1:14:00Z | ||
user-third-job-running 2023-1-1T1:15:00Z medium 4 4 <none> Admitted Admitted by ClusterQueue cluster-queue 2023-1-1T1:16:00Z | ||
user-forth-job-in-queue 2023-1-1T1:16:05Z medium 4 <none> <none> Admitted couldn't assign flavors to pod set slice-job: insufficient unused quota for google.com/tpu in flavor 2xv4-8, 4 more need 2023-1-1T1:16:10Z | ||
user-fifth-job-preempted 2023-1-1T1:10:05Z low 4 <none> <none> Evicted Preempted to accommodate a higher priority Workload 2023-1-1T1:10:00Z | ||
``` | ||
|
||
* Workload List supports filtering. Observe a portion of jobs that match user criteria. | ||
|
||
* Filter by Status: `filter-by-status` | ||
|
||
Filter the workload list by the status of respective jobs. | ||
Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL` | ||
|
||
* Filter by Job: `filter-by-job` | ||
|
||
Filter the workload list by the name of a job. | ||
|
||
```shell | ||
python3 xpk.py workload list \ | ||
--cluster xpk-test --filter-by-job=$USER | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines can be deleted, They are already in the readme in earlier sections and is not GPU specific
xpk.py
Outdated
@@ -1344,6 +1726,15 @@ def create_cluster_configmaps(args, system): | |||
Returns: | |||
0 if successful and 1 otherwise. | |||
""" | |||
device_type = args.tpu_type if args.tpu_type else args.device_type | |||
if device_type == h100_device_type: | |||
data = f'{device_type}: "{1 * int(args.num_nodes)}"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit "1 * int(args.num_nodes)" can be int(args.num_nodes)
xpk.py
Outdated
@@ -3072,10 +3678,18 @@ def workload_create(args) -> int: | |||
container = get_main_and_sidecar_container(args, system, docker_image) | |||
# Get GKE debugging dashboard only when sidecar container is deployed for TPU workloads | |||
debugging_dashboard_id = get_gke_debugging_dashboard(args) | |||
else: | |||
elif system.accelerator_type == AcceleratorType['CPU']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also want this to run in the TPU case.
xpk.py
Outdated
if args.use_pathways: | ||
# Ensure the cluster and CPU nodepools were created with --enable-pathways | ||
all_node_pools = get_all_nodepools_programmatic(args) | ||
desired_pw_cpu_node_pools = {'cpu-user-np', 'cpu-rm-np', 'cpu-proxy-np'} | ||
if not desired_pw_cpu_node_pools.issubset(set(all_node_pools[0])): | ||
xpk_print( | ||
'Cluster needs to be created with --enable-pathways to run Pathways workloads.' | ||
) | ||
xpk_exit(1) | ||
|
||
# Ensure device type is TPUs - currently Pathways supports TPUs only. | ||
if system.accelerator_type != AcceleratorType['TPU']: | ||
xpk_print( | ||
'Currently, Pathways workloads can only be run on TPUs.' | ||
) | ||
xpk_exit(1) | ||
|
||
yml_string = pw_workload_create_yaml.format(args=args, | ||
system=system, | ||
container=container, | ||
accelerator_label=create_accelerator_label(system.accelerator_type, system), | ||
machine_label=create_machine_label(system.accelerator_type, system), | ||
pathways_rm_args = get_pathways_rm_args(args), | ||
pathways_worker_args = get_pathways_worker_args(args), | ||
pathways_proxy_args = get_pathways_proxy_args(args), | ||
resource_type=resource_type, | ||
local_queue_name=_LOCAL_QUEUE_NAME) | ||
tmp = write_temporary_file(yml_string) | ||
command = f'kubectl apply -f {str(tmp.file.name)}' | ||
return_code = run_command_with_updates(command, 'Creating Workload', args) | ||
return_code = run_command_with_updates(command, 'Creating a Pathways Workload', args) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Pathways code (+3732 to +3763) is duplicated. Already exists above. We can delete it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this change, h100s in XPK is awesome
Nina was able to check that Pathways functionality still work after these changes! Thank you for the change! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nina for verifying Pathways functionality!
Fixes / Features
Testing / Documentation
Ran xpk cluster create successfully in integ tests