Skip to content

Commit

Permalink
Adding instructions to README for Pathways - XPK.
Browse files Browse the repository at this point in the history
  • Loading branch information
RoshaniN committed Mar 8, 2024
1 parent ac8e5da commit 95d66ce
Show file tree
Hide file tree
Showing 2 changed files with 44 additions and 4 deletions.
45 changes: 43 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,16 @@ all zones.
--num-slices=4 --spot
```

* Cluster Create for Pathways:
Pathways compatible cluster can be created using `--enable-pathways`
```shell
python3 xpk.py cluster create \
--cluster xpk-pw-test \
--num-slices=4 --on-demand \
--tpu-type=v5litepod-16 \
--enable-pathways
```

* Cluster Create can be called again with the same `--cluster name` to modify
the number of slices or retry failed steps.

Expand Down Expand Up @@ -195,8 +205,39 @@ all zones.

```shell
python3 xpk.py workload create \
--workload xpk-test-workload --command "echo goodbye" --cluster \
xpk-test --tpu-type=v5litepod-16
--workload xpk-test-workload --command "echo goodbye" \
--cluster xpk-test \
--tpu-type=v5litepod-16
```

* Workload Create for Pathways:
Pathways workload can be submitted using `--use-pathways` on a Pathways enabled cluster (created with `--enable-pathways`)

Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-pw-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--use-pathways \
--cluster xpk-pw-test \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='bash /usr/pathways/ifrt/maxtext_entrypoint.sh base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
```

Regular workload can also be submitted on a Pathways enabled cluster (created with `--enable-pathways`)

Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-regular-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--cluster xpk-pw-test \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
```

### Set `max-restarts` for production jobs
Expand Down
3 changes: 1 addition & 2 deletions xpk.py
Original file line number Diff line number Diff line change
Expand Up @@ -1326,7 +1326,7 @@ def run_gke_cluster_create_command(args) -> int:
f' --project={args.project} --region={zone_to_region(args.zone)}'
f' --cluster-version={args.gke_version} --location-policy=BALANCED'
f' --machine-type={machine_type}'
' --scopes=storage-full,gke-default'
' --scopes=storage-full,gke-default'
f' {args.custom_cluster_arguments}'
)

Expand Down Expand Up @@ -1805,7 +1805,6 @@ def enable_kueue_crds(args, system) -> int:
cluster_queue_name=_CLUSTER_QUEUE_NAME,
local_queue_name=_LOCAL_QUEUE_NAME,
)
print(yml_string)

tmp = write_temporary_file(yml_string)
command = f'kubectl apply -f {str(tmp.file.name)}'
Expand Down

0 comments on commit 95d66ce

Please sign in to comment.