Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure proxy and server images are only provided with --use-pathways. #100

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 43 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,15 +125,6 @@ all zones.
--num-slices=4 --spot
```

* Cluster Create for Pathways:
Pathways compatible cluster can be created using `--enable-pathways`
```shell
python3 xpk.py cluster create \
--cluster xpk-pw-test \
--num-slices=4 --on-demand \
--tpu-type=v5litepod-16 \
--enable-pathways
```

* Cluster Create can be called again with the same `--cluster name` to modify
the number of slices or retry failed steps.
Expand Down Expand Up @@ -211,36 +202,6 @@ all zones.
--tpu-type=v5litepod-16
```

* Workload Create for Pathways:
Pathways workload can be submitted using `--use-pathways` on a Pathways enabled cluster (created with `--enable-pathways`)

Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-pw-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--use-pathways \
--cluster xpk-pw-test \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='bash /usr/pathways/ifrt/maxtext_entrypoint.sh base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
```

Regular workload can also be submitted on a Pathways enabled cluster (created with `--enable-pathways`)

Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-regular-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--cluster xpk-pw-test \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
```

### Set `max-restarts` for production jobs

* `--max-restarts <value>`: By default, this is 0. This will restart the job ""
Expand Down Expand Up @@ -354,6 +315,49 @@ checkpointing so the job restarts near where it was interrupted.
python3 xpk.py workload list \
--cluster xpk-test --filter-by-job=$USER
```
## Pathways on XPK

* Cluster Create for Pathways:
Pathways compatible cluster can be created using `--enable-pathways`
```shell
python3 xpk.py cluster create \
--cluster xpk-pw-test \
--num-slices=4 --on-demand \
--tpu-type=v5litepod-16 \
--enable-pathways
```

* Workload Create for Pathways:
Pathways workload can be submitted using `--use-pathways` on a Pathways enabled cluster (created with `--enable-pathways`)

Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-pw-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--cluster xpk-pw-test \
--use-pathways \
--server-image=<Pathways server image> \
--proxy-server-image=<Pathways proxy server image> \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
Comment on lines +343 to +344
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--base-docker-image flow is more recommended for xpk users so that local changes can be iterated on. Is that possible for pathways?

Copy link
Collaborator Author

@RoshaniN RoshaniN Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the docker-image provided by the users is a fixed docker image, but if the others deps (mentioned above) are merged, I can update docker-image to base-docker-image! I would also prefer that!

--command='bash /usr/pathways/ifrt/maxtext_entrypoint.sh base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you have a simpler example. This seems like a complicated README case example to maintain.

Like in the non-pathways workload command example we just print hello world. Is that possible here too?

Copy link
Collaborator Author

@RoshaniN RoshaniN Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the question, Victor! Let me check if other dependencies are up-to-date. If so, I can simplify it to more like "python3 Maxtext...", however, we really want the workload to actually train!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, we really want the workload to actually train!

Agreed this is the ultimately goal but we should go one step at a time. How about we have both a simple workload and a MaxText example?

I want users to be able to verify they have xpk + system working before trying to train.

```

Regular workload can also be submitted (by omitting `--use-pathways`) on a Pathways enabled cluster (i.e a cluster created with `--enable-pathways`)

Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-regular-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--cluster xpk-pw-test \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
```

## Inspector
* Inspector provides debug info to understand cluster health, and why workloads are not running.
Expand Down
45 changes: 40 additions & 5 deletions xpk.py
Original file line number Diff line number Diff line change
Expand Up @@ -3200,6 +3200,35 @@ def setup_docker_image(args) -> tuple[int, str]:

return 0, docker_image

def check_use_pathways(args) -> bool:
"""
Both --proxy-server-image and --server-image need to be set if --use-pathways is set.
Neither can be provided if --use-pathways is not provided.
Args:
args: user provided arguments for running the command.

Returns:
bool: Whether the expected images are provided with --use-pathways.
"""
return args.use_pathways == (args.proxy_server_image is not None and args.server_image is not None)
Obliviour marked this conversation as resolved.
Show resolved Hide resolved

def validate_pathways_docker_images(args) -> bool:
"""Validates the existence of Pathways server and proxy docker images in the project.

Args:
args: user provided arguments for running the command.

Returns:
bool: whether the Pathways proxy and server images are valid.
"""
proxy_server_return_code = validate_docker_image(args.proxy_server_image, args)
server_return_code = validate_docker_image(args.server_image, args)
if proxy_server_return_code > 0 or server_return_code > 0:
return False
else:
return True


def get_main_and_sidecar_container(args, system, docker_image) -> str:
"""Generate yaml for main and sidecar container.
Args:
Expand Down Expand Up @@ -3379,7 +3408,6 @@ def get_pathways_proxy_args(args) -> str:
- --pathways_ifrt_proxy_server_resource_manager={args.workload}-rm-0-0.{args.workload}:38677
- --pathways_ifrt_proxy_server_port=38676
- --pathways_tmp_dir_pattern={args.pathways_gcs_location}
- --pathways_xprof_trace_enable_bulk_upload=true
- --pathways_plaque_network=gcp"""
if args.use_pathways:
return yaml.format(args=args)
Expand Down Expand Up @@ -3721,6 +3749,15 @@ def workload_create(args) -> int:
if setup_docker_image_code != 0:
xpk_exit(setup_docker_image_code)

if not check_use_pathways(args):
xpk_print('--proxy-server-image and --server-image need to be provided',
'with --use-pathways.')
xpk_exit(1)

if args.use_pathways and not validate_pathways_docker_images(args):
xpk_print('Please check the proxy-server-image or server-image as advised!')
xpk_exit(1)

add_env_config(args)

debugging_dashboard_id = None
Expand Down Expand Up @@ -4857,17 +4894,15 @@ def directory_path_type(value):
workload_pathways_workload_arguments.add_argument(
'--proxy-server-image',
type=str,
default='gcr.io/cloud-tpu-v2-images/pathways/pathways-demo:proxy_server',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmk if you do want to keep the ability to have a default and we can think through how this would look!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is super important, as folks would need to replace it during iteration!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during iteration

Can you go into this more? will customers need to replace this often? curious if we should make it simpler if so.

help=(
'Please provide the proxy server image for Pathways.'
'Please provide the proxy server image for Pathways. This argument needs to be used with --use-pathways.'
),
)
workload_pathways_workload_arguments.add_argument(
'--server-image',
type=str,
default='gcr.io/cloud-tpu-v2-images/pathways/pathways-demo:server',
help=(
'Please provide the server image for Pathways.'
'Please provide the server image for Pathways. This argument needs to be used with --use-pathways.'
),
)
workload_pathways_workload_arguments.add_argument(
Expand Down
Loading