Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UX] remove all uses of deprecated sky jobs #4173

Merged
merged 3 commits into from
Oct 26, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/examples/managed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ We can launch it with the following:
setup: |
# Fill in your wandb key: copy from https://wandb.ai/authorize
# Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
# to pass the key in the command line, during `sky spot launch`.
# to pass the key in the command line, during `sky jobs launch`.
echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc

pip install -e .
Expand Down
2 changes: 1 addition & 1 deletion docs/source/reference/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ How to ensure my workdir's ``.git`` is synced up for managed spot jobs?
Currently, there is a difference in whether ``.git`` is synced up depending on the command used:

- For regular ``sky launch``, the workdir's ``.git`` is synced up by default.
- For managed spot jobs ``sky spot launch``, the workdir's ``.git`` is excluded by default.
- For managed spot jobs ``sky jobs launch``, the workdir's ``.git`` is excluded by default.
cg505 marked this conversation as resolved.
Show resolved Hide resolved

In the second case, to ensure the workdir's ``.git`` is synced up for managed spot jobs, you can explicitly add a file mount to sync it up:

Expand Down
2 changes: 1 addition & 1 deletion examples/managed_job_with_storage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# Runs a task that uses cloud buckets for uploading and accessing files.
#
# Usage:
# sky spot launch -c spot-storage examples/managed_job_with_storage.yaml
# sky jobs launch -c spot-storage examples/managed_job_with_storage.yaml
# sky down spot-storage

resources:
Expand Down
2 changes: 1 addition & 1 deletion llm/axolotl/axolotl-spot.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# HF_TOKEN=abc BUCKET=<unique-name> sky launch -c axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET -i30 --down
#
# Managed spot (auto-recovery; for full runs):
# HF_TOKEN=abc BUCKET=<unique-name> sky spot launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET
# HF_TOKEN=abc BUCKET=<unique-name> sky jobs launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET

name: axolotl

Expand Down
2 changes: 1 addition & 1 deletion llm/axolotl/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,5 @@ ssh -L 8888:localhost:8888 axolotl-spot

Launch managed spot instances (auto-recovery; for full runs):
```
HF_TOKEN=abc BUCKET=<unique-name> sky spot launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET
HF_TOKEN=abc BUCKET=<unique-name> sky jobs launch -n axolotl-spot axolotl-spot.yaml --env HF_TOKEN --env BUCKET
```
12 changes: 6 additions & 6 deletions llm/falcon/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Finetuning Falcon with SkyPilot

This README contains instructions on how to use SkyPilot to finetune Falcon-7B and Falcon-40B, an open-source LLM that rivals many current closed-source models, including ChatGPT.
This README contains instructions on how to use SkyPilot to finetune Falcon-7B and Falcon-40B, an open-source LLM that rivals many current closed-source models, including ChatGPT.

* [Blog post](https://huggingface.co/blog/falcon)
* [Repo](https://huggingface.co/tiiuae/falcon-40b)
Expand All @@ -16,10 +16,10 @@ sky check
See the Falcon SkyPilot YAML for [training](train.yaml). Serving is currently a work in progress and a YAML will be provided for that soon! We are also working on adding an evaluation step to evaluate the model you finetuned compared to the base model.

## Running Falcon on SkyPilot
Finetuning `Falcon-7B` and `Falcon-40B` require GPUs with 80GB memory,
Finetuning `Falcon-7B` and `Falcon-40B` require GPUs with 80GB memory,
but `Falcon-7b-sharded` requires only 40GB memory. Thus,
* If your GPU has 40 GB memory or less (e.g., Nvidia A100): use `ybelkada/falcon-7b-sharded-bf16`.
* If your GPU has 80 GB memory (e.g., Nvidia A100-80GB): you can also use `tiiuae/falcon-7b` and `tiiuae/falcon-40b`.
* If your GPU has 80 GB memory (e.g., Nvidia A100-80GB): you can also use `tiiuae/falcon-7b` and `tiiuae/falcon-40b`.

Try `sky show-gpus --all` for supported GPUs.

Expand All @@ -32,13 +32,13 @@ Steps for training on your cloud(s):
1. In [train.yaml](train.yaml), set the following variables in `envs`:

- Replace the `OUTPUT_BUCKET_NAME` with a unique name. SkyPilot will create this bucket for you to store the model weights.
- Replace the `WANDB_API_KEY` to your own key.
- Replace the `MODEL_NAME` with your desired base model.
- Replace the `WANDB_API_KEY` to your own key.
- Replace the `MODEL_NAME` with your desired base model.

2. **Training the Falcon model using spot instances**:

```bash
sky spot launch -n falcon falcon.yaml
sky jobs launch --use-spot -n falcon falcon.yaml
```

Currently, such `A100-80GB:1` spot instances are only available on AWS and GCP.
Expand Down
2 changes: 1 addition & 1 deletion llm/vicuna-llama-2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ sky launch --no-use-spot ...

[SkyPilot Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html) is a library built on top of SkyPilot that helps users run jobs on spot instances without worrying about interruptions. That is the tool used by the LMSYS organization to train the first version of Vicuna (more details can be found in their [launch blog post](https://lmsys.org/blog/2023-03-30-vicuna/) and [example](https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna)). With this, the training cost can be reduced from $1000 to **\$300**.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to quickly update this too :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!


To use SkyPilot Managed Spot, you can simply replace `sky launch` with `sky spot launch` in the above command:
To use SkyPilot Managed Spot, you can simply replace `sky launch` with `sky jobs launch` in the above command:

```bash
sky spot launch -n vicuna train.yaml \
cg505 marked this conversation as resolved.
Show resolved Hide resolved
Expand Down
4 changes: 2 additions & 2 deletions llm/vicuna/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,14 +63,14 @@ Steps for training on your cloud(s):
2. **Training the Vicuna-7B model on 8 A100 GPUs (80GB memory) using spot instances**:
```bash
# Launch it on managed spot to save 3x cost
sky spot launch -n vicuna train.yaml
sky jobs launch -n vicuna train.yaml
```
Note: if you would like to see the training curve on W&B, you can add `--env WANDB_API_KEY` to the above command, which will propagate your local W&B API key in the environment variable to the job.

[Optional] Train a larger 13B model
```
# Train a 13B model instead of the default 7B
sky spot launch -n vicuna-7b train.yaml --env MODEL_SIZE=13
sky jobs launch -n vicuna-7b train.yaml --env MODEL_SIZE=13

# Use *unmanaged* spot instances (i.e., preemptions won't get auto-recovered).
# Unmanaged spot provides a better interactive development experience but is vulnerable to spot preemptions.
Expand Down
4 changes: 2 additions & 2 deletions tests/backward_compatibility_tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -167,8 +167,8 @@ MANAGED_JOB_JOB_NAME=${CLUSTER_NAME}-${uuid:0:4}
if [ "$start_from" -le 7 ]; then
conda activate sky-back-compat-master
rm -r ~/.sky/wheels || true
sky spot launch -d --cloud ${CLOUD} -y --cpus 2 --num-nodes 2 -n ${MANAGED_JOB_JOB_NAME}-7-0 "echo hi; sleep 1000"
sky spot launch -d --cloud ${CLOUD} -y --cpus 2 --num-nodes 2 -n ${MANAGED_JOB_JOB_NAME}-7-1 "echo hi; sleep 400"
sky jobs launch -d --cloud ${CLOUD} -y --cpus 2 --num-nodes 2 -n ${MANAGED_JOB_JOB_NAME}-7-0 "echo hi; sleep 1000"
sky jobs launch -d --cloud ${CLOUD} -y --cpus 2 --num-nodes 2 -n ${MANAGED_JOB_JOB_NAME}-7-1 "echo hi; sleep 400"
conda activate sky-back-compat-current
rm -r ~/.sky/wheels || true
s=$(sky jobs queue | grep ${MANAGED_JOB_JOB_NAME}-7 | grep "RUNNING" | wc -l)
Expand Down
Loading