Skip to content

Commit

Permalink
Merge pull request #126 from DataBiosphere/dev
Browse files Browse the repository at this point in the history
PR for 0.2.1 release
  • Loading branch information
wnojopra authored Oct 4, 2018
2 parents 4bac0ae + 85f14ff commit c968683
Show file tree
Hide file tree
Showing 39 changed files with 913 additions and 272 deletions.
133 changes: 89 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ Note: `dsub` was written for Python 2.7 and production users of `dsub`
should continue using Python 2.7. As of `dsub` v0.2.0, we have enabled
experimental support of Python 3.5+.


### Pre-installation steps

1. This is optional, but whether installing from PyPI or from github,
Expand Down Expand Up @@ -123,25 +122,26 @@ charges using it.

The dsub logs and output files will be written to a bucket. Create a
bucket using the [storage browser](https://cloud.google.com/storage/browser?project=)
or run the command-line utility [gsutil](https://cloud.google.com/storage/docs/gsutil), included in
the Cloud SDK.
or run the command-line utility [gsutil](https://cloud.google.com/storage/docs/gsutil),
included in the Cloud SDK.

gsutil mb gs://my-bucket

Change `my-bucket` to a unique name that follows the
[bucket-naming conventions](https://cloud.google.com/storage/docs/bucket-naming).

(By default, the bucket will be in the US, but you can change or
refine the [location](https://cloud.google.com/storage/docs/bucket-locations) setting with the
`-l` option.)
refine the [location](https://cloud.google.com/storage/docs/bucket-locations)
setting with the `-l` option.)

1. Run a dsub job and wait for completion.

Here is a very simple "Hello World" test:

dsub \
--provider google-v2 \
--project my-cloud-project \
--zones "us-central1-*" \
--regions us-central1 \
--logging gs://my-bucket/logging/ \
--output OUT=gs://my-bucket/output/out.txt \
--command 'echo "Hello World" > "${OUT}"' \
Expand All @@ -157,27 +157,51 @@ charges using it.

gsutil cat gs://my-bucket/output/out.txt

## Backend providers

Where possible, `dsub` tries to support users being able to develop and test
locally (for faster iteration) and then progressing to running at scale.

To this end, `dsub` provides multiple "backend providers", each of which
implements a consistent runtime environment. The current providers are:

- local
- google (the default, but deprecated)
- google-v2

More details on the runtime environment implemented by the backend providers
can be found in [dsub backend providers](./docs/providers/README.md).

### Deprecation of the `google` provider

The original `dsub` provider was the `google` provider, built on top of the
Google Genomics Pipelines API `v1alpha2`. The Pipelines API `v1alpha2` has
been deprecated and will be turned down at the end of 2018.
For more details, see
[Cloud Genomics v1alpha2 Migration Guide](https://cloud.google.com/genomics/docs/how-tos/migration)

Replacing `v1alpha2` is [v2alpha1](https://cloud.google.com/genomics/reference/rest/v2alpha1/pipelines/run).
`dsub` has added the `google-v2` provider which use `v2alpha1` as the backend
for running `dsub` jobs on Google Cloud.

### Getting started with the `google-v2` provider
**dsub users are encourage today to use the google-v2 provider. At the end of
2018, the Pipelines API `v1alpha2` will be turned down and the `google` provider
for `dsub` will be removed.**

Google Cloud has made available a new version of the Google Genomics
Pipelines API. This version, `v2alpha1`, will soon replace the `v1alpha2`
verision that dsub's `google` provider uses.
### Migrating existing code from `google` to `google-v2`

To use the `google-v2` provider:
To migrate existing `dsub` calls from the `google` provider to the `google-v2`
provider:

- Add `--provider google-v2` to your command-line
- Use `--machine-type` (default is `n1-standard-1`).
- Use `--machine-type` (default is `n1-standard-1`) instead of `--min-cpu`
and `--min-ram`.

The `--machine-type` value can be one of the
[Predefined Machine Types](https://cloud.google.com/compute/docs/machine-types#predefined_machine_types)
or a
[Custom Machine Type](https://cloud.google.com/compute/docs/machine-types#custom_machine_types).

The `google` provider supports `--min-cpu` and `--min-ram`. A plan to support
these flags for `google-v2` is being evaluated.
See [google-v2 support](https://github.com/DataBiosphere/dsub/issues/114).

## `dsub` features

The following sections show how to run more complex jobs.
Expand All @@ -190,9 +214,7 @@ hello example above.
You can also save your script to a file, like `hello.sh`. Then you can run:

dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logging \
... \
--script hello.sh

If your script has dependencies that are not stored in your Docker image,
Expand All @@ -205,9 +227,7 @@ By default, dsub uses a stock Ubuntu image. You can change the image
by passing the `--image` flag.

dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logging \
... \
--image ubuntu:16.04 \
--script hello.sh

Expand All @@ -216,9 +236,7 @@ by passing the `--image` flag.
You can pass environment variables to your script using the `--env` flag.

dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logging \
... \
--env MESSAGE=hello \
--command 'echo ${MESSAGE}'

Expand Down Expand Up @@ -278,9 +296,7 @@ can find in the [Google Cloud Console](https://console.cloud.google.com).
To specify input and output files, use the `--input` and `--output` flags:

dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logging \
... \
--input INPUT_FILE=gs://my-bucket/my-input-file \
--output OUTPUT_FILE=gs://my-bucket/my-output-file \
--command 'cat ${INPUT_FILE} > ${OUTPUT_FILE}'
Expand All @@ -300,15 +316,42 @@ To copy folders rather than files, use the `--input-recursive` or
`output-recursive` flags:

dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logging \
... \
--input-recursive FOLDER=gs://my-bucket/my-folder \
--command 'find ${FOLDER} -name "foo*"'

#### Mounting buckets

The `google-v2` provider supports mounting a Cloud Storage bucket using
[Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This
capability is currently experimental, but may be most useful when:

1. You have a large input file in Cloud Storage over which your code makes
a single read pass or only needs to read a small range of bytes.
2. You have a large set of resource files in Cloud Storage, your code only reads
a subset of those files, and the decision of which files to read is determined
at runtime.

Writing to a mounted bucket is not recommended.

Please read
[Key differences from a POSIX file system](https://cloud.google.com/storage/docs/gcs-fuse#notes)
and [Semantics](https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/semantics.md)
before using Cloud Storage FUSE.

To mount a Cloud Storage bucket with the `google-v2` provider, use the `--mount`
command line flag:

--mount MYBUCKET=gs://mybucket

The bucket will be mounted to a local path given by the environment variable
`${MYBUCKET}`. Inside your script, you can reference the local path using the
environment variable.

##### Notice

As a getting started convenience, if `--input-recursive` or `--output-recursive`
For the `google` provider, as a getting started convenience, if
`--input-recursive` or `--output-recursive`
are used, `dsub` will automatically check for and, if needed, install the
[Google Cloud SDK](https://cloud.google.com/sdk/docs/) in the Docker container
at runtime (before your script executes).
Expand All @@ -322,6 +365,9 @@ If you use a Debian or Ubuntu Docker image, you are encouraged to use the
If you use a Red Hat or CentOS Docker image, you are encouraged to use the
[package installation instructions](https://cloud.google.com/sdk/downloads#yum).

**The installation of the CloudSDK into your Docker image is not needed for the
`local` or the `google-v2` providers.**

### Setting resource requirements

`dsub` tasks run using the `local` provider will use the resources available on
Expand Down Expand Up @@ -377,15 +423,11 @@ The file may be read from the local filesystem (on the machine you're calling
For example, suppose `my-tasks.tsv` contains 101 lines: a one-line header and
100 lines of parameters for tasks to run. Then:

```
dsub ... --tasks ./my-tasks.tsv
```
dsub ... --tasks ./my-tasks.tsv

will create a job with 100 tasks, while:

```
dsub ... --tasks ./my-tasks.tsv 1-10
```
dsub ... --tasks ./my-tasks.tsv 1-10

will create a job with 10 tasks, one for each of lines 2 through 11.

Expand Down Expand Up @@ -423,14 +465,14 @@ For more details, see [Checking Status and Troubleshooting Jobs](docs/troublesho

The `dstat` command displays the status of jobs:

dstat --project my-cloud-project
dstat --provider google-v2 --project my-cloud-project

With no additional arguments, dstat will display a list of *running* jobs for
the current `USER`.

To display the status of a specific job, use the `--jobs` flag:

dstat --project my-cloud-project --jobs job-id
dstat --provider google-v2 --project my-cloud-project --jobs job-id

For a batch job, the output will list all *running* tasks.

Expand All @@ -449,13 +491,15 @@ each job includes:

Metadata can be used to cancel a job or individual tasks within a batch job.

For more details, see [Checking Status and Troubleshooting Jobs](docs/troubleshooting.md)

#### Summarizing job status

By default, dstat outputs one line per task. If you're using a batch job with
many tasks then you may benefit from `--summary`.

```
$ dstat --project my-project --summary
$ dstat --provider google-v2 --project my-project --summary
Job Name Status Task Count
------------- ------------- -------------
Expand All @@ -472,24 +516,25 @@ how many are failed/canceled.
The `ddel` command will delete running jobs.

By default, only jobs submitted by the current user will be deleted.
Use the `--users` flag to specify other users, or `"*"` for all users.
Use the `--users` flag to specify other users, or `'*'` for all users.

To delete a running job:

ddel --project my-cloud-project --jobs job-id
ddel --provider google-v2 --project my-cloud-project --jobs job-id

If the job is a batch job, all running tasks will be deleted.

To delete specific tasks:

ddel \
--provider google-v2 \
--project my-cloud-project \
--jobs job-id \
--tasks task-id1 task-id2

To delete all running jobs for the current user:

ddel --project my-cloud-project --jobs "*"
ddel --provider google-v2 --project my-cloud-project --jobs '*'

## What next?

Expand Down
6 changes: 2 additions & 4 deletions docs/code.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,11 @@ a simple file rewrite operation using

**Be sure to enclose your `command string` in single quotes and not double
quotes. If you use double quotes, the command will be expanded in your local
shell before being passed to dsub. If your command flag was in double quotes
shell before being passed to `dsub`. If your command flag was in double quotes
as:**

dsub \
--project my-cloud-project \
--zones "us-central1-*" \
--logging gs://my-bucket/logs \
... \
--env MESSAGE=hello \
--command "echo ${MESSAGE}"

Expand Down
30 changes: 15 additions & 15 deletions docs/job_control.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,35 +61,35 @@ dsub ... --after "${JOB_A}" "${JOB_B}"
Here is the output of a sample run:

```
$ JOBID_A=$(dsub --project "${MYPROJECT}" --zones "us-central1-*" \
$ JOBID_A=$(dsub --provider google-v2 --project "${MYPROJECT}" --regions us-central1 \
--logging "gs://${MYBUCKET}/logging/" \
--command 'echo "hello from job A"')
Job: echo--<userid>--170328-093205-80
Launched job-id: echo--<userid>--170328-093205-80
Job: echo--<user>--180924-112256-64
Launched job-id: echo--<user>--180924-112256-64
To check the status, run:
dstat --project test-project --jobs echo--<userid>--170328-093205-80
dstat --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112256-64' --status '*'
To cancel the job, run:
ddel --project test-project --jobs echo--<userid>--170328-093205-80
ddel --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112256-64'
$ echo "${JOBID_A}"
echo--<userid>--170328-093205-80
echo--<user>--180924-112256-64
$ JOBID_B=... (similar)
$ JOBID_C=$(dsub --project "${MYPROJECT}" --zones "us-central1-*" \
$ JOBID_C=$(dsub --provider google-v2 --project "${MYPROJECT}" --regions us-central1 \
--logging "gs://${MYBUCKET}/logging/" \
--command 'echo "job C"' --after "${JOBID_A}" "${JOBID_B}")
Job: echo--<userid>--170328-093415-23
Waiting for predecessor jobs to complete...
Waiting for: echo--<userid>--170328-093358-55, echo--<userid>--170328-093205-80.
echo--<userid>--170328-093205-80: ('Success', '2017-03-28 09:32:58')
Waiting for: echo--<userid>--170328-093358-55.
echo--<userid>--170328-093358-55: ('Success', '2017-03-28 09:35:00')
Launched job-id: echo--<userid>--170328-093415-23
Waiting for: echo--<user>--180924-112256-64, echo--<user>--180924-112259-48.
echo--<user>--180924-112256-64: SUCCESS
Waiting for: echo--<user>--180924-112259-48.
echo--<user>--180924-112259-48: SUCCESS
Launched job-id: echo--<user>--180924-112302-87
To check the status, run:
dstat --project test-project --jobs echo--<userid>--170328-093415-23
dstat --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112302-87' --status '*'
To cancel the job, run:
ddel --project test-project --jobs echo--<userid>--170328-093415-23
ddel --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112302-87'
echo--<user>--180924-112302-87
```

## --after is blocking
Expand Down
Loading

0 comments on commit c968683

Please sign in to comment.