Skip to content

Commit

Permalink
Merge pull request #296 from DataBiosphere/dev
Browse files Browse the repository at this point in the history
PR for 0.5.0 release
  • Loading branch information
wnojopra authored Sep 3, 2024
2 parents 0cbbdb5 + 88e8b72 commit dba9b86
Show file tree
Hide file tree
Showing 39 changed files with 182 additions and 479 deletions.
86 changes: 36 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ your shell.

#### Install the Google Cloud SDK

While not used directly by `dsub` for the `google-v2` or `google-cls-v2` providers, you are likely to want to install the command line tools found in the [Google
While not used directly by `dsub` for the `google-batch` or `google-cls-v2` providers, you are likely to want to install the command line tools found in the [Google
Cloud SDK](https://cloud.google.com/sdk/).

If you will be using the `local` provider for faster job development,
Expand Down Expand Up @@ -156,13 +156,13 @@ You'll get quicker turnaround times and won't incur cloud charges using it.

### Getting started on Google Cloud

`dsub` supports the use of two different APIs from Google Cloud for running
tasks. Google Cloud is transitioning from `Genomics v2alpha1`
to [Cloud Life Sciences v2beta](https://cloud.google.com/life-sciences/docs/reference/rest).
`dsub` currently supports the [Cloud Life Sciences v2beta](https://cloud.google.com/life-sciences/docs/reference/rest)
API from Google Cloud and is is developing support for the [Batch](https://cloud.google.com/batch/docs/reference/rest)
API from Google Cloud.

`dsub` supports both APIs with the (old) `google-v2` and (new) `google-cls-v2`
providers respectively. `google-v2` is the current default provider. `dsub`
will be transitioning to make `google-cls-v2` the default in coming releases.
`dsub` supports the v2beta API with the `google-cls-v2` provider.
`google-cls-v2` is the current default provider. `dsub` will be transitioning to
make `google-batch` the default in coming releases.

The steps for getting started differ slightly as indicated in the steps below:

Expand All @@ -171,13 +171,14 @@ The steps for getting started differ slightly as indicated in the steps below:

1. Enable the APIs:

- For the `v2alpha1` API (provider: `google-v2`):
- For the `v2beta` API (provider: `google-cls-v2`):

[Enable the Genomics, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=genomics,storage_component,compute_component&redirect=https://console.cloud.google.com).
[Enable the Cloud Life Sciences, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=lifesciences.googleapis.com,storage.googleapis.com,compute.googleapis.com&redirect=https://console.cloud.google.com)

- For the `v2beta` API (provider: `google-cls-v2`):
- For the `batch` API (provider: `google-batch`):

[Enable the Batch, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=batch.googleapis.com,storage.googleapis.com,compute.googleapis.com&redirect=https://console.cloud.google.com).

[Enable the Cloud Life Sciences, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=lifesciences.googleapis.com,storage_component,compute_component&redirect=https://console.cloud.google.com)

1. Provide [credentials](https://developers.google.com/identity/protocols/application-default-credentials)
so `dsub` can call Google APIs:
Expand All @@ -202,10 +203,10 @@ The steps for getting started differ slightly as indicated in the steps below:

1. Run a very simple "Hello World" `dsub` job and wait for completion.

- For the `v2alpha1` API (provider: `google-v2`):
- For the `v2beta` API (provider: `google-cls-v2`):

dsub \
--provider google-v2 \
--provider google-cls-v2 \
--project my-cloud-project \
--regions us-central1 \
--logging gs://my-bucket/logging/ \
Expand All @@ -216,10 +217,10 @@ The steps for getting started differ slightly as indicated in the steps below:
Change `my-cloud-project` to your Google Cloud project, and `my-bucket` to
the bucket you created above.

- For the `v2beta` API (provider: `google-cls-v2`):
- For the `batch` API (provider: `google-batch`):

dsub \
--provider google-cls-v2 \
--provider google-batch \
--project my-cloud-project \
--regions us-central1 \
--logging gs://my-bucket/logging/ \
Expand All @@ -246,14 +247,13 @@ To this end, `dsub` provides multiple "backend providers", each of which
implements a consistent runtime environment. The current providers are:

- local
- google-v2 (the default)
- google-cls-v2
- google-cls-v2(the default)
- google-batch (*new*)

More details on the runtime environment implemented by the backend providers
can be found in [dsub backend providers](https://github.com/DataBiosphere/dsub/blob/main/docs/providers/README.md).

### Differences between `google-v2`, `google-cls-v2` and `google-batch`
### Differences between `google-cls-v2` and `google-batch`

The `google-cls-v2` provider is built on the Cloud Life Sciences `v2beta` API.
This API is very similar to its predecessor, the Genomics `v2alpha1` API.
Expand All @@ -265,29 +265,15 @@ Details of Cloud Life Sciences versus Batch can be found in this
[Migration Guide](https://cloud.google.com/batch/docs/migrate-to-batch-from-cloud-life-sciences).

`dsub` largely hides the differences between the APIs, but there are a
few difference to note:

- `v2beta` and Cloud Batch are regional services, `v2alpha1` is a global service

What this means is that with `v2alpha1`, the metadata about your tasks
(called "operations"), is stored in a global database, while with `v2beta` and
Cloud Batch, the metadata about your tasks are stored in a regional database. If
your operation/job information needs to stay in a particular region, use the
`v2beta` or Batch API (the `google-cls-v2` or `google-batch` provider), and
specify the `--location` where your operation/job information should be stored.
few differences to note:

- The `--regions` and `--zones` flags can be omitted when using `google-cls-v2` and `google-batch`
- `google-batch` requires jobs to run in one region

The `--regions` and `--zones` flags for `dsub` specify where the tasks should
run. More specifically, this specifies what Compute Engine Zones to use for
the VMs that run your tasks.

With the `google-v2` provider, there is no default region or zone, and thus
one of the `--regions` or `--zones` flags is required.

With `google-cls-v2` and `google-batch`, the `--location` flag defaults to
`us-central1`, and if the `--regions` and `--zones` flags are omitted, the
`location` will be used as the default `regions` list.
run. The `google-cls-v2` allows you to specify a multi-region like `US`,
multiple regions, or multiple zones across regions. With the `google-batch`
provider, you must specify either one region or multiple zones within a single
region.

## `dsub` features

Expand Down Expand Up @@ -463,15 +449,15 @@ mounting read-only:
[Compute Engine Image](https://cloud.google.com/compute/docs/images) that you
pre-create.

The `google-v2` and `google-cls-v2` providers support these methods of
The `google-cls-v2` and `google-batch` provider support these methods of
providing access to resource data.

The `local` provider supports mounting a
local directory in a similar fashion to support your local development.

##### Mounting a Google Cloud Storage bucket

To have the `google-v2`, `google-cls-v2`, or `google-batch` provider mount a
To have the `google-cls-v2` or `google-batch` provider mount a
Cloud Storage bucket using
[Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse), use the
`--mount` command line flag:
Expand All @@ -488,15 +474,15 @@ before using Cloud Storage FUSE.

##### Mounting an existing peristent disk

To have the `google-v2` or `google-cls-v2` provider mount a persistent disk that
To have the `google-cls-v2` or `google-batch` provider mount a persistent disk that
you have pre-created and populated, use the `--mount` command line flag and the
url of the source disk:

--mount RESOURCES="https://www.googleapis.com/compute/v1/projects/your-project/zones/your_disk_zone/disks/your-disk"

##### Mounting a persistent disk, created from an image

To have the `google-v2` or `google-cls-v2` provider mount a persistent disk created from an image,
To have the `google-cls-v2` or `google-batch` provider mount a persistent disk created from an image,
use the `--mount` command line flag and the url of the source image and the size
(in GB) of the disk:

Expand Down Expand Up @@ -527,7 +513,7 @@ path using the environment variable.
`dsub` tasks run using the `local` provider will use the resources available on
your local machine.

`dsub` tasks run using the `google`, `google-v2`, or `google-cls-v2` providers can take advantage
`dsub` tasks run using the `google-cls-v2` or `google-batch` providers can take advantage
of a wide range of CPU, RAM, disk, and hardware accelerator (eg. GPU) options.

See the [Compute Resources](https://github.com/DataBiosphere/dsub/blob/main/docs/compute_resources.md)
Expand Down Expand Up @@ -634,14 +620,14 @@ For more details, see [Checking Status and Troubleshooting Jobs](https://github.

The `dstat` command displays the status of jobs:

dstat --provider google-v2 --project my-cloud-project
dstat --provider google-cls-v2 --project my-cloud-project

With no additional arguments, dstat will display a list of *running* jobs for
the current `USER`.

To display the status of a specific job, use the `--jobs` flag:

dstat --provider google-v2 --project my-cloud-project --jobs job-id
dstat --provider google-cls-v2 --project my-cloud-project --jobs job-id

For a batch job, the output will list all *running* tasks.

Expand Down Expand Up @@ -673,7 +659,7 @@ By default, dstat outputs one line per task. If you're using a batch job with
many tasks then you may benefit from `--summary`.

```
$ dstat --provider google-v2 --project my-project --status '*' --summary
$ dstat --provider google-cls-v2 --project my-project --status '*' --summary
Job Name Status Task Count
------------- ------------- -------------
Expand All @@ -694,25 +680,25 @@ Use the `--users` flag to specify other users, or `'*'` for all users.

To delete a running job:

ddel --provider google-v2 --project my-cloud-project --jobs job-id
ddel --provider google-cls-v2 --project my-cloud-project --jobs job-id

If the job is a batch job, all running tasks will be deleted.

To delete specific tasks:

ddel \
--provider google-v2 \
--provider google-cls-v2 \
--project my-cloud-project \
--jobs job-id \
--tasks task-id1 task-id2

To delete all running jobs for the current user:

ddel --provider google-v2 --project my-cloud-project --jobs '*'
ddel --provider google-cls-v2 --project my-cloud-project --jobs '*'

## Service Accounts and Scope (Google providers only)

When you run the `dsub` command with the `google-v2` or `google-cls-v2`
When you run the `dsub` command with the `google-cls-v2` or `google-batch`
provider, there are two different sets of credentials to consider:

- Account submitting the `pipelines.run()` request to run your command/script on a VM
Expand Down
12 changes: 6 additions & 6 deletions docs/job_control.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,22 +61,22 @@ dsub ... --after "${JOB_A}" "${JOB_B}"
Here is the output of a sample run:

```
$ JOBID_A=$(dsub --provider google-v2 --project "${MYPROJECT}" --regions us-central1 \
$ JOBID_A=$(dsub --provider google-cls-v2 --project "${MYPROJECT}" --regions us-central1 \
--logging "gs://${MYBUCKET}/logging/" \
--command 'echo "hello from job A"')
Job: echo--<user>--180924-112256-64
Launched job-id: echo--<user>--180924-112256-64
To check the status, run:
dstat --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112256-64' --status '*'
dstat --provider google-cls-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112256-64' --status '*'
To cancel the job, run:
ddel --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112256-64'
ddel --provider google-cls-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112256-64'
$ echo "${JOBID_A}"
echo--<user>--180924-112256-64
$ JOBID_B=... (similar)
$ JOBID_C=$(dsub --provider google-v2 --project "${MYPROJECT}" --regions us-central1 \
$ JOBID_C=$(dsub --provider google-cls-v2 --project "${MYPROJECT}" --regions us-central1 \
--logging "gs://${MYBUCKET}/logging/" \
--command 'echo "job C"' --after "${JOBID_A}" "${JOBID_B}")
Waiting for predecessor jobs to complete...
Expand All @@ -86,9 +86,9 @@ Waiting for: echo--<user>--180924-112259-48.
echo--<user>--180924-112259-48: SUCCESS
Launched job-id: echo--<user>--180924-112302-87
To check the status, run:
dstat --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112302-87' --status '*'
dstat --provider google-cls-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112302-87' --status '*'
To cancel the job, run:
ddel --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112302-87'
ddel --provider google-cls-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112302-87'
echo--<user>--180924-112302-87
```

Expand Down
24 changes: 12 additions & 12 deletions docs/providers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ implements a consistent runtime environment. The current supported providers
are:

- local
- google-v2 (the default)
- google-cls-v2 (*new*)
- google-cls-v2 (the default)
- google-batch (*new*)

## Runtime environment

Expand Down Expand Up @@ -194,13 +194,13 @@ During execution, `runner.sh` writes the following files to record task state:
The `local` provider does not support resource-related flags such as
`--min-cpu`, `--min-ram`, `--boot-disk-size`, or `--disk-size`.

### `google-v2` and `google-cls-v2` providers
### `google-cls-v2` and `google-batch` providers

The `google-v2` and `google-cls-v2` providers share a significant amount of
their implementation. The `google-v2` provider utilizes the Google Genomics
Pipelines API `v2alpha1`
while the `google-cls-v2` provider utilizes the Google Cloud Life Sciences
The `google-cls-v2` and `google-batch` providers share a significant amount of
their implementation. The `google-cls-v2` provider utilizes the Google Cloud Life Sciences
Piplines API [v2beta](https://cloud.google.com/life-sciences/docs/apis)
while the `google-batch` provider utilizes the Google Cloud
[Batch API](https://cloud.google.com/batch/docs/reference/rest)
to queue a request for the following sequence of events:

1. Create a Google Compute Engine
Expand Down Expand Up @@ -282,7 +282,7 @@ its status is `RUNNING`.

#### Logging

The `google-v2` provider saves 3 log files to Cloud Storage, every 5 minutes
The `google-cls-v2` and `google-batch` provider saves 3 log files to Cloud Storage, every 5 minutes
to the `--logging` location specified to `dsub`:

- `[prefix].log`: log generated by all containers running on the VM
Expand All @@ -293,7 +293,7 @@ Logging paths and the `[prefix]` are discussed further in [Logging](../logging.m

#### Resource requirements

The `google-v2` and `google-cls-v2` providers support many resource-related
The `google-cls-v2` and `google-batch` providers support many resource-related
flags to configure the Compute Engine VMs that tasks run on, such as
`--machine-type` or `--min-cores` and `--min-ram`, as well as `--boot-disk-size`
and `--disk-size`. Additional provider-specific parameters are available
Expand All @@ -311,12 +311,12 @@ large Docker images are used, as such images need to be pulled to the boot disk.

#### Provider specific parameters

The following `dsub` parameters are specific to the `google-v2` and
`google-cls-v2` providers:
The following `dsub` parameters are specific to the `google-cls-v2` and
`google-batch` providers:

* [Location resources](https://cloud.google.com/about/locations)

- `--location` (`google-cls-v2` only):
- `--location`:
- Specifies the Google Cloud region to which the pipeline request will be
sent and where operation metadata will be stored. The associated dsub task
may be executed in another region if the `--regions` or `--zones`
Expand Down
Loading

0 comments on commit dba9b86

Please sign in to comment.