Merge pull request #296 from DataBiosphere/dev

PR for 0.5.0 release
DataBiosphere · Sep 3, 2024 · dba9b86 · dba9b86
2 parents 0cbbdb5 + 88e8b72
commit dba9b86
Show file tree

Hide file tree

Showing 39 changed files with 182 additions and 479 deletions.
diff --git a/README.md b/README.md
@@ -52,7 +52,7 @@ your shell.
 
 #### Install the Google Cloud SDK
 
-While not used directly by `dsub` for the `google-v2` or `google-cls-v2` providers, you are likely to want to install the command line tools found in the [Google
+While not used directly by `dsub` for the `google-batch` or `google-cls-v2` providers, you are likely to want to install the command line tools found in the [Google
 Cloud SDK](https://cloud.google.com/sdk/).
 
 If you will be using the `local` provider for faster job development,
@@ -156,13 +156,13 @@ You'll get quicker turnaround times and won't incur cloud charges using it.
 
 ### Getting started on Google Cloud
 
-`dsub` supports the use of two different APIs from Google Cloud for running
-tasks. Google Cloud is transitioning from `Genomics v2alpha1`
-to [Cloud Life Sciences v2beta](https://cloud.google.com/life-sciences/docs/reference/rest).
+`dsub` currently supports the [Cloud Life Sciences v2beta](https://cloud.google.com/life-sciences/docs/reference/rest)
+API from Google Cloud and is is developing support for the [Batch](https://cloud.google.com/batch/docs/reference/rest)
+API from Google Cloud. 
 
-`dsub` supports both APIs with the (old) `google-v2` and (new) `google-cls-v2`
-providers respectively. `google-v2` is the current default provider. `dsub`
-will be transitioning to make `google-cls-v2` the default in coming releases.
+`dsub` supports the v2beta API with the `google-cls-v2` provider.
+`google-cls-v2` is the current default provider. `dsub` will be transitioning to
+make `google-batch` the default in coming releases.
 
 The steps for getting started differ slightly as indicated in the steps below:
 
@@ -171,13 +171,14 @@ The steps for getting started differ slightly as indicated in the steps below:
 
 1.  Enable the APIs:
 
-    - For the `v2alpha1` API (provider: `google-v2`):
+    - For the `v2beta` API (provider: `google-cls-v2`):
 
-     [Enable the Genomics, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=genomics,storage_component,compute_component&redirect=https://console.cloud.google.com).
+     [Enable the Cloud Life Sciences, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=lifesciences.googleapis.com,storage.googleapis.com,compute.googleapis.com&redirect=https://console.cloud.google.com)
 
-    - For the `v2beta` API (provider: `google-cls-v2`):
+    - For the `batch` API (provider: `google-batch`):
+
+     [Enable the Batch, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=batch.googleapis.com,storage.googleapis.com,compute.googleapis.com&redirect=https://console.cloud.google.com).
 
-     [Enable the Cloud Life Sciences, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=lifesciences.googleapis.com,storage_component,compute_component&redirect=https://console.cloud.google.com)
 
 1. Provide [credentials](https://developers.google.com/identity/protocols/application-default-credentials)
     so `dsub` can call Google APIs:
@@ -202,10 +203,10 @@ The steps for getting started differ slightly as indicated in the steps below:
 
 1.  Run a very simple "Hello World" `dsub` job and wait for completion.
 
-    - For the `v2alpha1` API (provider: `google-v2`):
+    - For the `v2beta` API (provider: `google-cls-v2`):
 
             dsub \
-              --provider google-v2 \
+              --provider google-cls-v2 \
               --project my-cloud-project \
               --regions us-central1 \
               --logging gs://my-bucket/logging/ \
@@ -216,10 +217,10 @@ The steps for getting started differ slightly as indicated in the steps below:
     Change `my-cloud-project` to your Google Cloud project, and `my-bucket` to
     the bucket you created above.
 
-    - For the `v2beta` API (provider: `google-cls-v2`):
+    - For the `batch` API (provider: `google-batch`):
 
             dsub \
-              --provider google-cls-v2 \
+              --provider google-batch \
               --project my-cloud-project \
               --regions us-central1 \
               --logging gs://my-bucket/logging/ \
@@ -246,14 +247,13 @@ To this end, `dsub` provides multiple "backend providers", each of which
 implements a consistent runtime environment. The current providers are:
 
 - local
-- google-v2 (the default)
-- google-cls-v2
+- google-cls-v2(the default)
 - google-batch (*new*)
 
 More details on the runtime environment implemented by the backend providers
 can be found in [dsub backend providers](https://github.com/DataBiosphere/dsub/blob/main/docs/providers/README.md).
 
-### Differences between `google-v2`, `google-cls-v2` and `google-batch`
+### Differences between `google-cls-v2` and `google-batch`
 
 The `google-cls-v2` provider is built on the Cloud Life Sciences `v2beta` API.
 This API is very similar to its predecessor, the Genomics `v2alpha1` API.
@@ -265,29 +265,15 @@ Details of Cloud Life Sciences versus Batch can be found in this
 [Migration Guide](https://cloud.google.com/batch/docs/migrate-to-batch-from-cloud-life-sciences).
 
 `dsub` largely hides the differences between the APIs, but there are a
-few difference to note:
-
-- `v2beta` and Cloud Batch are regional services, `v2alpha1` is a global service
-
-What this means is that with `v2alpha1`, the metadata about your tasks
-(called "operations"), is stored in a global database, while with `v2beta` and
-Cloud Batch, the metadata about your tasks are stored in a regional database. If
-your operation/job information needs to stay in a particular region, use the
-`v2beta` or Batch API (the `google-cls-v2` or `google-batch` provider), and
-specify the `--location` where your operation/job information should be stored.
+few differences to note:
 
-- The `--regions` and `--zones` flags can be omitted when using `google-cls-v2` and `google-batch`
+- `google-batch` requires jobs to run in one region
 
 The `--regions` and `--zones` flags for `dsub` specify where the tasks should
-run. More specifically, this specifies what Compute Engine Zones to use for
-the VMs that run your tasks.
-
-With the `google-v2` provider, there is no default region or zone, and thus
-one of the `--regions` or `--zones` flags is required.
-
-With `google-cls-v2` and `google-batch`, the `--location` flag defaults to
-`us-central1`, and if the `--regions` and `--zones` flags are omitted, the
-`location` will be used as the default `regions` list.
+run. The `google-cls-v2` allows you to specify a multi-region like `US`,
+multiple regions, or multiple zones across regions.  With the `google-batch`
+provider, you must specify either one region or multiple zones within a single
+region.
 
 ## `dsub` features
 
@@ -463,15 +449,15 @@ mounting read-only:
 [Compute Engine Image](https://cloud.google.com/compute/docs/images) that you
 pre-create.
 
-The `google-v2` and `google-cls-v2` providers support these methods of
+The `google-cls-v2` and `google-batch` provider support these methods of
 providing access to resource data.
 
 The `local` provider supports mounting a
 local directory in a similar fashion to support your local development.
 
 ##### Mounting a Google Cloud Storage bucket
 
-To have the `google-v2`, `google-cls-v2`, or `google-batch` provider mount a
+To have the `google-cls-v2` or `google-batch` provider mount a
 Cloud Storage bucket using
 [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse), use the
 `--mount` command line flag:
@@ -488,15 +474,15 @@ before using Cloud Storage FUSE.
 
 ##### Mounting an existing peristent disk
 
-To have the `google-v2` or `google-cls-v2` provider mount a persistent disk that
+To have the `google-cls-v2` or `google-batch` provider mount a persistent disk that
 you have pre-created and populated, use the `--mount` command line flag and the
 url of the source disk:
 
     --mount RESOURCES="https://www.googleapis.com/compute/v1/projects/your-project/zones/your_disk_zone/disks/your-disk"
 
 ##### Mounting a persistent disk, created from an image
 
-To have the `google-v2` or `google-cls-v2` provider mount a persistent disk created from an image,
+To have the `google-cls-v2` or `google-batch` provider mount a persistent disk created from an image,
 use the `--mount` command line flag and the url of the source image and the size
 (in GB) of the disk:
 
@@ -527,7 +513,7 @@ path using the environment variable.
 `dsub` tasks run using the `local` provider will use the resources available on
 your local machine.
 
-`dsub` tasks run using the `google`, `google-v2`, or `google-cls-v2` providers can take advantage
+`dsub` tasks run using the `google-cls-v2` or `google-batch` providers can take advantage
 of a wide range of CPU, RAM, disk, and hardware accelerator (eg. GPU) options.
 
 See the [Compute Resources](https://github.com/DataBiosphere/dsub/blob/main/docs/compute_resources.md)
@@ -634,14 +620,14 @@ For more details, see [Checking Status and Troubleshooting Jobs](https://github.
 
 The `dstat` command displays the status of jobs:
 
-    dstat --provider google-v2 --project my-cloud-project
+    dstat --provider google-cls-v2 --project my-cloud-project
 
 With no additional arguments, dstat will display a list of *running* jobs for
 the current `USER`.
 
 To display the status of a specific job, use the `--jobs` flag:
 
-    dstat --provider google-v2 --project my-cloud-project --jobs job-id
+    dstat --provider google-cls-v2 --project my-cloud-project --jobs job-id
 
 For a batch job, the output will list all *running* tasks.
 
@@ -673,7 +659,7 @@ By default, dstat outputs one line per task. If you're using a batch job with
 many tasks then you may benefit from `--summary`.
 
 ```
-$ dstat --provider google-v2 --project my-project --status '*' --summary
+$ dstat --provider google-cls-v2 --project my-project --status '*' --summary
 
 Job Name        Status         Task Count
 -------------   -------------  -------------
@@ -694,25 +680,25 @@ Use the `--users` flag to specify other users, or `'*'` for all users.
 
 To delete a running job:
 
-    ddel --provider google-v2 --project my-cloud-project --jobs job-id
+    ddel --provider google-cls-v2 --project my-cloud-project --jobs job-id
 
 If the job is a batch job, all running tasks will be deleted.
 
 To delete specific tasks:
 
     ddel \
-        --provider google-v2 \
+        --provider google-cls-v2 \
         --project my-cloud-project \
         --jobs job-id \
         --tasks task-id1 task-id2
 
 To delete all running jobs for the current user:
 
-    ddel --provider google-v2 --project my-cloud-project --jobs '*'
+    ddel --provider google-cls-v2 --project my-cloud-project --jobs '*'
 
 ## Service Accounts and Scope (Google providers only)
 
-When you run the `dsub` command with the `google-v2` or `google-cls-v2`
+When you run the `dsub` command with the `google-cls-v2` or `google-batch`
 provider, there are two different sets of credentials to consider:
 
 - Account submitting the `pipelines.run()` request to run your command/script on a VM

diff --git a/docs/job_control.md b/docs/job_control.md
@@ -61,22 +61,22 @@ dsub ... --after "${JOB_A}" "${JOB_B}"
 Here is the output of a sample run:
 
 ```
-$ JOBID_A=$(dsub --provider google-v2 --project "${MYPROJECT}" --regions us-central1 \
+$ JOBID_A=$(dsub --provider google-cls-v2 --project "${MYPROJECT}" --regions us-central1 \
 --logging "gs://${MYBUCKET}/logging/"   \
 --command 'echo "hello from job A"')
 Job: echo--<user>--180924-112256-64
 Launched job-id: echo--<user>--180924-112256-64
 To check the status, run:
-  dstat --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112256-64' --status '*'
+  dstat --provider google-cls-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112256-64' --status '*'
 To cancel the job, run:
-  ddel --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112256-64'
+  ddel --provider google-cls-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112256-64'
 
 $ echo "${JOBID_A}"
 echo--<user>--180924-112256-64
 
 $ JOBID_B=... (similar)
 
-$ JOBID_C=$(dsub --provider google-v2 --project "${MYPROJECT}" --regions us-central1 \
+$ JOBID_C=$(dsub --provider google-cls-v2 --project "${MYPROJECT}" --regions us-central1 \
 --logging "gs://${MYBUCKET}/logging/"   \
 --command 'echo "job C"' --after "${JOBID_A}" "${JOBID_B}")
 Waiting for predecessor jobs to complete...
@@ -86,9 +86,9 @@ Waiting for: echo--<user>--180924-112259-48.
   echo--<user>--180924-112259-48: SUCCESS
 Launched job-id: echo--<user>--180924-112302-87
 To check the status, run:
-  dstat --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112302-87' --status '*'
+  dstat --provider google-cls-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112302-87' --status '*'
 To cancel the job, run:
-  ddel --provider google-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112302-87'
+  ddel --provider google-cls-v2 --project ${MYPROJECT} --jobs 'echo--<user>--180924-112302-87'
 echo--<user>--180924-112302-87
 ```
 

diff --git a/docs/providers/README.md b/docs/providers/README.md
@@ -10,8 +10,8 @@ implements a consistent runtime environment. The current supported providers
 are:
 
 - local
-- google-v2 (the default)
-- google-cls-v2 (*new*)
+- google-cls-v2 (the default)
+- google-batch (*new*)
 
 ## Runtime environment
 
@@ -194,13 +194,13 @@ During execution, `runner.sh` writes the following files to record task state:
 The `local` provider does not support resource-related flags such as
 `--min-cpu`, `--min-ram`, `--boot-disk-size`, or `--disk-size`.
 
-### `google-v2` and `google-cls-v2` providers
+### `google-cls-v2` and `google-batch` providers
 
-The `google-v2` and `google-cls-v2` providers share a significant amount of
-their implementation. The `google-v2` provider utilizes the Google Genomics
-Pipelines API `v2alpha1`
-while the `google-cls-v2` provider utilizes the Google Cloud Life Sciences
+The `google-cls-v2` and `google-batch` providers share a significant amount of
+their implementation. The `google-cls-v2` provider utilizes the Google Cloud Life Sciences
 Piplines API [v2beta](https://cloud.google.com/life-sciences/docs/apis)
+while the `google-batch` provider utilizes the Google Cloud
+[Batch API](https://cloud.google.com/batch/docs/reference/rest)
 to queue a request for the following sequence of events:
 
 1. Create a Google Compute Engine
@@ -282,7 +282,7 @@ its status is `RUNNING`.
 
 #### Logging
 
-The `google-v2` provider saves 3 log files to Cloud Storage, every 5 minutes
+The `google-cls-v2` and `google-batch` provider saves 3 log files to Cloud Storage, every 5 minutes
 to the `--logging` location specified to `dsub`:
 
 - `[prefix].log`: log generated by all containers running on the VM
@@ -293,7 +293,7 @@ Logging paths and the `[prefix]` are discussed further in [Logging](../logging.m
 
 #### Resource requirements
 
-The `google-v2` and `google-cls-v2` providers support many resource-related
+The `google-cls-v2` and `google-batch` providers support many resource-related
 flags to configure the Compute Engine VMs that tasks run on, such as
 `--machine-type` or `--min-cores` and `--min-ram`, as well as `--boot-disk-size`
 and `--disk-size`. Additional provider-specific parameters are available
@@ -311,12 +311,12 @@ large Docker images are used, as such images need to be pulled to the boot disk.
 
 #### Provider specific parameters
 
-The following `dsub` parameters are specific to the `google-v2` and
-`google-cls-v2` providers:
+The following `dsub` parameters are specific to the `google-cls-v2` and
+`google-batch` providers:
 
 * [Location resources](https://cloud.google.com/about/locations)
 
-    - `--location` (`google-cls-v2` only):
+    - `--location`:
       - Specifies the Google Cloud region to which the pipeline request will be
         sent and where operation metadata will be stored. The associated dsub task
         may be executed in another region if the `--regions` or `--zones`