Snakemake executor plugin: googlebatch

https://img.shields.io/badge/repository-github-blue?color=%23022c22

https://img.shields.io/badge/author-Vanessa%20Sochat-purple?color=%23064e3b

This is the Google Batch external executor plugin for snakemake. If you are migrating from Google Life Sciences see this documentation. For the underlying Python SDK, see google-cloud-batch on GitHub.

Installation

Install this plugin by installing it with pip or mamba, e.g.:

pip install snakemake-executor-plugin-googlebatch

Usage

In order to use the plugin, run Snakemake (>=8.0) in the folder where your workflow code and config resides (containing either workflow/Snakefile or Snakefile) with the corresponding value for the executor flag:

snakemake --executor googlebatch --default-resources --jobs N ...

with N being the number of jobs you want to run in parallel and ... being any additional arguments you want to use (see below). The machine on which you run Snakemake must have the executor plugin installed, and, depending on the type of the executor plugin, have access to the target service of the executor plugin (e.g. an HPC middleware like slurm with the sbatch command, or internet access to submit jobs to some cloud provider, e.g. azure).

The flag --default-resources ensures that Snakemake auto-calculates the mem and disk resources for each job, based on the input file size. The values assumed there are conservative and should usually suffice. However, you can always override those defaults by specifying the resources in your Snakemake rules or via the --set-resources flag.

Depending on the executor plugin, you might either rely on a shared local filesystem or use a remote filesystem or storage. For the latter, you have to additionally use a suitable storage plugin (see section storage plugins in the sidebar of this catalog) and eventually check for further recommendations in the sections below.

All arguments can also be persisted via a profile, such that they don’t have to be specified on each invocation. Here, this would mean the following entries inside of the profile

executor: googlebatch
default_resources: []

For specifying other default resources than the built-in ones, see the docs.

Settings

The executor plugin has the following settings (which can be passed via command line, the workflow or environment variables, if provided in the respective columns):

Settings
CLI argument	Description	Default	Required
`--googlebatch-project VALUE`	The name of the Google Project.	`None`	✓
`--googlebatch-region VALUE`	The name of the Google Project region (e.g., us-central1)	`None`	✓
`--googlebatch-container VALUE`	A custom container for use with Google Batch COS	`None`	✗
`--googlebatch-docker-password VALUE`	A docker registry password for COS if credentials are required	`None`	✗
`--googlebatch-docker-username VALUE`	A docker registry username for COS if credentials are required	`None`	✗
`--googlebatch-machine-type VALUE`	Google Cloud machine type or VM (mpitune on c2 and c2d family)	`'c2-standard-4'`	✗
`--googlebatch-labels VALUE`	Comma separated key value pairs to label job (e.g., model=a3,stage=test)	`''`	✗
`--googlebatch-image-family VALUE`	Google Cloud image family (defaults to hpc-centos-7)	`'hpc-centos-7'`	✗
`--googlebatch-image-project VALUE`	Selected image project (defaults cloud-hpc-image-public)	`'cloud-hpc-image-public'`	✗
`--googlebatch-work-tasks VALUE`	The default number of work tasks (these are NOT MPI ranks)	`1`	✗
`--googlebatch-cpu-milli VALUE`	Milliseconds per cpu-second	`1000`	✗
`--googlebatch-boot-disk-gb VALUE`	Boot disk size (GB)	`None`	✗
`--googlebatch-network VALUE`	The URL of an existing network resource	`None`	✗
`--googlebatch-subnetwork VALUE`	The URL of an existing subnetwork resource	`None`	✗
`--googlebatch-service-account VALUE`	The email of a customer compute service account	`None`	✗
`--googlebatch-boot-disk-type VALUE`	Boot disk type. (e.g., gcloud compute disk-types list)	`None`	✗
`--googlebatch-boot-disk-image VALUE`	Boot disk image (e.g., batch-debian, bath-centos)	`None`	✗
`--googlebatch-work-tasks-per-node VALUE`	The default number of work tasks per node (NOT MPI ranks)	`1`	✗
`--googlebatch-memory VALUE`	Memory in MiB	`1000`	✗
`--googlebatch-mount-path VALUE`	Mount path for Google bucket (if defined)	`'/mnt/share'`	✗
`--googlebatch-retry-count VALUE`	Retry count (default to 1)	`1`	✗
`--googlebatch-max-run-duration VALUE`	Maximum run duration, string (e.g., 3600s)	`'3600s'`	✗
`--googlebatch-snippets VALUE`	One or more snippets to add to the Google Batch task setup	`None`	✗

Further details

Setup

You’ll likely want to start by setting up application default credentials The easiest thing to do is run:

gcloud auth application-default login

Quick Start

The basic usage is, from a directory with your Snakefile, to ask for googlebatch as the executor.

$ snakemake --jobs 1 --executor googlebatch

You are minimally required to provide a project and region, and can do this through the environment or command line:

export SNAKEMAKE_GOOGLEBATCH_PROJECT=myproject
export SNAKEMAKE_GOOGLEBATCH_REGION=us-central1
snakemake --jobs 1 --executor googlebatch

export SNAKEMAKE_GOOGLEBATCH_PROJECT=myproject
export SNAKEMAKE_GOOGLEBATCH_REGION=us-central1
snakemake --jobs 1 --executor googlebatch --googlebatch-project myproject --googlebatch-region us-central1

You can provide one or more custom arguments, as shown in the table below, to customize your batch run. Note that batch offers setup snippets to help with more complex setups (e.g,. MPI). See batch snippets for more information.

Logging

For logging, for an interactive run from the command line we provide status updates in the console you have running locally. For full logs, you can go to the Google Cloud Batch interface and click on your job of interest, and then the “Logs” tab. If you don’t see logs, look in the “Events” tab, as usually there is an error with your configuration (e.g., an unknown image or family). It is important to enable the logging API for this to work.

Isolated Logs

If you need to retrieve logs for a job outside of this context (e.g., after a run or in a Pythonic test) you can use the provided script in example. Here is how to run it using the local poetry environment. You can either provide --project and --region or export the environment variables for them described above.

#                                       <jobid>
poetry run python example/show-logs.py a-898674

Note that this is currently provided as a helper script because the Google Cloud API limits set a rate limit of 60/minute.

Number of entries.list requests 60 per minute, per Google Cloud project

For some perspective, a “hello world” job will produce over 3K lines of logs, and (without a sleep between calls) the ratelimit is hit very easily. We are currently assessing strategies to deliver full logs to .snakemake logging files without hitting issues with this rate limit. It looks possible to create “sinks“ using Pub Sub, however this would be adding an extra API dependency (and cost).

Choosing an Image

You can read about how to choose an image here. Note that the image family and project must match or you’ll see that your job does not run (but has an event that indicates a mismatch in the online table). Since this is a changing set we do not validate, however we suggest that you check before running to not waste time. I am not entirely sure how to choose correctly, because there is some information `here <>`_ but this listing offers different information:

gcloud compute images list | grep cos

Batch Snippets

Batch, by way of running on virtual machines, can support custom more complex setups or running steps such as running MPI. However, the setups here are non trivial, so if you choose, a custom snippet can be added. There are two types of snippets:

named, built-in snippets provided by the googlebatch executor plugin here
your custom snippet provided via a script file (not implemented yet)

For each named snippet, depending on the functionality it might add custom logic to the setup or final runnable step. Examples for providing both are shown below. To determine if the snippet is custom, it should be a json or yaml file that exists. The order that you provide any number of snippets is the order they are added. To provide more than one, provide them via a comma separated list.

$ snakemake --jobs 1 --executor googlebatch --googlebatch-bucket snakemake-cache-dinosaur --googlebatch-snippets intel-mpi

Additional Environment Variables

The following environment variables are available within any Google batch run:

BATCH_TASK_INDEX: The index of the workflow step (Google Batch calls a “task”)
GOOGLEBATCH_DOCKER_PASSWORD: your docker registry passwork if using the container operating system (COS) and your container requires credentials
GOOGLEBATCH_DOCKER_USERNAME: the same, but the username

GPU

The Google Batch executor uses the same designation for GPUs as core Snakemake. However, you should keep compatibility of machine type with the GPU that you selected in mind. For example, if you select gpu_nvidia=1 you will need an n1-* family machine type.

Step Options

The following options are allowed for batch steps. This predominantly includes most arguments.

googlebatch_machine_type

This will define the machine type for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_machine_type="c3-standard-112"
    shell:
        "..."

Note that for MPI workloads, mpitune configurations are validated on c2 and c2d instances only.

googlebatch_image_family

This will define the image family for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_image_family="hpc-centos-7"
    shell:
        "..."

Note that the way to get updated names is to run:

gcloud compute images list \
    --project=batch-custom-image \
    --no-standard-images

And see this page for more details.

googlebatch_image_project

This will define the image project for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_image_project="cloud-hpc-image-public"
    shell:
        "..."

googlebatch_bucket

This will define the bucket for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_bucket="my-snakemake-batch-bucket"
    shell:
        "..."

googlebatch_mount_path

This will define the mount path for a bucket for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_mount_path="/mnt/workflow"
    shell:
        "..."

googlebatch_work_tasks

This will define the work tasks for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_work_tasks=1
    shell:
        "..."

googlebatch_network

The URL of an existing network resource (e.g., projects/{project}/global/networks/{network})

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_network="projects/{project}/global/networks/{network}"
    shell:
        "..."

googlebatch_subnetwork

The URL of an existing subnetwork resource (e.g., projects/{project}/regions/{region}/subnetworks/{subnetwork})

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_subnetwork="projects/{project}/regions/{region}/subnetworks/{subnetwork}"
    shell:
        "..."

googlebatch_service_account

The email of custom compute service account to be used by Batch (e.g., snakemake-sa@projectid.iam.gserviceaccount.com)

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_service_account="snakemake-sa@projectid.iam.gserviceaccount.com"
    shell:
        "..."

googlebatch_cpu_milli

This will define the milliseconds per cpu-second for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_cpu_mulli=2000
    shell:
        "..."

googlebatch_work_tasks_per_node

This will define the work tasks per node (Google batch calls these tasks) for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_work_tasks_per_node=2
    shell:
        "..."

googlebatch_memory

This will define the memory for a particular step as an integer in MiB, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_memory=2000
    shell:
        "..."

googlebatch_boot_disk_type

This is the boot disk type.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_boot_disk_type="pd-standard"
    shell:
        "..."

googlebatch_boot_disk_image

This is the boot disk image. If not set, we use the family defined for the job.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_boot_disk_image="batch-centos"
    shell:
        "..."

googlebatch_boot_disk_gb

The size of the boot disk in GB. This needs to be 30 (default) or larger

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_boot_disk_gb=40
    shell:
        "..."

googlebatch_retry_count

This will define the retry times for a step overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_retry_count=2
    shell:
        "..."

googlebatch_max_run_duration

This will define the max run duration for a step overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_max_run_duration="3600s"
    shell:
        "..."

googlebatch_labels

This will define the extra labels to add to the Google Batch job.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_labels="model=c3,stage=test"
    shell:
        "..."

googlebatch_container

A container to use only with image_family set to batch-cos* (see here for how to see VM choices)

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_container="ghcr.io/rse-ops/atacseq:app-latest"
    shell:
        "..."

googlebatch_snippets

One or more named (or file-derived) snippets to add to setup.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_snippets="mpi,myscript.sh"
    shell:
        "..."