Snakemake executor plugin: googlebatch

https://img.shields.io/badge/repository-github-blue?color=%23022c22 https://img.shields.io/badge/author-Vanessa%20Sochat-purple?color=%23064e3b PyPI - Version PyPI - License

This is the Google Batch external executor plugin for snakemake. If you are migrating from Google Life Sciences see this documentation. For the underlying Python SDK, see google-cloud-batch on GitHub.

Installation

Install this plugin by installing it with pip or mamba, e.g.:

pip install snakemake-executor-plugin-googlebatch

Usage

In order to use the plugin, run Snakemake (>=8.0) with the corresponding value for the executor flag:

snakemake --executor googlebatch ...

with ... being any additional arguments you want to use.

The executor plugin has the following settings:

Settings

CLI argument

Description

Default

Choices

Required

Type

--googlebatch-project VALUE

The name of the Google Project.

None

--googlebatch-region VALUE

The name of the Google Project region (e.g., us-central1)

None

--googlebatch-container VALUE

A custom container for use with Google Batch COS

None

--googlebatch-docker-password VALUE

A docker registry password for COS if credentials are required

None

--googlebatch-docker-username VALUE

A docker registry username for COS if credentials are required

None

--googlebatch-machine-type VALUE

Google Cloud machine type or VM (mpitune on c2 and c2d family)

'c2-standard-4'

--googlebatch-labels VALUE

Comma separated key value pairs to label job (e.g., model=a3,stage=test)

''

--googlebatch-image-family VALUE

Google Cloud image family (defaults to hpc-centos-7)

'hpc-centos-7'

--googlebatch-image-project VALUE

Selected image project (defaults cloud-hpc-image-public)

'cloud-hpc-image-public'

--googlebatch-work-tasks VALUE

The default number of work tasks (these are NOT MPI ranks)

1

--googlebatch-cpu-milli VALUE

Milliseconds per cpu-second

1000

--googlebatch-boot-disk-gb VALUE

Boot disk size (GB)

None

--googlebatch-network VALUE

The URL of an existing network resource

None

--googlebatch-subnetwork VALUE

The URL of an existing subnetwork resource

None

--googlebatch-boot-disk-type VALUE

Boot disk type. (e.g., gcloud compute disk-types list)

None

--googlebatch-boot-disk-image VALUE

Boot disk image (e.g., batch-debian, bath-centos)

None

--googlebatch-work-tasks-per-node VALUE

The default number of work tasks per node (NOT MPI ranks)

1

--googlebatch-memory VALUE

Memory in MiB

1000

--googlebatch-mount-path VALUE

Mount path for Google bucket (if defined)

'/mnt/share'

--googlebatch-retry-count VALUE

Retry count (default to 1)

1

--googlebatch-max-run-duration VALUE

Maximum run duration, string (e.g., 3600s)

'3600s'

--googlebatch-snippets VALUE

One or more snippets to add to the Google Batch task setup

None

Further details

Setup

You’ll likely want to start by setting up application default credentials The easiest thing to do is run:

gcloud auth application-default login

Quick Start

The basic usage is, from a directory with your Snakefile, to ask for googlebatch as the executor.

$ snakemake --jobs 1 --executor googlebatch

You are minimally required to provide a project and region, and can do this through the environment or command line:

export SNAKEMAKE_GOOGLEBATCH_PROJECT=myproject
export SNAKEMAKE_GOOGLEBATCH_REGION=us-central1
snakemake --jobs 1 --executor googlebatch

or

export SNAKEMAKE_GOOGLEBATCH_PROJECT=myproject
export SNAKEMAKE_GOOGLEBATCH_REGION=us-central1
snakemake --jobs 1 --executor googlebatch --googlebatch-project myproject --googlebatch-region us-central1

You can provide one or more custom arguments, as shown in the table below, to customize your batch run. Note that batch offers setup snippets to help with more complex setups (e.g,. MPI). See batch snippets for more information.

Logging

For logging, for an interactive run from the command line we provide status updates in the console you have running locally. For full logs, you can go to the Google Cloud Batch interface and click on your job of interest, and then the “Logs” tab. If you don’t see logs, look in the “Events” tab, as usually there is an error with your configuration (e.g., an unknown image or family). It is important to enable the logging API for this to work.

Isolated Logs

If you need to retrieve logs for a job outside of this context (e.g., after a run or in a Pythonic test) you can use the provided script in example. Here is how to run it using the local poetry environment. You can either provide --project and --region or export the environment variables for them described above.

#                                       <jobid>
poetry run python example/show-logs.py a-898674

Note that this is currently provided as a helper script because the Google Cloud API limits set a rate limit of 60/minute.

Number of entries.list requests 60 per minute, per Google Cloud project

For some perspective, a “hello world” job will produce over 3K lines of logs, and (without a sleep between calls) the ratelimit is hit very easily. We are currently assessing strategies to deliver full logs to .snakemake logging files without hitting issues with this rate limit. It looks possible to create “sinks“ using Pub Sub, however this would be adding an extra API dependency (and cost).

Choosing an Image

You can read about how to choose an image here. Note that the image family and project must match or you’ll see that your job does not run (but has an event that indicates a mismatch in the online table). Since this is a changing set we do not validate, however we suggest that you check before running to not waste time. I am not entirely sure how to choose correctly, because there is some information `here <>`_ but this listing offers different information:

gcloud compute images list | grep cos

Batch Snippets

Batch, by way of running on virtual machines, can support custom more complex setups or running steps such as running MPI. However, the setups here are non trivial, so if you choose, a custom snippet can be added. There are two types of snippets:

  • named, built-in snippets provided by the googlebatch executor plugin here

  • your custom snippet provided via a script file (not implemented yet)

For each named snippet, depending on the functionality it might add custom logic to the setup or final runnable step. Examples for providing both are shown below. To determine if the snippet is custom, it should be a json or yaml file that exists. The order that you provide any number of snippets is the order they are added. To provide more than one, provide them via a comma separated list.

$ snakemake --jobs 1 --executor googlebatch --googlebatch-bucket snakemake-cache-dinosaur --googlebatch-snippets intel-mpi

Additional Environment Variables

The following environment variables are available within any Google batch run:

  • BATCH_TASK_INDEX: The index of the workflow step (Google Batch calls a “task”)

  • GOOGLEBATCH_DOCKER_PASSWORD: your docker registry passwork if using the container operating system (COS) and your container requires credentials

  • GOOGLEBATCH_DOCKER_USERNAME: the same, but the username

GPU

The Google Batch executor uses the same designation for GPUs as core Snakemake. However, you should keep compatibility of machine type with the GPU that you selected in mind. For example, if you select gpu_nvidia=1 you will need an n1-* family machine type.

Step Options

The following options are allowed for batch steps. This predominantly includes most arguments.

googlebatch_machine_type

This will define the machine type for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_machine_type="c3-standard-112"
    shell:
        "..."

Note that for MPI workloads, mpitune configurations are validated on c2 and c2d instances only.

googlebatch_image_family

This will define the image family for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_image_family="hpc-centos-7"
    shell:
        "..."

Note that the way to get updated names is to run:

gcloud compute images list \
    --project=batch-custom-image \
    --no-standard-images

And see this page for more details.

googlebatch_image_project

This will define the image project for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_image_project="cloud-hpc-image-public"
    shell:
        "..."

googlebatch_bucket

This will define the bucket for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_bucket="my-snakemake-batch-bucket"
    shell:
        "..."

googlebatch_mount_path

This will define the mount path for a bucket for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_mount_path="/mnt/workflow"
    shell:
        "..."

googlebatch_work_tasks

This will define the work tasks for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_work_tasks=1
    shell:
        "..."

googlebatch_network

The URL of an existing network resource (e.g., projects/{project}/global/networks/{network})

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_network="projects/{project}/global/networks/{network}"
    shell:
        "..."

googlebatch_subnetwork

The URL of an existing subnetwork resource (e.g., projects/{project}/regions/{region}/subnetworks/{subnetwork})

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_subnetwork="projects/{project}/regions/{region}/subnetworks/{subnetwork}"
    shell:
        "..."

googlebatch_cpu_milli

This will define the milliseconds per cpu-second for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_cpu_mulli=2000
    shell:
        "..."

googlebatch_work_tasks_per_node

This will define the work tasks per node (Google batch calls these tasks) for a particular step, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_work_tasks_per_node=2
    shell:
        "..."

googlebatch_memory

This will define the memory for a particular step as an integer in MiB, overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_memory=2000
    shell:
        "..."

googlebatch_boot_disk_type

This is the boot disk type.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_boot_disk_type="pd-standard"
    shell:
        "..."

googlebatch_boot_disk_image

This is the boot disk image. If not set, we use the family defined for the job.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_boot_disk_image="batch-centos"
    shell:
        "..."

googlebatch_boot_disk_gb

The size of the boot disk in GB. This needs to be 30 (default) or larger

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_boot_disk_gb=40
    shell:
        "..."

googlebatch_retry_count

This will define the retry times for a step overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_retry_count=2
    shell:
        "..."

googlebatch_max_run_duration

This will define the max run duration for a step overriding the default from the command line.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_max_run_duration="3600s"
    shell:
        "..."

googlebatch_labels

This will define the extra labels to add to the Google Batch job.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_labels="model=c3,stage=test"
    shell:
        "..."

googlebatch_container

A container to use only with image_family set to batch-cos* (see here for how to see VM choices)

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_container="ghcr.io/rse-ops/atacseq:app-latest"
    shell:
        "..."

googlebatch_snippets

One or more named (or file-derived) snippets to add to setup.

rule hello_world:
    output:
        "...",
    resources:
        googlebatch_snippets="mpi,myscript.sh"
    shell:
        "..."