Snakemake executor plugin: kubernetes

https://img.shields.io/badge/repository-github-blue?color=%23022c22 https://img.shields.io/badge/author-Johannes%20Koester-purple?color=%23064e3b PyPI - Version PyPI - License

This is the Snakemake executor for Kubernetes, allowing to execute Snakemake workflows in a distributed way on a Kubernetes cluster.

Installation

Install this plugin by installing it with pip or mamba, e.g.:

pip install snakemake-executor-plugin-kubernetes

Usage

In order to use the plugin, run Snakemake (>=8.0) with the corresponding value for the executor flag:

snakemake --executor kubernetes ...

with ... being any additional arguments you want to use.

The executor plugin has the following settings:

Settings

CLI argument

Description

Default

Choices

Required

Type

--kubernetes-namespace VALUE

The namespace to use for submitted jobs.

'default'

--kubernetes-cpu-scalar VALUE

K8s reserves some proportion of available CPUs for its own use. So, where an underlying node may have 8 CPUs, only e.g. 7600 milliCPUs are allocatable to k8s pods (i.e. snakemake jobs). As 8 > 7.6, k8s can’t find a node with enough CPU resource to run such jobs. This argument acts as a global scalar on each job’s CPU request, so that e.g. a job whose rule definition asks for 8 CPUs will request 7600m CPUs from k8s, allowing it to utilise one entire node. N.B: the job itself would still see the original value, i.e. as the value substituted in {threads}.

0.95

--kubernetes-service-account-name VALUE

This argument allows the use of customer service accounts for kubernetes pods. If specified, serviceAccountName will be added to the pod specs. This is e.g. needed when using workload identity which is enforced when using Google Cloud GKE Autopilot.

None

--kubernetes-privileged VALUE

Create privileged containers for jobs.

False

--kubernetes-persistent-volumes VALUE

Mount the given persistent volumes under the given paths in each job container (<name>:<path>).

None

Further details

GPU Scheduling with the Snakemake Kubernetes Executor on Google Kubernetes Engine

Below are instructions on how to use the new GPU support in the Snakemake Kubernetes executor plugin. This feature allows you to specify the GPU vendor—either NVIDIA or AMD—in your Snakemake resource definitions. The plugin then configures the pod with the appropriate node selectors, tolerations, and resource requests so that GPU-accelerated jobs are scheduled on the correct nodes in GKE.


Overview

The GPU support in the plugin enables you to:

  • Specify the number of GPUs required for a job.

  • Indicate the GPU vendor via a new key (e.g., gpu_manufacturer="nvidia" or "amd").

  • Automatically set node selectors and tolerations based on the GPU vendor.

With these changes, your job will be scheduled only on GPU-enabled nodes, and the GKE autoscaler will be able to provision GPU nodes as needed.


Prerequisites

  • GKE Environment:
    Ensure you have a functioning GKE cluster with autoscaling enabled if required.

  • GPU-Enabled Node Pool:

    • GKE automatically labels and taints these nodes via the official device plugins.

      • NVIDIA: nvidia.com/gpu

      • AMD: amd.com/gpu

    • Validate these taints match your configuration.

Resource Definition

resources:
    gpu=1,
    gpu_manufacturer="nvidia",  # Allowed values: "nvidia" or "amd"
    machine_type="n1-standard-16",
    scale=0                     # Optional (default=1)
  • gpu: the number of GPUs to be requested

    • note: This currently only works for multiple GPUs within a single node. The current implementation cannot handle requesting multiple nodes with GPUs.

  • gpu_manufacturer: Specifies the GPU vendor. Use “nvidia” for NVIDIA GPUs or “amd” for AMD GPUs.

  • machine_type: machine type for the GPU enabled node. This is NOT the GPU type.

  • scale: variable allows us to conditionally include resource (GPU, threads, memory, etc) limits - those limits being equal to the resource requests.

    • If scale=1(the default), we omit the limits entirely. This is how the plugin currently operates and will allow the pods to scale up as needed.

    • If scale=0 we explicitly set the resource limits for each requested resource type.

  • You can define any of the other Snakemake resource types here as normal.

Debugging Tips:

  • Failing to schedule on the GPU node

    • Inspect your GPU node and validate your tolerations match the default ones defined above.

  • Failing to see performance boost despite successful scheduling

    • Verify you are executing the Snakemake workflow in an environment with the correct GPU accelerated libraries

    • The NVIDIA/CUDA Docker container is recommended.

    • You can also run nvidia-smi within the Snakemake rule execution to validate and monitor GPU usage.

  • Issues with large jobs failing to schedule

    • In many default configurations for Kubernetes clusters there is a limit range or some other admission controller requiring both resource requests and resource limits when scaling to very large jobs. Try setting scale=0 and ensuring resource requests are within the limits of your cluster configuration.