Readability and automation

With Snakemake, data analysis workflows are defined via an easy to read, adaptable, yet powerful specification language on top of Python. Steps are defined by "rules", which denote how to generate a set of output files from a set of input files (e.g. using a shell command). Wildcards (in curly braces) provide generalization. Dependencies between rules are determined automatically.

rule select_by_country:
    input:
        "data/worldcitiespop.csv"
    output:
        "by-country/{country}.csv"
    shell:
        "xsv search -s Country '{wildcards.country}' "
        "{input} > {output}"

Portability

By integration with the Conda package manager and containers, all software dependencies of each workflow step are automatically deployed upon execution.

rule select_by_country:
    input:
        "data/worldcitiespop.csv"
    output:
        "by-country/{country}.csv"
    conda:
        "envs/xsv.yaml"
    shell:
        "xsv search -s Country '{wildcards.country}' "
        "{input} > {output}"

Scripting integration

Rapidly implement analysis steps via direct script and jupyter notebook integration supporting Python, R, Julia, Rust, Bash, without requiring any boilerplate code.

rule select_by_country:
    input:
        "data/worldcitiespop.csv"
    output:
        "by-country/{country}.csv"
    script:
        "scripts/select_by_country.R"

Modularization

Easily create and employ re-usable tool or library wrappers, split your data analysis into well-separated modules, and compose multi-modal analyses by easily combining entire workflows various sources.

rule convert_to_pdf:
    input:
        "{prefix}.svg"
    output:
        "{prefix}.pdf"
    wrapper:
        "0.47.0/utils/cairosvg"

"Turing completeness"

Being a syntactical extension of Python, you can implement arbitrary logic beyond the plain definition of rules. Rules can be generated conditionally, arbitrary Python logic can be used to perform aggregations, configuration and metadata can be obtained and postprocessed in any required way.

def get_data(wildcards):
    # use arbitrary Python logic to
    # aggregate over the required input files
    return ...

rule plot_histogram:
    input:
        get_data
    output:
        "plots/hist.svg"
    script:
        "scripts/plot-hist.py"

Human Readability

The logic of production workflows can become complex by involving lots of lookups and dynamic decisions. Snakemake offers semantic helper functions for lookups, branching and aggregation that avoid the need for plain Python code as shown above, and allow to express complex logic in a human-readable and self-contained way.

rule plot_histogram:
    input:
        branch(
            lookup(dpath="histogram/somedata", within=config),
            then="data/somedata.txt",
            otherwise="data/someotherdata.txt"
        )
    output:
        "plots/hist.svg"
    script:
        "scripts/plot-hist.py"

Dynamic workflows

Snakemake allows to define workflows that are dynamically updated at runtime. By defining so-called checkpoints, the workflow can be dynamically adapted at runtime. Further, input can be provided as Python queues, thereby enabling a workflow to continuously receive new input data (e.g. while a certain measurement is conducted).

rule all:
    input:
        from_queue(all_results, finish_sentinel=...)

checkpoint somestep:
    input:
        "samples/{sample}.txt"
    output:
        "somestep/{sample}.txt"
    shell:
        "somecommand {input} > {output}"

Transparency and data provenance

Automatic, interactive, self-contained reports ensure full transparency from results down to used steps, parameters, code, and software. The reports can moreover contain embedded results (from images, to PDFs and even interactive HTML) enabling a comprehensive reporting that combines analysis results with data provenance information.

Scalability

Workflows scale seamlessly from single to multicore, clusters or the cloud, without modification of the workflow definition and automatic avoidance of redundant computations.

Configurability

Snakemake is extremely flexible and configurable. Numerous options allow adapt the behavior to the needs of the data analysis at hand and the underlying infrastructure. Options can be provided via the command line interface or persisted via system-wide, user-specific, and workflow specific profiles.

executor: slurm
software-deployment-method:
  - conda
latency-wait: 60
default-storage-provider: fs
shared-fs-usage:
  - persistence
  - software-deployment
  - sources
  - source-cache
local-storage-prefix:
  /local/work/$USER/snakemake-scratch

Extensibility

Snakemake has a powerful plugin system that allows to extend various functionalities with alternative implementations. Via stable and well-defined interfaces, plugins can evolve independently of Snakemake, and mutual update requirements are minimized. Currently, execution backends and remote storage support is implemented via plugins. In the future, we will extend this to other areas, such as workflow scheduling, reporting, software deployment, and more.

Authors and Contributors ⓘ

Groups, Institutes, Companies, and Organizations ⓘ

University of Duisburg-Essen
Science for Life Laboratory
Broad Institute of MIT and Harvard
On Sabbatical
The GLOBE Institute – University of Copenhagen
CERN
CUBI Core Unit Bioinformatics, Berlin Institute of Health
UC Santa Cruz
Icahn School of Medicine at Mount Sinai
group.one
上海交通大学
@AnyBody
Karolinska Institutet
Netherlands Institute of Ecology (NIOO-KNAW)
Pairwise
Solynta
Gekkonid Scientific
LHCb
Oyat Consulting
EMBL
Data Science @google
University of Colorado School of Medicine
Princess Margaret Cancer Centre, University Health Network
Seoul National University
@neherlab @nextstrain
Princeton University
Spotify
Universität Duisburg-Essen
University of Mainz
@fulcrumgenomics
Clinical Genomics Uppsala / Uppsala University
Earth Sciences New Zealand (formerly GNS Science)
Helmholtz Centre for Environmental Research
AWS
University of California, Davis
BIH
@cid-harvard
@novartis
NTNU
Regeneron Pharmaceuticals
Sentieon
UZH Zurich
Washington University School of Medicine
Exact Sciences
@seqera.io
University Medicine Essen
Technical University of Munich
Université de Montréal
@cnio-bu
University of Vienna
Barcelona Supercomputing Center
@ENCCS
ImmunoScape
Gymrek Lab, UCSD
@insilicoconsulting
University of Duisburg Essen
DKTK/DKFZ
@TileDB-Inc
Stanford University
ETH
DKRZ
@BiomeSense
University of Helsinki
Institute for Health Metrics and Evaluation
@bihealth
Data Science Centre, EMBL
Max Delbrück center for molecular medicine
@sib-swiss
Alva Genomics
University of Pennsylvania
University College London
VantAI
IMS Nanofabrication GmbH
Sorbonne Université, Paris
@common-workflow-language
Brabant Water
Esox Biologics
@GenomicsUA @lyft
Genedata AG
Bioinformatics Software Engineer at Novartis Institutes for BioMedical Research (NIBR)
@RWTH-EBC
@txbiomed
@TRON-Bioinformatics
Saarland University
Center for eResearch, University of Auckland
University of Chicago
@Syngenta
A. C. Camargo Cancer Center
WEHI
University of Utah Center for High Performance Computing
Anthropic
@nanoporetech
@idiap
Kahnemnan-Treisman Center
University of Colorado Anschutz School of Medicine
University of Wisconsin-Madison
@fulcrumgenomics
Ascend Analytics
@Elembio
Medical College of Wisconsin
University of Pittsburgh / Center for Craniofacial and Dental Genetics
Freie Universität Berlin
@USF-HII
DataDotOrg
NIAID
Caltech
www.hzdr.de
HHU Düsseldorf
Seqoia
Red Hat
Amsterdam UMC
@mpinb
@open-energy-transition
Duke-NUS
@BlueRiverTechnology
Gustave Roussy
TU Delft
German Aerospace Center (DLR)
IOB
@open-energy-transition
Novartis
LPC Caen - IN2P3 - CNRS
Sunagawa Lab @ ETH Zürich
University of Tartu
Dartmouth College, @dandi, @Debian, @DataLad, @neurodebian, @PyMVPA, @fail2ban
University of Michigan, Ann Arbor
WCIP | University of Glasgow
Helmholtz-Zentrum Dresden-Rossendorf e.V.
@seqeralabs
National Institutes of Health
IBIS (Institut de Biologie Intégrative et des Systèmes)
TU Berlin
Roche
Friedrich Schiller University Jena
University of Jyväskylä
Swiss Ornithological Institute
Montreal Clinical Research Institute
4Catalyzer
@tempuslabs
ETHZ
Wellcome Sanger Institute
SPD
@Adobe
Westerdijk Fungal Bioidiversity Institute
Sarepta Therapeutics
@manaakiwhenua
Mount Sinai Hospital (@marcoralab)
CNRS
The Gladstone Institutes
Wesleyan University
Norges Miljø- & Biovitenskapelige Universitet (NMBU)
Freelancer for hire. Maybe.
Paul Scherrer Institute
Morgridge Instute for Research
𝐈𝐍𝐑𝐈𝐀🇫🇷 Nat. Inst. for DigitSci & Tech
Predictive Neuroscience Lab, University Hospital Essen
Oslo University Hospital
@JRC-IET, C3
Georgia Tech
FGCZ, ETHZ | UZH
Treelogic & University of Alacant
Pasqal
Stockholm Universitetet
TRON gGmbH Mainz
Hochschule Darmstadt
City, University of London
Harvard, USA
@Quantco
LPNHE - CNRS - Sorbonne Université
Institute for Molecular Bioscience, University of Queensland
@bio-raum @CVUA-RRW
DTU biosustain
VBCF
Stony Brook Medicine
La Jolla Institute for Allergy and Immunology @LJI-Bioinformatics @IEDB
Fred Hutchinson Cancer Research Center; Howard Hughes Medical Institute
CSIRO
Erlangen Centre for Astroparticle Physics
@blab @nextstrain
Daylily Informatics
Vertex Pharmaceuticals