image/svg+xml A framework for reproducible data analysis

Readability and automation

With Snakemake, data analysis workflows are defined via an easy to read, adaptable, yet powerful specification language on top of Python. Steps are defined by "rules", which denote how to generate a set of output files from a set of input files (e.g. using a shell command). Wildcards (in curly braces) provide generalization. Dependencies between rules are determined automatically.

rule select_by_country: input: "data/worldcitiespop.csv" output: "by-country/{country}.csv" shell: "xsv search -s Country '{wildcards.country}' " "{input} > {output}"

Portability

By integration with the Conda package manager and containers, all software dependencies of each workflow step are automatically deployed upon execution.

rule select_by_country: input: "data/worldcitiespop.csv" output: "by-country/{country}.csv" conda: "envs/xsv.yaml" shell: "xsv search -s Country '{wildcards.country}' " "{input} > {output}"

Scripting integration

Rapidly implement analysis steps via direct script and jupyter notebook integration supporting Python, R, Julia, Rust, Bash, without requiring any boilerplate code.

rule select_by_country: input: "data/worldcitiespop.csv" output: "by-country/{country}.csv" script: "scripts/select_by_country.R"

Modularization

Easily create and employ re-usable tool or library wrappers, split your data analysis into well-separated modules, and compose multi-modal analyses by easily combining entire workflows various sources.

rule convert_to_pdf: input: "{prefix}.svg" output: "{prefix}.pdf" wrapper: "0.47.0/utils/cairosvg"

"Turing completeness"

Being a syntactical extension of Python, you can implement arbitrary logic beyond the plain definition of rules. Rules can be generated conditionally, arbitrary Python logic can be used to perform aggregations, configuration and metadata can be obtained and postprocessed in any required way.

def get_data(wildcards): # use arbitrary Python logic to # aggregate over the required input files return ... rule plot_histogram: input: get_data output: "plots/hist.svg" script: "scripts/plot-hist.py"

Human Readability

The logic of production workflows can become complex by involving lots of lookups and dynamic decisions. Snakemake offers semantic helper functions for lookups, branching and aggregation that avoid the need for plain Python code as shown above, and allow to express complex logic in a human-readable and self-contained way.

rule plot_histogram: input: branch( lookup(dpath="histogram/somedata", within=config), then="data/somedata.txt", otherwise="data/someotherdata.txt" ) output: "plots/hist.svg" script: "scripts/plot-hist.py"

Dynamic workflows

Snakemake allows to define workflows that are dynamically updated at runtime. By defining so-called checkpoints, the workflow can be dynamically adapted at runtime. Further, input can be provided as Python queues, thereby enabling a workflow to continuously receive new input data (e.g. while a certain measurement is conducted).

rule all: input: from_queue(all_results, finish_sentinel=...) checkpoint somestep: input: "samples/{sample}.txt" output: "somestep/{sample}.txt" shell: "somecommand {input} > {output}"

Transparency and data provenance

Automatic, interactive, self-contained reports ensure full transparency from results down to used steps, parameters, code, and software. The reports can moreover contain embedded results (from images, to PDFs and even interactive HTML) enabling a comprehensive reporting that combines analysis results with data provenance information.

Scalability

Workflows scale seamlessly from single to multicore, clusters or the cloud, without modification of the workflow definition and automatic avoidance of redundant computations.

Configurability

Snakemake is extremely flexible and configurable. Numerous options allow adapt the behavior to the needs of the data analysis at hand and the underlying infrastructure. Options can be provided via the command line interface or persisted via system-wide, user-specific, and workflow specific profiles.

executor: slurm software-deployment-method: - conda latency-wait: 60 default-storage-provider: fs shared-fs-usage: - persistence - software-deployment - sources - source-cache local-storage-prefix: /local/work/$USER/snakemake-scratch

Extensibility

Snakemake has a powerful plugin system that allows to extend various functionalities with alternative implementations. Via stable and well-defined interfaces, plugins can evolve independently of Snakemake, and mutual update requirements are minimized. Currently, execution backends and remote storage support is implemented via plugins. In the future, we will extend this to other areas, such as workflow scheduling, reporting, software deployment, and more.

Community driven development

Snakemake has a vibrant developer community that continuously improves the system. Every year, we conduct a 1 week hackathon with 40-50 participants. The 2025 hackathon at CERN was sponsored by the FAIROS-HEP research network. The 2026 hackathon at TU Munich is sponsored by BASF.

Authors and Contributors ⓘ

Groups, Institutes, Companies, and Organizations ⓘ

  • University of Duisburg-Essen
  • Science for Life Laboratory
  • On Sabbatical
  • The GLOBE Institute – University of Copenhagen
  • Medizinische Genetik Mainz, Limbach Genetics
  • 上海交通大学
  • CERN
  • University of Utah
  • University of Queensland | Frazer Institute
  • Icahn School of Medicine at Mount Sinai
  • group.one
  • @Nykredit
  • Karolinska Institutet
  • University of Mainz
  • Pairwise
  • Netherlands Institute of Ecology (NIOO-KNAW)
  • Oyat Consulting
  • Solynta
  • Gekkonid Scientific
  • LHCb
  • Data Science @google
  • EMBL
  • University of Colorado School of Medicine
  • Seoul National University
  • Universität Duisburg-Essen
  • @neherlab @nextstrain
  • Princeton University
  • @bihealth
  • @fulcrumgenomics
  • TRON gGmbH Mainz
  • Spotify
  • Clinical Genomics Uppsala / Uppsala University
  • @open-energy-transition
  • @TRON-Bioinformatics
  • Paul Scherrer Institute
  • Swansea Academy of Advanced Computing, Swansea University
  • Helmholtz Centre for Environmental Research
  • AWS
  • University of California, Davis
  • Earth Sciences New Zealand (formerly GNS Science)
  • @novartis
  • BIH
  • NTNU
  • @cid-harvard
  • UZH Zurich
  • Université de Montréal
  • Sentieon
  • Barcelona Supercomputing Center
  • Regeneron Pharmaceuticals
  • DKTK/DKFZ
  • Stanford University
  • @cnio-bu
  • University Medicine Essen
  • Exact Sciences
  • @insilicoconsulting
  • Technical University of Munich
  • @common-workflow-language
  • @seqera.io
  • Center for eResearch, University of Auckland
  • @mimer-ai, @ENCCS, @coderefinery, @RI-SE
  • University of Duisburg Essen
  • Gymrek Lab, UCSD
  • Montreal Clinical Research Institute
  • ImmunoScape
  • Washington University School of Medicine
  • Giesselmann IT
  • DKFZ/Heidelberg University
  • @TileDB-Inc
  • Institute for Health Metrics and Evaluation
  • LPC Caen - IN2P3 - CNRS
  • University of Helsinki
  • Data Science Centre, EMBL
  • Esox Biologics
  • ETHZ
  • IMS Nanofabrication GmbH
  • Proxima
  • Sorbonne Université, Paris
  • @sib-swiss
  • @BiomeSense
  • University College London
  • Max Delbrück center for molecular medicine
  • @Alva-Genomics
  • University of Pennsylvania
  • @GenomicsUA @lyft
  • Genedata AG
  • Mount Sinai Hospital (@marcoralab)
  • @RWTH-EBC
  • Bioinformatics Software Engineer at Novartis Institutes for BioMedical Research (NIBR)
  • BASF AP
  • Vilnius University & CERN
  • @txbiomed
  • Brabant Water
  • UB
  • University of Wisconsin-Madison
  • Georgia Tech
  • @TRON-Bioinformatics
  • Saarland University
  • @Syngenta
  • @nanoporetech
  • @idiap
  • University of Colorado Anschutz School of Medicine
  • Humane and Sustainable Food Lab at Stanford University
  • TU Delft
  • IOB
  • Medical College of Wisconsin
  • University of Pittsburgh / Center for Craniofacial and Dental Genetics
  • Freie Universität Berlin
  • @USF-HII
  • NIAID
  • Caltech
  • www.hzdr.de
  • HHU Düsseldorf
  • Seqoia
  • @arcjet
  • A. C. Camargo Cancer Center
  • WEHI
  • University of Utah Center for High Performance Computing, @chpc-uofu
  • @mpinb
  • @open-energy-transition
  • University of Basel
  • Duke-NUS
  • @BlueRiverTechnology
  • Gustave Roussy
  • Novartis
  • Sunagawa Lab @ ETH Zürich
  • @FredHutch / @blab / @nextstrain
  • University of Tartu
  • Dartmouth College, @dandi, @Debian, @DataLad, @neurodebian, @PyMVPA, @fail2ban
  • University of Michigan, Ann Arbor
  • China Agricultural University
  • Amutable
  • Amsterdam UMC
  • @tempuslabs
  • University of Glasgow
  • Helmholtz-Zentrum Dresden-Rossendorf e.V.
  • @seqeralabs
  • National Institutes of Health
  • IBIS (Institut de Biologie Intégrative et des Systèmes)
  • Technische Universität Berlin
  • Roche
  • Friedrich Schiller University Jena
  • University of Jyväskylä
  • Swiss Ornithological Institute
  • 4Catalyzer
  • Institute for Molecular Bioscience, University of Queensland
  • SPD
  • @Adobe
  • Westerdijk Fungal Bioidiversity Institute
  • Sarepta Therapeutics
  • @manaakiwhenua
  • CNRS
  • The Gladstone Institutes
  • CosmoStat, CEA Paris-Saclay
  • Freelancer for hire. Maybe.
  • Vestas
  • @bio-raum @CVUA-RRW
  • Predictive Neuroscience Lab, University Hospital Essen
  • Oslo University Hospital
  • @JRC-IET, C3
  • CNRS, LIRMM, University of Montpellier
  • FGCZ, ETHZ | UZH
  • Treelogic & University of Alacant
  • Pasqal
  • Stockholm Universitetet
  • Hochschule Darmstadt
  • CORE
  • Harvard, USA
  • @Quantco
  • LPNHE - CNRS - Sorbonne Université
  • @fulcrumgenomics
  • Ascend Analytics
  • DTU biosustain
  • VBCF
  • Stony Brook Medicine
  • La Jolla Institute for Allergy and Immunology @LJI-Bioinformatics @IEDB
  • Fred Hutchinson Cancer Research Center; Howard Hughes Medical Institute
  • CSIRO
  • Erlangen Centre for Astroparticle Physics
  • @blab @nextstrain
  • Daylily Informatics
  • Illumina
  • Morgridge Instute for Research
  • 𝐈𝐍𝐑𝐈𝐀🇫🇷 Nat. Inst. for DigitSci & Tech