odcambc/dumpling

A Snakemake-based pipeline for analysing deep mutational scanning experiments.

Overview

Latest release: v0.9.0, Last update: 2026-06-18

Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=odcambc/dumpling

Quality control: linting: failed formatting: failed

Wrappers: bio/fastqc bio/multiqc bio/samtools/faidx

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run

conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

For other installation methods, refer to the Snakemake and Snakedeploy documentation.

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/odcambc/dumpling . --tag v0.9.0

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow using apptainer/singularity, use

snakemake --cores all --sdm apptainer

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Dumpling configuration

A dumpling run is defined by a config YAML plus the files it points to. Copy example.yaml to config/<your_experiment>.yaml, edit, and pass it with --configfile. Every key is validated against ../workflow/schemas/config.schema.yaml; a configuration generator is also available.

The example.yaml should have sensible defaults: a full list of options is listed below.

Inputs: paths and files

Key	Meaning
`experiment`	Unique experiment name; names output dirs/files.
`data_dir`	Directory holding all raw reads (flat, not in subfolders).
`ref_dir`	Directory holding reference FASTA.
`reference`	Reference FASTA filename (nucleotide).
`experiment_file`	Experiment CSV (see below).
`variants_file`	Designed-variants CSV (see below).
`oligo_file`	(optional) DIMPLE oligo CSV, used to regenerate `variants_file`.
`orf`	ORF coordinate range within the reference, e.g. `"141-1568"`.

Experiment CSV

This file defines the experimental organization of the data.

Columns: sample (unique), condition, replicate, time (or bin for FACS), tile number (for tiled amplicon sequencing), file (filename prefix, i.e. without _R1_001.fastq.gz/_R2_001.fastq.gz). For run_cosmos, exactly two conditions additionally set a phenotype column to 1 and 2 (the two sequential phenotypes cosmos models). Validated against ../workflow/schemas/experiments.schema.yaml.

Reference FASTA

A nucleotide FASTA covering the expected mapped region, placed in ref_dir (default references/) and named via the reference key. Set the ORF coordinates within this file via orf: these are 1-referenced (identical to SnapGene numbering) and are required for correct variant calling.

Designed-variants CSV

dumpling standardizes variant nomenclature and drops variants that aren’t designed (or are likely errors). Columns: count (init 0), pos, mutation_type (S/M/D/I/X), name (e.g. A123T), codon, mutation (subtype incl. indel length, e.g. D_3), length, hgvs.

Generate it from a DIMPLE oligo_file by setting regenerate_variants: true. To skip designed-variant filtering entirely (e.g. random mutagenesis), set noprocess: true.

Note that even if filtering is enabled, the filtered variant counts will be saved as a “rejected” counts file.

What to run

Key	Default	Effect
`scoring_backend`	`rosace`	Scoring model: `rosace`, `lilace`, or `rosace_aa`. `lilace`/`rosace_aa` need parsed variant metadata → incompatible with `noprocess: true`.
`lilace_seed`	`null`	Random seed for the lilace backend; `null` runs nondeterministically. No effect unless `scoring_backend: lilace`.
`enrich2`	`true`	Also run Enrich2 alongside the backend.
`keep_enrich_h5`	`false`	Keep Enrich2 `.h5` stores (large; otherwise `temp()`). No effect unless `enrich2`.
`deposit_to_mavedb`	`true`	Emit MaveDB-ready CSVs per condition under `results/{exp}/deposit/mavedb/`.
`run_cosmos`	`false`	Run cosmos multi-phenotype decomposition. Needs exactly two conditions with `phenotype` slots 1/2 in the experiment CSV. Slow (a model per position). Tune via the `cosmos:` block.
`run_qc`	`true`	Run FastQC + MultiQC.
`noprocess`	`false`	Skip designed-variant filtering.
`remove_zeros`	`false`	Drop zero/unobserved-count variants before scoring.
`regenerate_variants`	`false`	Rebuild `variants_file` from `oligo_file`.
`baseline_condition`	—	Condition used as the untreated/input baseline (library-quality QC).
`max_deletion_length`	`0`	Max designed in-frame deletion (codons); longer insdel-resolved deletions rejected. `0` disables.

Optional `cosmos:` block

When run_cosmos: true, tune with include_type / exclude_type / x_gmm_n_components / min_num_variants_per_group.

Mapping and read processing

Key	Default	Effect
`aligner`	`bbmap`	`bbmap` (richer QC) or `minimap2` (much faster, slightly less QC).
`kmers`	`15`	BBMap k-mer length.
`sam`	`"1.3"`	SAM version for BBMap (required for GATK compatibility).
`min_q`	`30`	GATK minimum base quality.
`min_variant_obs`	`3`	Minimum number of observations for GATK AnalyzeSaturationMutagenesis to include a variant.
`bbtools_compression`	`pigz`	BBTools (de)compression: `pigz` (parallel), `bgzip`, or `none` (gzip; use if pigz absent or BBTools hangs). Legacy `bbtools_use_bgzip: true/false` still accepted with a warning.
`adapters`	—	Adapter FASTA for BBDuk trimming; `example.yaml` defaults to the bundled BBTools adapters.
`contaminants`	—	Contaminant FASTAs for BBDuk removal; `example.yaml` defaults to PhiX + sequencing artifacts.

Resources (memory)

Per-tool allocations in MB, one per heavy rule; each becomes that rule’s resources: mem_mb, which a cluster scheduler turns into sbatch --mem (Java rules derive -Xmx a fixed headroom below). Defaults are sane — override only to fit a specific machine/cluster.

Key	Default (MB)	Rule
`mem_bbduk`	2000	trim/clean
`mem_bbmerge`	2000	merge/correct
`mem_bbmap`	12000	bbmap map + index (heaviest)
`mem_minimap2`	1000	minimap2 map + index
`mem_gatk`	6000	GATK AnalyzeSaturationMutagenesis
`mem_process_sample`	2000	per-sample variant processing
`mem_fastqc`	1024	FastQC
`mem_rosace`	16000	Rosace scoring
`mem_rosace_aa`	16000	rosace-aa scoring
`mem_lilace`	16000	Lilace scoring
`mem_cosmos`	4000	cosmos (run_cosmos)

mem (GB) is a legacy single BBTools-heap knob, superseded by the mem_* allocations but retained for out-of-tree use.

Local tool overrides

For platforms where the conda/renv environments don’t resolve (e.g. ARM Macs), point a rule at a locally-installed tool instead:

samtools_local, rosace_local, lilace_local, rosace_aa_local — all false by default. See the main README “Troubleshooting” section.

Workflow parameters

The following table is automatically parsed from the workflow’s config.schema.y(a)ml file.

Parameter	Type	Description	Required	Default
data_dir	string	directory containing reads	yes
experiment	string	sample condition that will be compared during differential expression analysis (e.g. a treatment, a tissue time, a disease)	yes
experiment_file	string	Location of experiment setup definition file in CSV format	yes
samples	array	Array of experiment file names (truncated at lane number)
ref_dir	string	directory where reference files are located	yes
reference	string	reference file name	yes
variants_file	string	designed variants file name		None
oligo_file	string	oligo list file name		None
orf	string	ORF coordinates in reference file. Format: start-stop	yes
kmers	integer	kmer length for bbduk		15
min_q	integer	minimum Q score for bases to be analyzed by GATK ASM		30
min_variant_obs	integer	minimum number of variant observations for GATK ASM to include		3
sam	string	cigar string version for bbmap - 1.3 uses M for = or X, and 1.4 uses = or X.		1.3
samtools_local	boolean	Flag for whether to use local samtools or wrapper version.		false
scoring_backend	string	Scoring backend to run for variant effect inference. `rosace` (default)
produces per-variant scalar scores; `lilace` is the alternative
mean-variance model; `rosace_aa` (pimentellab/rosace-aa) extends rosace
with position + amino-acid substitution effect decomposition. Both
`lilace` and `rosace_aa` require parsed variant metadata
(wildtype/mutation/type columns) and are incompatible with
`noprocess: true` — validate_scoring_backend_mode rejects the combo.

                                                                                                                                                                                  |          | rosace  |

| rosace_local | boolean | Flag for whether to use ROSACE installed on system-wide R or through a conda-controlled R environment. | | false | | lilace_local | boolean | Flag for whether to use Lilace installed on system-wide R or through a conda-controlled R environment. | | false | | rosace_aa_local | boolean | Flag for whether to use rosace-aa installed on system-wide R or through a conda-controlled R environment. | | false | | bbtools_compression | string | BBTools (de)compression backend for fastq IO. pigz (default) parallelizes across each rule’s threads, typically saving 30-40s/sample on 8GB+ compressed inputs vs the single-threaded bgzip path. bgzip uses BBTools’ built-in bgzip; equivalent to the old bbtools_use_bgzip: true knob. none disables bgzip and falls back to gzip; equivalent to the old bbtools_use_bgzip: false knob (kept for systems where pigz isn’t available or BBTools hangs with bgzip). The deprecated bbtools_use_bgzip bool is still accepted with a deprecation warning; it is translated to bgzip/none accordingly. | | pigz | | aligner | string | Read aligner to use for mapping trim/clean/correct’d reads to the reference. bbmap (default) emits BBTools-format per-position histograms (covstats, ehist, etc.). minimap2 (-ax sr short-read preset) is substantially faster on small DMS references and produces samtools-based QC outputs (samtools stats + flagstat) that MultiQC autodiscovers. See tasks/performance-audit.md Track C. | | bbmap | | adapters | [‘string’, ‘array’] | adapter file(s) locations for bbduk. Can be single file string or array of files. | | | | contaminants | [‘string’, ‘array’] | contaminant reference file(s) locations for bbduk. Can be single file string or array of files. | | | | mem | integer | memory for bbtools (in Gb) | | 16 | | mem_fastqc | integer | memory for bbtools (in Mb) | | 1024 | | mem_rosace | integer | memory for Rosace (in Mb) | | 16000 | | mem_rosace_aa | integer | memory for rosace-aa (in Mb) | | 16000 | | mem_lilace | integer | memory for Lilace (in Mb) | | 16000 | | mem_bbduk | integer | Memory allocation for each bbduk step in the trim/clean/correct rule (in Mb). Becomes the rule’s resources mem_mb (the trim_clean_correct rule requests 2*mem_bbduk + mem_bbmerge, for its concurrent JVMs); the bbduk -Xmx heap is derived a headroom below it. Used by cluster schedulers. | | 2000 | | mem_bbmerge | integer | Memory allocation for the bbmerge step in trim/clean/correct (in Mb). | | 2000 | | mem_bbmap | integer | Memory allocation for bbmap mapping and index building (in Mb). bbmap is the heaviest mapping step (~8.5 GB RSS on real DMS data); the -Xmx heap is derived a headroom below this. Becomes resources mem_mb for the scheduler. | | 12000 | | mem_minimap2 | integer | Memory allocation for minimap2 mapping and index building (in Mb). | | 1000 | | mem_gatk | integer | Memory allocation for the GATK AnalyzeSaturationMutagenesis rule (in Mb). The GATK JVM heap is pinned a headroom below this via –java-options so it doesn’t overrun the cgroup allocation on a cluster. | | 6000 | | mem_process_sample | integer | Memory allocation for the per-sample variant-processing rule (in Mb). | | 2000 | | mem_cosmos | integer | Memory allocation for the run_cosmos rule (in Mb). | | 4000 | | run_cosmos | boolean | Run cosmos (pimentellab/cosmos) as part of the build, like enrich2: an analysis layered on top of the chosen scoring backend, producing the per-position direct/indirect decomposition results/{experiment}/cosmos/{experiment}_cosmos_results.csv (format_cosmos prepares its input). Requires exactly two conditions assigned phenotype slots 1 and 2 in the experiment CSV (cosmos models two sequential phenotypes). Off by default (opt-in) — NOTE cosmos fits a model per position (~tens of seconds each, serial), so enabling it can add hours on a large library. Tune via the cosmos settings block. | | false | | cosmos | | Optional cosmos run settings (used by run_cosmos). All keys optional with sensible defaults: include_type (default [“missense”]), exclude_type (synonymous/nonsense + indel labels), x_gmm_n_components (default 2), min_num_variants_per_group (default 10). | | | | lilace_seed | [‘integer’, ‘null’] | Random seed passed to lilace::lilace_fit_model’s Stan sampling. null (default) means a fresh seed is generated at rule-invocation time from Sys.time() salted with the condition name; the chosen seed is always logged so users can grep the rule log to reproduce. Set to a positive integer to make the chain init bit-identical across runs. Rosace’s equivalent seed is hardcoded to 100 upstream (in rosace::MCMCRunStan) and not configurable here; rosace runs are already bit-identical across reruns by construction. There is intentionally no rosace_seed knob — adding one would silently no-op because RunRosace doesn’t accept a seed argument. | | | | enrich2 | boolean | Flag for whether to run Enrich2 in addition to the chosen scoring backend. | | true | | keep_enrich_h5 | boolean | Whether to keep Enrich2’s intermediate HDF5 (.h5) store files. dumpling only consumes Enrich2’s exported tsv scores, so by default (false) the stores are declared as Snakemake temp() outputs and removed once scoring completes, reclaiming disk space (Snakemake owns the deletion — the pipeline never calls rm/find itself). Set true to retain them for debugging Enrich2 internals. No effect unless enrich2 is true. | | false | | deposit_to_mavedb | boolean | Flag for whether to emit a MaveDB-formatted score CSV per experimental condition as part of the default all target. When true (default), results/{experiment}/deposit/mavedb/{condition}_mavedb.csv is produced alongside the regular scoring outputs — the format_mavedb rule’s output is cheap (~seconds, runs after scoring completes) and directly uploadable to MaveDB. Set to false to skip deposit-format generation. | | true | | remove_zeros | boolean | Flag for whether to remove unobserved and zero-count variants before processing with Enrich. | | false | | regenerate_variants | boolean | Flag for whether to regenerate the designed variants file from an oligo list. | | false | | noprocess | boolean | Flag for whether called variants are filtered or not. | | false | | run_qc | boolean | Flag for whether QC should be performed. | | true | | baseline_condition | string | Name of condition containing baseline samples. Scores will not be generated for this condition. | | | | max_deletion_length | integer | Maximum length (in codons) of a designed deletion in the library. Insdels that resolve to an in-frame deletion longer than this are rejected. Set to 0 (or any non-positive value) to disable the length cap. | | 0 |

Linting and formatting

Linting results

Using workflow specific profile workflow/profiles/default for setting default command line arguments.
WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '3_S3_L001': Could not find matching R1 and R2 fastq files for prefix '3_S3_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 3_S3_L001_R1_001.fastq.gz, 3_S3_L001_R1.fq.gz, 3_S3_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '4_S4_L001': Could not find matching R1 and R2 fastq files for prefix '4_S4_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 4_S4_L001_R1_001.fastq.gz, 4_S4_L001_R1.fq.gz, 4_S4_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '5_S5_L001': Could not find matching R1 and R2 fastq files for prefix '5_S5_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 5_S5_L001_R1_001.fastq.gz, 5_S5_L001_R1.fq.gz, 5_S5_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '6_S6_L001': Could not find matching R1 and R2 fastq files for prefix '6_S6_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 6_S6_L001_R1_001.fastq.gz, 6_S6_L001_R1.fq.gz, 6_S6_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '7_S7_L001': Could not find matching R1 and R2 fastq files for prefix '7_S7_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 7_S7_L001_R1_001.fastq.gz, 7_S7_L001_R1.fq.gz, 7_S7_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
WARNING [dumpling]: Could not resolve fastq pair for '3_S3_L001': Could not find matching R1 and R2 fastq files for prefix '3_S3_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 3_S3_L001_R1_001.fastq.gz, 3_S3_L001_R1.fq.gz, 3_S3_L001_1.fastq, etc.
FileNotFoundError in file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/common.smk", line 215:
Data directory example/data does not exist
  File "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/common.smk", line 447, in <module>
  File "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/common.smk", line 215, in validate_config

Formatting results

[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] In file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/enrich.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[WARNING] In file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/process.smk":  Keyword "input" at line 22 has comments under a value.
	PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[WARNING] In file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/process.smk":  Keyword "input" at line 50 has comments under a value.
	PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[DEBUG] 
[DEBUG] 
[DEBUG] 
[INFO] 1 file(s) would be changed 😬
[INFO] 13 file(s) would be left unchanged 🎉

snakefmt version: 0.11.5