odcambc/dumpling

A Snakemake-based pipeline for analysing deep mutational scanning experiments.

Overview

Latest release: v0.9.0, Last update: 2026-06-18

Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=odcambc/dumpling

Quality control: linting: failed formatting: failed

Wrappers: bio/fastqc bio/multiqc bio/samtools/faidx

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run

conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

For other installation methods, refer to the Snakemake and Snakedeploy documentation.

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/odcambc/dumpling . --tag v0.9.0

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow using apptainer/singularity, use

snakemake --cores all --sdm apptainer

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Dumpling configuration

A dumpling run is defined by a config YAML plus the files it points to. Copy example.yaml to config/<your_experiment>.yaml, edit, and pass it with --configfile. Every key is validated against ../workflow/schemas/config.schema.yaml; a configuration generator is also available.

The example.yaml should have sensible defaults: a full list of options is listed below.

Inputs: paths and files

Key

Meaning

experiment

Unique experiment name; names output dirs/files.

data_dir

Directory holding all raw reads (flat, not in subfolders).

ref_dir

Directory holding reference FASTA.

reference

Reference FASTA filename (nucleotide).

experiment_file

Experiment CSV (see below).

variants_file

Designed-variants CSV (see below).

oligo_file

(optional) DIMPLE oligo CSV, used to regenerate variants_file.

orf

ORF coordinate range within the reference, e.g. "141-1568".

Experiment CSV

This file defines the experimental organization of the data.

Columns: sample (unique), condition, replicate, time (or bin for FACS), tile number (for tiled amplicon sequencing), file (filename prefix, i.e. without _R1_001.fastq.gz/_R2_001.fastq.gz). For run_cosmos, exactly two conditions additionally set a phenotype column to 1 and 2 (the two sequential phenotypes cosmos models). Validated against ../workflow/schemas/experiments.schema.yaml.

Reference FASTA

A nucleotide FASTA covering the expected mapped region, placed in ref_dir (default references/) and named via the reference key. Set the ORF coordinates within this file via orf: these are 1-referenced (identical to SnapGene numbering) and are required for correct variant calling.

Designed-variants CSV

dumpling standardizes variant nomenclature and drops variants that aren’t designed (or are likely errors). Columns: count (init 0), pos, mutation_type (S/M/D/I/X), name (e.g. A123T), codon, mutation (subtype incl. indel length, e.g. D_3), length, hgvs.

Generate it from a DIMPLE oligo_file by setting regenerate_variants: true. To skip designed-variant filtering entirely (e.g. random mutagenesis), set noprocess: true.

Note that even if filtering is enabled, the filtered variant counts will be saved as a “rejected” counts file.

What to run

Key

Default

Effect

scoring_backend

rosace

Scoring model: rosace, lilace, or rosace_aa. lilace/rosace_aa need parsed variant metadata → incompatible with noprocess: true.

lilace_seed

null

Random seed for the lilace backend; null runs nondeterministically. No effect unless scoring_backend: lilace.

enrich2

true

Also run Enrich2 alongside the backend.

keep_enrich_h5

false

Keep Enrich2 .h5 stores (large; otherwise temp()). No effect unless enrich2.

deposit_to_mavedb

true

Emit MaveDB-ready CSVs per condition under results/{exp}/deposit/mavedb/.

run_cosmos

false

Run cosmos multi-phenotype decomposition. Needs exactly two conditions with phenotype slots 1/2 in the experiment CSV. Slow (a model per position). Tune via the cosmos: block.

run_qc

true

Run FastQC + MultiQC.

noprocess

false

Skip designed-variant filtering.

remove_zeros

false

Drop zero/unobserved-count variants before scoring.

regenerate_variants

false

Rebuild variants_file from oligo_file.

baseline_condition

Condition used as the untreated/input baseline (library-quality QC).

max_deletion_length

0

Max designed in-frame deletion (codons); longer insdel-resolved deletions rejected. 0 disables.

Optional cosmos: block

When run_cosmos: true, tune with include_type / exclude_type / x_gmm_n_components / min_num_variants_per_group.

Mapping and read processing

Key

Default

Effect

aligner

bbmap

bbmap (richer QC) or minimap2 (much faster, slightly less QC).

kmers

15

BBMap k-mer length.

sam

"1.3"

SAM version for BBMap (required for GATK compatibility).

min_q

30

GATK minimum base quality.

min_variant_obs

3

Minimum number of observations for GATK AnalyzeSaturationMutagenesis to include a variant.

bbtools_compression

pigz

BBTools (de)compression: pigz (parallel), bgzip, or none (gzip; use if pigz absent or BBTools hangs). Legacy bbtools_use_bgzip: true/false still accepted with a warning.

adapters

Adapter FASTA for BBDuk trimming; example.yaml defaults to the bundled BBTools adapters.

contaminants

Contaminant FASTAs for BBDuk removal; example.yaml defaults to PhiX + sequencing artifacts.

Resources (memory)

Per-tool allocations in MB, one per heavy rule; each becomes that rule’s resources: mem_mb, which a cluster scheduler turns into sbatch --mem (Java rules derive -Xmx a fixed headroom below). Defaults are sane — override only to fit a specific machine/cluster.

Key

Default (MB)

Rule

mem_bbduk

2000

trim/clean

mem_bbmerge

2000

merge/correct

mem_bbmap

12000

bbmap map + index (heaviest)

mem_minimap2

1000

minimap2 map + index

mem_gatk

6000

GATK AnalyzeSaturationMutagenesis

mem_process_sample

2000

per-sample variant processing

mem_fastqc

1024

FastQC

mem_rosace

16000

Rosace scoring

mem_rosace_aa

16000

rosace-aa scoring

mem_lilace

16000

Lilace scoring

mem_cosmos

4000

cosmos (run_cosmos)

mem (GB) is a legacy single BBTools-heap knob, superseded by the mem_* allocations but retained for out-of-tree use.

Local tool overrides

For platforms where the conda/renv environments don’t resolve (e.g. ARM Macs), point a rule at a locally-installed tool instead:

samtools_local, rosace_local, lilace_local, rosace_aa_local — all false by default. See the main README “Troubleshooting” section.

Workflow parameters

The following table is automatically parsed from the workflow’s config.schema.y(a)ml file.

Parameter

Type

Description

Required

Default

data_dir

string

directory containing reads

yes

experiment

string

sample condition that will be compared during differential expression analysis (e.g. a treatment, a tissue time, a disease)

yes

experiment_file

string

Location of experiment setup definition file in CSV format

yes

samples

array

Array of experiment file names (truncated at lane number)

ref_dir

string

directory where reference files are located

yes

reference

string

reference file name

yes

variants_file

string

designed variants file name

None

oligo_file

string

oligo list file name

None

orf

string

ORF coordinates in reference file. Format: start-stop

yes

kmers

integer

kmer length for bbduk

15

min_q

integer

minimum Q score for bases to be analyzed by GATK ASM

30

min_variant_obs

integer

minimum number of variant observations for GATK ASM to include

3

sam

string

cigar string version for bbmap - 1.3 uses M for = or X, and 1.4 uses = or X.

1.3

samtools_local

boolean

Flag for whether to use local samtools or wrapper version.

false

scoring_backend

string

Scoring backend to run for variant effect inference. rosace (default)

produces per-variant scalar scores; lilace is the alternative

mean-variance model; rosace_aa (pimentellab/rosace-aa) extends rosace

with position + amino-acid substitution effect decomposition. Both

lilace and rosace_aa require parsed variant metadata

(wildtype/mutation/type columns) and are incompatible with

noprocess: true — validate_scoring_backend_mode rejects the combo.

                                                                                                                                                                                  |          | rosace  |

| rosace_local | boolean | Flag for whether to use ROSACE installed on system-wide R or through a conda-controlled R environment. | | false | | lilace_local | boolean | Flag for whether to use Lilace installed on system-wide R or through a conda-controlled R environment. | | false | | rosace_aa_local | boolean | Flag for whether to use rosace-aa installed on system-wide R or through a conda-controlled R environment. | | false | | bbtools_compression | string | BBTools (de)compression backend for fastq IO. pigz (default) parallelizes across each rule’s threads, typically saving 30-40s/sample on 8GB+ compressed inputs vs the single-threaded bgzip path. bgzip uses BBTools’ built-in bgzip; equivalent to the old bbtools_use_bgzip: true knob. none disables bgzip and falls back to gzip; equivalent to the old bbtools_use_bgzip: false knob (kept for systems where pigz isn’t available or BBTools hangs with bgzip). The deprecated bbtools_use_bgzip bool is still accepted with a deprecation warning; it is translated to bgzip/none accordingly. | | pigz | | aligner | string | Read aligner to use for mapping trim/clean/correct’d reads to the reference. bbmap (default) emits BBTools-format per-position histograms (covstats, ehist, etc.). minimap2 (-ax sr short-read preset) is substantially faster on small DMS references and produces samtools-based QC outputs (samtools stats + flagstat) that MultiQC autodiscovers. See tasks/performance-audit.md Track C. | | bbmap | | adapters | [‘string’, ‘array’] | adapter file(s) locations for bbduk. Can be single file string or array of files. | | | | contaminants | [‘string’, ‘array’] | contaminant reference file(s) locations for bbduk. Can be single file string or array of files. | | | | mem | integer | memory for bbtools (in Gb) | | 16 | | mem_fastqc | integer | memory for bbtools (in Mb) | | 1024 | | mem_rosace | integer | memory for Rosace (in Mb) | | 16000 | | mem_rosace_aa | integer | memory for rosace-aa (in Mb) | | 16000 | | mem_lilace | integer | memory for Lilace (in Mb) | | 16000 | | mem_bbduk | integer | Memory allocation for each bbduk step in the trim/clean/correct rule (in Mb). Becomes the rule’s resources mem_mb (the trim_clean_correct rule requests 2*mem_bbduk + mem_bbmerge, for its concurrent JVMs); the bbduk -Xmx heap is derived a headroom below it. Used by cluster schedulers. | | 2000 | | mem_bbmerge | integer | Memory allocation for the bbmerge step in trim/clean/correct (in Mb). | | 2000 | | mem_bbmap | integer | Memory allocation for bbmap mapping and index building (in Mb). bbmap is the heaviest mapping step (~8.5 GB RSS on real DMS data); the -Xmx heap is derived a headroom below this. Becomes resources mem_mb for the scheduler. | | 12000 | | mem_minimap2 | integer | Memory allocation for minimap2 mapping and index building (in Mb). | | 1000 | | mem_gatk | integer | Memory allocation for the GATK AnalyzeSaturationMutagenesis rule (in Mb). The GATK JVM heap is pinned a headroom below this via –java-options so it doesn’t overrun the cgroup allocation on a cluster. | | 6000 | | mem_process_sample | integer | Memory allocation for the per-sample variant-processing rule (in Mb). | | 2000 | | mem_cosmos | integer | Memory allocation for the run_cosmos rule (in Mb). | | 4000 | | run_cosmos | boolean | Run cosmos (pimentellab/cosmos) as part of the build, like enrich2: an analysis layered on top of the chosen scoring backend, producing the per-position direct/indirect decomposition results/{experiment}/cosmos/{experiment}_cosmos_results.csv (format_cosmos prepares its input). Requires exactly two conditions assigned phenotype slots 1 and 2 in the experiment CSV (cosmos models two sequential phenotypes). Off by default (opt-in) — NOTE cosmos fits a model per position (~tens of seconds each, serial), so enabling it can add hours on a large library. Tune via the cosmos settings block. | | false | | cosmos | | Optional cosmos run settings (used by run_cosmos). All keys optional with sensible defaults: include_type (default [“missense”]), exclude_type (synonymous/nonsense + indel labels), x_gmm_n_components (default 2), min_num_variants_per_group (default 10). | | | | lilace_seed | [‘integer’, ‘null’] | Random seed passed to lilace::lilace_fit_model’s Stan sampling. null (default) means a fresh seed is generated at rule-invocation time from Sys.time() salted with the condition name; the chosen seed is always logged so users can grep the rule log to reproduce. Set to a positive integer to make the chain init bit-identical across runs. Rosace’s equivalent seed is hardcoded to 100 upstream (in rosace::MCMCRunStan) and not configurable here; rosace runs are already bit-identical across reruns by construction. There is intentionally no rosace_seed knob — adding one would silently no-op because RunRosace doesn’t accept a seed argument. | | | | enrich2 | boolean | Flag for whether to run Enrich2 in addition to the chosen scoring backend. | | true | | keep_enrich_h5 | boolean | Whether to keep Enrich2’s intermediate HDF5 (.h5) store files. dumpling only consumes Enrich2’s exported tsv scores, so by default (false) the stores are declared as Snakemake temp() outputs and removed once scoring completes, reclaiming disk space (Snakemake owns the deletion — the pipeline never calls rm/find itself). Set true to retain them for debugging Enrich2 internals. No effect unless enrich2 is true. | | false | | deposit_to_mavedb | boolean | Flag for whether to emit a MaveDB-formatted score CSV per experimental condition as part of the default all target. When true (default), results/{experiment}/deposit/mavedb/{condition}_mavedb.csv is produced alongside the regular scoring outputs — the format_mavedb rule’s output is cheap (~seconds, runs after scoring completes) and directly uploadable to MaveDB. Set to false to skip deposit-format generation. | | true | | remove_zeros | boolean | Flag for whether to remove unobserved and zero-count variants before processing with Enrich. | | false | | regenerate_variants | boolean | Flag for whether to regenerate the designed variants file from an oligo list. | | false | | noprocess | boolean | Flag for whether called variants are filtered or not. | | false | | run_qc | boolean | Flag for whether QC should be performed. | | true | | baseline_condition | string | Name of condition containing baseline samples. Scores will not be generated for this condition. | | | | max_deletion_length | integer | Maximum length (in codons) of a designed deletion in the library. Insdels that resolve to an in-frame deletion longer than this are rejected. Set to 0 (or any non-positive value) to disable the length cap. | | 0 |

Linting and formatting

Linting results
 1Using workflow specific profile workflow/profiles/default for setting default command line arguments.
 2WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
 3WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
 4WARNING [dumpling]: Could not resolve fastq pair for '3_S3_L001': Could not find matching R1 and R2 fastq files for prefix '3_S3_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 3_S3_L001_R1_001.fastq.gz, 3_S3_L001_R1.fq.gz, 3_S3_L001_1.fastq, etc.
 5WARNING [dumpling]: Could not resolve fastq pair for '4_S4_L001': Could not find matching R1 and R2 fastq files for prefix '4_S4_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 4_S4_L001_R1_001.fastq.gz, 4_S4_L001_R1.fq.gz, 4_S4_L001_1.fastq, etc.
 6WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
 7WARNING [dumpling]: Could not resolve fastq pair for '5_S5_L001': Could not find matching R1 and R2 fastq files for prefix '5_S5_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 5_S5_L001_R1_001.fastq.gz, 5_S5_L001_R1.fq.gz, 5_S5_L001_1.fastq, etc.
 8WARNING [dumpling]: Could not resolve fastq pair for '6_S6_L001': Could not find matching R1 and R2 fastq files for prefix '6_S6_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 6_S6_L001_R1_001.fastq.gz, 6_S6_L001_R1.fq.gz, 6_S6_L001_1.fastq, etc.
 9WARNING [dumpling]: Could not resolve fastq pair for '7_S7_L001': Could not find matching R1 and R2 fastq files for prefix '7_S7_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 7_S7_L001_R1_001.fastq.gz, 7_S7_L001_R1.fq.gz, 7_S7_L001_1.fastq, etc.
10WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
11WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
12WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
13WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
14WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
15WARNING [dumpling]: Could not resolve fastq pair for '3_S3_L001': Could not find matching R1 and R2 fastq files for prefix '3_S3_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 3_S3_L001_R1_001.fastq.gz, 3_S3_L001_R1.fq.gz, 3_S3_L001_1.fastq, etc.
16FileNotFoundError in file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/common.smk", line 215:
17Data directory example/data does not exist
18  File "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/common.smk", line 447, in <module>
19  File "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/common.smk", line 215, in validate_config
Formatting results
 1[DEBUG] 
 2[DEBUG] 
 3[DEBUG] 
 4[DEBUG] 
 5[DEBUG] 
 6[DEBUG] 
 7[DEBUG] 
 8[DEBUG] In file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/enrich.smk":  Formatted content is different from original
 9[DEBUG] 
10[DEBUG] 
11[DEBUG] 
12[DEBUG] 
13[WARNING] In file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/process.smk":  Keyword "input" at line 22 has comments under a value.
14	PEP8 recommends block comments appear before what they describe
15(see https://www.python.org/dev/peps/pep-0008/#id30)
16[WARNING] In file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/process.smk":  Keyword "input" at line 50 has comments under a value.
17	PEP8 recommends block comments appear before what they describe
18(see https://www.python.org/dev/peps/pep-0008/#id30)
19[DEBUG] 
20[DEBUG] 
21[DEBUG] 
22[INFO] 1 file(s) would be changed 😬
23[INFO] 13 file(s) would be left unchanged 🎉
24
25snakefmt version: 0.11.5