odcambc/dumpling
A Snakemake-based pipeline for analysing deep mutational scanning experiments.
Overview
Latest release: v0.9.0, Last update: 2026-06-18
Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=odcambc/dumpling
Quality control: linting: failed formatting: failed
Wrappers: bio/fastqc bio/multiqc bio/samtools/faidx
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run
conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
For other installation methods, refer to the Snakemake and Snakedeploy documentation.
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/odcambc/dumpling . --tag v0.9.0
Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method (short --sdm) argument.
To run the workflow using apptainer/singularity, use
snakemake --cores all --sdm apptainer
To run the workflow with automatic deployment of all required software via conda/mamba, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md.
Dumpling configuration
A dumpling run is defined by a config YAML plus the files it points to. Copy
example.yaml to config/<your_experiment>.yaml, edit, and pass
it with --configfile. Every key is validated against
../workflow/schemas/config.schema.yaml;
a configuration generator is also available.
The example.yaml should have sensible defaults: a full list of options is listed below.
Inputs: paths and files
Key |
Meaning |
|---|---|
|
Unique experiment name; names output dirs/files. |
|
Directory holding all raw reads (flat, not in subfolders). |
|
Directory holding reference FASTA. |
|
Reference FASTA filename (nucleotide). |
|
Experiment CSV (see below). |
|
Designed-variants CSV (see below). |
|
(optional) DIMPLE oligo CSV, used to regenerate |
|
ORF coordinate range within the reference, e.g. |
Experiment CSV
This file defines the experimental organization of the data.
Columns: sample (unique), condition, replicate, time (or bin for FACS),
tile number (for tiled amplicon sequencing), file (filename prefix, i.e. without
_R1_001.fastq.gz/_R2_001.fastq.gz). For run_cosmos, exactly two conditions
additionally set a phenotype column to 1 and 2 (the two sequential phenotypes
cosmos models). Validated against
../workflow/schemas/experiments.schema.yaml.
Reference FASTA
A nucleotide FASTA covering the expected mapped region, placed in ref_dir
(default references/) and named via the reference key. Set the ORF coordinates
within this file via orf: these are 1-referenced (identical to SnapGene
numbering) and are required for correct variant calling.
Designed-variants CSV
dumpling standardizes variant nomenclature and drops variants that aren’t
designed (or are likely errors). Columns: count (init 0), pos,
mutation_type (S/M/D/I/X), name (e.g. A123T), codon,
mutation (subtype incl. indel length, e.g. D_3), length, hgvs.
Generate it from a DIMPLE oligo_file by setting regenerate_variants: true.
To skip designed-variant filtering entirely (e.g. random mutagenesis), set
noprocess: true.
Note that even if filtering is enabled, the filtered variant counts will be saved as a “rejected” counts file.
What to run
Key |
Default |
Effect |
|---|---|---|
|
|
Scoring model: |
|
|
Random seed for the lilace backend; |
|
|
Also run Enrich2 alongside the backend. |
|
|
Keep Enrich2 |
|
|
Emit MaveDB-ready CSVs per condition under |
|
|
Run cosmos multi-phenotype decomposition. Needs exactly two conditions with |
|
|
Run FastQC + MultiQC. |
|
|
Skip designed-variant filtering. |
|
|
Drop zero/unobserved-count variants before scoring. |
|
|
Rebuild |
|
— |
Condition used as the untreated/input baseline (library-quality QC). |
|
|
Max designed in-frame deletion (codons); longer insdel-resolved deletions rejected. |
Optional cosmos: block
When run_cosmos: true, tune with include_type / exclude_type /
x_gmm_n_components / min_num_variants_per_group.
Mapping and read processing
Key |
Default |
Effect |
|---|---|---|
|
|
|
|
|
BBMap k-mer length. |
|
|
SAM version for BBMap (required for GATK compatibility). |
|
|
GATK minimum base quality. |
|
|
Minimum number of observations for GATK AnalyzeSaturationMutagenesis to include a variant. |
|
|
BBTools (de)compression: |
|
— |
Adapter FASTA for BBDuk trimming; |
|
— |
Contaminant FASTAs for BBDuk removal; |
Resources (memory)
Per-tool allocations in MB, one per heavy rule; each becomes that rule’s
resources: mem_mb, which a cluster scheduler turns into sbatch --mem (Java
rules derive -Xmx a fixed headroom below). Defaults are sane — override only
to fit a specific machine/cluster.
Key |
Default (MB) |
Rule |
|---|---|---|
|
2000 |
trim/clean |
|
2000 |
merge/correct |
|
12000 |
bbmap map + index (heaviest) |
|
1000 |
minimap2 map + index |
|
6000 |
GATK AnalyzeSaturationMutagenesis |
|
2000 |
per-sample variant processing |
|
1024 |
FastQC |
|
16000 |
Rosace scoring |
|
16000 |
rosace-aa scoring |
|
16000 |
Lilace scoring |
|
4000 |
cosmos (run_cosmos) |
mem (GB) is a legacy single BBTools-heap knob, superseded by the mem_*
allocations but retained for out-of-tree use.
Local tool overrides
For platforms where the conda/renv environments don’t resolve (e.g. ARM Macs), point a rule at a locally-installed tool instead:
samtools_local, rosace_local, lilace_local, rosace_aa_local — all
false by default. See the main README “Troubleshooting” section.
Workflow parameters
The following table is automatically parsed from the workflow’s config.schema.y(a)ml file.
Parameter |
Type |
Description |
Required |
Default |
|---|---|---|---|---|
data_dir |
string |
directory containing reads |
yes |
|
experiment |
string |
sample condition that will be compared during differential expression analysis (e.g. a treatment, a tissue time, a disease) |
yes |
|
experiment_file |
string |
Location of experiment setup definition file in CSV format |
yes |
|
samples |
array |
Array of experiment file names (truncated at lane number) |
||
ref_dir |
string |
directory where reference files are located |
yes |
|
reference |
string |
reference file name |
yes |
|
variants_file |
string |
designed variants file name |
None |
|
oligo_file |
string |
oligo list file name |
None |
|
orf |
string |
ORF coordinates in reference file. Format: start-stop |
yes |
|
kmers |
integer |
kmer length for bbduk |
15 |
|
min_q |
integer |
minimum Q score for bases to be analyzed by GATK ASM |
30 |
|
min_variant_obs |
integer |
minimum number of variant observations for GATK ASM to include |
3 |
|
sam |
string |
cigar string version for bbmap - 1.3 uses M for = or X, and 1.4 uses = or X. |
1.3 |
|
samtools_local |
boolean |
Flag for whether to use local samtools or wrapper version. |
false |
|
scoring_backend |
string |
Scoring backend to run for variant effect inference. |
||
produces per-variant scalar scores; |
||||
mean-variance model; |
||||
with position + amino-acid substitution effect decomposition. Both |
||||
|
||||
(wildtype/mutation/type columns) and are incompatible with |
||||
|
| | rosace |
| rosace_local | boolean | Flag for whether to use ROSACE installed on system-wide R or through a conda-controlled R environment. | | false |
| lilace_local | boolean | Flag for whether to use Lilace installed on system-wide R or through a conda-controlled R environment. | | false |
| rosace_aa_local | boolean | Flag for whether to use rosace-aa installed on system-wide R or through a conda-controlled R environment. | | false |
| bbtools_compression | string | BBTools (de)compression backend for fastq IO.
pigz (default) parallelizes across each rule’s threads, typically saving
30-40s/sample on 8GB+ compressed inputs vs the single-threaded bgzip path.
bgzip uses BBTools’ built-in bgzip; equivalent to the old
bbtools_use_bgzip: true knob.
none disables bgzip and falls back to gzip; equivalent to the old
bbtools_use_bgzip: false knob (kept for systems where pigz isn’t
available or BBTools hangs with bgzip).
The deprecated bbtools_use_bgzip bool is still accepted with a
deprecation warning; it is translated to bgzip/none accordingly.
| | pigz |
| aligner | string | Read aligner to use for mapping trim/clean/correct’d reads to the reference.
bbmap (default) emits BBTools-format per-position histograms (covstats, ehist, etc.).
minimap2 (-ax sr short-read preset) is substantially faster on small DMS
references and produces samtools-based QC outputs (samtools stats + flagstat)
that MultiQC autodiscovers. See tasks/performance-audit.md Track C.
| | bbmap |
| adapters | [‘string’, ‘array’] | adapter file(s) locations for bbduk. Can be single file string or array of files. | | |
| contaminants | [‘string’, ‘array’] | contaminant reference file(s) locations for bbduk. Can be single file string or array of files. | | |
| mem | integer | memory for bbtools (in Gb) | | 16 |
| mem_fastqc | integer | memory for bbtools (in Mb) | | 1024 |
| mem_rosace | integer | memory for Rosace (in Mb) | | 16000 |
| mem_rosace_aa | integer | memory for rosace-aa (in Mb) | | 16000 |
| mem_lilace | integer | memory for Lilace (in Mb) | | 16000 |
| mem_bbduk | integer | Memory allocation for each bbduk step in the trim/clean/correct rule (in
Mb). Becomes the rule’s resources mem_mb (the trim_clean_correct rule
requests 2*mem_bbduk + mem_bbmerge, for its concurrent JVMs); the bbduk
-Xmx heap is derived a headroom below it. Used by cluster schedulers.
| | 2000 |
| mem_bbmerge | integer | Memory allocation for the bbmerge step in trim/clean/correct (in Mb). | | 2000 |
| mem_bbmap | integer | Memory allocation for bbmap mapping and index building (in Mb). bbmap is
the heaviest mapping step (~8.5 GB RSS on real DMS data); the -Xmx heap is
derived a headroom below this. Becomes resources mem_mb for the scheduler.
| | 12000 |
| mem_minimap2 | integer | Memory allocation for minimap2 mapping and index building (in Mb). | | 1000 |
| mem_gatk | integer | Memory allocation for the GATK AnalyzeSaturationMutagenesis rule (in Mb).
The GATK JVM heap is pinned a headroom below this via –java-options so it
doesn’t overrun the cgroup allocation on a cluster.
| | 6000 |
| mem_process_sample | integer | Memory allocation for the per-sample variant-processing rule (in Mb). | | 2000 |
| mem_cosmos | integer | Memory allocation for the run_cosmos rule (in Mb). | | 4000 |
| run_cosmos | boolean | Run cosmos (pimentellab/cosmos) as part of the build, like enrich2:
an analysis layered on top of the chosen scoring backend, producing the
per-position direct/indirect decomposition
results/{experiment}/cosmos/{experiment}_cosmos_results.csv
(format_cosmos prepares its input). Requires exactly two conditions
assigned phenotype slots 1 and 2 in the experiment CSV (cosmos models
two sequential phenotypes). Off by default (opt-in) — NOTE cosmos fits a
model per position (~tens of seconds each, serial), so enabling it can add
hours on a large library. Tune via the cosmos settings block.
| | false |
| cosmos | | Optional cosmos run settings (used by run_cosmos). All keys optional with
sensible defaults: include_type
(default [“missense”]), exclude_type (synonymous/nonsense + indel labels),
x_gmm_n_components (default 2), min_num_variants_per_group (default 10).
| | |
| lilace_seed | [‘integer’, ‘null’] | Random seed passed to lilace::lilace_fit_model’s Stan sampling.
null (default) means a fresh seed is generated at rule-invocation
time from Sys.time() salted with the condition name; the chosen
seed is always logged so users can grep the rule log to reproduce.
Set to a positive integer to make the chain init bit-identical
across runs.
Rosace’s equivalent seed is hardcoded to 100 upstream (in
rosace::MCMCRunStan) and not configurable here; rosace runs are
already bit-identical across reruns by construction. There is
intentionally no rosace_seed knob — adding one would silently
no-op because RunRosace doesn’t accept a seed argument.
| | |
| enrich2 | boolean | Flag for whether to run Enrich2 in addition to the chosen scoring backend. | | true |
| keep_enrich_h5 | boolean | Whether to keep Enrich2’s intermediate HDF5 (.h5) store files. dumpling
only consumes Enrich2’s exported tsv scores, so by default (false) the
stores are declared as Snakemake temp() outputs and removed once scoring
completes, reclaiming disk space (Snakemake owns the deletion — the
pipeline never calls rm/find itself). Set true to retain them for
debugging Enrich2 internals. No effect unless enrich2 is true.
| | false |
| deposit_to_mavedb | boolean | Flag for whether to emit a MaveDB-formatted score CSV per experimental
condition as part of the default all target. When true (default),
results/{experiment}/deposit/mavedb/{condition}_mavedb.csv is
produced alongside the regular scoring outputs — the format_mavedb
rule’s output is cheap (~seconds, runs after scoring completes) and
directly uploadable to MaveDB. Set to false to skip deposit-format
generation.
| | true |
| remove_zeros | boolean | Flag for whether to remove unobserved and zero-count variants before processing with Enrich. | | false |
| regenerate_variants | boolean | Flag for whether to regenerate the designed variants file from an oligo list. | | false |
| noprocess | boolean | Flag for whether called variants are filtered or not. | | false |
| run_qc | boolean | Flag for whether QC should be performed. | | true |
| baseline_condition | string | Name of condition containing baseline samples. Scores will not be generated for this condition. | | |
| max_deletion_length | integer | Maximum length (in codons) of a designed deletion in the library. Insdels that resolve to an in-frame deletion longer than this are rejected. Set to 0 (or any non-positive value) to disable the length cap. | | 0 |
Linting and formatting
Linting results
1Using workflow specific profile workflow/profiles/default for setting default command line arguments.
2WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
3WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
4WARNING [dumpling]: Could not resolve fastq pair for '3_S3_L001': Could not find matching R1 and R2 fastq files for prefix '3_S3_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 3_S3_L001_R1_001.fastq.gz, 3_S3_L001_R1.fq.gz, 3_S3_L001_1.fastq, etc.
5WARNING [dumpling]: Could not resolve fastq pair for '4_S4_L001': Could not find matching R1 and R2 fastq files for prefix '4_S4_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 4_S4_L001_R1_001.fastq.gz, 4_S4_L001_R1.fq.gz, 4_S4_L001_1.fastq, etc.
6WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
7WARNING [dumpling]: Could not resolve fastq pair for '5_S5_L001': Could not find matching R1 and R2 fastq files for prefix '5_S5_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 5_S5_L001_R1_001.fastq.gz, 5_S5_L001_R1.fq.gz, 5_S5_L001_1.fastq, etc.
8WARNING [dumpling]: Could not resolve fastq pair for '6_S6_L001': Could not find matching R1 and R2 fastq files for prefix '6_S6_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 6_S6_L001_R1_001.fastq.gz, 6_S6_L001_R1.fq.gz, 6_S6_L001_1.fastq, etc.
9WARNING [dumpling]: Could not resolve fastq pair for '7_S7_L001': Could not find matching R1 and R2 fastq files for prefix '7_S7_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 7_S7_L001_R1_001.fastq.gz, 7_S7_L001_R1.fq.gz, 7_S7_L001_1.fastq, etc.
10WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
11WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
12WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
13WARNING [dumpling]: Could not resolve fastq pair for '1_S1_L001': Could not find matching R1 and R2 fastq files for prefix '1_S1_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 1_S1_L001_R1_001.fastq.gz, 1_S1_L001_R1.fq.gz, 1_S1_L001_1.fastq, etc.
14WARNING [dumpling]: Could not resolve fastq pair for '2_S2_L001': Could not find matching R1 and R2 fastq files for prefix '2_S2_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 2_S2_L001_R1_001.fastq.gz, 2_S2_L001_R1.fq.gz, 2_S2_L001_1.fastq, etc.
15WARNING [dumpling]: Could not resolve fastq pair for '3_S3_L001': Could not find matching R1 and R2 fastq files for prefix '3_S3_L001' in /tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/example/data. Expected files matching patterns like 3_S3_L001_R1_001.fastq.gz, 3_S3_L001_R1.fq.gz, 3_S3_L001_1.fastq, etc.
16FileNotFoundError in file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/common.smk", line 215:
17Data directory example/data does not exist
18 File "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/common.smk", line 447, in <module>
19 File "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/common.smk", line 215, in validate_config
Formatting results
1[DEBUG]
2[DEBUG]
3[DEBUG]
4[DEBUG]
5[DEBUG]
6[DEBUG]
7[DEBUG]
8[DEBUG] In file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/enrich.smk": Formatted content is different from original
9[DEBUG]
10[DEBUG]
11[DEBUG]
12[DEBUG]
13[WARNING] In file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/process.smk": Keyword "input" at line 22 has comments under a value.
14 PEP8 recommends block comments appear before what they describe
15(see https://www.python.org/dev/peps/pep-0008/#id30)
16[WARNING] In file "/tmp/tmp0jusz0mq/odcambc-dumpling-0fc2b14/workflow/rules/process.smk": Keyword "input" at line 50 has comments under a value.
17 PEP8 recommends block comments appear before what they describe
18(see https://www.python.org/dev/peps/pep-0008/#id30)
19[DEBUG]
20[DEBUG]
21[DEBUG]
22[INFO] 1 file(s) would be changed 😬
23[INFO] 13 file(s) would be left unchanged 🎉
24
25snakefmt version: 0.11.5