snakemake-workflows/dna-seq-varlociraptor
A Snakemake workflow for calling small and structural variants under any kind of scenario (tumor/normal, tumor/normal/relapse, germline, pedigree, populations) via the unified statistical model of Varlociraptor.
Overview
Topics: varlociraptor sciworkflows snakemake reproducibility genomic-variant-calling
Latest release: v5.14.0, Last update: 2025-03-09
Linting: linting: passed, Formatting:formatting: failed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/snakemake-workflows/dna-seq-varlociraptor . --tag v5.14.0
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
To configure this workflow, modify config/config.yaml
according to your needs, following the explanations provided in the file.
Add samples to config/samples.tsv
. For each sample, the columns sample_name
, alias
, platform
, datatype
, calling
and group
have to be defined.
- Samples within the same
group
can be referenced in a joint Calling scenario via theiralias
es. -
alias
es represent the name of the sample within its group. They are meant to be some abstract description of the sample type to be used in the Calling scenario, and should thus be used consistently across groups. A classic example would be a combination of thetumor
andnormal
aliases. - The
platform
column needs to contain the used sequencing plaform (one of 'CAPILLARY', 'LS454', 'ILLUMINA', 'SOLID', 'HELICOS', 'IONTORRENT', 'ONT', 'PACBIO’). - The purity column is required when being used with the default scenario. If it is unknown, it can be set to
1.0
. - The same
sample_name
entry can be used multiple times within asamples.tsv
sample sheet, with only the value in thegroup
column differing between repeated rows. This way, you can use the same sample for variant calling in different groups, for example if you use a panel of normal samples when you don't have matched normal samples for tumor variant calling. - The
datatype
column specifies what kind of data each sample corresponds to. This can either berna
ordna
. - The
calling
column sets the kind of analysis to be performed. This can be eitherfusions
,variants
or both (comma separated). Fusion calling is still under developement and should be considered as experimental.
If mutational burdens shall be estimated for a sample, the to be used events
from the calling scenario (see below) have to be specified in an additional column mutational_burden_events
. Multiple events have to be separated by commas within that column.
Missing values can be specified by empty columns or by writing NA
. Lines can be commented out with #
.
For each sample, add one or more sequencing units (runs, lanes or replicates) to the unit sheet config/units.tsv
.
- Each unit has a
unit_name
. This can be a running number, or an actual run, lane or replicate id. - Each unit has a
sample_name
, which associates it with the biological sample it comes from. This information is used to merged all the units of a sample before read mapping and duplicate marking. - For each unit, you need to specify either of these columns:
-
fq1
only for single end reads. This can point to any FASTQ file on your system -
fq1
andfq2
for paired end reads. These can point to any FASTQ files on your system -
sra
only: specify an SRA (sequence read archive) accession (starting with e.g. ERR or SRR). The pipeline will automatically download the corresponding paired end reads from SRA. - If both local files (
fq1
,fq2
) and SRA accession (sra
) are available, the local files will be used.
-
- Define adapters in the
adapters
column, by putting cutadapt arguments in quotation marks (e.g."-a ACGCGATCG -A GCTAGCGTACT"
).
Missing values can be specified by empty columns or by writing NA
. Lines can be commented out with #
.
Varlociraptor supports integrated uncertainty aware calling and filtering of variants for arbitrary scenarios. These are defined as so-called scenarios, via a variant calling grammar.
- For each group, a scenario is rendered via YTE.
- Therefore, edit the template scenario (
scenario.yaml
) according to your needs. The sample sheet is available for YTE rendering as a pandas data frame in the variablesamples
. This allows to customize the scenario according to the contents of the sample sheet. You can therefore add additional columns to the sample sheet (e.g. purity) and access them in the scenario template, in order to pass the information to Varlociraptor. - Example scenarios for various use cases can be found in the scenario catalog.
For panel data the pipeline allows trimming of amplicon primers on both ends of a fragment but also on a single end only.
In case of single end primers these are supposed to be located at the left end of a read.
When primer trimming is enabled, primers have to be defined either directly in the config.yaml
or in a seperate tsv-file.
Defining primers directly in the config file is prefered when all samples come from the same primer set.
In case of different panels, primers have to be set panel-wise in a seperate tsv-file.
For each panel the following columns need to be set: panel
, fa1
and fa2
(optional).
Additionally, for each sample the corresponding panel must be defined in samples.tsv
(column panel
).
If a panel is not provided for a sample, trimming will not be performed on that sample.
For single primer trimming only, the first entry in the config (respective in the tsv file) needs to be defined.
For annotating UMIs two additional columns in sample.tsv
must be set:
-
umi_read
: this can be either of the following options:-
fq1
if the UMIs are part of read 1 -
fq2
if the UMIs are part of read 2 -
both
if there are UMIs in both paired end reads - the path to an additional fastq file containing just the UMI of each fragment in fq1 and fq2 (with the same read names)
-
-
umi_read_structure
: A read structure defining the UMI position in each UMI record (see https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures). Ifboth
reads contain a UMI, specify a read structure for both with whitespace in between (for example,8M142T 8M142T
). In case a separate fastq file only containg UMI sequences is set the read structure needs to be+M
. Read names of UMI records must match the corresponding records of the sample fastqs.
Linting and formatting
Linting results
None
Formatting results
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG] In file "/tmp/tmpnh2isxjs/snakemake-workflows-dna-seq-varlociraptor-5dc02f6/workflow/rules/common.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpnh2isxjs/snakemake-workflows-dna-seq-varlociraptor-5dc02f6/workflow/rules/maf.smk": Formatted content is different from original
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG]
[DEBUG]
... (truncated)