snakemake-workflows/dna-seq-varlociraptor

A Snakemake workflow for calling small and structural variants under any kind of scenario (tumor/normal, tumor/normal/relapse, germline, pedigree, populations) via the unified statistical model of Varlociraptor.

Overview

Topics: varlociraptor sciworkflows snakemake reproducibility genomic-variant-calling

Latest release: v5.14.0, Last update: 2025-03-09

Linting: linting: passed, Formatting:formatting: failed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/snakemake-workflows/dna-seq-varlociraptor . --tag v5.14.0

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

General settings

To configure this workflow, modify config/config.yaml according to your needs, following the explanations provided in the file.

Sample sheet

Add samples to config/samples.tsv. For each sample, the columns sample_name, alias, platform, datatype, calling and group have to be defined.

  • Samples within the same group can be referenced in a joint Calling scenario via their aliases.
  • aliases represent the name of the sample within its group. They are meant to be some abstract description of the sample type to be used in the Calling scenario, and should thus be used consistently across groups. A classic example would be a combination of the tumor and normal aliases.
  • The platform column needs to contain the used sequencing plaform (one of 'CAPILLARY', 'LS454', 'ILLUMINA', 'SOLID', 'HELICOS', 'IONTORRENT', 'ONT', 'PACBIO’).
  • The purity column is required when being used with the default scenario. If it is unknown, it can be set to 1.0.
  • The same sample_name entry can be used multiple times within a samples.tsv sample sheet, with only the value in the group column differing between repeated rows. This way, you can use the same sample for variant calling in different groups, for example if you use a panel of normal samples when you don't have matched normal samples for tumor variant calling.
  • The datatype column specifies what kind of data each sample corresponds to. This can either be rna or dna.
  • The calling column sets the kind of analysis to be performed. This can be either fusions, variants or both (comma separated). Fusion calling is still under developement and should be considered as experimental.

If mutational burdens shall be estimated for a sample, the to be used events from the calling scenario (see below) have to be specified in an additional column mutational_burden_events. Multiple events have to be separated by commas within that column.

Missing values can be specified by empty columns or by writing NA. Lines can be commented out with #.

Unit sheet

For each sample, add one or more sequencing units (runs, lanes or replicates) to the unit sheet config/units.tsv.

  • Each unit has a unit_name. This can be a running number, or an actual run, lane or replicate id.
  • Each unit has a sample_name, which associates it with the biological sample it comes from. This information is used to merged all the units of a sample before read mapping and duplicate marking.
  • For each unit, you need to specify either of these columns:
    • fq1 only for single end reads. This can point to any FASTQ file on your system
    • fq1 and fq2 for paired end reads. These can point to any FASTQ files on your system
    • sra only: specify an SRA (sequence read archive) accession (starting with e.g. ERR or SRR). The pipeline will automatically download the corresponding paired end reads from SRA.
    • If both local files (fq1, fq2) and SRA accession (sra) are available, the local files will be used.
  • Define adapters in the adapters column, by putting cutadapt arguments in quotation marks (e.g. "-a ACGCGATCG -A GCTAGCGTACT").

Missing values can be specified by empty columns or by writing NA. Lines can be commented out with #.

Calling scenario

Varlociraptor supports integrated uncertainty aware calling and filtering of variants for arbitrary scenarios. These are defined as so-called scenarios, via a variant calling grammar.

  • For each group, a scenario is rendered via YTE.
  • Therefore, edit the template scenario (scenario.yaml) according to your needs. The sample sheet is available for YTE rendering as a pandas data frame in the variable samples. This allows to customize the scenario according to the contents of the sample sheet. You can therefore add additional columns to the sample sheet (e.g. purity) and access them in the scenario template, in order to pass the information to Varlociraptor.
  • Example scenarios for various use cases can be found in the scenario catalog.

Primer trimming

For panel data the pipeline allows trimming of amplicon primers on both ends of a fragment but also on a single end only. In case of single end primers these are supposed to be located at the left end of a read. When primer trimming is enabled, primers have to be defined either directly in the config.yaml or in a seperate tsv-file. Defining primers directly in the config file is prefered when all samples come from the same primer set. In case of different panels, primers have to be set panel-wise in a seperate tsv-file. For each panel the following columns need to be set: panel, fa1 and fa2 (optional). Additionally, for each sample the corresponding panel must be defined in samples.tsv (column panel). If a panel is not provided for a sample, trimming will not be performed on that sample. For single primer trimming only, the first entry in the config (respective in the tsv file) needs to be defined.

Annotating UMIS

For annotating UMIs two additional columns in sample.tsv must be set:

  • umi_read: this can be either of the following options:
    • fq1 if the UMIs are part of read 1
    • fq2 if the UMIs are part of read 2
    • both if there are UMIs in both paired end reads
    • the path to an additional fastq file containing just the UMI of each fragment in fq1 and fq2 (with the same read names)
  • umi_read_structure: A read structure defining the UMI position in each UMI record (see https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures). If both reads contain a UMI, specify a read structure for both with whitespace in between (for example, 8M142T 8M142T). In case a separate fastq file only containg UMI sequences is set the read structure needs to be +M. Read names of UMI records must match the corresponding records of the sample fastqs.

Linting and formatting

Linting results

None

Formatting results

[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] In file "/tmp/tmpnh2isxjs/snakemake-workflows-dna-seq-varlociraptor-5dc02f6/workflow/rules/common.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpnh2isxjs/snakemake-workflows-dna-seq-varlociraptor-5dc02f6/workflow/rules/maf.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 
[DEBUG] 

... (truncated)