PathoGenOmics-Lab/VIPERA

A Snakemake workflow for SARS-CoV-2 Viral Intra-Patient Evolution Reporting and Analysis

Overview

Latest release: v1.2.2, Last update: 2025-08-01

Linting: linting: failed, Formatting: formatting: failed

Topics: bioinformatics intrahost sars-cov-2 virus-evolution reporting snakemake

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/PathoGenOmics-Lab/VIPERA . --tag v1.2.2

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Instructions

To run VIPERA, an environment with Snakemake version 7.19 or later is needed (see the Snakemake docs for setup instructions).

This guide provides command-line instructions for running VIPERA with Snakemake versions prior to 8. All configuration parameters are fully cross-compatible. The original publication used Snakemake 7.32, but newer versions can also be used with only minor changes. For details, see the Snakemake migration guide. For example, existing profiles are cross-compatible as well, but note that the --use-conda flag is deprecated starting with Snakemake 8. Instead, use --software-deployment-method conda.

Inputs and outputs

The workflow requires a set of FASTA files (one per target sample), a corresponding set of BAM files (also one per target sample), and a metadata table in CSV format with one row per sample. The metadata must include the following columns: unique sample identifier (default column ID, used to match sequencing files with metadata), the date the sample was collected (default CollectionDate), the location where the sample was collected (default ResidenceCity), and GISAID accession (default GISAIDEPI). The default column names but can be customized if needed via the workflow parameters.

These parameters are set in two configuration files in YAML format: config.yaml (for general workflow settings) and targets.yaml (for specific dataset-related settings). The latter must be modified by the user to point the SAMPLES and METADATA parameters to your data. The OUTPUT_DIRECTORY parameter should point to your desired results directory.

The script build_targets.py simplifies the process of creating the targets configuration file. To run this script, you need to have PyYAML installed. It takes a list of sample names, a directory with BAM and FASTA files, the path to the metadata table and the name of your dataset as required inputs. Then, it searches the directory for files that have the appropriate extensions and sample names and adds them to the configuration file.

An example file could look like this:

OUTPUT_NAME:
  "your-dataset-name"
SAMPLES:
  sample1:
    bam: "path/to/sorted/bam1.bam"
    fasta: "path/to/sequence1.fasta"
  sample2:
    bam: "path/to/sorted/bam2.bam"
    fasta: "path/to/sequence2.fasta"
  ...
METADATA:
  "path/to/metadata.csv"
OUTPUT_DIRECTORY:
  "output"
CONTEXT_FASTA:
  null
MAPPING_REFERENCES_FASTA:
  null

This information may also be provided through the --config parameter.

Automated construction of a context dataset

Setting the CONTEXT_FASTA parameter to null (default) will enable the automatic download of sequences from the GISAID EpiCoV SARS-CoV-2 database. An unset parameter has the same effect. To enable this, you must also sign up to the GISAID platform and provide your credentials by creating and filling an additional configuration file (default: config/gisaid.yaml) as follows:

USERNAME: "your-username"
PASSWORD: "your-password"

A set of samples that meet the spatial, temporal and phylogenetic criteria set through the download_context rule will be retrieved automatically from GISAID. These criteria are:

Location matching the place(s) of sampling of the target samples
Collection date within the time window that includes 95% of the date distribution of the target samples (2.5% is trimmed at each end to account for extreme values) ± 2 weeks
Pango lineage matching that of the target samples

Then, a series of checkpoint steps are executed for quality assurance:

Remove context samples whose GISAID ID match any of the target samples
Enforce a minimum number of samples to have at least as many possible combinations as random subsample replicates for the diversity assessment (set in config.yaml)

The workflow will continue its execution until completion if the obtained context dataset passes these checkpoints. Otherwise, the execution will be terminated and, to continue the analysis, an external context dataset must be provided through the CONTEXT_FASTA parameter. This can be done by editing targets.yaml or via the command line:

snakemake --config CONTEXT_FASTA="path/to/fasta"

[!IMPORTANT] The GISAID EpiCoV database is proprietary and not openly accessible. For details, refer to the GISAID Terms of Use. VIPERA uses GISAIDR to automate access to GISAID data. However, this access can be unstable or occasionally fail due to changes in the platform. Possible workarounds are documented in the GISAIDR repository (e.g. issues #55 and #58). If programmatic access fails, a suitable context dataset must be manually provided by setting the CONTEXT_FASTA parameter to the path of a FASTA file. As a last resort, some of the analyses can be allowed to run even if context-dependent rules fail by passing the --keep-going flag to Snakemake. To replicate our work, the automatic context dataset is available via DOI: 10.55876/gis8.250718er (EPI_SET_250718er). Read more about EPI_SETs here.

Mapping reference sequence

Setting MAPPING_REFERENCES_FASTA to null (default) will enable the automatic download of the reference sequence(s) that were used to map the reads and generate the BAM files. An unset parameter has the same effect. If the required sequence is not available publically or the user already has it at your disposal, it can be provided manually by setting the parameter to the path of the reference FASTA file.

Workflow configuration variables

All of the following variables are pre-defined in config.yaml:

ALIGNMENT_REFERENCE: NCBI accession number of the reference record for sequence alignment.
PROBLEMATIC_VCF: URL or path of a VCF file containing problematic genome positions for masking.
FEATURES_JSON: path of a JSON file containing name equivalences of genome features for data visualization.
GENETIC_CODE_JSON: path of a JSON file containing a genetic code for gene translation.
TREE_MODEL: substitution model used by IQTREE (see docs).
UFBOOT_REPS: ultrafast bootstrap replicates for IQTREE (see UFBoot).
SHALRT_REPS: Shimodaira–Hasegawa approximate likelihood ratio test bootstrap replicates for IQTREE (see SH-aLRT).
VC: variant calling configuration:
- MAX_DEPTH: maximum depth at a position for samtools mpileup (option -d).
- MIN_QUALITY: minimum base quality for samtools mpileup (option -Q).
- IVAR_QUALITY: minimum base quality for ivar variants (option -q).
- IVAR_FREQ: minimum frequency threshold for ivar variants (option -t).
- IVAR_DEPTH: minimum read depth for ivar variants (option -m).
DEMIX: demixing configuration:
- MIN_QUALITY: minimum quality for freyja variants (option --minq).
- COV_CUTOFF: minimum depth for freyja demix (option --covcut).
- MIN_ABUNDANCE: minimum lineage estimated abundance for freyja demix (option --eps).
WINDOW: sliding window of nucleotide variants per site configuration:
- WIDTH: number of sites within windows.
- STEP: number of sites between windows.
GISAID: automatic context download configuration.
- CREDENTIALS: path of the GISAID credentials in YAML format.
- DATE_COLUMN: name of the column that contains sampling dates (YYYY-MM-DD) in the input target metadata.
- LOCATION_COLUMN: name of the column that contains sampling locations (e.g. city names) in the input target metadata.
- ACCESSION_COLUMN: name of the column that contains GISAID EPI identifiers in the input target metadata.
DIVERSITY_REPS: number of random sample subsets of the context dataset for the nucleotide diversity comparison.
USE_BIONJ: use the BIONJ algorithm (Gascuel, 1997) instead of NJ (neighbor-joining; Saitou & Nei, 1987) to reconstruct phylogenetic trees from pairwise distances.
COR: configuration for correlation analyses of allele frequency data over time and between variants. This parameter controls how correlation tests are performed using R’s cor.test and cor functions (see R documentation).
- METHOD: correlation method to use. Valid options are “pearson” (default), “kendall”, or “spearman”.
- EXACT: boolean flag indicating whether to compute an exact p-value when possible. This option applies only to certain methods and may be set to null (default) to let R decide automatically.
LOG_PY_FMT: logging format string for Python scripts.
PLOTS: path of the R script that sets the design and style of data visualizations.
PLOT_GENOME_REGIONS: path of a CSV file containing genome regions, e.g. SARS-CoV-2 non-structural protein (NSP) coordinates, for data visualization.
REPORT_QMD: path of the report template in Quarto markdown (QMD) format.

Workflow graphs

To generate a simplified rule graph, run:

snakemake --rulegraph | dot -Tpng > .rulegraph.png

Snakemake rule graph

To generate the directed acyclic graph (DAG) of all rules to be executed, run:

snakemake --forceall --dag | dot -Tpng > .dag.png

Snakemake rule graph

Run modes

To run the analysis with the default configuration, run the following command (change the -c/--cores argument to use a different number of CPUs):

snakemake --use-conda -c4

To run the analysis in an HPC environment using SLURM, we provide a default SLURM profile configuration as an example that should be modified to fit your needs. Read more about Snakemake profiles here. To use the profile, install the Snakemake executor plugin for SLURM and run one of the following commands:

snakemake --slurm --profile profile/slurm  # Snakemake v7
snakemake --profile profile/slurm          # Snakemake v8+

Additionally, we offer the option of running the workflow within a containerized environment using a pre-built Docker image, provided that Apptainer/Singularity is available on the system. This eliminates the need for further conda package downloads and environment configuration. To do that, simply add the option --use-apptainer to any of the previous commands.

Using Apptainer for running VIPERA in the Windows Subsystem for Linux (WSL) may encounter errors due to the default file permissions configuration, which conflicts with Snakemake’s containerized conda environment activation mechanism. Thus, running the containerized VIPERA workflow on the WSL is not advised. Additionally, certain known issues arise when utilizing non-default temporary directories and Snakemake shadow directories. To address this issue, use the default temporary directory (e.g. export TMPDIR=/tmp in Linux machines) and specify the shadow prefix (--shadow-prefix /tmp) before executing the containerized workflow.

Linting and formatting

Linting results

Lints for snakefile /tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/common.smk:
    * Absolute path "/"sequences.fasta" in line 52:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration

Lints for snakefile /tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/fasta.smk:
    * Absolute path "/g" in line 28:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/f" in line 55:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/f" in line 70:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/f" in line 74:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration

Lints for snakefile /tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/asr.smk:
    * Absolute path "/f" in line 10:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/f" in line 13:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/f" in line 34:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration

Lints for snakefile /tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/demix.smk:
    * Absolute path "/"{sample}/{sample}_depth.txt" in line 11:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"{sample}/{sample}_variants.tsv" in line 12:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"{sample}/{sample}_depth.txt" in line 31:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"{sample}/{sample}_variants.tsv" in line 32:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"{sample}/{sample}_demixed.tsv" in line 37:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"{sample}/{sample}_demixed.tsv" in line 56:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration

Lints for snakefile /tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/vaf.smk:
    * Absolute path "/g" in line 30:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/'{wildcards.sample}" in line 52:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration

Lints for snakefile /tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/context.smk:
    * Absolute path "/"sequences.fasta" in line 26:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"metadata.csv" in line 27:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"duplicate_accession_ids.txt" in line 28:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"nextalign" in line 45:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"nextalign" in line 46:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"nextalign" in line 61:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"nextalign" in line 65:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/f" in line 84:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"nextalign" in line 85:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/^>/{{p=seen[$0]++}}!p" in line 96:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration

Lints for snakefile /tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/report.smk:
    * Absolute path "/f" in line 40:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.

... (truncated)

Formatting results

[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/report.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/common.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/fetch.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/demix.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/vaf.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/fasta.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/distances.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/evolution.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/pangolin.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/Snakefile":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/asr.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmphuxussnv/PathoGenOmics-Lab-VIPERA-2b93521/workflow/rules/context.smk":  Formatted content is different from original
[INFO] 12 file(s) would be changed 😬

snakefmt version: 0.11.0