cbg-ethz/V-pipe

V-pipe is a pipeline designed for analysing NGS data of short viral genomes

Overview

Topics: ngs snakemake conda biohackeu20 virus sequencing bioinformatics bioinformatics-pipeline biohackcovid20 sars-cov-2 sarscov2 hiv genomics biohackeu21 biohackeu22

Latest release: v3.0.0.pre1, Last update: 2025-01-24

Linting: linting: failed, Formatting:formatting: passed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/cbg-ethz/V-pipe . --tag v3.0.0.pre1

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

To run the workflow using a combination of conda and apptainer/singularity for software deployment, use

snakemake --cores all --sdm conda apptainer

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Configuring V-pipe

In order to start using V-pipe, you need to provide three things:

  1. Samples in a specific directory structure
  2. (optional) TSV file listing the samples
  3. Configuration file

The utils subdirectory provides tools that can assist in importing samples files and structuring them.

Configuration file

The V-pipe workflow is customized using a structured configuration file called config.yaml, config.json or, for backward compatibility, vpipe.config (INI-like format).

This configuration file is a text file written using a basic structure composed of sections, properties and values. When using YAML or JSON format use these languages associative array/dictionaries in two levels for sections and properties. When using the older INI format, sections are expected in squared brackets, and properties are followed by corresponding values.

Further more, it is possible to specify additional options on the command line using Snakemake's --configfile to pass additional YAML/JSON configuration files, and/or using Snakemake's --config to pass sections and properties in a YAML Flow style/JSON syntax.

Here is an example of config.yaml:

general:
  virus_base_config: hiv

input: datadir: samples samples_file: config/samples.tsv

output: datadir: results snv: true local: true global: false visualization: true QA: true

At minimum, a valid configuration MUST provide a reference sequence against which to align the short reads from the raw data. This can be done in several ways:

  • by using a virus base config that will provide default presets for specific viruses
  • by directly passing a reference .fasta file in the section input -> property reference that will override the default

virus base config

We provide virus-specific base configuration files which contain handy defaults for some viruses.

Currently, the following virus base config are available:

  • hiv: provides HXB2 as a reference sequence for HIV, and sets the default aligner to ngshmmalign.
  • sars-cov-2: provides NC_045512.2 as a reference sequence for SARS-CoV-2, sets the default aligner to bwa and sets the variant calling to be done against the reference instead of the cohort's consensus. In addition, a look-up for the recent versions of ARTIC protocol is provided; this makes it possible to set per-sample protocol in the sample table, and to turn on amplicon trimming (see amplicon protocols).

configuration manual

More information about all the available configuration options and an exhaustive list can be found in config.html or online.

legacy V-pipe 1.xx/2.xx users

If you want to re-use your old configuration from a legacy V-pipe v1.x/2.x installation or sars-cov2 branch it is possible, if you keep in mind the following caveats:

  • The older INI-like syntax is still supported for a vpipe.config configuration file.
    • This configuration will be overridden by config.yaml or config.json, you might want to delete those files from your working directory if you are not using them.
  • V-pipe starting from version 2.99.1 follows the Standardized usage rules of the Snakemake Workflow Catalog
    • This defines a newer directory structure
      • samples TSV table is now expected to be in config/samples.tsv (use the section input -> property samples_file to override).
      • the per sample output isn't written in the same samples/ directory as the input anymore, but in a separate directory called results/ (use the section output -> property datadir to override).
      • the cohort-wide output isn't written in a different variants/ directory anymore, but at at the base of the output datadir - i.e by default in results/ (use the section output -> property cohortdir to specify a different path relative to the output datadir).
    • Add the following sections and properties to your vpipe.config configuration file to bring back the legacy behaviour:
[input]
datadir=samples
samples_file=samples.tsv

[output] datadir=samples cohortdir=../variants

As of version 2.99.1, only the analysis of viral sequencing data has been extensively tested and is guaranteed stable. For other more advanced functionality you might want to wait until a future release.

samples tsv

File containing sample unique identifiers and dates as tab-separated values.

Example: here, we have two samples from patient 1 and one sample from patient 2:

patient1	20100113
patient1	20110202
patient2	20081130

By default, V-pipe searches for a file named config/samples.tsv, if this file does not exist, a list of samples is built by searching the contents of the input datadir.

read-lenght

The samples' read-length is used for critical steps of the pipeline (e.g.: quality filtering). Different possibilities are available to set its value:

  • by default, V-pipe expects a read-length of 250bp

  • this default can be globally overridden in the configuration file in section input -> property read_length

    input:
      read_length: 150
  • the samples TSV file can contain an optional third column specifying the read length. This is particularly useful when samples are sequenced using protocols with different read lengths.

    patient1	20100113	150
    patient1	20110202	200
    patient2	20081130	150

    The utils subdirectory contain mass-importers tools that can generate this third column while importing samples.

amplicon protocols

Samples can be the result of PCR amplification. This can require some additional processing, e.g., primers might need trimming:

output:
  trim_primers: true

In order to complete these steps, additional information needs to be provided, e.g., a BED file describing the primers to be trimmed.

  • This can be specified globally with several properties in the configuration file in section input:

    input:
      primers_bedfile: references/primers/SARS-CoV-2.primer.bed
      inserts_bedfile: references/primers/SARS-CoV-2.insert.bed
  • The samples TSV file can contain an optional fourth column specifying the protocol:

    • When different samples have been processed with different library protocols, a lookup table with per-protocol specific (primers bed and fasta), can be provided in a YAML file. references/primers.yaml:
      v41:
        name: SARS-CoV-2 ARTIC V4.1
        inserts_bedfile: references/primers/v41/SARS-CoV-2.insert.bed
        primers_bedfile: references/primers/v41/SARS-CoV-2.primer.bed
      v4:
        name: SARS-CoV-2 ARTIC V4
        inserts_bedfile: references/primers/v4/SARS-CoV-2.insert.bed
        primers_bedfile: references/primers/v4/SARS-CoV-2.primer.bed
      v3:
        name: SARS-CoV-2 ARTIC V3
        inserts_bedfile: references/primers/v3/nCoV-2019.insert.bed
        primers_bedfile: references/primers/v3/nCoV-2019.primer.bed
    • in the configuration file, this look-up can be then specified in section input option protocols_file: config/config.yaml:
      input:
        protocols_file: references/primers.yaml
    • The short name can now be referenced in the fourth column samples TSV table file: config/samples.tsv:
      sample_a	20211108	250	v3
      sample_b	20220214	250	v4

    This is useful if multiple different amplicon schemes have been used of the lifetime of a long-running project, as new variants appear over time with SNVs that require adapting amplicons.

  • virus base config can provide some defaults for either above e.g.: sars-cov-2 provides BED files for ARTIC v3, v4 and v4.1

samples

V-pipe expects the input samples to be organized in a two-level directory hierarchy.

  • The first level can be, e.g., patient samples or biological replicates of an experiment.
  • The second level can be, e.g., different sampling dates or different sequencing runs of the same sample.
  • Inside that directory, the sub-directory raw_data/ holds the sequencing data in FASTQ format (optionally compressed with GZip).

For example:

📁samples
├──📁patient1
│  ├──📁20100113
│  │  └──📁raw_data
│  │     ├──🧬patient1_20100113_R1.fastq
│  │     └──🧬patient1_20100113_R2.fastq
│  └──📁20110202
│     └──📁raw_data
│        ├──🧬patient1_20100202_R1.fastq
│        └──🧬patient1_20100202_R2.fastq
└──📁patient2
   └──📁20081130
      └──📁raw_data
         ├──🧬patient2_20081130_R1.fastq.gz
         └──🧬patient2_20081130_R2.fastq.gz

The utils subdirectory contain mass-importers tools to assist you in generating this hierarchy.

Linting and formatting

Linting results

VPIPE_BASEDIR = /tmp/tmpwk1tqad0/cbg-ethz-V-pipe-01c271e/workflow
/tmp/tmpwk1tqad0/cbg-ethz-V-pipe-01c271e/workflow/rules/common.smk:916: SyntaxWarning: invalid escape sequence '\.'
  if config.input["paired"]:
ImportError in file /tmp/tmpwk1tqad0/cbg-ethz-V-pipe-01c271e/workflow/rules/common.smk, line 21:
cannot import name 'load_configfile' from 'snakemake.io' (/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/snakemake/io.py)
  File "/tmp/tmpwk1tqad0/cbg-ethz-V-pipe-01c271e/workflow/rules/common.smk", line 21, in <module>

Formatting results

None