solida-core/astra

Snakemake pipeline to performs Variant Calling (with DeepVariant) and Variant Annotation (with VEP) starting from tumor BAM files.

Overview

Topics:

Latest release: None, Last update: 2024-11-21

Linting: linting: failed, Formatting:formatting: failed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/solida-core/astra . --tag None

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

To run the workflow using apptainer/singularity, use

snakemake --cores all --sdm apptainer

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Configuration File: `config/config.yaml`

The config/config.yaml file is the main configuration file for the Astra pipeline. It contains several sections that define paths, resources, and parameters required for pipeline execution. You can modify the values according to your project and computational environment.

1. Samples, Units, and Reheader Files

samples: config/samples.tsv
units: config/units.tsv
reheader: config/reheader.tsv

samples: Path to the samples.tsv file, which defines the list of samples to be processed.
units: Path to the units.tsv file, which specifies the technical units associated with the samples.
reheader: Optional path to a reheader.tsv file, used to modify sample identifiers during processing.

2. Paths

paths:
    workdir: "/path/to/workdir"
    results_dir: "/path/to/results_dir"

workdir: Directory where the pipeline will store intermediate files and logs. Ensure that there is sufficient disk space.
results_dir: Directory where final results will be saved after pipeline execution.

3. Resources

resources:
    reference: "/path/to/reference/reference_genome.fasta"
    regions: "/path/to/reference/regions.bed"

reference: Path to the reference genome file in FASTA format. This will be used for alignment and variant calling.
regions: Path to a BED file specifying genomic regions of interest (optional). This file helps restrict analyses to specific regions of the genome.

4. Parameters

params:
  deepVariant:
    model_type: "WES"  # options [WGS, WES, PACBIO, ONT_R104, HYBRID_PACBIO_ILLUMINA]
  vep:
    resources: "/path/to/vep_resources"
    reference_version: "hg19"  # options [hg19, hg38] or [GRCh37, GRCh38]
    cache_version: "106"

deepVariant.model_type: Defines the model type for the DeepVariant tool. Choose from the following options:
- WGS: Whole genome sequencing
- WES: Whole exome sequencing
- PACBIO: PacBio sequencing
- ONT_R104: Oxford Nanopore R104 model
- HYBRID_PACBIO_ILLUMINA: Hybrid PacBio and Illumina sequencing
vep.resources: Path to the resources required for the Variant Effect Predictor (VEP).
vep.reference_version: Specifies the reference genome version for VEP. Options include hg19, hg38, GRCh37, and GRCh38.
vep.cache_version: Defines the VEP cache version. Make sure it matches the version of the resources.

Sample Files

The Astra pipeline requires two essential files for defining sample-related information: config/samples.tsv and config/units.tsv.

Units File: `config/units.tsv`

In the units.tsv file, each row represents a unit for a given sample. A unit typically corresponds to a specific sequencing run or file associated with the sample, and it contains paths to the relevant BAM, BAI, and MD5 checksum files. This file is essential for linking the raw sequencing data with the samples defined in samples.tsv.

The file contains 3 tab-separated columns:

sample: The generic sample name. This name will also be listed in the samples.tsv file. A sample can appear multiple times in the units.tsv file if it has multiple units associated with it (e.g., different sequencing runs or lanes).
unit: A unique identifier for the unit. The unit is composed of three parts:
- flowcell_id: An identifier for the flowcell or sequencing instrument.
- lane: The lane ID for the specific sequencing run. If the BAM file comes from multiple lanes or if the lane is unknown, the lane parameter can be set to L000.
- sample_id: The same sample ID as in the sample column.
bam: The absolute path to the BAM file for this unit. This is the aligned sequence data.

Example of `units.tsv`:

sample        unit                           bam_path
SampleA       Flowcell1.L001.SampleA       /abs_path/to/data/Flowcell1.L001.SampleA.bam
SampleB       Flowcell2.L002.SampleB       /abs_path/to/data/Flowcell2.L002.SampleB.bam
SampleA       Flowcell3.L000.SampleA       /abs_path/to/data/Flowcell3.L000.SampleA.bam  # Unknown or multiple lanes
SampleC       Flowcell4.L001.SampleC       /abs_path/to/data/Flowcell4.L001.SampleC.bam

SampleA has two units: one from lane L001 and another from lane L000, where L000 indicates that the lane is unknown or that the BAM file is from multiple lanes.

Samples File: `config/samples.tsv`

In the config/samples.tsv file, each row contains information for a single sample, including a list of all associated units.

The file has 2 tab-separated columns:

sample: The generic sample name. This should match the name listed in the units.tsv file.
units: A comma-separated list of units (the unit names reported in the units.tsv file) associated with the given sample.

Example of `samples.tsv`:

sample        units
SampleA       Flowcell1.L001.SampleA,Flowcell3.L000.SampleA
SampleB       Flowcell2.L002.SampleB
SampleC       Flowcell4.L001.SampleC

In the example:

SampleA has two units associated with it, Flowcell1.L001.SampleA and Flowcell3.L000.SampleA.
SampleB has one unit, Flowcell2.L002.SampleB.
SampleC has one unit, Flowcell4.L001.SampleC.

This structure ensures that each sample is linked to its respective units, which can come from different sequencing runs, lanes, or flowcells.

Hash File (Optional): `config/reheader.tsv`

The reheader.tsv file contains a mapping between the LIMS (Laboratory Information Management System) identifiers and the client identifiers for each sample. This mapping is used to update or reheader the sample names in the workflow.