solida-core/astra
Snakemake pipeline to performs Variant Calling (with DeepVariant) and Variant Annotation (with VEP) starting from tumor BAM files.
Overview
Topics:
Latest release: None, Last update: 2024-11-21
Linting: linting: failed, Formatting:formatting: failed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/solida-core/astra . --tag None
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
To run the workflow using apptainer
/singularity
, use
snakemake --cores all --sdm apptainer
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
The config/config.yaml
file is the main configuration file for the Astra pipeline.
It contains several sections that define paths, resources, and parameters required for pipeline execution.
You can modify the values according to your project and computational environment.
samples: config/samples.tsv
units: config/units.tsv
reheader: config/reheader.tsv
-
samples
: Path to thesamples.tsv
file, which defines the list of samples to be processed. -
units
: Path to theunits.tsv
file, which specifies the technical units associated with the samples. -
reheader
: Optional path to areheader.tsv
file, used to modify sample identifiers during processing.
paths:
workdir: "/path/to/workdir"
results_dir: "/path/to/results_dir"
-
workdir
: Directory where the pipeline will store intermediate files and logs. Ensure that there is sufficient disk space. -
results_dir
: Directory where final results will be saved after pipeline execution.
resources:
reference: "/path/to/reference/reference_genome.fasta"
regions: "/path/to/reference/regions.bed"
-
reference
: Path to the reference genome file in FASTA format. This will be used for alignment and variant calling. -
regions
: Path to a BED file specifying genomic regions of interest (optional). This file helps restrict analyses to specific regions of the genome.
params:
deepVariant:
model_type: "WES" # options [WGS, WES, PACBIO, ONT_R104, HYBRID_PACBIO_ILLUMINA]
vep:
resources: "/path/to/vep_resources"
reference_version: "hg19" # options [hg19, hg38] or [GRCh37, GRCh38]
cache_version: "106"
-
deepVariant.model_type
: Defines the model type for the DeepVariant tool. Choose from the following options:-
WGS
: Whole genome sequencing -
WES
: Whole exome sequencing -
PACBIO
: PacBio sequencing -
ONT_R104
: Oxford Nanopore R104 model -
HYBRID_PACBIO_ILLUMINA
: Hybrid PacBio and Illumina sequencing
-
-
vep.resources
: Path to the resources required for the Variant Effect Predictor (VEP). -
vep.reference_version
: Specifies the reference genome version for VEP. Options includehg19
,hg38
,GRCh37
, andGRCh38
. -
vep.cache_version
: Defines the VEP cache version. Make sure it matches the version of the resources.
The Astra pipeline requires two essential files for defining sample-related information: config/samples.tsv
and config/units.tsv
.
In the units.tsv
file, each row represents a unit for a given sample. A unit typically corresponds to a specific sequencing run or file associated with the sample, and it contains paths to the relevant BAM, BAI, and MD5 checksum files. This file is essential for linking the raw sequencing data with the samples defined in samples.tsv
.
The file contains 3 tab-separated columns:
-
sample
: The generic sample name. This name will also be listed in thesamples.tsv
file. A sample can appear multiple times in theunits.tsv
file if it has multiple units associated with it (e.g., different sequencing runs or lanes). -
unit
: A unique identifier for the unit. Theunit
is composed of three parts:-
flowcell_id
: An identifier for the flowcell or sequencing instrument. -
lane
: The lane ID for the specific sequencing run. If the BAM file comes from multiple lanes or if the lane is unknown, the lane parameter can be set toL000
. -
sample_id
: The same sample ID as in thesample
column.
-
-
bam
: The absolute path to the BAM file for this unit. This is the aligned sequence data.
sample unit bam_path
SampleA Flowcell1.L001.SampleA /abs_path/to/data/Flowcell1.L001.SampleA.bam
SampleB Flowcell2.L002.SampleB /abs_path/to/data/Flowcell2.L002.SampleB.bam
SampleA Flowcell3.L000.SampleA /abs_path/to/data/Flowcell3.L000.SampleA.bam # Unknown or multiple lanes
SampleC Flowcell4.L001.SampleC /abs_path/to/data/Flowcell4.L001.SampleC.bam
-
SampleA has two units: one from lane
L001
and another from laneL000
, whereL000
indicates that the lane is unknown or that the BAM file is from multiple lanes.
In the config/samples.tsv
file, each row contains information for a single sample, including a list of all associated units.
The file has 2 tab-separated columns:
-
sample
: The generic sample name. This should match the name listed in theunits.tsv
file. -
units
: A comma-separated list of units (the unit names reported in theunits.tsv
file) associated with the given sample.
sample units
SampleA Flowcell1.L001.SampleA,Flowcell3.L000.SampleA
SampleB Flowcell2.L002.SampleB
SampleC Flowcell4.L001.SampleC
In the example:
-
SampleA has two units associated with it,
Flowcell1.L001.SampleA
andFlowcell3.L000.SampleA
. -
SampleB has one unit,
Flowcell2.L002.SampleB
. -
SampleC has one unit,
Flowcell4.L001.SampleC
.
This structure ensures that each sample is linked to its respective units, which can come from different sequencing runs, lanes, or flowcells.
The reheader.tsv
file contains a mapping between the LIMS (Laboratory Information Management System) identifiers and the client identifiers for each sample. This mapping is used to update or reheader the sample names in the workflow.
The file has 2 tab-separated columns:
-
LIMS
: The identifier used by the LIMS system for a sample. This identifier is often used for internal tracking and management purposes. -
Client
: The identifier used by the client for the same sample. This name will replace the LIMS identifier at the end of the workflow, for delivering results to the client.
LIMS Client
SampleA C1234
SampleB C5678
SampleC C91011
In the example:
- The sample with LIMS identifier
SampleA
is associated with client identifierC1234
. - The sample with LIMS identifier
SampleB
is associated with client identifierC5678
. - The sample with LIMS identifier
SampleC
is associated with client identifierC91011
.
Linting and formatting
Linting results
EmptyDataError in file /tmp/tmp4j08bsl1/workflow/rules/common.smk, line 13:
No columns to parse from file
File "/tmp/tmp4j08bsl1/workflow/rules/common.smk", line 13, in <module>
File "/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
File "/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
File "/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
File "/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1898, in _make_engine
File "/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
File "parsers.pyx", line 581, in pandas._libs.parsers.TextReader.__cinit__
Formatting results
[DEBUG]
[DEBUG] In file "/tmp/tmp4j08bsl1/workflow/rules/preprocessing.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmp4j08bsl1/workflow/rules/common.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmp4j08bsl1/workflow/rules/delivery.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmp4j08bsl1/workflow/rules/annotation.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmp4j08bsl1/workflow/rules/metrics.smk": Formatted content is different from original
[DEBUG]
[WARNING] In file "/tmp/tmp4j08bsl1/workflow/rules/calling.smk": Keyword "shell" at line 25 has comments under a value.
PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[DEBUG] In file "/tmp/tmp4j08bsl1/workflow/rules/calling.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmp4j08bsl1/workflow/Snakefile": Formatted content is different from original
[INFO] 7 file(s) would be changed 😬
snakefmt version: 0.10.2