monagrland/MB_Pipeline
Metabarcoding Pipeline for Illumina Sequencing Data
Overview
Topics:
Latest release: v3.0.4, Last update: 2024-05-21
Linting: linting: failed, Formatting:formatting: passed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/monagrland/MB_Pipeline . --tag v3.0.4
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
To configure the MB_Pipeline
metabarcoding workflow, modify the configuration
file config.yaml
and input file list reads.tsv
in this folder with your
desired options and the paths to your input files and reference databases.
Sample names and paths to the input files for the pipeline should be listed in
reads.tsv
in tab-separated format, with the following fields:
-
sample
- Short identifier the sample (alphanumeric and underscore only) -
lib
(optional) - If there is more than one read file/library for a given sample, disambiguate them here, e.g. as libraries1
,2
, ... -
fwd
- Path to forward read file, relative to Snakemake working directory. Absolute paths are not recommended. -
rev
- Path to reverse read file if reads are paired-end; leave blank if reads are single-end
Other columns in the TSV file will be ignored.
Pipeline parameters are specified in the config.yaml
file in this folder.
Paths in the config file are relative to the working directory that Snakemake is run from; alternatively, absolute paths can be specified but are not recommended.
reads_table: config/reads.tsv
If paired end reads are used, specify this at the paired
key. This key
accepts only true
or false
.
paired: true
Adapter trimming options passed to
Cutadapt are under
adapter_trimming_options
.
The 5' primer sequence (5p
), 3' primer sequence (3p
) and minimum number of
overlapping bases to be trimmed (min_overlap
) are specified with keywords.
Other Cutadapt options can be specified with their command line flags under
others_common
, the same way one would usually specify them in a Cutadapt
command.
adapter_trimming_options:
5p: "CCNGAYATRGCNTTTYCCNCG"
3p: "TANACYTCNGGRTGNCCRAARAAYCA"
min_overlap: 23
others_common: "--minimum-length 1"
To merge the forward and reverse reads in a read pair, the --fastq_mergepairs
argument of the VSEARCH tool is used;
other parameters can also be specified and will be passed verbatim to VSEARCH.
merge_options:
- "--fastq_allowmergestagger"
- "--fastq_minovlen 100"
- "--fastq_maxdiffs 15"
- "--fastq_eeout"
Quality filtering parameters will also be passed verbatim to VSEARCH.
filter_options:
- "--fastq_maxee 1.0"
- "--fastq_minlen 200"
- "--fastq_maxlen 500"
- "--fastq_maxns 0"
- "--fasta_width 0"
Exactly identical sequences are removed with VSEARCH --derep_fulllength
.
Protein-coding sequences (e.g. the mitochondrial cytochrome oxidase I marker sequence) can be processed differently from non-coding sequences (e.g. tRNA or rRNA markers). Instead of filtering for chimeras with UCHIME (see below), putative pseudogenes are removed that have excessive in-frame stop codons or where the translation does not match a HMM of the target protein.
Activate the coding sequence-specific subworkflow with:
protein_coding: true
Genetic code and a HMM file of the target protein should be supplied to screen translated sequences for pseudogenes; PCR chimeras should also be filtered out in this step. The approach is adapted from Porter & Hajibabaei, 2021.
coding:
frame: 3 # default for the Leray fragment of mtCOI
code: 5 # Genetic code, must not be a stopless code
hmm: null # path to the HMM file, relative to workdir
Entropy ratio-based distance denoising with DnoisE (see below) is only available for coding sequences; the reading frame must be specified.
Denoising is performed with Unoise (Edgar, 2016) implemented in VSEARCH by default, but DnoisE (see below) is an option for coding sequences.
Specify either unoise
or dnoise
to the key method
under denoising
.
denoising:
method: 'unoise'
alpha: 5
minsize: 8
Both methods use the parameters alpha
and minsize
.
Parameter alpha
controls the tradeoff between "sensitivity to small
differences against an increase in the number of bad sequences which are
wrongly predicted to be good." Higher values of alpha
retain more sequences
(more sensitive, more bad sequences), whereas lower values retain fewer (less
sensitive, fewer bad sequences).
minsize
is the minimum number of sequences represented by a cluster after
denoising.
This will perform entropy-based distance denoising with DnoisE (Antich et al., 2022) instead of VSEARCH UNOISE.
The expected reading frame of the amplified metabarcoding fragment should be known, based on the PCR primers used, and denote the codon position (1, 2, or 3) of the first base in the fragment.
DnoisE can calculate the entropy ratio of codon positions 2 and 3 for different values of the denoising paramter alpha and minimum cluster size, to help in choosing a suitable value for alpha and min cluster size. Specify the range of alpha and minsize values to be tested:
test_entropy_ratio: false # true to test denoising parameter values
dnoise_opts:
alpha_range : # Range of alpha to test for entropy ratio diagnostics
[1,5,10]
minsize_range: # Range of minsize to test for entropy ratio diagnostics
[2,10]
The pipeline first runs with the default alpha and minsize values (see above),
and also produces plots of entropy ratio vs. alpha and minsize for the
specified ranges. After reviewing the plots, the user can update the desired
values for alpha and minsize under the key denoising
and rerun the pipeline
if necessary.
This is only possible for protein-coding genes.
Non-target sequences (e.g. pseudogenes, non-target amplicons) and technical artefacts (e.g. PCR chimeras) should be removed from the sequences.
Coding sequences are screened with an HMM of the target protein, and for excessive in-frame stop codons, to remove pseudogenes (see above). This procedure should also indirectly remove PCR chimeras.
Non-coding sequences are screened for PCR chimeras with UCHIME de-novo implemented in VSEARCH.
Reads are aligned to the final set of ASVs to calculate abundance per ASV per sample.
community_table_options:
- "--id 0.97"
- "--strand plus"
The phylogenetic tree can be built with either FastTree or IQ-TREE.
treeprog: "fasttree" # Options: fasttree, iqtree
There are two methods for taxonomic classification implemented so far: the
SINTAX algorithm as implemented in SINTAX (sintax
), and an experimental
heuristic method using a two-step comparison against databases of narrower and
broader scope (multilvl
).
More than one can be used: specify the option as a list to class_method
:
class_method: ["sintax", "multilvl"]
For sintax
and multilvl
methods, the database(s) should contain unaligned
reference sequences for the target gene in Fasta format, along with the
taxonomic classification in SINTAX format in the sequence header, which is
described in the VSEARCH manual:
The reference database must contain taxonomic information in the header of each sequence in the form of a string starting with ";tax=" and followed by a comma-separated list of up to eight taxo- nomic identifiers. Each taxonomic identifier must start with an indication of the rank by one of the letters d (for domain) k (kingdom), p (phylum), c (class), o (order), f (family), g (genus), or s (species). The letter is followed by a colon (:) and the name of that rank. Commas and semicolons are not allowed in the name of the rank.
Example: ">X80725_S000004313;tax=d:Bacteria,p:Proteobacteria,c:Gammaproteobacteria,o:Enterobacteriales,f:Enterobacteriaceae,g:Escherichia/Shigella,s:Escherichia_coli".
The database file can also be pre-formatted into UDB format with the
--makeudb_usearch
option in VSEARCH. For large databases, preparing a UDB
file is faster, as it avoids re-indexing every time the pipeline is run.
The paths to the reference databases are specified under the key dbpath
,
keyed by the classification methods named above (sintax
or multilvl
), and
further keyed by an alias for each database; the alias should not contain
spaces or the .
character. In the example below, the reference database for
the sintax
classifier has alias "test":
dbpaths: # Ref database used by all methods
sintax:
test:
"testdata/db/bold_coi-5p_test.fasta"
The output files will contain the classification method and reference database alias in their filenames.
The alias is necessary so that multiple reference databases can be applied in parallel, e.g. to compare the results when using two different databases curated in different ways. In the example below, the ASVs will be classified with SINTAX against two different databases, named "test" and "test2":
dbpaths:
sintax:
test: # Alias for first database
"testdata/db/bold_coi-5p_test.fasta"
test2: # Alias for second database
"path/to/db2.fasta""
If the multilvl
classifier is used, two sets of databases are applied, one
(or more) of narrower taxonomic scope for the first step, and one of broader
taxonomic scope for the second step. These are specified under the database
alias with the narrow
and broad
keys respectively.
dbpaths:
multilvl:
test: # alias for the db set
narrow:
- "testdata/db/bold_coi-5p_test.fasta"
broad:
"testdata/db/bold_coi-5p_test.fasta"
The cutoff scores for accepting results from the various classifiers are given
under class_thresholds
, keyed by the classification method. multilvl
requires two thresholds for the narrow
and broad
steps.
class_thresholds:
sintax: 0.8
multilvl:
narrow: 0.90 # threshold to use with narrow DB
broad: 0.8 # threshold for broad DB
Linting and formatting
Linting results
WorkflowError:
Error validating config file.
ValidationError: 'reads_table' is a required property
Failed validating 'required' in schema:
{'$schema': 'https://json-schema.org/draft/2020-12/schema',
'description': 'Snakemake configuration for metabarcoding pipeline '
'MB_Pipeline',
'properties': {'adapter_trimming_options': {'description': 'Options '
'for '
'adapter '
'trimming '
'with '
'cutadapt',
'properties': {'3p': {'description': '3-prime '
'adapter '
'sequence, '
'if '
'paired '
'end '
... (truncated)
Formatting results
None