markschl/uvsnake
USEARCH/VSEARCH-based pipeline for amplicon data processing
Overview
Topics:
Latest release: v0.1.1, Last update: 2024-11-20
Linting: linting: passed, Formatting:formatting: passed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/markschl/uvsnake . --tag v0.1.1
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
The configuration lives in the config
directory and is named config.yaml
. It roughly looks like this:
defaults:
program: usearch
maxaccepts: 1
maxrejects: 1
primers:
forward:
- fwd_name: SEQUENCE
reverse:
- rev_name: SEQUENCE
trim_settings:
min_overlap: 15
max_error_rate: 0.1
min_length: 100
merge:
overlap_ident: 75
max_diffs: 1000
filter:
max_error_rate: 0.002
uparse:
min_size: 2
unoise3:
min_size: 8
otutab:
ident_threshold: 97
sintax:
db: taxonomic_database.fasta.gz
db_format: utax
confidence: 0.8
The command snakedeploy deploy-workflow https://github.com/markschl/uvsnake . --tag v0.1.1
will copy a simple configuration template to config/config.yaml
. All its sections can be modified according to your needs. In the following, we describe the available options.
The primers
section needs to contain the correct primer sequences:
primers:
forward:
- fwd_name: SEQUENCE
reverse:
- rev_name: SEQUENCE
Mixes of primer oligos and multiple primer combinations are possible (see details below).
The sample file config/samples.tsv
should contain a list of paths of all demultiplexed FASTQ files files to be processed. In the following, we use the tool make_sample_tab, which was created exactly for this purpose:
make_sample_tab -d path/to/fastq_files/run1 -f simple
The message indicates that paired-end Illumina reads were found in the run1
directory:
Automatically inferred sample pattern: 'illumina'
120 samples from 'run1' (paired-end) written to samples_run1_paired.tsv
The content of samples_run1_paired.tsv
may look like this:
id R1 R2
sample1 path/to/fastq_files/run1/sample1_R1.fastq.gz path/to/fastq_files/run1/sample1_R2.fastq.gz
sample2 path/to/fastq_files/run1/sample2_R1.fastq.gz path/to/fastq_files/run1/sample2_R2.fastq.gz
(...)
We rename the file to config/samples.tsv
:
mv samples_run1_paired.tsv config/samples.tsv
Alternatively, specify a custom location for the sample file in config/config.yaml
:
sample_file: samples_run1_paired.tsv
The pipeline is configured to use VSEARCH for all steps, but in case of using USEARCH (program: usearch
), first obtain the software here, and make sure that it is installed as usearch
in $PATH
. Alternatively, specify usearch_binary: path/to/usearch
.
In the following, the structure of config/config.yaml
is shown along with detailed comments. The following sections are mandatory (described below): primers
, merge
, filter
, otutab
. The following sections are optional: usearch_binary
, sample_file
, uparse
, unoise3
and sintax
.
uparse, unoise3 and sintax are needed if the corresponding target rule is actually run.
defaults:
# Program to use by default for read merging, denoising and OTU table construction
# The output should still be similar between USEARCH and VSEARCH, except for
# the read merging, where VSEARCH is more conservative (see comment below)
# This default can still be overridden by the individual program settings
# (set them to 'usearch' or 'vsearch' instead of 'default')
program: vsearch # {vsearch, usearch}
# termination options
# https://drive5.com/usearch/manual/termination_options.html
# Their values have a strong impact on speed and sensitivity
# * setting both to 0 searches the whole database
# * setting to values > 1 results in a more precise, but slower search/clustering
# These are the default settings by USEARCH and VSEARCH
# the unoise3, uparse, post_cluster, otutab and sintax commands all
# accept these settings as well, which overrides these defaults
maxaccepts: 1
maxrejects: 1
# Default search settings
# Path to USEARCH binary
# (download here: https://drive5.com/usearch/)
# By default, we assume that binary is in PATH as ‘usearch’
usearch_binary: usearch
# Sample file, which is a three-column file with the following header:
# id R1 R2
# The R1/R2 sample paths must be absolute, or relative to the
# target directory where the results go (not the Snakemake directory).
# (QIIME2 manifest files are accepted, too)
input:
sample_file: manifest_paired_run1.tsv
# primers/primer combinations
primers:
forward:
# this is a list with one or several primers (depending on what you have in the run)
- fwd_primer1: SEQUENCE1
# Instead or in addition to using degenerate primers, there can also
# be a mix of different oligos, which all start at the same position
# in the locus
- fwd_primer2: SEQUENCE2, SEQUENCE3
reverse:
- rev_primer1: SEQUENCE4
# (optional) list of primer combinations
# (if not supplied, all combinations are used)
# Every combination will obtain its own results directory
combinations:
- ITS3-KYO2…ITS4
- fITS7…ITS4
# Primer trimming settings
trim_settings:
min_overlap: 15
max_error_rate: 0.15
# keep only amplicons of at least this length (shorter may be adapters)
min_length: 100
# paired end merging
# Note on ‘program’ setting:
# VSEARCH is more conservative regarding more reads of potentially
# bad quality will remain unmerged with the message
# “alignment score too low, or score drop too high” regardless of the options
# specified (overlap_ident, max_diffs)
# https://groups.google.com/g/vsearch-forum/c/ojqZYbEyUw4/m/9RMhPQbXAwAJ
merge:
overlap_ident: 75 # percent
# max. nucleotide differences
# here, we don’t care about this (setting the number very high),
# we only rely on overlap identity
max_diffs: 1000
# (optional) other settings, overriding above defaults
# program: usearch
# read filtering before denoising/clustering
# (done by VSEARCH)
filter:
# Maximum per-base error rate
# (e.g. 0.002 per bp means 0.8 errors per 400 bp)
max_error_rate: 0.002
# UPARSE options (https://doi.org/10.1038/nmeth.2604)
# USEARCH command: https://www.drive5.com/usearch/manual/cmd_cluster_otus.html
# The clustering threshold is fixed at 97%
# (uparse target rule, creates ‘otus.fasta’)
uparse:
# min. OTU size (USEARCH has 2 as default, discarding singletons)
min_size: 2
# (optional) other settings, overriding above defaults
# program: usearch
# maxaccepts: 8
# maxrejects: 8
# UNOISE3 settings
# USEARCH command: https://www.drive5.com/usearch/manual/cmd_unoise3.html
# (unoise3 target rule, creates ‘denoised.fasta’)
unoise3:
# min. zOTU size (USEARCH has 8 as default)
min_size: 8
# (optional) other settings, overriding above defaults
# program: usearch
# maxaccepts: 8
# maxrejects: 8
# Mapping options for creation of OTU table
# for USEARCH: https://www.drive5.com/usearch/manual/cmd_otutab.html
# with VSEARCH, we use the standard -usearch_global
otutab:
# similarity threshold (USEARCH -otutab uses 97% as default)
ident_threshold: 97
# extra: true will create two additional output files (may be large!):
# * BAM file for inspection of mapped reads (VSEARCH only)
# * A TSV file (named …_search.txt.gz) mapping each read to each cluster.
# They are placed in workdir/otutab_mapping
extra: true
# (optional) other settings, overriding above defaults
# program: usearch
maxaccepts: 64
maxrejects: 64
# Taxonomy assignment options
sintax:
# path to database (can be GZIP-compressed)
db: unite.fasta.gz
# database format: (example has only kingdom and order defined)
# * QIIME-style: >id k__Kingdom;p__;o__Order;f__;g__;s__;
# * UTAX-style headers: >id;tax=k:Kingdom,o:Order;
db_format: qiime # {qiime, utax}
# bootstrap threshold
confidence: 0.9
# (optional) other settings, overriding above defaults
# program: usearch
# # (only used by USEARCH:)
# maxaccepts: 8
# maxrejects: 8
Linting and formatting
Linting results
None
Formatting results
None