markschl/uvsnake

USEARCH/VSEARCH-based pipeline for amplicon data processing

Overview

Topics:

Latest release: v0.1.1, Last update: 2024-11-20

Linting: linting: passed, Formatting:formatting: passed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/markschl/uvsnake . --tag v0.1.1

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Configuration

The configuration lives in the config directory and is named config.yaml. It roughly looks like this:

defaults: program: usearch maxaccepts: 1 maxrejects: 1

primers: forward: - fwd_name: SEQUENCE reverse: - rev_name: SEQUENCE trim_settings: min_overlap: 15 max_error_rate: 0.1 min_length: 100

merge: overlap_ident: 75 max_diffs: 1000

filter: max_error_rate: 0.002

uparse: min_size: 2

unoise3: min_size: 8

otutab: ident_threshold: 97

sintax: db: taxonomic_database.fasta.gz db_format: utax confidence: 0.8

The command snakedeploy deploy-workflow https://github.com/markschl/uvsnake . --tag v0.1.1 will copy a simple configuration template to config/config.yaml. All its sections can be modified according to your needs. In the following, we describe the available options.

Primers

The primers section needs to contain the correct primer sequences:

primers:
  forward:
    - fwd_name: SEQUENCE
  reverse:
    - rev_name: SEQUENCE

Mixes of primer oligos and multiple primer combinations are possible (see details below).

Sample files

The sample file config/samples.tsv should contain a list of paths of all demultiplexed FASTQ files files to be processed. In the following, we use the tool make_sample_tab, which was created exactly for this purpose:

make_sample_tab -d path/to/fastq_files/run1 -f simple

The message indicates that paired-end Illumina reads were found in the run1 directory:

Automatically inferred sample pattern: 'illumina'
120 samples from 'run1' (paired-end) written to samples_run1_paired.tsv

The content of samples_run1_paired.tsv may look like this:

id	R1	R2
sample1	path/to/fastq_files/run1/sample1_R1.fastq.gz	path/to/fastq_files/run1/sample1_R2.fastq.gz
sample2	path/to/fastq_files/run1/sample2_R1.fastq.gz	path/to/fastq_files/run1/sample2_R2.fastq.gz
(...)

We rename the file to config/samples.tsv:

mv samples_run1_paired.tsv config/samples.tsv

Alternatively, specify a custom location for the sample file in config/config.yaml:

sample_file: samples_run1_paired.tsv

USEARCH

The pipeline is configured to use VSEARCH for all steps, but in case of using USEARCH (program: usearch), first obtain the software here, and make sure that it is installed as usearch in $PATH. Alternatively, specify usearch_binary: path/to/usearch.

All options

In the following, the structure of config/config.yaml is shown along with detailed comments. The following sections are mandatory (described below): primers, merge, filter, otutab. The following sections are optional: usearch_binary, sample_file, uparse, unoise3 and sintax.

uparse, unoise3 and sintax are needed if the corresponding target rule is actually run.

defaults: # Program to use by default for read merging, denoising and OTU table construction # The output should still be similar between USEARCH and VSEARCH, except for # the read merging, where VSEARCH is more conservative (see comment below) # This default can still be overridden by the individual program settings # (set them to 'usearch' or 'vsearch' instead of 'default') program: vsearch # {vsearch, usearch} # termination options # https://drive5.com/usearch/manual/termination_options.html # Their values have a strong impact on speed and sensitivity # * setting both to 0 searches the whole database # * setting to values > 1 results in a more precise, but slower search/clustering # These are the default settings by USEARCH and VSEARCH # the unoise3, uparse, post_cluster, otutab and sintax commands all # accept these settings as well, which overrides these defaults maxaccepts: 1 maxrejects: 1

# Default search settings

# Path to USEARCH binary # (download here: https://drive5.com/usearch/) # By default, we assume that binary is in PATH as ‘usearch’ usearch_binary: usearch

# Sample file, which is a three-column file with the following header: # id R1 R2 # The R1/R2 sample paths must be absolute, or relative to the # target directory where the results go (not the Snakemake directory). # (QIIME2 manifest files are accepted, too) input: sample_file: manifest_paired_run1.tsv

# primers/primer combinations primers: forward: # this is a list with one or several primers (depending on what you have in the run) - fwd_primer1: SEQUENCE1 # Instead or in addition to using degenerate primers, there can also # be a mix of different oligos, which all start at the same position # in the locus - fwd_primer2: SEQUENCE2, SEQUENCE3 reverse: - rev_primer1: SEQUENCE4 # (optional) list of primer combinations # (if not supplied, all combinations are used) # Every combination will obtain its own results directory combinations: - ITS3-KYO2…ITS4 - fITS7…ITS4 # Primer trimming settings trim_settings: min_overlap: 15 max_error_rate: 0.15 # keep only amplicons of at least this length (shorter may be adapters) min_length: 100

# paired end merging # Note on ‘program’ setting: # VSEARCH is more conservative regarding more reads of potentially # bad quality will remain unmerged with the message # “alignment score too low, or score drop too high” regardless of the options # specified (overlap_ident, max_diffs) # https://groups.google.com/g/vsearch-forum/c/ojqZYbEyUw4/m/9RMhPQbXAwAJ merge: overlap_ident: 75 # percent # max. nucleotide differences # here, we don’t care about this (setting the number very high), # we only rely on overlap identity max_diffs: 1000 # (optional) other settings, overriding above defaults # program: usearch

# read filtering before denoising/clustering # (done by VSEARCH) filter: # Maximum per-base error rate # (e.g. 0.002 per bp means 0.8 errors per 400 bp) max_error_rate: 0.002

# UPARSE options (https://doi.org/10.1038/nmeth.2604) # USEARCH command: https://www.drive5.com/usearch/manual/cmd_cluster_otus.html # The clustering threshold is fixed at 97% # (uparse target rule, creates ‘otus.fasta’) uparse: # min. OTU size (USEARCH has 2 as default, discarding singletons) min_size: 2 # (optional) other settings, overriding above defaults # program: usearch # maxaccepts: 8 # maxrejects: 8

# UNOISE3 settings # USEARCH command: https://www.drive5.com/usearch/manual/cmd_unoise3.html # (unoise3 target rule, creates ‘denoised.fasta’) unoise3: # min. zOTU size (USEARCH has 8 as default) min_size: 8 # (optional) other settings, overriding above defaults # program: usearch # maxaccepts: 8 # maxrejects: 8

# Mapping options for creation of OTU table # for USEARCH: https://www.drive5.com/usearch/manual/cmd_otutab.html # with VSEARCH, we use the standard -usearch_global otutab: # similarity threshold (USEARCH -otutab uses 97% as default) ident_threshold: 97 # extra: true will create two additional output files (may be large!): # * BAM file for inspection of mapped reads (VSEARCH only) # * A TSV file (named …_search.txt.gz) mapping each read to each cluster. # They are placed in workdir/otutab_mapping extra: true # (optional) other settings, overriding above defaults # program: usearch maxaccepts: 64 maxrejects: 64

# Taxonomy assignment options sintax: # path to database (can be GZIP-compressed) db: unite.fasta.gz # database format: (example has only kingdom and order defined) # * QIIME-style: >id k__Kingdom;p__;o__Order;f__;g__;s__; # * UTAX-style headers: >id;tax=k:Kingdom,o:Order; db_format: qiime # {qiime, utax} # bootstrap threshold confidence: 0.9 # (optional) other settings, overriding above defaults # program: usearch # # (only used by USEARCH:) # maxaccepts: 8 # maxrejects: 8

Linting and formatting

Linting results

None

Formatting results

None