arog-bioinfo/MAW-Annotation
None
Overview
Latest release: None, Last update: 2026-06-14
Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=arog-bioinfo/MAW-Annotation
Quality control: linting: passed formatting: passed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run
conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
For other installation methods, refer to the Snakemake and Snakedeploy documentation.
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/arog-bioinfo/MAW-Annotation . --tag None
Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method (short --sdm) argument.
To run the workflow using apptainer/singularity, use
snakemake --cores all --sdm apptainer
To run the workflow using a combination of conda and apptainer/singularity for software deployment, use
snakemake --cores all --sdm conda apptainer
To run the workflow with automatic deployment of all required software via conda/mamba, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md.
Workflow configuration
The workflow processes one or more Metagenome-Assembled Genomes (MAGs) per run. Configure inputs and tool options in config/config.yaml.
General input
sample_sheet: path to a TSV file containing sample names, input FASTA paths, and domains.
Sample sheet format
The sample sheet is a tab-separated file. Required columns:
sample: unique identifier/name for the MAG or isolate.path: path to the input genome file in FASTA format (.fasta,.fna,.fa).domain: annotation path for the sample. Useprokfor prokaryotic samples andeukfor eukaryotic samples.
The workflow dynamically processes all rows defined in this sheet.
Example:
sample path domain
sample_a data/sample_a.fna prok
sample_b data/sample_b.fasta euk
Optional QA filtering
qa_filter.enabled: whentrue, filter samples before annotation targets are expanded.qa_filter.min_completeness: minimum completeness required for a sample to pass.qa_filter.max_contamination: maximum contamination allowed for a sample to pass.qa_filter.checkm2_reports: CheckM2 TSV reports for prokaryotic samples. Reports must includeName,Completeness, andContamination.qa_filter.eukcc_reports: EukCC CSV reports for eukaryotic samples. Reports must includebin,completeness, andcontamination.qa_filter.missing_sample: behavior when a sample is absent from QA reports. Supported values areerror,keep, anddrop.
Prokaryotic annotation path
prodigal.extra: optional extra options string passed to the Prodigal wrapper.bakta.db: path to the Bakta database directory.bakta.extra: optional extra options string passed to the Bakta wrapper.gtdbtk.data_dir: path to the GTDB-Tk reference database directory.gtdbtk.extra: optional extra options string passed to the GTDB-Tk wrapper.recognizer_prok.resources_dir: path to the prokaryotic reCOGnizer resources database directory.recognizer_prok.extra: optional extra options string passed to the prokaryotic reCOGnizer wrapper.upimapi.db: UPIMAPI built-in database name to use, for exampleswissprot. Leave empty when usingupimapi.db_custom.upimapi.db_custom: path to a custom UPIMAPI database FASTA. Leave empty when usingupimapi.db.upimapi.resources_dir: path to the UPIMAPI resources database directory.upimapi.extra: optional extra options string passed to the UPIMAPI wrapper.upimapi.skip_db_check_if_exists: whentrue, automatically add--skip-db-checkonly if the selected UPIMAPI database FASTA already exists inupimapi.resources_dirorupimapi.db_customexists.
Eukaryotic annotation path
metaeuk.db: path to the MetaEuk reference database, such as a UniProt database.metaeuk.extra: optional extra options string passed to the MetaEuk wrapper.recognizer_euk.resources_dir: path to the eukaryotic reCOGnizer resources database directory.recognizer_euk.custom_db: path to a KOG/custom database for eukaryotic reCOGnizer. Leave empty to disable a custom eukaryotic database.recognizer_euk.extra: optional extra options string passed to the eukaryotic reCOGnizer wrapper.
Thread presets
threads: dictionary containing computational resource presets.threads.high: thread count for high-resource steps.threads.medium: thread count for medium-resource steps.threads.low: thread count for low-resource steps.
Example config
# ====================
# General Input
# ====================
sample_sheet: "config/samples.tsv"
# ====================
# Quality Filtering
# ====================
qa_filter:
enabled: false
min_completeness: 50.0
max_contamination: 10.0
checkm2_reports: []
eukcc_reports: []
missing_sample: "error"
# ====================
# Prokaryotic Annotation
# ====================
# --------------------
# Prodigal
# --------------------
prodigal:
extra: "-p meta -f gff"
# --------------------
# Bakta
# --------------------
bakta:
db: "resources/bakta_db/db-light"
extra: ""
# --------------------
# GTDB-Tk
# --------------------
gtdbtk:
data_dir: "resources/gtdbtk_db"
extra: ""
# --------------------
# reCOGnizer Prokaryotic
# --------------------
recognizer_prok:
resources_dir: "resources/recognizer_db"
extra: ""
# --------------------
# UPIMAPI
# --------------------
upimapi:
db: "swissprot"
db_custom: ""
resources_dir: "resources/upimapi_db"
extra: ""
skip_db_check_if_exists: true
# ====================
# Eukaryotic Annotation
# ====================
# --------------------
# MetaEuk
# --------------------
metaeuk:
db: "resources/metaeuk_db/uniprot_db"
extra: ""
# --------------------
# reCOGnizer Eukaryotic
# --------------------
recognizer_euk:
resources_dir: "resources/recognizer_db"
custom_db: ""
extra: ""
# ====================
# Computational Resources
# ====================
threads:
high: 16
medium: 8
low: 1
Workflow parameters
The following table is automatically parsed from the workflow’s config.schema.y(a)ml file.
Parameter |
Type |
Description |
Required |
Default |
|---|---|---|---|---|
sample_sheet |
string |
path to sample sheet, mandatory |
yes |
config/samples.tsv |
qa_filter |
external CheckM2/EukCC QA filtering applied before target expansion |
yes |
||
. enabled |
boolean |
enable filtering by external QA reports |
false |
|
. min_completeness |
number |
minimum completeness required to keep a sample |
50.0 |
|
. max_contamination |
number |
maximum contamination allowed to keep a sample |
10.0 |
|
. checkm2_reports |
array |
CheckM2 TSV report paths for prokaryotic samples |
[] |
|
. eukcc_reports |
array |
EukCC CSV report paths for eukaryotic samples |
[] |
|
. missing_sample |
string |
behavior when a sample is missing from its QA report |
error |
|
prodigal |
parameters for Prodigal gene prediction |
yes |
||
. extra |
string |
extra CLI options passed to Prodigal wrapper |
||
bakta |
parameters for Bakta annotation |
yes |
||
. db |
string |
path to Bakta database directory |
yes |
|
. extra |
string |
extra CLI options passed to Bakta wrapper |
||
gtdbtk |
parameters for GTDB-Tk classification |
yes |
||
. data_dir |
string |
path to GTDB-Tk database directory |
yes |
|
. extra |
string |
extra CLI options passed to GTDB-Tk wrapper |
||
metaeuk |
parameters for MetaEuk gene prediction |
yes |
||
. db |
string |
path to MetaEuk reference database |
yes |
|
. extra |
string |
extra CLI options passed to MetaEuk wrapper |
||
recognizer_prok |
parameters for prokaryotic reCOGnizer domain annotation |
yes |
||
. resources_dir |
string |
path to prokaryotic reCOGnizer database directory |
yes |
|
. extra |
string |
extra CLI options passed to prokaryotic reCOGnizer wrapper |
||
recognizer_euk |
parameters for eukaryotic reCOGnizer domain annotation |
yes |
||
. resources_dir |
string |
path to eukaryotic reCOGnizer database directory |
yes |
|
. custom_db |
string |
path to KOG/custom database for eukaryotic reCOGnizer; empty disables custom DB |
||
. extra |
string |
extra CLI options passed to eukaryotic reCOGnizer wrapper |
||
upimapi |
parameters for UPIMAPI functional annotation |
yes |
||
. db |
string |
UPIMAPI built-in database name to use (for example, swissprot) |
||
. db_custom |
string |
path to a custom UPIMAPI database FASTA; leave empty when using db |
||
. resources_dir |
string |
path to UPIMAPI resources database directory |
||
. extra |
string |
extra CLI options passed to UPIMAPI wrapper |
||
. skip_db_check_if_exists |
boolean |
automatically pass –skip-db-check when the selected UPIMAPI database FASTA already exists in the configured resources directory |
true |
|
threads |
computational resources presets |
yes |
||
. high |
integer |
|||
. medium |
integer |
|||
. low |
integer |
Linting and formatting
Linting results
All tests passed!
Formatting results
All tests passed!