merfre/Hull_Microbiome_Cluster_Workflow
Snakemake workflow for microbiome research cluster analysis using fastp, Kraken2, and BIOM.
Overview
Topics:
Latest release: v1.2, Last update: 2024-05-27
Linting: linting: failed, Formatting:formatting: failed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/merfre/Hull_Microbiome_Cluster_Workflow . --tag v1.2
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
In this directory 'config' is the main configuration file 'config.yaml' which is used for specifying the samples to analyse and what parameters to use throughout the workflow.
This file is split into four sections:
-
The first is at the top of the file and contains general information for the workflow, including the location of the desired libraries, the metadata file for the samples, and the software environment to use for analysis.
-
The second section is titled 'Analysis options' and allows certain steps in analysis to be toggled on or off. Each analysis option is described below its title and includes which software is used for that step. When the word 'True' is next to the analysis option it will be included in the next workflow run. When 'False' is next to the option it will be excluded and the software described in that step will not be used for the next run.
-
The next section is below the second and labeled 'Database locations'. In this section the location of the reference databases for analysis is specified. For this workflow, there is one reference required, a Kraken2 database for taxonomy assignment.
-
The final section, located below the third, is titled 'Parameters'. This section contains the list of adjustable parameters used in this workflow. Each parameter has a title, a value assigned to it, and a description. The parameters are organized by the software that uses them. For instance, fastp is used for initial quality control and has six adjustable parameters for its performance.
Prior to using this workflow there is only one entry that must be specified before use and that is the path to the "metadata_file" for the samples you wish to analyze.
The metadata file supplied for this workflow should be placed into this config directory and in a table format that is tab delimited. Ideally this is a file you already have written to track your experiments and likely only requires changes to the column names for compatibility with this workflow. At a minimum this table needs to contain rows for each sample you wish to analyze and columns that specify:
-
The run it was sequenced in a column titled "Run", which should also be the name of the directory or library this sample is located in in the "resources" folder of this workflow.
-
The nanopore barcode of the sample in a column titled "Barcode", which is the barcode number assigned to this sample prior to sequencing and demultiplexed by Guppy.
-
The sample's ID in a column labeled "Sample_ID", which is a unique identifier that will be assigned to this sample's concatenated fastq file and all future results.
Linting and formatting
Linting results
/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/Snakefile:37: SyntaxWarning: invalid escape sequence '\/'
RUNS = "[^\/]+"
/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/Snakefile:38: SyntaxWarning: invalid escape sequence '\/'
/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/Snakefile:39: SyntaxWarning: invalid escape sequence '\/'
### Optional analyses ###
/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_all_samples.smk:50: SyntaxWarning: invalid escape sequence '\#'
/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_all_samples.smk:64: SyntaxWarning: invalid escape sequence '\#'
/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_individual.smk:50: SyntaxWarning: invalid escape sequence '\#'
/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_individual.smk:64: SyntaxWarning: invalid escape sequence '\#'
Lints for snakefile /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/Snakefile:
* Path composition with '+' in line 25:
This becomes quickly unreadable. Usually, it is better to endure some
redundancy against having a more readable workflow. Hence, just repeat
common prefixes. If path composition is unavoidable, use pathlib or
(python >= 3.6) string formatting with f"...".
Also see:
* Path composition with '+' in line 16:
This becomes quickly unreadable. Usually, it is better to endure some
... (truncated)
Formatting results
[DEBUG]
[WARNING] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_all_samples.smk": Keyword "shell" at line 14 has comments under a value.
PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
<unknown>:1: SyntaxWarning: invalid escape sequence '\#'
[DEBUG] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_all_samples.smk": Formatted content is different from original
[DEBUG]
[WARNING] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_individual.smk": Keyword "shell" at line 14 has comments under a value.
PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
<unknown>:1: SyntaxWarning: invalid escape sequence '\#'
[DEBUG] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_individual.smk": Formatted content is different from original
[DEBUG]
<unknown>:1: SyntaxWarning: invalid escape sequence '\/'
<unknown>:1: SyntaxWarning: invalid escape sequence '\/'
<unknown>:1: SyntaxWarning: invalid escape sequence '\/'
[ERROR] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/Snakefile": InvalidPython: Black error:
Cannot parse: 78:0: else:
... (truncated)