merfre/Hull_Microbiome_Cluster_Workflow

Snakemake workflow for microbiome research cluster analysis using fastp, Kraken2, and BIOM.

Overview

Topics:

Latest release: v1.2, Last update: 2024-05-27

Linting: linting: failed, Formatting: formatting: failed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/merfre/Hull_Microbiome_Cluster_Workflow . --tag v1.2

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Hull Microbiome Cluster Workflow Configuration

In this directory 'config' is the main configuration file 'config.yaml' which is used for specifying the samples to analyse and what parameters to use throughout the workflow.

Reading the configuration file

This file is split into four sections:

  1. The first is at the top of the file and contains general information for the workflow, including the location of the desired libraries, the metadata file for the samples, and the software environment to use for analysis.

  2. The second section is titled 'Analysis options' and allows certain steps in analysis to be toggled on or off. Each analysis option is described below its title and includes which software is used for that step. When the word 'True' is next to the analysis option it will be included in the next workflow run. When 'False' is next to the option it will be excluded and the software described in that step will not be used for the next run.

  3. The next section is below the second and labeled 'Database locations'. In this section the location of the reference databases for analysis is specified. For this workflow, there is one reference required, a Kraken2 database for taxonomy assignment.

  4. The final section, located below the third, is titled 'Parameters'. This section contains the list of adjustable parameters used in this workflow. Each parameter has a title, a value assigned to it, and a description. The parameters are organized by the software that uses them. For instance, fastp is used for initial quality control and has six adjustable parameters for its performance.

Setting up the configuration file

Prior to using this workflow there is only one entry that must be specified before use and that is the path to the "metadata_file" for the samples you wish to analyze.

Metadata file requirements

The metadata file supplied for this workflow should be placed into this config directory and in a table format that is tab delimited. Ideally this is a file you already have written to track your experiments and likely only requires changes to the column names for compatibility with this workflow. At a minimum this table needs to contain rows for each sample you wish to analyze and columns that specify:

  1. The run it was sequenced in a column titled "Run", which should also be the name of the directory or library this sample is located in in the "resources" folder of this workflow.

  2. The nanopore barcode of the sample in a column titled "Barcode", which is the barcode number assigned to this sample prior to sequencing and demultiplexed by Guppy.

  3. The sample's ID in a column labeled "Sample_ID", which is a unique identifier that will be assigned to this sample's concatenated fastq file and all future results.

Linting and formatting

Linting results

  1/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/Snakefile:37: SyntaxWarning: invalid escape sequence '\/'
  2  RUNS = "[^\/]+"
  3/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/Snakefile:38: SyntaxWarning: invalid escape sequence '\/'
  4  
  5/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/Snakefile:39: SyntaxWarning: invalid escape sequence '\/'
  6  ### Optional analyses ###
  7/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_all_samples.smk:50: SyntaxWarning: invalid escape sequence '\#'
  8/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_all_samples.smk:64: SyntaxWarning: invalid escape sequence '\#'
  9/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_individual.smk:50: SyntaxWarning: invalid escape sequence '\#'
 10/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_individual.smk:64: SyntaxWarning: invalid escape sequence '\#'
 11Lints for snakefile /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/Snakefile:
 12    * Path composition with '+' in line 25:
 13      This becomes quickly unreadable. Usually, it is better to endure some
 14      redundancy against having a more readable workflow. Hence, just repeat
 15      common prefixes. If path composition is unavoidable, use pathlib or
 16      (python >= 3.6) string formatting with f"...".
 17      Also see:
 18
 19    * Path composition with '+' in line 16:
 20      This becomes quickly unreadable. Usually, it is better to endure some
 21      redundancy against having a more readable workflow. Hence, just repeat
 22      common prefixes. If path composition is unavoidable, use pathlib or
 23      (python >= 3.6) string formatting with f"...".
 24      Also see:
 25
 26
 27Lints for rule preqc_stats (line 9, /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/fastp.smk):
 28    * No log directive defined:
 29      Without a log directive, all output will be printed to the terminal. In
 30      distributed environments, this means that errors are harder to discover.
 31      In local environments, output of concurrent jobs will be mixed and become
 32      unreadable.
 33      Also see:
 34      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 35    * Specify a conda environment or container for each rule.:
 36      This way, the used software for each specific step is documented, and the
 37      workflow can be executed on any machine without prerequisites.
 38      Also see:
 39      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
 40      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
 41
 42Lints for rule fastp (line 36, /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/fastp.smk):
 43    * No log directive defined:
 44      Without a log directive, all output will be printed to the terminal. In
 45      distributed environments, this means that errors are harder to discover.
 46      In local environments, output of concurrent jobs will be mixed and become
 47      unreadable.
 48      Also see:
 49      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 50    * Specify a conda environment or container for each rule.:
 51      This way, the used software for each specific step is documented, and the
 52      workflow can be executed on any machine without prerequisites.
 53      Also see:
 54      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
 55      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
 56
 57Lints for rule postqc_stats (line 85, /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/fastp.smk):
 58    * No log directive defined:
 59      Without a log directive, all output will be printed to the terminal. In
 60      distributed environments, this means that errors are harder to discover.
 61      In local environments, output of concurrent jobs will be mixed and become
 62      unreadable.
 63      Also see:
 64      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 65    * Specify a conda environment or container for each rule.:
 66      This way, the used software for each specific step is documented, and the
 67      workflow can be executed on any machine without prerequisites.
 68      Also see:
 69      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
 70      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
 71
 72Lints for rule kraken2 (line 9, /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/kraken2.smk):
 73    * No log directive defined:
 74      Without a log directive, all output will be printed to the terminal. In
 75      distributed environments, this means that errors are harder to discover.
 76      In local environments, output of concurrent jobs will be mixed and become
 77      unreadable.
 78      Also see:
 79      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 80    * Specify a conda environment or container for each rule.:
 81      This way, the used software for each specific step is documented, and the
 82      workflow can be executed on any machine without prerequisites.
 83      Also see:
 84      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
 85      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
 86    * Param rcf_input_directory is a prefix of input or output file but hardcoded:
 87      If this is meant to represent a file path prefix, it will fail when
 88      running workflow in environments without a shared filesystem. Instead,
 89      provide a function that infers the appropriate prefix from the input or
 90      output file, e.g.: lambda w, input: os.path.splitext(input[0])[0]
 91      Also see:
 92      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#non-file-parameters-for-rules
 93      https://snakemake.readthedocs.io/en/stable/tutorial/advanced.html#tutorial-input-functions
 94
 95Lints for rule kraken_to_biom (line 9, /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_all_samples.smk):
 96    * No log directive defined:
 97      Without a log directive, all output will be printed to the terminal. In
 98      distributed environments, this means that errors are harder to discover.
 99      In local environments, output of concurrent jobs will be mixed and become
100      unreadable.
101      Also see:
102      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
103    * Specify a conda environment or container for each rule.:
104      This way, the used software for each specific step is documented, and the
105      workflow can be executed on any machine without prerequisites.
106      Also see:
107      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
108      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
109
110Lints for rule biom_to_tsv (line 36, /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_all_samples.smk):
111    * No log directive defined:
112      Without a log directive, all output will be printed to the terminal. In
113      distributed environments, this means that errors are harder to discover.
114      In local environments, output of concurrent jobs will be mixed and become
115      unreadable.
116      Also see:
117      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
118    * Specify a conda environment or container for each rule.:
119      This way, the used software for each specific step is documented, and the
120      workflow can be executed on any machine without prerequisites.
121      Also see:
122      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
123      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
124
125Lints for rule kraken_to_biom_individual (line 9, /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_individual.smk):
126    * No log directive defined:
127      Without a log directive, all output will be printed to the terminal. In
128      distributed environments, this means that errors are harder to discover.
129      In local environments, output of concurrent jobs will be mixed and become
130      unreadable.
131      Also see:
132      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
133    * Specify a conda environment or container for each rule.:
134      This way, the used software for each specific step is documented, and the
135      workflow can be executed on any machine without prerequisites.
136      Also see:
137      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
138      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
139
140Lints for rule biom_to_tsv_individual (line 36, /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_individual.smk):
141    * No log directive defined:
142      Without a log directive, all output will be printed to the terminal. In
143      distributed environments, this means that errors are harder to discover.
144      In local environments, output of concurrent jobs will be mixed and become
145      unreadable.
146      Also see:
147      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
148    * Specify a conda environment or container for each rule.:
149      This way, the used software for each specific step is documented, and the
150      workflow can be executed on any machine without prerequisites.
151      Also see:
152      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
153      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
154
155Lints for rule fasta_conversion (line 9, /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/metaflye.smk):
156    * No log directive defined:
157      Without a log directive, all output will be printed to the terminal. In
158      distributed environments, this means that errors are harder to discover.
159      In local environments, output of concurrent jobs will be mixed and become
160      unreadable.
161      Also see:
162      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
163    * Specify a conda environment or container for each rule.:
164      This way, the used software for each specific step is documented, and the
165      workflow can be executed on any machine without prerequisites.
166      Also see:
167      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
168      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
169
170Lints for rule metaflye (line 34, /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/metaflye.smk):
171    * No log directive defined:
172      Without a log directive, all output will be printed to the terminal. In
173      distributed environments, this means that errors are harder to discover.
174      In local environments, output of concurrent jobs will be mixed and become
175      unreadable.
176      Also see:
177      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
178    * Specify a conda environment or container for each rule.:
179      This way, the used software for each specific step is documented, and the
180      workflow can be executed on any machine without prerequisites.
181      Also see:
182      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
183      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
184
185Lints for rule assembly_stat_report (line 76, /tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/metaflye.smk):
186    * No log directive defined:
187      Without a log directive, all output will be printed to the terminal. In
188      distributed environments, this means that errors are harder to discover.
189      In local environments, output of concurrent jobs will be mixed and become
190      unreadable.
191      Also see:
192      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
193    * Specify a conda environment or container for each rule.:
194      This way, the used software for each specific step is documented, and the
195      workflow can be executed on any machine without prerequisites.
196      Also see:
197      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
198      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers

Formatting results

 1[DEBUG] 
 2[WARNING] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_all_samples.smk":  Keyword "shell" at line 14 has comments under a value.
 3	PEP8 recommends block comments appear before what they describe
 4(see https://www.python.org/dev/peps/pep-0008/#id30)
 5<unknown>:1: SyntaxWarning: invalid escape sequence '\#'
 6[DEBUG] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_all_samples.smk":  Formatted content is different from original
 7[DEBUG] 
 8[WARNING] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_individual.smk":  Keyword "shell" at line 14 has comments under a value.
 9	PEP8 recommends block comments appear before what they describe
10(see https://www.python.org/dev/peps/pep-0008/#id30)
11<unknown>:1: SyntaxWarning: invalid escape sequence '\#'
12[DEBUG] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/biom_individual.smk":  Formatted content is different from original
13[DEBUG] 
14<unknown>:1: SyntaxWarning: invalid escape sequence '\/'
15<unknown>:1: SyntaxWarning: invalid escape sequence '\/'
16<unknown>:1: SyntaxWarning: invalid escape sequence '\/'
17[ERROR] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/Snakefile":  InvalidPython: Black error:

Cannot parse: 78:0: else:


[DEBUG] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/Snakefile":  
[WARNING] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/fastp.smk":  Keyword "shell" at line 14 has comments under a value.
	PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[WARNING] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/fastp.smk":  Keyword "shell" at line 36 has comments under a value.
	PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[WARNING] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/fastp.smk":  Keyword "shell" at line 59 has comments under a value.
	PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[DEBUG] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/fastp.smk":  Formatted content is different from original
[DEBUG] 
[WARNING] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/metaflye.smk":  Keyword "output" at line 26 has comments under a value.
	PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[WARNING] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/metaflye.smk":  Keyword "shell" at line 31 has comments under a value.
	PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[WARNING] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/metaflye.smk":  Keyword "shell" at line 49 has comments under a value.
	PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[DEBUG] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/metaflye.smk":  Formatted content is different from original
[DEBUG] 
[WARNING] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/kraken2.smk":  Keyword "shell" at line 19 has comments under a value.
	PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[DEBUG] In file "/tmp/tmpe2zczg64/merfre-Hull_Microbiome_Cluster_Workflow-dd5584e/workflow/rules/kraken2.smk":  Formatted content is different from original
[INFO] 1 file(s) raised parsing errors 🤕
[INFO] 5 file(s) would be changed 😬

snakefmt version: 0.10.2