baerlachlan/smk-rnaseq-gatk-variants

Snakemake implementation of the GATK best-practices workflow for RNA-seq short variant discovery

Overview

Topics:

Latest release: v1.0.3, Last update: 2024-08-22

Linting: linting: failed, Formatting: formatting: failed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/baerlachlan/smk-rnaseq-gatk-variants . --tag v1.0.3

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Configuration

Workflow config

The workflow requires configuration by modification of config/config.yaml. Follow the explanations provided as comments in the file.

Sample & unit config

The configuration of samples and units is specified as tab-separated value (.tsv) files. Each .tsv requires specific columns (see below), but extra columns may be present (however, will not be used).

samples.tsv

The default path for the sample sheet is config/samples.tsv. This may be changed via configuration in config/config.yaml.

samples.tsv requires only one column named sample, which contains the desired names of the samples. Sample names must be unique, corresponding to a physical sample. Biological and technical replicates should be specified as separate samples.

units.tsv

The default path for the unit sheet is config/units.tsv. This may be changed via configuration in config/config.yaml.

units.tsv requires four columns, named sample, unit, fq1 and fq2. Each row of the units sheet corresponds to a single sequencing unit. Therefore, for each sample specified in samples.tsv, one or more sequencing units should be present. unit values must be unique within each sample. A common example of an experiment with multiple sequencing units is a sample split across several runs/lanes.

For each unit, the respective path to FASTQ files must be specified in the fq1 and fq2 columns. Both columns must exist, however, the fq2 column may be left empty in the case of single-end sequencing experiments. This is how one specifies whether single- or paired-end rules are run by the workflow.

Linting and formatting

Linting results

  1Lints for rule genome_get (line 1, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/refs.smk):
  2    * No log directive defined:
  3      Without a log directive, all output will be printed to the terminal. In
  4      distributed environments, this means that errors are harder to discover.
  5      In local environments, output of concurrent jobs will be mixed and become
  6      unreadable.
  7      Also see:
  8      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
  9
 10Lints for rule genome_index (line 26, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/refs.smk):
 11    * No log directive defined:
 12      Without a log directive, all output will be printed to the terminal. In
 13      distributed environments, this means that errors are harder to discover.
 14      In local environments, output of concurrent jobs will be mixed and become
 15      unreadable.
 16      Also see:
 17      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 18
 19Lints for rule genome_dict (line 53, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/refs.smk):
 20    * No log directive defined:
 21      Without a log directive, all output will be printed to the terminal. In
 22      distributed environments, this means that errors are harder to discover.
 23      In local environments, output of concurrent jobs will be mixed and become
 24      unreadable.
 25      Also see:
 26      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 27
 28Lints for rule annotation_get (line 80, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/refs.smk):
 29    * No log directive defined:
 30      Without a log directive, all output will be printed to the terminal. In
 31      distributed environments, this means that errors are harder to discover.
 32      In local environments, output of concurrent jobs will be mixed and become
 33      unreadable.
 34      Also see:
 35      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 36
 37Lints for rule star_index (line 105, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/refs.smk):
 38    * No log directive defined:
 39      Without a log directive, all output will be printed to the terminal. In
 40      distributed environments, this means that errors are harder to discover.
 41      In local environments, output of concurrent jobs will be mixed and become
 42      unreadable.
 43      Also see:
 44      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 45
 46Lints for rule known_variants_get (line 134, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/refs.smk):
 47    * No log directive defined:
 48      Without a log directive, all output will be printed to the terminal. In
 49      distributed environments, this means that errors are harder to discover.
 50      In local environments, output of concurrent jobs will be mixed and become
 51      unreadable.
 52      Also see:
 53      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 54
 55Lints for rule known_variants_index (line 164, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/refs.smk):
 56    * No log directive defined:
 57      Without a log directive, all output will be printed to the terminal. In
 58      distributed environments, this means that errors are harder to discover.
 59      In local environments, output of concurrent jobs will be mixed and become
 60      unreadable.
 61      Also see:
 62      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 63
 64Lints for rule gtf_to_bed (line 1, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/intervals.smk):
 65    * No log directive defined:
 66      Without a log directive, all output will be printed to the terminal. In
 67      distributed environments, this means that errors are harder to discover.
 68      In local environments, output of concurrent jobs will be mixed and become
 69      unreadable.
 70      Also see:
 71      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 72
 73Lints for rule bed_to_intervals (line 28, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/intervals.smk):
 74    * No log directive defined:
 75      Without a log directive, all output will be printed to the terminal. In
 76      distributed environments, this means that errors are harder to discover.
 77      In local environments, output of concurrent jobs will be mixed and become
 78      unreadable.
 79      Also see:
 80      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 81
 82Lints for rule fastqc_raw (line 1, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/fastqc.smk):
 83    * No log directive defined:
 84      Without a log directive, all output will be printed to the terminal. In
 85      distributed environments, this means that errors are harder to discover.
 86      In local environments, output of concurrent jobs will be mixed and become
 87      unreadable.
 88      Also see:
 89      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 90
 91Lints for rule fastqc_trim (line 29, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/fastqc.smk):
 92    * No log directive defined:
 93      Without a log directive, all output will be printed to the terminal. In
 94      distributed environments, this means that errors are harder to discover.
 95      In local environments, output of concurrent jobs will be mixed and become
 96      unreadable.
 97      Also see:
 98      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 99
100Lints for rule fastqc_align (line 57, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/fastqc.smk):
101    * No log directive defined:
102      Without a log directive, all output will be printed to the terminal. In
103      distributed environments, this means that errors are harder to discover.
104      In local environments, output of concurrent jobs will be mixed and become
105      unreadable.
106      Also see:
107      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
108
109Lints for rule trim_se (line 1, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/trim.smk):
110    * No log directive defined:
111      Without a log directive, all output will be printed to the terminal. In
112      distributed environments, this means that errors are harder to discover.
113      In local environments, output of concurrent jobs will be mixed and become
114      unreadable.
115      Also see:
116      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
117
118Lints for rule trim_pe (line 30, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/trim.smk):
119    * No log directive defined:
120      Without a log directive, all output will be printed to the terminal. In
121      distributed environments, this means that errors are harder to discover.
122      In local environments, output of concurrent jobs will be mixed and become
123      unreadable.
124      Also see:
125      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
126
127Lints for rule trim_md5 (line 64, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/trim.smk):
128    * No log directive defined:
129      Without a log directive, all output will be printed to the terminal. In
130      distributed environments, this means that errors are harder to discover.
131      In local environments, output of concurrent jobs will be mixed and become
132      unreadable.
133      Also see:
134      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
135    * Specify a conda environment or container for each rule.:
136      This way, the used software for each specific step is documented, and the
137      workflow can be executed on any machine without prerequisites.
138      Also see:
139      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
140      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
141
142Lints for rule align (line 1, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/align.smk):
143    * No log directive defined:
144      Without a log directive, all output will be printed to the terminal. In
145      distributed environments, this means that errors are harder to discover.
146      In local environments, output of concurrent jobs will be mixed and become
147      unreadable.
148      Also see:
149      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
150
151Lints for rule align_md5 (line 31, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/align.smk):
152    * No log directive defined:
153      Without a log directive, all output will be printed to the terminal. In
154      distributed environments, this means that errors are harder to discover.
155      In local environments, output of concurrent jobs will be mixed and become
156      unreadable.
157      Also see:
158      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
159    * Specify a conda environment or container for each rule.:
160      This way, the used software for each specific step is documented, and the
161      workflow can be executed on any machine without prerequisites.
162      Also see:
163      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
164      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
165
166Lints for rule assign_read_groups (line 1, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/assign_read_groups.smk):
167    * No log directive defined:
168      Without a log directive, all output will be printed to the terminal. In
169      distributed environments, this means that errors are harder to discover.
170      In local environments, output of concurrent jobs will be mixed and become
171      unreadable.
172      Also see:
173      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
174
175Lints for rule assign_read_groups_index (line 28, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/assign_read_groups.smk):
176    * No log directive defined:
177      Without a log directive, all output will be printed to the terminal. In
178      distributed environments, this means that errors are harder to discover.
179      In local environments, output of concurrent jobs will be mixed and become
180      unreadable.
181      Also see:
182      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
183
184Lints for rule assign_read_groups_md5 (line 50, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/assign_read_groups.smk):
185    * No log directive defined:
186      Without a log directive, all output will be printed to the terminal. In
187      distributed environments, this means that errors are harder to discover.
188      In local environments, output of concurrent jobs will be mixed and become
189      unreadable.
190      Also see:
191      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
192    * Specify a conda environment or container for each rule.:
193      This way, the used software for each specific step is documented, and the
194      workflow can be executed on any machine without prerequisites.
195      Also see:
196      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
197      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
198
199Lints for rule group_umis_se (line 1, /tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/group_umis.smk):
200    * No log directive defined:
201
202... (truncated)

Formatting results

 1[DEBUG] 
 2[DEBUG] 
 3[DEBUG] 
 4[DEBUG] 
 5[DEBUG] 
 6[DEBUG] 
 7[DEBUG] 
 8[DEBUG] 
 9[DEBUG] 
10[DEBUG] 
11[DEBUG] 
12[DEBUG] In file "/tmp/tmpq65g3tiz/baerlachlan-smk-rnaseq-gatk-variants-9ad4331/workflow/rules/align.smk":  Formatted content is different from original
13[DEBUG] 
14[DEBUG] 
15[INFO] 1 file(s) would be changed 😬
16[INFO] 12 file(s) would be left unchanged 🎉
17
18snakefmt version: 0.10.2