snakemake-workflows/rna-seq-kallisto-sleuth
A Snakemake workflow for differential expression analysis of RNA-seq data with Kallisto and Sleuth.
Overview
Latest release: v4.0.0, Last update: 2025-12-18
Linting: linting: passed, Formatting: formatting: passed
Topics: snakemake kallisto sleuth rna-seq differential-expression sciworkflows reproducibility
Wrappers: bio/bwa/index bio/bwa/mem bio/fastp bio/kallisto/index bio/kallisto/quant bio/reference/ensembl-annotation bio/reference/ensembl-sequence bio/samtools/index utils/datavzrd
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Conda. It is recommended to install conda via Miniforge. Run
conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
For other installation methods, refer to the Snakemake and Snakedeploy documentation.
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth . --tag v4.0.0
Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method (short --sdm) argument.
To run the workflow with automatic deployment of all required software via conda/mamba, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md.
General settings
To configure this workflow, modify the following files to reflect your dataset and differential expression analysis model:
config/samples.tsv: samples sheet with covariates and conditionsconfig/units.tsv: (sequencing) units sheet with raw data pathsconfig/config.yaml: general workflow configuration and differential expression model setup
For the samples.tsv and units.tsv, we explain the expected columns right here, in this README.md file.
For the config.yaml file, all entries are explained in detail in its comments.
samples sheet
For each biological sample, add a line to the sample sheet in config/samples.tsv.
The column sample is required and gives the sample name.
Additional columns can specify covariates (including batch effects) and conditions.
Please ensure that these column names can be interpreted as proper variable names.
Especially, they should not start with numeric chars (1vs2) or special (non-letter) symbols (+compound_x).
These columns can then be used in the diffexp: models: specification section in config/config.yaml (see below).
Missing values can be specified by empty columns or by writing NA.
units sheet
For each sample, add one or more sequencing unit lines (runs, lanes or replicates) to the unit sheet in config/units.tsv.
Missing values can be specified by empty columns or by writing NA.
input file options
For each unit, provide exactly one out of the following options for input files:
The path to two paired-end FASTQ files in the columns
fq1,fq2.The path to a single-end FASTQ file in the column
fq1. For single-end data, you also need to specifyfragment_len_meanandfragment_len_sd, which should usually be available from your sequencing provider.The path to a single-end BAM file in the column
bam_singleThe path to a paired-end BAM file in the column
bam_paired
adapter trimming and read filtering
Finally, you can provide settings for the adapter trimming with fastp (see the fastp documentation) via the units.tsv columns fastp_adapters and fastp_extra.
However, if you leave those two columns empty (no whitespace!), fastp will auto-detect adapters and the workflow will set sensible defaults for trimming of RNA-seq data.
If you use this automatic inference, make sure to double-check the Detected read[12] adapter: entries in the resulting fastp HTML report.
This is part of the final snakemake report of the workflow, or can be found in the sample-specific folders under results/trimmed/, once a sample has been processed this far.
If the auto-detection didn’t work at all (empty Detected read[12] adapter: entries), or the Occurrences in the Adapters section are lower than you would expect, please ensure that you find out which adapters were used and configure the adapter trimming manually:
In the column fastp_adapters, you can specify known adapter sequences to be trimmed off by fastp, including the command-line argument for the trimming.
For example, specify the following string in this column: --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
If you don’t know the adapters used, leave this empty (an empty string, containing no whitespace), and fastp will auto-detect the adapters that need to be trimmed.
If you want to make the auto-detection explicit for paired-end samples, you can also specify --detect_adapter_for_pe.
In the column fastp_extra, you can specify further fastp command-line settings.
If you leave this empty (an empty string, containing no whitespace), the workflow will set its default:
--length_required 33: This length filtering ensures that the resulting reads are all usable with the recommended k-mer size of30forkallistoquantification.--trim_poly_x --poly_x_min_len 7: This poly-X trimming removes polyA tails if they are 7 nucleotides or longer. It is run after adapter trimming.--trim_poly_g --poly_g_min_len 7: This poly-G trimming removes erroneous G basecalls at the tails of reads, if they are 7 nucleotides or longer. These Gs are artifacts in Illumina data from machines with a one channel or two channel color chemistry. We currently set this by default, because the auto-detection for the respective machines is lacking the latest machine types. When the above-linked pull request is updated and merged, we can remove this and rely on the auto-detection. If you want to specify additional command line options, we recommend always including those parameters in your units.tsv, as well. Here’s the full concatenation for copy-pasting:
--length_required 33 --trim_poly_x --poly_x_min_len 7 --trim_poly_g --poly_g_min_len 7
Lexogen 3’ QuantSeq adapter trimming
For this data, adapter trimming should automatically work as expected with the use of fastp.
The above-listed defaults are equivalent to an adaptation of the Lexogen read preprocessing recommendations for 3’ FWD QuantSeq data with cutadapt.
The fastp equivalents, including minimal deviations from the recommendations, are motivated as follows:
-m: In cutadapt, this is the short version of--minimum-length. Thefastpequivalent is--length_required. In addition, we increase the minimum length to a default of 33 that makes more sense for kallisto quantification.-O: Here,fastpdoesn’t have an equivalent option, so we currently have to live with the suboptimal default of4. This is greater than themin_overlap=3used here; but smaller than the value of7, a threshold that we have found avoids removing randomly matching sequences when combined with the typical Illuminamax_error_rate=0.005.-a "polyA=A{20}": This can be replaced by withfastp’s dedicated option for--trim_poly_xtail removal (which is run after adapter trimming).-a "QUALITY=G{20}": This can be replaced byfastp’s dedicated option for the removal of artifactual trailingGs in Illumina data from machines with a one channel or two channel color chemistry:--trim_poly_g. This is automatically activated for earlier Illumina machine models with this chemistry, but we recommend to activate it manually in thefastp_extracolumn of yourconfig/units.tsvfile for now, as there are newer models that are not auto-detected, yet. Also, we recommend to set--poly_g_min_len 7, to avoid trimming spurious matches of G-only stretches at the end of reads.-n: With the dedicatedfastpoptions getting applied in the right order, this option is not needed any more.-a "r1adapter=A{18}AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC;min_overlap=3;max_error_rate=0.100000": We remove A{18}, as this is handled by--trim_poly_x.fastpuses the slightly highermin_overlapequivalent of4, which is currently hard-coded (and not exposed as a command-line argument). Because of this, we cannot set themax_error_rateto the Illumina error rate of about0.005.-g "r1adapter=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC;min_overlap=20": This is not needed any more, asfastpsearches the read sequence for adapter sequences from the start of the read (see thefastpadapter search code).--discard-trimmed: We omit this, as adapter sequence removal early in the read will leave short remaining read sequences that are subsequently filtered by--length_required.
config.yaml
This file contains the general workflow configuration and the setup for the differential expression analysis performed by sleuth.
Configurable options should be explained in the comments above the respective entry, so the easiest way to set it up for your workflow is to carefully read through the config/config.yaml file and adjust it to your needs.
If something is unclear, don’t hesitate to file an issue in the rna-seq-kallisto-sleuth GitHub repository.
Linting and formatting
Linting results
All tests passed!
Formatting results
All tests passed!