snakemake-workflows/rna-longseq-de-isoform
long read differential expression analysis and splice variant analysis
Overview
Latest release: v2.3.0, Last update: 2025-08-13
Linting: linting: failed, Formatting: formatting: passed
Topics: differential-expression gene-annotation isoform-quantification
Wrappers: bio/minimap2/aligner bio/minimap2/index bio/qualimap/bamqc bio/samtools/index bio/samtools/sort bio/samtools/stats bio/samtools/view
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/snakemake-workflows/rna-longseq-de-isoform . --tag v2.3.0
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
Transcriptome Differential Expression Workflow
This workflow facilitates the analysis of transcriptomic data obtained from Oxford Nanopore long-read sequencing, including alignment, quantification, differential expression analysis, alternative splicing analysis and quality control. Users should be aware that alignment parameters may need to be adjusted accordingly for optimal performance with alternative sequencing techniques.
Configuration Files
To set up the workflow, modify the following files to reflect your dataset and analysis parameters:
config/samples.csv
: Contains sample information and experimental design.config/config.yaml
: General workflow configuration and parameter settings.
samples.csv
Each line in samples.csv
represents a biological sample with associated metadata. The required columns are:
sample
: Unique identifies that matches the sample file in the input directorycondition
: Experimental condition or treatment groupbatch
: Batch of samples and the following columns forward optional metadata:platform
: Sequencing platformpurity
: Purity value
config.yml
The config.yml
file contains the main configuration parameters for the workflow.
General Workflow Parameters
inputdir
: Directory containing input samples.
Reference Genome Parameters
Since Salmon requires transcriptomic alignments for quantification, a transcriptome is constructed using genomic and annotation reference data, these reference files can be provided locally or automatically retrieved from NCBI using an accession number.
ref:
species
: Name of the species.genome
: Path to the genome file (FASTA format, can be omitted if remote retrieval through accession number is prefered).annotation
: Path to the annotation file (GFF or GTF format, can be omitted if remote retrieval through accession number is prefered).accession
: NCBI accession number (used if local files are not provided).
Read Filtering
As this workflow is designed for long-read sequencing, a custom Python script is used to filter out short reads that may represent contamination or sequencing artifacts, ensuring that only reads meeting the specified length threshold are used for analysis.
read_filter:
min_length
: Minimum read length to retain. Can be left at 0 to consider all reads.
Alignment (minimap2)
Alignment performed using Minimap2. A comprehensive explanation of its parameters can be found in the Minimap2 documentation.
minimap2:
index_opts
: Used to define additional options for indexing.opts
: Used to define additional mapping options.maximum_secondary
: Maximum number of secondary alignments,-N
within the minimap2 documentation.secondary_score_ratio
: Score ratio for secondary alignments,-p
within the minimap2 documentation.
Alignment Processing (Samtools)
Since multiple downstream analysis tools require different Alignment formats, Samtools is used to convert SAM files into BAM format for Salmon quantification or sorted SAM files used for FLAIR splice-isoform analysis. Additionally, Samtools generates alignment statistics that serve as quality control metrics. More details can be found in the Samtools documentation.
samtools:
samtobam_opts
: Additional options for SAM to BAM conversion.bamsort_opts
: Additional options for sorting BAM files.bamindex_opts
: Additional options for indexing sorted BAM files.bamstats_opts
: Additional options for generating Alignment statistics.
Quantification (Salmon)
Transcripts are quantified using Salmon in alignment-based mode. TO ensure accurate quantification, Salmon requires information about the strandedness if the sequencing reads. More details can be found in the Salmon documentation.
quant:
salmon_libtype
: Library type for Salmon quantification.
Differential Expression Analysis (DESeq2)
Differential expression analysis is performed using PyDESeq2 to model raw read counts wtih a negative binomial distribution, estimating dispersion parameters to identify differentially expressed genes. See the PyDESeq2 documentation for more details.
deseq2:
fit_type
: Type of fitting of dispersions to the mean intensity.parametric
: fit a dispersion-mean relation via a robust gamma-family GLM.mean
: use the mean of gene-wise dispersion estimates. Will set the fit type for the DEA and the vst transformation. If needed, it can be set separately for each method.design_factors
: List of design factors for the analysis.lfc_null
: The (log2) log fold change under the null hypothesis for Wald test.alt_hypothesis
: The alternative hypothesis for computing wald p-values. By default, the normal Wald test assesses deviation of the estimated log fold change from the null hypothesis, as given bylfc_null
. The alternative hypothesis corresponds to what the user wants to find rather than the null hypothesis.point_width
: Marker size for MA-plotmincount
: Minimum count threshold, genes below the threshold will be removed from analysis.alpha
: Type I error cutoff value.threshold_plot
: Number of top differentially expressed genes to plot in additional heatmap.colormap
: Colormap for heatmaps.figtype
: Figure output format (e.g.,png
).
Isoform Analysis (FLAIR)
FLAIR is used to identify alternative splice isoforms in full-length transcripts obtained from long-read sequencing. It then quantifies these transcripts and performs differential expression analysis on the corresponding genes with splice-isoforms. More details can be found in the FLAIR documentation.
isoform_analysis:
FLAIR
: Enable FLAIR alternative isoform analysis (true
orfalse
).qscore
: Minimum MAPQ for read alignment.--quality
for FLAIR modules.exp_thresh
: Minimum read count expression threshold for differential expression analysis. Genes with less counts will be removed form analysis.col_opts
: Additional options for flair collapse module.
Protein Annotation (lambda)
Lambda aligns sequences of differentially expressed genes or transcripts against indexed protein databases (e.g. UniProt). This process is similar to BLAST, enabling identification of similar proteins and functional annotation of transcripts.
Protein Annotation:
lambda
: Enable lambda Sequence alignment (true
orfalse
).uniref
: URL of the indexed UniRef database from the lambda wiki.num_matches
: Maximum number of proteins that have been identififed per sequence.
Linting and formatting
Linting results
1WorkflowError in file "/tmp/tmp3_gm_3tj/snakemake-workflows-rna-longseq-de-isoform-46ea827/workflow/Snakefile", line 11:
2Workflow defines configfile config/config.yml but it is not present or accessible (full checked path: /tmp/tmp3_gm_3tj/snakemake-workflows-rna-longseq-de-isoform-46ea827/config/config.yml).
Formatting results
All tests passed!