snakemake-workflows/rna-longseq-de-isoform

long read differential expression analysis and splice variant analysis

Overview

Latest release: v2.5.0, Last update: 2025-12-09

Linting: linting: failed, Formatting: formatting: passed

Topics: differential-expression gene-annotation isoform-quantification

Wrappers: bio/minimap2/aligner bio/minimap2/index bio/qualimap/bamqc bio/reference/ensembl-annotation bio/reference/ensembl-sequence bio/samtools/index bio/samtools/sort bio/samtools/stats bio/samtools/view

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Conda. It is recommended to install conda via Miniforge. Run

conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

For other installation methods, refer to the Snakemake and Snakedeploy documentation.

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/snakemake-workflows/rna-longseq-de-isoform . --tag v2.5.0

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Transcriptome Differential Expression Workflow

This workflow facilitates the analysis of transcriptomic data obtained from Oxford Nanopore long-read sequencing, including alignment, quantification, differential expression analysis, alternative splicing analysis and quality control. Users should be aware that alignment parameters may need to be adjusted accordingly for optimal performance with alternative sequencing techniques.

Configuration Files

To set up the workflow, modify the following files to reflect your dataset and analysis parameters:

config/samples.csv: Contains sample information and experimental design.
config/config.yaml: General workflow configuration and parameter settings.

samples.csv

Each line in samples.csv represents a biological sample with associated metadata. The required columns are:

sample: Unique identifies that matches the sample file in the input directory
condition: Experimental condition or treatment group
batch: Batch of samples and the following columns forward optional metadata:
platform: Sequencing platform
purity: Purity value

config.yml

The config.yml file contains the main configuration parameters for the workflow.

General Workflow Parameters

inputdir: Directory containing input samples.

Reference Genome Parameters

Since Salmon requires transcriptomic alignments for quantification, a transcriptome is constructed using genomic and annotation reference data, these reference files can be provided locally or automatically retrieved from NCBI using an accession number.

ref:
- species: Name of the species.
- genome: Path to the genome file (FASTA format, can be omitted if remote retrieval through accession number is prefered).
- annotation: Path to the annotation file (GFF or GTF format, can be omitted if remote retrieval through accession number is prefered).
- accession: NCBI accession number (used if remote NCBI data is prefered).
- ensembl_species: Ensembl species identifier (used if remote Ensembl data is prefered).
- build: Ensembl data release build(only required if using Ensembl download)
- release: Ensembl data release number (only required if using Ensembl download)

Read Filtering

As this workflow is designed for long-read sequencing, a custom Python script is used to filter out short reads that may represent contamination or sequencing artifacts, ensuring that only reads meeting the specified length threshold are used for analysis.

read_filter:
- min_length: Minimum read length to retain. Can be left at 0 to consider all reads.

Alignment (minimap2)

Alignment performed using Minimap2. A comprehensive explanation of its parameters can be found in the Minimap2 documentation.

minimap2:
- index_opts: Used to define additional options for indexing.
- opts: Used to define additional mapping options.
- maximum_secondary: Maximum number of secondary alignments, -N within the minimap2 documentation.
- secondary_score_ratio: Score ratio for secondary alignments, -p within the minimap2 documentation.

Alignment Processing (Samtools)

Since multiple downstream analysis tools require different Alignment formats, Samtools is used to convert SAM files into BAM format for Salmon quantification or sorted SAM files used for FLAIR splice-isoform analysis. Additionally, Samtools generates alignment statistics that serve as quality control metrics. More details can be found in the Samtools documentation.

samtools:
- samtobam_opts: Additional options for SAM to BAM conversion.
- bamsort_opts: Additional options for sorting BAM files.
- bamindex_opts: Additional options for indexing sorted BAM files.
- bamstats_opts: Additional options for generating Alignment statistics.

Quantification (Salmon)

Transcripts are quantified using Salmon in alignment-based mode. TO ensure accurate quantification, Salmon requires information about the strandedness if the sequencing reads. More details can be found in the Salmon documentation.

quant:
- salmon_libtype: Library type for Salmon quantification.

Differential Expression Analysis (DESeq2)

Differential expression analysis is performed using PyDESeq2 to model raw read counts wtih a negative binomial distribution, estimating dispersion parameters to identify differentially expressed genes. See the PyDESeq2 documentation for more details.

deseq2:
- design_factors: List of categorical factors used in the model (e.g. condition). These must appear as columns in your samples.csv.
- batch_effect: List of variables representing batch or technical effects to correct for during normalization.
fit_type: Design formula for DESeq2. If left empty (“”), the workflow automatically constructs a formula combining all batch_effect and design_factors with interactions.
lfc_null: The (log2) log fold change under the null hypothesis for Wald test.
alt_hypothesis: The alternative hypothesis for computing wald p-values. By default, the normal Wald test assesses deviation of the estimated log fold change from the null hypothesis, as given by lfc_null. The alternative hypothesis corresponds to what the user wants to find rather than the null hypothesis.
mincount: Minimum count threshold, genes below the threshold will be removed from analysis.
alpha: Type I error cutoff value.
threshold_plot: Number of top differentially expressed genes to plot in additional heatmap.
colormap: Colormap for heatmaps.

Isoform Analysis (FLAIR)

FLAIR is used to identify alternative splice isoforms in full-length transcripts obtained from long-read sequencing. It then quantifies these transcripts and performs differential expression analysis on the corresponding genes with splice-isoforms. More details can be found in the FLAIR documentation.

isoform_analysis:
- FLAIR: Enable FLAIR alternative isoform analysis (true or false).
- qscore: Minimum MAPQ for read alignment. --quality for FLAIR modules.
- exp_thresh: Minimum read count expression threshold for differential expression analysis. Genes with less counts will be removed form analysis.
- col_opts: Additional options for flair collapse module.

Protein Annotation (lambda)

Lambda aligns sequences of differentially expressed genes or transcripts against indexed protein databases (e.g. UniProt). This process is similar to BLAST, enabling identification of similar proteins and functional annotation of transcripts.

Protein Annotation:
- lambda: Enable lambda Sequence alignment (true or false).
- uniref: URL of the pre-formatted indexed UniRef database from the lambda wiki.
- num_matches: Maximum number of proteins that have been identififed per sequence.

Linting and formatting

Linting results

WorkflowError in file "/tmp/tmp2pz_spsz/snakemake-workflows-rna-longseq-de-isoform-16012bb/workflow/rules/commons.smk", line 137:
no valid samples found, allowed extensions are: '('.fastq', '.fq', '.fastq.gz', '.fq.gz')'
  File "/tmp/tmp2pz_spsz/snakemake-workflows-rna-longseq-de-isoform-16012bb/workflow/rules/qc.smk", line 43, in <module>
  File "/tmp/tmp2pz_spsz/snakemake-workflows-rna-longseq-de-isoform-16012bb/workflow/rules/commons.smk", line 137, in aggregate_input

Formatting results

All tests passed!