FeelLiao/rna-seq-std

RNA-Seq analysis workflow based on snakemake

Overview

Topics: rna-seq rna-seq-analysis snakemake

Latest release: v1.1.0, Last update: 2025-04-04

Linting: linting: failed, Formatting: formatting: failed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/FeelLiao/rna-seq-std . --tag v1.1.0

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

General configuration

To configure this workflow, modify config/config.yaml according to your needs, following the explanations provided in the file.

Sample setup (samples)

The sample file is specified via comma-separated files (.csv).

samples

The default sample sheet is config/samples.tsv (as configured in config/config.yaml).

Each sample refers to an actual physical sample, and replicates (both biological and technical) are specified as separate samples. For each sample, you will always have to specify a sample name by sample.

In addition, you need to specify the group of samples. When replicates are contained in an experiment, the file name of a sample usually contains the sample name and replicates number separated by “-“.

Two more columns are needed to be fulfilled, read1 and read2, which represent the sample path.

Finally, extra column is not necessary unless you have more complicated experiment design.

Here, we propose that you have a folder containing raw reads data, and the filenames fit the formation sample-replicates_R1/2.fastq. Hence, we provide a python script config/sample_pre.py to process the sample information of rawdata in that situation. Before you use this workflow, just run python config/sample_pre.py -h to see how this script will help you in generating sample_sheet.csv.

python config/sample_pre.py -h
# usage: sample_pre.py [-h] [-i INPUT] [-e EXTENSION] [-o OUTPUT]

# Automatic create sample sheet for this RNA-seq pipeline

# options:
#   -h, --help            show this help message and exit
#   -i INPUT, --input INPUT
#                         Path of your raw data [default: ../rawdata]
#   -e EXTENSION, --extension EXTENSION
#                         file extension of raw data [default: fastq]
#   -o OUTPUT, --output OUTPUT
#                         output path of sample sheet [default: .]

# version: 0.1.0

Reference genome (ref)

The reference genome and genome annotation used for RNA-Seq

genome

Define the reference genome to use. file can be fasta and gz format.

You could download this via NCBI or EMBL database.

annotation

Which genome annotation to use. Usually, the more precise the annotation is, the more persuasive quantification results is.

Both gff and gtf format are acceptable.

Clean

If you want to clean the output, default is true. If you set this to true, the output will be removed after the workflow finished. This is useful when you want to save disk space. Only the intermediate files will be removed, the final output will be kept.

Reports

If you want to generate reports, default is true. If you set this to true, the report will be generated after the workflow finished. This is useful when you want to check the quality of the workflow. The report will be generated in the out/reports directory.

SRA download configuration (SRA)

To successfully run the SRA download , you need a stable connection to NCBI, or its a annoying time consuming task.

This workflow will download sra and decompress these files into fastq format automatically. This means that the output files are only the decompress fastq files, not contain sra original files.

After the sra workflow finished, you will get a sample file named sample_sheet_sra.csv in the same directory of acc_list. For downstream analysis, you need to fulfill the sample information yourself.

You can run this by targeting sra in snakemake commandline.

snakemake sra -c 30 --use-conda --conda-cleanup-pkgs

activate

When you plan to use SRA download, you need to set this to true. If you set this to false, downstream analysis of these sra files will be unavailable.

If your plan to use SRA for downstream analysis, please change samples in config file after you run sra target.

acc_list

Path to SRA accession list. Usually, you can download it using NCBI SRA run selector.

output_dir

The directory that will store the processed sra files, usually in fastq format.

featureCounts configuration (featureCounts)

Specify the parameters that featureCounts use. For more information, see official website

New Gene

If you want to add new gene annotation, default is false. If you set this to true, the new gene annotation will be added to the existing gene annotation.

activate

When you plan to add new gene annotation, you need to set this to true.

stringtie_params

The parameters that stringtie use. For more information, see official website

Linting and formatting

Linting results

FileNotFoundError in file /tmp/tmp_bxwk0r1/FeelLiao-rna-seq-std-3bf4544/workflow/rules/common.smk, line 11:
[Errno 2] No such file or directory: 'config/sample_sheet.csv'
  File "/tmp/tmp_bxwk0r1/FeelLiao-rna-seq-std-3bf4544/workflow/rules/common.smk", line 11, in <module>
  File "/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
  File "/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
  File "/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
  File "/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
  File "/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/pandas/io/common.py", line 873, in get_handle

Formatting results

[DEBUG] 
[DEBUG] In file "/tmp/tmp_bxwk0r1/FeelLiao-rna-seq-std-3bf4544/workflow/rules/new_gene.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmp_bxwk0r1/FeelLiao-rna-seq-std-3bf4544/workflow/rules/align.smk":  Formatted content is different from original
[DEBUG] 
<unknown>:1: SyntaxWarning: invalid escape sequence '\$'
[DEBUG] In file "/tmp/tmp_bxwk0r1/FeelLiao-rna-seq-std-3bf4544/workflow/rules/lncRNA.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmp_bxwk0r1/FeelLiao-rna-seq-std-3bf4544/workflow/Snakefile":  Formatted content is different from original
[DEBUG] 
[WARNING] In file "/tmp/tmp_bxwk0r1/FeelLiao-rna-seq-std-3bf4544/workflow/rules/counts.smk":  Keyword "output" at line 30 has comments under a value.
	PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[DEBUG] In file "/tmp/tmp_bxwk0r1/FeelLiao-rna-seq-std-3bf4544/workflow/rules/counts.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmp_bxwk0r1/FeelLiao-rna-seq-std-3bf4544/workflow/rules/post.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmp_bxwk0r1/FeelLiao-rna-seq-std-3bf4544/workflow/rules/common.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmp_bxwk0r1/FeelLiao-rna-seq-std-3bf4544/workflow/rules/ref.smk":  Formatted content is different from original

... (truncated)