MPUSP/snakemake-bacterial-riboseq

Bacterial-Riboseq: A Snakemake workflow for the analysis of riboseq data in bacteria.

Overview

Topics: bioinformatics-pipeline conda riboseq ribosome-profiling singularity snakemake workflow

Latest release: v1.4.0, Last update: 2025-02-02

Linting: linting: passed, Formatting:formatting: passed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/MPUSP/snakemake-bacterial-riboseq . --tag v1.4.0

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

To run the workflow using a combination of conda and apptainer/singularity for software deployment, use

snakemake --cores all --sdm conda apptainer

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

snakemake-bacterial-riboseq

Usage

The usage of this workflow is described in the Snakemake Workflow Catalog.

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).

Workflow overview

This workflow is a best-practice workflow for the analysis of ribosome footprint sequencing (Ribo-Seq) data. The workflow is built using snakemake and consists of the following steps:

Obtain genome database in fasta and gff format (python, NCBI Datasets)
1. Using automatic download from NCBI with a RefSeq ID
2. Using user-supplied files
Check quality of input sequencing data (FastQC)
Cut adapters and filter by length and/or sequencing quality score (cutadapt)
Deduplicate reads by unique molecular identifier (UMI, umi_tools)
Map reads to the reference genome (STAR aligner)
Sort and index for aligned seq data (samtools)
Filter reads by feature type (bedtools)
Generate summary report for all processing steps (MultiQC)
Shift ribo-seq reads according to the ribosome's P-site alignment (R, ORFik)
Calculate basic gene-wise statistics such as RPKM (R, ORFik)
Return report as HTML and PDF files (R markdown, weasyprint)

If you want to contribute, report issues, or suggest features, please get in touch on github.

Installation

Step 1: Clone this repository

git clone https://github.com/MPUSP/snakemake-bacterial-riboseq.git
cd snakemake-bacterial-riboseq

Step 2: Install dependencies

It is recommended to install snakemake and run the workflow with conda, mamba or micromamba.

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# install Mamba
conda install -n base -c conda-forge -y mamba

Step 3: Create snakemake environment

This step creates a new conda environment called snakemake-bacterial-riboseq.

# create new environment with dependencies & activate it
mamba create -c conda-forge -c bioconda -n snakemake-bacterial-riboseq snakemake pandas
conda activate snakemake-bacterial-riboseq

Note:

All other dependencies for the workflow are automatically pulled as conda environments by snakemake, when running the workflow with the --use-conda parameter (recommended).

Running the workflow

Input data

Reference genome

An NCBI Refseq ID, e.g. GCF_000006945.2. Find your genome assembly and corresponding ID on NCBI genomes. Alternatively use a custom pair of *.fasta file and *.gff file that describe the genome of choice.

Important requirements when using custom *.fasta and *.gff files:

*.gff genome annotation must have the same chromosome/region name as the *.fasta file (example: NC_003197.2)
*.gff genome annotation must have gene and CDS type annotation that is automatically parsed to extract transcripts
all chromosomes/regions in the *.gff genome annotation must be present in the *.fasta sequence
but not all sequences in the *.fasta file need to have annotated genes in the *.gff file

Read data

Ribosome footprint sequencing data in *.fastq.gz format. The currently supported input data are single-end, strand-specific reads. Input data files are supplied via a mandatory table, whose location is indicated in the config.yml file (default: samples.tsv). The sample sheet has the following layout:

sample	condition	replicate	data_folder	fq1
RPF-RTP1	RPF-RTP	1	data	RPF-RTP1_R1_001.fastq.gz
RPF-RTP2	RPF-RTP	2	data	RPF-RTP2_R1_001.fastq.gz

Some configuration parameters of the pipeline may be specific for your data and library preparation protocol. The options should be adjusted in the config.yml file. For example:

Minimum and maximum read length after adapter removal (see option cutadapt: default). Here, the test data has a minimum read length of 15 + 7 = 22 (2 nt on 5'end + 5 nt on 3'end), and a maximum of 45 + 7 = 52.
Unique molecular identifiers (UMIs). For example, the protocol by McGlincy & Ingolia, 2017 creates a UMI that is located on both the 5'-end (2 nt) and the 3'-end (5 nt). These UMIs are extracted with umi_tools (see options umi_extraction: method and pattern).

Example configuration files for different sequencing protocols can be found in resources/protocols/.

Execution

To run the workflow from command line, change the working directory.

cd path/to/snakemake-bacterial-riboseq

Adjust options in the default config file config/config.yml. Before running the entire workflow, you can perform a dry run using:

snakemake --dry-run

To run the complete workflow with test files using conda, execute the following command. The definition of the number of compute cores is mandatory.

snakemake --cores 10 --use-conda --directory .test

Parameters

This table lists all parameters that can be used to run the workflow.

parameter	type	details	default
samplesheet
path	str	path to samplesheet, mandatory	"config/samples.tsv"
get_genome
database	str	one of `manual`, `ncbi`	`ncbi`
assembly	str	RefSeq ID	`GCF_000006785.2`
fasta	str	optional path to fasta file	Null
gff	str	optional path to gff file	Null
gff_source_type	str	list of name/value pairs for GFF source	see config file
cutadapt
fivep_adapter	str	sequence of the 5' adapter	Null
threep_adapter	str	sequence of the 3' adapter	`ATCGTAGATCGGAAGAGCACACGTCTGAA`
default	str	additional options passed to `cutadapt`	[`-q 10` , `-m 22` , `-M 52`, `--overlap=3`]
umi_extraction
method	str	one of `string` or `regex`, see manual	`regex`
pattern	str	string or regular expression	`^(?P<umi_0>.{5}).*(?P<umi_1>.{2})$`
umi_dedup
options	str	default options for deduplication	see config file
star
index	str	location of genome index; if Null, is made	Null
genomeSAindexNbases	num	length of pre-indexing string, see STAR man	9
multi	num	max number of loci read is allowed to map	10
sam_multi	num	max number of alignments reported for read	1
intron_max	num	max length of intron; 0 = automatic choice	1
default	str	default options for STAR aligner	see config file
extract_features
biotypes	str	biotypes to exclude from mapping	[`rRNA`, `tRNA`]
CDS	str	CDS type to include for mapping	[`protein_coding`]
bedtools_intersect
defaults	str	remove hits, sense strand, min overlap 20%	[`-v` , `-s` , `-f 0.2`]
annotate_orfs
window_size	num	size of 5'-UTR added to CDS	30
shift_reads
window_size	num	start codon window to determine shift	30
read_length	num	size range of reads to use for shifting	[27, 45]
end_alignment	str	end used for alignment of RiboSeq reads	`3prime`
shift_table	str	optional table with offsets per read length	Null
export_bigwig	str	export shifted reads as bam file	True
export_ofst	str	export shifted reads as ofst file	False
skip_shifting	str	skip read shifting entirely	False
skip_length_filter	str	skip filtering reads by length	False
multiqc
config	str	path to multiqc config	`config/multiqc_config.yml`
report
export_figures	bool	export figures as `.svg` and `.png`	True
export_dir	str	sub-directory for figure export	`figures/`
figure_width	num	standard figure width in px	875
figure_height	num	standard figure height in px	500
figure_resolution	num	standard figure resolution in dpi	125

Linting and formatting

Linting results

None

Formatting results

None