MPUSP/snakemake-bacterial-riboseq
Bacterial-Riboseq: A Snakemake workflow for the analysis of riboseq data in bacteria.
Overview
Topics: bioinformatics-pipeline conda riboseq ribosome-profiling singularity snakemake workflow
Latest release: v1.4.0, Last update: 2025-02-02
Linting: linting: passed, Formatting:formatting: passed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/MPUSP/snakemake-bacterial-riboseq . --tag v1.4.0
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
To run the workflow using a combination of conda
and apptainer
/singularity
for software deployment, use
snakemake --cores all --sdm conda apptainer
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
The usage of this workflow is described in the Snakemake Workflow Catalog.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).
This workflow is a best-practice workflow for the analysis of ribosome footprint sequencing (Ribo-Seq) data. The workflow is built using snakemake and consists of the following steps:
- Obtain genome database in
fasta
andgff
format (python
, NCBI Datasets)- Using automatic download from NCBI with a
RefSeq
ID - Using user-supplied files
- Using automatic download from NCBI with a
- Check quality of input sequencing data (
FastQC
) - Cut adapters and filter by length and/or sequencing quality score (
cutadapt
) - Deduplicate reads by unique molecular identifier (UMI,
umi_tools
) - Map reads to the reference genome (
STAR aligner
) - Sort and index for aligned seq data (
samtools
) - Filter reads by feature type (
bedtools
) - Generate summary report for all processing steps (
MultiQC
) - Shift ribo-seq reads according to the ribosome's P-site alignment (
R
,ORFik
) - Calculate basic gene-wise statistics such as RPKM (
R
,ORFik
) - Return report as HTML and PDF files (
R markdown
,weasyprint
)
If you want to contribute, report issues, or suggest features, please get in touch on github.
Step 1: Clone this repository
git clone https://github.com/MPUSP/snakemake-bacterial-riboseq.git
cd snakemake-bacterial-riboseq
Step 2: Install dependencies
It is recommended to install snakemake and run the workflow with conda
, mamba
or micromamba
.
# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# install Mamba
conda install -n base -c conda-forge -y mamba
Step 3: Create snakemake environment
This step creates a new conda environment called snakemake-bacterial-riboseq
.
# create new environment with dependencies & activate it
mamba create -c conda-forge -c bioconda -n snakemake-bacterial-riboseq snakemake pandas
conda activate snakemake-bacterial-riboseq
Note:
All other dependencies for the workflow are automatically pulled as conda
environments by snakemake, when running the workflow with the --use-conda
parameter (recommended).
An NCBI Refseq ID, e.g. GCF_000006945.2
. Find your genome assembly and corresponding ID on NCBI genomes. Alternatively use a custom pair of *.fasta
file and *.gff
file that describe the genome of choice.
Important requirements when using custom *.fasta
and *.gff
files:
-
*.gff
genome annotation must have the same chromosome/region name as the*.fasta
file (example:NC_003197.2
) -
*.gff
genome annotation must havegene
andCDS
type annotation that is automatically parsed to extract transcripts - all chromosomes/regions in the
*.gff
genome annotation must be present in the*.fasta
sequence - but not all sequences in the
*.fasta
file need to have annotated genes in the*.gff
file
Ribosome footprint sequencing data in *.fastq.gz
format. The currently supported input data are single-end, strand-specific reads. Input data files are supplied via a mandatory table, whose location is indicated in the config.yml
file (default: samples.tsv
). The sample sheet has the following layout:
sample | condition | replicate | data_folder | fq1 |
---|---|---|---|---|
RPF-RTP1 | RPF-RTP | 1 | data | RPF-RTP1_R1_001.fastq.gz |
RPF-RTP2 | RPF-RTP | 2 | data | RPF-RTP2_R1_001.fastq.gz |
Some configuration parameters of the pipeline may be specific for your data and library preparation protocol. The options should be adjusted in the config.yml
file. For example:
- Minimum and maximum read length after adapter removal (see option
cutadapt: default
). Here, the test data has a minimum read length of 15 + 7 = 22 (2 nt on 5'end + 5 nt on 3'end), and a maximum of 45 + 7 = 52. - Unique molecular identifiers (UMIs). For example, the protocol by McGlincy & Ingolia, 2017 creates a UMI that is located on both the 5'-end (2 nt) and the 3'-end (5 nt). These UMIs are extracted with
umi_tools
(see optionsumi_extraction: method
andpattern
).
Example configuration files for different sequencing protocols can be found in resources/protocols/
.
To run the workflow from command line, change the working directory.
cd path/to/snakemake-bacterial-riboseq
Adjust options in the default config file config/config.yml
.
Before running the entire workflow, you can perform a dry run using:
snakemake --dry-run
To run the complete workflow with test files using conda
, execute the following command. The definition of the number of compute cores is mandatory.
snakemake --cores 10 --use-conda --directory .test
This table lists all parameters that can be used to run the workflow.
parameter | type | details | default |
---|---|---|---|
samplesheet | |||
path | str | path to samplesheet, mandatory | "config/samples.tsv" |
get_genome | |||
database | str | one of manual , ncbi
|
ncbi |
assembly | str | RefSeq ID | GCF_000006785.2 |
fasta | str | optional path to fasta file | Null |
gff | str | optional path to gff file | Null |
gff_source_type | str | list of name/value pairs for GFF source | see config file |
cutadapt | |||
fivep_adapter | str | sequence of the 5' adapter | Null |
threep_adapter | str | sequence of the 3' adapter | ATCGTAGATCGGAAGAGCACACGTCTGAA |
default | str | additional options passed to cutadapt
|
[-q 10 , -m 22 , -M 52 , --overlap=3 ] |
umi_extraction | |||
method | str | one of string or regex , see manual |
regex |
pattern | str | string or regular expression | ^(?P<umi_0>.{5}).*(?P<umi_1>.{2})$ |
umi_dedup | |||
options | str | default options for deduplication | see config file |
star | |||
index | str | location of genome index; if Null, is made | Null |
genomeSAindexNbases | num | length of pre-indexing string, see STAR man | 9 |
multi | num | max number of loci read is allowed to map | 10 |
sam_multi | num | max number of alignments reported for read | 1 |
intron_max | num | max length of intron; 0 = automatic choice | 1 |
default | str | default options for STAR aligner | see config file |
extract_features | |||
biotypes | str | biotypes to exclude from mapping | [rRNA , tRNA ] |
CDS | str | CDS type to include for mapping | [protein_coding ] |
bedtools_intersect | |||
defaults | str | remove hits, sense strand, min overlap 20% | [-v , -s , -f 0.2 ] |
annotate_orfs | |||
window_size | num | size of 5'-UTR added to CDS | 30 |
shift_reads | |||
window_size | num | start codon window to determine shift | 30 |
read_length | num | size range of reads to use for shifting | [27, 45] |
end_alignment | str | end used for alignment of RiboSeq reads | 3prime |
shift_table | str | optional table with offsets per read length | Null |
export_bigwig | str | export shifted reads as bam file | True |
export_ofst | str | export shifted reads as ofst file | False |
skip_shifting | str | skip read shifting entirely | False |
skip_length_filter | str | skip filtering reads by length | False |
multiqc | |||
config | str | path to multiqc config | config/multiqc_config.yml |
report | |||
export_figures | bool | export figures as .svg and .png
|
True |
export_dir | str | sub-directory for figure export | figures/ |
figure_width | num | standard figure width in px | 875 |
figure_height | num | standard figure height in px | 500 |
figure_resolution | num | standard figure resolution in dpi | 125 |
Linting and formatting
Linting results
None
Formatting results
None