WestGermanGenomeCenter/circrna_detection

circs_snake : a snakemake-based circRNA detection workflow

Overview

Topics: circular rna rna-seq rna-seq-pipeline rnaseq-pipeline

Latest release: None, Last update: 2021-09-30

Linting: linting: failed, Formatting:formatting: failed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/WestGermanGenomeCenter/circrna_detection . --tag None

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

To run the workflow using apptainer/singularity, use

snakemake --cores all --sdm apptainer

To run the workflow using a combination of conda and apptainer/singularity for software deployment, use

snakemake --cores all --sdm conda apptainer

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

users manual to circs_snake

circs_snake is a multi-pipeline circRNA detection workflow from RNASeq data.

This readme is meant to help you, the user, to understand what circs_snake tries to do such that you can use /change this to your liking / environment. For an first rough overview, lets look at a DAG of this pipeline with two input samples.

Here you can see that (starting from the top) we have four major "starting points":

the parental pipeline flow (starting with rule r01): does the vote, normalization and preparation steps
find_circ (starting with fc_b, fc_a is a rule unpacking .fastq.gz files if this is the given format)
DCC (starting with dcc_b, dcc_a is a rule unpacking .fastq.gz files if this is the given format)
CIRCexplorer1 (starting with cx_b, cx_a is a rule unpacking .fastq.gz files if this is the given format)

each of the pipelines in run twice here, since we have two input samples in this example. The exception is the parental pipeline, this part will be only run once for each dataset. Another visualization of the same flow is below, making this a little more clear:

Here you can see what happens with the data: First all three pipelines (find_circ, DCC, CIRCexplorer1) are run on each sample, resulting in one file for each sample for each pipeline. An example output file at this stage looks like this:

These files are summarized in step r06a,b,c that result in a .mat1 file for each pipeline. The columns in this fle are: circRNA coordinates, strand, samplename, detected quantity, quality, quality, refseq annotation Annotation is added, data is summarized and results in a .mat2 file (r07a,b,c). These pipeline-specific matrix2 files are then voted (circRNA coordinates are overlapped and filtered based on only 3/3 overlaps) and finally then normalized, resulting in three normalized and voted circRNA datafiles as the main output of this pipeline. An example output file is given with example_output_norm_voted_dcc_hg19.csv

before you can run this

Before you will be able to run this workflow, you need to have:

snakemake installed
have the find_circ scripts from the officical website (http://circbase.org/cgi-bin/downloads.cgi, Custom scripts for finding circRNAs; unpack, edit find_circ_conf.yaml accordingly)
installed DCC and CIRCexplorer1 (install or download, edit the config.yaml files accordingly)
reference genome index built for STAR and Bowtie2, aswell as the reference genome in .fa and .gtf format (other annotation data is in the data/ dir, edit the config.yaml files accordingly)
all other software dependencies should be handled by snakemake, see the env.yaml files
the config.yaml files are for my specific deployment, yours should vary. Here you only need to change directories for each of the needed files / folders + you can change pipeline-specific parameters to your liking aswell. I attached hg19 and hg38 example config.yaml files to ease your adaption.

and thats it! an example of how to execute the pipeline is given in howtostart.sh, a cluster config example is given in cluster_config.yaml and an example samplesheet is given aswell (samples.tsv)

the samplesheet and expected files

Given this as samples.tsv:

samples
"SRR3184300"
"SRR3184285"

the workflow expects:

SRR3184300_1.fastq and SRR3184300_2.fastq + SRR3184385_1.fastq and SRR3184385_2.fastq in the root directory of this workflow: path/to/circs_snake/. <- put the .fastq files here the lane identifier is changeable in the config.yaml:

lane_ident1:
 "_1"
lane_ident2:
 "_2"

The workflow itself does create the needed .tsv file given two input fastq files in its root directory. you can also self-create this, see scripts/snake_infile_creator.pl. (parental Snakefile, rule r03 is where this would be created from a previously created .fastq file list, rule r02)

how to start a typical circs_snake run:

copy/past/move paired end, trimmed and QC'ed .fastq files into circ_snake/.
check if the lane idetifier is correct in all config.yaml files (change this if needed)
snakemake (for more options here see howtorun.sh)

Linting and formatting

Linting results

Lints for snakefile /tmp/tmpojt17qj5/Snakefile:
    * Absolute path "/cx_out/"+config[" in line 12:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/"+" in line 12:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#configuration
    * Absolute path "/dc_out/"+config[" in line 13:
      Do not define absolute paths inside of the workflow, since this renders
      your workflow irreproducible on other machines. Use path relative to the
      working directory instead, or make the path configurable via a config
      file.

... (truncated)

Formatting results

[INFO] 1 file(s) would be changed 😬

snakefmt version: 0.4.3