mukherjeelab/CLIPs4U
Snakemake workflow for PAR-CLIP data analysis
Overview
Topics:
Latest release: None, Last update: 2024-07-12
Linting: linting: failed, Formatting:formatting: failed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/mukherjeelab/CLIPs4U . --tag None
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
Most of the parameters for CLIPs4U run have to be set up in your custom YAML file (for example config.yaml
).
You can edit config.yaml
, which is an empty config file in any text editor and set all parameters.
NOTE: only 3'adapter and 5'adapter sequences are absolutely required.
NOTE: when you do not provide path to your input file(s) CLIPs4U will run with test dataset (ZFP36.fq.gz) and your config file.
The configuration file contains seven main sections:
- exec and input paths and hardware settings
- genome file and genomic annotations
- adapter trimming and reads collapsing
- removing repetitive elements
- genome alignment
- PARalyzer params
- motif enrichment params
Default configuration file is stored in config/default_config.yaml
Exec and input paths and hardware settings section contains two fields; the input_files
field where you specify path(s) to input fastq.gz file(s). When you have more than one input, separate paths by comma. In the threads
field you specify maximum naumber of processors used by the rules within workflow.
Section from default config is shown below.
{
#exec and input paths and hardware settings
'input_files': "", #(REQUIRED)absolute path(s) must be provided by user in config file, if > 1, comma separated
'threads': 4,
Genome file and genomic annotations contains five fields. The first one specify organism. Right now you can choose human or mouse ("hs"
or "mm"
). If you want to run your analysis with data from other organism(s) open an issue and|or write an e-mail to marcin.sajek@gmail.com and we will prepare the data and|or modify scripts.
genome_fasta
and gtf
fields contain paths to genome sequence and gtf files. When empty (""
) both genome and its annotation will be downloaded from gencode. Gencode version of genome and gtf is specified in the genome_version
field. The default version for human genome is 45, and for mouse - 35.
NOTE: Gencode versions for both organism must be greater than 4
annot_rank
field contains path to annotation ranking file. If empty, default annot rank file will be used. This file contain categories from gene_type
field of gtf file, and if the peak is mapping to region with multiple annotations, lowest rank is chosen. The most abundant RNA types have highest ranks.
Section from default config is shown below.
'organism' : "hs", #(OPTIONAL) allowed values are "hs" for human and "mm" for mouse
'genome_fasta' : "", #(OPTIONAL) you can specify your own genome file, if not specified GRCh38/GRCm39 primary assembly from gencode v45 will be automatically downloaded
'gtf' : "", #(OPTIONAL) you can specify your own gtf file (compatible with genome), if not specified basic annotation for GRCh38/GRCm39 primary assembly from gencode v45 will be automatically downloaded
'genome_version' : "", #(OPTIONAL) you can specify your own genome/gtf version; default: GRCh38/GRCm39 primary assembly from gencode v45/35
'annot_rank' : "", #(OPTIONAL) you can specify your own annotation rank file, suggesed when using your own genome; default: CLIPs4U/annotation/{organism}/annot_ranks.txt
Adapter trimming and reads collapsing section contain adapter trimming parameters. Please familiarize yourself with cutadapt before filling this section. For some analyses it may be better to split adapter trimming into two rounds. If you want to do it in one step fill cutadapt_params1
field and leave cutadapt_params2
field with empty quotes cutadapt_params2 : ""
, which is a default setting. If yor sequencing was performed on NextSeq or NovaSeq machine consider setting nextseq
to True
, to remove a ‘dark cycle’ encoded as string of Gs. If you have UMIs in your read set the length of UMI in umi
field. If your UMIs are in the 5' end of the read use positive integer, if at the 3' end, negative integer.
Section from default config is shown below.
#adapter trimming and reads collapsing
'three_prime_adapter' : "", #(REQUIRED!!!) put your 3'adapter sequence
'five_prime_adapter' : "", #(REQUIRED!!!) your 5'adapter sequence
'nextseq' : False, #(OPTIONAL) put True or 1 if your sequencing run was on NextSeq machine
'cutadapt_params1' : "-b CGTACGCGGGTTTAAACGA -b CTCATCTTGGTCGTACGCGGAATAGTTTAAACTGT -n 3 -j 0 -m 15 -M 50 --max-n 1", #(OPTIONAL) - additional cutadapt parameters
'cutadapt_params2' : "", #(OPTIONAL) leave empty to ommit this step, if you want to perform second trimming put all cutadapt parameters for 2nd trimming here (including adapters)
'umi' : "", #(OPTIONAL) leave empty if you don't have UMIs, for 5' UMIs positive integer equal UMI length, if 3'UMIs - negative integer equal UMI length
TIP: -j 0
in cutadapt determined automatically numbers of available threads.
TIP: Reads that are to long can cause further issues with PARalyzer (e.g. reported here). Consider setting a max read length parameter, e.g. -M 50
.
Removing repetitive elements section is skipped with default settings. User can perform removing repetitive element before genome alignent by setting rem_rep
to True
. CLIPs4U is using STAR for this task. It is using default repetitive elments index and STAR parameters if rep_idx
and star_rem_reps_params
are empty.
Section from default config is shown below.
#removing repetitive elements
'rem_rep' : "", #(OPTIONAL) leave empty to omit this step, put True or 1 if you want to remove repetitive elements
'rep_idx' : "", #(OPTIONAL) default: CLIPs4U/annotation/{organism}/rep_idx
'star_rem_reps_params' : "--runMode alignReads --genomeLoad NoSharedMemory --alignEndsType EndToEnd --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outSAMtype BAM Unsorted --outFilterType BySJout --outBAMcompression 10 --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full", #(OPTIONAL)
Genome alignment section of the configuration file contains a parameters list passed to aligner. Default aligner is bowtie. User can switch aligner to STAR by filling aligner
field with one of "STAR"
, "star"
, "S"
or "s"
. Remaining fields contain paths to bowtie or STAR indexes (generated automatically for selected aligner if left empty) and aligner parameters (filled with default settings when empty).
Section from default config is shown below.
#genome alignment
'aligner' : "", #(OPTIONAL) default bowtie, use 'S', 's', 'STAR' or 'star' to switch to STAR
'bowtie_index_dir' : "", #(OPTIONAL) default - will be created automatically in your working dir if aligner will be set to bowtie
'bowtie_index_params' : "", #(OPTIONAL)
'bowtie_map_params' : "-v 1 -m 10 --best --strata -S", #(OPTIONAL)
'star_index_dir' : "", #(OPTIONAL) default - will be created automatically in your working dir if aligner will be set to STAR
'star_index_params' : "--runMode genomeGenerate --sjdbOverhang 100", #(OPTIONAL)
'star_map_params' : "--runMode alignReads --genomeLoad NoSharedMemory --alignEndsType EndToEnd --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outSAMattributes All --outSAMtype BAM Unsorted --outFilterType BySJout --outFilterMismatchNoverReadLmax 0.05 --outSAMattrRGline ID:foo --outStd Log --outBAMcompression 10 --outSAMmode Full", #(OPTIONAL)
The PARalyzer parameters section is directly used to create .ini file for PARalyzer run and setup memory limit for PARalyzer (paralyzer_memory
field). To properly fill all fields please read PARalyzer manual. Field ADDITIONAL_NUCLEOTIDES_BEYOND_SIGNAL
is required only if signal_extension : ADDITIONAL_NUCLEOTIDES_BEYOND_SIGNAL
. In other cases it will be skipped. If your library does not contain UMIs you may consider switching use_pcr_duplicates
to True
. Then PARalyzer will take into account all reads without PCR duplicates discrimination.
Section from default config is shown below.
#PARalyzer params
'paralyzer_memory' : "4G", #(OPTIONAL)
'BANDWIDTH' : 3, #(OPTIONAL)
'CONVERSION' : "T>C", #(OPTIONAL), possible choices "T>C", "G>A"
'MINIMUM_READ_COUNT_PER_GROUP' : 5, #(OPTIONAL),
'MINIMUM_READ_COUNT_PER_CLUSTER' : 2, #(OPTIONAL)
'MINIMUM_READ_COUNT_FOR_KDE' : 3, #(OPTIONAL)
'MINIMUM_CLUSTER_SIZE' : 11, #(OPTIONAL)
'MINIMUM_CONVERSION_LOCATIONS_FOR_CLUSTER' : 2, #(OPTIONAL)
'MINIMUM_CONVERSION_COUNT_FOR_CLUSTER' : 2, #(OPTIONAL)
'MINIMUM_READ_COUNT_FOR_CLUSTER_INCLUSION' : 1, #(OPTIONAL)
'MINIMUM_READ_LENGTH' : 15, #(OPTIONAL)
'MAXIMUM_NUMBER_OF_NON_CONVERSION_MISMATCHES' : 1, #(OPTIONAL)
'signal_extension' : "EXTEND_BY_READ", ##(OPTIONAL) one of 'EXTEND_BY_READ', 'HAFNER_APPROACH', 'ADDITIONAL_NUCLEOTIDES_BEYOND_SIGNAL'
'ADDITIONAL_NUCLEOTIDES_BEYOND_SIGNAL' : 2, #only valid when signal_extension : 'ADDITIONAL_NUCLEOTIDES_BEYOND_SIGNAL'
'use_pcr_duplicates' : False, #(OPTIONAL) change to True to force PARalyzer to treat pcr duplicates as individual reads (will increase number of clusters, but also false positives)
The last section contains motif enrichment parameters. Motif enrichmantis performed by one of the tools from MEME. Default value for motif_enrichment_method
is meme, user can also choose dreme or streme. motif_enrichment_params
field should be filled with parameters specific for selected method.
NOTE: For motif enrichment analysis keep alphabet as -dna
. Do not switch it to -rna
. Final motifs will be changed to RNA alphabet automatically in the next steps.
Section from default config is shown below.
#motif enrichment params
'motif_enrichment_method' : "meme", #(OPTIONAL), possible choices "meme", "dreme", "streme"
'motif_enrichment_params' : "-dna -mod anr -evt 0.05 -minw 3", #(OPTIONAL); be aware that some prameters are specific for enrichment method chosen
}
Linting and formatting
Linting results
Traceback (most recent call last):
File "/home/runner/micromamba-root/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/snakemake/cli.py", line 2011, in args_to_api
any_lint = workflow_api.lint()
^^^^^^^^^^^^^^^^^^^
File "/home/runner/micromamba-root/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/snakemake/api.py", line 337, in _handle_no_dag
return method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/micromamba-root/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/snakemake/api.py", line 354, in lint
workflow.include(
File "/home/runner/micromamba-root/envs/snakemake-workflow-catalog/lib/python3.12/site-packages/snakemake/workflow.py", line 1399, in include
exec(compile(code, snakefile.get_path_or_uri(), "exec"), self.globals)
File "/tmp/tmpz4ql9bk1/workflow/Snakefile", line 11, in <module>
from scripts.genomic_files_and_indexes import prepare_genomes_and_indexes
File "/tmp/tmpz4ql9bk1/workflow/scripts/genomic_files_and_indexes.py", line 9, in <module>
... (truncated)
Formatting results
[DEBUG]
[DEBUG] In file "/tmp/tmpz4ql9bk1/workflow/Snakefile": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpz4ql9bk1/workflow/rules/genomes_indexes.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpz4ql9bk1/workflow/rules/quality.smk": Formatted content is different from original
[DEBUG]
[WARNING] In file "/tmp/tmpz4ql9bk1/workflow/rules/alignment.smk": Keyword "output" at line 7 has comments under a value.
PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)
[DEBUG] In file "/tmp/tmpz4ql9bk1/workflow/rules/alignment.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpz4ql9bk1/workflow/rules/remove_rep_elements.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpz4ql9bk1/workflow/rules/generate_report.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpz4ql9bk1/workflow/rules/trimming.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpz4ql9bk1/workflow/rules/bigwigs.smk": Formatted content is different from original
[DEBUG]
... (truncated)