MPUSP/snakemake-crispr-guides
A Snakemake workflow for the design of small guide RNAs (sgRNAs) for CRISPR applications.
Overview
Latest release: v1.5.1, Last update: 2025-11-04
Linting: linting: passed, Formatting: formatting: passed
Topics: bioinformatics-pipeline crispr crispr-design guide-rna-library python3 r-markdown snakemake workflow
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Conda. It is recommended to install conda via Miniforge. Run
conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
For other installation methods, refer to the Snakemake and Snakedeploy documentation.
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/MPUSP/snakemake-crispr-guides . --tag v1.5.1
Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method (short --sdm) argument.
To run the workflow using a combination of conda and apptainer/singularity for software deployment, use
snakemake --cores all --sdm conda apptainer
To run the workflow with automatic deployment of all required software via conda/mamba, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md.
Running the workflow
Input data
The workflow requires the following input:
An NCBI Refseq ID, e.g.
GCF_000006945.2. Find your genome assembly and corresponding ID on NCBI genomesOR use a custom pair of
*.fastafile and*.gfffile that describe the genome of choice
Important requirements when using custom *.fasta and *.gff files:
*.gffgenome annotation must have the same chromosome/region name as the*.fastafile (example:NC_003197.2)*.gffgenome annotation must havegeneandCDStype annotation that is automatically parsed to extract transcripts*.gffgenome annotation must have additional qualifiersName=...,ID=..., andParent=...forCDSsall chromosomes/regions in the
*.gffgenome annotation must be present in the*.fastasequencebut not all sequences in the
*.fastafile need to have annotated genes in the*.gfffile
Parameters
This table lists all parameters that can be used to run the workflow.
parameter |
type |
details |
default |
|---|---|---|---|
GET_GENOME |
|||
database |
string |
one of |
|
assembly |
string |
RefSeq ID |
|
fasta |
path |
optional input |
|
gff |
path |
optional input |
|
gff_source_type |
list |
allowed source types in GFF file |
|
DESIGN_GUIDES |
|||
target_region |
numeric |
use subset of regions for testing |
|
target_type |
string |
specify targets for guide design (see below) |
|
tss_window |
numeric |
upstream/downstream window around TSS |
|
tiling_window |
numeric |
window size for intergenic regions |
|
tiling_min_dist |
numeric |
min distance between TSS and intergenic region |
|
circular |
logical |
is the genome circular? |
|
canonical |
logical |
only canonical PAM sites are included |
|
strands |
string |
target |
|
spacer_length |
numeric |
desired length of guides |
|
guide_aligner |
string |
one of |
|
crispr_enzyme |
string |
CRISPR enzyme ID |
|
score_methods |
string |
see crisprScore package |
default scores are listed below |
score_weights |
numeric |
opt. weights when calculating mean score |
|
restriction_sites |
string |
sequences to omit in entire guide |
|
bad_seeds |
string |
sequences to omit in seed region |
|
no_target_controls |
numeric |
number of non targeting guides (neg. controls) |
100 |
FILTER_GUIDES |
|||
filter_best_per_gene |
numeric |
max number of guides to return per gene |
|
filter_best_per_tile |
numeric |
max number of guides to return per ig/tile |
|
filter_score_threshold |
numeric |
mean score to use as lower limit |
|
filter_multi_targets |
logical |
remove guides that perfectly match >1 target |
|
filter_rna |
logical |
remove guides that target e.g. rRNA or tRNA |
|
gc_content_range |
numeric |
range of allowed GC content |
|
fiveprime_linker |
string |
optionally add 5’ linker to each guide |
|
threeprime_linker |
string |
optionally add 3’ linker to each guide |
|
export_as_gff |
logical |
export result table to |
|
export_as_fasta |
logical |
export result table to |
|
REPORT |
|||
show_examples |
numeric |
number of genes to show guide position |
|
show_genomic_range |
numeric |
genome start and end pos to show tiling guides |
|
Target type
One of the most important options is to specify the type of target with the target_type parameter. The pipeline can generate up to three different types of guide RNAs:
guides for targets - these are typically genes, promoters or other annotated genetic elements determined from the supplied GFF file. The pipeline will try to find the best guides by position and score targeting the defined window around the start of the gene/feature (parameter
tss_window). The number of guides is specified withfilter_best_per_gene.guides for intergenic regions - for non-annotated regions (or in the absence of any targets), the pipeline attempts to design guide RNAs using a ‘tiling’ approach. This means that the supplied genome is subdivided into ‘tiles’ (bins) of width
tiling_window, and the best guide RNAs per window are selected. The number of guides is specified withfilter_best_per_tile.guides not targeting anything - this type of guide RNAs is most useful as negative control, in order to gauge the effect of the genetic background on mutant selection without targeting a gene. These guides are random nucleotide sequences with the same length as the target guide RNAs. The no-target control guides are named
NTC_<number>and exported in a separate table (results/filter_guides/guideRNAs_ntc.csv). Some very reduced checks are done for these guides, such as off-target binding. mMst on-target checks are omitted for these guides as they have no defined binding site, strand, or other typical guide properties.
Off-target scores
The pipeline maps each guide RNA to the target genome and -by default- counts the number of alternative alignments with 1, 2, 3, or 4 mismatches. All guide RNAs that map to any other position including up to 4 allowed mismatches are removed.
An exception to this rule is made for guides that perfectly match multiple targets when the filter_multi_targets is set to False (default: True). The reasoning behind this rule is that genomes often contain duplicated genes/targets, and the default but sometimes undesired behavior is to remove all guides targeting the two or more duplicates. If set to False, these guides will not be removed and duplicated genes will be targeted even if they are located at different sites.
On-target scores
The list of available on-target scores in the R crisprScore package is larger than the different scores included by default. It is important to note that the computation of some scores does not necessarily make sense for the design of every CRISPR library. For example, several scores were obtained from analysis of Cas9 cutting efficiency in human cell lines. For such scores it is questionable if they are useful for the design of a different type of library, for example a dCas9 CRISPR inhibition library for bacteria.
Another good reason to exclude some scores are the computational resources they require. Particularly deep learning-derived scores are calculated by machine learning models that require both a lot of extra resources in terms of disk space (downloaded and installed via basilisk and conda environments) and processing power (orders of magnitude longer computation time).
Users can look up all available scores on the R crisprScore github page and decide which ones should be included. In addition, the default behavior of the pipeline is to compute an average score and select the top N guides based on it. The average score is the weighted mean of all single scores and the score_weights can be defined in the config/config.yml file. If a score should be excluded from the ranking, it’s weight can simply be set to zero.
The default scores are:
ruleset1,ruleset3,crisprater, andcrisprscanfrom thecrisprScorepackagetssdistas an additional score representing the relative distance to the promoter. Only relevant for CRISRPi repressiongenrichas an additional score representing theGenrichment in the -4 to -14 nt region of a spacer (Miao & Jahn et al., 2023). Only relevant for CRISPRi repression
Strand specificity
The strand specificity is important for some CRISPR applications. In contrast to the crisprDesign package, functions were added to allow the design of guide RNAs that target either both strands, or just the coding (non-template) strand, or the template strand. This can be defined with the strands parameter in the config file.
For CRISPRi (inhibition) experiments, the literature recommends to target the coding strand for the CDS or both strands for the promoter (Larson et al., Nat Prot, 2013)
this pipeline will automatically filter guides for the chosen strand
for example, if only guides for the coding (non-template) strand are desired, genes on the “+” strand will be targeted with reverse-complement guides (“-“), and genes on the “-” strand with “+” guides.
Linting and formatting
Linting results
All tests passed!
Formatting results
All tests passed!