magwenelab/WeavePop
Workflow to Explore and Analyze Variants of Eukaryotic Populations
Overview
Topics: copy-number-variant database eukaryotic-genomes genomic-variants population-genomics shiny-app variant-effect-prediction
Latest release: None, Last update: 2025-06-26
Linting: linting: failed, Formatting: formatting: failed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/magwenelab/WeavePop . --tag None
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
Configuration of the analysis workflow of WeavePop
To configure the workflow you need to provide input files, edit the configuration file config/config.yaml
and the execution profile config/default/config.yaml
(for local execution) or config/slurm/config.yaml
(for SLURM execution).
In the descriptions below when it mentions “specified in field:
” it means a field in the config/config.yaml
. When it is written as field1: field2:
the second one is nested.
Input files and their configuration
FASTQ files:
Paired-end short-read FASTQ files, one forward and one reverse file for each sample. The names of these files should be the names used in the metadata sample
column, followed by an extension specified in fastq_suffix1:
and fastq_suffix2:
. The files for all samples should be in one directory specified in fastqs_directory:
. They can be gzip compressed.
Reference genomes:
The names of the files must be the ones in the lineage
column of the metadata (e.g. VNI.fasta
and VNI.gff
). They should be in one directory specified in references_directory:
.
Providing annotations: Specify it with
annotate_references: activate: False
. Provide the FASTA and GFF files for each reference genome.Annotating with main reference: Specify it with
annotate_references: activate: True
. If you want to have a common naming scheme for the genes or don’t have an annotation (GFF file) of all your reference genomes you can provide one annotated main reference to annotate the rest using Liftoff. For this, provide the FASTA and GFF files for the main reference and specify the file names inannotate_references: fasta:
andannotate_references: gff:
. And provide only the FASTA file for each of the other reference genomes. If you activate the annotation of the reference genomes, all of them will be annotated.
Metadata:
A comma-separated table with one sample per row. Example. Path specified in metadata:
.
Mandatory columns with these exact names:
sample
: sample ID used in the FASTQ file names (no special characters or spaces).lineage
: lineage or group name that associates the sample with a reference genome (no special characters or spaces).strain
: strain name (a “common name” for each sample, it can be the same assample
if you don’t have a different one).If the plotting will be activated, you need one metadata column to color the samples. Specify the name of this column in
plotting: metadata2color:
. More columns with free format are allowed.
Chromosomes:
A comma-separated table with one row per chromosome per lineage. Example. Path specified in chromosomes:
.
Mandatory columns with these exact names:
lineage
: Lineage name (the same as in the metadata table and the names of the reference files).accession
: Sequence ID of the chromosomes in the FASTA and GFF of the reference of each lineage. Make sure each chromosome ID is not repeated in this file.chromosome
: Common name of the chromosome, e.g. chr01, 1, VNI_chr01.
Exclude samples (optional):
If you want to exclude some of the samples in your metadata file from all analyses, you can provide a file with a list of sample names to exclude. Without a column name. Specify its path in samples_to_exclude:
.
Repeats database (optional):
Database of repetitive sequences in FASTA format to use for RepeatMasker. Needed for the CNV, plotting, and database modules. If you don’t need a good identification of repeats specify it with use_fake_database: True
and don’t provide this file. We recommend the RepBase database. You need to download it, extract the files, and concatenate them all in one FASTA file config/RepBase_<version>.fasta
. Specify its path in repeats_database:
.
Genetic features for plots (optional):
If you want genetic features to be plotted in the depth and MAPQ plots, provide a comma-separated table with one row per gene. Example. Specify its path in plotting: loci:
.
Mandatory columns with these exact names:
gene_id
: with the gene IDs (IDs of the reference genome’s GFF).feature
: name of the feature (locus, pathway, centromere, individual gene name, etc.) the gene belongs to. Max 8 features.
Output
A project directory relative to WeavePop/
can be specified in project_directory:
. Otherwise the output will be in WeavePop/results/
.
A run ID can be specified in run_id:
to append it as a suffix to the results and logs directories.
Module activation and parameters
By default only the required modules are run.
You can activate the optional modules by setting module-name: activate: True
in the corresponding section of the configuration file. Activating the database and plotting modules is enough to run everything.
Some of the modules also have additional parameters to configure. The instructions are in the configuration file itself.
Using a container for RepeatMasker and RepeatModeler: If this programs don’t work properly in the
Conda environment (which is likely to happen when running in SLURM), they can be run in an Apptainer container. For this you need to set cnv: repeats: use_container: True
in config/config.yaml
and in the execution profile (config/default/config.yaml
or config/slurm/config.yaml
) set use-apptainer: True
and apptainer-args: "--bind $(pwd):$(pwd)"
.
Configuration of the join datasets workflow
By default the analysis workflow is activated (workflow: "analysis"
). If you already used it for more than one dataset and want to combine the results, you can use the join datasets workflow (workflow: "join_datasets"
). For this you need to provide the paths to the results directories of the datasets you want to join in dataset_paths:
and names for them in dataset_names:
. By default the output will be in WeavePop/results/
, if you are using the same project directory as in the analysis workflow, you should specify a different run_id:
.
Execution profiles
The execution profiles are YAML files with the command-line options for the execution of the workflow. They are located in config/default/config.yaml
for local execution and config/slurm/config.yaml
for SLURM execution. You can edit them and run the workflow with snakemake --profile config/default
or put the options in the command line.
Linting and formatting
Linting results
1No validator found for JSON Schema version identifier 'http://json-schema.org/draft-04/schema#'
2Defaulting to validator for JSON Schema version 'https://json-schema.org/draft/2020-12/schema'
3Note that schema file may not be validated correctly.
Formatting results
1[DEBUG]
2[DEBUG]
3[DEBUG]
4[DEBUG]
5[DEBUG]
6[DEBUG]
7[DEBUG]
8[DEBUG]
9[DEBUG]
10[DEBUG]
11[DEBUG]
12[DEBUG]
13[DEBUG]
14[DEBUG]
15[DEBUG]
16[DEBUG]
17[DEBUG]
18[DEBUG]
19[DEBUG] In file "/tmp/tmpik8gb79v/workflow/Snakefile": Formatted content is different from original
20[INFO] 1 file(s) would be changed 😬
21[INFO] 17 file(s) would be left unchanged 🎉
22
23snakefmt version: 0.11.0