magwenelab/WeavePop

Workflow to Explore and Analyze Variants of Eukaryotic Populations

Overview

Topics: copy-number-variant database eukaryotic-genomes genomic-variants population-genomics shiny-app variant-effect-prediction

Latest release: None, Last update: 2025-06-26

Linting: linting: failed, Formatting: formatting: failed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/magwenelab/WeavePop . --tag None

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Configuration of the analysis workflow of WeavePop

To configure the workflow you need to provide input files, edit the configuration file config/config.yaml and the execution profile config/default/config.yaml (for local execution) or config/slurm/config.yaml (for SLURM execution).

In the descriptions below when it mentions “specified in field:” it means a field in the config/config.yaml. When it is written as field1: field2: the second one is nested.

Input files and their configuration

FASTQ files:

Paired-end short-read FASTQ files, one forward and one reverse file for each sample. The names of these files should be the names used in the metadata sample column, followed by an extension specified in fastq_suffix1: and fastq_suffix2:. The files for all samples should be in one directory specified in fastqs_directory:. They can be gzip compressed.

Reference genomes:

The names of the files must be the ones in the lineage column of the metadata (e.g. VNI.fasta and VNI.gff). They should be in one directory specified in references_directory:.

  • Providing annotations: Specify it with annotate_references: activate: False. Provide the FASTA and GFF files for each reference genome.

  • Annotating with main reference: Specify it with annotate_references: activate: True. If you want to have a common naming scheme for the genes or don’t have an annotation (GFF file) of all your reference genomes you can provide one annotated main reference to annotate the rest using Liftoff. For this, provide the FASTA and GFF files for the main reference and specify the file names in annotate_references: fasta: and annotate_references: gff:. And provide only the FASTA file for each of the other reference genomes. If you activate the annotation of the reference genomes, all of them will be annotated.

Metadata:

A comma-separated table with one sample per row. Example. Path specified in metadata:.
Mandatory columns with these exact names:

  • sample: sample ID used in the FASTQ file names (no special characters or spaces).

  • lineage: lineage or group name that associates the sample with a reference genome (no special characters or spaces).

  • strain: strain name (a “common name” for each sample, it can be the same as sample if you don’t have a different one).

  • If the plotting will be activated, you need one metadata column to color the samples. Specify the name of this column in plotting: metadata2color:. More columns with free format are allowed.

Chromosomes:

A comma-separated table with one row per chromosome per lineage. Example. Path specified in chromosomes:.
Mandatory columns with these exact names:

  • lineage: Lineage name (the same as in the metadata table and the names of the reference files).

  • accession: Sequence ID of the chromosomes in the FASTA and GFF of the reference of each lineage. Make sure each chromosome ID is not repeated in this file.

  • chromosome: Common name of the chromosome, e.g. chr01, 1, VNI_chr01.

Exclude samples (optional):

If you want to exclude some of the samples in your metadata file from all analyses, you can provide a file with a list of sample names to exclude. Without a column name. Specify its path in samples_to_exclude:.

Repeats database (optional):

Database of repetitive sequences in FASTA format to use for RepeatMasker. Needed for the CNV, plotting, and database modules. If you don’t need a good identification of repeats specify it with use_fake_database: True and don’t provide this file. We recommend the RepBase database. You need to download it, extract the files, and concatenate them all in one FASTA file config/RepBase_<version>.fasta. Specify its path in repeats_database:.

Genetic features for plots (optional):

If you want genetic features to be plotted in the depth and MAPQ plots, provide a comma-separated table with one row per gene. Example. Specify its path in plotting: loci:.

Mandatory columns with these exact names:

  • gene_id: with the gene IDs (IDs of the reference genome’s GFF).

  • feature: name of the feature (locus, pathway, centromere, individual gene name, etc.) the gene belongs to. Max 8 features.

Output

A project directory relative to WeavePop/ can be specified in project_directory:. Otherwise the output will be in WeavePop/results/.
A run ID can be specified in run_id: to append it as a suffix to the results and logs directories.

Module activation and parameters

By default only the required modules are run.
You can activate the optional modules by setting module-name: activate: True in the corresponding section of the configuration file. Activating the database and plotting modules is enough to run everything.
Some of the modules also have additional parameters to configure. The instructions are in the configuration file itself.

Using a container for RepeatMasker and RepeatModeler: If this programs don’t work properly in the Conda environment (which is likely to happen when running in SLURM), they can be run in an Apptainer container. For this you need to set cnv: repeats: use_container: True in config/config.yaml and in the execution profile (config/default/config.yaml or config/slurm/config.yaml) set use-apptainer: True and apptainer-args: "--bind $(pwd):$(pwd)".

Configuration of the join datasets workflow

By default the analysis workflow is activated (workflow: "analysis"). If you already used it for more than one dataset and want to combine the results, you can use the join datasets workflow (workflow: "join_datasets"). For this you need to provide the paths to the results directories of the datasets you want to join in dataset_paths: and names for them in dataset_names:. By default the output will be in WeavePop/results/, if you are using the same project directory as in the analysis workflow, you should specify a different run_id:.

Execution profiles

The execution profiles are YAML files with the command-line options for the execution of the workflow. They are located in config/default/config.yaml for local execution and config/slurm/config.yaml for SLURM execution. You can edit them and run the workflow with snakemake --profile config/default or put the options in the command line.

Linting and formatting

Linting results

1No validator found for JSON Schema version identifier 'http://json-schema.org/draft-04/schema#'
2Defaulting to validator for JSON Schema version 'https://json-schema.org/draft/2020-12/schema'
3Note that schema file may not be validated correctly.

Formatting results

 1[DEBUG] 
 2[DEBUG] 
 3[DEBUG] 
 4[DEBUG] 
 5[DEBUG] 
 6[DEBUG] 
 7[DEBUG] 
 8[DEBUG] 
 9[DEBUG] 
10[DEBUG] 
11[DEBUG] 
12[DEBUG] 
13[DEBUG] 
14[DEBUG] 
15[DEBUG] 
16[DEBUG] 
17[DEBUG] 
18[DEBUG] 
19[DEBUG] In file "/tmp/tmpik8gb79v/workflow/Snakefile":  Formatted content is different from original
20[INFO] 1 file(s) would be changed 😬
21[INFO] 17 file(s) would be left unchanged 🎉
22
23snakefmt version: 0.11.0