SarahSaadain/Single_Copy_Gene_Selector

This repository contains a Snakemake pipeline which can be used to determine the best SCGs for aDNA downstream Analysis

Overview

Latest release: v1.2.0, Last update: 2026-05-15

Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=SarahSaadain/Single_Copy_Gene_Selector

Quality control: linting: failed formatting: failed

Wrappers: bio/busco bio/bwa-mem2/index bio/bwa-mem2/mem bio/minimap2/aligner bio/minimap2/index bio/samtools/index bio/samtools/view

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run

conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

For other installation methods, refer to the Snakemake and Snakedeploy documentation.

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/SarahSaadain/Single_Copy_Gene_Selector . --tag v1.2.0

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Pipeline Setup Guide

Requirements

The only manual prerequisite is Snakemake ≥ 7.0. All other dependencies (BUSCO, BWA, SAMtools, pysam, numpy, etc.) are automatically installed by the pipeline when it runs.

To install Snakemake:

conda install -c bioconda -c conda-forge snakemake

Input Data

The pipeline requires two inputs per species:

Reference genome — a FASTA file of the assembly against which BUSCO will be run to identify candidate SCGs. This should be a high-quality modern genome assembly.

Reads — one or more FASTQ files (gzipped) containing the reads to be mapped to the SCG library. These can be a mix of modern and ancient samples; providing both is strongly recommended (see the main README for why).

Configuration

All pipeline parameters are controlled through config.yaml. A minimal working example is shown below:

# config.yaml - Configuration file for SCG Selector Workflow
species:
  demo:
    name: demo
    lineage: drosophilidae_odb12
    settings:
      num_top_scgs: 100
      min_length_scg: 1500
    reads:
      - /demo_data/demo.fastq.gz
    reference: /demo_data/Dfun_reference_genome.fasta

Configuration fields

Field	Description
`species.<id>.name`	A short identifier for the species or run. Used to name output files.
`species.<id>.lineage`	The BUSCO lineage dataset to use (e.g. `drosophilidae_odb12`). Must match a dataset available in your BUSCO installation.
`settings.num_top_scgs`	Number of top-ranked SCGs to retain after scoring. Default: `100`.
`settings.min_length_scg`	Minimum SCG sequence length in base pairs. Shorter candidates are discarded. Default: `1500`.
`reads`	List of paths to input FASTQ files (gzipped). At least one file is required. Add one path per line for multiple samples.
`reference`	Path to the reference genome FASTA file.

Adding multiple species or samples

Multiple species blocks can coexist in the same config file, and multiple read files can be listed under a single species. Each read file is treated as an independent sample and mapped separately to the SCG library:

species:
  drosophila:
    name: drosophila
    lineage: drosophilidae_odb12
    settings:
      num_top_scgs: 100
      min_length_scg: 1500
    reads:
      - /data/modern_sample.fastq.gz
      - /data/ancient_sample_1.fastq.gz
      - /data/ancient_sample_2.fastq.gz
    reference: /data/reference.fasta

Running the Pipeline

Once your config.yaml is set up, run the pipeline from the repository root with:

snakemake --cores <N> --use-conda

Replace <N> with the number of CPU cores to use. For a dry run (to check that the workflow is correctly configured without executing anything):

snakemake --configfile config.yaml --cores 1 --use-conda --dry-run

Output Files

The pipeline produces the following outputs per species:

File	Description
`results/<name>/scg_stats/<sample>.json`	Per-contig coverage statistics for each input BAM (depth, breadth, etc.).
`results/<name>/<species>_best_scgs.tsv`	Tab-separated ranked list of SCGs with scores for breadth, depth variation, and depth consistency.
`results/<name>/<species>_scg_summary.json`	Full JSON summary including per-sample raw stats, aggregated metrics, and scoring details for every SCG.
`results/<name>/<species>_relevant_scg.fasta`	FASTA file containing the sequences of the top-ranked SCGs.

The TSV and JSON outputs are both sorted by final score in descending order, so the top entries are the highest-quality SCGs recommended for downstream use.

Linting and formatting

Linting results

SCG Selector Pipeline 1.1.0 run:
[2026-05-18 06:37:57 (UTC)] [INFO] SCG Selector Pipeline 1.1.0 run:
	Date:               2026-05-18 06:37:57
[2026-05-18 06:37:57 (UTC)] [INFO] 	Date:               2026-05-18 06:37:57
	Process ID:         2700
[2026-05-18 06:37:57 (UTC)] [INFO] 	Process ID:         2700
	Platform:           Linux-6.17.0-1013-azure-x86_64-with-glibc2.39; #13~24.04.1-Ubuntu SMP Wed Apr 15 16:52:17 UTC 2026
[2026-05-18 06:37:57 (UTC)] [INFO] 	Platform:           Linux-6.17.0-1013-azure-x86_64-with-glibc2.39; #13~24.04.1-Ubuntu SMP Wed Apr 15 16:52:17 UTC 2026
	Host:               runnervmrw5os
[2026-05-18 06:37:57 (UTC)] [INFO] 	Host:               runnervmrw5os
	User:               runner
[2026-05-18 06:37:57 (UTC)] [INFO] 	User:               runner
	Conda:              26.3.2
[2026-05-18 06:37:57 (UTC)] [INFO] 	Conda:              26.3.2
	Python:             3.13.12
[2026-05-18 06:37:57 (UTC)] [INFO] 	Python:             3.13.12
	Snakemake:          9.17.2
[2026-05-18 06:37:57 (UTC)] [INFO] 	Snakemake:          9.17.2
	Conda env:          snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
[2026-05-18 06:37:57 (UTC)] [INFO] 	Conda env:          snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
	Command:            /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
[2026-05-18 06:37:57 (UTC)] [INFO] 	Command:            /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
	Base directory:     /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow
[2026-05-18 06:37:57 (UTC)] [INFO] 	Base directory:     /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow
	Working directory:  /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb
[2026-05-18 06:37:57 (UTC)] [INFO] 	Working directory:  /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb
	Config file(s):     /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/config/config.yaml
[2026-05-18 06:37:57 (UTC)] [INFO] 	Config file(s):     /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/config/config.yaml
[2026-05-18 06:37:57 (UTC)] [INFO] Detected species:
- demo [demo]
Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/initialize_pipeline.smk:
    * Environment variable CONDA_DEFAULT_ENV"] + " (" + os.environ["CONDA_PREFIX used but not asserted with envvars directive in line 126.:
      Asserting existence of environment variables with the envvars directive
      ensures proper error messages if the user fails to invoke a workflow with
      all required environment variables defined. Further, it allows snakemake
      to pass them on in case of distributed execution.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#environment-variables
    * Path composition with '+' in line 32:
      This becomes quickly unreadable. Usually, it is better to endure some
      redundancy against having a more readable workflow. Hence, just repeat
      common prefixes. If path composition is unavoidable, use pathlib or
      (python >= 3.6) string formatting with f"...".
    * Path composition with '+' in line 126:
      This becomes quickly unreadable. Usually, it is better to endure some
      redundancy against having a more readable workflow. Hence, just repeat
      common prefixes. If path composition is unavoidable, use pathlib or
      (python >= 3.6) string formatting with f"...".

[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/initialize_pipeline.smk:
    * Environment variable CONDA_DEFAULT_ENV"] + " (" + os.environ["CONDA_PREFIX used but not asserted with envvars directive in line 126.:
      Asserting existence of environment variables with the envvars directive
      ensures proper error messages if the user fails to invoke a workflow with
      all required environment variables defined. Further, it allows snakemake
      to pass them on in case of distributed execution.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#environment-variables
    * Path composition with '+' in line 32:
      This becomes quickly unreadable. Usually, it is better to endure some
      redundancy against having a more readable workflow. Hence, just repeat
      common prefixes. If path composition is unavoidable, use pathlib or
      (python >= 3.6) string formatting with f"...".
    * Path composition with '+' in line 126:
      This becomes quickly unreadable. Usually, it is better to endure some
      redundancy against having a more readable workflow. Hence, just repeat
      common prefixes. If path composition is unavoidable, use pathlib or
      (python >= 3.6) string formatting with f"...".

Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk:
    * Mixed rules and functions in same snakefile.:
      Small one-liner functions used only once should be defined as lambda
      expressions. Other functions should be collected in a common module, e.g.
      'rules/common.smk'. This makes the workflow steps more readable.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes

[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk:
    * Mixed rules and functions in same snakefile.:
      Small one-liner functions used only once should be defined as lambda
      expressions. Other functions should be collected in a common module, e.g.
      'rules/common.smk'. This makes the workflow steps more readable.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes

Lints for rule prepare_reference (line 5, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
    * No log directive defined:
      Without a log directive, all output will be printed to the terminal. In
      distributed environments, this means that errors are harder to discover.
      In local environments, output of concurrent jobs will be mixed and become
      unreadable.
      Also see:
      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
    * Specify a conda environment or container for each rule.:
      This way, the used software for each specific step is documented, and the
      workflow can be executed on any machine without prerequisites.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers

[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule prepare_reference (line 5, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
    * No log directive defined:
      Without a log directive, all output will be printed to the terminal. In
      distributed environments, this means that errors are harder to discover.
      In local environments, output of concurrent jobs will be mixed and become
      unreadable.
      Also see:
      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
    * Specify a conda environment or container for each rule.:
      This way, the used software for each specific step is documented, and the
      workflow can be executed on any machine without prerequisites.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers

Lints for rule prepare_scg_library (line 43, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
    * No log directive defined:
      Without a log directive, all output will be printed to the terminal. In
      distributed environments, this means that errors are harder to discover.
      In local environments, output of concurrent jobs will be mixed and become
      unreadable.
      Also see:
      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files

[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule prepare_scg_library (line 43, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
    * No log directive defined:
      Without a log directive, all output will be printed to the terminal. In
      distributed environments, this means that errors are harder to discover.
      In local environments, output of concurrent jobs will be mixed and become
      unreadable.
      Also see:
      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files

Lints for rule trigger_map_reads_to_scg_library (line 36, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
    * No log directive defined:
      Without a log directive, all output will be printed to the terminal. In
      distributed environments, this means that errors are harder to discover.
      In local environments, output of concurrent jobs will be mixed and become
      unreadable.
      Also see:
      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
    * Specify a conda environment or container for each rule.:
      This way, the used software for each specific step is documented, and the
      workflow can be executed on any machine without prerequisites.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers

[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule trigger_map_reads_to_scg_library (line 36, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
    * No log directive defined:
      Without a log directive, all output will be printed to the terminal. In
      distributed environments, this means that errors are harder to discover.
      In local environments, output of concurrent jobs will be mixed and become
      unreadable.
      Also see:
      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
    * Specify a conda environment or container for each rule.:
      This way, the used software for each specific step is documented, and the
      workflow can be executed on any machine without prerequisites.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers

Lints for rule remove_unmapped_reads_from_bam_reads_to_library (line 111, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
    * No log directive defined:
      Without a log directive, all output will be printed to the terminal. In
      distributed environments, this means that errors are harder to discover.
      In local environments, output of concurrent jobs will be mixed and become
      unreadable.
      Also see:
      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files

[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule remove_unmapped_reads_from_bam_reads_to_library (line 111, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
    * No log directive defined:
      Without a log directive, all output will be printed to the terminal. In
      distributed environments, this means that errors are harder to discover.
      In local environments, output of concurrent jobs will be mixed and become
      unreadable.
      Also see:
      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files

Lints for rule index_bam_reads_to_library (line 125, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
    * No log directive defined:
      Without a log directive, all output will be printed to the terminal. In
      distributed environments, this means that errors are harder to discover.
      In local environments, output of concurrent jobs will be mixed and become
      unreadable.
      Also see:
      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files

[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule index_bam_reads_to_library (line 125, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
    * No log directive defined:
      Without a log directive, all output will be printed to the terminal. In
      distributed environments, this means that errors are harder to discover.
      In local environments, output of concurrent jobs will be mixed and become
      unreadable.
      Also see:
      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files

Lints for rule filter_top_scgs_tes (line 28, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/rank_scg.smk):
    * No log directive defined:

... (truncated)

Formatting results

[DEBUG] 
[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/Snakefile":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/rank_scg.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/initialize_pipeline.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk":  Formatted content is different from original
[INFO] 5 file(s) would be changed 😬

snakefmt version: 0.11.5