SarahSaadain/Single_Copy_Gene_Selector

This repository contains a Snakemake pipeline which can be used to determine the best SCGs for aDNA downstream Analysis

Overview

Latest release: v1.2.0, Last update: 2026-05-15

Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=SarahSaadain/Single_Copy_Gene_Selector

Quality control: linting: failed formatting: failed

Wrappers: bio/busco bio/bwa-mem2/index bio/bwa-mem2/mem bio/minimap2/aligner bio/minimap2/index bio/samtools/index bio/samtools/view

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run

conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

For other installation methods, refer to the Snakemake and Snakedeploy documentation.

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/SarahSaadain/Single_Copy_Gene_Selector . --tag v1.2.0

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Pipeline Setup Guide

Requirements

The only manual prerequisite is Snakemake ≥ 7.0. All other dependencies (BUSCO, BWA, SAMtools, pysam, numpy, etc.) are automatically installed by the pipeline when it runs.

To install Snakemake:

conda install -c bioconda -c conda-forge snakemake

Input Data

The pipeline requires two inputs per species:

Reference genome — a FASTA file of the assembly against which BUSCO will be run to identify candidate SCGs. This should be a high-quality modern genome assembly.

Reads — one or more FASTQ files (gzipped) containing the reads to be mapped to the SCG library. These can be a mix of modern and ancient samples; providing both is strongly recommended (see the main README for why).

Configuration

All pipeline parameters are controlled through config.yaml. A minimal working example is shown below:

# config.yaml - Configuration file for SCG Selector Workflow
species:
  demo:
    name: demo
    lineage: drosophilidae_odb12
    settings:
      num_top_scgs: 100
      min_length_scg: 1500
    reads:
      - /demo_data/demo.fastq.gz
    reference: /demo_data/Dfun_reference_genome.fasta

Configuration fields

Field

Description

species.<id>.name

A short identifier for the species or run. Used to name output files.

species.<id>.lineage

The BUSCO lineage dataset to use (e.g. drosophilidae_odb12). Must match a dataset available in your BUSCO installation.

settings.num_top_scgs

Number of top-ranked SCGs to retain after scoring. Default: 100.

settings.min_length_scg

Minimum SCG sequence length in base pairs. Shorter candidates are discarded. Default: 1500.

reads

List of paths to input FASTQ files (gzipped). At least one file is required. Add one path per line for multiple samples.

reference

Path to the reference genome FASTA file.

Adding multiple species or samples

Multiple species blocks can coexist in the same config file, and multiple read files can be listed under a single species. Each read file is treated as an independent sample and mapped separately to the SCG library:

species:
  drosophila:
    name: drosophila
    lineage: drosophilidae_odb12
    settings:
      num_top_scgs: 100
      min_length_scg: 1500
    reads:
      - /data/modern_sample.fastq.gz
      - /data/ancient_sample_1.fastq.gz
      - /data/ancient_sample_2.fastq.gz
    reference: /data/reference.fasta

Running the Pipeline

Once your config.yaml is set up, run the pipeline from the repository root with:

snakemake --cores <N> --use-conda

Replace <N> with the number of CPU cores to use. For a dry run (to check that the workflow is correctly configured without executing anything):

snakemake --configfile config.yaml --cores 1 --use-conda --dry-run

Output Files

The pipeline produces the following outputs per species:

File

Description

results/<name>/scg_stats/<sample>.json

Per-contig coverage statistics for each input BAM (depth, breadth, etc.).

results/<name>/<species>_best_scgs.tsv

Tab-separated ranked list of SCGs with scores for breadth, depth variation, and depth consistency.

results/<name>/<species>_scg_summary.json

Full JSON summary including per-sample raw stats, aggregated metrics, and scoring details for every SCG.

results/<name>/<species>_relevant_scg.fasta

FASTA file containing the sequences of the top-ranked SCGs.

The TSV and JSON outputs are both sorted by final score in descending order, so the top entries are the highest-quality SCGs recommended for downstream use.

Linting and formatting

Linting results
  1SCG Selector Pipeline 1.1.0 run:
  2[2026-05-18 06:37:57 (UTC)] [INFO] SCG Selector Pipeline 1.1.0 run:
  3	Date:               2026-05-18 06:37:57
  4[2026-05-18 06:37:57 (UTC)] [INFO] 	Date:               2026-05-18 06:37:57
  5	Process ID:         2700
  6[2026-05-18 06:37:57 (UTC)] [INFO] 	Process ID:         2700
  7	Platform:           Linux-6.17.0-1013-azure-x86_64-with-glibc2.39; #13~24.04.1-Ubuntu SMP Wed Apr 15 16:52:17 UTC 2026
  8[2026-05-18 06:37:57 (UTC)] [INFO] 	Platform:           Linux-6.17.0-1013-azure-x86_64-with-glibc2.39; #13~24.04.1-Ubuntu SMP Wed Apr 15 16:52:17 UTC 2026
  9	Host:               runnervmrw5os
 10[2026-05-18 06:37:57 (UTC)] [INFO] 	Host:               runnervmrw5os
 11	User:               runner
 12[2026-05-18 06:37:57 (UTC)] [INFO] 	User:               runner
 13	Conda:              26.3.2
 14[2026-05-18 06:37:57 (UTC)] [INFO] 	Conda:              26.3.2
 15	Python:             3.13.12
 16[2026-05-18 06:37:57 (UTC)] [INFO] 	Python:             3.13.12
 17	Snakemake:          9.17.2
 18[2026-05-18 06:37:57 (UTC)] [INFO] 	Snakemake:          9.17.2
 19	Conda env:          snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
 20[2026-05-18 06:37:57 (UTC)] [INFO] 	Conda env:          snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
 21	Command:            /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
 22[2026-05-18 06:37:57 (UTC)] [INFO] 	Command:            /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
 23	Base directory:     /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow
 24[2026-05-18 06:37:57 (UTC)] [INFO] 	Base directory:     /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow
 25	Working directory:  /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb
 26[2026-05-18 06:37:57 (UTC)] [INFO] 	Working directory:  /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb
 27	Config file(s):     /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/config/config.yaml
 28[2026-05-18 06:37:57 (UTC)] [INFO] 	Config file(s):     /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/config/config.yaml
 29[2026-05-18 06:37:57 (UTC)] [INFO] Detected species:
 30- demo [demo]
 31Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/initialize_pipeline.smk:
 32    * Environment variable CONDA_DEFAULT_ENV"] + " (" + os.environ["CONDA_PREFIX used but not asserted with envvars directive in line 126.:
 33      Asserting existence of environment variables with the envvars directive
 34      ensures proper error messages if the user fails to invoke a workflow with
 35      all required environment variables defined. Further, it allows snakemake
 36      to pass them on in case of distributed execution.
 37      Also see:
 38      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#environment-variables
 39    * Path composition with '+' in line 32:
 40      This becomes quickly unreadable. Usually, it is better to endure some
 41      redundancy against having a more readable workflow. Hence, just repeat
 42      common prefixes. If path composition is unavoidable, use pathlib or
 43      (python >= 3.6) string formatting with f"...".
 44    * Path composition with '+' in line 126:
 45      This becomes quickly unreadable. Usually, it is better to endure some
 46      redundancy against having a more readable workflow. Hence, just repeat
 47      common prefixes. If path composition is unavoidable, use pathlib or
 48      (python >= 3.6) string formatting with f"...".
 49
 50[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/initialize_pipeline.smk:
 51    * Environment variable CONDA_DEFAULT_ENV"] + " (" + os.environ["CONDA_PREFIX used but not asserted with envvars directive in line 126.:
 52      Asserting existence of environment variables with the envvars directive
 53      ensures proper error messages if the user fails to invoke a workflow with
 54      all required environment variables defined. Further, it allows snakemake
 55      to pass them on in case of distributed execution.
 56      Also see:
 57      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#environment-variables
 58    * Path composition with '+' in line 32:
 59      This becomes quickly unreadable. Usually, it is better to endure some
 60      redundancy against having a more readable workflow. Hence, just repeat
 61      common prefixes. If path composition is unavoidable, use pathlib or
 62      (python >= 3.6) string formatting with f"...".
 63    * Path composition with '+' in line 126:
 64      This becomes quickly unreadable. Usually, it is better to endure some
 65      redundancy against having a more readable workflow. Hence, just repeat
 66      common prefixes. If path composition is unavoidable, use pathlib or
 67      (python >= 3.6) string formatting with f"...".
 68
 69Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk:
 70    * Mixed rules and functions in same snakefile.:
 71      Small one-liner functions used only once should be defined as lambda
 72      expressions. Other functions should be collected in a common module, e.g.
 73      'rules/common.smk'. This makes the workflow steps more readable.
 74      Also see:
 75      https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes
 76
 77[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk:
 78    * Mixed rules and functions in same snakefile.:
 79      Small one-liner functions used only once should be defined as lambda
 80      expressions. Other functions should be collected in a common module, e.g.
 81      'rules/common.smk'. This makes the workflow steps more readable.
 82      Also see:
 83      https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes
 84
 85Lints for rule prepare_reference (line 5, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
 86    * No log directive defined:
 87      Without a log directive, all output will be printed to the terminal. In
 88      distributed environments, this means that errors are harder to discover.
 89      In local environments, output of concurrent jobs will be mixed and become
 90      unreadable.
 91      Also see:
 92      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 93    * Specify a conda environment or container for each rule.:
 94      This way, the used software for each specific step is documented, and the
 95      workflow can be executed on any machine without prerequisites.
 96      Also see:
 97      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
 98      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
 99
100[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule prepare_reference (line 5, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
101    * No log directive defined:
102      Without a log directive, all output will be printed to the terminal. In
103      distributed environments, this means that errors are harder to discover.
104      In local environments, output of concurrent jobs will be mixed and become
105      unreadable.
106      Also see:
107      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
108    * Specify a conda environment or container for each rule.:
109      This way, the used software for each specific step is documented, and the
110      workflow can be executed on any machine without prerequisites.
111      Also see:
112      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
113      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
114
115Lints for rule prepare_scg_library (line 43, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
116    * No log directive defined:
117      Without a log directive, all output will be printed to the terminal. In
118      distributed environments, this means that errors are harder to discover.
119      In local environments, output of concurrent jobs will be mixed and become
120      unreadable.
121      Also see:
122      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
123
124[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule prepare_scg_library (line 43, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
125    * No log directive defined:
126      Without a log directive, all output will be printed to the terminal. In
127      distributed environments, this means that errors are harder to discover.
128      In local environments, output of concurrent jobs will be mixed and become
129      unreadable.
130      Also see:
131      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
132
133Lints for rule trigger_map_reads_to_scg_library (line 36, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
134    * No log directive defined:
135      Without a log directive, all output will be printed to the terminal. In
136      distributed environments, this means that errors are harder to discover.
137      In local environments, output of concurrent jobs will be mixed and become
138      unreadable.
139      Also see:
140      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
141    * Specify a conda environment or container for each rule.:
142      This way, the used software for each specific step is documented, and the
143      workflow can be executed on any machine without prerequisites.
144      Also see:
145      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
146      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
147
148[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule trigger_map_reads_to_scg_library (line 36, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
149    * No log directive defined:
150      Without a log directive, all output will be printed to the terminal. In
151      distributed environments, this means that errors are harder to discover.
152      In local environments, output of concurrent jobs will be mixed and become
153      unreadable.
154      Also see:
155      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
156    * Specify a conda environment or container for each rule.:
157      This way, the used software for each specific step is documented, and the
158      workflow can be executed on any machine without prerequisites.
159      Also see:
160      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
161      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
162
163Lints for rule remove_unmapped_reads_from_bam_reads_to_library (line 111, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
164    * No log directive defined:
165      Without a log directive, all output will be printed to the terminal. In
166      distributed environments, this means that errors are harder to discover.
167      In local environments, output of concurrent jobs will be mixed and become
168      unreadable.
169      Also see:
170      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
171
172[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule remove_unmapped_reads_from_bam_reads_to_library (line 111, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
173    * No log directive defined:
174      Without a log directive, all output will be printed to the terminal. In
175      distributed environments, this means that errors are harder to discover.
176      In local environments, output of concurrent jobs will be mixed and become
177      unreadable.
178      Also see:
179      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
180
181Lints for rule index_bam_reads_to_library (line 125, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
182    * No log directive defined:
183      Without a log directive, all output will be printed to the terminal. In
184      distributed environments, this means that errors are harder to discover.
185      In local environments, output of concurrent jobs will be mixed and become
186      unreadable.
187      Also see:
188      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
189
190[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule index_bam_reads_to_library (line 125, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
191    * No log directive defined:
192      Without a log directive, all output will be printed to the terminal. In
193      distributed environments, this means that errors are harder to discover.
194      In local environments, output of concurrent jobs will be mixed and become
195      unreadable.
196      Also see:
197      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
198
199Lints for rule filter_top_scgs_tes (line 28, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/rank_scg.smk):
200    * No log directive defined:
201
202... (truncated)
Formatting results
 1[DEBUG] 
 2[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/Snakefile":  Formatted content is different from original
 3[DEBUG] 
 4[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/rank_scg.smk":  Formatted content is different from original
 5[DEBUG] 
 6[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk":  Formatted content is different from original
 7[DEBUG] 
 8[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/initialize_pipeline.smk":  Formatted content is different from original
 9[DEBUG] 
10[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk":  Formatted content is different from original
11[INFO] 5 file(s) would be changed 😬
12
13snakefmt version: 0.11.5