SarahSaadain/Single_Copy_Gene_Selector

This repository contains a Snakemake pipeline which can be used to determine the best SCGs for aDNA downstream Analysis

Overview

Latest release: None, Last update: 2026-03-07

Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=SarahSaadain/Single_Copy_Gene_Selector

Quality control: linting: failed formatting: failed

Wrappers: bio/busco bio/bwa-mem2/index bio/bwa-mem2/mem bio/samtools/index bio/samtools/sort bio/samtools/view

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run

conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

For other installation methods, refer to the Snakemake and Snakedeploy documentation.

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/SarahSaadain/Single_Copy_Gene_Selector . --tag None

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Pipeline Setup Guide

Requirements

The only manual prerequisite is Snakemake ≥ 7.0. All other dependencies (BUSCO, BWA, SAMtools, pysam, numpy, etc.) are automatically installed by the pipeline when it runs.

To install Snakemake:

conda install -c bioconda -c conda-forge snakemake

Input Data

The pipeline requires two inputs per species:

Reference genome — a FASTA file of the assembly against which BUSCO will be run to identify candidate SCGs. This should be a high-quality modern genome assembly.

Reads — one or more FASTQ files (gzipped) containing the reads to be mapped to the SCG library. These can be a mix of modern and ancient samples; providing both is strongly recommended (see the main README for why).

Configuration

All pipeline parameters are controlled through config.yaml. A minimal working example is shown below:

# config.yaml - Configuration file for aDNA pipeline
species:
  demo:
    name: demo
    lineage: drosophilidae_odb12
    settings:
      num_top_scgs: 100
      min_length_scg: 1500
    reads:
      - /demo_data/demo.fastq.gz
    reference: /demo_data/Dfun_reference_genome.fasta

Configuration fields

Field

Description

species.<id>.name

A short identifier for the species or run. Used to name output files.

species.<id>.lineage

The BUSCO lineage dataset to use (e.g. drosophilidae_odb12). Must match a dataset available in your BUSCO installation.

settings.num_top_scgs

Number of top-ranked SCGs to retain after scoring. Default: 100.

settings.min_length_scg

Minimum SCG sequence length in base pairs. Shorter candidates are discarded. Default: 1500.

reads

List of paths to input FASTQ files (gzipped). At least one file is required. Add one path per line for multiple samples.

reference

Path to the reference genome FASTA file.

Adding multiple species or samples

Multiple species blocks can coexist in the same config file, and multiple read files can be listed under a single species. Each read file is treated as an independent sample and mapped separately to the SCG library:

species:
  drosophila:
    name: drosophila
    lineage: drosophilidae_odb12
    settings:
      num_top_scgs: 100
      min_length_scg: 1500
    reads:
      - /data/modern_sample.fastq.gz
      - /data/ancient_sample_1.fastq.gz
      - /data/ancient_sample_2.fastq.gz
    reference: /data/reference.fasta

Running the Pipeline

Once your config.yaml is set up, run the pipeline from the repository root with:

snakemake --cores <N> --use-conda

Replace <N> with the number of CPU cores to use. For a dry run (to check that the workflow is correctly configured without executing anything):

snakemake --configfile config.yaml --cores 1 --use-conda --dry-run

Output Files

The pipeline produces the following outputs per species:

File

Description

results/<name>/scg_stats/<sample>.json

Per-contig coverage statistics for each input BAM (depth, breadth, etc.).

results/<name>/<species>_best_scgs.tsv

Tab-separated ranked list of SCGs with scores for breadth, depth variation, and depth consistency.

results/<name>/<species>_scg_summary.json

Full JSON summary including per-sample raw stats, aggregated metrics, and scoring details for every SCG.

results/<name>/<species>_relevant_scg.fasta

FASTA file containing the sequences of the top-ranked SCGs.

The TSV and JSON outputs are both sorted by final score in descending order, so the top entries are the highest-quality SCGs recommended for downstream use.

Linting and formatting

Linting results

  1SCG Selector Pipeline 0.0.1-27201c5 run:
  2[2026-03-09 07:44:46 (UTC)] [INFO] SCG Selector Pipeline 0.0.1-27201c5 run:
  3	Date:               2026-03-09 07:44:46
  4[2026-03-09 07:44:46 (UTC)] [INFO] 	Date:               2026-03-09 07:44:46
  5	Platform:           Linux-6.14.0-1017-azure-x86_64-with-glibc2.39; #17~24.04.1-Ubuntu SMP Mon Dec  1 20:10:50 UTC 2025
  6[2026-03-09 07:44:46 (UTC)] [INFO] 	Platform:           Linux-6.14.0-1017-azure-x86_64-with-glibc2.39; #17~24.04.1-Ubuntu SMP Mon Dec  1 20:10:50 UTC 2025
  7	Host:               runnervm0kj6c
  8[2026-03-09 07:44:46 (UTC)] [INFO] 	Host:               runnervm0kj6c
  9	User:               runner
 10[2026-03-09 07:44:46 (UTC)] [INFO] 	User:               runner
 11	Conda:              25.11.1
 12[2026-03-09 07:44:46 (UTC)] [INFO] 	Conda:              25.11.1
 13	Python:             3.12.8
 14[2026-03-09 07:44:46 (UTC)] [INFO] 	Python:             3.12.8
 15	Snakemake:          9.16.3
 16[2026-03-09 07:44:46 (UTC)] [INFO] 	Snakemake:          9.16.3
 17	Conda env:          snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
 18[2026-03-09 07:44:46 (UTC)] [INFO] 	Conda env:          snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
 19	Command:            /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
 20[2026-03-09 07:44:46 (UTC)] [INFO] 	Command:            /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
 21	Base directory:     /tmp/tmptdd91ahp/workflow
 22[2026-03-09 07:44:46 (UTC)] [INFO] 	Base directory:     /tmp/tmptdd91ahp/workflow
 23	Working directory:  /tmp/tmptdd91ahp
 24[2026-03-09 07:44:46 (UTC)] [INFO] 	Working directory:  /tmp/tmptdd91ahp
 25	Config file(s):     /tmp/tmptdd91ahp/config/config.yaml
 26[2026-03-09 07:44:46 (UTC)] [INFO] 	Config file(s):     /tmp/tmptdd91ahp/config/config.yaml
 27[2026-03-09 07:44:46 (UTC)] [INFO] Detected species:
 28- demo [demo]
 29Lints for snakefile /tmp/tmptdd91ahp/workflow/rules/initialize_pipeline.smk:
 30    * Environment variable CONDA_DEFAULT_ENV"] + " (" + os.environ["CONDA_PREFIX used but not asserted with envvars directive in line 125.:
 31      Asserting existence of environment variables with the envvars directive
 32      ensures proper error messages if the user fails to invoke a workflow with
 33      all required environment variables defined. Further, it allows snakemake
 34      to pass them on in case of distributed execution.
 35      Also see:
 36      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#environment-variables
 37    * Path composition with '+' in line 32:
 38      This becomes quickly unreadable. Usually, it is better to endure some
 39      redundancy against having a more readable workflow. Hence, just repeat
 40      common prefixes. If path composition is unavoidable, use pathlib or
 41      (python >= 3.6) string formatting with f"...".
 42    * Path composition with '+' in line 125:
 43      This becomes quickly unreadable. Usually, it is better to endure some
 44      redundancy against having a more readable workflow. Hence, just repeat
 45      common prefixes. If path composition is unavoidable, use pathlib or
 46      (python >= 3.6) string formatting with f"...".
 47
 48[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for snakefile /tmp/tmptdd91ahp/workflow/rules/initialize_pipeline.smk:
 49    * Environment variable CONDA_DEFAULT_ENV"] + " (" + os.environ["CONDA_PREFIX used but not asserted with envvars directive in line 125.:
 50      Asserting existence of environment variables with the envvars directive
 51      ensures proper error messages if the user fails to invoke a workflow with
 52      all required environment variables defined. Further, it allows snakemake
 53      to pass them on in case of distributed execution.
 54      Also see:
 55      https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#environment-variables
 56    * Path composition with '+' in line 32:
 57      This becomes quickly unreadable. Usually, it is better to endure some
 58      redundancy against having a more readable workflow. Hence, just repeat
 59      common prefixes. If path composition is unavoidable, use pathlib or
 60      (python >= 3.6) string formatting with f"...".
 61    * Path composition with '+' in line 125:
 62      This becomes quickly unreadable. Usually, it is better to endure some
 63      redundancy against having a more readable workflow. Hence, just repeat
 64      common prefixes. If path composition is unavoidable, use pathlib or
 65      (python >= 3.6) string formatting with f"...".
 66
 67Lints for snakefile /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk:
 68    * Mixed rules and functions in same snakefile.:
 69      Small one-liner functions used only once should be defined as lambda
 70      expressions. Other functions should be collected in a common module, e.g.
 71      'rules/common.smk'. This makes the workflow steps more readable.
 72      Also see:
 73      https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes
 74
 75[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for snakefile /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk:
 76    * Mixed rules and functions in same snakefile.:
 77      Small one-liner functions used only once should be defined as lambda
 78      expressions. Other functions should be collected in a common module, e.g.
 79      'rules/common.smk'. This makes the workflow steps more readable.
 80      Also see:
 81      https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes
 82
 83Lints for rule prepare_reference (line 5, /tmp/tmptdd91ahp/workflow/rules/identify_scg_with_busco.smk):
 84    * No log directive defined:
 85      Without a log directive, all output will be printed to the terminal. In
 86      distributed environments, this means that errors are harder to discover.
 87      In local environments, output of concurrent jobs will be mixed and become
 88      unreadable.
 89      Also see:
 90      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 91    * Specify a conda environment or container for each rule.:
 92      This way, the used software for each specific step is documented, and the
 93      workflow can be executed on any machine without prerequisites.
 94      Also see:
 95      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
 96      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
 97
 98[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for rule prepare_reference (line 5, /tmp/tmptdd91ahp/workflow/rules/identify_scg_with_busco.smk):
 99    * No log directive defined:
100      Without a log directive, all output will be printed to the terminal. In
101      distributed environments, this means that errors are harder to discover.
102      In local environments, output of concurrent jobs will be mixed and become
103      unreadable.
104      Also see:
105      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
106    * Specify a conda environment or container for each rule.:
107      This way, the used software for each specific step is documented, and the
108      workflow can be executed on any machine without prerequisites.
109      Also see:
110      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
111      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
112
113Lints for rule prepare_scg_library (line 43, /tmp/tmptdd91ahp/workflow/rules/identify_scg_with_busco.smk):
114    * No log directive defined:
115      Without a log directive, all output will be printed to the terminal. In
116      distributed environments, this means that errors are harder to discover.
117      In local environments, output of concurrent jobs will be mixed and become
118      unreadable.
119      Also see:
120      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
121
122[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for rule prepare_scg_library (line 43, /tmp/tmptdd91ahp/workflow/rules/identify_scg_with_busco.smk):
123    * No log directive defined:
124      Without a log directive, all output will be printed to the terminal. In
125      distributed environments, this means that errors are harder to discover.
126      In local environments, output of concurrent jobs will be mixed and become
127      unreadable.
128      Also see:
129      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
130
131Lints for rule trigger_map_reads_to_scg_library (line 34, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
132    * No log directive defined:
133      Without a log directive, all output will be printed to the terminal. In
134      distributed environments, this means that errors are harder to discover.
135      In local environments, output of concurrent jobs will be mixed and become
136      unreadable.
137      Also see:
138      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
139    * Specify a conda environment or container for each rule.:
140      This way, the used software for each specific step is documented, and the
141      workflow can be executed on any machine without prerequisites.
142      Also see:
143      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
144      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
145
146[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for rule trigger_map_reads_to_scg_library (line 34, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
147    * No log directive defined:
148      Without a log directive, all output will be printed to the terminal. In
149      distributed environments, this means that errors are harder to discover.
150      In local environments, output of concurrent jobs will be mixed and become
151      unreadable.
152      Also see:
153      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
154    * Specify a conda environment or container for each rule.:
155      This way, the used software for each specific step is documented, and the
156      workflow can be executed on any machine without prerequisites.
157      Also see:
158      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
159      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
160
161Lints for rule convert_sam_to_bam_reads_to_library (line 85, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
162    * No log directive defined:
163      Without a log directive, all output will be printed to the terminal. In
164      distributed environments, this means that errors are harder to discover.
165      In local environments, output of concurrent jobs will be mixed and become
166      unreadable.
167      Also see:
168      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
169
170[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for rule convert_sam_to_bam_reads_to_library (line 85, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
171    * No log directive defined:
172      Without a log directive, all output will be printed to the terminal. In
173      distributed environments, this means that errors are harder to discover.
174      In local environments, output of concurrent jobs will be mixed and become
175      unreadable.
176      Also see:
177      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
178
179Lints for rule remove_unmapped_reads_from_bam_reads_to_library (line 96, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
180    * No log directive defined:
181      Without a log directive, all output will be printed to the terminal. In
182      distributed environments, this means that errors are harder to discover.
183      In local environments, output of concurrent jobs will be mixed and become
184      unreadable.
185      Also see:
186      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
187
188[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for rule remove_unmapped_reads_from_bam_reads_to_library (line 96, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
189    * No log directive defined:
190      Without a log directive, all output will be printed to the terminal. In
191      distributed environments, this means that errors are harder to discover.
192      In local environments, output of concurrent jobs will be mixed and become
193      unreadable.
194      Also see:
195      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
196
197Lints for rule move_deduplicated_to_library_mapped (line 143, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
198    * No log directive defined:
199      Without a log directive, all output will be printed to the terminal. In
200      distributed environments, this means that errors are harder to discover.
201
202... (truncated)

Formatting results

 1[DEBUG] 
 2[DEBUG] In file "/tmp/tmptdd91ahp/workflow/Snakefile":  Formatted content is different from original
 3[DEBUG] 
 4[DEBUG] In file "/tmp/tmptdd91ahp/workflow/rules/rank_scg.smk":  Formatted content is different from original
 5[DEBUG] 
 6[DEBUG] In file "/tmp/tmptdd91ahp/workflow/rules/identify_scg_with_busco.smk":  Formatted content is different from original
 7[DEBUG] 
 8[DEBUG] In file "/tmp/tmptdd91ahp/workflow/rules/initialize_pipeline.smk":  Formatted content is different from original
 9[DEBUG] 
10[DEBUG] In file "/tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk":  Formatted content is different from original
11[INFO] 5 file(s) would be changed 😬
12
13snakefmt version: 0.11.4