SarahSaadain/Single_Copy_Gene_Selector
This repository contains a Snakemake pipeline which can be used to determine the best SCGs for aDNA downstream Analysis
Overview
Latest release: v1.2.0, Last update: 2026-05-15
Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=SarahSaadain/Single_Copy_Gene_Selector
Quality control: linting: failed formatting: failed
Wrappers: bio/busco bio/bwa-mem2/index bio/bwa-mem2/mem bio/minimap2/aligner bio/minimap2/index bio/samtools/index bio/samtools/view
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run
conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
For other installation methods, refer to the Snakemake and Snakedeploy documentation.
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/SarahSaadain/Single_Copy_Gene_Selector . --tag v1.2.0
Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method (short --sdm) argument.
To run the workflow with automatic deployment of all required software via conda/mamba, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md.
Pipeline Setup Guide
Requirements
The only manual prerequisite is Snakemake ≥ 7.0. All other dependencies (BUSCO, BWA, SAMtools, pysam, numpy, etc.) are automatically installed by the pipeline when it runs.
To install Snakemake:
conda install -c bioconda -c conda-forge snakemake
Input Data
The pipeline requires two inputs per species:
Reference genome — a FASTA file of the assembly against which BUSCO will be run to identify candidate SCGs. This should be a high-quality modern genome assembly.
Reads — one or more FASTQ files (gzipped) containing the reads to be mapped to the SCG library. These can be a mix of modern and ancient samples; providing both is strongly recommended (see the main README for why).
Configuration
All pipeline parameters are controlled through config.yaml. A minimal working example is shown below:
# config.yaml - Configuration file for SCG Selector Workflow
species:
demo:
name: demo
lineage: drosophilidae_odb12
settings:
num_top_scgs: 100
min_length_scg: 1500
reads:
- /demo_data/demo.fastq.gz
reference: /demo_data/Dfun_reference_genome.fasta
Configuration fields
Field |
Description |
|---|---|
|
A short identifier for the species or run. Used to name output files. |
|
The BUSCO lineage dataset to use (e.g. |
|
Number of top-ranked SCGs to retain after scoring. Default: |
|
Minimum SCG sequence length in base pairs. Shorter candidates are discarded. Default: |
|
List of paths to input FASTQ files (gzipped). At least one file is required. Add one path per line for multiple samples. |
|
Path to the reference genome FASTA file. |
Adding multiple species or samples
Multiple species blocks can coexist in the same config file, and multiple read files can be listed under a single species. Each read file is treated as an independent sample and mapped separately to the SCG library:
species:
drosophila:
name: drosophila
lineage: drosophilidae_odb12
settings:
num_top_scgs: 100
min_length_scg: 1500
reads:
- /data/modern_sample.fastq.gz
- /data/ancient_sample_1.fastq.gz
- /data/ancient_sample_2.fastq.gz
reference: /data/reference.fasta
Running the Pipeline
Once your config.yaml is set up, run the pipeline from the repository root with:
snakemake --cores <N> --use-conda
Replace <N> with the number of CPU cores to use. For a dry run (to check that the workflow is correctly configured without executing anything):
snakemake --configfile config.yaml --cores 1 --use-conda --dry-run
Output Files
The pipeline produces the following outputs per species:
File |
Description |
|---|---|
|
Per-contig coverage statistics for each input BAM (depth, breadth, etc.). |
|
Tab-separated ranked list of SCGs with scores for breadth, depth variation, and depth consistency. |
|
Full JSON summary including per-sample raw stats, aggregated metrics, and scoring details for every SCG. |
|
FASTA file containing the sequences of the top-ranked SCGs. |
The TSV and JSON outputs are both sorted by final score in descending order, so the top entries are the highest-quality SCGs recommended for downstream use.
Linting and formatting
Linting results
1SCG Selector Pipeline 1.1.0 run:
2[2026-05-18 06:37:57 (UTC)] [INFO] SCG Selector Pipeline 1.1.0 run:
3 Date: 2026-05-18 06:37:57
4[2026-05-18 06:37:57 (UTC)] [INFO] Date: 2026-05-18 06:37:57
5 Process ID: 2700
6[2026-05-18 06:37:57 (UTC)] [INFO] Process ID: 2700
7 Platform: Linux-6.17.0-1013-azure-x86_64-with-glibc2.39; #13~24.04.1-Ubuntu SMP Wed Apr 15 16:52:17 UTC 2026
8[2026-05-18 06:37:57 (UTC)] [INFO] Platform: Linux-6.17.0-1013-azure-x86_64-with-glibc2.39; #13~24.04.1-Ubuntu SMP Wed Apr 15 16:52:17 UTC 2026
9 Host: runnervmrw5os
10[2026-05-18 06:37:57 (UTC)] [INFO] Host: runnervmrw5os
11 User: runner
12[2026-05-18 06:37:57 (UTC)] [INFO] User: runner
13 Conda: 26.3.2
14[2026-05-18 06:37:57 (UTC)] [INFO] Conda: 26.3.2
15 Python: 3.13.12
16[2026-05-18 06:37:57 (UTC)] [INFO] Python: 3.13.12
17 Snakemake: 9.17.2
18[2026-05-18 06:37:57 (UTC)] [INFO] Snakemake: 9.17.2
19 Conda env: snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
20[2026-05-18 06:37:57 (UTC)] [INFO] Conda env: snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
21 Command: /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
22[2026-05-18 06:37:57 (UTC)] [INFO] Command: /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
23 Base directory: /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow
24[2026-05-18 06:37:57 (UTC)] [INFO] Base directory: /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow
25 Working directory: /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb
26[2026-05-18 06:37:57 (UTC)] [INFO] Working directory: /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb
27 Config file(s): /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/config/config.yaml
28[2026-05-18 06:37:57 (UTC)] [INFO] Config file(s): /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/config/config.yaml
29[2026-05-18 06:37:57 (UTC)] [INFO] Detected species:
30- demo [demo]
31Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/initialize_pipeline.smk:
32 * Environment variable CONDA_DEFAULT_ENV"] + " (" + os.environ["CONDA_PREFIX used but not asserted with envvars directive in line 126.:
33 Asserting existence of environment variables with the envvars directive
34 ensures proper error messages if the user fails to invoke a workflow with
35 all required environment variables defined. Further, it allows snakemake
36 to pass them on in case of distributed execution.
37 Also see:
38 https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#environment-variables
39 * Path composition with '+' in line 32:
40 This becomes quickly unreadable. Usually, it is better to endure some
41 redundancy against having a more readable workflow. Hence, just repeat
42 common prefixes. If path composition is unavoidable, use pathlib or
43 (python >= 3.6) string formatting with f"...".
44 * Path composition with '+' in line 126:
45 This becomes quickly unreadable. Usually, it is better to endure some
46 redundancy against having a more readable workflow. Hence, just repeat
47 common prefixes. If path composition is unavoidable, use pathlib or
48 (python >= 3.6) string formatting with f"...".
49
50[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/initialize_pipeline.smk:
51 * Environment variable CONDA_DEFAULT_ENV"] + " (" + os.environ["CONDA_PREFIX used but not asserted with envvars directive in line 126.:
52 Asserting existence of environment variables with the envvars directive
53 ensures proper error messages if the user fails to invoke a workflow with
54 all required environment variables defined. Further, it allows snakemake
55 to pass them on in case of distributed execution.
56 Also see:
57 https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#environment-variables
58 * Path composition with '+' in line 32:
59 This becomes quickly unreadable. Usually, it is better to endure some
60 redundancy against having a more readable workflow. Hence, just repeat
61 common prefixes. If path composition is unavoidable, use pathlib or
62 (python >= 3.6) string formatting with f"...".
63 * Path composition with '+' in line 126:
64 This becomes quickly unreadable. Usually, it is better to endure some
65 redundancy against having a more readable workflow. Hence, just repeat
66 common prefixes. If path composition is unavoidable, use pathlib or
67 (python >= 3.6) string formatting with f"...".
68
69Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk:
70 * Mixed rules and functions in same snakefile.:
71 Small one-liner functions used only once should be defined as lambda
72 expressions. Other functions should be collected in a common module, e.g.
73 'rules/common.smk'. This makes the workflow steps more readable.
74 Also see:
75 https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes
76
77[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for snakefile /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk:
78 * Mixed rules and functions in same snakefile.:
79 Small one-liner functions used only once should be defined as lambda
80 expressions. Other functions should be collected in a common module, e.g.
81 'rules/common.smk'. This makes the workflow steps more readable.
82 Also see:
83 https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes
84
85Lints for rule prepare_reference (line 5, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
86 * No log directive defined:
87 Without a log directive, all output will be printed to the terminal. In
88 distributed environments, this means that errors are harder to discover.
89 In local environments, output of concurrent jobs will be mixed and become
90 unreadable.
91 Also see:
92 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
93 * Specify a conda environment or container for each rule.:
94 This way, the used software for each specific step is documented, and the
95 workflow can be executed on any machine without prerequisites.
96 Also see:
97 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
98 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
99
100[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule prepare_reference (line 5, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
101 * No log directive defined:
102 Without a log directive, all output will be printed to the terminal. In
103 distributed environments, this means that errors are harder to discover.
104 In local environments, output of concurrent jobs will be mixed and become
105 unreadable.
106 Also see:
107 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
108 * Specify a conda environment or container for each rule.:
109 This way, the used software for each specific step is documented, and the
110 workflow can be executed on any machine without prerequisites.
111 Also see:
112 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
113 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
114
115Lints for rule prepare_scg_library (line 43, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
116 * No log directive defined:
117 Without a log directive, all output will be printed to the terminal. In
118 distributed environments, this means that errors are harder to discover.
119 In local environments, output of concurrent jobs will be mixed and become
120 unreadable.
121 Also see:
122 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
123
124[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule prepare_scg_library (line 43, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk):
125 * No log directive defined:
126 Without a log directive, all output will be printed to the terminal. In
127 distributed environments, this means that errors are harder to discover.
128 In local environments, output of concurrent jobs will be mixed and become
129 unreadable.
130 Also see:
131 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
132
133Lints for rule trigger_map_reads_to_scg_library (line 36, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
134 * No log directive defined:
135 Without a log directive, all output will be printed to the terminal. In
136 distributed environments, this means that errors are harder to discover.
137 In local environments, output of concurrent jobs will be mixed and become
138 unreadable.
139 Also see:
140 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
141 * Specify a conda environment or container for each rule.:
142 This way, the used software for each specific step is documented, and the
143 workflow can be executed on any machine without prerequisites.
144 Also see:
145 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
146 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
147
148[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule trigger_map_reads_to_scg_library (line 36, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
149 * No log directive defined:
150 Without a log directive, all output will be printed to the terminal. In
151 distributed environments, this means that errors are harder to discover.
152 In local environments, output of concurrent jobs will be mixed and become
153 unreadable.
154 Also see:
155 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
156 * Specify a conda environment or container for each rule.:
157 This way, the used software for each specific step is documented, and the
158 workflow can be executed on any machine without prerequisites.
159 Also see:
160 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
161 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
162
163Lints for rule remove_unmapped_reads_from_bam_reads_to_library (line 111, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
164 * No log directive defined:
165 Without a log directive, all output will be printed to the terminal. In
166 distributed environments, this means that errors are harder to discover.
167 In local environments, output of concurrent jobs will be mixed and become
168 unreadable.
169 Also see:
170 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
171
172[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule remove_unmapped_reads_from_bam_reads_to_library (line 111, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
173 * No log directive defined:
174 Without a log directive, all output will be printed to the terminal. In
175 distributed environments, this means that errors are harder to discover.
176 In local environments, output of concurrent jobs will be mixed and become
177 unreadable.
178 Also see:
179 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
180
181Lints for rule index_bam_reads_to_library (line 125, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
182 * No log directive defined:
183 Without a log directive, all output will be printed to the terminal. In
184 distributed environments, this means that errors are harder to discover.
185 In local environments, output of concurrent jobs will be mixed and become
186 unreadable.
187 Also see:
188 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
189
190[2026-05-18 06:37:57 (UTC)] [WARNING] Lints for rule index_bam_reads_to_library (line 125, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk):
191 * No log directive defined:
192 Without a log directive, all output will be printed to the terminal. In
193 distributed environments, this means that errors are harder to discover.
194 In local environments, output of concurrent jobs will be mixed and become
195 unreadable.
196 Also see:
197 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
198
199Lints for rule filter_top_scgs_tes (line 28, /tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/rank_scg.smk):
200 * No log directive defined:
201
202... (truncated)
Formatting results
1[DEBUG]
2[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/Snakefile": Formatted content is different from original
3[DEBUG]
4[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/rank_scg.smk": Formatted content is different from original
5[DEBUG]
6[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/identify_scg_with_busco.smk": Formatted content is different from original
7[DEBUG]
8[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/initialize_pipeline.smk": Formatted content is different from original
9[DEBUG]
10[DEBUG] In file "/tmp/tmpr0s8fq_d/SarahSaadain-Single_Copy_Gene_Selector-c8e16cb/workflow/rules/map_reads_to_library.smk": Formatted content is different from original
11[INFO] 5 file(s) would be changed 😬
12
13snakefmt version: 0.11.5