SarahSaadain/Single_Copy_Gene_Selector
This repository contains a Snakemake pipeline which can be used to determine the best SCGs for aDNA downstream Analysis
Overview
Latest release: None, Last update: 2026-03-07
Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=SarahSaadain/Single_Copy_Gene_Selector
Quality control: linting: failed formatting: failed
Wrappers: bio/busco bio/bwa-mem2/index bio/bwa-mem2/mem bio/samtools/index bio/samtools/sort bio/samtools/view
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run
conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
For other installation methods, refer to the Snakemake and Snakedeploy documentation.
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/SarahSaadain/Single_Copy_Gene_Selector . --tag None
Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method (short --sdm) argument.
To run the workflow with automatic deployment of all required software via conda/mamba, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md.
Pipeline Setup Guide
Requirements
The only manual prerequisite is Snakemake ≥ 7.0. All other dependencies (BUSCO, BWA, SAMtools, pysam, numpy, etc.) are automatically installed by the pipeline when it runs.
To install Snakemake:
conda install -c bioconda -c conda-forge snakemake
Input Data
The pipeline requires two inputs per species:
Reference genome — a FASTA file of the assembly against which BUSCO will be run to identify candidate SCGs. This should be a high-quality modern genome assembly.
Reads — one or more FASTQ files (gzipped) containing the reads to be mapped to the SCG library. These can be a mix of modern and ancient samples; providing both is strongly recommended (see the main README for why).
Configuration
All pipeline parameters are controlled through config.yaml. A minimal working example is shown below:
# config.yaml - Configuration file for aDNA pipeline
species:
demo:
name: demo
lineage: drosophilidae_odb12
settings:
num_top_scgs: 100
min_length_scg: 1500
reads:
- /demo_data/demo.fastq.gz
reference: /demo_data/Dfun_reference_genome.fasta
Configuration fields
Field |
Description |
|---|---|
|
A short identifier for the species or run. Used to name output files. |
|
The BUSCO lineage dataset to use (e.g. |
|
Number of top-ranked SCGs to retain after scoring. Default: |
|
Minimum SCG sequence length in base pairs. Shorter candidates are discarded. Default: |
|
List of paths to input FASTQ files (gzipped). At least one file is required. Add one path per line for multiple samples. |
|
Path to the reference genome FASTA file. |
Adding multiple species or samples
Multiple species blocks can coexist in the same config file, and multiple read files can be listed under a single species. Each read file is treated as an independent sample and mapped separately to the SCG library:
species:
drosophila:
name: drosophila
lineage: drosophilidae_odb12
settings:
num_top_scgs: 100
min_length_scg: 1500
reads:
- /data/modern_sample.fastq.gz
- /data/ancient_sample_1.fastq.gz
- /data/ancient_sample_2.fastq.gz
reference: /data/reference.fasta
Running the Pipeline
Once your config.yaml is set up, run the pipeline from the repository root with:
snakemake --cores <N> --use-conda
Replace <N> with the number of CPU cores to use. For a dry run (to check that the workflow is correctly configured without executing anything):
snakemake --configfile config.yaml --cores 1 --use-conda --dry-run
Output Files
The pipeline produces the following outputs per species:
File |
Description |
|---|---|
|
Per-contig coverage statistics for each input BAM (depth, breadth, etc.). |
|
Tab-separated ranked list of SCGs with scores for breadth, depth variation, and depth consistency. |
|
Full JSON summary including per-sample raw stats, aggregated metrics, and scoring details for every SCG. |
|
FASTA file containing the sequences of the top-ranked SCGs. |
The TSV and JSON outputs are both sorted by final score in descending order, so the top entries are the highest-quality SCGs recommended for downstream use.
Linting and formatting
Linting results
1SCG Selector Pipeline 0.0.1-27201c5 run:
2[2026-03-09 07:44:46 (UTC)] [INFO] SCG Selector Pipeline 0.0.1-27201c5 run:
3 Date: 2026-03-09 07:44:46
4[2026-03-09 07:44:46 (UTC)] [INFO] Date: 2026-03-09 07:44:46
5 Platform: Linux-6.14.0-1017-azure-x86_64-with-glibc2.39; #17~24.04.1-Ubuntu SMP Mon Dec 1 20:10:50 UTC 2025
6[2026-03-09 07:44:46 (UTC)] [INFO] Platform: Linux-6.14.0-1017-azure-x86_64-with-glibc2.39; #17~24.04.1-Ubuntu SMP Mon Dec 1 20:10:50 UTC 2025
7 Host: runnervm0kj6c
8[2026-03-09 07:44:46 (UTC)] [INFO] Host: runnervm0kj6c
9 User: runner
10[2026-03-09 07:44:46 (UTC)] [INFO] User: runner
11 Conda: 25.11.1
12[2026-03-09 07:44:46 (UTC)] [INFO] Conda: 25.11.1
13 Python: 3.12.8
14[2026-03-09 07:44:46 (UTC)] [INFO] Python: 3.12.8
15 Snakemake: 9.16.3
16[2026-03-09 07:44:46 (UTC)] [INFO] Snakemake: 9.16.3
17 Conda env: snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
18[2026-03-09 07:44:46 (UTC)] [INFO] Conda env: snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
19 Command: /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
20[2026-03-09 07:44:46 (UTC)] [INFO] Command: /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
21 Base directory: /tmp/tmptdd91ahp/workflow
22[2026-03-09 07:44:46 (UTC)] [INFO] Base directory: /tmp/tmptdd91ahp/workflow
23 Working directory: /tmp/tmptdd91ahp
24[2026-03-09 07:44:46 (UTC)] [INFO] Working directory: /tmp/tmptdd91ahp
25 Config file(s): /tmp/tmptdd91ahp/config/config.yaml
26[2026-03-09 07:44:46 (UTC)] [INFO] Config file(s): /tmp/tmptdd91ahp/config/config.yaml
27[2026-03-09 07:44:46 (UTC)] [INFO] Detected species:
28- demo [demo]
29Lints for snakefile /tmp/tmptdd91ahp/workflow/rules/initialize_pipeline.smk:
30 * Environment variable CONDA_DEFAULT_ENV"] + " (" + os.environ["CONDA_PREFIX used but not asserted with envvars directive in line 125.:
31 Asserting existence of environment variables with the envvars directive
32 ensures proper error messages if the user fails to invoke a workflow with
33 all required environment variables defined. Further, it allows snakemake
34 to pass them on in case of distributed execution.
35 Also see:
36 https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#environment-variables
37 * Path composition with '+' in line 32:
38 This becomes quickly unreadable. Usually, it is better to endure some
39 redundancy against having a more readable workflow. Hence, just repeat
40 common prefixes. If path composition is unavoidable, use pathlib or
41 (python >= 3.6) string formatting with f"...".
42 * Path composition with '+' in line 125:
43 This becomes quickly unreadable. Usually, it is better to endure some
44 redundancy against having a more readable workflow. Hence, just repeat
45 common prefixes. If path composition is unavoidable, use pathlib or
46 (python >= 3.6) string formatting with f"...".
47
48[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for snakefile /tmp/tmptdd91ahp/workflow/rules/initialize_pipeline.smk:
49 * Environment variable CONDA_DEFAULT_ENV"] + " (" + os.environ["CONDA_PREFIX used but not asserted with envvars directive in line 125.:
50 Asserting existence of environment variables with the envvars directive
51 ensures proper error messages if the user fails to invoke a workflow with
52 all required environment variables defined. Further, it allows snakemake
53 to pass them on in case of distributed execution.
54 Also see:
55 https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#environment-variables
56 * Path composition with '+' in line 32:
57 This becomes quickly unreadable. Usually, it is better to endure some
58 redundancy against having a more readable workflow. Hence, just repeat
59 common prefixes. If path composition is unavoidable, use pathlib or
60 (python >= 3.6) string formatting with f"...".
61 * Path composition with '+' in line 125:
62 This becomes quickly unreadable. Usually, it is better to endure some
63 redundancy against having a more readable workflow. Hence, just repeat
64 common prefixes. If path composition is unavoidable, use pathlib or
65 (python >= 3.6) string formatting with f"...".
66
67Lints for snakefile /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk:
68 * Mixed rules and functions in same snakefile.:
69 Small one-liner functions used only once should be defined as lambda
70 expressions. Other functions should be collected in a common module, e.g.
71 'rules/common.smk'. This makes the workflow steps more readable.
72 Also see:
73 https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes
74
75[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for snakefile /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk:
76 * Mixed rules and functions in same snakefile.:
77 Small one-liner functions used only once should be defined as lambda
78 expressions. Other functions should be collected in a common module, e.g.
79 'rules/common.smk'. This makes the workflow steps more readable.
80 Also see:
81 https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes
82
83Lints for rule prepare_reference (line 5, /tmp/tmptdd91ahp/workflow/rules/identify_scg_with_busco.smk):
84 * No log directive defined:
85 Without a log directive, all output will be printed to the terminal. In
86 distributed environments, this means that errors are harder to discover.
87 In local environments, output of concurrent jobs will be mixed and become
88 unreadable.
89 Also see:
90 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
91 * Specify a conda environment or container for each rule.:
92 This way, the used software for each specific step is documented, and the
93 workflow can be executed on any machine without prerequisites.
94 Also see:
95 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
96 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
97
98[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for rule prepare_reference (line 5, /tmp/tmptdd91ahp/workflow/rules/identify_scg_with_busco.smk):
99 * No log directive defined:
100 Without a log directive, all output will be printed to the terminal. In
101 distributed environments, this means that errors are harder to discover.
102 In local environments, output of concurrent jobs will be mixed and become
103 unreadable.
104 Also see:
105 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
106 * Specify a conda environment or container for each rule.:
107 This way, the used software for each specific step is documented, and the
108 workflow can be executed on any machine without prerequisites.
109 Also see:
110 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
111 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
112
113Lints for rule prepare_scg_library (line 43, /tmp/tmptdd91ahp/workflow/rules/identify_scg_with_busco.smk):
114 * No log directive defined:
115 Without a log directive, all output will be printed to the terminal. In
116 distributed environments, this means that errors are harder to discover.
117 In local environments, output of concurrent jobs will be mixed and become
118 unreadable.
119 Also see:
120 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
121
122[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for rule prepare_scg_library (line 43, /tmp/tmptdd91ahp/workflow/rules/identify_scg_with_busco.smk):
123 * No log directive defined:
124 Without a log directive, all output will be printed to the terminal. In
125 distributed environments, this means that errors are harder to discover.
126 In local environments, output of concurrent jobs will be mixed and become
127 unreadable.
128 Also see:
129 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
130
131Lints for rule trigger_map_reads_to_scg_library (line 34, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
132 * No log directive defined:
133 Without a log directive, all output will be printed to the terminal. In
134 distributed environments, this means that errors are harder to discover.
135 In local environments, output of concurrent jobs will be mixed and become
136 unreadable.
137 Also see:
138 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
139 * Specify a conda environment or container for each rule.:
140 This way, the used software for each specific step is documented, and the
141 workflow can be executed on any machine without prerequisites.
142 Also see:
143 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
144 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
145
146[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for rule trigger_map_reads_to_scg_library (line 34, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
147 * No log directive defined:
148 Without a log directive, all output will be printed to the terminal. In
149 distributed environments, this means that errors are harder to discover.
150 In local environments, output of concurrent jobs will be mixed and become
151 unreadable.
152 Also see:
153 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
154 * Specify a conda environment or container for each rule.:
155 This way, the used software for each specific step is documented, and the
156 workflow can be executed on any machine without prerequisites.
157 Also see:
158 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
159 https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
160
161Lints for rule convert_sam_to_bam_reads_to_library (line 85, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
162 * No log directive defined:
163 Without a log directive, all output will be printed to the terminal. In
164 distributed environments, this means that errors are harder to discover.
165 In local environments, output of concurrent jobs will be mixed and become
166 unreadable.
167 Also see:
168 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
169
170[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for rule convert_sam_to_bam_reads_to_library (line 85, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
171 * No log directive defined:
172 Without a log directive, all output will be printed to the terminal. In
173 distributed environments, this means that errors are harder to discover.
174 In local environments, output of concurrent jobs will be mixed and become
175 unreadable.
176 Also see:
177 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
178
179Lints for rule remove_unmapped_reads_from_bam_reads_to_library (line 96, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
180 * No log directive defined:
181 Without a log directive, all output will be printed to the terminal. In
182 distributed environments, this means that errors are harder to discover.
183 In local environments, output of concurrent jobs will be mixed and become
184 unreadable.
185 Also see:
186 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
187
188[2026-03-09 07:44:46 (UTC)] [WARNING] Lints for rule remove_unmapped_reads_from_bam_reads_to_library (line 96, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
189 * No log directive defined:
190 Without a log directive, all output will be printed to the terminal. In
191 distributed environments, this means that errors are harder to discover.
192 In local environments, output of concurrent jobs will be mixed and become
193 unreadable.
194 Also see:
195 https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
196
197Lints for rule move_deduplicated_to_library_mapped (line 143, /tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk):
198 * No log directive defined:
199 Without a log directive, all output will be printed to the terminal. In
200 distributed environments, this means that errors are harder to discover.
201
202... (truncated)
Formatting results
1[DEBUG]
2[DEBUG] In file "/tmp/tmptdd91ahp/workflow/Snakefile": Formatted content is different from original
3[DEBUG]
4[DEBUG] In file "/tmp/tmptdd91ahp/workflow/rules/rank_scg.smk": Formatted content is different from original
5[DEBUG]
6[DEBUG] In file "/tmp/tmptdd91ahp/workflow/rules/identify_scg_with_busco.smk": Formatted content is different from original
7[DEBUG]
8[DEBUG] In file "/tmp/tmptdd91ahp/workflow/rules/initialize_pipeline.smk": Formatted content is different from original
9[DEBUG]
10[DEBUG] In file "/tmp/tmptdd91ahp/workflow/rules/map_reads_to_library.smk": Formatted content is different from original
11[INFO] 5 file(s) would be changed 😬
12
13snakefmt version: 0.11.4