SarahSaadain/PastForward
This project contains a pipeline to analyze raw ancient data, obtained from the sequencing facility. The pipeline includes various Snakemake workflows to process, analyze, and generate reports on the sequence quality, which helps decide if an aDNA extraction and sequencing was successful, and further polishes the data for downstream analyses.
Overview
Latest release: None, Last update: 2026-03-14
Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=SarahSaadain/PastForward
Quality control: linting: failed formatting: failed
Topics: adna dna-sequencing genomics pipeline snakemake ancient-dna ancient-dna-analysis ancientdna genome genome-mapping multiqc raw-reads bioinformatics bioinformatics-pipeline short-read-mapping short-reads bam fasta fastq
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run
conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
For other installation methods, refer to the Snakemake and Snakedeploy documentation.
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/SarahSaadain/PastForward . --tag None
Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method (short --sdm) argument.
To run the workflow with automatic deployment of all required software via conda/mamba, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md.
Setup Overview
Install Snakemake
To install Snakemake, you can use conda, which is a package manager that simplifies the installation of software and its dependencies. You can create a new conda environment for Snakemake and install it using the following commands:
conda create -c conda-forge -c bioconda -c nodefaults -n snakemake snakemake
conda activate snakemake
snakemake --help
Refer to the Snakemake documentation for more installation options and details.
Setup Instructions
Before running the pipeline, ensure you have an environment with Snakemake and the required dependencies installed.
Required dependencies for pipeline processing will be installed automatically, except for the contamination analysis tools. Those have to be installed separately and their details need to be added to the config file.
The pipeline supports ECMSD for contamination analysis. Ensure ECMSD is configured in the
config.yamlfile undercontamination_analysis.The pipeline supports Centrifuge for contamination analysis. Ensure Centrifuge is configured in the
config.yamlfile undercontamination_analysis.
You need to add species details to the pipeline (config and files).
your reads should be renamed according to the naming convention specifie dbelow
Folder Structure
Species Folders
The project contains folders for different species, which contain the raw data, processed data, and results for each species.
The species folders should be placed in the root folder of your pipeline.
Providing Raw Data
The pipeline supports automatically moving the raw reads to the <species>/raw/reads/ folder as well as the reference to the <species>/raw/ref/ folder. Simply provide the files in the <species> folder. Alternatively, you can manually move the files to the respective folders.
provide the raw reads in
<species>/raw/reads/folderprovide the reference in
<species>/raw/ref/folder
When adding a new species, make sure to
the species folder should be placed in the root folder of your pipeline
add the folder name should match the species key which is defined in
config.yamlbelowspecies:
Folder Structure
All other folders will be created and populated automatically
Folder
<species>/processed/contains the intermediary files during processing. Most of these files are marked as temporary and will be deleted at the end of the pipeline. Some files are kept to allow reprocessing the pipeline from different points in case something fails.Folder
<species>/results/contains the final results and reports.
General reads processing data will be in either processedor results.
Everything related to a reference will have a <reference> folder under processedor results. Typically, only the results folder will contain information required for further analyis. In case more information is required, the original files can often be found in the processed folder.
Some exemptions include *.sam and unsorted *.bam files. These are deleted to save storrage space. Most other files are kept in order to allow reprocessing the pipeline from different points in case something fails. If a step should be repeated, the relevant files need to be deleted manually.
RAW Reads Filenames
The pipeline expects input read files to follow a standardized naming convention:
<Individual>_[<FreeText>_]R<1/2>[_<FreeText>].fastq.gz
Following this convention ensures proper organization and automated processing within the pipeline.
Filename Components:
<Individual>– A unique identifier for the sample or individual.<FreeText>– Any additional text or identifier that can be included in the filename. Typically, this is used to differentiate between different samples within the same individual. E.g. the original sample name.R<1/2>– Indicates the read pair number, typicallyR1for the first read andR2for the second read..fastq.gz– The expected file extension, indicating compressed FASTQ format. Only.fastq.gzfiles are supported.
Notes:
This name must contain
_R1and, if paired-end,_R2.For paired-end data, the name of the reads must be identical except for
_R1and_R2.Individual names/IDs will be used to name the output files as well as in the reports and plots.
Example:
Bger1_326862_S37_R1_001.fastq.gz
Configuration File Structure for aDNA Pipeline (config.yaml)
The config.yaml file is used to configure the aDNA pipeline. It contains settings such as project name, the species list and the pipeline stages and their process steps.
Global Settings
project_name: Name of the aDNA project.
Pipeline Settings
Defines the overall pipeline behavior, including execution controls and process details.
Pipeline Stages and Process Steps
The pipeline is broken into stages (e.g.,
raw_reads_processing,reference_processing).Each stage contains multiple process steps (e.g.,
adapter_removal,deduplication, …).Both stages and process steps can be controlled with
execute: true/falseflags to enable or disable them.Some process steps include additional configurable settings (e.g., adapter sequences, database paths, …).
If an enabled process step requires data from a previous stage which is disabled in the config, the pipeline will execute the disabled process step anyway.
Important Defaults
You do not need to specify all stages or process steps explicitly.
Any stage or process step not provided in the config defaults to
execute: trueand will be executed.
Example config.yaml
# config.yaml - Configuration file for aDNA pipeline
# This file contains settings for various stages of the pipeline
project_name: "aDNA_Project"
# Pipeline stages and their configurations
pipeline:
# Global settings
global:
# When true, existing output files will be skipped to avoid re-computation (Default: true)
skip_existing_files: true
# Stages of the pipeline
# Raw reads processing
# Includes quality checking, adapter removal, quality filtering, merging,
# contamination analysis, and statistical analysis
raw_reads_processing:
# When true, this stage will be executed. (Default: true)
execute: true
# Sub-stages with their respective settings
# Quality checking of raw reads
quality_checking_raw:
# When true, this sub-stage will be executed (Default: true)
execute: true
# Adapter removal from raw reads
adapter_removal:
# When true, this sub-stage will be executed (Default: true)
execute: true
# Settings for adapter removal
settings:
# Minimum quality score for adapter removal
min_quality: 0
# Minimum length of reads after adapter removal
min_length: 0
# Optional: Adapter sequences for read 1 and read 2
# If not provided, fastp will try to identify adapters automatically
adapters_sequences:
# Adapter sequence for read 1
r1: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# Adapter sequence for read 2
r2: "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
# Quality checking of trimmed reads
quality_checking_trimmed:
# When true, this sub-stage will be executed (Default: true)
execute: true
# Quality filtering of trimmed reads
quality_filtering:
# When true, this sub-stage will be executed (Default: true)
execute: true
settings:
# Minimum quality score for quality filtering
min_quality: 15
# Minimum length of reads after quality filtering
min_length: 30
# Quality checking of quality-filtered reads
quality_checking_quality_filtered:
# When true, this sub-stage will be executed (Default: true)
execute: true
# Quality checking of merged reads
quality_checking_merged:
# When true, this sub-stage will be executed (Default: true)
execute: true
# Contamination analysis
contamination_analysis:
# When true, this sub-stage will be executed (Default: true)
execute: true
tools:
# ECMSD tool settings for contamination analysis
ecmsd:
# When true, this tool will be executed (Default: true)
execute: true
settings:
# Optional: Path to the conda environment for ECMSD
# If not provided, the default environment will be used
#conda_env: "../../../../envs/ecmsd.yaml"
# Path to the ECMSD executable
# Curretnly, ecmsd can not be installed via conda. Provide the path to the shell script to run ECMSD.
executable: "/path/to/ecmsd/shell/ECMSD.sh"
# Centrifuge tool settings for contamination analysis
centrifuge:
# When true, this tool will be executed (Default: true)
execute: true
settings:
# Optional: Path to the conda environment for Centrifuge
# If not provided, the default environment will be used
#conda_env: "../../../../envs/centrifuge.yaml"
# Path to the Centrifuge index
index: "/path/to/centrifuge_index"
# Statistical analysis
statistical_analysis:
# When true, this sub-stage will be executed (Default: true)
execute: true
# Reference processing
reference_processing:
# When true, this stage will be executed (Default: true)
execute: true
# Deduplication settings
deduplication:
# When true, this sub-stage will be executed (Default: true)
execute: false
settings:
# To increase performance, deduplication will be done per cluster of contigs
# Below settings define how the contigs will be clustered
# Optional: Maximum number of contigs per cluster (Default: 10 if not specified)
max_contigs_per_cluster: 10
# Optional: Maximum number of contigs per cluster (Default: 500 if not specified)
max_contigs_per_cluster: 500
# Damage rescaling settings for mapDamage2
damage_rescaling:
# When true, this sub-stage will be executed (Default: true)
execute: true
# Damage analysis settings for mapDamage2
damage_analysis:
# When true, this sub-stage will be executed (Default: true)
execute: true
# Endogenous reads analysis settings
endogenous_reads_analysis:
# When true, this sub-stage will be executed (Default: true)
execute: true
# Coverage analysis settings
coverage_analysis:
# When true, this sub-stage will be executed (Default: true)
execute: true
# Species details
species:
Bger:
name: "Blatella germanica"
Dsim:
name: "Drosophila simulans"
## Linting and formatting
(linting-sarahsaadain-pastforward)=
:::{dropdown} Linting results
<div style="max-height: 400px; overflow-y: auto; padding: 0;">
```{code-block}
:linenos:
PastForward 0.0.1-894b1d4 run:
[2026-03-16 08:26:55 (UTC)] [INFO] PastForward 0.0.1-894b1d4 run:
Date: 2026-03-16 08:26:55
[2026-03-16 08:26:55 (UTC)] [INFO] Date: 2026-03-16 08:26:55
Platform: Linux-6.14.0-1017-azure-x86_64-with-glibc2.39; #17~24.04.1-Ubuntu SMP Mon Dec 1 20:10:50 UTC 2025
[2026-03-16 08:26:55 (UTC)] [INFO] Platform: Linux-6.14.0-1017-azure-x86_64-with-glibc2.39; #17~24.04.1-Ubuntu SMP Mon Dec 1 20:10:50 UTC 2025
Host: runnervm46oaq
[2026-03-16 08:26:55 (UTC)] [INFO] Host: runnervm46oaq
User: runner
[2026-03-16 08:26:55 (UTC)] [INFO] User: runner
Conda: 26.1.1
[2026-03-16 08:26:55 (UTC)] [INFO] Conda: 26.1.1
Python: 3.12.13
[2026-03-16 08:26:55 (UTC)] [INFO] Python: 3.12.13
Snakemake: 9.16.3
[2026-03-16 08:26:55 (UTC)] [INFO] Snakemake: 9.16.3
Conda env: snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
[2026-03-16 08:26:55 (UTC)] [INFO] Conda env: snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
Command: /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
[2026-03-16 08:26:55 (UTC)] [INFO] Command: /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
Base directory: /tmp/tmp6jhzexeh/workflow
[2026-03-16 08:26:55 (UTC)] [INFO] Base directory: /tmp/tmp6jhzexeh/workflow
Working directory: /tmp/tmp6jhzexeh
[2026-03-16 08:26:55 (UTC)] [INFO] Working directory: /tmp/tmp6jhzexeh
Config file(s): /tmp/tmp6jhzexeh/config/config.yaml
[2026-03-16 08:26:55 (UTC)] [INFO] Config file(s): /tmp/tmp6jhzexeh/config/config.yaml
[2026-03-16 08:26:55 (UTC)] [INFO] Loaded configuration:
global:
skip_existing_files: true
raw_reads_processing:
execute: true
quality_checking_raw:
execute: true
adapter_removal:
execute: true
settings:
min_quality: 0
min_length: 0
quality_checking_trimmed:
execute: true
quality_filtering:
execute: true
settings:
min_quality: 15
min_length: 30
quality_checking_quality_filtered:
execute: true
quality_checking_merged:
execute: true
contamination_analysis:
execute: true
tools:
ecmsd:
execute: true
settings: null
centrifuge:
execute: true
settings:
include_human_taxid: true
statistical_analysis:
execute: true
reference_processing:
execute: true
deduplication:
execute: false
settings:
max_contigs_per_cluster: 500
damage_rescaling:
execute: true
damage_analysis:
execute: true
endogenous_reads_analysis:
execute: true
coverage_analysis:
execute: true
dynamics:
execute: true
teplotter:
execute: true
pf_normalization:
execute: true
[2026-03-16 08:26:55 (UTC)] [INFO] Detected species:
- Dmel Demo [demo]
AttributeError in file "/tmp/tmp6jhzexeh/workflow/rules/raw_read/analytics/contamination/check_contamination_ecmsd.smk", line 1:
'NoneType' object has no attribute 'get'
File "/tmp/tmp6jhzexeh/workflow/rules/raw_read/analytics/contamination/check_contamination_ecmsd.smk", line 1, in <module>
[2026-03-16 08:26:55 (UTC)] [ERROR] AttributeError in file "/tmp/tmp6jhzexeh/workflow/rules/raw_read/analytics/contamination/check_contamination_ecmsd.smk", line 1:
'NoneType' object has no attribute 'get'
File "/tmp/tmp6jhzexeh/workflow/rules/raw_read/analytics/contamination/check_contamination_ecmsd.smk", line 1, in <module>