SarahSaadain/PastForward

This project contains a pipeline to analyze raw ancient data, obtained from the sequencing facility. The pipeline includes various Snakemake workflows to process, analyze, and generate reports on the sequence quality, which helps decide if an aDNA extraction and sequencing was successful, and further polishes the data for downstream analyses.

Overview

Latest release: None, Last update: 2026-03-14

Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=SarahSaadain/PastForward

Quality control: linting: failed formatting: failed

Topics: adna dna-sequencing genomics pipeline snakemake ancient-dna ancient-dna-analysis ancientdna genome genome-mapping multiqc raw-reads bioinformatics bioinformatics-pipeline short-read-mapping short-reads bam fasta fastq

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run

conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

For other installation methods, refer to the Snakemake and Snakedeploy documentation.

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/SarahSaadain/PastForward . --tag None

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Setup Overview

Install Snakemake

To install Snakemake, you can use conda, which is a package manager that simplifies the installation of software and its dependencies. You can create a new conda environment for Snakemake and install it using the following commands:

conda create -c conda-forge -c bioconda -c nodefaults -n snakemake snakemake
conda activate snakemake
snakemake --help

Refer to the Snakemake documentation for more installation options and details.

Setup Instructions

  • Before running the pipeline, ensure you have an environment with Snakemake and the required dependencies installed.

  • Required dependencies for pipeline processing will be installed automatically, except for the contamination analysis tools. Those have to be installed separately and their details need to be added to the config file.

    • The pipeline supports ECMSD for contamination analysis. Ensure ECMSD is configured in the config.yaml file under contamination_analysis.

    • The pipeline supports Centrifuge for contamination analysis. Ensure Centrifuge is configured in the config.yaml file under contamination_analysis.

  • You need to add species details to the pipeline (config and files).

  • your reads should be renamed according to the naming convention specifie dbelow

Folder Structure

Species Folders

The project contains folders for different species, which contain the raw data, processed data, and results for each species.

The species folders should be placed in the root folder of your pipeline.

Providing Raw Data

The pipeline supports automatically moving the raw reads to the <species>/raw/reads/ folder as well as the reference to the <species>/raw/ref/ folder. Simply provide the files in the <species> folder. Alternatively, you can manually move the files to the respective folders.

  • provide the raw reads in <species>/raw/reads/ folder

  • provide the reference in <species>/raw/ref/ folder

When adding a new species, make sure to

  • the species folder should be placed in the root folder of your pipeline

  • add the folder name should match the species key which is defined in config.yaml below species:

Folder Structure

All other folders will be created and populated automatically

  • Folder <species>/processed/ contains the intermediary files during processing. Most of these files are marked as temporary and will be deleted at the end of the pipeline. Some files are kept to allow reprocessing the pipeline from different points in case something fails.

  • Folder <species>/results/ contains the final results and reports.

General reads processing data will be in either processedor results.

Everything related to a reference will have a <reference> folder under processedor results. Typically, only the results folder will contain information required for further analyis. In case more information is required, the original files can often be found in the processed folder.

Some exemptions include *.sam and unsorted *.bam files. These are deleted to save storrage space. Most other files are kept in order to allow reprocessing the pipeline from different points in case something fails. If a step should be repeated, the relevant files need to be deleted manually.

RAW Reads Filenames

The pipeline expects input read files to follow a standardized naming convention:

<Individual>_[<FreeText>_]R<1/2>[_<FreeText>].fastq.gz

Following this convention ensures proper organization and automated processing within the pipeline.

Filename Components:
  • <Individual> – A unique identifier for the sample or individual.

  • <FreeText> – Any additional text or identifier that can be included in the filename. Typically, this is used to differentiate between different samples within the same individual. E.g. the original sample name.

  • R<1/2> – Indicates the read pair number, typically R1 for the first read and R2 for the second read.

  • .fastq.gz – The expected file extension, indicating compressed FASTQ format. Only .fastq.gz files are supported.

Notes:

  • This name must contain _R1 and, if paired-end, _R2.

  • For paired-end data, the name of the reads must be identical except for _R1 and _R2.

  • Individual names/IDs will be used to name the output files as well as in the reports and plots.

Example:

Bger1_326862_S37_R1_001.fastq.gz

Configuration File Structure for aDNA Pipeline (config.yaml)

The config.yaml file is used to configure the aDNA pipeline. It contains settings such as project name, the species list and the pipeline stages and their process steps.

Global Settings

  • project_name: Name of the aDNA project.

Pipeline Settings

Defines the overall pipeline behavior, including execution controls and process details.

Pipeline Stages and Process Steps

  • The pipeline is broken into stages (e.g., raw_reads_processing, reference_processing).

  • Each stage contains multiple process steps (e.g., adapter_removal, deduplication, …).

  • Both stages and process steps can be controlled with execute: true/false flags to enable or disable them.

  • Some process steps include additional configurable settings (e.g., adapter sequences, database paths, …).

  • If an enabled process step requires data from a previous stage which is disabled in the config, the pipeline will execute the disabled process step anyway.

Important Defaults

  • You do not need to specify all stages or process steps explicitly.

  • Any stage or process step not provided in the config defaults to execute: true and will be executed.

Example config.yaml

# config.yaml - Configuration file for aDNA pipeline
# This file contains settings for various stages of the pipeline

project_name: "aDNA_Project"

# Pipeline stages and their configurations
pipeline:

  # Global settings
  global:
    # When true, existing output files will be skipped to avoid re-computation (Default: true)
    skip_existing_files: true

  # Stages of the pipeline

  # Raw reads processing
  # Includes quality checking, adapter removal, quality filtering, merging, 
  # contamination analysis, and statistical analysis
  raw_reads_processing:
    # When true, this stage will be executed. (Default: true)
    execute: true

    # Sub-stages with their respective settings
    # Quality checking of raw reads
    quality_checking_raw:
      # When true, this sub-stage will be executed (Default: true)
      execute: true
    
    # Adapter removal from raw reads
    adapter_removal:
      # When true, this sub-stage will be executed (Default: true)
      execute: true

      # Settings for adapter removal
      settings: 
        # Minimum quality score for adapter removal
        min_quality: 0
        # Minimum length of reads after adapter removal
        min_length: 0
        # Optional: Adapter sequences for read 1 and read 2
        # If not provided, fastp will try to identify adapters automatically
        adapters_sequences:
          # Adapter sequence for read 1
          r1: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
          # Adapter sequence for read 2
          r2: "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" 
    
    # Quality checking of trimmed reads
    quality_checking_trimmed:
      # When true, this sub-stage will be executed (Default: true)
      execute: true

    # Quality filtering of trimmed reads
    quality_filtering:
      # When true, this sub-stage will be executed (Default: true)
      execute: true
      settings:
        # Minimum quality score for quality filtering
        min_quality: 15
        # Minimum length of reads after quality filtering
        min_length: 30

    # Quality checking of quality-filtered reads
    quality_checking_quality_filtered:
      # When true, this sub-stage will be executed (Default: true)
      execute: true

    # Quality checking of merged reads
    quality_checking_merged:
      # When true, this sub-stage will be executed (Default: true)
      execute: true

    # Contamination analysis
    contamination_analysis:
      # When true, this sub-stage will be executed (Default: true)
      execute: true
      tools:
        # ECMSD tool settings for contamination analysis
        ecmsd:
          # When true, this tool will be executed (Default: true)
          execute: true
          settings:
            # Optional: Path to the conda environment for ECMSD
            # If not provided, the default environment will be used
            #conda_env: "../../../../envs/ecmsd.yaml"
            # Path to the ECMSD executable
            # Curretnly, ecmsd can not be installed via conda. Provide the path to the shell script to run ECMSD.
            executable: "/path/to/ecmsd/shell/ECMSD.sh"
        # Centrifuge tool settings for contamination analysis
        centrifuge:
          # When true, this tool will be executed (Default: true)
          execute: true
          settings:
            # Optional: Path to the conda environment for Centrifuge
            # If not provided, the default environment will be used
            #conda_env: "../../../../envs/centrifuge.yaml"
            # Path to the Centrifuge index
            index: "/path/to/centrifuge_index"
    
    # Statistical analysis
    statistical_analysis:
      # When true, this sub-stage will be executed (Default: true)
      execute: true

  # Reference processing
  reference_processing:
    # When true, this stage will be executed (Default: true)
    execute: true
    
    # Deduplication settings
    deduplication:
      # When true, this sub-stage will be executed (Default: true)
      execute: false
      settings:
        # To increase performance, deduplication will be done per cluster of contigs
        # Below settings define how the contigs will be clustered
        # Optional: Maximum number of contigs per cluster (Default: 10 if not specified)
        max_contigs_per_cluster: 10
        # Optional: Maximum number of contigs per cluster (Default: 500 if not specified)
        max_contigs_per_cluster: 500
    
    # Damage rescaling settings for mapDamage2
    damage_rescaling:
      # When true, this sub-stage will be executed (Default: true)
      execute: true

    # Damage analysis settings for mapDamage2
    damage_analysis:
      # When true, this sub-stage will be executed (Default: true)
      execute: true

    # Endogenous reads analysis settings
    endogenous_reads_analysis: 
      # When true, this sub-stage will be executed (Default: true)
      execute: true

    # Coverage analysis settings
    coverage_analysis: 
      # When true, this sub-stage will be executed (Default: true)
      execute: true

# Species details
species:
  Bger:
    name: "Blatella germanica"
  Dsim:
    name: "Drosophila simulans"


## Linting and formatting

(linting-sarahsaadain-pastforward)=
:::{dropdown} Linting results

<div style="max-height: 400px; overflow-y: auto; padding: 0;">

```{code-block}
:linenos:

PastForward 0.0.1-894b1d4 run:
[2026-03-16 08:26:55 (UTC)] [INFO] PastForward 0.0.1-894b1d4 run:
	Date:               2026-03-16 08:26:55
[2026-03-16 08:26:55 (UTC)] [INFO] 	Date:               2026-03-16 08:26:55
	Platform:           Linux-6.14.0-1017-azure-x86_64-with-glibc2.39; #17~24.04.1-Ubuntu SMP Mon Dec  1 20:10:50 UTC 2025
[2026-03-16 08:26:55 (UTC)] [INFO] 	Platform:           Linux-6.14.0-1017-azure-x86_64-with-glibc2.39; #17~24.04.1-Ubuntu SMP Mon Dec  1 20:10:50 UTC 2025
	Host:               runnervm46oaq
[2026-03-16 08:26:55 (UTC)] [INFO] 	Host:               runnervm46oaq
	User:               runner
[2026-03-16 08:26:55 (UTC)] [INFO] 	User:               runner
	Conda:              26.1.1
[2026-03-16 08:26:55 (UTC)] [INFO] 	Conda:              26.1.1
	Python:             3.12.13
[2026-03-16 08:26:55 (UTC)] [INFO] 	Python:             3.12.13
	Snakemake:          9.16.3
[2026-03-16 08:26:55 (UTC)] [INFO] 	Snakemake:          9.16.3
	Conda env:          snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
[2026-03-16 08:26:55 (UTC)] [INFO] 	Conda env:          snakemake-workflow-catalog (/home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default)
	Command:            /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
[2026-03-16 08:26:55 (UTC)] [INFO] 	Command:            /home/runner/work/snakemake-workflow-catalog/snakemake-workflow-catalog/.pixi/envs/default/bin/snakemake --lint
	Base directory:     /tmp/tmp6jhzexeh/workflow
[2026-03-16 08:26:55 (UTC)] [INFO] 	Base directory:     /tmp/tmp6jhzexeh/workflow
	Working directory:  /tmp/tmp6jhzexeh
[2026-03-16 08:26:55 (UTC)] [INFO] 	Working directory:  /tmp/tmp6jhzexeh
	Config file(s):     /tmp/tmp6jhzexeh/config/config.yaml
[2026-03-16 08:26:55 (UTC)] [INFO] 	Config file(s):     /tmp/tmp6jhzexeh/config/config.yaml
[2026-03-16 08:26:55 (UTC)] [INFO] Loaded configuration:
global:
  skip_existing_files: true
raw_reads_processing:
  execute: true
  quality_checking_raw:
    execute: true
  adapter_removal:
    execute: true
    settings:
      min_quality: 0
      min_length: 0
  quality_checking_trimmed:
    execute: true
  quality_filtering:
    execute: true
    settings:
      min_quality: 15
      min_length: 30
  quality_checking_quality_filtered:
    execute: true
  quality_checking_merged:
    execute: true
  contamination_analysis:
    execute: true
    tools:
      ecmsd:
        execute: true
        settings: null
      centrifuge:
        execute: true
        settings:
          include_human_taxid: true
  statistical_analysis:
    execute: true
reference_processing:
  execute: true
  deduplication:
    execute: false
    settings:
      max_contigs_per_cluster: 500
  damage_rescaling:
    execute: true
  damage_analysis:
    execute: true
  endogenous_reads_analysis:
    execute: true
  coverage_analysis:
    execute: true
dynamics:
  execute: true
  teplotter:
    execute: true
  pf_normalization:
    execute: true

[2026-03-16 08:26:55 (UTC)] [INFO] Detected species:
- Dmel Demo [demo]
AttributeError in file "/tmp/tmp6jhzexeh/workflow/rules/raw_read/analytics/contamination/check_contamination_ecmsd.smk", line 1:
'NoneType' object has no attribute 'get'
  File "/tmp/tmp6jhzexeh/workflow/rules/raw_read/analytics/contamination/check_contamination_ecmsd.smk", line 1, in <module>
[2026-03-16 08:26:55 (UTC)] [ERROR] AttributeError in file "/tmp/tmp6jhzexeh/workflow/rules/raw_read/analytics/contamination/check_contamination_ecmsd.smk", line 1:
'NoneType' object has no attribute 'get'
  File "/tmp/tmp6jhzexeh/workflow/rules/raw_read/analytics/contamination/check_contamination_ecmsd.smk", line 1, in <module>