vollgerlab/DSA-phasing

None

Overview

Latest release: None, Last update: 2026-02-18

Linting: linting: failed, Formatting: formatting: passed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Conda. It is recommended to install conda via Miniforge. Run

conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

For other installation methods, refer to the Snakemake and Snakedeploy documentation.

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/vollgerlab/DSA-phasing . --tag None

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

DSA-phasing Configuration

This document describes all configuration options available for the DSA-phasing workflow. Configuration is primarily done through config.yaml and the workflow validates all options against workflow/config.schema.yaml.

Required Configuration

manifest

  • Type: String (file path)

  • Description: Path to a table containing the manifest of samples to be processed. This file should contain sample information including paths to input files.

  • Example: manifest: "test/manifest.tbl"

Optional Configuration

max_threads

  • Type: Integer

  • Default: 64

  • Description: Maximum number of threads for parallel processing operations throughout the workflow.

  • Example: max_threads: 32

ont

  • Type: Boolean

  • Default: false

  • Description: Enable Oxford Nanopore Technologies (ONT) specific processing. When true, affects output requirements and final CRAM file selection to optimize for ONT data characteristics.

  • Example: ont: true

set-sm

  • Type: Boolean

  • Default: false

  • Description: Whether to set sample names in BAM headers to match the manifest. When enabled, BAM headers will be modified to ensure consistency with the provided manifest.

  • Example: set-sm: true

mm2_preset

  • Type: String

  • Default: “lr:hq”

  • Description: Minimap2 preset parameter for alignment. Common options include ‘lr:hq’ for high-quality long reads, ‘map-ont’ for ONT reads, or ‘map-pb’ for PacBio reads.

  • Example: mm2_preset: "map-ont"

mm2_extra_options

  • Type: String

  • Default: “” (empty)

  • Description: Additional command-line options to pass to minimap2 during alignment. Allows fine-tuning of alignment parameters beyond the preset.

  • Example: mm2_extra_options: "-k19 -w10"

min_mapq

  • Type: Integer

  • Default: 1

  • Description: Minimum mapping quality threshold for haplotype assignment. Reads below this threshold still appear in the output but have their HP tag cleared (set to unphased). The original assignment is preserved in the oh tag for debugging.

  • Example: min_mapq: 20

reset_mapq

  • Type: Integer

  • Default: disabled

  • Description: Reset MAPQ of mapped reads to this value during haplotagging. The original MAPQ is preserved in the om tag. Useful because DSA-alignment MAPQ values may not be meaningful for downstream tools that filter on MAPQ. By default, resets after haplotype assignment so min_mapq filtering uses the original value (see reset_mapq_before).

  • Example: reset_mapq: 60

reset_mapq_before

  • Type: Boolean

  • Default: false

  • Description: When true (and reset_mapq is set), reset MAPQ before haplotype assignment. This means min_mapq filtering will use the new MAPQ value instead of the original. When false (default), MAPQ is reset after assignment so filtering uses the original alignment MAPQ.

  • Example: reset_mapq_before: true

ft_nuc_params

  • Type: String

  • Default: “” (empty)

  • Description: Additional parameters for the ft add-nucleosomes command in the modkit rule. Used for nucleosome detection and modification analysis.

  • Example: ft_nuc_params: "--nucleosome-length 60"

shared_reference

  • Type: String (file path) or List of file paths

  • Default: [] (empty, disabled)

  • Description: Path to a shared reference genome. When provided, the workflow will realign the final merged CRAMs to this reference in addition to the DSA-aligned outputs. Useful for comparing samples aligned to different DSAs on a common coordinate system.

  • Example: shared_reference: "/path/to/hg38.fa"

keep_read_assignments

  • Type: Boolean

  • Default: false

  • Description: When true, read-to-haplotype assignment TSV files are saved to results/{sm}/ instead of being placed in temp/ and cleaned up. Useful for downstream analysis of phasing results.

  • Example: keep_read_assignments: true

Configuration Validation

The workflow automatically validates all configuration options against the schema defined in workflow/config.schema.yaml. Invalid configurations will cause the workflow to fail with descriptive error messages.

Example Configuration

# Required
manifest: "test/manifest.tbl"

# Optional (showing non-default values)
max_threads: 32
ont: true
set-sm: true
mm2_preset: "map-ont"
mm2_extra_options: "-k19"
min_mapq: 20
ft_nuc_params: "--nucleosome-length 60"

Notes

  • Only manifest is required; all other options have sensible defaults

  • Boolean values should be lowercase: true or false

  • String values should be quoted if they contain special characters

  • The workflow will print the loaded configuration to stderr for verification

Linting and formatting

Linting results

 1Using workflow specific profile workflow/profiles/default for setting default command line arguments.
 2WorkflowError in file "/tmp/tmphv7u24p1/workflow/Snakefile", line 15:
 3Error validating config file.
 4ValidationError: 'manifest' is a required property
 5
 6Failed validating 'required' in schema:
 7    {'$schema': 'https://json-schema.org/draft/2020-12/schema',
 8     'description': 'Configuration schema for the DSA-phasing pipeline',
 9     'properties': {'manifest': {'type': 'string',
10                                 'description': 'Path to a table with the '
11                                                'manifest of samples to be '
12                                                'processed.'}},
13     'required': ['manifest'],
14     '$id': 'file:///tmp/tmphv7u24p1/workflow/config.schema.yaml'}
15
16On instance:
17    {}

Formatting results

All tests passed!