rabiafidan/pure-faf

Formalin artefact filtering of tumour sample VCF files

Overview

Latest release: v1.1, Last update: 2025-12-10

Share link: https://snakemake.github.io/snakemake-workflow-catalog?wf=rabiafidan/pure-faf

Quality control: linting: failed formatting: failed

Topics: cancer-genomics ffpe variant-filtration

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Conda package manager. It is recommended to install conda via Miniforge. Run

conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

For other installation methods, refer to the Snakemake and Snakedeploy documentation.

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/rabiafidan/pure-faf . --tag v1.1

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow using a combination of conda and apptainer/singularity for software deployment, use

snakemake --cores all --sdm conda apptainer

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Workflow overview

This workflow is a best-practice workflow for variant filtering of formalin fixed tumour samples. The workflow is built using snakemake and consists of the following steps:

Annotating rhe VCF with a custom panel of normals (PON)
Applying a general quality filter with creteria including PON, AD and ROQ
Splitting variants into confidence tiers using an AD threshold
Intersecting low-confidence indels with a second tool’s indel calls. Intersection is indel type-specific (insertion or deletion), and only position-based (alt allele doesn’t matter).
Running microsec twice with different parameters for low-confidence and high-confidence tiers
Merging the results back annotating VCF with microsec results
Applyying different filtering methods on the annotated VCF
Applying a post-filter VAF filter
Quality control using FFPEsig repaired and unrepaired signatures

See the worflow diagram.

Running the workflow

Input data

You need these following files:

Mutect2 VCF files (with AD, DP and ROQ annotations)
Strelka2 indel VCF (or any other secondary variant caller should theoratically work)
Sample bam/cram file
Human reference genome fasta you used for alignment and variant calling

You need the following information:

tumour and normal sample names
Sequencing read length
Sequencing adapters

Using the input files and information, you need to create a sample sheet as follows:

tumour_name	normal_name	Mutect2_vcf	Strelka2_indel_vcf	tumour_bam_cram	sequencing_read_length	sequencing_adapter_1	sequencing_adapter_2
K1058_K1058_HB005_D02	K1058_K1058_BC001_D01	path/to/ROQ_flagged_K1058_HB005_D02.vcf.gz	path/to/K1058_HB005_D02_vs_K1058_BC001_D01/K1058_HB005_D02_vs_K1058_BC001_D01.strelka.somatic_indels.vcf.gz	path/to/K1058_HB005_D02/K1058_HB005_D02.recal.cram	300	AGCGAGAT-CCAGTATC	AGCGAGAT-CCAGTATC
K1058_K1058_HB003_D02	K1058_K1058_BC001_D01	path/to/ROQ_flagged_K1058_HB003_D02.vcf.gz	path/to/K1058_HB003_D02_vs_K1058_BC001_D01/K1058_HB003_D02_vs_K1058_BC001_D01.strelka.somatic_indels.vcf.gz	path/to/K1058_HB003_D02/K1058_HB003_D02.recal.cram	300	CTTGCTAG-TCATCTCC	CTTGCTAG-TCATCTCC

Parameters

You then need to configure the workflow by updating the config/config.yaml. This table lists all parameters that can be used to run the workflow. All parameters except exclude_samples are mandatory.

parameter	type	details	VAULT setup
samplesheet
path	str	path to samplesheet.
ref_genome
ref_genome_version	str	choose one of GRCh38 or GRCh37, case-insensitive.
ref_genome_fasta	str	path to the human reference genome.
VCF quality filter thresholds
AD	num	minimum alternative allele count	4
ROQ	num	ROQ threshold for read orientation quality	20
DP	num	minimum read depth	0 - no seperate DP filtering
PON parameters
PON_alpha	num	Beta test significance level	0.05
Microsec parameters
AD_tier_threshold	num	maximum AD for low-confidence tier	10
pval_threshold_high_AD	num	pvalue threshold for `fun_analysis` function for high-confidence tier	1e-4
pval_threshold_low_AD	num	pvalue threshold for `fun_analysis` function for low-confidence tier	1e-6
Post-filter thresholds
VAF	array	a list of VAF values you want to apply after formalin filtering. All will appear in the QC plot.
Other parameters
odir	str	output directory. This will appear under results/
exclude_samples	array	a list of sample names you want to exclude from the analysis. Default is empty list.

Output

results/odir/
    ├── logs
        ├── 1-step1_sample1.log
        ├── 2-step2_sample1.log
        ├── ...
    ├── filtered_vcf #MAIN OUTPUT FOLDER
        ├── 1-sample1_msec_all_filtered.vcf.gz #passing all microsec filters
        ├── 1-sample2_vault_filtered.vcf.gz #filtering creteria used in VAULT paper
        ├── 1-sample1__vault_plus_SR_filtered.vcf.gz #VAULT createria + simple repeat filter (relevant if whole genome data, otherwise almost no difference)
        ├── 2-..... #various VAF filtered VCFs
        ├── 2-..... #various VAF filtered VCFs
    ├── microsec_input # sample info and mutation info files for microsec
    ├── microsec_output #microsec tsv output files for samples and tiers
    ├── ffpe_sig #QC plots and tables using FFPEsig repaired and unrepaired signatures
    ├── temp # intermediate files. Most is automatically deleted after workflow completion. Remaining ones may be useful or can be deleted manually.
    └── PON # custom RepSeq panel of normals annotation and beta distribution testing results. Kind of intermediate, can be deleted if not needed.

Linting and formatting

Linting results

ModuleNotFoundError in file "/tmp/tmprklvm_hw/rabiafidan-pure-faf-b3e3e46/workflow/rules/common.smk", line 5:
No module named 'cyvcf2'
  File "/tmp/tmprklvm_hw/rabiafidan-pure-faf-b3e3e46/workflow/rules/common.smk", line 5, in <module>

Formatting results

[DEBUG] 
[DEBUG] In file "/tmp/tmprklvm_hw/rabiafidan-pure-faf-b3e3e46/workflow/Snakefile":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmprklvm_hw/rabiafidan-pure-faf-b3e3e46/workflow/rules/common.smk":  Formatted content is different from original
[INFO] 2 file(s) would be changed 😬

snakefmt version: 0.11.2