rabiafidan/pure-faf
Formalin artefact filtering of tumour sample VCF files
Overview
Latest release: v1.1, Last update: 2025-12-10
Linting: linting: failed, Formatting: formatting: failed
Topics: cancer-genomics ffpe variant-filtration
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Conda. It is recommended to install conda via Miniforge. Run
conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
For other installation methods, refer to the Snakemake and Snakedeploy documentation.
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/rabiafidan/pure-faf . --tag v1.1
Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method (short --sdm) argument.
To run the workflow using a combination of conda and apptainer/singularity for software deployment, use
snakemake --cores all --sdm conda apptainer
Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md.
Workflow overview
This workflow is a best-practice workflow for variant filtering of formalin fixed tumour samples. The workflow is built using snakemake and consists of the following steps:
Annotating rhe VCF with a custom panel of normals (PON)
Applying a general quality filter with creteria including PON, AD and ROQ
Splitting variants into confidence tiers using an AD threshold
Intersecting low-confidence indels with a second tool’s indel calls. Intersection is indel type-specific (insertion or deletion), and only position-based (alt allele doesn’t matter).
Running microsec twice with different parameters for low-confidence and high-confidence tiers
Merging the results back annotating VCF with microsec results
Applyying different filtering methods on the annotated VCF
Applying a post-filter VAF filter
Quality control using FFPEsig repaired and unrepaired signatures
See the worflow diagram.
Running the workflow
Input data
You need these following files:
Mutect2 VCF files (with AD, DP and ROQ annotations)
Strelka2 indel VCF (or any other secondary variant caller should theoratically work)
Sample bam/cram file
Human reference genome fasta you used for alignment and variant calling
You need the following information:
tumour and normal sample names
Sequencing read length
Sequencing adapters
Using the input files and information, you need to create a sample sheet as follows:
tumour_name |
normal_name |
Mutect2_vcf |
Strelka2_indel_vcf |
tumour_bam_cram |
sequencing_read_length |
sequencing_adapter_1 |
sequencing_adapter_2 |
|---|---|---|---|---|---|---|---|
K1058_K1058_HB005_D02 |
K1058_K1058_BC001_D01 |
path/to/ROQ_flagged_K1058_HB005_D02.vcf.gz |
path/to/K1058_HB005_D02_vs_K1058_BC001_D01/K1058_HB005_D02_vs_K1058_BC001_D01.strelka.somatic_indels.vcf.gz |
path/to/K1058_HB005_D02/K1058_HB005_D02.recal.cram |
300 |
AGCGAGAT-CCAGTATC |
AGCGAGAT-CCAGTATC |
K1058_K1058_HB003_D02 |
K1058_K1058_BC001_D01 |
path/to/ROQ_flagged_K1058_HB003_D02.vcf.gz |
path/to/K1058_HB003_D02_vs_K1058_BC001_D01/K1058_HB003_D02_vs_K1058_BC001_D01.strelka.somatic_indels.vcf.gz |
path/to/K1058_HB003_D02/K1058_HB003_D02.recal.cram |
300 |
CTTGCTAG-TCATCTCC |
CTTGCTAG-TCATCTCC |
Parameters
You then need to configure the workflow by updating the config/config.yaml. This table lists all parameters that can be used to run the workflow. All parameters except exclude_samples are mandatory.
parameter |
type |
details |
VAULT setup |
|---|---|---|---|
samplesheet |
|||
path |
str |
path to samplesheet. |
|
ref_genome |
|||
ref_genome_version |
str |
choose one of GRCh38 or GRCh37, case-insensitive. |
|
ref_genome_fasta |
str |
path to the human reference genome. |
|
VCF quality filter thresholds |
|||
AD |
num |
minimum alternative allele count |
4 |
ROQ |
num |
ROQ threshold for read orientation quality |
20 |
DP |
num |
minimum read depth |
0 - no seperate DP filtering |
PON parameters |
|||
PON_alpha |
num |
Beta test significance level |
0.05 |
Microsec parameters |
|||
AD_tier_threshold |
num |
maximum AD for low-confidence tier |
10 |
pval_threshold_high_AD |
num |
pvalue threshold for |
1e-4 |
pval_threshold_low_AD |
num |
pvalue threshold for |
1e-6 |
Post-filter thresholds |
|||
VAF |
array |
a list of VAF values you want to apply after formalin filtering. All will appear in the QC plot. |
|
Other parameters |
|||
odir |
str |
output directory. This will appear under results/ |
|
exclude_samples |
array |
a list of sample names you want to exclude from the analysis. Default is empty list. |
Output
results/odir/
├── logs
├── 1-step1_sample1.log
├── 2-step2_sample1.log
├── ...
├── filtered_vcf #MAIN OUTPUT FOLDER
├── 1-sample1_msec_all_filtered.vcf.gz #passing all microsec filters
├── 1-sample2_vault_filtered.vcf.gz #filtering creteria used in VAULT paper
├── 1-sample1__vault_plus_SR_filtered.vcf.gz #VAULT createria + simple repeat filter (relevant if whole genome data, otherwise almost no difference)
├── 2-..... #various VAF filtered VCFs
├── 2-..... #various VAF filtered VCFs
├── microsec_input # sample info and mutation info files for microsec
├── microsec_output #microsec tsv output files for samples and tiers
├── ffpe_sig #QC plots and tables using FFPEsig repaired and unrepaired signatures
├── temp # intermediate files. Most is automatically deleted after workflow completion. Remaining ones may be useful or can be deleted manually.
└── PON # custom RepSeq panel of normals annotation and beta distribution testing results. Kind of intermediate, can be deleted if not needed.
Linting and formatting
Linting results
1ModuleNotFoundError in file "/tmp/tmprklvm_hw/rabiafidan-pure-faf-b3e3e46/workflow/rules/common.smk", line 5:
2No module named 'cyvcf2'
3 File "/tmp/tmprklvm_hw/rabiafidan-pure-faf-b3e3e46/workflow/rules/common.smk", line 5, in <module>
Formatting results
1[DEBUG]
2[DEBUG] In file "/tmp/tmprklvm_hw/rabiafidan-pure-faf-b3e3e46/workflow/Snakefile": Formatted content is different from original
3[DEBUG]
4[DEBUG] In file "/tmp/tmprklvm_hw/rabiafidan-pure-faf-b3e3e46/workflow/rules/common.smk": Formatted content is different from original
5[INFO] 2 file(s) would be changed 😬
6
7snakefmt version: 0.11.2