Crollo95/survhive-workflow

Standardized Snakemake workflow for survival analysis using SurvSet datasets and SurvHive models. Supports preprocessing, cross-validation, and model evaluation.

Overview

Topics: benchmarking machine-learning snakemake survival-analysis survhive

Latest release: v1.0.0, Last update: 2025-03-20

Linting: linting: failed, Formatting: formatting: failed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/Crollo95/survhive-workflow . --tag v1.0.0

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Configuration Guide for SurvHive Workflow

This document explains how to set up and configure the config.yaml file for the SurvHive Snakemake workflow.

1️⃣ Creating a Configuration File

Before running the workflow, copy the example configuration file and modify it as needed:

cp config/config.yaml.example config/config.yaml

2️⃣ Dataset Configuration

The datasets section defines datasets that will be processed. Each dataset requires a source and, if necessary, a file path.

Dataset Entry Format

datasets:
  dataset_name:
    source: survset  # Must be either 'survset' or 'external'
    file_path: "path/to/dataset.csv"  # Only required for 'external' datasets
  • source must be survset or external (case-insensitive):

    • survset: The dataset is from the SurvSet repository.

    • external: The dataset is a CSV file (requires file_path).

  • For external datasets, file_path must be specified.

  • For survset datasets, file_path must NOT be provided.

  • Any other value for source will trigger an error.

Example Configuration

datasets:
  lung_cancer:
    source: survset
  clinical_study:
    source: external
    file_path: "data/clinical_study.csv"

3️⃣ Model Selection

Define which survival models should be used for training and evaluation.

models:
  - CoxPH
  - RSF
  - DeepHitSingle

You can comment out models you do not want to use.

4️⃣ Feature Naming Conventions

To ensure consistency, datasets should follow these column conventions:

  • Mandatory Columns:

    • pid: Unique identifier for patients or samples

    • event: Binary event indicator (1 = event occurred, 0 = censored)

    • time: Time-to-event or censoring

  • Feature Columns:

    • num_: Prefix for numerical features

    • fac_: Prefix for categorical features

5️⃣ Cross-Validation Trials

The number of trials used for hyperparameter optimization in cross-validation can be configured using the n_trials setting in config.yaml. This allows users to control the trade-off between optimization quality and computation time.

n_trials: 10  # Default is 10. Increase for better tuning, decrease for speed.

6️⃣ Running the Workflow

Once config.yaml is set up, execute Snakemake:

snakemake --use-conda --cores <n>

Where <n> is the number of CPU cores to allocate.

7️⃣ Troubleshooting

  • Ensure source is correctly set (survset or external).

  • If using external, verify that file_path is correct.

  • Use snakemake --use-conda --dry-run to test the configuration before execution.

For further details, refer to the main README.md. 🚀

5️⃣ Running the Workflow

Once config.yaml is set up, execute Snakemake:

snakemake --use-conda --cores <n>

Where <n> is the number of CPU cores to allocate.

6️⃣ Troubleshooting

  • Ensure source is correctly set (survset or external).

  • If using external, verify that file_path is correct.

  • Use snakemake --use-conda --dry-run to test the configuration before execution.

For further details, refer to the main README.md. 🚀

Linting and formatting

Linting results

WorkflowError in file /tmp/tmpst0ztrt2/Crollo95-survhive-workflow-180afd5/workflow/Snakefile, line 18:
Workflow defines configfile /tmp/tmpst0ztrt2/Crollo95-survhive-workflow-180afd5/config/config.yaml but it is not present or accessible (full checked path: /tmp/tmpst0ztrt2/Crollo95-survhive-workflow-180afd5/config/config.yaml).

Formatting results

[DEBUG] 
[DEBUG] In file "/tmp/tmpst0ztrt2/Crollo95-survhive-workflow-180afd5/workflow/Snakefile":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpst0ztrt2/Crollo95-survhive-workflow-180afd5/workflow/rules/validate_datasets.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpst0ztrt2/Crollo95-survhive-workflow-180afd5/workflow/rules/load_data.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpst0ztrt2/Crollo95-survhive-workflow-180afd5/workflow/rules/evaluate_model.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpst0ztrt2/Crollo95-survhive-workflow-180afd5/workflow/rules/cross_validation.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpst0ztrt2/Crollo95-survhive-workflow-180afd5/workflow/rules/preprocess_and_split.smk":  Formatted content is different from original
[INFO] 6 file(s) would be changed 😬

snakefmt version: 0.10.2