MPUSP/snakemake-ms-proteomics
Pipeline for automatic processing and quality control of mass spectrometry data
Overview
Topics: bioinformatics conda mass-spectrometry pipeline proteomics snakemake
Latest release: v1.0.0, Last update: 2025-01-30
Linting: linting: passed, Formatting:formatting: passed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/MPUSP/snakemake-ms-proteomics . --tag v1.0.0
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
This workflow is a best-practice workflow for the automated analysis of mass spectrometry proteomics data. It currently supports automated analysis of data-dependent acquisition (DDA) data with label-free quantification. An extension by different wokflows (DIA, isotope labeling) is planned in the future. The workflow is mainly a wrapper for the excellent tools fragpipe and MSstats, with additional modules that supply and check the required input files, and generate reports. The workflow is built using snakemake and processes MS data using the following steps:
- Prepare
workflow
file (python
script) - check user-supplied sample sheet (
python
script) - Fetch protein database from NCBI or use user-supplied fasta file (
python
, NCBI Datasets) - Generate decoy proteins (DecoyPyrat)
- Import raw files, search protein database (fragpipe)
- Align feature maps using IonQuant (fragpipe)
- Import quantified features, infer and quantify proteins (R MSstats)
- Compare different biological conditions, export results (R MSstats)
- Generate HTML report with embedded QC plots (R markdown)
- Generate PDF report from HTML weasyprint
- Send out report by email (
python
script) - Clean up temporary files after workflow execution (
bash
script)
If you want to contribute, report issues, or suggest features, please get in touch on github.
Step 1: Install snakemake with conda
, mamba
, micromamba
(or any another conda
flavor). This step generates a new conda environment called snakemake-ms-proteomics
, which will be used for all further installations.
conda create -c conda-forge -c bioconda -n snakemake-ms-proteomics snakemake
Step 2: Activate conda environment with snakemake
source /path/to/conda/bin/activate
conda activate snakemake-ms-proteomics
Alternatively, install snakemake
using pip:
pip install snakemake
Or install snakemake
globally from linux archives:
sudo apt install snakemake
Fragpipe is not available on conda
or other package archives. However, to make the workflow as user-friendly as possible, the latest fragpipe release from github (currently v22.0) is automatically installed to the respective conda
environment when using the workflow the first time. After installation, the GUI (graphical user interface) will pop up and ask to you to finish the installation by downloading the missing modules MSFragger, IonQuant, and Philosopher. This step is necessary to abide to license restrictions. From then on, fragpipe will run in headless
mode through command line only.
All other dependencies for the workflow are automatically pulled as conda
environments by snakemake.
The workflow requires the following input files:
- mass spectrometry data, such as Thermo
*.raw
or*.mzML
files - an (organism) database in
*.fasta
format OR a NCBI Refseq ID. Decoys (rev_
prefix) will be added if necessary - a sample sheet in tab-separated format (aka
manifest
file) - a
workflow
file for fragpipe (seeresources
dir)
The samplesheet file has the following structure with four mandatory columns and no header (example file: test/input/samplesheet/samplesheet.tsv
).
-
sample
: names/paths to raw files -
condition
: experimental group, treatments -
replicate
: replicate number, consecutively numbered. Repeating numbers (e.g. 1,2,1,2) will be treated as paired samples! -
type
: the type of MS data, will be used to determine the workflow -
control
: reference condition for testing differential abudandance
sample | condition | replicate | type | control |
---|---|---|---|---|
sample_1 | condition_1 | 1 | DDA | condition_1 |
sample_2 | condition_1 | 2 | DDA | condition_1 |
sample_3 | condition_2 | 3 | DDA | condition_1 |
sample_4 | condition_2 | 4 | DDA | condition_1 |
To run the workflow from command line, change the working directory.
cd /path/to/snakemake-ms-proteomics
Adjust options in the default config file config/config.yml
.
Before running the entire workflow, you can perform a dry run using:
snakemake --dry-run
To run the complete workflow with test files using conda
, execute the following command. The definition of the number of compute cores is mandatory.
snakemake --cores 10 --sdm conda --directory .test
To supply options that override the defaults, run the workflow like this:
snakemake --cores 10 --sdm conda --directory .test \
--configfile 'config/config.yml' \
--config \
samplesheet='my/sample_sheet.tsv'
This table lists all global parameters to the workflow.
parameter | type | details | example |
---|---|---|---|
samplesheet | *.tsv |
tab-separated file | test/input/config/samplesheet.tsv |
database |
*.fasta OR refseq ID |
plain text |
test/input/database/database.fasta , GCF_000009045.1
|
workflow |
*.workflow OR string |
a fragpipe workflow |
workflows/LFQ-MBR.workflow , from_samplesheet
|
This table lists all module-specific parameters and their default values, as included in the config.yml
file.
module | parameter | default | details |
---|---|---|---|
decoypyrat | cleavage_sites |
KR |
amino acids residues used for decoy peptide generation |
decoy_prefix |
rev |
decoy prefix appended to proteins names | |
fragpipe | target_dir |
share |
default path in conda env to store fragpipe |
executable |
fragpipe/bin/fragpipe |
path to fragpipe executable | |
download |
FragPipe-22.0 (see config) | downlowd link to Fragpipe Github repo | |
msstats | logTrans |
2 |
base for log fold change transformation |
normalization |
equalizeMedians |
normalization strategy for feature intensity, see MSstats manual | |
featureSubset |
all |
which features to use for quantification | |
summaryMethod |
TMP |
how to calculate protein from feature intensity | |
MBimpute |
True |
Imputes missing values with Accelerated failure time model | |
report | html |
True |
Generate HTLM report |
pdf |
True |
Generate PDF report | |
send |
False |
whether reports should send out by email | |
port |
0 |
default port for email server | |
smtp_server |
smtp.example.com |
smtp server address | |
smtp_user |
user |
smtp server user name | |
smtp_pw |
password |
smtp server user password | |
from |
sender@email.com |
sender's email address | |
to |
["receiver@email.com"] |
receiver's email address(es), a list | |
subject |
"Results MS proteomics workflow" |
subject line for email |
Linting and formatting
Linting results
None
Formatting results
None