snakemake-workflows/cellranger-multi

A Snakemake workflow for preprocessing single cell RNAseq (scRNA-seq) data with cellranger multi (Cell Ranger licensing requires a manual download of the software).

Overview

Latest release: v2.0.0, Last update: 2025-11-05

Linting: linting: passed, Formatting: formatting: passed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Conda. It is recommended to install conda via Miniforge. Run

conda create -c conda-forge -c bioconda -c nodefaults --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

For other installation methods, refer to the Snakemake and Snakedeploy documentation.

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/snakemake-workflows/cellranger-multi . --tag v2.0.0

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow using apptainer/singularity, use

snakemake --cores all --sdm apptainer

To run the workflow using a combination of conda and apptainer/singularity for software deployment, use

snakemake --cores all --sdm conda apptainer

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Workflow overview

This workflow is a best-practice snakemake workflow for systematically running cellranger multi on one or more samples. See the 10X documentation for choosing a pipeline to see whether this is the preprocessing you need. If your assay setup suggests cellranger count, have a look at the standardised workflow for cellranger count instead.

The workflow is built using snakemake and consists of the following steps:

Link in files to a new file name that follows cellranger requirements.
Create a per-sample cellranger multi config CSV sheet.
Run cellranger multi, parallelizing over biological samples.
Create a snakemake report with the a Web Summary per biological sample.

Running the workflow

cellranger download

As a pre-requisite for running the workflow, you need to download the *.tar.gz file with the Cell Ranger executable from the Cell Ranger Download center: https://www.10xgenomics.com/support/software/cell-ranger/downloads

Afterwards, set the environment variable CELLRANGER_TARBALL to the full path of this executable, for example:

To make this a permanently set environment variable for your user on the respective system, add the (adapted) line from above to your ~/.bashrc file and make sure this file is always loaded.

With this environment variable set, the workflow will automatically install cellranger into a conda environment that is then used for all cellranger steps. So once your specific analysis has created this conda environment, the cellranger version will stay at the version specified at that time.

Should you ever want to update the cellranger version for an analysis, you will have to update the CELLRANGER_TARBALL environment variable and delete the conda environment, to ensure that it gets re-generated. The conda environments are stored in the hidden .snakemake/conda/ folder. You can usually identify the exact conda environment used by a rule from the .snakemake/logs/ files or the respective cluster system log files. Search for the execution of the respective rule (cellranger_multi) and then look for Activating conda environment: right below. You can then delete the respective file and directory under .snakemake/conda/ and rerun the workflow.

Input data

pool sheet TSV

The sample sheet configures all the possible columns for the [libraries] section of the multi config CSV file:

id	feature_types	read1	read2	lane_number
pool_1	Gene Expression	../data/pool1_gex/sample1_gex.bwa.L001.read1.fastq.gz	../data/pool1_gex/sample1_gex.bwa.L001.read2.fastq.gz	1
pool_1	Gene Expression	../data/pool1_gex/pool1_gex.bwa.L002.read1.fastq.gz	../data/pool1_gex/pool1_gex.bwa.L002.read2.fastq.gz	2
pool_1	VDJ-T	../data/pool1_vdjt/pool1_vdjt.bwa.L003.read1.fastq.gz	../data/pool1_vdjt/pool1_vdjt.bwa.L003.read2.fastq.gz	1

For more details on these columns, refer to the 10X documentation for the [libraries] section of the multi config CSV file. We also provide specific subsection links wherever available.

These are required columns:

id is an arbitrary name assigned to represent one pool of samples. One multi config CSV file per id value will be created by the workflow, and each id will be processed separately, so that the workflow can parallelize as much as possible. If you used just one sample to create your assay libraries (without any multiplexing), then the id value used here can just be a sample name. If you used multiplexing barcodes to pool multiple samples, this id groups all of the samples that were pooled before preparing assay libraries. For multiplexed experiments, you have to provide a multiplexing TSV file (see below) that provides a barcode or id for each sample in a multiplexed pool.
feature_types can be any of the values listed in the cellranger multi documentation on multi config CSVs. The only disallowed feature type is VDJ, as this results in auto-detection of the actual VDJ- feature type and thus to unpredictable folder names in the output.
read1 and read2 require file names with paths relative to the main workflow directory (the directory, where you run the snakemake command). From these (and the optional lane_number column), the raw read data files are linked into the folder and file name structure that cellranger expects, and the fastq_id and fastqs columns of the multi config CSV file are set up accordingly.

These are optional columns:

lane_number is only necessary if a single sample is sequenced across multiple lanes. Usually, you will number lanes starting from 1 and only up to a single digit number of lanes. As we specify one pair of fastq files per row, the lane_number column also only contains a single lane number, as we have one pair of files per lane. For the lanes column in the final multi config CSV file, multiple lane numbers get parsed into the format 1|2|3 etc.
physical_library_id is usually auto-detected, so just omit it if in doubt.
subsample_rate is not usually needed.
chemistry is auto per default and only applicable for Flex assays. If you think this applies to your setup, see the chemistry options in the 10X documentation.

multiplexing TSV

This file is only necessary if you used multiplexing to pool multiple samples for sequencing. You provide its relative path in the global config/config.yaml file under multi_config_csv_sections: multiplexing:. It specifies which sample used which ocm_barcode_ids (or hashtag_ids or cmo_ids or probe_barcode_ids).

id	sample_id	ocm_barcode_ids
pool_1	sample_1	OB1\|OB2
pool_1	sample_2	OB3\|OB4

These are required columns:

sample_id is an arbitrary name assigned to represent one biological sample that was sequenced with a known set of barcodes.
id is an arbitrary name assigned to represent one pool of samples. It needs to match an id from the pool sheet TSV file. A pool can contain one or more samples.
One of the following columns, depending on the barcode type used for the multiplexing setup:
- ocm_barcode_ids
- hashtag_ids
- cmo_ids
- probe_barcode_ids

You can read the cellranger documentation on multiplexing setups to determine which column to use. In addition, the samples section of the multi config csv file documentation explains which values you can use there and which additional columns may be available.

Global analysis-level configuration

All global configuration settings for the whole analysis are specified in the config/config.yaml file. This file is extensively commented to explain how to set which options. You can delete any options you don’t need, or set them to an empty string (""). The only required sections are those for the feature types present in the feature_types column of the sample sheet and the flag whether multiplexing should be activated. And the only required entry for a required section is usually the reference: path or file specification.

Linting and formatting

Linting results

All tests passed!

Formatting results

All tests passed!