durr1602/gyoza
Snakemake pipeline to analyze demultiplexed sequencing data of DMS experiment, produce diagnostics plots and generate a dataframe of selection coefficients
Overview
Topics:
Latest release: v1.0.0, Last update: 2025-02-20
Linting: linting: passed, Formatting:formatting: failed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/durr1602/gyoza . --tag v1.0.0
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
A few files should be provided to properly analyze your data. What follows is the general procedure, however a toy dataset is provided in order to test the workflow. If you simply want to run the workflow with the toy dataset, enter the following: snakemake --use-conda
.
Please provide the raw reads (forward and reverse) of your DMS sequencing data in the config/reads
folder (or specify a different path in the main config file). The file names should be featured in the layout (see section below).
Please provide a csv-formatted layout of your samples. The file should be named layout.csv
and be located in the config/project_files
folder. Here is an example. The file should contain the following columns:
- Sample_name: the unique identifier for each of your samples. The sample name does not need to contain information about the timepoint or replicate, since these correspond to other columns
- R1: base name of the fastq file for forward (R1) reads (can be gzipped), including extension
- R2: base name of the fastq file for reverse (R2) reads (can be gzipped), including extension
- N_forward: the 5'-3' DNA sequence corresponding to the fixed region upstream of the mutated sequence
- N_reverse: the 5'-3' DNA sequence corresponding to the fixed region 5' of the mutated sequence on the reverse strand
- Mutated_seq: the unique identifier for the mutated DNA sequence, should be the same for all samples in which the same sequence was mutated
- Pos_start: starting position in the protein sequence. If you've mutated several regions/fragments in a coding gene, this position should refer to the full-length protein sequence
- Replicate: e.g. "R1"
- Timepoint: "T0", "T1", etc. Intermediate timepoints are optional.
Finally, additional columns can be added by the user to specify what makes this sample unique. These are referred to as "sample attributes" and could correspond to the genetic background, the fragment/region of the gene if it applies, the drug used for selection, etc. In summary, a "sample" is any unique combination of sample attributes + Replicate + Timepoint and should be associated to 2 fastq files, for the forward and reverse reads, respectively. Sample attributes = attributes related to Mutated_seq + optional attributes.
Please provide a csv-formatted list of WT DNA sequences. The file should be named wt_seq.csv
and be located in the config/project_files
folder. Here is an example. The file should contain exactly the two following columns:
- Mutated_seq: all possible values for the Mutated_seq flag from the layout
- WT_seq: corresponding WT DNA sequence, assuming the first three bases constitute the first mutated codon
To prevent any typing mistake, the genetic code is imported from a CoCoPUTs table (which also features codon frequencies, although the workflow does not make use of this). The one provided corresponds to Saccharomyces cerevisiae TAXID 559292. Please edit the main config file if you ever need to specify a different genetic code. Any csv-formatted file with at least two columns ("codon" and "aminoacid") should do.
This normalization is optional. Please set the corresponding parameter to True or False in the main config file (see section below). In any case, a csv-formatted file will be automatically generated the first time the workflow is run (even if it is a dry run). Again, if normalization is set to True in the config, you will be prompted to edit the file to add the number of cellular generations for each condition in the column 'Nb_gen'. Once the file is edited, re-run the workflow.
Please specify the codon mode, meaning the type of degenerate codons you introduced at each position in the specified loci. Currently supported are: "NNN" (default value) or "NNK". Make sure you adapt the main config file if necessary.
The main config file is located here. Please make sure to:
- select the samples to be processed (or leave 'all' if you want to process all samples)
- list your sample attributes
- replace all parameter values with the ones adapted for your project. Note: a first pass might be necessary to establish what would be a good read count threshold. Feel free to adjust it and re-run the workflow (if nothing else has changed, only the last steps should run again). This parameter is important because the "avg_scores" dataframe is built only upon "high confidence" variants, i.e. variants with a read count above the set threshold in all T0 replicates.
- set the "perform qc" parameter to True if you want to analyze your raw FASTQ with FastQC (and generate a MultiQC report)
- set the "normalize with gen" parameter to True if you want to normalize with the number of cellular generations
- set the "generate report" parameter to True if you want the HTML report to be automatically generated upon full completion of the workflow
- edit all directory/file paths if necessary
Currently, all the following files are validated against a YAML schema to help spot formatting issues (misspelled column headers, missing mandatory properties, improper format, etc.): main config file, sample layout, file with WT DNA sequences, codon table, file with the number of cellular generations.
The file containing technical config parameters to run the snakemake pipeline on HPC is here. Apart from your email adress (please replace <...>
), this file does not need to be modified too much, and flags added to the snakemake command line will supersede the default values specified in the file. Careful, by default, an email will be sent every time a job fails. This is useful to catch TIMEOUT and MEM_OUT errors, but we recommend automatically redirecting emails to prevent inbox overflow.
Linting and formatting
Linting results
None
Formatting results
[DEBUG]
[DEBUG] In file "/tmp/tmpa14lmpct/durr1602-gyoza-912115b/workflow/rules/stats.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpa14lmpct/durr1602-gyoza-912115b/workflow/rules/vsearch.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpa14lmpct/durr1602-gyoza-912115b/workflow/Snakefile": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpa14lmpct/durr1602-gyoza-912115b/workflow/rules/cutadapt.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpa14lmpct/durr1602-gyoza-912115b/workflow/rules/generate_mutants.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpa14lmpct/durr1602-gyoza-912115b/workflow/rules/qc.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpa14lmpct/durr1602-gyoza-912115b/workflow/rules/parse_fasta.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpa14lmpct/durr1602-gyoza-912115b/workflow/rules/process_read_counts.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpa14lmpct/durr1602-gyoza-912115b/workflow/rules/pandaseq.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmpa14lmpct/durr1602-gyoza-912115b/workflow/rules/common.smk": Formatted content is different from original
... (truncated)