lacklab/MuSTARRd

(Analysis of Saturation) Mu(tagenesis) STARR(seq) d(ata)

Overview

Topics: bioinformatics sequencing starrseq ngs snakemake

Latest release: None, Last update: 2023-02-15

Linting: linting: failed, Formatting: formatting: passed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/lacklab/MuSTARRd . --tag None

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

In order to configure your analysis, make changes to config.yaml.

  1. SAMPLES: A tab-separated file with the following example should be provided to specify the samples:
sample condition type
ETOH.R1 EtOH RNA
ETOH.R2 EtOH RNA
ETOH.R3 EtOH RNA
DHT.R1 DHT RNA
DHT.R2 DHT RNA
DHT.R3 DHT RNA
INPUT DNA

sample: Sample name

condition: Treatment condition (blank if none; blank for DNA)

type: Sample type (RNA or DNA)

  1. UNITS: A tab-separated file with the following example should be provided to specify the units (replicates or lanes):
sample unit fq1 fq2
ETOH.R1 1 reads/EM_LNCaP_R1_EtOH_L1_1.fq.gz reads/EM_LNCaP_R1_EtOH_L1_2.fq.gz
ETOH.R2 1 reads/EM_LNCaP_R2_EtOH_L1_1.fq.gz reads/EM_LNCaP_R2_EtOH_L1_2.fq.gz
ETOH.R3 1 reads/EM_LNCaP_R3_EtOH_L1_1.fq.gz reads/EM_LNCaP_R3_EtOH_L1_2.fq.gz
ETOH.R3 2 reads/EM_LNCaP_R3_EtOH_L2_1.fq.gz reads/EM_LNCaP_R3_EtOH_L2_2.fq.gz
DHT.R1 1 reads/EM_LNCaP_R1_DHT_L1_1.fq.gz reads/EM_LNCaP_R1_DHT_L1_2.fq.gz
DHT.R2 1 reads/EM_LNCaP_R2_DHT_L1_1.fq.gz reads/EM_LNCaP_R2_DHT_L1_2.fq.gz
DHT.R3 1 reads/EM_LNCaP_R3_DHT_L1_1.fq.gz reads/EM_LNCaP_R3_DHT_L1_2.fq.gz
INPUT 1 reads/NL18_L2_1.fq.gz reads/NL18_L2_2.fq.gz

sample: Sample name (same as in samples.tsv)

unit: Unit no

fq1: Path to the 1st FASTQ file

fq2: Path to the 2nd FASTQ file

  1. PRIMERS: A tab-separated file with the following specificiations (and example row) should be provided to specify the regions of analysis:
Region name Forward primer Reverse primer Chromosome Start End
overlapped_read_114 AGCGCGGCTTAGTGA TACCAGGAGACTATTTCCAACA chr8 6456903 6457213

This file shouldn't have a header. The primers should be 5' to 3' and the positions should be BED-like (0-based).

  1. REF

    1. FA: Path to the FASTA file of the reference genome
    2. BWA_IDX: Path to the BWA index files (prefix)
    3. FIXED: Path to the FASTA file matching the wild type plasmids (or the same as FA)
  2. SEQ

    1. UMI1_LEN: Length of the 5' UMI
    2. UMI2_LEN: Length of the 3' UMI
    3. EXPECTED_TLEN: Template (insert) length of the plasmids (excluding UMIs but including primers)

Linting and formatting

Linting results

  1Lints for rule merge_DNA_reads (line 1, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
  2    * Do not access input and output files individually by index in shell commands:
  3      When individual access to input or output files is needed (i.e., just
  4      writing '{input}' is impossible), use names ('{input.somename}') instead
  5      of index based access.
  6      Also see:
  7      https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#rules
  8    * No log directive defined:
  9      Without a log directive, all output will be printed to the terminal. In
 10      distributed environments, this means that errors are harder to discover.
 11      In local environments, output of concurrent jobs will be mixed and become
 12      unreadable.
 13      Also see:
 14      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 15
 16Lints for rule map_DNA_reads (line 32, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
 17    * Do not access input and output files individually by index in shell commands:
 18      When individual access to input or output files is needed (i.e., just
 19      writing '{input}' is impossible), use names ('{input.somename}') instead
 20      of index based access.
 21      Also see:
 22      https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#rules
 23    * No log directive defined:
 24      Without a log directive, all output will be printed to the terminal. In
 25      distributed environments, this means that errors are harder to discover.
 26      In local environments, output of concurrent jobs will be mixed and become
 27      unreadable.
 28      Also see:
 29      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 30
 31Lints for rule merge_DNA_treplicates (line 75, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
 32    * No log directive defined:
 33      Without a log directive, all output will be printed to the terminal. In
 34      distributed environments, this means that errors are harder to discover.
 35      In local environments, output of concurrent jobs will be mixed and become
 36      unreadable.
 37      Also see:
 38      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 39
 40Lints for rule remove_DNA_indels (line 118, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
 41    * No log directive defined:
 42      Without a log directive, all output will be printed to the terminal. In
 43      distributed environments, this means that errors are harder to discover.
 44      In local environments, output of concurrent jobs will be mixed and become
 45      unreadable.
 46      Also see:
 47      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 48
 49Lints for rule filter_DNA_reads (line 161, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
 50    * No log directive defined:
 51      Without a log directive, all output will be printed to the terminal. In
 52      distributed environments, this means that errors are harder to discover.
 53      In local environments, output of concurrent jobs will be mixed and become
 54      unreadable.
 55      Also see:
 56      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 57
 58Lints for rule reformat_DNA_reads_to_fastq (line 204, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
 59    * No log directive defined:
 60      Without a log directive, all output will be printed to the terminal. In
 61      distributed environments, this means that errors are harder to discover.
 62      In local environments, output of concurrent jobs will be mixed and become
 63      unreadable.
 64      Also see:
 65      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 66
 67Lints for rule move_UMIs_seq2qname (line 235, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
 68    * No log directive defined:
 69      Without a log directive, all output will be printed to the terminal. In
 70      distributed environments, this means that errors are harder to discover.
 71      In local environments, output of concurrent jobs will be mixed and become
 72      unreadable.
 73      Also see:
 74      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 75
 76Lints for rule map_qname_UMI (line 266, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
 77    * No log directive defined:
 78      Without a log directive, all output will be printed to the terminal. In
 79      distributed environments, this means that errors are harder to discover.
 80      In local environments, output of concurrent jobs will be mixed and become
 81      unreadable.
 82      Also see:
 83      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 84
 85Lints for rule get_UMIs (line 306, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
 86    * No log directive defined:
 87      Without a log directive, all output will be printed to the terminal. In
 88      distributed environments, this means that errors are harder to discover.
 89      In local environments, output of concurrent jobs will be mixed and become
 90      unreadable.
 91      Also see:
 92      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
 93
 94Lints for rule mask_anchors (line 353, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
 95    * No log directive defined:
 96      Without a log directive, all output will be printed to the terminal. In
 97      distributed environments, this means that errors are harder to discover.
 98      In local environments, output of concurrent jobs will be mixed and become
 99      unreadable.
100      Also see:
101      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
102
103Lints for rule count_UMIs (line 392, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
104    * No log directive defined:
105      Without a log directive, all output will be printed to the terminal. In
106      distributed environments, this means that errors are harder to discover.
107      In local environments, output of concurrent jobs will be mixed and become
108      unreadable.
109      Also see:
110      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
111
112Lints for rule count_UMIs_per_region (line 427, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
113    * No log directive defined:
114      Without a log directive, all output will be printed to the terminal. In
115      distributed environments, this means that errors are harder to discover.
116      In local environments, output of concurrent jobs will be mixed and become
117      unreadable.
118      Also see:
119      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
120
121Lints for rule get_read_per_BC (line 462, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
122    * No log directive defined:
123      Without a log directive, all output will be printed to the terminal. In
124      distributed environments, this means that errors are harder to discover.
125      In local environments, output of concurrent jobs will be mixed and become
126      unreadable.
127      Also see:
128      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
129
130Lints for rule get_read_per_mut (line 494, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
131    * No log directive defined:
132      Without a log directive, all output will be printed to the terminal. In
133      distributed environments, this means that errors are harder to discover.
134      In local environments, output of concurrent jobs will be mixed and become
135      unreadable.
136      Also see:
137      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
138
139Lints for rule apply_UMI_filters (line 536, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
140    * No log directive defined:
141      Without a log directive, all output will be printed to the terminal. In
142      distributed environments, this means that errors are harder to discover.
143      In local environments, output of concurrent jobs will be mixed and become
144      unreadable.
145      Also see:
146      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
147
148Lints for rule parse_mutations (line 563, /tmp/tmpew7ma_mb/workflow/rules/association.smk):
149    * No log directive defined:
150      Without a log directive, all output will be printed to the terminal. In
151      distributed environments, this means that errors are harder to discover.
152      In local environments, output of concurrent jobs will be mixed and become
153      unreadable.
154      Also see:
155      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
156
157Lints for rule map_RNA_reads (line 1, /tmp/tmpew7ma_mb/workflow/rules/count.smk):
158    * Do not access input and output files individually by index in shell commands:
159      When individual access to input or output files is needed (i.e., just
160      writing '{input}' is impossible), use names ('{input.somename}') instead
161      of index based access.
162      Also see:
163      https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#rules
164    * No log directive defined:
165      Without a log directive, all output will be printed to the terminal. In
166      distributed environments, this means that errors are harder to discover.
167      In local environments, output of concurrent jobs will be mixed and become
168      unreadable.
169      Also see:
170      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
171
172Lints for rule merge_RNA_treplicates (line 44, /tmp/tmpew7ma_mb/workflow/rules/count.smk):
173    * No log directive defined:
174      Without a log directive, all output will be printed to the terminal. In
175      distributed environments, this means that errors are harder to discover.
176      In local environments, output of concurrent jobs will be mixed and become
177      unreadable.
178      Also see:
179      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
180
181Lints for rule remove_RNA_indels (line 87, /tmp/tmpew7ma_mb/workflow/rules/count.smk):
182    * No log directive defined:
183      Without a log directive, all output will be printed to the terminal. In
184      distributed environments, this means that errors are harder to discover.
185      In local environments, output of concurrent jobs will be mixed and become
186      unreadable.
187      Also see:
188      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
189
190Lints for rule filter_RNA_reads (line 134, /tmp/tmpew7ma_mb/workflow/rules/count.smk):
191    * No log directive defined:
192      Without a log directive, all output will be printed to the terminal. In
193      distributed environments, this means that errors are harder to discover.
194      In local environments, output of concurrent jobs will be mixed and become
195      unreadable.
196      Also see:
197      https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files
198
199Lints for rule fix_and_sort_RNA_reads (line 177, /tmp/tmpew7ma_mb/workflow/rules/count.smk):
200    * No log directive defined:
201
202... (truncated)

Formatting results

All tests passed!