ChaissonLab/SegDupAnnotation2

Count gene duplications.

Overview

Topics:

Latest release: v1.3.2, Last update: 2024-11-25

Linting: linting: failed, Formatting:formatting: failed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/ChaissonLab/SegDupAnnotation2 . --tag v1.3.2

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

To run the workflow using a combination of conda and apptainer/singularity for software deployment, use

snakemake --cores all --sdm conda apptainer

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

SegDupAnnotation2 General Configuration

This workflow requires a config/sd_analysis.json configuration file. Please modify it to your needs. It is valudated against workflow/schemas/config.schema.json which also contains example parameters.

Required Parameters

The key files required by this pipeline are an assembly file, a genemodel (for instance a refseq annotation file), and the PacBio reads bam.

Parameter	Type	Description
species	String	The name, code, or identifier for the species or individual being processed.
reads	Array of Strings	List of paths to read files. Acceptable formats and extensions: bam, fastq.gz, fastq, fasta, and fa. (If none provided, workflow will run in 'bamless' mode and assume a read depth of 100 bp across the assembly. Output files will reflect this depth, and thus results may be confusing. For this reason it is not advised to use this workflow without pacbio read bams.)
asm	String	File path to assembly of this species/individual's genome in fasta format. If using haplotype-aware mode, place haplotype 1 here.
genemodel	String	Fasta filepath of chosen gene model. Assumes fasta headers have no space characters.
temp	String	File path to an existing directory for temporary files.

Optional Parameters

Parameter	Type	Default	Description
hap2	String	N/A	File path to haplotype 2 of this species/individual's genome in fasta format. If provided, will activate haplotype-aware mode. The contig/scaffold/chromosome names in the hap2 file must be unique relative to the provided asm/hap1 file.
read_type	String	N/A	Note PacBio read technology type (CLR vs CCS). Currently used for metadata purposed only.
haploid_chrs	Array of Strings	N/A	List of haploid/sex chromosome names in given assembly. Will only be used if flag_autodetect_haloid_chrs is not explicitly set to 'true'. Only helpful in haplotype-unaware mode to ensure collapsed duplications on haploid chromosomes are detected appropriately.
flag_autodetect_haploid_chrs	Boolean	false	When true autodetects whether a chromosome is haploid or diploid to inform correct gene duplication accounting in rule F05_FindDups. Rule B05 detects haploid chromosomes by looking for chromosomes with half the mean depth of the assembly. Chromosomes smaller than 10 Mb will not be selected as haploid chromosomes. Only helpful in haplotype-unaware mode to ensure collapsed duplications on haploid chromosomes are detected appropriately.
bed_for_filtering_results	String	N/A	Bed file path of assembly coordinates to filter out of results.
flagger_components_to_filter	Array of Strings	N/A	Provide flagger bed file to bed_for_filtering_results option. Here list which flagger components to exclude from analysis. Reference https://github.com/mobinasri/flagger for component descriptions.
min_copy_identity	Number	0.90	Minimum gene copy identity to keep when the gene copy is compared to the original copy.
min_hit_length	Int	5000	Minimum hit length to keep in bases.
max_length_margin	Number	0.10	Keep gene copies with length within <max_length_margin> of the original gene's length.
min_gene_model_alignment	Number	0.75	Minimum percent alignment of gene model to gene copy as a decimal.
min_depth	Number	0.05	Minimum mean copy depth to keep as percentage of mean assembly depth.
flag_filt_single_exon_genes	Boolean	true	When true, keeps only genes with multiple exons.
flag_force_hmm	Boolean	false	When true, uses the hidden markov copy number caller instead of samtools' mpileup to determine depth in 100 bp reads. This is not a recommended option, and may fail if no reads map to a single chromosome/scaffold.
flag_assume_clear_and_unique_gene_codes	Boolean	true	When false, assumes gene model fasta headers are in default RefSeq format, and thus renames all headers based on gene symbol in parenthesis at end of header line.
flag_filt_uncharacterized_genes	Boolean	true	When true, filters out genes in gene model with gene names beginning with `LOC`.
uncharacterized_gene_name_prefix	String	N/A	Gene model gene name prefix to filter out when flag_filt_uncharacterized_genes is true. Also used by Network Filter to deprioritize uncharacterized genes when picking a representative gene for a gene family.
cluster_mem_mb_baby	Int	1000	The memory in MB a cluster node or cpu must provide for a computationally simple job. In practice this parameter is combined with a cluster_cpus_per_task_<size> parameter by some rules to create a SLURM or other cluster job.
cluster_mem_mb_small	Int	6000	The memory in MB a cluster node or cpu must provide for a computationally simple job. In practice this parameter is combined with a cluster_cpus_per_task_<size> parameter by some rules to create a SLURM or other cluster job.
cluster_mem_mb_medium	Int	6000	The memory in MB a cluster node or cpu must provide for a computationally mild job. In practice this parameter is combined with a cluster_cpus_per_task_<size> parameter by some rules to create a SLURM or other cluster job.
cluster_mem_mb_large	Int	6000	The memory in MB a cluster node or cpu must provide for a computationally intense job. In practice this parameter is combined with a cluster_cpus_per_task_<size> parameter by some rules to create a SLURM or other cluster job.
cluster_mem_mb_xlarge	Int	6000	The memory in MB a cluster node or cpu must provide for a computationally intense job. In practice this parameter is combined with a cluster_cpus_per_task_<size> parameter by some rules to create a SLURM or other cluster job.
cluster_cpus_per_task_baby	Int	1	The number of cpus per task for a computationally simple rule. In practice this parameter is combined with a cluster_mem_mb_<size> parameter by some rules to create a SLURM or other cluster job.
cluster_cpus_per_task_small	Int	3	The number of cpus per task for a computationally mild rule. In practice this parameter is combined with a cluster_mem_mb_<size> parameter by some rules to create a SLURM or other cluster job.
cluster_cpus_per_task_medium	Int	3	The number of cpus per task for a computationally intense rule. In practice this parameter is combined with a cluster_mem_mb_<size> parameter by some rules to create a SLURM or other cluster job.
cluster_cpus_per_task_large	Int	3	The number of cpus per task for a computationally intense rule. In practice this parameter is combined with a cluster_mem_mb_<size> parameter by some rules to create a SLURM or other cluster job.
cluster_runtime_short	Int	240	The walltime in minutes allocated for rules expected to take a relatively short amount of time (like 4 hrs). This parameter is only used if called by snakemake's --slurm command line paramter.
cluster_runtime_long	Int	240	The walltime in minutes allocated for rules expected to take a relatively long amount of time (like 24 hrs). This parameter is only used if called by snakemake's --slurm command line paramter.
override_mem	Int	-1	Override the memory available in MB otherwise defined by the cluster_mem_mb_<size> parameters in MB. If set to -1, the cluster_mem_mb_<size> paramters will not be overwritten.
override_num_cores	Int	-1	Override the number of allocated cores otherwise defined by the cluster_cpus_per_task_<size> parameters. If set to -1, the cluster_cpus_per_task_<size> paramters will not be overwritten.

Linting and formatting

Linting results

FileNotFoundError in file /tmp/tmpdo4bac8g/ChaissonLab-SegDupAnnotation2-98f12df/workflow/Snakefile, line 64:
[Errno 2] No such file or directory: '/scratch1/krabbani/sda_tmp/mSunEtr1_38m59m9f_tmp'
  File "/tmp/tmpdo4bac8g/ChaissonLab-SegDupAnnotation2-98f12df/workflow/Snakefile", line 64, in <module>
  File "/home/runner/micromamba/envs/snakemake-workflow-catalog/lib/python3.12/tempfile.py", line 384, in mkdtemp

Formatting results

[DEBUG] 
[DEBUG] In file "/tmp/tmpdo4bac8g/ChaissonLab-SegDupAnnotation2-98f12df/workflow/rules/G_summarize_results.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpdo4bac8g/ChaissonLab-SegDupAnnotation2-98f12df/workflow/rules/F_combine_genes_and_depth.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpdo4bac8g/ChaissonLab-SegDupAnnotation2-98f12df/workflow/rules/E_locate_resolved_gene_copies.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpdo4bac8g/ChaissonLab-SegDupAnnotation2-98f12df/workflow/rules/B_bamless.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpdo4bac8g/ChaissonLab-SegDupAnnotation2-98f12df/workflow/rules/A_process_reads.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpdo4bac8g/ChaissonLab-SegDupAnnotation2-98f12df/workflow/rules/A_hapAware_process_reads.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpdo4bac8g/ChaissonLab-SegDupAnnotation2-98f12df/workflow/rules/D_locate_resolved_gene_originals.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmpdo4bac8g/ChaissonLab-SegDupAnnotation2-98f12df/workflow/rules/C_process_gene_model.smk":  Formatted content is different from original
[DEBUG] 
[WARNING] In file "/tmp/tmpdo4bac8g/ChaissonLab-SegDupAnnotation2-98f12df/workflow/Snakefile":  Keyword "input" at line 88 has comments under a value.
	PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)

... (truncated)