tahiri-lab/aPhyloGeo-pipeline

Phylogeographic workflow using sliding-windows, RAxML-NG and FastTree

Overview

Topics: fasttree neighbor-joining phylogenetics phylogeography raxml robinson-foulds sliding-windows workflow

Latest release: v1.0, Last update: 2023-06-01

Linting: linting: passed, Formatting:formatting: failed

Deployment

Step 1: Install Snakemake and Snakedeploy

Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.

When using Mamba, run

mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy

to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via

conda activate snakemake

Step 2: Deploy workflow

With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:

mkdir -p path/to/project-workdir
cd path/to/project-workdir

In all following steps, we will assume that you are inside of that directory. Then run

snakedeploy deploy-workflow https://github.com/tahiri-lab/aPhyloGeo-pipeline . --tag v1.0

Snakedeploy will create two folders, workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.

Step 3: Configure workflow

To configure the workflow, adapt config/config.yml to your needs following the instructions below.

Step 4: Run workflow

The deployment method is controlled using the --software-deployment-method (short --sdm) argument.

To run the workflow with automatic deployment of all required software via conda/mamba, use

snakemake --cores all --sdm conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow module that has been defined by the deployment in step 2.

For further options such as cluster and cloud execution, see the docs.

Step 5: Generate report

After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using

snakemake --report report.zip

Configuration

The following section is imported from the workflow’s config/README.md.

Snakemake workflow: aPhyloGeo

A Snakemake workflow for phylogeographic analysis.

aPhyloGeo is a user-friendly, scalable, reproducible, and comprehensive workflow that can explore the correlation between specific genes (or gene segments) and environmental factors.

Dependencies

Python
Conda - package/environment management system
Snakemake - workflow management system

The workflow includes the following Python packages:

The workflow includes the following bioinformatics tools:

The software dependencies can be found in the conda environment files: [1] and [2].

Usage

1. Clone this repo.

git clone https://github.com/tahiri-lab/aPhyloGeo-pipeline.git
cd aPhyloGeo-pipeline

2. Install dependencies.

2.1 If you do not have Conda installed, then use the following method to install it. If you already have Conda installed, then refer directly to the next step (2.2).

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh

install Conda (respond by ‘yes’)

bash miniconda.sh

update Conda

conda update -y conda

2.2 Create a conda environment named aaa and install all the dependencies in that environment.

# create a new environment with dependencies 
conda env create -n aPhyloGeo -f environment.yaml

2.3 Activate the environment

conda activate aPhyloGeo

3. Configure the workflow.

config file:
- config.yaml - analysis-specific settings (e.g., bootstrap_threshold, rf_threshold, step_size, window_size, data_type etc.)
  Note:You should set the parameters and threshold in the config.yaml file according to your research needs. When setting the parameters and threshold, please modify the corresponding values. Remember not to change the parameter names or file names.
- Thresholds in config.yaml:
  - bootstrap_threshold: Only sliding windows with bootstrap values greater than user-set bootstrap_threshold will be written to the output file.
  - rf_threshold: The tree distance between each combination of sliding windows and environmental features will be calculated. Only sliding windows with Robinson–Foulds (RF) distance below the user-set bootstrap_threshold will be written to the output file.
- params in config.yaml:
  - data_type: aa for the amino acid dataset (case insensitive); Any other values set by the user will be treated as nucleotide dataset (default).
  - step_size: the size of the Sliding window movement step (bp)
  - window_size: the size of the Sliding window (bp)
  - strategy: For constructing the phylogenetic tree, two alternative algorithms are provided, RAxML-Ng and FastTree. fasttree for the FastTree strategy (case insensitive); Any other values set by the user will be treated as RAxML-Ng strategy (default).
  - geo_file: the path of input file (the environmental data .csv )
  - seq_file: the path of input file (the Multiple Sequence Alignment data .fasta )
    Note: If you want to use a Relative Path to describe the input file, you should use the path related to the aPhyloGeo-pipeline directory (i.e., the default Present Working Directory should be the workflow).
  - specimen_id: the name of the column containing the sample id in geo_file
  - feature_names: The names of the columns corresponding to the environmental factors that will be involved in the analysis (in geo_file)
    Note: Each column name is on a separate line, don't forget to keep the "-" in front of it.
input files:
- example data files for protein analysis:
  - align_p.fa - Multiple Sequence Alignment for protein sequences in FASTA format(5 samples).
  - geo_p.csv - Environmental data corresponding to sequencing samples (5 samples).
- example data files for nucleotide analysis:
  - align.fa - Multiple Sequence Alignment for nucleotide sequences in FASTA format (5 samples).
  - geo.csv - Environmental data corresponding to sequencing samples (5 samples).
output files:
- (filtered) sliding windows with Robinson–Foulds (RF) distance values below the user-set threshold and bootstrap values greater than the user-set threshold in .csv (comma-separated values files).
- .csv and related metadata will be stored in the 'results' directory.

4. Execute the workflow.

Locally

run workflow

# If you are in a conda environment where all dependencies are already installed
## you need to specify the maximum number of CPU cores to be used at the same time.
## If you want to use N cores, say --cores N or -cN.
snakemake –cores all

Even if you have not created and activated the conda environment as required in 2.2 and 2.3, you can still run workflow successfully with '--use-conda'. Snakemake will create a temporary conda environment for you

#you need to specify the maximum number of CPU cores to be used at the same time. 
#If you want to use N cores, say --cores N or -cN. 
#For all cores on your system (be sure that this is appropriate) use --cores all. 
snakemake –use-conda –cores all

Other features available

# 'dry' run only checks I/O files
snakemake -n

‘dry-run’ print shell commands

snakemake -np

force snakemake to run the job. By default, if snakemake thinks the pipeline doesn’t need updating, snakemake will not run

snakemake -F

Citation

A manuscript for aPhyloGeo-pipeline is in preparation.

Contact

Please email us at : Nadia.Tahiri@USherbrooke.ca for any question or feedback.

Linting and formatting

Linting results

None

Formatting results

[DEBUG] 
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/phyloGeo.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/slidingWindows1a.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/rf_phyML.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/phyML2.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/Snakefile":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/refTrees1b.smk":  Formatted content is different from original
[DEBUG] 
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/common.smk":  Formatted content is different from original
[INFO] 7 file(s) would be changed 😬

snakefmt version: 0.8.4