tahiri-lab/aPhyloGeo-pipeline
Phylogeographic workflow using sliding-windows, RAxML-NG and FastTree
Overview
Topics: fasttree neighbor-joining phylogenetics phylogeography raxml robinson-foulds sliding-windows workflow
Latest release: v1.0, Last update: 2023-06-01
Linting: linting: passed, Formatting:formatting: failed
Deployment
Step 1: Install Snakemake and Snakedeploy
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it is recommended to install Miniforge. More details regarding Mamba can be found here.
When using Mamba, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
Step 2: Deploy workflow
With Snakemake and Snakedeploy installed, the workflow can be deployed as follows. First, create an appropriate project working directory on your system and enter it:
mkdir -p path/to/project-workdir
cd path/to/project-workdir
In all following steps, we will assume that you are inside of that directory. Then run
snakedeploy deploy-workflow https://github.com/tahiri-lab/aPhyloGeo-pipeline . --tag v1.0
Snakedeploy will create two folders, workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs.
Step 3: Configure workflow
To configure the workflow, adapt config/config.yml
to your needs following the instructions below.
Step 4: Run workflow
The deployment method is controlled using the --software-deployment-method
(short --sdm
) argument.
To run the workflow with automatic deployment of all required software via conda
/mamba
, use
snakemake --cores all --sdm conda
Snakemake will automatically detect the main Snakefile
in the workflow
subfolder and execute the workflow module that has been defined by the deployment in step 2.
For further options such as cluster and cloud execution, see the docs.
Step 5: Generate report
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser using
snakemake --report report.zip
Configuration
The following section is imported from the workflow’s config/README.md
.
Snakemake workflow: aPhyloGeo
A Snakemake workflow for phylogeographic analysis.
aPhyloGeo is a user-friendly, scalable, reproducible, and comprehensive workflow that can explore the correlation between specific genes (or gene segments) and environmental factors.
Dependencies
The workflow includes the following Python packages:
The workflow includes the following bioinformatics tools:
The software dependencies can be found in the conda environment files: [1] and [2].
Usage
1. Clone this repo.
git clone https://github.com/tahiri-lab/aPhyloGeo-pipeline.git
cd aPhyloGeo-pipeline
2. Install dependencies.
2.1 If you do not have Conda installed, then use the following method to install it. If you already have Conda installed, then refer directly to the next step (2.2).
# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
install Conda (respond by ‘yes’)
bash miniconda.sh
update Conda
conda update -y conda
2.2 Create a conda environment named aaa and install all the dependencies in that environment.
# create a new environment with dependencies
conda env create -n aPhyloGeo -f environment.yaml
2.3 Activate the environment
conda activate aPhyloGeo
3. Configure the workflow.
-
config file:
-
config.yaml
- analysis-specific settings (e.g., bootstrap_threshold, rf_threshold, step_size, window_size, data_type etc.)
Note:You should set the parameters and threshold in theconfig.yaml
file according to your research needs. When setting the parameters and threshold, please modify the corresponding values. Remember not to change the parameter names or file names. -
Thresholds in
config.yaml
:-
bootstrap_threshold
: Only sliding windows with bootstrap values greater than user-set bootstrap_threshold will be written to the output file. -
rf_threshold
: The tree distance between each combination of sliding windows and environmental features will be calculated. Only sliding windows with Robinson–Foulds (RF) distance below the user-set bootstrap_threshold will be written to the output file.
-
-
params in
config.yaml
:-
data_type
:aa
for the amino acid dataset (case insensitive); Any other values set by the user will be treated as nucleotide dataset (default). -
step_size
: the size of the Sliding window movement step (bp) -
window_size
: the size of the Sliding window (bp) -
strategy
: For constructing the phylogenetic tree, two alternative algorithms are provided, RAxML-Ng and FastTree.fasttree
for the FastTree strategy (case insensitive); Any other values set by the user will be treated as RAxML-Ng strategy (default). -
geo_file
: the path of input file (the environmental data.csv
) -
seq_file
: the path of input file (the Multiple Sequence Alignment data.fasta
)
Note: If you want to use a Relative Path to describe the input file, you should use the path related to theaPhyloGeo-pipeline
directory (i.e., the default Present Working Directory should be theworkflow
).
-
specimen_id
: the name of the column containing the sample id ingeo_file
-
feature_names
: The names of the columns corresponding to the environmental factors that will be involved in the analysis (ingeo_file
)
Note: Each column name is on a separate line, don't forget to keep the "-" in front of it.
-
-
-
input files:
-
example data files for protein analysis:
-
align_p.fa
- Multiple Sequence Alignment for protein sequences inFASTA format
(5 samples). -
geo_p.csv
- Environmental data corresponding to sequencing samples (5 samples).
-
- example data files for nucleotide analysis:
-
example data files for protein analysis:
-
output files:
- (filtered) sliding windows with Robinson–Foulds (RF) distance values below the user-set threshold and bootstrap values greater than the user-set threshold in
.csv
(comma-separated values files). -
.csv
and related metadata will be stored in the 'results' directory.
- (filtered) sliding windows with Robinson–Foulds (RF) distance values below the user-set threshold and bootstrap values greater than the user-set threshold in
4. Execute the workflow.
Locally
run workflow
# If you are in a conda environment where all dependencies are already installed
## you need to specify the maximum number of CPU cores to be used at the same time.
## If you want to use N cores, say --cores N or -cN.
snakemake –cores all
Even if you have not created and activated the conda environment as required in 2.2 and 2.3, you can still run workflow successfully with '--use-conda'. Snakemake will create a temporary conda environment for you
#you need to specify the maximum number of CPU cores to be used at the same time.
#If you want to use N cores, say --cores N or -cN.
#For all cores on your system (be sure that this is appropriate) use --cores all.
snakemake –use-conda –cores all
Other features available
# 'dry' run only checks I/O files
snakemake -n
‘dry-run’ print shell commands
snakemake -np
force snakemake to run the job. By default, if snakemake thinks the pipeline doesn’t need updating, snakemake will not run
snakemake -F
Citation
A manuscript for aPhyloGeo-pipeline is in preparation.
Contact
Please email us at : Nadia.Tahiri@USherbrooke.ca for any question or feedback.
Linting and formatting
Linting results
None
Formatting results
[DEBUG]
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/phyloGeo.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/slidingWindows1a.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/rf_phyML.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/phyML2.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/Snakefile": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/refTrees1b.smk": Formatted content is different from original
[DEBUG]
[DEBUG] In file "/tmp/tmp89gaejl8/tahiri-lab-aPhyloGeo-pipeline-1a5dc29/workflow/rules/common.smk": Formatted content is different from original
[INFO] 7 file(s) would be changed 😬
snakefmt version: 0.8.4