Skip to content

connor122721/nf-GeneFamilyEvolution

Repository files navigation

Gene Family Evolution in Daphnia

Nextflow License: MIT Profile: Slurm Repo Size # Languages Top Language

This repository analyzes gene family dynamics in Daphnia species, focusing on gene family expansion and contraction related to ecological adaptation and evolutionary innovation. This pipeline can be applied to any combination of species that is well-annotated (or custom gene models!) within the NCBI genome catalog.

Pipeline Overview

graph TD
    AA[Download NCBI Genomes/Proteomes] --> A
    AA --> C
    A[Extract Longest Transcript] --> B[Annotate GO Terms]
    C[Run BUSCO on Proteomes] --> G[Extract/Trim/Align/Estimate BUSCO Gene Trees & Consensus Species-Tree]
    A --> I[Run OrthoFinder]
    I --> J[Filter Gene Families]
    J --> K[Estimate Gene Family Evolution CAFE5]
    K --> L[GO Term Enrichment]
    I --> M[Run Selection Tests]
    G --> P[Estimate Time-Calibrated Tree MCMCtree]
    P --> K
Loading

Set-Up

  • Download updated protein coding GTFs from NCBI and genomes and determine your appropriate in-group and out-group species for biological hypothesis testing.

BUSCO

  • BUSCO assesses the completeness of Daphnia genome assemblies and annotations by evaluating conserved benchmark universial single-copy orthologous genes.

Orthologs

  • OrthoFinder detects orthologs within and across Daphnia species to understand evolutionary relationships.

Phylogenomics

  • Phylogenomic analysis infers evolutionary relationships and dynamics within Daphnia species using MCMCtree on BUSCO genes.

Gene Family Evolution

  • This section explores gene family expansion and contraction across Daphnia species, focusing on genes related to spermatogenesis and stress responses using Cafe5 and ClusterProfiler.

Selection

  • Selection analysis investigates evolutionary pressures on specific gene families, particularly those undergoing expansion, using codon-based models like PAML and HyPhy.

Building the Apptainer Image

To build the Apptainer image, use the following command:

apptainer build gene_family_evolution.sif definition.def

Running the Nextflow Pipeline

To run the Nextflow pipeline, use the following command:

nextflow run main.nf -profile standard

Ensure that the nextflow.config file is in the same directory as main.nf or specify its path using the -c option.

Notes

  • While I am using apptainer run latest_image.sif for most processes, you could modify the code to run apptainer exec docker://image:latest so you do not have to pull images. I am currently editing this feature so it is more user-friendly.
  • This is currently a work-in-progress project and I am learning best practices with NextFlow in general, any help or tips would be appreciated!

TimeTree Constraints

  • TimeTree constraints are used to calibrate the phylogenetic tree with divergence times obtained from the TimeTree database. These constraints help in estimating the divergence times between species accurately.
  • In the pipeline, the makeConsensusMCMC process includes the application of these constraints using the mcmctree_prep.py script. This script adds time constraints to the species tree based on known divergence times.

Example: Time between Daphnia magna and Drosophila melanogaster

  • The divergence time between Daphnia magna and Drosophila melanogaster is constrained between 474.8 and 530 million years.
  • This constraint is applied in the makeConsensusMCMC process as follows:
    python ${params.scripts_dir}/mcmctree_prep.py \
        --left_species magna \
        --right_species melanogaster \
        --lower_bound 474.8 \
        --upper_bound 530 \
        --tree - \

License

This project is licensed under the MIT License.