🧬 Orthology and Paralogy at Transcript Level 🧬

👥 Authors

Wend Yam Donald Davy Ouedraogo & Aida Ouangraoua, CoBIUS LAB, Department of Computer Science, Faculty of Science, Université de Sherbrooke, Sherbrooke, Canada*

💡 If you are using our algorithm in your research, please cite our recent paper: Ouedraogo, W. Y. D. D., & Ouangraoua, A. (2023, April). Inferring Clusters of Orthologous and Paralogous Transcripts. In RECOMB International Workshop on Comparative Genomics (pp. 19-34).

📧 Contact: wend.yam.donald.davy.ouedraogo@usherbrooke.ca

📖 Table of Contents

➤ About the project
➤ Inferring clusters of orthologous and paralogous transcripts

📝 About The Project

☁️ Overview

We present an algorithm for inferring clusters of orthologous and paralogous transcripts.

👨‍💻 Operating System

The program was both developed and tested on a system operating Ubuntu version 18.04.6 LTS. ---

⚒️ Requirements

python3 (at leat python 3.6)
NetworkX
Pandas
Numpy
ETE toolkit

Inferring clusters of orthologous and paralogous transcripts

📦 About the package

install the package

pip3 install transcriptorthology

import package and use the main function

from transcriptorthology.transcriptOrthology import inferring_transcripts_isoorthology

if __name__ == '__main__':
  gtot_path = './execution/mapping_gene_to_transcripts/ENSGT00390000000080.fasta'
  gt_path = './execution/NHX_trees/ENSGT00390000000080.nwk'
  lower_bound = 0.7
  transcripts_msa_path = './execution/transcripts_alignments/ENSGT00390000000080.alg'
  tsm_conditions = 2
  constraint = 1
  output_folder = './execution/output_folder'
  
  inferring_transcripts_isoorthology(transcripts_msa_path, gtot_path, gt_path, tsm_conditions, lower_bound, constraint, output_folder)

🚀 Getting Started

Command

usage: transcriptOrthology.py [-h] -talg TRALIGNMENT
                              -gtot GENETOTRANSCRIPTS -nhxt NHXGENETREE
                              [-lowb LOWERBOUND] [-tsm TSMVALUE]
                              [-outf OUTPUTFOLDER]

program parameters

options:
  -h, --help            show this help message and exit
  -talg TRALIGNMENT, --tralignment TRALIGNMENT
                        Multiple Sequences Alignment of transcripts in FASTA
                        format
  -gtot GENETOTRANSCRIPTS, --genetotranscripts GENETOTRANSCRIPTS
                        mappings transcripts to corresponding genes
  -nhxt NHXGENETREE, --nhxgenetree NHXGENETREE
                        NHX gene tree
  -lowb LOWERBOUND, --lowerbound LOWERBOUND
                        a threshold for the selection of transcripts RBHs
  -tsm TSMVALUE, --tsmvalue TSMVALUE
                        an integer(1|2|3|4|5|6) that refers to the transcript
                        similarity measure
  -const CONSTRAINT, --constraint CONSTRAINT
                        an integer(0|1), constraint for the selection of recent paralogs
                        similarity measure
  -outf OUTPUTFOLDER, --outputfolder OUTPUTFOLDER
                        the output folder to store the results

Details

parameter	definition	value format
-talg --tralignment	MSA of transcripts	FASTA format >{id_transcript}\n{sequence}
-gtot --genetotranscripts	mappings g(t)	FASTA format >{id_transcript}:{id_gene}\n
-nhxt --nhxtgenetree	gene tree	NHX format
-lowb --lowerbound	a lower bound to select RBHs transcripts. By default, equals to 0.5	float between 0 and 1
-tsm --tsmvalue	The similarity mesure(mean, length, unitary)	integer 1(tsm+unitary) \| 2(tsm+length) \| 3(tsm+mean) \| 4(tsm++unitary) \| 5(tsm++length) \| 6(tsm++mean)
-const --constraint	constraint for the selection of recent paralogs	0(not reciprocal) \| 1(reciprocal)
-outf --outputfolder	folder to save results. The current program folder is set by default.	String

Usage example

python3 ./scripts/transcriptOrthology.py -talg ./execution/inputs/transcripts_alignments/ENSGT00390000003967.alg -gtot ./execution/inputs/mapping_gene_to_transcripts/ENSGT00390000003967.fasta -nhxt ./execution/inputs/NHX_trees/ENSGT00390000003967.nhx -lowb 0.7 -outf ./execution/outputs/ -tsm 1 -const 1

OR

sh ./execution_inferring_clusters.sh

Output expected

++++++++++++++++Starting ....
+++++++ All data were retrieved & the representation of subtranscribed sequences of genes into blocks are available.
+++++ Computing matrix ...       in progress
+++++ Computing matrix ...       status: Finished without errors in 0.42296433448791504 seconds
+++++ Searching for recent-paralogs ...         status: processing
+++++ Searching for recent-paralogs ...         status: finished in 0.11350250244140625 seconds
+++++ Searching for RBHs ...    status: processing
+++++ Searching for RBHs ...    status: finished in 0.09129834175109863 seconds
+++++ Construction of the orthology graph (Adding nodes ...) ...        status: processing
+++++ Construction of the orthology graph (Adding nodes ...) ...        status: finished in 0.524106502532959 seconds
+++++ Searching for connected components ...    status: processing
+++++ Searching for connected components ...    status: finished in 0.06076645851135254 seconds
++++++++++++++++Finished

📁 Project Files Description

⌨️ Inputs description

Inputs files

1️⃣ tsmcomputing() ➡️ returns the similarity matrix (tsm+ | tsm) scores depending on the `tsmvalue` for all pairs of homologous transcripts.

usage: tsmComputing.py [-h] [-talg TRALIGNMENT] [-gtot GENETOTRANSCRIPTS] [-tsm TSMVALUE] [-outf OUTPUTFOLDER] parsor program parameter

optional arguments: -h, --help show this help message and exit -talg TRALIGNMENT, --tralignment TRALIGNMENT -gtot GENETOTRANSCRIPTS, --genetotranscripts GENETOTRANSCRIPTS -tsm TSMVALUE, --tsmvalue TSMVALUE -outf OUTPUTFOLDER, --outputfolder OUTPUTFOLDER

2️⃣ Tclustering() ➡️ returns the orthology graph of transcripts.

usage: Tclustering.py [-h] [-m MATRIX] [-gtot GENETOTRANSCRIPTS]
                      [-nhxt NHXGENETREE] [-lowb LOWERBOUND]
                      [-outf OUTPUTFOLDER]
parsor program parameter
optional arguments:
-h, --help            show this help message and exit
-m MATRIX, --matrix MATRIX
-gtot GENETOTRANSCRIPTS, --genetotranscripts GENETOTRANSCRIPTS
-nhxt NHXGENETREE, --nhxgenetree NHXGENETREE
-lowb LOWERBOUND, --lowerbound LOWERBOUND
-const CONSTRAINT,  --constraint CONSTRAINT
-outf OUTPUTFOLDER, --outputfolder OUTPUTFOLDER

3️⃣ transcriptOthology() ➡️ returns for each pair of homologous transcripts, their homology relationship type (recent-paralogs, ortho-paralogs or ortho-orthologs).

💽 Outputs description

Outputs files

1️⃣ matrix.csv : similarity matrix score that present the tsm+ score between each pair of homologous transcripts.
2️⃣ blocks_transcripts.csv|blocks_genes : csv file describing the representation of blocks for each transcript(resp. gene).
3️⃣ start_orthology_graph.pdf|end_orthology_graph.pdf : orthology graph at the start of the algorithm(resp. at the end of the algorithm) showing only the pair relationships between recent-paralogs(resp. all the orthologous clusters). (:warning:only retrieved if the number of transcripts is not greater than 20)
4️⃣ orthologs.csv : csv files resuming the information of the isoorthology-clustering.

✔️ Dataset

The folder data contains dataset used for the studies and also the results obtained.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
data/inferring_clusters		data/inferring_clusters
execution		execution
scripts		scripts
LICENSE		LICENSE
README.md		README.md
execution_inferring_clusters.sh		execution_inferring_clusters.sh
theme.png		theme.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Orthology and Paralogy at Transcript Level 🧬

📖 Table of Contents

📝 About The Project

☁️ Overview

👨‍💻 Operating System

⚒️ Requirements

Inferring clusters of orthologous and paralogous transcripts

📦 About the package

🚀 Getting Started

📁 Project Files Description

⌨️ Inputs description

💽 Outputs description

✔️ Dataset

About

Releases

Packages

Languages

License

UdeS-CoBIUS/TranscriptOrthology

Folders and files

Latest commit

History

Repository files navigation

🧬 Orthology and Paralogy at Transcript Level 🧬

📖 Table of Contents

📝 About The Project

☁️ Overview

👨‍💻 Operating System

⚒️ Requirements

Inferring clusters of orthologous and paralogous transcripts

📦 About the package

🚀 Getting Started

📁 Project Files Description

⌨️ Inputs description

💽 Outputs description

✔️ Dataset

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages