Skip to content

openproblems-bio/task_grn_inference

Repository files navigation

A dynamic benchmark for gene regulatory network (GRN) inference

Benchmarking GRN inference methods The full documentation is hosted on ReadTheDocs.

Path to source: src

README

Installation

You need to have Docker, Java, and Viash installed. Follow these instructions to install the required dependencies.

Download resources

git clone git@github.com:openproblems-bio/task_grn_inference.git

cd task_grn_inference

# download resources
scripts/download_resources.sh

Infer a GRN

viash run src/methods/dummy/config.vsh.yaml -- --multiomics_rna resources/grn-benchmark/multiomics_rna.h5ad --multiomics_atac resources/grn-benchmark/multiomics_atac.h5ad --prediction output/dummy.csv

Similarly, run the command for other methods.

Evaluate a GRN

scripts/benchmark_grn.sh --grn resources/grn-benchmark/models/collectri.csv 

Similarly, run the command for other GRN models.

Add a method

To add a method to the repository, follow the instructions in the scripts/add_a_method.sh script.

Motivation

GRNs are essential for understanding cellular identity and behavior. They are simplified models of gene expression regulated by complex processes involving multiple layers of control, from transcription to post-transcriptional modifications, incorporating various regulatory elements and non-coding RNAs. Gene transcription is controlled by a regulatory complex that includes transcription factors (TFs), cis-regulatory elements (CREs) like promoters and enhancers, and essential co-factors. High-throughput datasets, covering thousands of genes, facilitate the use of machine learning approaches to decipher GRNs. The advent of single-cell sequencing technologies, such as scRNA-seq, has made it possible to infer GRNs from a single experiment due to the abundance of samples. This allows researchers to infer condition-specific GRNs, such as for different cell types or diseases, and study potential regulatory factors associated with these conditions. Combining chromatin accessibility data with gene expression measurements has led to the development of enhancer-driven GRN (eGRN) inference pipelines, which offer significantly improved accuracy over single-modality methods.

Description

Here, we present a dynamic benchmark platform for GRN inference. This platform provides curated datasets for GRN inference and evaluation, standardized evaluation protocols and metrics, computational infrastructure, and a dynamically updated leaderboard to track state-of-the-art methods. It runs novel GRNs in the cloud, offers competition scores, and stores them for future comparisons, reflecting new developments over time.

The platform supports the integration of new datasets and protocols. When a new feature is added, previously evaluated GRNs are re-assessed, and the leaderboard is updated accordingly. The aim is to evaluate both the accuracy and completeness of inferred GRNs. It is designed for both single-modality and multi-omics GRN inference. Ultimately, it is a community-driven platform. So far, six eGRN inference methods have been integrated: Scenic+, CellOracle, FigR, scGLUE, GRaNIE, and ANANSE.

Due to its flexible nature, the platform can incorporate various benchmark datasets and evaluation methods, using either prior knowledge or feature-based approaches. In the current version, due to the absence of standardized prior knowledge, we use a feature-based approach to benchmark GRNs. Our evaluation utilizes standardized datasets for GRN inference and evaluation, employing multiple regression analysis approaches to assess both accuracy and comprehensiveness.

Authors & contributors

name roles
Jalil Nourisa author
Robrecht Cannoodt author
Antoine Passimier contributor
Christian Arnold contributor
Marco Stock contributor

API

flowchart LR
  file_multiomics_rna_h5ad("multiomics rna")
  comp_method[/"Method"/]
  file_prediction("GRN")
  comp_metric[/"Label"/]
  file_score("Score")
  file_multiomics_atac_h5ad("multiomics atac")
  file_perturbation_h5ad("perturbation")
  comp_control_method[/"Control Method"/]
  comp_method_r[/"Method r"/]
  file_multiomics_rna_h5ad---comp_method
  comp_method-->file_prediction
  file_prediction---comp_metric
  comp_metric-->file_score
  file_multiomics_atac_h5ad---comp_method
  file_perturbation_h5ad---comp_metric
  comp_control_method-->file_prediction
  comp_method_r-->file_prediction
Loading

File format: multiomics rna

RNA expression for multiomics data.

Example file: resources_test/grn-benchmark/multiomics_rna.h5ad

Format:

AnnData object
 obs: 'cell_type', 'donor_id'

Slot description:

Slot Type Description
obs["cell_type"] string The annotated cell type of each cell based on RNA expression.
obs["donor_id"] string Donor id.

Component type: Method

Path: src/methods

A GRN inference method

Arguments:

Name Type Description
--multiomics_rna file (Optional) RNA expression for multiomics data. Default: resources/grn-benchmark/multiomics_rna.h5ad.
--multiomics_atac file (Optional) Peak data for multiomics data. Default: resources/grn-benchmark/multiomics_atac.h5ad.
--prediction file (Optional, Output) GRN prediction. Default: output/prediction.csv.
--temp_dir string (Optional) NA. Default: output/temdir.
--num_workers integer (Optional) NA. Default: 4.
--tf_all file (Optional) NA. Default: resources/prior/tf_all.csv.
--max_n_links integer (Optional) NA. Default: 50000.

File format: GRN

GRN prediction

Example file: resources_test/grn_models/collectri.csv

Format:

Tabular data
 'source', 'target', 'weight'

Slot description:

Column Type Description
source string Source of regulation.
target string Target of regulation.
weight float Weight of regulation.

Component type: Label

Path: src/metrics

A metric to evaluate the performance of the inferred GRN

Arguments:

Name Type Description
--perturbation_data file (Optional) Perturbation dataset for benchmarking. Default: resources/grn-benchmark/perturbation_data.h5ad.
--prediction file GRN prediction.
--score file (Optional, Output) File indicating the score of a metric. Default: output/score.h5ad.
--reg_type string (Optional) name of regretion to use. Default: ridge.
--subsample integer (Optional) number of samples randomly drawn from perturbation data. Default: -2.
--max_workers integer (Optional) NA. Default: 4.
--method_id string (Optional) NA.
--tf_all file (Optional) NA. Default: resources/prior/tf_all.csv.
--apply_tf boolean (Optional) NA. Default: TRUE.
--clip_scores boolean (Optional) clips the r2 scores for each gene to make them within [0, 1]. Default: TRUE.

File format: Score

File indicating the score of a metric.

Example file: resources_test/scores/score.h5ad

Format:

AnnData object
 uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'

Slot description:

Slot Type Description
uns["dataset_id"] string A unique identifier for the dataset.
uns["method_id"] string A unique identifier for the method.
uns["metric_ids"] string One or more unique metric identifiers.
uns["metric_values"] double The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’.

File format: multiomics atac

Peak data for multiomics data.

Example file: resources_test/grn-benchmark/multiomics_atac.h5ad

Format:

AnnData object
 obs: 'cell_type', 'donor_id'

Slot description:

Slot Type Description
obs["cell_type"] string The annotated cell type of each cell based on RNA expression.
obs["donor_id"] string Donor id.

File format: perturbation

Perturbation dataset for benchmarking.

Example file: resources_test/grn-benchmark/perturbation_data.h5ad

Format:

AnnData object
 obs: 'cell_type', 'sm_name', 'donor_id', 'plate_name', 'row', 'well', 'cell_count'
 layers: 'n_counts', 'pearson', 'lognorm'

Slot description:

Slot Type Description
obs["cell_type"] string The annotated cell type of each cell based on RNA expression.
obs["sm_name"] string The primary name for the (parent) compound (in a standardized representation) as chosen by LINCS. This is provided to map the data in this experiment to the LINCS Connectivity Map data.
obs["donor_id"] string Donor id.
obs["plate_name"] string Plate name 6 levels.
obs["row"] string Row name on the plate.
obs["well"] string Well name on the plate.
obs["cell_count"] string Number of single cells pseudobulked.
layers["n_counts"] double Pseudobulked values using mean approach.
layers["pearson"] double (Optional) Normalized values using pearson residuals.
layers["lognorm"] double (Optional) Normalized values using shifted logarithm .

Component type: Control Method

Path: src/control_methods

A control method.

Arguments:

Name Type Description
--layer string (Optional) Which layer of pertubation data to use to find tf-gene relationships. Default: scgen_pearson.
--prediction file (Optional, Output) GRN prediction.
--tf_all file NA.

Component type: Method r

Path: src/methods_r

A GRN inference method

Arguments:

Name Type Description
--multiomics_rna_r file (Optional) NA.
--multiomics_atac_r file (Optional) NA.
--prediction file (Optional, Output) GRN prediction.
--temp_dir string (Optional) NA. Default: output/temdir.
--num_workers integer (Optional) NA. Default: 4.