Identification and validation of microbial biomarkers from cross-cohort datasets using xMarkerFinder
xMarkerFinder is a four-stage workflow for microbiome research including differential signature identification, model construction, model validation, and biomarker interpretation. Detailed scripts, example files, and a ready-to-use docker image are provided. We also provide a user-friendly web server for easier implementation. Feel free to explore the web server and discover more about xMarkerFinder! Manuscript is available at https://doi.org/10.21203/rs.3.pex-1984/v1.
Please cite: Wenxing Gao, Wanning Chen, Wenjing Yin et al. Identification and validation of microbial biomarkers from cross-cohort datasets using xMarkerFinder, 25 August 2022, PROTOCOL (Version 1) available at Protocol Exchange [https://doi.org/10.21203/rs.3.pex-1984/v1]
The protocol can be executed on standard computational hardware, and greater computational resources would allow for faster execution. The development and test of this protocol have been conducted on a MacBook Pro equipped with a 2.4-GHz quad-core eighth-generation Intel Core i5 processor and 16-GB 2133-MHz LPDDR3 memory.
- R v.3.6.1 or newer (https://www.r-project.org)
- Python3 v3.7 or newer (https://www.python.org)
- HAllA (https://huttenhower.sph.harvard.edu/halla)
- FastSpar (https://github.com/scwatts/FastSpar)
- Gephi (https://gephi.org)
- BiocManager (https://github.com/Bioconductor/BiocManager) to ensure the installation of the following packages and their dependencies.
- MMUPHin (https://huttenhower.sph.harvard.edu/mmuphin)
- dplyr (https://dplyr.tidyverse.org/)
- vegan (https://github.com/vegandevs/vegan)
- XICOR (https://github.com/cran/XICOR)
- eva (https://github.com/brianbader/eva_package)
- labdsv (https://github.com/cran/labdsv)
- Boruta (https://gitlab.com/mbq/Boruta, optional)
- pandas (https://pandas.pydata.org)
- NumPy (https://numpy.org/)
- scikit-learn (https://scikit-learn.org)
- bioinfokit (https://github.com/reneshbedre/bioinfokit)
- Bayesian Optimization (https://github.com/fmfn/BayesianOptimization)
- Matplotlib (https://matplotlib.org/)
- seaborn (https://seaborn.pydata.org/)
Above software list provides the minimal requirements for the complete execution of xMarkerFinder locally. Alternatively, we provide a ready-to-use Docker image, enabling users to skip the software installation and environment setup (https://hub.docker.com/r/tjcadd2022/xmarkerfinder). Additionally, an interactive JupyterHub server (https://mybinder.org/v2/gh/tjcadd2020/xMarkerFinder/HEAD) is also available.
Installation of R on different platforms can be conducted following the instructions on the official website (https://www.r-project.org/). All R packages used in this protocol can be installed following given commands.
> install.packages(Package_Name)
or
> if (!requireNamespace(“BiocManager”, quietly = TRUE))
> install.packages(“BiocManager”)
> BiocManager::install(Package_Name)
Python can be downloaded and installed from its official website (https://www.python.org/), and all python packages could be installed using pip.
$ pip install Package_Name
HAllA can be installed according to its website (https://huttenhower.sph.harvard.edu/halla/) with the following command.
$ pip install halla
FastSpar can be installed following its GitHub repository (https://github.com/scwatts/fastspar). Installation through conda:
$ conda install -c bioconda -c conda-forge fastspar
Or compiling from source code:
$ git clone https://github.com/scwatts/fastspar.git
$ cd fastspar
$./autogen.sh
$./configure --prefix=/usr/
$ make
$ make install
Gephi could be freely downloaded and installed from its website (https://gephi.org/).
To provide easier implementation, we provide a Docker image to replace above Equipment setup steps excluding Gephi. Firstly, users should download and install Docker (https://docs.docker.com/engine/install/) and then setup the xMarkerFinder computational environment. All scripts in the Procedure part below should be executed within the Docker container created from the xMarkerFinder Docker image.
$ docker pull tjcadd2022/xmarkerfinder:1.0.16
$ docker run -it -v $(pwd):/work tjcadd2022/xmarkerfinder:1.0.16 /bin/bash
-it Run containers in an interactive mode, allowing users to execute commands and access files within the docker container.
-v Mounts a volume between present working directory in your local machine to the /work directory in the docker container.
To mitigate challenges induced by different number of sequencing (e.g., library size), microbial count matrices are often normalized by various computational strategies prior to downstream analyses. Here, xMarkerFinder takes the proportional normalization as its default algorithm for determining relative abundances (REL), other normalization methods are also available, including AST, CLR, and TMM.
$ Rscript 1_Normalization.R -W /workplace/ -p abundance.txt -o TEST
Users should specify these parameters or enter the default values, subsequent repetitions of which are not listed.
-W the Workplace of this whole protocol
-p the input microbial count profile
-m the normalization method (REL, AST, CLR, TMM)
-o prefix of output files
- Input files:
abundance.txt: merged microbial count profile of all datasets. - Output files:
normalized_abundance.txt: normalized abundance profile of input dataset. Normalized abundance profiles are used as input files for all subsequent analyses, except for Step 11, which requires raw count file.
Rare signatures, those with low occurrence rates across cohorts are discarded (default: prevalence below 20% of samples) to ensure that identified biomarkers are reproducible and could be applied to prospective cohorts.
$ Rscript 2_Filtering.R -W /workplace/ -m train_metadata.txt -p normalized_abundance.txt -b Cohort -t 2 -o TEST
-m the input metadata file
-p the input microbial normalized abundance file (output file of Step 1)
-b the column name of batch(cohort) in metadata (default: Cohort)
-t the minimum number of cohorts where features have to occur (default: 2)
-O prefix of output files
- Input files:
train_metadata.txt: the clinical metadata of the training dataset.
normalized_abundance.txt: normalized abundance profile of the training dataset. - Output files:
filtered_abundance.txt: filtered normalized abundance profile of the training dataset, used as the input file for following steps.
Inter-cohort heterogeneity caused by variance in confounders is inevitable in meta-analyses, strongly affecting downstream differential signature identification. Permutational multivariate analysis of variance (PERMANOVA) test, one of the most widely used nonparametric methods to fit multivariate models based on dissimilarity metric in microbial studies, quantifies microbial variations attributable to each metadata variable, thus assigning a delegate to evaluate confounding effects. PERMANOVA test here is performed on Bray-Curtis (recommended for REL and TMM normalized data) or Eucledian (recommended for AST and CLR normalized data) dissimilarity matrices. For each metadata variable, coefficient of determination (R2) value and p value are calculated to explain how variation is attributed. The variable with the most predominant impact on microbial profiles is treated as major batch, and other confounders are subsequently used as covariates in Step 4. Principal coordinate analysis (PCoA) plot is also provided.
$ Rscript 3_Confounder_analysis.R -W /workplace/ -m train_metadata.txt -p filtered_abundance.txt -d bc -c 999 -g Group -o TEST
-m input metadata file
-p input filtered microbial abundance file
-d distance matrix (bc, euclidean)
-c permutation count (default: 999)
-g the column name of experimental interest(group) in metadata (default: Group)
- Input files:
train_metadata.txt: the clinical metadata of the training dataset.
filtered_abundance.txt: filtered abundance profile after preprocessing. - Output files:
metadata_microbiota.txt: the confounding effects caused by clinical information, used to determine the major batch and covariates.
pcoa_plot.pdf: the PCoA plot with Bray-Curtis dissimilarity between groups.
To identify disease or trait-associated microbial signatures across cohorts, MMUPHin is employed. Regression analyses in individual cohorts are performed using the well-validated Microbiome Multivariable Association with Linear Models (MaAsLin2) package, where multivariable associations between phenotypes, experimental groups or other metadata factors and microbial profiles are determined. These results are then aggregated with established fixed effects models to test for consistently differential signatures between groups with the major confounder (determined in Step 3) set as the main batch and other minor confounders (e.g., demographic indices, technical differences) as covariates. Signatures with consistently significant differences in meta-analysis are identified as cross-cohort differential signatures and used for further feature selection in subsequent stages. Users can choose from using p values or adjusted p values. Volcano plot of differential signatures is provided.
$ Rscript 4_Differential_analysis.R -W /workplace/ -m train_metadata.txt -p filtered_abundance.txt -g Group -b Cohort -c covariates.txt -d p -t 0.05 -o TEST
-g the column name of experimental interest(group) in metadata (default: Group)
-b the column name of major confounder in metadata (default: Cohort)
-c input covariates file (tab-delimited format containing all covariates)
-d input choice indicating whether to use the adjusted p values rather than the raw p values and the adjusting values (F,bonf,fdr)
-t the threshold of p or q value for plotting (default: 0.05)
- Input files:
train_metadata.txt: the clinical metadata of the training dataset.
filtered_abundance.txt: filtered abundance profile after preprocessing.
covariates.txt: covariates identified in Step 3 (newly generated tab-delimited file where each row is a covariate, example file is provided). - Output files:
differential_significance_single_cohort.txt: the differential significance result in individual cohorts.
differential_significance.txt: meta-analytic testing results aggregating differential testing results in individual cohorts, used for next-step visualization.
differential_signature.txt: significantly differential signatures between groups derived from input filtered profiles, used as input files for feature selection.
differential_volcano.pdf: the volcano plot of input differential significance file.
This step provides optional classifier selection for subsequent steps where the performances of every ML algorithm are generally assessed using all differential signatures. The output file contains the cross-validation AUC, AUPR, MCC, specificity, sensitivity, accuracy, precision, and F1 score of all classification models built with these various algorithms. Users should specify the selected classifier in all the following steps.
$ python 5_Classifier_selection.py -W /workplace/ -m train_metadata.txt -p differential_signature.txt -g Group -e exposure -s 0 -o TEST
-p input differential signature file (output file of Step 4)
-g the column name of experimental interest(group) in metadata (default: Group)
-e the experiment group(exposure) of interest (in example data: CRC)
-s random seed (default:0)
- Input files:
train_metadata.txt: the clinical metadata of the training dataset.
differential_signature.txt: significantly differential signatures between groups. - Output files:
classifier_selection.txt: the overall cross-validation performance of all classifiers using differential signatures, used to determine the most suitable classifier.
The first step of Triple-E feature selection procedure evaluates the predictive capability of every feature via constructing individual classification models respectively. Users should specify an ML algorithm here and in every future step as the overall classifier for the whole protocol from the following options: LRl1, LRl2, KNN, SVC, DT, RF, and GB. Features with cross-validation AUC above the threshold (default:0.5) are defined as effective features and are returned in the output file.
$ python 6_Feature_effectiveness_evaluation.py -W /workplace/ -m train_metadata.txt -p differential_signature.txt -g Group -e exposure -b Cohort -c classifier -s 0 -t 0.5 -o TEST
-p input differential signature file (output file of Step 4)
-b the column name of batch(cohort) in metadata (default: Cohort)
-c selected classifier
-t AUC threshold for defining if a feature is capable of prediction (default:0.5)
- Input files:
train_metadata.txt: the clinical metadata of the training dataset.
differential_signature.txt: significantly differential signatures between groups. - Output files:
feature_auc.txt: cross-validation AUC values of individual features.
effective_feature.txt: features derived from differential signatures that are capable of predicting disease states, used as input file of the following step.
The second step of feature selection aims to exclude collinear issue caused by highly correlated features based on the result of Step 7 and returns the uncorrelated-effective features.
$ python 7_Collinear_feature_exclusion.py -W /workplace/ -p effective_feature.txt -t 0.7 -o TEST
-p input effective feature file (output file of Step 6)
-t correlation threshold for collinear feature exclusion (default:0.7)
- Input files:
metadata.txt: the clinical metadata of the training dataset.
effective_feature.txt: features with classification capability. - Output files:
feature_correlation.txt: spearman correlation coefficients of every feature pair.
uncorrelated_effective_feature.txt: features derived from input effective features excluding highly collinear features, used as input file of the following step.
The last step of feature selection recursively eliminates the weakest feature per loop to sort out the minimal panel of candidate biomarkers.
$ python 8_Recursive_feature_elimination.py -W /workplace/ -m train_metadata.txt -p uncorrelated_effective_feature.txt -g Group -e exposure -c classifier -s 0 -o TEST
-p input uncorrelated-effective feature file (output file of Step 7)
- Input files:
train_metadata.txt: the clinical metadata of the training dataset.
uncorrelated_effective_feature.txt: independent features derived from effective features. - Output files:
candidate_biomarker.txt: identified optimal panel of candidate biomarkers, used as model input for all subsequent steps.
Besides Triple-E feature selection procedure, we provide an alternative method, feature selection with the Boruta algorithm.
$ Rscript Boruta_feature_selection.R -W /workplace/ -m metadata.txt -p differential_signature.txt -g Group -s 0 -o TEST
-p input differential signature profile (output file of Step 4) or uncorrelated-effective feature file (output file of Step 7)
- Input files:
metadata.txt: the clinical metadata of the training dataset.
differential_signature.txt: differential signatures used for feature selection (could also be uncorrelated-effective features from Step 7). - Output files:
boruta_feature_imp.txt: confirmed feature importances via Boruta algorithm.
boruta_selected_feature.txt: selected feature profile, used as input candidate biomarkers for subsequent steps.
Based on the selected classifier and candidate biomarkers, the hyperparameters of the classification model are adjusted via bayesian optimization method based on cross-validation AUC. The output files contain the tuned hyperparameters and the multiple performance metric values of the constructed best-performing model.
$ python 9_Hyperparameter_tuning.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -g Group -e exposure -c classifier -s 0 -o TEST
-p input candidate marker profile (output file of Step 8)
- Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate biomarkers (or boruta_selected_feature.txt for all subsequent steps). - Output files:
best_param.txt: the best hyperparameter combination of classification model.
optimal_cross_validation.txt: the overall cross-validation performance of the best-performing model.
cross_validation_auc.pdf: the visualization of the cross-validation AUC of the best-performing model.
As stated above, this step provides extensive internal validations to ensure the robustness and reproducibility of identified candidate biomarkers in different cohorts via intra-cohort validation, cohort-to-cohort transfer, and LOCO validation. Output files contain multiple performance metrics used to assess the markers internally, including AUC, specificity, sensitivity, accuracy, precision and F1 score.
$ python 10_Validation.py -W /workplace/ -m metadata.txt -p candidate_biomarker.txt -g Group -e exposure -b Cohort -c classifier -s 0 -o TEST
-p input optimal candidate marker file (output file of Step 8)
- Input files:
metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate markers. - Output files:
validation_metric.txt: the overall performance of candidate biomarkers in internal validations. validation_metric.pdf: the visualization of input file.
As the best-performing candidate biomarkers and classification model are established, the test dataset is used to externally validate their generalizability. The input external metadata and microbial relative profiles need to be in the same format as initial input files for the training dataset. This step returns the overall performance of the model and its AUC plot.
$ python 11_Test.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -a external_metadata.txt -x external_profile.txt -g Group -e exposure -c classifier -r hyperparamter.txt -s 0 -o TEST
-a input external metadata file for the test dataset
-x input external microbial relative abundance file as the test dataset
-r input optimal hyperparameter file (output file of Step 9)
- Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate biomarkers.
test_metadata.txt: the clinical metadata of the external test dataset.
test_profile.txt: the relative abundance matrix of the external test dataset. - Output files:
test_result.txt: the overall performance of model in external test dataset.
test_auc.pdf: the visualization of the AUC value in test_result.txt.
To further assess markers’ specificity for the experimental group of interest, they are used to construct classification models to discriminate between other related diseases and corresponding controls. Cross-validation AUC values of other classification models and visualization are returned.
$ python 12_Biomarker_specificity.py -W /workplace/ -p candidate_biomarker.txt -q test_metadata.txt -l test_relative_abundance.txt -a other_metadata.txt -x other_relative_abundance.txt -g Group -e CTR -b Cohort -c classifier -r best_param.txt -s 0 -o TEST
-q input external test metadata file for the test dataset
-l input external microbial relative abundance file as the test dataset
-a input metadata file of samples from other non-target diseases
-x input microbial relative abundance file of samples from other non-target diseases
-e the control group name (in example file: CTR)
-b the column name of cohort(in example file: Cohort)
- Input files:
candidate_biomarker.txt: the optimal panel of candidate biomarkers.
other_metadata.txt: the clinical metadata of samples for other diseases.
other_profile.txt: the relative abundance matrix of other diseases. - Output files:
specificity_result.txt: AUC values of models constructed with candidate biomarkers in other related diseases.
specificity_auc.pdf: the visualization of the specificity_result.txt.
Random samples of case and control class of other diseases are added into the classification model, respectively, both labeled as “control”, the variations of corresponding AUCs of which are calculated and used for visualization.
$ python 13_Model_specificity_add.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -q test_metadata.txt -l test_profile.txt -a other_metadata.txt -x other_profile.txt -g Group -e exposure -b Cohort -c classifier -r hyperparamter.txt -n 5 -s 0 -o TEST
-q input external metadata file for the test dataset
-l input external microbial relative abundance file as the test dataset
-a input metadata file of samples from other diseases
-x input microbial relative abundance file of samples from other non-target diseases
-e the control group name (in example file: CTR)
-b the column name of cohort(dataset)
-n the number of samples to add into the model each time
- Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate markers.
test_metadata.txt: the clinical metadata of the external test dataset.
test_profile.txt: the relative abundance matrix of the external test dataset.
other_metadata.txt: the clinical metadata of samples for other non-target diseases.
other_profile.txt: the relative abundance matrix of other non-target diseases. - Output files:
specificity_add_result.txt: AUC values of models constructed with candidate markers in other non-target diseases.
specificity_add_auc.pdf: the visualization of the specificity_result.txt.
Permutation feature importance is employed here to evaluate biomarkers’ contributions in the best-performing classification model.
$ python 14_Biomarker_importance.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -g Group -e exposure -c classifier -r best_param.txt -s 0 -o TEST
-p input candidate biomarkers (output file of Step 8)
-r input optimal hyperparameter file (output file of Step 9)
-n input number for biomarker abundance visualization
- Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate biomarkers.
best_param.txt: the best hyperparameter combination of classification model. - Output files:
biomarker_importance.txt: permutation feature importance of candidate biomarkers via ten permutations.
biomarker_importance.pdf: the visualization of feature importance file. Biomarker_distribution.pdf: the visualization for the abundances of the top n (set by users) important biomarkers in the discovery dataset.
As the input file for Step 16 needs to be microbial count profile in .tsv format where each row describes a microbial signature and each column represents a sample (could be converted profiles of all features, differential signatures, or candidate biomarkers according to users’ need, and null values needed to be set as 0) and header needs to start with “#OTU ID”, an additional file conversion script is provided.
$ python 15_Convert.py -W /workplace/ -p abundance.txt -s selected_feature.txt -o TEST
-p input feature raw count file before normalization.
-s selected features for calculating microbial correlation (could be differential signatures or candidate markers, output file of Step 4 or 8).
- Input files:
abundance.txt: microbial raw count profile before normalization.
selected_feature.txt: selected features for calculating microbial co-occurrence network (output file of Step 4 or 8) - Output files:
convert.tsv: the converted file appropriate for constructing microbial co-occurrence network.
$ ./16_Microbial_network.sh -W /workplace/ –i feature_abundance.tsv -o TEST -t 4
-i input feature abundance file
-t threads of available computational source
- Input files:
convert.tsv: microbial count profile in .tsv format where each row describes a microbial signature and each column represents a sample and the header needs to start with “#OTU ID”. Example input file is provided and users are recommended to user Step 15 to convert files into appropriate formats.
-t the threads of available computational source when running - Output files:
median_correlation.tsv: the correlation coefficients between every input signature pair.
pvalues.tsv: the statistical significance of median_correlation.tsv.
The visualization of Step 16 is performed using Gephi.
Preprocess of the results of Step 16 to ensure that Step 17 only draws significant correlations (pvalues<0.05) with absolute correlation coefficients above 0.5 (default).
$ python 17_Microbial_network_plot.py -W /workplace/ –c median_correlation.tsv -p pvalues.tsv -t 0.5 -o TEST
-c input network profile (output file of Step 16)
-p input pvalue profile (output file of Step 16)
-t input correlation threshold (default: 0.5)
- Input files: median_correlation.tsv: the correlation coefficients profile (output file of Step 16). pvalues.tsv: the statistical significance of median_correlation.tsv (output file of Step 16).
- Output files: microbial_network.csv: adjusted network profile for Gephi input, only significant correlations reserved.
18. Open Gephi and click "File" – "Import spreadsheet", and then choose the adjusted network profile.
20. Choose a preferable layout type to form the basic network and press the “stop” button when the network becomes stable (Fruchterman Reingold style is recommended).
21. For further optimization of the network, the appearances of nodes and edges should be adjusted according to users’ needs, as well as the labels of nodes.
If users have multi-omics or multidimensional microbial profiles of the same dataset, the correlation between different omics or dimensions is calculated via HAllA.
$ ./22_Multi_omics_correlation.sh -W /workplace/ -i microbial_abundance_1.txt -d microbial_abundance_2.txt -o TEST
-i input microbial abundance file 1
-d input microbial abundance file 2
- Input files:
microbial_abundance_1.txt: microbial abundance profile 1.
microbial_abundance_2.txt: microbial abundance profile 2. These two input files should have the same samples (columns) but different features (rows). - Output files:
results/all_associations.txt: associations between different omics or dimensions.
results/hallagram.png: the visualization of all_associations.txt with only significant associations highlighted.
It’s worth highlighting that xMarkerFinder is designed as a standard protocol with a high level of impartiality regarding data type and microbial habitat. In other words, xMarkerFinder’s versatility goes beyond its initial purpose in gut microbiome research, making it suitable for diverse microbial biomes. To provide further clarity, we present three examples showcasing the application of xMarkerFinder across various contexts.
Firstly, we used datasets from previous publications containing 16S rRNA gene sequencing data of the oral microbiome of patients with oral squamous cell carcinoma (OSCC) and controls. We applied xMarkerFinder to these oral microbiome datasets and successfully identified consistent microbial signatures associated with OCSS with great diagnostic capabilities.
Secondly, we employed metagenomic datasets from the Tara Ocean project to characterize important microbiota within the oceanic environment, capable of distinguishing between deep and surface regions.
To demonstrate its generalizability in different omics data, we further applied xMarkerFinder to transcriptomic datasets of non-alcoholic steatohepatitis (NASH) patients, using three publicly available NASH cohorts. The resulting classification model reached an impressive AUC value of 0.99, highlighting the robustness and applicability of xMarkerFinder.
These examples collectively serve as compelling evidence of the extensive scope of applicability inherent in xMarkerFinder.
xMarkerFinder is suitable for microbial biomarker identification from cross-cohort datasets. Our previous studies demonstrated its applicability in identifying global microbial diagnostic biomarkers for adenoma and colorectal cancer. Moreover, xMarkerFinder could also be applied to biomarker determination in disease prognosis, treatment stratification, metastasis surveillance, adverse reactions anticipation, etc. Any research dedicated to biomarker identification from multi-population microbial datasets is welcome.
We provide detailed instructions on software installation for users to run the whole xMarkerFinder workflow locally. However, we strongly encourage the usage of the provided docker image as it would significantly reduce potential errors in the entire installation and setup process. (https://hub.docker.com/r/tjcadd2022/xmarkerfinder)
Yes. The codes used in xMarkerFinder are deposited in our GitHub repository and can be freely downloaded and modified according to users’ specific needs. However, the modification might cause unprecedented errors and we encourage users to try different parameters first, and then modify the codes.
Yes. The whole xMarkerFinder workflow contains four stages (12 steps) and every stage/step can be conducted independently and users could skip any one of them according to specific study designs.
Yes. Although xMarkerFinder is developed for human microbiome studies, it is also generalizable to other microbial habitats.
The time needed for the whole workflow depends on the dataset size, selected algorithm, and computational resources available. The following time estimates are based on execution of our protocol on provided example datasets with all classifiers (Logistic Regression (LR, L1 and L2 regularization), K-nearest Neighbors (KNN) classifier, Support Vector classifier (SVC) with the Radial Basis Function kernel), Decision Tree (DT) classifier, Random Forest(RF) classifier, and Gradient Boosting (GB) classifier using the xMarkerFinder docker image on a MacBook Pro (2.4-GHz quad-core eighth-generation Intel Core i5 processor, 16-GB 2133-MHz LPDDR3 memory).
Stage | Step | LRl1 | LRl2 | SVC | KNN | DT | RF | GB |
---|---|---|---|---|---|---|---|---|
Stage1: Differential signature identification | 1 | 0m20.600s | 0m20.600s | 0m20.600s | 0m20.600s | 0m20.600s | 0m20.600s | 0m20.600s |
2 | 0m11.372s | 0m11.372s | 0m11.372s | 0m11.372s | 0m11.372s | 0m11.372s | 0m11.372s | |
3 | 1m21.356s | 1m21.356s | 1m21.356s | 1m21.356s | 1m21.356s | 1m21.356s | 1m21.356s | |
4 | 0m24.858s | 0m24.858s | 0m24.858s | 0m24.858s | 0m24.858s | 0m24.858s | 0m24.858s | |
Total | 2m18.186s | 2m18.186s | 2m18.186s | 2m18.186s | 2m18.186s | 2m18.186s | 2m18.186s | |
Stage2: Model construction | 5 | 0m12.464s | 0m12.464s | 0m12.464s | 0m12.464s | 0m12.464s | 0m12.464s | 0m12.464s |
6 | 0m2.733s | 0m3.032s | 0m50.913s | 0m3.105s | 0m3.252s | 1m43.332s | 0m49.196s | |
7 | 0m0.846s | 0m1.150s | 0m1.102s | 0m1.178s | 0m1.015s | 0m0.863s | 0m1.216s | |
8 | 0m2.447s | 0m18.449s | 10m32.261s | 0m21.103s | 0m53.413s | 18m37.552s | 47m59.647s | |
9 | 0m30.420s | 0m24.735s | 0m35.112s | 0m42.348s | 0m34.801s | 8m57.417s | 8m12.045s | |
Total | 0m48.91s | 0m59.83s | 12m11.852s | 1m20.198s | 1m44.945s | 29m31.628s | 57m14.568s | |
Stage3: Model validation | 10 | 4m30.737s | 4m42.105s | 10m15.050s | 6m10.515s | 4m31.044s | 91m52.940s | 65m47.511s |
11 | 0m3.896s | 0m3.776s | 0m3.150s | 0m3.761s | 0m4.002 | 0m7.120s | 0m4.266s | |
12 | 0m4.877s | 0m4.764s | 0m4.426s | 0m5.287s | 0m5.315s | 2m25.064s | 0m36.946s | |
13 | 0m5.941s | 0m5.982 | 0m22.211s | 0m7.342s | 0m6.646s | 2m21.262s | 0m39.554s | |
Total | 4m45.451s | 4m56.627 | 10m44.837s | 6m26.905s | 4m47.007s | 96m46.386s | 67m8.277s | |
Stage4: Biomarker interpretation | 14 | 0m3.270s | 0m3.599s | 0m16.746s | 0m21.809s | 0m4.041s | 0m46.265s | 0m5.028s |
15-21 | 6m32.696s | 6m32.696s | 6m32.696s | 6m32.696s | 6m32.696s | 6m32.696s | 6m32.696s | |
22 | 7m57.119s | 7m57.119s | 7m57.119s | 7m57.119s | 7m57.119s | 7m57.119s | 7m57.119s | |
Total | 14m33.085s | 14m33.414s | 14m46.561s | 14m51.624s | 14m33.856s | 15m16.080s | 14m34.843s | |
Total | / | 22m25.632s | 22m48.057s | 40m1.436s | 24m56.913s | 23m23.994s | 143m52.280s | 141m15.874s |
A preliminary understanding of shell scripts would allow users to complete the whole workflow. Intermediate experience in R and Python would facilitate users to interpret and modify the codes.
Yes. xMarkerFinder aims to integrate different datasets and establish replicable biomarkers. However, xMarkerFinder differs from systematic review as it integrates original datasets instead of the respective results.
Processed microbial count matrices and corresponding metadata are required. For cross-cohort analysis, we require merged datasets from at least three cohorts in the discovery set to accomplish the full protocol with internal validations. xMarkerFinder is well adapted to microbial taxonomic and functional profiles derived from both amplicon and whole metagenomics sequencing data, as well as other omics layers, including but not limited to metatranscriptomics, metaproteomics, and metabolomics.
To perform meta-analysis, corresponding sample groups are required. Other metadata indices, such as body mass index, age, and gender are recommended but unnecessary. However, it is worth noticing that the absence of metadata information might compromise the correction for confounding effects and the identification of microbial biomarkers.
To mitigate challenges induced by different numbers of sequencing (e.g. library sizes), microbial count profiles are converted to relative abundances for subsequent analysis in xMarkerFinder.
To identify a replicable panel of microbial biomarkers, we need to exclude rare microbial features, those with low occurrence rates across cohorts as they are not ideal candidates as global biomarkers.
To ensure models’ reliability, datasets are split into training/discovery and test sets. The training set is used to train and have the model learn the hidden pattern. The test set is used to test the model after completing the training process and provides unbiased final model performance results.
Potential installation problems and solutions are provided in our manuscript, and most problems would be avoided by simply using the docker image we provided instead of running all scripts locally (https://hub.docker.com/r/tjcadd2022/xmarkerfinder).
Step 5 provides the evaluation of multiple commonly used algorithms in machine learning, and users could choose the most suitable algorithm based on these results. However, due to its robustness and interpretability, Random Forest classifiers are considered suitable for most microbiome datasets. Therefore, step 5 is not compulsory and we recommend users to build Random Forest models first, and move to other classifiers if they underperform.
For most scenarios, the default parameters would work. For further exploration, users are encouraged to try different parameters to get better results.
AUC is the area under the ROC curve (the plot of the Sensitivity as y-axis versus 1-Specificity as x-axis). A perfect classifier gives an AUC of 1 while a simple classifier that makes completely random guesses gives an AUC of 0.5.