Identification and validation of microbial biomarkers from cross-cohort datasets using xMarkerFinder

xMarkerFinder is a four-stage workflow for microbiome research including differential signature identification, model construction, model validation, and biomarker interpretation. Detailed scripts, example files, and a ready-to-use docker image are provided. We also provide a user-friendly web server for easier implementation. Feel free to explore the web server and discover more about xMarkerFinder! Manuscript is available at https://doi.org/10.21203/rs.3.pex-1984/v1.

Citation

Please cite: Wenxing Gao, Wanning Chen, Wenjing Yin et al. Identification and validation of microbial biomarkers from cross-cohort datasets using xMarkerFinder, 25 August 2022, PROTOCOL (Version 1) available at Protocol Exchange [https://doi.org/10.21203/rs.3.pex-1984/v1]

Installation

Hardware

The protocol can be executed on standard computational hardware, and greater computational resources would allow for faster execution. The development and test of this protocol have been conducted on a MacBook Pro equipped with a 2.4-GHz quad-core eighth-generation Intel Core i5 processor and 16-GB 2133-MHz LPDDR3 memory.

Docker image

Above software list provides the minimal requirements for the complete execution of xMarkerFinder locally. Alternatively, we provide a ready-to-use Docker image, enabling users to skip the software installation and environment setup (https://hub.docker.com/r/tjcadd2022/xmarkerfinder). Additionally, an interactive JupyterHub server (https://mybinder.org/v2/gh/tjcadd2020/xMarkerFinder/HEAD) is also available.

Software setup

Installation of R and R packages

Installation of R on different platforms can be conducted following the instructions on the official website (https://www.r-project.org/). All R packages used in this protocol can be installed following given commands.

> install.packages(Package_Name)

or

> if (!requireNamespace(“BiocManager”, quietly = TRUE)) 
> install.packages(“BiocManager”)
> BiocManager::install(Package_Name)

Installation of python and python packages

Python can be downloaded and installed from its official website (https://www.python.org/), and all python packages could be installed using pip.

$ pip install Package_Name

Installation of HAllA

HAllA can be installed according to its website (https://huttenhower.sph.harvard.edu/halla/) with the following command.

$ pip install halla

Installation of FastSpar

FastSpar can be installed following its GitHub repository (https://github.com/scwatts/fastspar). Installation through conda:

$ conda install -c bioconda -c conda-forge fastspar

Or compiling from source code:

$ git clone https://github.com/scwatts/fastspar.git
$ cd fastspar
$./autogen.sh
$./configure --prefix=/usr/
$ make
$ make install

Installation of Gephi

Gephi could be freely downloaded and installed from its website (https://gephi.org/).

Docker image setup

To provide easier implementation, we provide a Docker image to replace above Equipment setup steps excluding Gephi. Firstly, users should download and install Docker (https://docs.docker.com/engine/install/) and then setup the xMarkerFinder computational environment. All scripts in the Procedure part below should be executed within the Docker container created from the xMarkerFinder Docker image.

$ docker pull tjcadd2022/xmarkerfinder:1.0.16
$ docker run -it -v $(pwd):/work tjcadd2022/xmarkerfinder:1.0.16 /bin/bash

-it Run containers in an interactive mode, allowing users to execute commands and access files within the docker container.  
-v Mounts a volume between present working directory in your local machine to the /work directory in the docker container.

User tutorial

Stage 1 Differential signature identification

1. Data normalization.

To mitigate challenges induced by different number of sequencing (e.g., library size), microbial count matrices are often normalized by various computational strategies prior to downstream analyses. Here, xMarkerFinder takes the proportional normalization as its default algorithm for determining relative abundances (REL), other normalization methods are also available, including AST, CLR, and TMM.

$ Rscript 1_Normalization.R -W /workplace/ -p abundance.txt -o TEST

Users should specify these parameters or enter the default values, subsequent repetitions of which are not listed.

-W the Workplace of this whole protocol  
-p the input microbial count profile
-m the normalization method (REL, AST, CLR, TMM)
-o prefix of output files

Input files:
abundance.txt: merged microbial count profile of all datasets.
Output files:
normalized_abundance.txt: normalized abundance profile of input dataset. Normalized abundance profiles are used as input files for all subsequent analyses, except for Step 11, which requires raw count file.

2. Data filtering.

Rare signatures, those with low occurrence rates across cohorts are discarded (default: prevalence below 20% of samples) to ensure that identified biomarkers are reproducible and could be applied to prospective cohorts.

$ Rscript 2_Filtering.R -W /workplace/ -m train_metadata.txt -p normalized_abundance.txt -b Cohort -t 2 -o TEST

-m the input metadata file  
-p the input microbial normalized abundance file (output file of Step 1)  
-b the column name of batch(cohort) in metadata (default: Cohort)  
-t the minimum number of cohorts where features have to occur (default: 2)  
-O prefix of output files

Input files:
train_metadata.txt: the clinical metadata of the training dataset.
normalized_abundance.txt: normalized abundance profile of the training dataset.
Output files:
filtered_abundance.txt: filtered normalized abundance profile of the training dataset, used as the input file for following steps.

3. Confounder analysis.

Inter-cohort heterogeneity caused by variance in confounders is inevitable in meta-analyses, strongly affecting downstream differential signature identification. Permutational multivariate analysis of variance (PERMANOVA) test, one of the most widely used nonparametric methods to fit multivariate models based on dissimilarity metric in microbial studies, quantifies microbial variations attributable to each metadata variable, thus assigning a delegate to evaluate confounding effects. PERMANOVA test here is performed on Bray-Curtis (recommended for REL and TMM normalized data) or Eucledian (recommended for AST and CLR normalized data) dissimilarity matrices. For each metadata variable, coefficient of determination (R2) value and p value are calculated to explain how variation is attributed. The variable with the most predominant impact on microbial profiles is treated as major batch, and other confounders are subsequently used as covariates in Step 4. Principal coordinate analysis (PCoA) plot is also provided.

$ Rscript 3_Confounder_analysis.R -W /workplace/ -m train_metadata.txt -p filtered_abundance.txt -d bc -c 999 -g Group -o TEST

-m input metadata file  
-p input filtered microbial abundance file
-d distance matrix (bc, euclidean)
-c permutation count (default: 999)
-g the column name of experimental interest(group) in metadata (default: Group)

Input files:
train_metadata.txt: the clinical metadata of the training dataset.
filtered_abundance.txt: filtered abundance profile after preprocessing.
Output files:
metadata_microbiota.txt: the confounding effects caused by clinical information, used to determine the major batch and covariates.
pcoa_plot.pdf: the PCoA plot with Bray-Curtis dissimilarity between groups.

4. Differential analysis.

To identify disease or trait-associated microbial signatures across cohorts, MMUPHin is employed. Regression analyses in individual cohorts are performed using the well-validated Microbiome Multivariable Association with Linear Models (MaAsLin2) package, where multivariable associations between phenotypes, experimental groups or other metadata factors and microbial profiles are determined. These results are then aggregated with established fixed effects models to test for consistently differential signatures between groups with the major confounder (determined in Step 3) set as the main batch and other minor confounders (e.g., demographic indices, technical differences) as covariates. Signatures with consistently significant differences in meta-analysis are identified as cross-cohort differential signatures and used for further feature selection in subsequent stages. Users can choose from using p values or adjusted p values. Volcano plot of differential signatures is provided.

$ Rscript 4_Differential_analysis.R -W /workplace/ -m train_metadata.txt -p filtered_abundance.txt -g Group -b Cohort -c covariates.txt -d p -t 0.05 -o TEST

-g the column name of experimental interest(group) in metadata (default: Group)  
-b the column name of major confounder in metadata (default: Cohort)  
-c input covariates file (tab-delimited format containing all covariates)
-d input choice indicating whether to use the adjusted p values rather than the raw p values and the adjusting values (F,bonf,fdr)
-t the threshold of p or q value for plotting (default: 0.05)

Input files:
train_metadata.txt: the clinical metadata of the training dataset.
filtered_abundance.txt: filtered abundance profile after preprocessing.
covariates.txt: covariates identified in Step 3 (newly generated tab-delimited file where each row is a covariate, example file is provided).
Output files:
differential_significance_single_cohort.txt: the differential significance result in individual cohorts.
differential_significance.txt: meta-analytic testing results aggregating differential testing results in individual cohorts, used for next-step visualization.
differential_signature.txt: significantly differential signatures between groups derived from input filtered profiles, used as input files for feature selection.
differential_volcano.pdf: the volcano plot of input differential significance file.

Stage 2 Model construction

5. Classifier selection.

This step provides optional classifier selection for subsequent steps where the performances of every ML algorithm are generally assessed using all differential signatures. The output file contains the cross-validation AUC, AUPR, MCC, specificity, sensitivity, accuracy, precision, and F1 score of all classification models built with these various algorithms. Users should specify the selected classifier in all the following steps.

$ python 5_Classifier_selection.py -W /workplace/ -m train_metadata.txt -p differential_signature.txt -g Group -e exposure -s 0 -o TEST

-p input differential signature file (output file of Step 4)
-g the column name of experimental interest(group) in metadata (default: Group)
-e the experiment group(exposure) of interest (in example data: CRC)
-s random seed (default:0)

Input files:
train_metadata.txt: the clinical metadata of the training dataset.
differential_signature.txt: significantly differential signatures between groups.
Output files:
classifier_selection.txt: the overall cross-validation performance of all classifiers using differential signatures, used to determine the most suitable classifier.

6. Feature effectiveness evaluation

The first step of Triple-E feature selection procedure evaluates the predictive capability of every feature via constructing individual classification models respectively. Users should specify an ML algorithm here and in every future step as the overall classifier for the whole protocol from the following options: LRl1, LRl2, KNN, SVC, DT, RF, and GB. Features with cross-validation AUC above the threshold (default:0.5) are defined as effective features and are returned in the output file.

$ python 6_Feature_effectiveness_evaluation.py -W /workplace/ -m train_metadata.txt -p differential_signature.txt -g Group -e exposure -b Cohort -c classifier -s 0 -t 0.5 -o TEST

-p input differential signature file (output file of Step 4)
-b the column name of batch(cohort) in metadata (default: Cohort)
-c selected classifier
-t AUC threshold for defining if a feature is capable of prediction (default:0.5)

Input files:
train_metadata.txt: the clinical metadata of the training dataset.
differential_signature.txt: significantly differential signatures between groups.
Output files:
feature_auc.txt: cross-validation AUC values of individual features.
effective_feature.txt: features derived from differential signatures that are capable of predicting disease states, used as input file of the following step.

7. Collinear feature exclusion.

The second step of feature selection aims to exclude collinear issue caused by highly correlated features based on the result of Step 7 and returns the uncorrelated-effective features.

$ python 7_Collinear_feature_exclusion.py -W /workplace/ -p effective_feature.txt -t 0.7 -o TEST

-p input effective feature file (output file of Step 6)
-t correlation threshold for collinear feature exclusion (default:0.7)

Input files:
metadata.txt: the clinical metadata of the training dataset.
effective_feature.txt: features with classification capability.
Output files:
feature_correlation.txt: spearman correlation coefficients of every feature pair.
uncorrelated_effective_feature.txt: features derived from input effective features excluding highly collinear features, used as input file of the following step.

8. Recursive feature elimination.

The last step of feature selection recursively eliminates the weakest feature per loop to sort out the minimal panel of candidate biomarkers.

$ python 8_Recursive_feature_elimination.py -W /workplace/ -m train_metadata.txt -p uncorrelated_effective_feature.txt -g Group -e exposure -c classifier -s 0 -o TEST

-p input uncorrelated-effective feature file (output file of Step 7)

Input files:
train_metadata.txt: the clinical metadata of the training dataset.
uncorrelated_effective_feature.txt: independent features derived from effective features.
Output files:
candidate_biomarker.txt: identified optimal panel of candidate biomarkers, used as model input for all subsequent steps.

Boruta feature selection.

Besides Triple-E feature selection procedure, we provide an alternative method, feature selection with the Boruta algorithm.

$ Rscript Boruta_feature_selection.R -W /workplace/ -m metadata.txt -p differential_signature.txt -g Group -s 0 -o TEST

-p input differential signature profile (output file of Step 4) or uncorrelated-effective feature file (output file of Step 7)

Input files:
metadata.txt: the clinical metadata of the training dataset.
differential_signature.txt: differential signatures used for feature selection (could also be uncorrelated-effective features from Step 7).
Output files:
boruta_feature_imp.txt: confirmed feature importances via Boruta algorithm.
boruta_selected_feature.txt: selected feature profile, used as input candidate biomarkers for subsequent steps.

9. Hyperparameter tuning.

Based on the selected classifier and candidate biomarkers, the hyperparameters of the classification model are adjusted via bayesian optimization method based on cross-validation AUC. The output files contain the tuned hyperparameters and the multiple performance metric values of the constructed best-performing model.

$ python 9_Hyperparameter_tuning.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -g Group -e exposure -c classifier -s 0 -o TEST

-p input candidate marker profile (output file of Step 8)

Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate biomarkers (or boruta_selected_feature.txt for all subsequent steps).
Output files:
best_param.txt: the best hyperparameter combination of classification model.
optimal_cross_validation.txt: the overall cross-validation performance of the best-performing model.
cross_validation_auc.pdf: the visualization of the cross-validation AUC of the best-performing model.

Stage 3 Model validation

10. Internal validations ( intra-cohort, cohort-to-cohort, and LOCO validation).

As stated above, this step provides extensive internal validations to ensure the robustness and reproducibility of identified candidate biomarkers in different cohorts via intra-cohort validation, cohort-to-cohort transfer, and LOCO validation. Output files contain multiple performance metrics used to assess the markers internally, including AUC, specificity, sensitivity, accuracy, precision and F1 score.

$ python 10_Validation.py -W /workplace/ -m metadata.txt -p candidate_biomarker.txt -g Group -e exposure -b Cohort -c classifier -s 0 -o TEST

-p input optimal candidate marker file (output file of Step 8)

Input files:
metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate markers.
Output files:
validation_metric.txt: the overall performance of candidate biomarkers in internal validations. validation_metric.pdf: the visualization of input file.

11. Independent test.

As the best-performing candidate biomarkers and classification model are established, the test dataset is used to externally validate their generalizability. The input external metadata and microbial relative profiles need to be in the same format as initial input files for the training dataset. This step returns the overall performance of the model and its AUC plot.

$ python 11_Test.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -a external_metadata.txt -x external_profile.txt -g Group -e exposure -c classifier -r hyperparamter.txt -s 0 -o TEST

-a input external metadata file for the test dataset
-x input external microbial relative abundance file as the test dataset
-r input optimal hyperparameter file (output file of Step 9)

Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate biomarkers.
test_metadata.txt: the clinical metadata of the external test dataset.
test_profile.txt: the relative abundance matrix of the external test dataset.
Output files:
test_result.txt: the overall performance of model in external test dataset.
test_auc.pdf: the visualization of the AUC value in test_result.txt.

12. Biomarker specificity assessment.

To further assess markers’ specificity for the experimental group of interest, they are used to construct classification models to discriminate between other related diseases and corresponding controls. Cross-validation AUC values of other classification models and visualization are returned.

$ python 12_Biomarker_specificity.py -W /workplace/ -p candidate_biomarker.txt -q test_metadata.txt -l test_relative_abundance.txt -a other_metadata.txt -x other_relative_abundance.txt -g Group -e CTR -b Cohort -c classifier -r best_param.txt -s 0 -o TEST

-q input external test metadata file for the test dataset
-l input external microbial relative abundance file as the test dataset
-a input metadata file of samples from other non-target diseases
-x input microbial relative abundance file of samples from other non-target diseases
-e the control group name (in example file: CTR)
-b the column name of cohort(in example file: Cohort)

Input files:
candidate_biomarker.txt: the optimal panel of candidate biomarkers.
other_metadata.txt: the clinical metadata of samples for other diseases.
other_profile.txt: the relative abundance matrix of other diseases.
Output files:
specificity_result.txt: AUC values of models constructed with candidate biomarkers in other related diseases.
specificity_auc.pdf: the visualization of the specificity_result.txt.

13. Model specificity assessment.

Random samples of case and control class of other diseases are added into the classification model, respectively, both labeled as “control”, the variations of corresponding AUCs of which are calculated and used for visualization.

$ python 13_Model_specificity_add.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -q test_metadata.txt -l test_profile.txt -a other_metadata.txt -x other_profile.txt -g Group -e exposure -b Cohort -c classifier -r hyperparamter.txt -n 5 -s 0 -o TEST

-q input external metadata file for the test dataset
-l input external microbial relative abundance file as the test dataset
-a input metadata file of samples from other diseases
-x input microbial relative abundance file of samples from other non-target diseases
-e the control group name (in example file: CTR)
-b the column name of cohort(dataset)
-n the number of samples to add into the model each time

Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate markers.
test_metadata.txt: the clinical metadata of the external test dataset.
test_profile.txt: the relative abundance matrix of the external test dataset.
other_metadata.txt: the clinical metadata of samples for other non-target diseases.
other_profile.txt: the relative abundance matrix of other non-target diseases.
Output files:
specificity_add_result.txt: AUC values of models constructed with candidate markers in other non-target diseases.
specificity_add_auc.pdf: the visualization of the specificity_result.txt.

Stage 4 Biomarker interpretation.

14. Biomarker importance.

Permutation feature importance is employed here to evaluate biomarkers’ contributions in the best-performing classification model.

$ python 14_Biomarker_importance.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -g Group -e exposure -c classifier -r best_param.txt -s 0 -o TEST

-p input candidate biomarkers (output file of Step 8)
-r input optimal hyperparameter file (output file of Step 9)
-n input number for biomarker abundance visualization

Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate biomarkers.
best_param.txt: the best hyperparameter combination of classification model.
Output files:
biomarker_importance.txt: permutation feature importance of candidate biomarkers via ten permutations.
biomarker_importance.pdf: the visualization of feature importance file. Biomarker_distribution.pdf: the visualization for the abundances of the top n (set by users) important biomarkers in the discovery dataset.

15. Convert microbial abundance files to meet the criterion for network construction.

As the input file for Step 16 needs to be microbial count profile in .tsv format where each row describes a microbial signature and each column represents a sample (could be converted profiles of all features, differential signatures, or candidate biomarkers according to users’ need, and null values needed to be set as 0) and header needs to start with “#OTU ID”, an additional file conversion script is provided.

$ python 15_Convert.py -W /workplace/ -p abundance.txt -s selected_feature.txt -o TEST

-p input feature raw count file before normalization.
-s selected features for calculating microbial correlation (could be differential signatures or candidate markers, output file of Step 4 or 8).

Input files:
abundance.txt: microbial raw count profile before normalization.
selected_feature.txt: selected features for calculating microbial co-occurrence network (output file of Step 4 or 8)
Output files:
convert.tsv: the converted file appropriate for constructing microbial co-occurrence network.

16. Microbial co-occurrence network.

$ ./16_Microbial_network.sh -W /workplace/ –i feature_abundance.tsv -o TEST -t 4

-i input feature abundance file  
-t threads of available computational source

Input files:
convert.tsv: microbial count profile in .tsv format where each row describes a microbial signature and each column represents a sample and the header needs to start with “#OTU ID”. Example input file is provided and users are recommended to user Step 15 to convert files into appropriate formats.
-t the threads of available computational source when running
Output files:
median_correlation.tsv: the correlation coefficients between every input signature pair.
pvalues.tsv: the statistical significance of median_correlation.tsv.

17. Microbial co-occurrence network plot.

The visualization of Step 16 is performed using Gephi.
Preprocess of the results of Step 16 to ensure that Step 17 only draws significant correlations (pvalues<0.05) with absolute correlation coefficients above 0.5 (default).

$ python 17_Microbial_network_plot.py -W /workplace/ –c median_correlation.tsv -p pvalues.tsv -t 0.5 -o TEST

-c input network profile (output file of Step 16)
-p input pvalue profile (output file of Step 16)
-t input correlation threshold (default: 0.5)

Input files: median_correlation.tsv: the correlation coefficients profile (output file of Step 16). pvalues.tsv: the statistical significance of median_correlation.tsv (output file of Step 16).
Output files: microbial_network.csv: adjusted network profile for Gephi input, only significant correlations reserved.

18. Open Gephi and click "File" – "Import spreadsheet", and then choose the adjusted network profile.

19. Import the network file.

20. Choose a preferable layout type to form the basic network and press the “stop” button when the network becomes stable (Fruchterman Reingold style is recommended).

21. For further optimization of the network, the appearances of nodes and edges should be adjusted according to users’ needs, as well as the labels of nodes.

22. Multi-omics correlation.

If users have multi-omics or multidimensional microbial profiles of the same dataset, the correlation between different omics or dimensions is calculated via HAllA.

$ ./22_Multi_omics_correlation.sh -W /workplace/ -i microbial_abundance_1.txt -d microbial_abundance_2.txt -o TEST

-i input microbial abundance file 1
-d input microbial abundance file 2

Input files:
microbial_abundance_1.txt: microbial abundance profile 1.
microbial_abundance_2.txt: microbial abundance profile 2. These two input files should have the same samples (columns) but different features (rows).
Output files:
results/all_associations.txt: associations between different omics or dimensions.
results/hallagram.png: the visualization of all_associations.txt with only significant associations highlighted.

Case study

It’s worth highlighting that xMarkerFinder is designed as a standard protocol with a high level of impartiality regarding data type and microbial habitat. In other words, xMarkerFinder’s versatility goes beyond its initial purpose in gut microbiome research, making it suitable for diverse microbial biomes. To provide further clarity, we present three examples showcasing the application of xMarkerFinder across various contexts.

Human microbiome

Firstly, we used datasets from previous publications containing 16S rRNA gene sequencing data of the oral microbiome of patients with oral squamous cell carcinoma (OSCC) and controls. We applied xMarkerFinder to these oral microbiome datasets and successfully identified consistent microbial signatures associated with OCSS with great diagnostic capabilities.

Ocean microbiome

Secondly, we employed metagenomic datasets from the Tara Ocean project to characterize important microbiota within the oceanic environment, capable of distinguishing between deep and surface regions.

Human transcriptome

To demonstrate its generalizability in different omics data, we further applied xMarkerFinder to transcriptomic datasets of non-alcoholic steatohepatitis (NASH) patients, using three publicly available NASH cohorts. The resulting classification model reached an impressive AUC value of 0.99, highlighting the robustness and applicability of xMarkerFinder.

These examples collectively serve as compelling evidence of the extensive scope of applicability inherent in xMarkerFinder.

FAQs

Part I General questions

1. When should I use xMarkerFinder?

xMarkerFinder is suitable for microbial biomarker identification from cross-cohort datasets. Our previous studies demonstrated its applicability in identifying global microbial diagnostic biomarkers for adenoma and colorectal cancer. Moreover, xMarkerFinder could also be applied to biomarker determination in disease prognosis, treatment stratification, metastasis surveillance, adverse reactions anticipation, etc. Any research dedicated to biomarker identification from multi-population microbial datasets is welcome.

2. How should I set up the required computational environment for xMarkerFinder?

We provide detailed instructions on software installation for users to run the whole xMarkerFinder workflow locally. However, we strongly encourage the usage of the provided docker image as it would significantly reduce potential errors in the entire installation and setup process. (https://hub.docker.com/r/tjcadd2022/xmarkerfinder)

3. Can I access and modify the codes used in xMarkerFinder?

Yes. The codes used in xMarkerFinder are deposited in our GitHub repository and can be freely downloaded and modified according to users’ specific needs. However, the modification might cause unprecedented errors and we encourage users to try different parameters first, and then modify the codes.

4. Can I use only certain steps of xMarkerFinder and skip other parts?

Yes. The whole xMarkerFinder workflow contains four stages (12 steps) and every stage/step can be conducted independently and users could skip any one of them according to specific study designs.

5. Can I use xMarkerFinder for environmental microbiome research?

Yes. Although xMarkerFinder is developed for human microbiome studies, it is also generalizable to other microbial habitats.

6. How long does it take to run xMarkerFinder?

The time needed for the whole workflow depends on the dataset size, selected algorithm, and computational resources available. The following time estimates are based on execution of our protocol on provided example datasets with all classifiers (Logistic Regression (LR, L1 and L2 regularization), K-nearest Neighbors (KNN) classifier, Support Vector classifier (SVC) with the Radial Basis Function kernel), Decision Tree (DT) classifier, Random Forest(RF) classifier, and Gradient Boosting (GB) classifier using the xMarkerFinder docker image on a MacBook Pro (2.4-GHz quad-core eighth-generation Intel Core i5 processor, 16-GB 2133-MHz LPDDR3 memory).

Stage	Step	LRl1	LRl2	SVC	KNN	DT	RF	GB
Stage1： Differential signature identification	1	0m20.600s	0m20.600s	0m20.600s	0m20.600s	0m20.600s	0m20.600s	0m20.600s
	2	0m11.372s	0m11.372s	0m11.372s	0m11.372s	0m11.372s	0m11.372s	0m11.372s
	3	1m21.356s	1m21.356s	1m21.356s	1m21.356s	1m21.356s	1m21.356s	1m21.356s
	4	0m24.858s	0m24.858s	0m24.858s	0m24.858s	0m24.858s	0m24.858s	0m24.858s
	Total	2m18.186s	2m18.186s	2m18.186s	2m18.186s	2m18.186s	2m18.186s	2m18.186s
Stage2： Model construction	5	0m12.464s	0m12.464s	0m12.464s	0m12.464s	0m12.464s	0m12.464s	0m12.464s
	6	0m2.733s	0m3.032s	0m50.913s	0m3.105s	0m3.252s	1m43.332s	0m49.196s
	7	0m0.846s	0m1.150s	0m1.102s	0m1.178s	0m1.015s	0m0.863s	0m1.216s
	8	0m2.447s	0m18.449s	10m32.261s	0m21.103s	0m53.413s	18m37.552s	47m59.647s
	9	0m30.420s	0m24.735s	0m35.112s	0m42.348s	0m34.801s	8m57.417s	8m12.045s
	Total	0m48.91s	0m59.83s	12m11.852s	1m20.198s	1m44.945s	29m31.628s	57m14.568s
Stage3： Model validation	10	4m30.737s	4m42.105s	10m15.050s	6m10.515s	4m31.044s	91m52.940s	65m47.511s
	11	0m3.896s	0m3.776s	0m3.150s	0m3.761s	0m4.002	0m7.120s	0m4.266s
	12	0m4.877s	0m4.764s	0m4.426s	0m5.287s	0m5.315s	2m25.064s	0m36.946s
	13	0m5.941s	0m5.982	0m22.211s	0m7.342s	0m6.646s	2m21.262s	0m39.554s
	Total	4m45.451s	4m56.627	10m44.837s	6m26.905s	4m47.007s	96m46.386s	67m8.277s
Stage4： Biomarker interpretation	14	0m3.270s	0m3.599s	0m16.746s	0m21.809s	0m4.041s	0m46.265s	0m5.028s
	15-21	6m32.696s	6m32.696s	6m32.696s	6m32.696s	6m32.696s	6m32.696s	6m32.696s
	22	7m57.119s	7m57.119s	7m57.119s	7m57.119s	7m57.119s	7m57.119s	7m57.119s
	Total	14m33.085s	14m33.414s	14m46.561s	14m51.624s	14m33.856s	15m16.080s	14m34.843s
Total	/	22m25.632s	22m48.057s	40m1.436s	24m56.913s	23m23.994s	143m52.280s	141m15.874s

7. What skills are required to run xMarkerFinder?

A preliminary understanding of shell scripts would allow users to complete the whole workflow. Intermediate experience in R and Python would facilitate users to interpret and modify the codes.

8. Is xMarkerFinder a pipeline for meta-analysis?

Yes. xMarkerFinder aims to integrate different datasets and establish replicable biomarkers. However, xMarkerFinder differs from systematic review as it integrates original datasets instead of the respective results.

Part II Data processing

1. What kind of data should I use for xMarkerFinder?

Processed microbial count matrices and corresponding metadata are required. For cross-cohort analysis, we require merged datasets from at least three cohorts in the discovery set to accomplish the full protocol with internal validations. xMarkerFinder is well adapted to microbial taxonomic and functional profiles derived from both amplicon and whole metagenomics sequencing data, as well as other omics layers, including but not limited to metatranscriptomics, metaproteomics, and metabolomics.

2. If I don’t have the corresponding metadata, can I still use xMarkerFinder?

To perform meta-analysis, corresponding sample groups are required. Other metadata indices, such as body mass index, age, and gender are recommended but unnecessary. However, it is worth noticing that the absence of metadata information might compromise the correction for confounding effects and the identification of microbial biomarkers.

3. Why should I normalize my data?

To mitigate challenges induced by different numbers of sequencing (e.g. library sizes), microbial count profiles are converted to relative abundances for subsequent analysis in xMarkerFinder.

4. Why should I perform data filtering?

To identify a replicable panel of microbial biomarkers, we need to exclude rare microbial features, those with low occurrence rates across cohorts as they are not ideal candidates as global biomarkers.

5. What does the training and test set do and why should I separate them?

To ensure models’ reliability, datasets are split into training/discovery and test sets. The training set is used to train and have the model learn the hidden pattern. The test set is used to test the model after completing the training process and provides unbiased final model performance results.

Part III Using xMarkerFinder

1. How to solve installation errors?

Potential installation problems and solutions are provided in our manuscript, and most problems would be avoided by simply using the docker image we provided instead of running all scripts locally (https://hub.docker.com/r/tjcadd2022/xmarkerfinder).

2. What machine learning classifier should I choose?

Step 5 provides the evaluation of multiple commonly used algorithms in machine learning, and users could choose the most suitable algorithm based on these results. However, due to its robustness and interpretability, Random Forest classifiers are considered suitable for most microbiome datasets. Therefore, step 5 is not compulsory and we recommend users to build Random Forest models first, and move to other classifiers if they underperform.

3. How to choose suitable parameters when running xMarkerFinder?

For most scenarios, the default parameters would work. For further exploration, users are encouraged to try different parameters to get better results.

4. What is an AUC and how to interpret it?

AUC is the area under the ROC curve (the plot of the Sensitivity as y-axis versus 1-Specificity as x-axis). A perfect classifier gives an AUC of 1 while a simple classifier that makes completely random guesses gives an AUC of 0.5.

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
.binder		.binder
data		data
docker		docker
scripts		scripts
LICENSE.md		LICENSE.md
README.md		README.md

License

tjcadd2020/xMarkerFinder

Folders and files

Latest commit

History

Repository files navigation