Skip to content

Latest commit

 

History

History
executable file
·
72 lines (50 loc) · 4.47 KB

ComparingResults.md

File metadata and controls

executable file
·
72 lines (50 loc) · 4.47 KB

ProteinGraphML: Comparing Results

This Python package was based on R-code developed by Oleg Ursu, mostly in 2018. As such, comparing against R-code results is an important for validation and continuity toward additional progress.

CreateFeatureFilesFromRDS.py takes RDS file as input and creates two output files containing features, one for the training set and the other for the prediction set.

Command-line parameters:

  • --rdsfile : full path to the RDS file created using R code.
  • --trainfile : full path to the output pickle file for the training set.
  • --predictfile : full path to the output pickle file for the prediction set.

Example command:

CreateFeatureFilesFromRDS.py --rdsfile /home/oleg/workspace/metap/data/input/PS118220.rds --trainfile results/PS118220/PS118220_train.pkl --predictfile results/PS118220/PS118220_predict.pkl

CompareRandPythonFeatureSet.py uses features pickle files created by CreateFeatureFilesFromRDS.py and GenTrainingAndTestFeatures.py and finds the features which have different values.

Command-line parameters:

  • --pythonfile : full path to the features pickle file created by GenTrainingAndTestFeatures.py.
  • --rfile : full path to the features pickle file created by CreateFeatureFilesFromRDS.py.
  • --decimalplace : Feature values genereated by R and Python might have different number of digits after the decimal point. Use this parameter to specify the number of digits after the decimal point. Program rounds the floating-point numbers to the given decimal place. The default value is 2.

Example command:

CompareRandPythonFeatureSet.py --pythonfile results/ATG_NEG_NO_LINCS/atg_no_lincs_TrainingData.pkl --rfile results/ATG_NEG_NO_LINCS/ATG_NEG_NO_LINCS_train.pkl --decimalplace 3

FindCommonPid.py uses classification results generated by R and Python for tranining and prediction sets and finds the common protein ids in top 'N' (100,200,...1000) proteins in training and prediction sets. Proteins are first sorted by their predicted probabilities in the descending order and then common proteins are searched.

Command-line parameters:

  • --pythonTr : full path to the classification result file for training set created by Python code `TrainModelML.py'.
  • --pythonPr : full path to the classification result file for prediction set created by Python code `PredictML.py'.
  • --rTr : full path to the classification result file for training set created by R code.
  • --rPr : full path to the classification result file for prediction set created by R code.
  • --imgfile : full path to the output file to save the plot.
  • --maxlimit : Maximum number of proteins to compare. Default value is 1000.

Example command:

FindCommonPid.py --pythonTr results/ATG_NEG/classificationResults_XGBCrossValPred.tsv --pythonPr results/ATG_NEG/classificationResults_XGBPredict.tsv --rTr /home/oleg/workspace/metap/data/output/ATG_NEG/train.pred.tsv --rPr /home/oleg/workspace/metap/data/output/ATG_NEG/blind.pred.tsv --imgfile results/ATG_NEG/common_pid.png --maxlimit 1000

FindCorrelation.py uses classification results generated by R and Python for tranining/prediction sets and computes Pearson correlation coefficient using the ML predicted probabilites of proteins. Proteins are first sorted by their predicted probabilities in the descending order and then top 'N' proteins are selected to determine the correlation between R and Python results.

Command-line parameters:

  • --pythonfile : full path to the classification result file created by Python code `TrainModelML.py'.
  • --rfile : full path to the classification result file created by R code.
  • --tsvfile : full path to the output file where common proteins will be saved.
  • --maxlimit : Maximum number of proteins to compare. Default value is 100.

Example command:

FindCorrelation.py --pythonfile results/ATG_NEG/classificationResults_XGBPredict.tsv --rfile /home/oleg/workspace/metap/data/output/ATG_NEG/blind.pred.tsv --tsvfile results/ATG_NEG/common_pid_top100.tsv --maxlimit 1000