Breast cancer detection using machine learning models.
We used the UCI Machine Learning Repository.
Link: http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29
The dataset was created by Dr. William H. Wolberg, physician at the University Of Wisconsin Hospital at Madison, Wisconsin, USA.
Programming Language: Python 3
Libraries: pandas, numpy, seaborn, and sklearn
IDE: Jupyter Notebook
Mean is the average of the given numbers and is calculated by dividing the sum of the given numbers by the total number of numbers.
Mean of a random varibale X, μ = Σ(Xi)/n
Standard deviation is a measure of how dispersed the data is in relation to the mean.
Standard deviation of a population X, σ = (Σ(Xi - μ)2/n)1/2
Correlation describes the strength of association between two variables.
Pearson correlation coefficient between two random variables X and Y can be calculated by the formula:
Standardization scales each input variable separately by subtracting the mean and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.
Formula for standarization: xnew = (xold-μ)/σ
- Logistic Regression Classifier
- Nearest Neighbor Classifier
- Support Vector Machines Classifier
- Kernel SVM Classifier
- Naive Bayes Classifier
- Decision Tree Classifier
- Random Forest Classifier
- F1 Score
- Accuracy Score
Precision = TP/(TP + FP)
Recall = TP/(TP + FN)
F1 Score = 2*(Precision * Recall)/(Precision + Recall)
Accuracy Score = (TP + TN)/(TP + FP + FN + TN)
Accuracy Score:
- Logistic Regression — 97.36%
- Nearest Neighbor — 94.73%
- Support Vector Machines — 95.61%
- Kernel SVM — 98.24%
- Naive Bayes — 96.49%
- Decision Tree Algorithm — 95.61%
- Random Forest Classification — 97.36%
F1 Score:
- Logistic Regression — 96.47%
- Nearest Neighbor — 93.02%
- Support Vector Machines — 94.25%
- Kernel SVM — 97.61%
- Naive Bayes — 95.23%
- Decision Tree Algorithm — 93.97%
- Random Forest Classification — 96.38%