-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathModule3_Classification_HWSubmission_Yangxin_Fan.Rmd
375 lines (277 loc) · 15.3 KB
/
Module3_Classification_HWSubmission_Yangxin_Fan.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
---
title: "Module 3 Assignment on Classification"
author: "Yangxin Fan, Graduate Student"
date: "02/28/2021"
#output: pdf_document
output:
pdf_document: default
df_print: paged
#html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, tidy=TRUE, tidy.opts=list(width.cutoff=80))
```
***
## Module Assignment Questions
## Q1) (*Bayes Classifier*)
`Bayes classifier` classifies an observation $x_0$ to the class $k$ for which $p_k(x_0)$ is largest, where $\pi_k$ is prior (proportion of $k$ class in all classes over $j$):
$$
p_k(x_0) = P(y=k | X=x_0) = \frac {\pi_k \cdot f_k(x_0)} {\sum { \pi_j \cdot f_j(x_0)}}.
$$
Assume univariate (p=1) observation $x$ in class $k$ is iid from $N(\mu_k, \sigma_k^2)$, $f_k(x)$ is the density function of $x$ with parameters $\mu_k,\sigma_k$.
a. Show that the Bayes classifier in 2-class problem (so $k=0,1$) assigns the observation $x_0$ to the class $k$ for which the discriminant score $\delta$ is largest when $\sigma_0=\sigma_1$ :
$$
\delta_k(x_0) = x_0 \frac {\mu_k} {\sigma^2} - \frac {\mu_k^2} {2 \sigma^2} + \log(\pi_k)
$$
b. (Empirical Work) Verify `part a` with a simple empirical demonstration using normal densities in R with `dnorm()` or generated normal variables from `rnorm()` with $\mu_0 = 10, \mu_1=15, \sigma_0=\sigma_1=2, \pi_0 = 0.3, \pi_1=.7, mu2 = 15, pi2 = 0.7$. Plot the class densities or histograms in color, show the intersection between two class distributions (where the classification boundary starts), check one random value from each class by calculating the discriminant score so to confirm the class it belongs. How would you describe the misclassified values or regions? Calculate the error rate. What is the Bayes error rate?
c. Under `part a`, assume $\sigma_0 \neq \sigma_1$. Derive the Bayes classifier. Show work.
d. (BONUS) For p>1, derive the the Bayes classifier. Show work.
***
## Q2) (*Four Models as Classifiers*)
The `Boston` data from `MASS` has 506 rows and 14 columns with the target variable `crim`, which is per capita crime rate by town. You will fit classification models (`KNN`, `logistic regression`, `LDA`, and `QDA`) in order to predict whether a given suburb has a crime rate above or below .3 per capita crime rate by town. Upper .3 may be labeled as `not really safe town` to raise a family. Use `80%-20% split validation test` approach.
a. Fit the `KNN`, `logistic regression`, `LDA`, and `QDA` models separately using all the predictors. Report `error rate` for each `train` and `test` data set. Use `error rate` = `1-accuracy`. Based on the test error rates, decide which model is best/better. Why?
b. Using the test data set, obtain confusion matrices and report only `recall`, `precision`, `f1` and `accuracy` metrics in a table. Comment on the findings. Based on this table, decide which model is best/better. Explain why some models do better than others. Which metric would be most important in this context? Why? Is this decision different from that of `part a`? Explain.
c. Obtain the ROC curve for `logistic regression` based on train data set (plot of FPR vs TPR with classification threshold change). Plot it. Calculate the area under the curve. Explain what the curve and area tell about the model.
d. How did you find the optimal $k$ in the `KNN` classifier? Did you use `grid search` or `CV`? If not, use it and revise the results in part a and b. Did the results improve?
e. What are the assumptions in each model? Do your best to describe each. Do your best to check these based on the fit. When you see assumption violation, what would you do to validate the fit?
***
## Q3) (*Concepts*)
a. What would change if you perform `$k-$fold approach` instead of `validation set` approach for the model fits in Question 2? Just discuss conceptually.
b. To improve the test error rates in `part a` Q2, what strategies can be applied: list the ideas as many as possible. Try one of them and report the improved test error rate.
c. Explain with less technical terms an estimation method employed in `binary logistic regression`. `MLE` and `gradient descent` are two of them.
d. (BONUS) Demonstrate with technical terms and numerically how `MLE` as estimation method employed in `binary logistic regression` works. Explain.
***
Review the R lab codes. Also, some useful codes to start are here:
```{r eval=FALSE}
#Some useful codes
library(MASS)
summary(Boston)
rm(Boston)
detach(Boston)
attach(Boston)
str(Boston)
dim(Boston)
n = nrow(Boston)
n
hist(crim)
summary(crim)
crime_dummy = rep(0, length(crim))
#Q3 is 1
quantile(crim, .75)
crime_dummy[crim>1] = 1
Boston = data.frame(Boston, crime_dummy)
View(Boston)
#rate in crime_dummy is 0.2509881 (P)
sum(crime_dummy)/length(crime_dummy)
# choose randomly 80%
set.seed(99)
train=sample(c(TRUE,FALSE), size=n,
prob=c(.80, .20), rep=TRUE) #randomly select index whether train or not from row
train
test=(!train)
test
Boston.train = Boston[train,]
dim(Boston.train)
Boston.test = Boston[test,]
dim(Boston.test)
crime_dummy.train = crime_dummy[train]
sum(crime_dummy.train)/length(crime_dummy.train)
crime_dummy.test = crime_dummy[test]
sum(crime_dummy.test)/length(crime_dummy.test)
# (this is another option to split the data into train and test)
n_train = ceiling(.80 * n)
n_train
```
\newpage
***
## Your Solutions
Q1)
Part a:
![Part a](/Users/fanyangxin/Desktop/Statistical Methods/Part a.jpeg)
***
Part b:
```{r eval=TRUE}
library(bayestestR)
x = seq(0,25,by=0.001)
dvalues_0 = 0.3 * dnorm(x, mean = 10, sd = 2)
dvalues_1 = 0.7 * dnorm(x, mean = 15, sd = 2)
plot(x, dvalues_0, type = "l", col='blue',ann=FALSE, xlim=c(0, 25), ylim=c(0,0.2))
par(new=TRUE)
plot(x, dvalues_1, type = "l", col='red', ann=FALSE, xlim=c(0, 25), ylim=c(0,0.2))
f1 <- function(x) 0.3 * dnorm(x, mean = 10, sd = 2)
f2 <- function(x) 0.7 * dnorm(x, mean = 15, sd = 2)
abline(v = uniroot(function(x) f1(x)-f2(x),c(10,15))$root,
col = "black",
lty = "dashed",
lwd = 2)
uniroot(function(x) f1(x)-f2(x),c(10,15))$root
x1 = seq(-1000,11.82216,by=0.0001)
area1 = area_under_curve(x1, f2(x1), method = "trapezoid")
x2 = seq(11.82216,1000,by=0.0001)
area2 = area_under_curve(x2, f1(x2), method = "trapezoid")
error_rate = area1 + area2
error_rate
```
The intersection is at x = 11.82216, which is the decision boundary for classification. If x < 11.82216, x is classified to class 0. Otherwise, x is classified to class 1.
Let's check x = 10, discriminant score for class 0 is 11.2960272 and discriminant score of class 1 is 9.018325096. Since the discriminat score is higher for class 0, x = 10 is classified to class 0. Let's check x = 15, discriminant score for class 1 is 27.76832506 and discriminant score for class 0 is 23.7960272. Since the discriminat score is higher for class 1, x = 15 is classified to class 1. The error rate is the area under the curve min(dvalues_0, dvalues_1). The error rate is about 0.09356307, which is the approximated area under the curve min(dvalues_0, dvalues_1). Bayes error rate is the lowest possible error rate for any classifier of a random outcome (into, for example, one of two categories) and is analogous to the irreducible error. It is equal to 1 - E(maxj(Pr(Y=j|X))), where the expectation averages the probability over all values of X.
***
Part c:
![Part c: variance not equal](/Users/fanyangxin/Desktop/Statistical Methods/Part c.jpeg)
***
Part d (BONUS):
![Part d (BONUS): P>1](/Users/fanyangxin/Desktop/Statistical Methods/Part d.jpeg)
***
\newpage
Q2)
Part a:
```{r eval=TRUE}
library(MASS)
library(caTools)
set.seed(1)
Boston.backup = Boston
# let crim>0.3 be not safe
Boston$crim[Boston$crim>0.3]="not safe"
Boston$crim[Boston$crim<=0.3]="safe"
Boston$crim=as.factor(Boston$crim)
# 80% - 20% train test split
train_rows=sample.split(Boston$crim,SplitRatio=0.8)
Boston.train=Boston[train_rows,]
Boston.test=Boston[!train_rows,]
# KNN
library(class)
train.X=subset(Boston.train, select=-c(crim))
train.y=Boston.train$crim
test.X=subset(Boston.test, select=-c(crim))
knn.pred=knn(train.X,test.X,train.y,k=3)
1-mean(knn.pred==Boston.test$crim)
# Logistic regression
glm.fits=glm(crim~., data=Boston.train, family=binomial)
glm.probs=predict(glm.fits,test.X,type="response")
contrasts(Boston.train$crim)
glm.pred=rep("not safe", nrow(Boston.test))
glm.pred[glm.probs>0.5]="safe"
1-mean(glm.pred==Boston.test$crim)
# LDA
lda.fit=lda(crim~., data=Boston.train)
lda.pred=predict(lda.fit,Boston.test)
lda.class=lda.pred$class
1-mean(lda.class==Boston.test$crim)
# QDA
qda.fit=qda(crim~., data=Boston.train)
qda.pred=predict(qda.fit,Boston.test)
qda.class=qda.pred$class
1-mean(qda.class==Boston.test$crim)
```
Summary of accuracy on test data
1. KNN: let K=8, error rate is 0.1188119 and k=3, error rate is 0.0990099.
2. Logistic regression: error rate is 0.07920792,
3. LDA: error rate is 0.08910891,
4. QDA: error rate is 0.02970297.
Based on error rate, QDA is the best choice of model since it has the lowest error rate in test dataset.
***
Part b:
```{r eval=TRUE}
knn.table = table(Boston.test$crim,knn.pred)
glm.table = table(Boston.test$crim,glm.pred)
lda.table = table(Boston.test$crim,lda.class)
qda.table = table(Boston.test$crim,qda.class)
knn.table
perfcheck <- function(ct) {
Accuracy <- (ct[1]+ct[4])/sum(ct)
Recall <- ct[4]/sum((ct[2]+ct[4])) #TP/P or Power, Sensitivity, TPR
Precision <- ct[4]/sum((ct[3]+ct[4])) #TP/P*
F1 <- 2/(1/Recall+1/Precision)
Values <- as.vector(round(c(Accuracy, Recall, Precision, F1),4))
Metrics = c("Accuracy", "Recall", "Precision", "F1")
cbind(Metrics, Values)
}
#Note this is knn result with k=3
perfcheck(knn.table)
perfcheck(glm.table)
perfcheck(lda.table)
perfcheck(qda.table)
```
Important notice: all results derived based on assumption that "safe" is the positive case and "not safe" is negative.
Some findings: Still, although KNN was the worst in accuracy, it has pretty good precision but it has the worst recall. QDA perform better than other models since it ranks highest in all metrics accuracy, precision, recall , and F1. I think the most important metric depends on the use cases. For example, if this is for real estate pricing, then accuracy is not the most important since we want to predict as many correct labels as possible, However, if this is for a person who wants to purchase a safe house, I think that F1 score is more important. Despite of different metrics we might consider, the QDA always perform the best so the decision is same as in part a.
***
Part c:
```{r eval=TRUE}
library(pROC)
model_glm = glm(crim ~ ., data = Boston.train, family = "binomial")
train_prob = predict(model_glm, newdata = Boston.train, type = "response")
train_roc = roc(Boston.train$crim ~ train_prob, plot = TRUE, print.auc = TRUE)
as.numeric(train_roc$auc)
```
ROC curve is plot of true positive rate (sensitivity, recall, power) vs. false positive rate (type I error, 1-Specificity). Since AUC = 0.9764151, very close to 1, it means that logistic regression perform well in the training dataset.
***
Part d:
```{r eval=TRUE}
library(caret)
set.seed(1)
ctrl = trainControl(method="repeatedcv", number = 5, repeats = 3)
knn_grid <- expand.grid(k=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
knnFit = train(crim ~ ., data = Boston, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneGrid = knn_grid)
plot(knnFit)
knnFit
```
I used repeated 5 cross validation on the whole Boston dataset with grid search to find the optimal k. It turns out that k=3 has the highest accuracy. k=3 gives a better result in all metrics accuracy, precision, recall, and f1 score than previous used k=8.
***
Part e:
Assumptions in each model:
1. KNN: It's a distance based classifier. Tt implicitly assumes that the smaller the distance between two points, the more similar they are.
2. Logistic regression: Binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal. Logistic regression requires the observations to be independent of each other.Logistic regression requires there to be little or no multicollinearity among the independent variables. Logistic regression assumes linearity of independent variables and log odds.
3. LDA: Each variable has a normal distribution given any class labelis and each variable has same variance.
4. QDA: Each variable has a normal distribution given any class labelis and each variable has different variance.
When I see assumption violation, I would either standardize or normalize violated features in original dataset to validate the fit.
***
\newpage
Q3)
Part a:
K-fold cross validation will give a better estimated test error since it randomly divide dataset into k folds. The model is trained on k-1 folds and test on the left 1 fold. Then average the performance on k test dataset. For validation set method, it randomly divide data into two parts train and test set. It doesn't average test error rate like we did in k-fold cross validation. In summary, k-fold cross validation should be a better method to evaluate model performance.
***
Part b:
```{r eval=TRUE}
library(glmnet)
set.seed(1)
x=model.matrix(crim~.,data=Boston.backup)[,-1]
y=Boston.backup$crim
cv.out=cv.glmnet(x,y,alpha=1)
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
cv.out
ridge.mod=glmnet(x,y,alpha=1,lambda=bestlam)
coef(ridge.mod)
# KNN
library(class)
Boston.train=Boston.train[, setdiff(names(Boston.train),c('age','tax','black'))]
Boston.test=Boston.test[, setdiff(names(Boston.test),c('age','tax','black'))]
train.X=Boston.train[, setdiff(names(Boston.train),c('crim'))]
train.y=Boston.train$crim
test.X=Boston.test[, setdiff(names(Boston.test),c('crim'))]
knn.pred=knn(train.X,test.X,train.y,k=3)
1-mean(knn.pred==Boston.test$crim)
```
Basic: since crim is orginally a continuous variable, we can use features selection method used in regression setting.
Possible methods to boost performance on test error include:
1. Using Lasso regression to do feature selection first, then use the newly selected features for training
2. Using forward or backward subset selection to choose the best combo of variables.
3. Scale/standardize numeric variables when doing data preprocessing
Here I used method 1 for illustration. I choose parameters with abosolute coefficient >0.01, so features "black", "age", and "tax" are deleted. Now, for KNN method, when k=3, the test error is now only 0.07920792 improved from previous 0.0990099 in Q2.
***
Part c:
In binary logistic regression, the MLE (maximum likehood estimation) method: we define a loss function to be log likelihood loss, i.e: negative sum of yn * log(yn_hat), also called cross entropy loss. When we minimize the log likelihood loss, we push the loss to as close to 0 as possible. We eventually find a estimated function of yn_hat which is also the maximum likehood estimation of yn_hat and it can minimize the error coming from misclassification and make the error rate close to bayes error rate.
***
Part d (BONUS):
***
![Part d bonus: MLE in binary logistic regression](/Users/fanyangxin/Desktop/Statistical Methods/Q3 part d.jpeg)
\newpage
### Write comments, questions: ...
***
I hereby write and submit my solutions without violating the academic honesty and integrity. If not, I accept the consequences. Yangxin Fan
### List the friends you worked with (name, last name): Myself
### Disclose the resources or persons if you get any help: ISLR
### How long did the assignment solutions take?: 6 hours
***
## References
...