-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMidterm1_Submission_Yangxin_Fan.Rmd
637 lines (478 loc) · 23.2 KB
/
Midterm1_Submission_Yangxin_Fan.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
---
title: "Midterm-1 Project Portion - Version 1"
author: "First and last name: Yangxin Fan //
Pair's first and last name: Ziyu Xiong"
date: "Submission Date: 03/09/2021"
#output: pdf_document
output:
pdf_document: default
df_print: paged
#html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, tidy=TRUE, tidy.opts=list(width.cutoff=80))
```
## Midterm-1 Project Instruction
Midterm-1 has test and project portions. This is the project portion. Based on what we covered on the modules 1, 2 and 3, you will reflect statistical methods by analyzing data and building predictive models using train and test data sets. The data sets are about college students and their academic performances and retention status, which include categorical and numerical variables.
Throughout the data analysis, we will consider only two response variables, 1) current GPA of students, a numerical response variable, call it \textbf{y1}=\textbf{Term.GPA} and 2) Persistence of student for following year, a binary response variable (0: not persistent on the next term, 1:persistent on the next term), call it \textbf{y2}=\textbf{Persistence.NextYear}.
Briefly, you will fit regression models on $y1$ and classification models on $y2$ using the subset of predictors in the data set. Don't use all predictors in any model.
***
\section{A. Touch and Feel the Data - 5 pts}
- Import Data Set and Set Up:
Open the data set \textbf{StudentDataTrain.csv}. Be familiar with the data and variables. Start exploring it. Practice the code at the bottom and do the set-up.
- Do Exploratory Data Analysis:
Start with Exploratory Data Analysis (EDA) before running models. Visually or aggregatedly you can include the description and summary of the variables (univariate, and some bivariate analyses). If you keep this part very simple, it is ok.
***
\section{B. Build Regression Models - 20 pts - each model 5 pts}
Build linear regressions as listed below the specific four models to predict $y1$ with a small set of useful predictors. Please fit all these by justifying why you do (I expect grounding justifications and technical terms used), report the performance indicators in a comparative table, $MSE_{train}$, $MSE_{test}$, $R_{adj, train}^2$ and $R_{adj, test}^2$ using train and test data sets. The regression models you will fit:
\begin{enumerate}
\item Best OLS SLR
\item Best OLS MLR using any best small subset of predictors (using any selection methods)
\item Best MLR Ridge with any best small subset of predictors
\item Best MLR Lasso with any best small subset of predictors
\end{enumerate}
For tuning parameter, justify with statistical methods/computations why you choose.
***
\section{C. Build Classification Models - 20 pts - each model 5pts}
Build four classification models as below. Please fit all these, include performance indicators for train and test data sets, separately. Include confusion matrix for each. For each `train` and `test` data set, report: `accuracy`, `recall`, `precision`, and `f1` in a cooperative table. For LR or LDA, include ROC curve, area and interpretation. The classification models you will fit:
\begin{enumerate}
\item Logistic Regression (LR) with any best small subset of predictors
\item KNN Classification with any best small subset of predictors
\item Linear Discriminant Analysis (LDA) with any best small subset of predictors
\item Quadratic Discriminant Analysis (QDA) with any best small subset of predictors
\end{enumerate}
Justify why you choose specific K in KNN with a grid search or CV methods.
***
\section{D. Overall Evaluations and Conclusion - 5 pts}
Briefly, make critiques of the models fitted and write the conclusion (one sentence for each model, one sentence for each problem - regression and classificaton problems we have here). Also, just address one of these: diagnostics, violations, assumptions checks, overall quality evaluations of the models, importance analyses (which predictors are most important or effects of them on response), outlier analyses. You don't need to address all issues. Just show the reflection of our course materials.
***
\newpage{}
\section{Project Evaluation}
The submitted project report will be evaluated according to the following criteria:
\begin{enumerate}
\item All models in the instruction used correctly
\item Completeness and novelty of the model fitting
\item Techniques and theorems of the methods used accurately
\item Reflection of in-class lectures and discussions
\item Achieved reasonable/high performances; insights obtained (patterns of variables)
\item Clear write-ups
\end{enumerate}
If the response is not full or not reflecting the correct answer as expected, you may still earn partial points. For each part or model, I formulated this `partial points` as this:
- 25% of pts: little progress with some minor solutions;
- 50% of pts: major calculation mistake(s), but good work;
- 75% of pts: correct method used, but minor mistake(s).
Additionally, a student who will get the highest performances from both problems in the class (`minimum test MSE` from the regression model and `highest precision rate` from the classification model) will get a BONUS.
\section{Tips and Clarifications}
- You will use the test data set to asses the performance of the fitted models based on train data set.
- Implementing 5-fold cross validation method while fitting with train data set is suggested.
- You can use any packs as long as you are 100% sure what it does and clear to the grader.
- Include compact other useful measurements and plots. Not too many! Report some useful results in a comparative table each.
- Include helpful compact plots with titles.
- Keep at most 4 decimals to present numbers and the performance scores.
- What other models could be used to get better results? This is an extra if you like to discuss.
***
\section{Setup and Useful Codes}
Data handling:
```{r eval=FALSE}
getwd() #gets what working directory is
# Create a RStudio Project and work under it.
#Download, Import and Assign
train <- read.csv("StudentDataTrain.csv")
test <- read.csv("StudentDataTest.csv")
#Summarize univariately
summary(train)
summary(test)
#Dims
dim(train) #5961x18
dim(test) #1474x18
#Without NA's
dim(na.omit(train)) #5757x18
dim(na.omit(test)) #1445x18
#Perc of complete cases
sum(complete.cases(train))/nrow(train)
sum(complete.cases(test))/nrow(test)
#Delete or not? Don't delete!! Use Imputation method to fill na's
train <- na.omit(train)
test <- na.omit(test)
dim(train)
#Missing columns as percent
san = function(x) sum(is.na(x))
round(apply(train,2,FUN=san)/nrow(train),4) #pers of na's in columns
round(apply(train,1,FUN=san)/nrow(train),4) #perc of na's in rows
#you can create new columns based on features
#Variable/Column names
colnames(test)
#Response variables
#Do this for train after processing the data AND for test data sets)
y1=train$Term.GPA #numerical
y2=train$Persistence.NextYear #categorical
##Summarize
#y1
hist(y1)
boxplot(y1)
#y2: 0 - not persistent (drop), 1 - persistent (stay)
table(y2)
#Persistence
aa=table(test$Persistence.NextYear, test$Gender)
addmargins(aa)
prop.table(aa,2)
barplot(aa,beside=TRUE,legend=TRUE) #counts
barplot(t(aa),beside=TRUE,legend=TRUE)
```
First fits:
```{r eval=FALSE}
##A lm modeling on y1
summary(model_lm <- lm(y1~HSGPA, data=train))$adj.r.squared #slr model
summary(model_lm)
##A Logistic Regression (with glm) modeling on y2
model_glm <- glm(factor(y2)~HSGPA, data=train, family=binomial)
# model
summary(model_glm)
##checking the classification performance on Y2 with training data
glm.predict.train = predict(model_glm, train, type="response")
glm.predict.train[glm.predict.train>.5]="Persistent" #1
glm.predict.train[glm.predict.train<=.5]="Dropped" #0
##Confusing matrix (report the proportions)
table(glm.predict.train, train$Persistence.NextYear)
```
How to create new columns and make dummy:
```{r eval=FALSE}
##use ifelse, create dummy
Combined$FullTime <- ifelse((Combined$N.RCourse - Combined$N.Ws)>2, 1, 0)
# If registered for full time (12 to 18 hours) students will be assessed the full time undergraduate rate. FOR GRAD, IT IS 9 OR MORE.
##gender dummy
Combined$genderD <- ifelse(Combined$gender=="Male", 1, 0)
# If registered for full time (12 to 18 hours) students will be assessed the full time undergraduate rate. FOR GRAD, IT IS 9 OR MORE.
EnrollGrades$GradeDF <- ifelse(EnrollGrades$Grade=="F" | EnrollGrades$Grade=="D", 1, 0)
##only numerical from combined and combinedD
numvC <- sapply(Combined, class) == "numeric" | sapply(Combined, class) == "integer"
CombinedN <- Combined[, numvC]
```
***
\newpage
## Your Solutions
\subsection{Section A.}
Exploratory Data Analysis (EDA)
```{r eval=TRUE}
train <- read.csv("StudentDataTrain.csv")
test <- read.csv("StudentDataTest.csv")
#Dims
dim(train)
dim(test)
#Column names
names(train)
#Summarize univariately
summary(train)
summary(test)
#Summarize bivariately
cor(train[, c('Term.GPA','Persistence.NextYear','N.RegisteredCourse','N.Ws')])
#MICE imputations to fill in all the missing values (NAs):
library(mice)
train = mice(train,m=5,maxit=50,meth='pmm',seed=500)
train = complete(train,1)
summary(train)
test = mice(test,m=5,maxit=50,meth='pmm',seed=500)
test = complete(test,1)
summary(train)
#dummies for categorical variables
library(dummies)
total = rbind(train, test)
train = dummy.data.frame(train, sep = ".")
test = dummy.data.frame(test, sep = ".")
```
Summary:
1. Training dataset consists of 5961 observations and 18 features and test dataset consists of 1474 observations and 18 features.
2. Among 18 features, Race_Ethc_Visa, Gender, and Entry_Term are categorical. Persistence and FullTimeStudent are binary. All others are numerical. For Race_Ethc_Visa and Gender seperately, the number of each category is pretty much even. For HSGPA, SAT_Total, and Term.GPA, the mean and median are very close, which means the distribution is not so skewed.
3. In training dataset, there are missing values in features Gender(2), HSGPA(17), SAT_Total(12), and Perc.Pass(186). In test dataset, there are missing values in features Gender(1) and Perc.Pass (28).
4. In terms of bivariate analysis, we could use correlation matrix to analyze correlations between any pair of numeric attributes. Here for convenience I only analyze correlations among Term.GPA, Persistence.NextYear, N.RegisteredCourse, and N.Ws. Among these correlation, the correlation between Term.GPA and Persistence.NextYear is the largest 0.4766.
5. Using MICE imputation methods to fill in all NAs in training and test datasets.
6. One-hot encoding for categorical variables.
***
\newpage
\subsection{Section B.}
#Subset selections using adjusted_r2 to select the best subset
```{r eval=TRUE}
library(leaps)
reg.full = regsubsets(Term.GPA~.-Persistence.NextYear, data=train, nvmax=21, method="forward")
reg.summary = summary(reg.full)
reg.summary
which.max(reg.summary$adjr2)
```
- Model 1 (OLS SLR)
```{r eval=TRUE}
y1 = train$Term.GPA
model_lm = lm(y1~HSGPA, data = train)
adj_r2_train = summary(model_lm)$adj.r.squared
SSE = sum(model_lm$residuals**2)
train_MSE = SSE / 5961
pred = predict(model_lm,test)
test_MSE = mean((test$Term.GPA-pred)^2)
SSE = test_MSE * 1474
SSTO = sum((test$Term.GPA-mean(test$Term.GPA))^2)
r2 = 1- SSE/SSTO
adj_r2_test = 1- (1-r2)*(1474-1)/(1474-1-1)
tab = matrix(c(round(train_MSE,4), round(test_MSE,4), round(adj_r2_train,4), round(adj_r2_test,4)), ncol=2, byrow=TRUE)
colnames(tab) = c('Train', 'Test')
rownames(tab) = c('MSE', 'Adjusted R square')
#convert matrix to table
tab = as.table(tab)
tab
```
The best OLS SLR is regress Term.GPA onto HSGPA. The training_MSE is 1.0279, the adjusted training R square is 0.0038, the test_MSE is 0.9856, and adjusted test R square is 0.0021.
***
- Model 2 (OLS MLR)
```{r eval=TRUE}
model_lm = lm(y1~Race_Ethc_Visa.Afram+Gender.Female+HSGPA+SAT_Total+N.As+FullTimeStudent+N.PassedCourse,data=train)
adj_r2_train = summary(model_lm)$adj.r.squared
SSE = sum(model_lm$residuals**2)
train_MSE = SSE / 5961
pred = predict(model_lm,test)
test_MSE = mean((test$Term.GPA-pred)^2)
SSE = test_MSE * 1474
SSTO = sum((test$Term.GPA-mean(test$Term.GPA))^2)
r2 = 1- SSE/SSTO
adj_r2_test = 1- (1-r2)*(1474-1)/(1474-7-1)
tab = matrix(c(round(train_MSE,4), round(test_MSE,4), round(adj_r2_train,4), round(adj_r2_test,4)), ncol=2, byrow=TRUE)
colnames(tab) = c('Train', 'Test')
rownames(tab) = c('MSE', 'Adjusted R square')
#convert matrix to table
tab = as.table(tab)
tab
```
Best subset choice is determined by the highest adjusted_r2 in #subset selection section. The best subset selection are variables Race_Ethc_Visa.Afram, Gender.Female, HSGPA+SAT_Total, N.As+FullTimeStudent, N.PassedCourse. The training_MSE is 1.0257, the adjusted training R square is 0.0050, the test_MSE is 0.9907, and adjusted test R square is -0.0072.
***
- Model 3 (MLR Ridge)
```{r eval=TRUE}
library(glmnet)
set.seed(1)
x=model.matrix(Term.GPA~Race_Ethc_Visa.Afram+Gender.Female+HSGPA+SAT_Total+N.As+FullTimeStudent+N.PassedCourse,data=train)[,-1]
x_test=model.matrix(Term.GPA~Race_Ethc_Visa.Afram+Gender.Female+HSGPA+SAT_Total+N.As+FullTimeStudent+N.PassedCourse,data=test)[,-1]
y=y1
y_test=test$Term.GPA
cv.out=cv.glmnet(x,y,alpha=0,nfolds=5)
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
ridge.fit = glmnet(x,y,alpha=0,lambda=0.2004251)
pred_1 = predict(ridge.fit,x)
pred_2 = predict(ridge.fit,x_test)
train_MSE = mean((train$Term.GPA-pred_1)^2)
SSE = train_MSE * 5961
SSTO = sum((train$Term.GPA-mean(train$Term.GPA))^2)
r2 = 1- SSE/SSTO
adj_r2_train = 1- (1-r2)*(5961-1)/(5961-7-1)
test_MSE = mean((test$Term.GPA-pred_2)^2)
SSE = test_MSE * 1474
SSTO = sum((test$Term.GPA-mean(test$Term.GPA))^2)
r2 = 1- SSE/SSTO
adj_r2_test = 1- (1-r2)*(1474-1)/(1474-7-1)
tab = matrix(c(round(train_MSE,4), round(test_MSE,4), round(adj_r2_train,4), round(adj_r2_test,4)), ncol=2, byrow=TRUE)
colnames(tab) = c('Train', 'Test')
rownames(tab) = c('MSE', 'Adjusted R square')
#convert matrix to table
tab = as.table(tab)
tab
```
Best lambda is 0.2004251. The training_MSE is 1.0259, the adjusted training R square is 0.0047, the test_MSE is 0.9888, and adjusted test R square is -0.0053.
***
- Model 4 (MLR Lasso)
```{r eval=TRUE}
library(glmnet)
set.seed(1)
x=model.matrix(Term.GPA~Race_Ethc_Visa.Afram+Gender.Female+HSGPA+SAT_Total+N.As+FullTimeStudent+N.PassedCourse,data=train)[,-1]
x_test=model.matrix(Term.GPA~Race_Ethc_Visa.Afram+Gender.Female+HSGPA+SAT_Total+N.As+FullTimeStudent+N.PassedCourse,data=test)[,-1]
y=y1
y_test=test$Term.GPA
cv.out=cv.glmnet(x,y,alpha=1,nfolds=5)
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
ridge.fit = glmnet(x,y,alpha=1,lambda=0.0001381452)
pred_1 = predict(ridge.fit,x)
pred_2 = predict(ridge.fit,x_test)
train_MSE = mean((train$Term.GPA-pred_1)^2)
SSE = train_MSE * 5961
SSTO = sum((train$Term.GPA-mean(train$Term.GPA))^2)
r2 = 1- SSE/SSTO
adj_r2_train = 1- (1-r2)*(5961-1)/(5961-7-1)
test_MSE = mean((test$Term.GPA-pred_2)^2)
SSE = test_MSE * 1474
SSTO = sum((test$Term.GPA-mean(test$Term.GPA))^2)
r2 = 1- SSE/SSTO
adj_r2_test = 1- (1-r2)*(1474-1)/(1474-7-1)
tab = matrix(c(round(train_MSE,4), round(test_MSE,4), round(adj_r2_train,4), round(adj_r2_test,4)), ncol=2, byrow=TRUE)
colnames(tab) = c('Train', 'Test')
rownames(tab) = c('MSE', 'Adjusted R square')
#convert matrix to table
tab = as.table(tab)
tab
```
Best lambda is 0.0001381452. The training_MSE is 1.0257, the adjusted training R square is 0.0050, the test_MSE is 0.9906, and adjusted test R square is -0.0071.
***
\newpage
\subsection{Section C.}
- Model 1 (Logistic Regression)
```{r eval=TRUE}
library(pROC)
glm.fits=glm(as.factor(Persistence.NextYear)~Term.GPA+Perc.PassedEnrolledCourse, data=train, family=binomial)
glm.probs=predict(glm.fits,train,type="response")
glm.pred=rep(0, nrow(train))
glm.pred[glm.probs>0.5]=1
ct = table(train$Persistence.NextYear,glm.pred)
ct
Accuracy_train = (ct[1]+ct[4])/sum(ct)
Recall_train = ct[4]/sum((ct[2]+ct[4]))
Precision_train = ct[4]/sum((ct[3]+ct[4]))
F1_train = 2/(1/Recall_train+1/Precision_train)
train_roc=roc(train$Persistence.NextYear,glm.probs,plot=TRUE, legacy.axes = TRUE,
print.auc = TRUE, xlab="False Positive Rate", ylab="True Positive Rate")
AUC_train = as.numeric(train_roc$auc)
glm.probs=predict(glm.fits,test,type="response")
glm.pred=rep(0, nrow(test))
glm.pred[glm.probs>0.5]=1
ct = table(test$Persistence.NextYear,glm.pred)
ct
Accuracy_test = (ct[1]+ct[4])/sum(ct)
Recall_test = ct[4]/sum((ct[2]+ct[4]))
Precision_test = ct[4]/sum((ct[3]+ct[4]))
F1_test = 2/(1/Recall_test+1/Precision_test)
test_roc=roc(test$Persistence.NextYear,glm.probs,plot=TRUE, legacy.axes = TRUE,
print.auc = TRUE, xlab="False Positive Rate", ylab="True Positive Rate")
AUC_test = as.numeric(test_roc$auc)
tab = matrix(c(round(Accuracy_train,4), round(Accuracy_test,4), round(Recall_train,4), round(Recall_test,4), round(Precision_train,4),round(Precision_test,4),
round(F1_train,4),round(F1_test,4),round(AUC_train,4),round(AUC_test,4)),ncol=2, byrow=TRUE)
colnames(tab) = c('Train', 'Test')
rownames(tab) = c('Accuracy', 'Recall','Precision','F1','AUC')
tab = as.table(tab)
tab
```
Results of train and test accuracy, recall, precision, F1, ROC curve, and AUC are shown above.
***
- Model 2 (KNN)
```{r eval=TRUE}
library(pROC)
library(caret)
library(class)
set.seed(1)
ctrl = trainControl(method="repeatedcv", number = 5, repeats = 3)
knn_grid <- expand.grid(k=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40))
knnFit = train(as.factor(Persistence.NextYear)~Term.GPA+Perc.PassedEnrolledCourse, data = train, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneGrid = knn_grid)
plot(knnFit)
knnFit
train.X=subset(train, select=c(Term.GPA,Perc.PassedEnrolledCourse))
train.y=train$Persistence.NextYear
test.X=subset(test, select=c(Term.GPA,Perc.PassedEnrolledCourse))
knn.pred=knn(train.X,test.X,train.y,k=35)
ct = table(test$Persistence.NextYear,knn.pred)
Accuracy_test = (ct[1]+ct[4])/sum(ct)
Recall_test = ct[4]/sum((ct[2]+ct[4]))
Precision_test = ct[4]/sum((ct[3]+ct[4]))
F1_test = 2/(1/Recall_test+1/Precision_test)
knn.pred=knn(train.X,train.X,train.y,k=35)
ct = table(train$Persistence.NextYear,knn.pred)
Accuracy_train = (ct[1]+ct[4])/sum(ct)
Recall_train = ct[4]/sum((ct[2]+ct[4]))
Precision_train = ct[4]/sum((ct[3]+ct[4]))
F1_train = 2/(1/Recall_train+1/Precision_train)
tab = matrix(c(round(Accuracy_train,4), round(Accuracy_test,4), round(Recall_train,4), round(Recall_test,4), round(Precision_train,4),round(Precision_test,4),
round(F1_train,4),round(F1_test,4)),ncol=2, byrow=TRUE)
colnames(tab) = c('Train', 'Test')
rownames(tab) = c('Accuracy', 'Recall','Precision','F1')
tab = as.table(tab)
tab
```
Results of train and test accuracy, recall, precision are shown above. Using five fold CV and grid search in train, we find the optimal k=35.
***
- Model 3 (LDA)
```{r eval=TRUE}
library(MASS)
lda.fit=lda(as.factor(Persistence.NextYear)~Term.GPA+Perc.PassedEnrolledCourse, data=train)
lda.pred=predict(lda.fit,train)
lda.class=lda.pred$class
ct = table(train$Persistence.NextYear,lda.class)
ct
Accuracy_train = (ct[1]+ct[4])/sum(ct)
Recall_train = ct[4]/sum((ct[2]+ct[4]))
Precision_train = ct[4]/sum((ct[3]+ct[4]))
F1_train = 2/(1/Recall_train+1/Precision_train)
par(pty = "s")
train_roc = roc(train$Persistence.NextYear,lda.pred$posterior[,2],plot=TRUE, legacy.axes = TRUE,
print.auc = TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage")
AUC_train = as.numeric(train_roc$auc)
lda.pred=predict(lda.fit,test)
lda.class=lda.pred$class
ct = table(test$Persistence.NextYear,lda.class)
ct
Accuracy_test = (ct[1]+ct[4])/sum(ct)
Recall_test = ct[4]/sum((ct[2]+ct[4]))
Precision_test = ct[4]/sum((ct[3]+ct[4]))
F1_test = 2/(1/Recall_test+1/Precision_test)
par(pty = "s")
test_roc = roc(test$Persistence.NextYear,lda.pred$posterior[,2],plot=TRUE, legacy.axes = TRUE,
print.auc = TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage")
AUC_test = as.numeric(test_roc$auc)
tab = matrix(c(round(Accuracy_train,4), round(Accuracy_test,4), round(Recall_train,4), round(Recall_test,4), round(Precision_train,4),round(Precision_test,4),
round(F1_train,4),round(F1_test,4),round(AUC_train,4),round(AUC_test,4)),ncol=2, byrow=TRUE)
colnames(tab) = c('Train', 'Test')
rownames(tab) = c('Accuracy', 'Recall','Precision','F1','AUC')
tab = as.table(tab)
tab
```
Results of train and test accuracy, recall, precision, F1, ROC curve, and AUC are shown above.
***
- Model 4 (QDA)
```{r eval=TRUE}
library(MASS)
qda.fit=qda(as.factor(Persistence.NextYear)~Term.GPA+Perc.PassedEnrolledCourse, data=train)
qda.pred=predict(lda.fit,train)
qda.class=qda.pred$class
ct = table(train$Persistence.NextYear,qda.class)
ct
Accuracy_train = (ct[1]+ct[4])/sum(ct)
Recall_train = ct[4]/sum((ct[2]+ct[4]))
Precision_train = ct[4]/sum((ct[3]+ct[4]))
F1_train = 2/(1/Recall_train+1/Precision_train)
qda.pred=predict(qda.fit,test)
qda.class=qda.pred$class
ct = table(test$Persistence.NextYear,qda.class)
ct
Accuracy_test = (ct[1]+ct[4])/sum(ct)
Recall_test = ct[4]/sum((ct[2]+ct[4]))
Precision_test = ct[4]/sum((ct[3]+ct[4]))
F1_test = 2/(1/Recall_test+1/Precision_test)
tab = matrix(c(round(Accuracy_train,4), round(Accuracy_test,4), round(Recall_train,4), round(Recall_test,4), round(Precision_train,4),round(Precision_test,4),
round(F1_train,4),round(F1_test,4)),ncol=2, byrow=TRUE)
colnames(tab) = c('Train', 'Test')
rownames(tab) = c('Accuracy', 'Recall','Precision','F1')
tab = as.table(tab)
tab
```
Results of train and test accuracy, recall, precision, and F1 are shown above.
***
\newpage
Section 4.
Summary:
1. Regression models (variable Persistence.NextYear is excluded)
1.1: Simple Linear regression: Use the Forward Stepwise Selection to find the optimal single predictor. It turns out the single predictor is HSGPA. Test MSEE is 0.9856.
1.2: Multiple Linear regression: Use the Forward Stepwise Selection to find the optimal combination of predictors (the highest adjusted r2. It turns out the best combination of predictors are Race_Ethc_Visa.Afram, Gender.Female, HSGPA, SAT_Total, N.As, FullTimeStudent, N.PassedCourse.
Test MSE is 0.9907.
1.3: MLR Ridge: Use five-fold CV in training data to find the best lambda is 0.200425126. Test_MSE is 0.9888.
1.4: MLR Lasso: Use five-fold CV in training data to find the best lambda is 0.0001381452. Test_MSE is 0.9906.
2. Classification models:
The best combination of variables we found is Term.GPA and Perc.PassedEnrolledCourse.
2.1: Logistic regression: Test precision is 0.9741.
2.2: KNN: According to the plot that shows the result of knn from k=1 to k=40 using five-fold CV, as we increasing the value k, the accuracy keeps increase and seems to converage around k=35. Test precision is 0.9743.
2.3: LDA: Test precision is 0.9741.
2.4: QDA: Test precision is 0.9746.
***
- BONUS.
***
\newpage
***
I hereby write and submit my solutions without violating the academic honesty and integrity. If not, I accept the consequences. Yangxin Fan
### Write your pair you worked at the top of the page. If no pair, it is ok. List other fiends you worked with (name, last name): Ziyu Xiong
### Disclose the resources or persons if you get any help: ISLR and previous homeworks
### How long did the assignment solutions take?: ...
***
## References
...