-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathModule0_Review and Hypo Tests_HWSubmission_Yangxin_Fan.Rmd
447 lines (338 loc) · 15.6 KB
/
Module0_Review and Hypo Tests_HWSubmission_Yangxin_Fan.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
---
title: "Module 0 Assignment"
author: "Yangxin Fan, Graduate Student"
date: "02/05/2021"
output:
pdf_document: default
html_document:
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
***
\newpage{}
## Instruction
In Module 0, we briefly reviewed seven **statistical tests methods** in data analysis:
1. Conventional (z- or t-)
2. Permutation (randomization-based, all permutations)
3. Randomization or Simulated Permutation (simulation-based, some permutations, without replacement)
4. Bootstrapping (resampling-based, with replacement)
5. Linear Regression (theory-based, LS-based)
6. and 7. Nonparametric Approaches: Wilcoxon and Rank-based (median test would be used as well)
In the assignment, you will run these tests with a new data set for two-sample independent mean problem below.
An experiment was administered whether or not extra nitrogen affects the stem weight on seedlings. One group (control, n=8) was controlled with **standard nitrogen**, the other group (treatment, n=8) was given **extra nitrogen**. After two weeks, the stem weights were measured in gr. Assume the populations of the samples are normally distributed with unknown equal variances.
The raw data is as follows:
```{r echo=TRUE, results='hold'}
control_group <- c(.40, .45, .35, .27, .46, .33, .30, .43)
trtmnt_group <- c(.49, .45, .35, .38, .48, .55, .47, .65)
boxplot(control_group, trtmnt_group)
round(sd(control_group),2)
round(sd(trtmnt_group),2)
```
```{r echo=TRUE}
# Or, use this way, or make data frame so you can run directly the Module0 R Codes
datam = cbind(c(control_group, trtmnt_group), c(rep(0, 8), rep(1, 8)))
y <- datam[,1]
group <- datam[,2]
colnames(datam)=c("y","group")
#View(datam) #Check the data
```
## Module Assignment Questions
1) (*Descriptive*) Do descriptive analysis.
a. Is the study design an experimental or observational study? Justify.
b. Obtain a side-by-side boxplot on measurements of the two groups. Include the graph. Add title and group names.
c. Obtain summary statistics that show central (mean, median etc) and spread (sd, IQR, range etc) measurements of the data distribution for each group. Make a table and include the statistics.
d. Compare the centers and spreads. Do you see a difference in centers? Do you see a difference in spreads? Comment.
e. Do you see any potential outliers? Find and comment if exists. Explain your criterion to find outliers.
***
2) (*Test*) Using the seven methods above, test at a 5% significance level if the difference in the mean stem weight between seedlings that receive regular nitrogen and those that receive extra nitrogen is not equal.
a. Write the hypotheses.
b. What is a hypothesis testing? Write what you know about hot to test in general.
c. Run the seven tests and make a comparative table, showing the `p-values`. (You need to make the table showing all p-values. Modify the lab codes for this data set. Make sure you run line by line)
d. Conclude each result with a short comment at the specified significance level (answer part c and d together, put all the p-value calculations and the comments in a table)
e. Now, write an `overall comment` on the results that communicate with the goal of the problem. Use the context.
***
3) (*Concepts*) Analyze the validity of some tests.
a. List all the assumptions on `t-test`.
b. Do the assumptions meet? Check each.
c. List all the assumptions on `Wilcoxon test`. Do the assumptions meet?
d. Do you know the assumptions on `linear regression` with LS? (a simple answer works: yes/no. if you know, write all. if not, it is fine we will learn)
e. The three methods are based on randomization or resampling. What are the merits of doing randomization or resampling? Which of the three methods would work well for large data situation? Why?
***
4) (*Extension*) Deep analysis and pitfalls.
a. List the `type of errors` (either Type I or II) committed in the decision made for each test. Make a table that shows all. Describe what Type I and Type II error rates are.
b. Build a 95% `confidence interval` on mean difference between treatment and control groups using t-critical value, df=n1+n2-1, and t-test's standard error formula. Interpret what it says. Does it confirm the p-value result from t-test?
c. Obtain a percentile confidence interval (2.5th to 97.5th) on mean difference for the permutation test (using the permutation sampling distribution of differences on mean). Include the confidence interval. Interpret.
d. Compare the confidence intervals calculated in part b and c. Which one is more precise? Which method is more efficient? Comment.
e. Corrupt the data value .40 as 40 in the control group - as if you make a typo so this is a `bad outlier`. Recalculate the p-values of all tests. What changed? Which tests didn't change dramatically?
f. (BONUS) `Standard error` is basically defined as the standard deviation of the sampling distribution of differences in mean. Either use the R packs or use the std of test statistics simulated for data. Can you find the standard error on mean difference estimate for each test? Do your best to find each. Which one(s) are most efficient test(s)? Why?
g. (BONUS) Ask a challenging question and answer (under the assignment context).
\newpage{}
***
## Your Solutions
Q1)
Part a
The study design is an experimental study since it has a control group and a treatment group, which aims to research
whether or not extra nitrogen affects stem weights on seedlings
***
Part b:
```{r echo=TRUE}
boxplot(y~group, data=datam, xlab = 'Group', ylab='stem weight on seedlings', main="Drop in stem weight on seedlings: Control (0) and Treatment (1) groups")
```
***
Part c
```{r echo=TRUE}
range1 = max(y) - min(y)
range2 = max(y[group==0]) - min(y[group==0])
range3 = max(y[group==1]) - min(y[group==1])
summ=matrix(c(mean(y), mean(y[group==0]), mean(y[group==1]),sd(y), sd(y[group==0]), sd(y[group==1]),median(y), median(y[group==0]), median(y[group==1]),IQR(y), IQR(y[group==0]), IQR(y[group==1]),range1,range2,range3), 3, 5)
colnames(summ)=c("mean","sd","median","IQR","range")
rownames(summ)=c("all","control","treatment")
table1=round(summ,2)
table1
```
***
Part d:
In terms of center, treatment group has a larger mean and median than control group. Treatment group also has a wider spread but a smaller IQR compared to control group.
***
Part e:
If a number is less than Q1 – 1.5×IQR or greater than Q3 + 1.5×IQR, then it is an outlier.
```{r echo=TRUE}
OutVals_control = boxplot(control_group)$out
which(control_group %in% OutVals_control)
OutVals_trtmnt = boxplot(trtmnt_group)$out
which(trtmnt_group %in% OutVals_trtmnt )
```
So there are no outliers in both control group and treatment group
***
\newpage
Q2)
Part a
Mean weight between seedlings that receive regular nitrogen and those that
receive extra nitrogen is equal.
***
Part b:
Hypothesis testing in general is to test an assumption regarding a population
parameter and provide evidence on plausibility of the null hypothesis. The null hypothesis
is usually a hypothesis of equality between population parameters; e.g., a null hypothesis
may state that the population mean return is equal to zero. The alternative hypothesis is
effectively the opposite of a null hypothesis.
***
Part c
```{r echo=TRUE}
# 1. T-test
t2w=t.test(y~group, var.equal = TRUE)
t2p=t2w$p.value
cat(t2p, 'is p-value of T-test')
```
```{r echo=TRUE}
# 2. Permutation Test
diff=mean(trtmnt_group)-mean(control_group)
library(PASWR) # using allocation functions
set.seed(99)
length(y)
alloc=SRS(y,8) # see alloc[1,]
dim(alloc) #factorial(n)/(factorial(n-r)*factorial(r))
alloc[1,] #see an sample result
# how many all combinations
reps=choose(16,8) #sample sizes vary
Comb <- matrix(rep(y,12870),ncol=16, byrow=TRUE)
pdT6<-SRS(1:16,8)
Theta=array(0,12870)
for(i in 1:reps){
Theta[i]=mean(Comb[i,pdT6[i,]])-mean(Comb[i,-pdT6[i,]])
}
# p-value calculation
num_out= sum(Theta>=diff | Theta<=-diff) #how many are out of threshold
pep=num_out/reps
cat(reps,',', num_out, ': number of all combinations and number beyond threshold')
cat(pep, 'is p-value of permutation test')
```
```{r echo=TRUE}
# 3. Randomization Test
reps2 <- 10000 #B=10000
results <- numeric(reps2)
set.seed(99)
for (i in 1:reps2) {
temp <- sample(y)
results[i] <- mean(temp[1:8])-mean(temp[9:16])
}
# p-value
psp=sum(results>=diff | results<=-diff)/(reps2+1)
cat(psp, 'is p-value of randomization test')
```
```{r echo=TRUE}
# 4. Bootstrapping
reps2 <- 10000 #B=10000
results2 <- numeric(reps2)
set.seed(99)
for (i in 1:reps2) {
temp <- sample(y, replace=TRUE)
results2[i] <- mean(temp[1:8])-mean(temp[9:16])
}
bp=sum(results2>=diff | results2<=-diff)/(reps2+1)
cat(bp, 'is p-value of bootstrapping')
```
```{r echo=TRUE}
# 5. Linear regression
lw=summary(lm(y~group))
le=lw$coefficients[2]
lp=coef(lw)[2,4]
cat(lp, 'is p-value of linear regression')
```
```{r echo=TRUE}
# 6. Wilcoxon test
g1=y[group==1]
g0=y[group==0]
fitw=wilcox.test(g0,g1, exact=FALSE)
wp=fitw$p.value
cat(wp, 'is p-value of wilcoxon test')
```
```{r echo=TRUE}
# 7. rank-based test
library(quantreg)
library(Rfit) #rank-based
fit.r=summary(rfit(y~group))
mp=fit.r$coefficients[2,4]
cat(mp, ' is p-value of rank-based test')
```
```{r echo=TRUE}
# result table
A=round(as.matrix(c(t2p,pep,psp,bp,lp,wp,mp)),4)
ftr="Fail to reject the null"
rn="Reject the null in favor of the alt"
B=c(rn,rn,rn,rn,rn,rn,ftr)
C=cbind(A,B)
colnames(C)=c("P-value","Decision at 5% level")
rownames(C)=c("t-test","perm","perm-sim","bootst","lin reg","wilcoxon", "rank-based")
C
```
***
Part d:
Already answered in Part C
***
Part e:
In most our tests, we reject the null hypothesis that Mean weight between seedlings that receive regular nitrogen and those that receive extra nitrogen is equal at 5% statistically significant level. This tells us that extra nitrogen very likely affects the stem weight on seedlings.
***
\newpage
Q3)
Part a
t-test assumptions include:
(1): Independence of the obeservations. Each subject should belong to only one group. There is
no relationship between the observations in each group.
(2): No siginificant outliers in the two groups
(3): Normality. the data for each group should be approximately normally distributed. And the means of two populations being compared should follow normal distribution
(4): Homogeneity of variances. the variance of outcome variable should be equal in each group.
***
Part b:
For (1), since observation belong to either control or treatment group and each one is measurement weight
for a different stem weight, it satisfies assumption (1).
For (2), there are no outliers proved by Part e in Q1, it satisfies assumption (2).
For (3), we test it by drawing Q-Q plot.
```{r echo=TRUE}
# Q-Q plot for control group
qqnorm(y[group==0])
# Q-Q plot for treatment group
qqnorm(y[group==1])
```
We can see that points in each plot roughly form a straight diagonal line, it satisfies assumption (3).
For (4), we can implement Levene’s test.
```{r echo=TRUE}
# Q-Q plot for control group
library(car)
leveneTest(y,as.factor(group))
```
Since p value > .05, this incidicates there is no significant difference between the group variances in
the two groups, it satisfies assumption (4).
***
Part c
(1). Data are paired and come from the same population.
(2). Each pair is chosen randomly and independently.
(3). The data are measured on at least an interval scale when, as is usual, within-pair differences are calculated to perform the test (though it does suffice that within-pair comparisons are on an ordinal scale).
It satisfies all these assumptions.
***
Part d:
Yes, assumptions on linear regression with LS include:
1. The regression model is linear in the coefficients and the error term
2. The error term has a population mean of zero
3. All independent variables are uncorrelated with the error term
4. Observations of the error term are uncorrelated with each other
5. The error term has a constant variance (no heteroscedasticity)
6. No independent variable is a perfect linear function of other explanatory variables
***
Part e:
The merits of doing randomization or resampling include:
1. We rarely think in terms of the populations from which the data came, and there is no need to assume anything about normality or homoscedasticity
2. Null hypothesis has nothing to do with parameters, but is phrased rather vaguely, as, for example, the hypothesis that the treatment has no effect on the how participants perform
3. We do calculate some sort of test statistic, however we do not compare that statistic to tabled distributions.
4. we are not concerned with populations, we are not concerned with estimating (or even testing) characteristics of those populations.
Among these three methods, randomization test works well for a large trial since it requires the less computational resoruces than permutation test and bootstrapping when dealing with large datasets.
***
\newpage
Q4)
Part a
Type I error is the rejection of a true null hypothesis, also known as false positive rate.
Type II error is the non-rejection of a false null hypothesis, also known as false negative rate.
```{r echo=TRUE}
A=c(0.05,0.05,0.05,0.05,0.05,0.05,0.05)
A=cbind(A)
colnames(A)=c("Type I error")
rownames(A)=c("t-test","perm","perm-sim","bootst","lin reg","wilcoxon", "rank-based")
A
```
since we choose alpha = 0.05, the type I error is 0.05 for all tests.
***
Part b:
Here, we use t-test to build a 95% confidence interval.
```{r echo=TRUE}
t2w$conf.int
```
The 95% CI is [-0.1932609, -0.0142391]. This confirms our result from p value. Since 0 is not included in the interval, we can tell the null hypothesis is rejected at (1-95%)=5% statistically significant level.
***
Part c
```{r echo=TRUE}
quantile(Theta,probs=seq(0,1,by=0.025))
```
From the above table, we find that 95% confidence interval is [-0.09375, 0.09375]. Since the obeserved statistic is
0.48-0.37=0.11, which is outside of the CI, we reject the null hypothesis at 5% statistically significant level.
***
Part d:
Permutation test is more precise but the traditional t-test is more efficient since permutation method is too expensive for larger samples.
***
Part e:
```{r echo=TRUE}
A=c(0.3445,1,0.9999,0.9999,0.3445,0.1031,0.2465)
B=c(t2p,pep,psp,bp,lp,wp,mp)
C=cbind(A,B)
colnames(C)=c("p value after corrupted value","p value beofre corrupted value")
rownames(C)=c("t-test","perm","perm-sim","bootst","lin reg","wilcoxon", "rank-based")
C
```
After corrupting the data value .40 as 40 in the control group, p value of all tests go up. P value of Wilcoxon and rank-based tests go up relatively less dramatically.
***
***
Part f:
```{r echo=TRUE}
t2w$stderr
sd(Theta)
sd(results)
sd(results2)
lw$coefficients
```
Therefore, standard deviation for difference between two means for t-test is 0.04173417, for permutation test is 0.0484087, for Randomization Test is 0.04893261, for Bootstrapping is 0.04688374, for linear regression is 0.04173417. T-test and linear regression are both efficient since they have smaller standard deviation for means difference of these two groups.
***
***
Part g (Ask a challenging question and answer (under the assignment context).
Why don't we just t-test for testing whether the mean's difference in any cases?
The limitations of t-test appear when the population distribution is extremely skewednot and not randomly sampled.
***
\newpage
### Write comments, questions: ...
***
I hereby write and submit my solutions without violating the academic honesty and integrity. If so, I accept the consequences.
### List the fiends you worked with (name, last name): I completed my homework indepedently
### Disclose the resourcrs or persons if you get any help: I completed my homework indepedently and I used Module0_Lab codes to guide my codings in this homework.
***
## References
...