-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathExample_9_MissingData.Rmd
319 lines (235 loc) · 11.2 KB
/
Example_9_MissingData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
title: "Missing Data Imputation using Bayesian Models"
author: "coreysparks"
date: "October 29, 2014"
output:
html_document:
highlight: tango
---
In this example, we will use Bayesian hierarchical models to do some imputation of missing data. I will first do a simple case where we use the posterior predictive distribution to impute a missing value.
The second case will impute county poverty rates in Texas.
The final example will impute student math scores from a longitudinal data set (the [ECLS-K ](http://nces.ed.gov/ecls/kinderdatainformation.asp) ).
We will use JAGS to illustrate the use of this type of imputation
## Example 1
In this example we will show how to impute missing values from the posterior predictive distribution using JAGS. This is actally very easy to do, as JAGS/BUGS will treat all missing values as stochastic nodes in the model and estimate their posterior directly.
```{r simple_ex}
library(rjags)
model1<-"
#continous outcome
model{
for (i in 1:n){
y[i]~dnorm(mu[i], tau)
mu[i]<-alpha+u[i]
u[i]~dnorm(0, tauu)
}
alpha~dnorm(0,.001)
tau~dgamma(.001, .001)
tauu~dgamma(.001, .001)
}
"
```
First we load our data, Here I use the first 100 people from the 2011 BRFSS data's BMI variable
```{r jags_simp_ex}
load("/Users/ozd504/Google Drive/dem7903_App_Hier/data/brfss_11.Rdata")
set.seed(seed = 1234)
samps<-sample(1:dim(brfss_11)[1], size = 100, replace = F)
dat<-list(y=brfss_11$"_BMI5"[samps]/100, n=100)
init.rng1<-list(".RNG.seed" = 1234, ".RNG.name" = "base::Mersenne-Twister")
init.rng2<-list(".RNG.seed" = 5678, ".RNG.name" = "base::Mersenne-Twister")
mod<-jags.model(file=textConnection(model1), data=dat,inits =list(init.rng1, init.rng2) , n.chains=2)
#next, we update the model, this is the "burn in" period
update(mod, 1000)
#sample 2000 samples from each chain (5,000/5 = 1000 * 2 chains = 2000)
samps<-coda.samples(mod, variable.names=c( "mu"), n.iter=1000)
#Numerical summary of the missing values
sums<-summary(samps)
sums$statistics[is.na(dat$y)==T,]
mean(dat$y, na.rm=T)
sd(dat$y, na.rm=T)
```
So we see that we have 5 missing cases, and their values have been imputed
## Example 2
This case involves us estimating the missing data for poverty rates at the county level in Texas from the American Community Survey. In this case, we have missing values both in the outcome (poverty rate) and a predictor (unemployment rate). We build a model to jointly estimate the missing values of unemployment, then use those values to help estimate the values of poverty. Since all the data are observed, we artifically knock out a sample of cases then compare our imputed estimates to the observed values that we knocked out. This is called a cross-validation experiment.
```{r tx_examp}
library(maptools)
library(car)
dat<-readShapePoly("/Users/ozd504/Google Drive/dem7903_App_Hier/Bayes/TXsda.shp")
dat$ppoor2k8<-ifelse(dat$ppoor2k8==0,NA, dat$ppoor2k8)
dat$punem2k8<-ifelse(dat$punem2k8==0,NA, dat$punem2k8)
dat2<-dat@data
dat2$mispov<-dat2$ppoor2k
dat2$misunem<-dat2$punemp2k
set.seed(1234)
#Here we artifically put in missing cases for 75 of the counties.
dat2$mispov[sample(1:length(dat2$STATE), size=75, replace=F)]<-NA
dat2$misunem[sample(1:length(dat2$STATE), size=75, replace=F)]<-NA
summary(dat2$ppoor2k)
summary(dat2$mispov)
missings<-which(is.na(dat2$mispov)==T)
N<-length(dat$STATE)
model2<-"
model{
for(i in 1:N)
{
pov[i]~dnorm(mu[i], taup)
unem[i]~dnorm(muun[i],tauu)
mu[i] <- alpha + beta*muun[i]+v[i]
muun[i]<-alpha2+vu[i]
v[i]~dnorm(0, taup)
vu[i]~dnorm(0, tauu)
}
alpha~dnorm(0, .01)
alpha2~dnorm(0, .01)
beta~dnorm(0,.01)
taup<-pow(sd1,-2)
sd1~dunif(0,100)
tauu<-pow(sd2,-2)
sd2~dunif(0,100)
}
"
#generate the data
#I use a logit transform of the rate to avoid prediction out of the 0:1 interval
mdat<-list(pov=car::logit(dat2$mispov/100, adjust=.01), unem=car::logit(dat2$misunem/100, adjust=.01), N= N)
init.rng1<-list(".RNG.seed" = 1234, ".RNG.name" = "base::Mersenne-Twister")
init.rng2<-list(".RNG.seed" = 5678, ".RNG.name" = "base::Mersenne-Twister")
mod<-jags.model(file=textConnection(model2), data=mdat,inits =list(init.rng1, init.rng2) , n.chains=2)
#next, we update the model, this is the "burn in" period
update(mod, 1000)
#sample 2000 samples from each chain (5,000/5 = 1000 * 2 chains = 2000)
samps<-coda.samples(mod, variable.names=c( "mu", "muun"), n.iter=1000)
#Numerical summary of the missing values
sums<-summary(samps)
library(boot)
dat$imppov<-inv.logit(sums$statistics[1:254,1])
dat$impunemp<-inv.logit(sums$statistics[255:508,1])
cbind(head(inv.logit(mdat$pov), n=15), head(mdat$imppov, n=15))
```
So we have our missing values, Let's compare them to the actual known values and see how our estimation did
```{r tx_comparison}
dat$BayesianEst<-inv.logit(sums$statistics[1:254,1])
dat$unem.est<-inv.logit(sums$statistics[255:508,1])
dat$MissingCounties<-dat2$mispov
dat$misunem<-dat2$misunem
dat$PovertyRate<-dat$ppoor2k
dat$Imputed<-ifelse(is.na(dat$MissingCounties)==T, dat$BayesianEst,dat$ppoor2k)
errs1<-(dat$BayesianEst[missings]-dat$ppoor2k[missings])/dat$ppoor2k[missings]
mpe<-sum(errs1[is.finite(errs1)])*1/N
mpe
errs3<-abs((dat$BayesianEst[missings]-dat$ppoor2k[missings])/dat$ppoor2k[missings])
mape<-100*sum(errs3[is.finite(errs3)])*1/N
mape
```
So our model didn't estimate the rate very well, we were 29.2% off on average. This model could be improved by incorporating spatial structure into the model.
## Example 3
The last example will use the ECLS-K data. Here, we will impute student math scores for panels of the survey where the student was not observed.
```{r ecls_ex}
load("~/Google Drive/dem7903_App_Hier/data/eclsk.Rdata")
names(eclsk)<-tolower(names(eclsk))
library (car)
library(rjags)
#get out only the variables I'm going to use for this example
myvars<-c( "childid","gender", "race", "r1_kage","r2_kage", "r4age", "r5age", "r6age", "r7age","c1r4mtsc", "c4r4mtsc", "c5r4mtsc", "c6r4mtsc", "c7r4mtsc", "w1povrty","w1povrty","w3povrty", "w5povrty", "w8povrty","wkmomed", "s2_id", "p2fsstat","p5fsstat","p6fsstat","p7fsstat", "cregion")
eclsk<-eclsk[,myvars]
#recode outcome, math score at each of the 4 waves
eclsk$math1<-ifelse(eclsk$c1r4mtsc==-9,NA, eclsk$c1r4mtsc)
eclsk$math2<-ifelse(eclsk$c4r4mtsc==-9,NA, eclsk$c4r4mtsc)
eclsk$math3<-ifelse(eclsk$c5r4mtsc==-9,NA, eclsk$c5r4mtsc)
eclsk$math4<-ifelse(eclsk$c6r4mtsc==-9,NA, eclsk$c6r4mtsc)
eclsk$math5<-ifelse(eclsk$c7r4mtsc==-9,NA, eclsk$c7r4mtsc)
eclsk$age1<-ifelse(eclsk$r1_kage==-9, NA, eclsk$r1_kage/12)
eclsk$age2<-ifelse(eclsk$r4age==-9, NA, eclsk$r4age/12)
#for the later waves, the NCES group the ages into ranges of months, so 1= <105 months, 2=105 to 108 months. So, I fix the age at the midpoint of the interval they give, and make it into years by dividing by 12
eclsk$age3<-recode(eclsk$r5age,recodes="1=105; 2=107; 3=109; 4=112; 5=115; 6=117; -9=NA")/12
eclsk$age4<-recode(eclsk$r6age,recodes="1=118; 2=129; 3=135; 4=141; 5=155; -9=NA")/12
eclsk$age5<-recode(eclsk$r7age,recodes="1=155; 2=166; 3=172; 4=178; 5=192; -9=NA")/12
eclsk$pov1<-ifelse(eclsk$w1povrty==1,1,0)
eclsk$pov2<-ifelse(eclsk$w1povrty==1,1,0)
eclsk$pov3<-ifelse(eclsk$w3povrty==1,1,0)
eclsk$pov4<-ifelse(eclsk$w5povrty==1,1,0)
eclsk$pov5<-ifelse(eclsk$w5povrty==1,1,0)
#Recode race with white, non Hispanic as reference using dummy vars
eclsk$hisp<-recode (eclsk$race, recodes="3:4=1;-9=NA; else=0")
eclsk$black<-recode (eclsk$race, recodes="2=1;-9=NA; else=0")
eclsk$asian<-recode (eclsk$race, recodes="5=1;-9=NA; else=0")
eclsk$nahn<-recode (eclsk$race, recodes="6:7=1;-9=NA; else=0")
eclsk$other<-recode (eclsk$race, recodes="8=1;-9=NA; else=0")
eclsk$male<-recode(eclsk$gender, recodes="1=1; 2=0; -9=NA")
eclsk$mlths<-recode(eclsk$wkmomed, recodes = "1:2=1; 3:9=0; else = NA")
eclsk$mgths<-recode(eclsk$wkmomed, recodes = "1:3=0; 4:9=1; else =NA")
```
To analyze data longitudinally, we need to reshape the data from the current "wide" format (repeated measures in columns) to a "long" format (repeated observations in rows). The `reshape()` function allows us to do this easily. It allows us to specify our repeated measures, time varying covariates as well as time-constant covariates.
This takes a long time to run with the full sample, so I just subset to children from the south census region (cregion==3)
```{r}
eclsk<-subset(eclsk, cregion==3)
e.long<-reshape(eclsk, idvar="childid", varying=list(math = c("math1", "math2", "math3", "math4","math5"),
age = c("age1", "age2", "age3", "age4", "age5"),
pov= c("pov1", "pov2", "pov3", "pov4", "pov5")),
times=1:5,direction="long",
drop = names(eclsk)[c(4:19,22:25) ],)
e.long<-e.long[order(e.long$childid, e.long$time),]
```
###Models
The first model is a simple random intercept model for each child's mean for the math score, with a population trajectory for time, not child specific
```{r}
model1<-"
model{
#Likelihood
for( i in 1:n)
{
math[i]~dnorm(mu[i], tau)
mu[i]<-b0+b1*time[i]+u[childnum[i]]
mathpred[i]~dnorm(mu[i], tau)
}
#priors
#Prior for random intercept
for (j in 1:nkids)
{
u[j] ~ dnorm(0, tauu)
}
#regression effects
b0~dnorm(0, .001)
b1~dnorm(0, .001)
tau<-pow(sd, -2)
sd~dunif(0,100)
tauu<-pow(sdu, -2)
sdu~dunif(0,100)
}
"
```
I only use the first 200 children in the data (5 times *200 kids = 1000 lines of data)
```{r}
eshort<-e.long[1:1000,]
nkids<-table(eshort$childid)
head(nkids) #Number of people within counties
eshort$childnum<-rep(1:length(unique(eshort$childid)), nkids)
dat<-list(math=as.numeric(scale(eshort$math1, center=T, scale=T)), sex=eshort$male, black=eshort$black, hisp=eshort$hisp, other=eshort$other, asian=eshort$asian,pov=eshort$pov1,lths=eshort$mlths,gths=eshort$mgths,age=eshort$age1, time=eshort$time,childnum=eshort$childnum,n=length(eshort$math1), nkids=length(unique(eshort$childid)))
lapply(dat , summary)
```
```{r}
#Set some initial values
init.rng1<-list(".RNG.seed" = 1234, ".RNG.name" = "base::Mersenne-Twister", sd=.1, sdu=.1)
init.rng2<-list(".RNG.seed" = 5678, ".RNG.name" = "base::Mersenne-Twister", sd=.5, sdu=.5)
#Initialize the model
mod1<-jags.model(file=textConnection(model1), data=dat, n.chains=2,inits =list(init.rng1, init.rng2) )
#burn in
update(mod1, 5000)
#collect samples of the parameters
samps1<-coda.samples(mod1, variable.names=c("math"), n.iter=5000, n.thin=1)
samps2<-coda.samples(mod1, variable.names=c("mu"), n.iter=5000, n.thin=1)
#Numerical summary of each parameter:
sums<-summary(samps1)$statistics[,1]
sums2<-summary(samps2)$statistics[,1]
#Insert imputed math scores into data frame, convert the z scores back to ecls t-scores
eshort$math.imp<-(sums*10+50)
```
We can plot the children's scores over time to see trends within child. Some kids are missing, but the second plot shows the imputed values and the recovered trend lines. I only plot the first 20 kids
```{r ,fig.height=6, fig.width=6}
xyplot(math1~time|childid, data=eshort[1:100,],main = "Original Math Scores",
panel=function(x,y){
panel.xyplot(x,y)
panel.lmline(x,y)})
xyplot(math.imp~time|childid, data=eshort[1:100,],main="Imputed Math Scores",
panel=function(x,y){
panel.xyplot(x,y)
panel.lmline(x,y)})
```