-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathrstanarm.Rmd
258 lines (171 loc) · 9 KB
/
rstanarm.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# (PART\*) Part II: rstanarm {-}
# Getting Started with rstanarm
- <span class="pack">rstan</span> installation required.
- Installed as any other R package
# Basic GLM
Attendance behavior of high school juniors at two schools
Target:
- Number of days of absence
Predictors:
- Type of program in which the student is enrolled:
- General, Vocational, Academic
- Standardized test in math
- Gender
## Traditional GLM
```{r basic_poisson_data_prep, eval=FALSE, echo=-10}
library(tidyverse)
attendance = haven::read_dta("https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta")
attendance <- attendance %>%
mutate(
prog = factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational")),
prog = fct_relevel(prog, c('Vocational', 'General', 'Academic')),
gender = factor(gender, labels = c('Female', 'Male')),
id = factor(id)
)
saveRDS(attendance, 'data/attendance.Rds')
```
We'll use Poisson regression[^poisson] to model the count of the number of days absent
```{r basic_poisson, echo=-1, eval=-3}
attendance = readRDS('data/attendance.Rds')
attendance_glm <- glm(daysabs ~ math + gender + prog,
data = attendance,
family = poisson)
summary(attendance_glm)
```
```{r basic_poisson_result, echo=FALSE, eval=TRUE}
tidy(attendance_glm) %>%
kable_df()
```
## rstanarm: GLM
<span class="pack">rstanarm</span> uses the same nomenclature and general approach as base R
```{r bayesian_poisson0, echo=FALSE}
attendance_bglm <- stan_glm(daysabs ~ math + gender + prog,
data = attendance,
family = poisson)
attendance_brms <- brm(daysabs ~ math + gender + prog,
data = attendance,
family = poisson, cores=4, thin=10)
save(attendance, attendance_bglm, attendance_brms, attendance_brms_add_re, attendance_brms_inter, file = 'data/attendance_bayes.RData')
```
```{r bayesian_poisson}
library(rstanarm)
attendance_bglm <- stan_glm(daysabs ~ math + gender + prog,
data = attendance,
family = poisson)
summary(attendance_bglm, digits = 2, prob=c(.025, .5, .975))
```
```{r bayesian_poisson_pretty, eval=TRUE, echo=FALSE, cache.rebuild=F}
load('data/attendance_bayes.RData')
summary(attendance_bglm, digits = 2, prob=c(.5, .025, .975))
```
Summary Info:
This is the same as you see in every other regression model:
- mean: the point estimate for the parameter
- sd: standard error for the point estimate
- quantiles: are whatever you want, but here represent the median and 95% <span class="emph">uncertainty inteval</span>
Additional:
- mean_PPD: mean of the posterior predictive distribution (hopefully on par with the mean of the target variable (`daysabs`))
- log-posterior: similar to the log-likelihood from maximum likelihood, but for the Bayesian case
Diagnostics for quick eyeball inspection:
- <span class="emph">Monte Carlo Standard Error</span>: The standard error of the *mean* of the posterior draws. Want mcse than 10% of the posterior standard deviation.
- <span class="emph">$n_{eff}$</span>: is an estimate of the effective number of independent draws from the posterior distribution of the estimand of interest. Because the draws within a chain are not independent if there is autocorrelation, the effective sample size will be smaller than the total number of iterations. Should be greater than 10% of max.
- <span class="emph">$\hat{R}$</span>: measures the ratio of the average variance of samples within each chain to the variance of the pooled samples across chains; if all chains are at equilibrium, these will be the same and R̂ will be one. Desire less than 1.1.
### Adding more options
Typical configuration would involve setting priors, as well as MCMC options such as iterations, warm-up, etc.
```{r bayesian_poisson_options, echo=T}
attendance_bglm <- stan_glm(daysabs ~ math + gender + prog,
data = attendance,
family = poisson,
prior = student_t(df = 7),
prior_intercept = student_t(df = 7),
iter = 5000,
warmup = 2000,
thin = 10,
cores = 4,
seed = 1234)
```
## rstanarm: Mixed Model
Let's look at a mixed model for another demonstration
- The average reaction time per day for subjects in a sleep deprivation study
- On day 0 the subjects had their normal amount of sleep
- Subsequently restricted to 3 hours of sleep per night
- The observations represent the average reaction time on a series of tests
We'll have a random intercept and random coefficient for Days
```{r mixed, eval=TRUE}
library(lme4)
sleepstudy_lmer <- lmer(Reaction ~ Days + (1 + Days|Subject),
data = sleepstudy)
summary(sleepstudy_lmer)
```
Again, <span class="pack">rstanarm</span> sticks with the same style
```{r bayesian_mixed, echo=1:3}
sleepstudy_blmer <- stan_lmer(Reaction ~ Days + (1 + Days|Subject),
data = sleepstudy)
summary(sleepstudy_blmer, digits = 3)
save(sleepstudy_lmer, sleepstudy_blmer, sleepstudy_brms, file = 'data/sleepstudy_bayes.RData')
```
```{r load_mixed_models, echo=FALSE, eval=TRUE}
load('data/sleepstudy_bayes.RData')
print(sleepstudy_blmer, digits = 3)
```
In the Bayesian model, the random effects are not BLUPS, but are parameters estimates in the model
In this case, we see a little more shrinkage relative to the standard approach
The following are obtained from the same <span class="func">ranef</span> function used in <span class="pack">lme4</span>
```{r ranef_compare, eval=TRUE, echo=FALSE}
data_frame(lme4 = ranef(sleepstudy_lmer)[[1]][['(Intercept)']],
bayesian = ranef(sleepstudy_blmer)[[1]][['(Intercept)']]) %>%
kable_df(digits=1)
```
## rstanarm: Other Models
- ANOVA
- Beta regression
- Conditional logistic
- GLM including negative binomial models
- Generalized additive models
- Nonlinear and Generalized mixed models
- 'Joint' models for longitudinal and time-to-event (e.g. survival)
- Multivariate
- Ordinal models
# Priors
## Default priors
Depends on the model
For most models:
- intercepts are treated differently
- regression coefficients have mean zero with some specific variance
- scale parameters (e.g. residual variance) will have appropriate priors
Basically, <span class="pack">rstanarm</span> is going to be okay for you to use defaults
If you want to change, there are plenty of resources about priors:
`?prior_summary`
`?priors`
http://mc-stan.org/rstanarm/articles/priors.html
https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
## Getting priors
```{r getting_priors, eval=TRUE}
prior_summary(attendance_bglm)
```
## Setting priors
One can set priors with the appropriate arguments to the model function
| Argument | Used in | Applies to |
| ------------- | ------------- | ------------- |
| `prior_intercept` | All modeling functions except `stan_polr` and `stan_nlmer`| Model intercept, after centering predictors.|
| `prior` | All modeling functions| Regression coefficients. Does _not_ include coefficients that vary by group in a multilevel model (see `prior_covariance`).|
| `prior_aux` | `stan_glm`\*, `stan_glmer`\*, `stan_gamm4`, `stan_nlmer`| Auxiliary parameter, e.g. error SD (interpretation depends on the GLM).|
| `prior_covariance` | `stan_glmer`\*, `stan_gamm4`, `stan_nlmer`| Covariance matrices in multilevel models with varying slopes and intercepts. See the [`stan_glmer` vignette](http://mc-stan.org/rstanarm/articles/glmer.html) for details on this prior.|
<br>
The `stan_polr`, `stan_betareg`, and `stan_gamm4` functions also provide
additional arguments specific only to those models:
| Argument | Used only in | Applies to |
| ------------- | ------------- | ------------- |
| `prior_smooth` | `stan_gamm4` | Prior for hyper-parameters in GAMs (lower values yield less flexible smooth functions). |
| `prior_counts` | `stan_polr` | Prior counts of an _ordinal_ outcome (when predictors at sample means). |
| `prior_z` | `stan_betareg`| Coefficients in the model for `phi`.|
| `prior_intercept_z` | `stan_betareg`| Intercept in the model for `phi`. |
| `prior_phi` | `stan_betareg`| `phi`, if not modeled as function of predictors. |
### Example
```{r setting_priors}
attendance_bglm <- stan_glm(daysabs ~ math + gender + prog,
data = attendance,
family = poisson,
prior = student_t(df = 7))
```
[^poisson]: If not familiar with Poisson regression, we are modeling the log counts as a function of the covariates. Often the exponentiated coefficients are reported. For example, `exp(coef(attendance_glm)['genderMale'])` is `r round(exp(coef(attendance_glm)['genderMale']),3)`. Subtracting 1 tells us there is a `r round(exp(coef(attendance_glm)['genderMale']) -1, 3)*100`% decrease in the incident rate of days absent as we switch from female to male.