-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDATA310-regression-analysis-childpoverty
207 lines (143 loc) · 11.7 KB
/
DATA310-regression-analysis-childpoverty
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
# DATA-3100
# Adefoluke Shemsu
setwd("~/Documents/Education/Penn/Classes/DATA 310/Week 6")
library(tidyverse)
# PROBLEM 1
# We are going to work again the ACS County Data to investigate the relationship between median household income
# and the percent of children living in poverty in counties. Load in the “ACSCountyData.Rdata” dataframe.
load("~/Documents/Education/Penn/Classes/DATA 310/Week 6/ACSCountyData.Rdata")
# 1. First, to make things more readable, recode the median.income variable to be expressed in thousands of dollars.
acs$median.income <- acs$median.income/1000
# 2. Plot the relationship between median income on the x axis and percent child poverty on the y axis and describe what you see.
plot(acs$median.income,
acs$percent.child.poverty,
xlab = "Median Income",
ylab = "Percent Child Poverty",
pch = 16)
# Child poverty rates and median income appear to be directly correlated in the sense that
# higher poverty rates will typically also mean lower median incomes based on this data.
# This relationship can also be demonstrated by the fact that the majority of the most impoverished children in this
# population sample with the highest percentages of child poverty (greater than 40%) almost all fall well into the
# lowest percentile of median income. With a maximum median income of nearly $140,000, we can reasonably infer that
# the 25th percentile for this group would be $35,000, and still a large chunk of the most impoverished lie between $0
# and $35,000, implying a strong correlation between child poverty and overall earning potential as well.
# 3. Run a bi-variate linear regression on this relationship, discuss what the coefficients (including the intercept)
# mean, and visualize the result on top of the scatterplot you produced above. Just to make things easier for the
# next step, you may want to use code similar to this to plot the result:
child.pov.lm <- lm(percent.child.poverty ~ median.income, data = acs)
summary(child.pov.lm)
ggplot(acs, aes(x = median.income, y = percent.child.poverty)) +
geom_point() +
ylim(0,100) +
labs(x = "Median Income (Thousands)", y = "Percent Childen in Poverty") +
geom_smooth(method = lm,
formula = y ~ poly(x, 1),
se = FALSE)
# The intercept means the average value of percent.child.poverty when median.income equals zero will be 53.58%.
#The coefficient for median.income means that for each additional point median.income increases, percent.child.poverty
# will decrease by 0.6132 on average, which supports our initial theory in question #2.
# 4. Looking at this relationship visually, why doesn’t this regression satisfy Gauss-Markov Assumption 2
# (functional form)? Add the square of median.income to your model and determine whether this improves model fit,
# making reference to both the visual change in the regression line and to the R2 of each model.
acs$median.income.2 <- acs$median.income^2
child.pov.lm2 <- lm(percent.child.poverty ~ median.income + median.income.2, data = acs )
summary(child.pov.lm2)
ggplot(acs, aes(x = median.income, y = percent.child.poverty)) +
geom_point() +
ylim(0,100) +
labs(x = "Median Income (Thousands)", y = "Percent Childen in Poverty") +
geom_smooth(method = lm, formula = y ~ poly(x, 2), se = FALSE)
# Gauss-Markov Assumption 2 is defined by specifying the functional form of the relationship between your
# explanatory variables and dependent variable. The plot above demonstrates that the model we’ve specified is
# not representation of the actual relationship between our variables. When we used a squared term for median.income,
# we can see that the regression line on our plot above more closely follows the distribution of the points on our plot.
# The R2 of this model tells us how well the data fits the regression model. The rule of thumb typically is a higher R2
# value indicates better fit. The R2 value of the original model was 0.5547, and the value of our
# model with the squared term is 0.7037. This demonstrates improvement from ~55% of the variability in
# percent.child.poverty explained by the first model to ~70% of the variability in percent.child.poverty
# explained by the second model.
# 5. In this new regression with a second order polynomial term, what is the the effect of an additional 1000 dollars
# in median income when median income is at 30k? What is the the effect of an additional 1000 dollars in median income
# when median income is at 100k? Does this make theoretical sense?
coef(child.pov.lm2)
coef(child.pov.lm2)[2] + 2*coef(child.pov.lm2)[3]*30 # Testing at 30k
coef(child.pov.lm2)[2] + 2*coef(child.pov.lm2)[3]*100 # Testing at 100k
# When median.income is at 30k, an additional 1000 dollars will lead percent.child.poverty to decrease by 1.23, or 123%.
# When median.income is at 100k, an additional 1000 dollars will lead percent.child.poverty to increase by 0.33, or 33%.
# This test shows what looks like diminishing returns to median.income at a certain point, as it relates to this issue.
# The 1,000 difference appears to have a more notable impact on reducing child poverty when median.income is around $30k,
# whereas this additional 1000 dollars in median.income makes less of a difference when the median income is closer to $100k.
# 6. A possible confounding variable to this relationship is the unemployment rate, which may affect both the
# median income of a county and the percent of children living in poverty. Use the cor() function to investigate
# the relationships between median income, unemployment, and child poverty. Based on the pattern of correlations,
# what is likely to happen to the coefficient on median.income if you add unemployment rate to the first regression
# model (the one without the polynomial terms)?
# Disclaimer: I think cor might have been replaced with cor.test, as cor only produces NAs.
cor.test(acs$median.income, acs$percent.child.poverty, method = "pearson")
cor.test(acs$median.income, acs$unemployment.rate, method = "pearson")
cor.test(acs$percent.child.poverty, acs$unemployment.rate, method = "pearson")
# Based on our analysis of the relationships between median.income, unemployment.rate, and percent.child.poverty,
# we can state with near certainty that median.income and percent.child.poverty are strongly correlated.
# This is based on the fact that our confidence interval does not contain the null value and our p-value (< 2.2e-16)
# indicates a statistically significant relationship.
# Similarly, median.income and unemployment.rate are highly correlated for the same reasons that median.income
# and percent.child.poverty are correlated (confidence interval, p-value demonstrating significance).
# Percent.child.poverty and unemployment.rate are also highly correlated for the same reasons that median.income
# and percent.child.poverty are correlated (confidence interval, p-value demonstrating significance).
# Overall, this leads us to believe adding unemployment.rate to the first regression model makes the coefficient for
# median.income smaller, as median.income co-varies with both the other unemployment.rate (explanatory)
# and the percent.child.poverty (outcome variable).
# 7. Run this regression with unemployment rate and median income (no polynomial terms), and determine the degree to
# which the coefficient on median.income changes. Interpret the other coefficients in the model as well, being sure
# to adjust your language to the fact that there are now multiple independent variables.
child.pov.lm3 <- lm(percent.child.poverty ~ median.income + unemployment.rate, data = acs)
summary(child.pov.lm3)
# The intercept for this regression tells us that the average value of percent.child.poverty will be 36.66%
# when the values of median.income and unemployment.rate are 0.
# The coefficients for this regression tell us that when median.income increases by $1000, percent.child.poverty will
# decrease by 0.44, assuming all other variables remain constant. They also tell us that when unemployment.rate
# increases by one percent, percent.child.poverty will increase by 1.37, assuming all other variables remain constant.
# The coefficient of -0.44 for median.income in contrast to our earlier test of how median.income affects percent.child.poverty,
# our coefficient of -0.61 for median.income means including unemployment.rate in our model--thus, controlling for
# unemployment.rate when evaluating the effect of median.income--nominally reduced the explanatory effect of
# median.income. In other words, confirms the results in Question 6 where unemployment.rate co-varies with both
# median.income and percent.child.poverty.
# 8. Another possible confounding variable is the census region people are living in. For example, living in the
# south could be associated with both lower average incomes and more child poverty. Create an indicator
# variable for the 4 census regions (or change the variable into a factor variable) and then re-estimate the
# regression with median income and unemployment to take into account which region each county is in. Interpret the
# coefficients from this regression.
acs$census.region <- as.factor(acs$census.region) #Recoding to make usable
child.pov.lm4 <- lm(percent.child.poverty ~ median.income + unemployment.rate + census.region,
data = acs)
summary(child.pov.lm4)
# The intercept for this regression tells us the average value of percent.child.poverty will be 33.28% in our
# census region reference group (the Midwest) when median.income and
# unemployment.rate are 0.
# The coefficients for this regression tell us that when median.income increases by one point, percent.child.poverty
# will decrease by 0.40, holding all other variables constant, and when unemployment.rate increases by one point,
# percent.child.poverty will increase by 1.19, holding all other variables constant.
# The coefficients also tell us that percent.child.poverty is 1.83 points higher in the Northeast, 3.55 points
# higher in the South, and 0.62 points higher in the West than in the Midwest, assuming the variables remain constant throughout.
# 9. It’s possible that the effect of median income is different conditional on whether a county is urban or not.
# Create an indicator variable for whether a county is urban (population density greater or equal to 1000) or not.
# Interact this variable with median income in the regression with unemployment rate and census region indicators.
# Interpret the coefficients on median income, the urban indicator, and the interaction term.
acs$urban <- NA # Creating the indicator
min(acs$population.density) # Using pop density to determine what parameters might give us the cleanest outcome
max(acs$population.density)
acs$urban[acs$population.density >= 1000] <- 1 # Building parameters to make urban classification
acs$urban[acs$population.density < 1000] <- 0
table(acs$urban)
child.pov.lm5 <- lm(percent.child.poverty ~ median.income*urban + unemployment.rate +
census.region, data = acs)
summary(child.pov.lm5)
coef(child.pov.lm5)["median.income"] + coef(child.pov.lm5)["median.income:urban"]*0
coef(child.pov.lm5)["median.income"] + coef(child.pov.lm5)["median.income:urban"]*1
# This intercept tells us the average value of percent.child.poverty will be 35.86% in our census region reference
# group (Midwest) when median.income and unemployment.rate are 0.
# The coefficients here now tell us that when median.income increases by 1 point in a non-urban county,
# percent.child.poverty will decrease by 0.44. They also tell us that when a county is urban, percent.child.poverty will
# decrease by 6.34, assuming variables remain constant.
# The interaction term also tells us that there is a difference of 0.16 in median.income when the county is urban vs.
# not.