-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
255 lines (197 loc) · 11.3 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
title: "Financial Inclusion in Africa"
author: "Molo Muli"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_float: true
toc_collapsed: false
toc_depth: 4
smooth_scroll: false
number_sections: false
highlight: textmate
theme: journal
includes:
after_body: footer.html
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r R-Scripts, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
# Libraries
library(easypackages)
libraries("dlookr","knitr","kableExtra")
## Script Files
source("Scripts/FIIA.DataPreparation.R", local = knitr::knit_global())
sys.source("Scripts/Analysis.R", envir = knitr::knit_global())
sys.source("Scripts/MachineLearning.R", envir = knitr::knit_global())
```
# Research question and the data science problem
* The research question is anchored on the main objective of this project which is create a machine learning model that can predict which individuals are most likely to have or use a bank account.
* The data science problem will be a classification problem.
* My personal objective on this project is how to handle an imbalanced dataset in a classification problem
# Data Health and Preparation
Training data contains 23,524 observations and 13 variables. From the total variables, 13 are nominal (including the response column) and 3 numeric. The independent variable is **bank_account** which has observations as yes or no. Part of data wrangling will entail encoding observations of the response variable to match 1 and 0 for yes and no respectively. A quick glimpse on the response column observations indicate a high imbalance with 86 percents of all persons not having a bank account.
No missing values are present in the dataset.
Half of the all respondents are of 35 years of age whereas 75% of all respondents are 49 years and below. Householdwise, the largest household consisted of 21 members given the average number being 4. On both variables, measures of spread indicated presence of outliers. Using Tukey's rule of outlier detection, 2.77% of all observations had outliers. Given their marginal proportion, the outliers were dropped.
# Data Analysis
## Descriptive {.tabset}
On both the variables, the distribution is light tailed (platykurtic). Outliers, which were marginal in relative to the total sample were removed during the data wrangling process.
Age has a high standard deviation indicating that values are spread out around the mean. Standard deviation (σ) as a measure of dispersion, its an indicator of how accurately the mean(μ) represents the sample data.
```{r Numeric Descriptives, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
kbl(numeric.descriptives.training.set, booktabs = T) %>%
# kable_styling(latex_options = "striped") %>%
kable_classic(full_width = F, html_font = "Cambria", position = "float_right") %>%
footnote(number = "**table 1 : Descritpive statistitcs of age and the household size**")
```
Standard Error of the mean (SEM) on the other hand, measures how far the mean of the sample is likely to be from the true population mean. SEM is be inversely proportional to the sample size. As the sample size increases the mean value becomes more representative of the population, hence SEM reduces towards zero.
From *table one*, we can see that the SEM is more close to zero indicating that the mean of the sample close to the mean of the total sample.
In deducing any of the coefficients mentioned above, primary assumption is that all observations from the sample are statistically independent.
### Normal Q-Q Plot for Age
```{r Q-Q Plot for Age, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.show="hold"}
qqnorm(training.set$Age, main = "Normal Q-Q Plot for Age")
```
### Normal Q-Q Plot for Household Size
```{r Q-Q Plot for Household Size, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.show="hold"}
# par(mfrow=c(2,2))
# qqnorm(training.set$Age, main = "Normal Q-Q Plot for Age")
qqnorm(training.set$`Household Size`, main = "Normal Q-Q Plot for Household Size")
# qqline(training.set$Age, col = "red")
# qqline(training.set$`Household Size`, col = "red")
```
## Exploratory
<div class = "row">
<hr>
<div class = "col-md-6">
```{r Mean Age of who own a bank account, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.height=6}
MeanAgeBankAccountPlot
```
</div>
<div class = "col-md-6">
```{r Bank account holders per gender, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.height=6}
GenderBA
````
</div>
* Fig I: Mean age of all individuals who both have and don't have a bank account is 38 and 39 respectively.
* Fig II: Almost 90% of all females do not have a bank account
<hr>
</div>
<!--Relationship of the individual to the Head of household-->
<!--Third chart takes the full-->
<div class = "row">
<div class = "col-md-12">
```{r Relationship of the respondent to the HH Head-BA, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.width=12, fig.height=6}
GenderHHBA
````
</div>
* 72% of all male headed household possess a bank account
* 9 in 10 females own a bank account where a female is a spouse.
<hr>
</div>
<!--Fourth chart takes the full-->
<div class = "row">
<div class = "col-md-12">
```{r Relationship of the respondent to the HH Head-NoBA, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.width=12, fig.height=6}
GenderHHNoBA
````
</div>
* 57 % of male headed households do not have a bank account as compared to female counterparts (43%)
<hr>
</div>
<!--Education levels of the individuals with respect to the of the individual to the Head of household-->
<!--Fifth Chart-->
<div class = "row">
<div class = "col-md-12">
```{r Education Levels -BA, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.width=12, fig.height=6}
EducationGenderBA
````
</div>
* 56% of females who own a bank account have no formal education as compared to their male counterparts (43.8%).
<hr>
</div>
<!--Sixth Chart-->
<div class = "row">
<div class = "col-md-12">
```{r Education Levels-NoBA, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.width=12, fig.height=6}
EducationGenderNoBA
````
</div>
* 71% of of female respondents who don't own a bank account, don't have a formal education. This is in contrast with 29% of male respondents
* Finally, regardless of presence or absence of a bank account female respondents account the highest demographic of not undergone formal education.
<hr>
</div>
# Machine Learning
## Random Forest Classifier
```{r Random Forest Classifier, echo=TRUE, message=FALSE, warning=FALSE, paged.print=FALSE}
# Split the data into training and validation sets
set.seed(3456)
splitting.index <- createDataPartition(training$BankAccount, p=.68,
list = F,
times = 1)
# training and validation set
training.data <- training[splitting.index,]
validation.data <- training[-splitting.index,]
# The classifier
model <- randomForest(formula = BankAccount ~ ., data = training.data)
model
# Print the model output
print(model)
```
From the model;
* Variables randomly selected at each split are 3.
* Out of the Bag error - a validation technique used in this model is 11%, hence the model has accuracy of around 89%.
* Finally, below is a plot showing variable importance. Job type has the most predictive power. It's closely followed by education level of a person and age.
```{r Variable importance, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.width=10, fig.height=5}
varImportancePlot
```
## Partial Dependence (DP) Plots {.tabset}
Partial Dependence plots (PDP) illustrate the relationship between an input variable and the response variable. They show how the predictions partially depend on values of the input. Useful to infer relationships between input variables and the predictor variables in complex non-parametric models.
The relationship can either be causal, (non)linear, monotonic, curvilinear, step function and so on. In our model we'll illustrate PDP using the input that has the most predictive power (job type) and the response variable (presence of a bank account). Below are the PD plots for the 3 important variables against the predictor variable
### PD plot of job title
```{r PDP Job Type, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.width=10, fig.height=5}
jobTypePDP
```
### PD plot of education level
```{r PDP Environment, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.width=10, fig.height=5}
EducationLevelPDP
```
### PD plot of age
```{r PDP Age, echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE, fig.width=10, fig.height=5}
AgePDP
```
On the nominal observations of input variables under job title(**No formal employment**) and education level(**primary education**) have the highest effect on the response variable. Age on the other hand is inversely proportional. As the age decreases so does the ability of one to open a bank account.
## Tuning the classifier
Hyperparameter tuning is usually treated as an optimization problem. Even though random forest is an out of the box algorithm, there are three important parameters that affect the accuracy of the model. These are `ntrees`, `mtry` and `sampsize`. `ntrees` is the number of trees in the forest. By default, its usually 500, `mtry` denotes the number of random predictor variables selected on each split and `sampsize` defaults to 63.2% of the training examples. Reason for the explicit proportion is because 63.2% is the expected number of unique observations in a bootstrap sample. Other hyper parameters that affect the performance of a model include:
* `nodesize` - When the nodesize is small, it allows deeper more complex trees to be grown.
* `maxnodes` - Together with `ntrees`, it limits the tree growth to avoid overfitting.
```{r tunemtry, echo=TRUE, message=FALSE, warning=FALSE, paged.print=FALSE}
# Tuning mtry using out of bag (OOB) error
tune_mtry <- tuneRF(training.data[,1:11],
training.data$BankAccount,
ntreeTry = 500,
stepFactor = 2,
plot = F,
trace = T,
doBest = T,
improve = 0.01)
tune_mtry
```
* `mtry` is 2 because it has the least OOB error. Very marginal with three.
* Pruning the trees in a random forest model is taken care of. The model builds a large collection of de-correlated trees. This removes correlation between the trees
After hyperparamenter tuning, both the model with 3 and 2 `mtry` give an out of the bag error of 11%.
## Evaluating the Random Forest Classifier
### Confusion Matrix
```{r ConfusionMatrix, echo=TRUE, message=FALSE, warning=FALSE, paged.print=FALSE}
# Predict using validation set
predict_BA <- predict(object = model,
newdata = validation.data,
type = "class")
# Create a confusion matrix
cM <- confusionMatrix(data = predict_BA,
reference = validation.data$BankAccount)
cM
```
From the matrix, the model was able to predict 6120 as true positives and 361 as true negatives.
* Type one error where the the model predicted that individuals have a bank account but they actually dont was 677.
* Type two error where the model predicted that individuals don't have a bank account but they actually have was 361.