-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathRSVS_classification.Rmd
executable file
·724 lines (516 loc) · 38.5 KB
/
RSVS_classification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
---
title: "Classification in R"
author: "Stef Lhermitte"
date: "2023-03-14"
output:
pdf_document: default
html_document: default
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Classification in R
## Introduction
In this practical session we will learn how to:
- download (remote sensing) data from Google Earth Engine
- visualize and preliminary analyse the data
- apply different machine learning techniques
In order to use run this notebook you will need a Google Earth Engine (GEE) account. If you do not have one yet, create one at [GEE webpage](https://signup.earthengine.google.com/). GEE combines a multi-petabyte catalog of satellite imagery and geospatial data sets with planetary-scale analysis capabilities and makes it available for scientists, researchers, and developers to detect changes, map trends, and quantify differences on the Earth's surface. The [public data archive](https://developers.google.com/earth-engine/datasets) includes more than thirty years of historical imagery and scientific data sets, updated and expanded daily. It contains over twenty petabytes of geospatial data instantly available for analysis and if you look for a specific data set I recommend to check the [data catalog](https://developers.google.com/earth-engine/datasets).
In this notebook, we will limit the use of GEE to downloading different data sources and subsequently work with the data on your local computer. GEE offers a multitude of more opportunities and functions to process, analyze and visualize the data in the cloud, but covering these is beyond the scope of this session.
Once the data are downloaded, we will perform different types of classifiers/regressions on the data in order to let you understand the different concept of classification based on machine learning techniques in practice.
## Before you start
Before starting this notebook make sure that you install the necessary packages. In the next cell, we load the necessary packages. If you encounter an error message when running the `library()` function in R, it may be because the required package is not installed on your computer. In order to install the package, you can use the `install.packages()` function. For example, if you receive an error when running `library(raster)`, you can install the package by typing `install.packages("raster")` in the console. After installation, you can load the package by running `library(raster)` again. It is important to ensure that all necessary packages are installed before running your R code to avoid errors and ensure that your code runs smoothly.
Additionally, you can also install packages using the R GUI by selecting the `Packages` menu and clicking on `Install`. From there, you can search for and install the desired package.
```{r load-packages, message=FALSE, warning=FALSE}
# Packages for spatial data processing
library(raster)
library(terra)
library(dplyr)
library(reshape2)
library(data.table)
library(leafsync)
library(glcm)
library(RStoolbox)
# Packages for visualization
library(mapview)
library(tmap)
library(ggplot2)
library(ggridges)
library(plotly)
# Machine learning packages
library(caret)
# Packages for general data processing
library(data.table)
library(dplyr)
# set the temporary folder for raster package operations
rasterOptions(tmpdir = "./cache/temp")
```
Finally, before starting our notebook we want to run it in the correct location on our computer by changing the working directory with \`setwd\`. The notebook will work optimally if you install it in the location of the unzipped zip file of this practical session
```{r}
# Change the working directory with setwd
# setwd(location/of/your/unzipped/labZipfile/)
setwd('C:/Users/u0132684/Documents/Lab/Lab')
# Check the working directory by printing it
getwd()
```
## Step 1: Pre-process remote sensing data on GEE
We will do Step 1 and Step 2 in the Google Earth Engine (GEE) code editor (<https://code.earthengine.google.com/>). First, we explore GEE and then export an image to your Google Drive. We can then download the image and import it into R to use it from Step 3 onwards in this notebook.
Link to GEE script: <https://code.earthengine.google.com/2b95fd1bcbe1fe53b0844661c7921bc3>
## Step 2: Download data from GEE
--\> GEE code editor
## Step 3: Read in downloaded data
Now we can import the .tif file.
```{r load-raster}
s2_rl = raster::brick('s2_r.tif')
s2_rl
```
## Step 4: Visualize the data
#### Interactive visualization
Subsequently, you can visualize the data on an interactive map using the [mapview](https://r-spatial.github.io/mapview/) package
```{r plot-interactive}
# use viewRGB to view RGB images interactively
mapview::viewRGB(s2_rl, b=2, g=3, r=4, maxpixels=1242110)
# use viewRGB to view one band images interactively
mapview::mapview(s2_rl[[1]])
# Combine maps by using the + symbol
mapview::viewRGB(s2_rl, b=2, g=3, r=4, maxpixels=1242110) +
mapview(s2_rl[[8]], maxpixels=1242110)
```
#### Static visualization
Or a static map using the `tmap` [package](https://r.geocompx.org/adv-map.html#static-maps):
```{r plot-static-tm}
# Create a tmap object (map_1) with a raster layer (s2_rl) colored using a custom RGB color scheme based on the layer's values
map_1 = tmap::tm_shape((s2_rl/5500*255)) +
# Set the raster layer as the main layer and scale its colors according to the values in the layer divided by 5500 and multiplied by 255
tmap::tm_rgb(r = 4, g = 3, b = 2) +
# Use a custom RGB color scheme with red = 4, green = 3, and blue = 2
tmap::tm_scale_bar(position = c("left", "bottom"))
# Add a scale bar to the map in the bottom-left corner
# Create another tmap object (map_2) with a raster layer (s2_rl[[4]]) overlaid
map_2 = tmap::tm_shape(s2_rl[[4]]) +
# Set the raster layer as the main layer
tmap::tm_raster(alpha = 0.7, # Set the opacity of the raster layer to 0.7
palette = "YlGn", # Use a color palette called "YlGn"
legend.show = TRUE) # Display a legend for the raster layer
# Combine the two tmap objects (map_1 and map_2) into a single layout
tmap_arrange(map_1, map_2)
```
Or alternatively use the functions of the raster package such as [`plotRGB`](https://www.rdocumentation.org/packages/raster/versions/3.6-14/topics/plotRGB)
```{r plot-static-raster}
raster::plotRGB(s2_rl,2 , 3, 4, stretch="lin", scale=5000)
```
Now you know how to download, read in and visualize data. In the next part of the notebook, we will see how to further explore this data and finally classify this data into information.
## Step 5: Explore the data
When using the raster::click function we can explore some of the pixel values. However, you have to make sure the markdown is plotted in the console and not inline in the notebook. This implies clicking on the preferences wheel above and setting the Chunk Output in console:
{width="237"}
Once this is done you can visualize the image and use an interactive clicking function to identify the pixels that we want to extract spectral signatures for.
```{r}
# Plot the data
dev.new() # In RStudio, the raster::click()/terra::click() functions may result in an offset between the clicked point and the recorded one.
raster::plotRGB(s2_rl, 2, 3, 4, stretch="lin", scale=5000)
# Change plotting parameters to better see the points and numbers generated from clicking
par(col="red", cex=3)
# Use the 'click' function
ct = raster::click(s2_rl, id=T, xy=T, cell=T, type="p", pch=16, col="magenta", col.lab="red")
```
Once you have clicked your five points, press the `ESC` key to save your clicked points and close the function before moving on to the next step. If you make a mistake in the step, run the `chunck` again to start over.
Here we see the output of our clicking:
```{r show-spectra-data}
head(ct)
```
and subsequently we can subset it to get only the pixel values. From here on it might be better to switch back to `Chunk Output Inline`
```{r}
# Select the spectral data (columns 3 to 15) from the data frame called ct and assign it to a new data frame called dat
dat = ct[,3:15]
# Create a vector of wavelengths that correspond to the columns of dat and include the id column
wavelengths=c("id",443,490,560,665,705,740,783,835,865,945,1610,2190)
# Rename the columns of dat with the values of the wavelengths vector
colnames(dat)=wavelengths
# Print the structure of dat, which includes information on the data types and column names
str(dat)
# Reshape the data frame from wide to long format using melt()
# Use "id" column as the identifier variable and create a new column called "Reflectance" for the values
Pixel.melt <- reshape2::melt(dat, id.vars = "id", value.name ="Reflectance")
# Convert the "variable" column (which contains the wavelengths) to numeric and assign it to a new column called "wl"
Pixel.melt$wl <- as.numeric(as.character(Pixel.melt$variable))
# Convert the "id" column to a factor and assign it to a new column called "idf"
Pixel.melt$idf <- as.factor(Pixel.melt$id)
# Print the structure of Pixel.melt, which includes information on the data types and column names
str(Pixel.melt)
# Create a ggplot object with no data or mappings specified
ggplot() +
# Add a line layer using the data from Pixel.melt, mapping wl to x, Reflectance to y, and idf to color
# Set the line width to 1
geom_line(data = Pixel.melt,
mapping = aes(x=wl, y=Reflectance, color=idf),
lwd=1)+
# Add a point layer using the data from Pixel.melt, mapping wl to x, Reflectance to y, and idf to color
# Set the point shape to a filled circle (pch=20) and the size to 5
geom_point(data = Pixel.melt,
mapping = aes(x=wl, y=Reflectance, color=idf),
pch=20,
size=5)+
# Add a legend for the color mapping with the label "Pixel id"
labs(color = "Pixel id")+
# Add a plot title "Land cover spectral signatures"
ggtitle("Land cover spectral signatures")+
# Center the plot title and set the font size to 20
theme(plot.title = element_text(hjust = 0.5, size=20))+
# Add an x-axis label "Wavelength"
xlab("Wavelength")
```
Alternatively, we can translate the entire Sentinel-2 image, which is a `raster` data set, into a `data.table` to plot the different band histograms.
```{r}
# Convert the Sentinel-2 raster image s2_rl to a data table with columns: x, y, and the pixel values (one column for each band)
dat_all <- data.table::as.data.table(rasterToPoints(s2_rl))
# Add a column with the id
dat_all = dat_all <- dat_all %>% mutate(id = row_number())
# Select the id (column 15) and spectral data (columns 3 to 14) from dat_all
dat_all = dat_all[,c(3:15)]
# Create a vector of wavelengths that correspond to the columns of dat_all and include the id column
wavelengths=c(443,490,560,665,705,740,783,835,865,945,1610,2190,"id")
# Rename the columns of dat_all with the values of the wavelengths vector
colnames(dat_all)=wavelengths
# Remove any rows with missing values from dat_all
dt = na.omit(dat_all)
```
Now we are going to make a ridgeline plot which creates a visualization of the density of the different band distributions along a common axis, using a series of vertically stacked curves. This visualization allows for easy comparison of distributions, as well as visualization of the distributional shape. Compared to histograms, ridgeline plots provide a more detailed view of the distribution of each variable by showing the density of the values in addition to their frequency. This can be especially useful when comparing multiple distributions, as ridgeline plots can help to identify differences in the shapes of the distributions that might not be apparent in histograms. Additionally, the stacked nature of ridgeline plots makes it easy to compare the distribution of a variable across different groups, such as different values of a categorical variable
```{r}
# Convert the wide format of 'dat_all' into a long format using the 'melt()' function
# 'id.vars = "id"' specifies that the 'id' column should be used as the identifier variable,
# 'value.name = "Reflectance"' specifies that the melted column should be named 'Reflectance',
dat_mlt = reshape2::melt(dat_all, id.vars = "id", value.name ="Reflectance")
# Create a ridgeline plot of the 'Reflectance' values for each wavelength in the 'dat_mlt' data frame
# using the 'ridgeline()' function from the 'ggridges' package.
# The 'Reflectance' values are plotted on the x-axis, and the wavelength groups are plotted on the y-axis.
# The resulting plot is displayed in the output.
ggplot(dat_mlt, aes(x=Reflectance, y=variable)) + geom_density_ridges()
```
## Step 6: Prepare data for classification
### Download training data
--\> GEE code editor
Now save it to our hard disk and re-read it.
```{r}
# Read raster from computer
lc_r = raster::raster('lc_r.tif')
```
### Sample training data
So now, we have both our training data `X` (i.e., spectral values per pixel) and `Y` (i.e., class per pixel) downloaded to our computer and from that we need to create a training and testing set for our supervised classifier. To do so, we are first going to sample our `Y` into a random set of samples and subsequently get the corresponding `X` for each pixel.
```{r}
set.seed(321) # set the seed for reproducibility
# Convert the raster object 'lc_r' into a spatial points data frame using the 'rasterToPoints()' function,
# and save the resulting data table to the variable 'poly_dt' as a data table using 'as.data.table()'.
poly_dt <- data.table::as.data.table(rasterToPoints(lc_r))
# Select a random sample of 10,000 points from the 'poly_dt' data table using the 'sample()' function,
# and save the resulting data table back to the variable 'poly_dt'.
poly_dt = poly_dt[sample(1:nrow(poly_dt), 10000), ]
# Rename the column 'lc_r' to 'id_cls' in the 'poly_dt' data table using the 'setnames()' function.
data.table::setnames(poly_dt, old = "X2021_Map", new = "id_cls")
# Create a new spatial points data frame 'points' using the 'SpatialPointsDataFrame()' function,
# with the x and y coordinates taken from the 'poly_dt' data table and the attribute data and projection taken from the same table.
points <- sp::SpatialPointsDataFrame(coords = poly_dt[, .(x, y)],
data = poly_dt,
proj4string = lc_r@crs)
# Print the first few rows of the 'points' data frame to the console using the 'head()' function.
# The output actually shows for every point the x,y coordinates and corresponding class.
head(points)
```
Once we have the sample of `Y` points, we can start to sample the data from `X` at the location of the `Y` points
```{r}
# Extract the attribute data from the spatial points data frame 'points' in the raster object 's2_rl' using the 'extract()' function,
# and save the resulting data as a data table using the 'as.data.table()' function.
dt <- s2_rl %>%
extract(y = points) %>%
as.data.table()
# Add a new column to the 'dt' data table called 'class', which is based on the 'id_cls' attribute of the original spatial points data frame.
# The 'make.names()' function is used to ensure that the factor levels are valid column names.
dt[, class := make.names(factor(points@data$id_cls))]
# Print the first few rows of the 'dt' data table to the console using the 'head()' function.
head(dt)
```
So this resulting table `dt` gives us for every sample the spectral reflectance and corresponding class value. Since some pixels might still contain `NA` values that might mess up our classifiers, we need to remove the rows that contain `NA`'s.
```{r}
# Print the dimension before removing NA
dim(dt)
# Remove NA rows
dt <- na.omit(dt)
# Print the dimension after removing NA
print(dim(dt))
table(dt$class)
```
We can now calculate the mean reflectance per class
```{r}
# calculate the mean for each group and each variable
df_mean <- dt %>%
group_by(class) %>%
summarize_all(mean)
# view the resulting data frame
df_mean
```
#### Q1: Reflectance per class
Now as task for you is to plot the reflectance in function of the wavelength for every class and compare the spectral profiles
## Step 7: Classification introduction
After downloading and preparing the data we can start with the actual classification. Therefore, we are first performing a stratified random split of the data into training and testing sets. Splitting the data into training and testing sets is a crucial step in building and evaluating machine learning models. The purpose of splitting the data is to evaluate the performance of the model on unseen data. The model is trained and fine-tuned on the training set and then evaluated on the testing set. This test set provides an estimate of how well the model can generalize to new, unseen data. If the model performs well on the testing set, it suggests that it has learned meaningful patterns in the data and can generalize to new data. On the other hand, if the model performs poorly on the testing set, it suggests that it has overfit to the training data and is not able to generalize to new data.
```{r}
set.seed(321) # set the seed for reproducibility
# create a stratified random split of the data into training and testing sets
idx_train <- caret::createDataPartition(dt$class, # variable to stratify by
p = 0.7, # proportion of data to use for training
list = FALSE) # output as vector, not list
# subset the original data to create the training and testing sets
dt_train <- dt[idx_train] # subset the training set
dt_test <- dt[-idx_train] # subset the testing set
# create a frequency table of the class variable in the training set
print(table(dt_train$class))
print(table(dt_test$class))
# remove classes with too little samples (only needed when the Random Forest model gives a lot of warnings, see below)
# e.g. dt_train <- dt_train[!dt_train$class == 'X80', ]
```
The training dataset is used for carrying cross-validation (CV) and grid search for model tuning. Once the optimal/best parameters were found a final model is fit to the entire training dataset using those findings. Further we can check how these final models behave on unseen data (the testing dataset). Therefore, the CV indices need to match when comparing multiple models, so to get a fair comparison. Therefore, folds will pass to trainControl argument for each type of model.
```{r}
# create cross-validation folds (splits the data into n random groups)
n_folds <- 10 # number of folds
folds <- caret::createFolds(1:nrow(dt_train), k = n_folds) # create n_folds folds of the data
```
The `n_folds` variable determines the number of folds to use in the cross-validation process. The `folds <- createFolds(1:nrow(dt_train), k = n_folds)` line creates `n_folds` random groups (or "folds") of the data for cross-validation purposes.
Subsequently, we can set the seed for each iteration of the folds in a `seeds` variable that will store the random number seeds for each resampling iteration. This ensures that each resampling iteration has a unique random seed.
```{r}
# Set the seed at each resampling iteration. Useful when running CV in parallel.
seeds <- vector(mode = "list", length = n_folds + 1) # create empty list for seeds, +1 for the final model
for(i in 1:n_folds) seeds[[i]] <- sample.int(1000, n_folds) # set seeds for each fold
seeds[n_folds + 1] <- sample.int(1000, 1) # set seed for final model
```
### Control objects
Subsequently, we are going to use the `caret` package to train our classifiers. The `caret` (Classification And REgression Training) package in R is a powerful tool for building and evaluating machine learning models. Some advantages of using the caret package are:
- Streamlined workflow: The caret package provides a consistent, streamlined workflow for building and evaluating machine learning models, with a consistent interface for many different modeling algorithms. As such it allows to test completely different classification algorithms (which were designed with a different syntax) by using a very similar syntax.
- Cross-validation: The caret package makes it easy to perform cross-validation, a key step in model evaluation and hyper-parameter tuning, with support for different types of cross-validation such as k-fold, repeated k-fold, and leave-one-out cross-validation.
- Hyperparameter tuning: The caret package provides tools for tuning hyper-parameters of machine learning models, which can greatly improve model performance. The train() function in caret can automatically search through a grid of hyper-parameter values and select the best-performing combination.
Here we will start by setting up a control object for a machine learning model using cross-validation. Such a control object is important for being able to compare the different type of models. Therefore, the `trainControl()` function is used to create the control object, and it includes several arguments:
- `summaryFunction`: specifies a summary function to be used in evaluating the model's performance. Here, multiClassSummary is used for multi-class classification problems.
- `method`: specifies the resampling method to be used. Here, cv indicates that k-fold cross-validation will be used.
- `number`: specifies the number of folds to be used in cross-validation. The value of n_folds is passed to this argument.
- `search`: specifies the search method to be used for hyper-parameter tuning. Here, grid indicates that a grid search will be used.
- `classProbs`: specifies whether to include class probabilities in the model output. It is set to TRUE, but a warning will be issued because it is not implemented for SVM models.
- `savePredictions`: specifies whether to save the predicted values for each fold. It is set to TRUE.
- `index`: specifies the resampling indexes to be used. The folds variable is passed to this argument.
- `seeds`: specifies the random seed values to be used. The seeds variable is passed to this argument.
```{r set-trainControl}
ctrl <- trainControl(summaryFunction = multiClassSummary,
method = "cv",
number = n_folds,
search = "grid",
classProbs = TRUE, # not implemented for SVM; will just get a warning
savePredictions = TRUE,
index = folds,
seeds = seeds)
```
Once we have set up this control object for cross-validation we can start training the model.
## Step 8 : Random Forest classifier
We will start by training a Random Forest classifier using the `caret` package.
```{r}
# train a random forest model using caret::train()
model_rf <- caret::train(class ~ . , # class ~ .: specifies the formula for the model, where class is the target variable and . indicates that all other variables in the dt_train dataset should be used as predictors.
method = "rf", # method: specifies the machine learning method to be used. Here, rf indicates that a random forest model will be used.
data = dt_train, # data: specifies the data used for training the model.
importance = TRUE, # specifies whether to calculate variable importance.
tuneGrid = data.frame(mtry = c(2, 3, 4, 5, 8)), # specifies the hyperparameters to be tuned during cross-validation. Here, mtry (the number of variables to consider at each split) is varied across a range of values.
trControl = ctrl) # specifies the control object to be used for cross-validation.
# save the trained model as an RDS file
saveRDS(model_rf, file = "./cache/model_rf.rds")
# Load the model again
# model_rf = readRDS('./cache/model_rf.rds')
# Getting a lot of warnings or NA values? remove classes with too little samples ()
# dt_train <- dt_train[!dt_train$class == 'X80', ]
# dt_test <- dt_test[!dt_test$class == 'X80', ]
```
Tuning here was done via the `mtry` argument, which can vary from 2 up to total number of predictors (bands) used (here, 12). So, the optimization was done via cross validation and grid search (here by grid I refer to `tuneGrid = data.frame(mtry = c(2, 3, 4, 5, 8))`).
```{r}
model_rf$times$everything # Print the total computation time
```
```{r}
plot(model_rf) # Plot the tuning results
```
#### Confusion matrix
Now we can compute the confusion matrix and associated statistics using the test data. A confusion matrix indicates how "confused" the model is between the given classes and highlights instances in which one class is confused for another. The main (first) diagonal of the matrix shows the cases when the model is "correct". The next cell creates a confusion matrix `cm_rf` using the `confusionMatrix()` function from the caret package. The function takes as input two arguments:
- `data`: a vector of predicted class labels, obtained by applying the trained `model_rf` on the test dataset `dt_test` using the `predict()` function.
- `factor(dt_test$class,levels=...)`: the true class labels from the `dt_test` dataset, converted to a factor variable, where we can either hardcode the potential levels (e.g. `c('X10','X30','X40','X50','X60','X80')` or get them from the training data (e.g., `levels(as.factor(dt_train$class))`))
```{r}
cm_rf <- caret::confusionMatrix(data = predict(model_rf, newdata = dt_test),factor(dt_test$class,levels=c('X10','X30','X40','X50','X60','X80')))
cm_rf <- caret::confusionMatrix(data = predict(model_rf, newdata = dt_test),factor(dt_test$class,levels=levels(as.factor(dt_train$class))))
cm_rf
```
The output is a confusion matrix object that summarizes the number of true positives, true negatives, false positives, and false negatives for each class label. Moreover it gives: - Accuracy: - Sensitivity (Recall) refers to the true positive rate (model correctly detects the class); - Specificity is the true negative rate (model correctly rejects the class) See also this [Wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix) or `help(confusionMatrix)` for more details on confusion matrix terminology.
#### Predictor importance
Finally, we can determine the predictor importance of each `X` feature on the classification. Determining the importance of predictors when fitting a machine learning model provides valuable insights into the factors that are most relevant for predicting the target variable. This information can be used to improve the accuracy of the model, reduce the number of features needed for prediction, and gain a better understanding of the underlying data.
There are several ways to get importance values but in this notebook we will use the `caret::varImp()` method as it is a generic method, so will work also for other models. The simple rule with importance values is that higher values mean the variables are more important.
```{r}
# Compute variable importance scores for model_rf using varImp() function
caret::varImp(model_rf)$importance %>%
# Convert data frame to matrix
as.matrix() %>%
# Create heatmap visualization using plot_ly() function
plotly::plot_ly(x = colnames(.), y = rownames(.), z = ., type = "heatmap", width = 350, height = 300)
```
The predictor importance shows that some bands are more important than other variables for the classification.
#### Q2: Predictor importance
Compare the predictor importance with your earlier plot of the spectral profiles per class. Can you visualise explain the predictor importance to separate the different classes.
## Step 9: SVM
OK, so we have trained one RF model, but of course we can also train some completely different models. On the [caret help page](https://topepo.github.io/caret/available-models.html) you can find a complete list of model types that can be used in caret and which relevant characteristics are set there.
For example, we can train a [support vector machine model](https://en.wikipedia.org/wiki/Support_vector_machine) (SVM). In this case we are going to fine tune two different SVM models with different values of the cost parameter and the Loss function to be used in the SVM model. The cost parameter in SVM determines the penalty for misclassification of training samples. It controls the balance between achieving a low training error and a low testing error. The loss function in SVM determines the type of error measure that the SVM model will optimize during training.
```{r}
# Grid of tuning parameters for SVM model
svm_grid <- expand.grid(cost = c(0.2, 0.5, 1), # Different values of cost parameter
Loss = c("L1", "L2")) # Different values of loss function
# Train an SVM model using caret package
model_svm <- caret::train(class ~ . , method = "svmLinear3", data = dt_train,
importance = TRUE, # specifies whether to calculate variable importance.
tuneGrid = svm_grid,
trControl = ctrl)
# Notice that I didn’t bother to make another ctrl object for SVM, so it works to recycle the one used for the random forests models with ignoring the warning: Class probabilities were requested for a model that does not implement them.
# Save the trained SVM model object to a file using saveRDS() function
saveRDS(model_svm, file = "./cache/model_svm.rds")
# Load the model again
# model_svm = readRDS('./cache/model_svm.rds')
```
Now we can again see which model performs best.
```{r}
plot(model_svm) # Plot the tuning results
```
Or get the confusion matrix
```{r}
# The confusion matrix using the test dataset
cm_svm <- confusionMatrix(data = predict(model_svm, newdata = dt_test), factor(dt_test$class,levels=levels(as.factor(dt_train$class))))
cm_svm
```
Here you notice that the SVM models performs less well than the RF model.
## Step 10: Neural Network
Finally, we can test a third different type of ML classifier, namely a Neural Network with one hidden layer where we will test how many hidden units and decay functions fit best.
```{r}
# define a grid of tuning parameters for the nnet model
nnet_grid <- expand.grid(size = c(5, 10, 15),
decay = c(0.001, 0.01, 0.1))
# train the nnet model using the defined grid of tuning parameters, with parallel processing
model_nnet <- caret::train(class ~ ., method = 'nnet', data = dt_train,
maxit = 1000, # set high enough so to be sure that it converges
tuneGrid = nnet_grid,
trControl = ctrl)
# save the trained nnet model as an RDS file
saveRDS(model_nnet, file = "./cache/model_nnet.rds")
# Load the model again
# model_nnet = readRDS('./cache/model_nnet.rds')
```
Now we can again see which model performs best.
```{r}
plot(model_nnet) # Plot the tuning results
```
Or get the confusion matrix
```{r}
# The confusion matrix using the test dataset
cm_nnet <- confusionMatrix(data = predict(model_nnet, newdata = dt_test), factor(dt_test$class,levels=levels(as.factor(dt_train$class))))
cm_nnet
```
Here you notice that the NN model performs slightly less than the RF model, but better than the SVM model.
## Step 11: Compare models performance
Once we have trained and fine-tuned the different models, we can compare the models using the resamples() function. We can only do this as long as the train indices of the observations match (which we made sure they do by setting specific seeds). Here we compare the results obtained via cross validation on the train data set when we tuned the models.
```{r}
# Create model_list
model_list <- list(rf = model_rf, svm = model_svm, nnet = model_nnet)
# Pass model_list to resamples()
resamples <- caret::resamples(model_list)
```
In general, the model with the higher median accuracy is the "winner", as well as a smaller range between min and max accuracy.
```{r}
# All metrics with boxplots
bwplot(resamples)
```
Here we see that in general the RF model performs best for our data set and tuning, so we will preferably use that model for later purposes if we want to continue with this. Remember however that there is typically not one best method for all data sets. So when you are faced with a new data set, it might be important to again compare different models and different hyper-parameter settings.
Finally, we can also plot the output of the different models. Therefore, we first need to apply the trained models to every pixel in the image
```{r}
predict_rf <- raster::predict(object = s2_rl, model = model_rf, type = 'raw')
predict_svm <- raster::predict(object = s2_rl, model = model_svm, type = 'raw')
predict_nnet <- raster::predict(object = s2_rl, model = model_nnet, type = 'raw')
```
These lines use the trained machine learning models to predict the class of each pixel in a raster object. Specifically, the `raster::predict()` function is called for each model, passing the trained model and the raster object `s2_rl` as input. The type = 'raw' argument indicates that we want to get the raw predicted values (not the class probabilities). The predicted values for each model are stored in separate variables `predict_rf`, `predict_svm`, and `predict_nnet`.
```{r}
leafsync::sync(viewRGB(s2_rl, r = 4, g = 3, b = 2)+mapView(lc_r, maxpixels = 1242110),
mapView(predict_rf, maxpixels = 1242110),
mapView(predict_svm, maxpixels = 1242110),
mapView(predict_nnet, maxpixels = 1242110))
```
Now you can compare the original map of WorldCover with our own predictions.
#### Q3: Model comparison
Compare the different models and discuss their performance in terms of overall accuracy and how well they spatially capture the land cover patterns.
## Step 12: GCLM
Up to now we have been classifying our data based on per-pixel values only without taking the spatial context into account.
In the next example we are going to calculate the Gray Level Co-Occurrence Matrix (Haralick et al. 1973) to calculate the texture of the image. GLCM is a powerful image feature for image analysis. The `glcm` package provides a easy-to-use function to calculate such textural features for RasterLayer objects in R. More information on GCLM can be found [here](https://prism.ucalgary.ca/server/api/core/bitstreams/8f9de234-cc94-401d-b701-f08ceee6cfdf/content).
We can calculate the GLCM textures in one direction:
```{r}
# Calculate gray level co-occurrence matrix (GLCM) for the fourth band of the image
# represented by the variable "s2_rl"
rglcm <- glcm(s2_rl[[4]],
window = c(9,9), # The GLCM is calculated using a 9x9 moving window
shift = c(1,1), # The window is shifted one pixel in both x and y directions
statistics = c("mean", "variance", "homogeneity", "contrast",
"dissimilarity", "entropy", "second_moment")) # The GLCM is calculated for several statistical measures including mean, variance, homogeneity, contrast, dissimilarity, entropy, and second moment.
# Plot the GLCM
plot(rglcm)
```
or we can calculate rotation-invariant texture features. This means that the textural features are calculated in all 4 directions (0°, 45°, 90° and 135°) and then combined to one rotation-invariant texture. The key for this is the shift parameter.
```{r}
# Calculate gray level co-occurrence matrix (GLCM) for the fourth band of the image
# represented by the variable "s2_rl". This GLCM is calculated using a 9x9 moving window
# with a shift of 1 pixel in four different directions (right, diagonal-upright,
# upward, diagonal-upleft).
rglcm1 <- glcm(s2_rl[[4]],
window = c(9,9),
shift = list(c(0,1), c(1,1), c(1,0), c(1,-1)),
statistics = c("mean", "variance", "homogeneity", "contrast",
"dissimilarity", "entropy", "second_moment")
)
# Plot the GLCM
plot(rglcm1)
```
#### Q4: GLCM in model
Add the GLCM output as additional input to the classifier to check if the spatial context improves the classification performance.
## Step 13: Unmixing
In the final step we are also going to implement a unmixing example.
The "mesma" unmixing algorithm is a spectral unmixing method that can be used to decompose a multispectral image into its constituent spectral endmembers and their corresponding fractional abundances in each pixel. This algorithm is implemented in the R package `RSToolbox`.
The "mesma" algorithm is based on the assumption that the spectral response of a pixel can be modeled as a linear combination of the spectral responses of the endmembers present in the pixel. In other words, the algorithm assumes that each pixel can be represented as a weighted sum of the spectra of the endmembers, where the weights correspond to the fractional abundances of the endmembers in the pixel.
The "mesma" algorithm uses a constrained least squares approach to estimate the fractional abundances of the endmembers in each pixel.
Therefore, we first have to first estimate the endmember spectra by selecting a subset of the pixels in the image that have pure spectra (i.e., spectra that are representative of a single endmember). In this example, we are going to do that by selecting 4 pure pixels in the image by using the earlier explained click function.
```{r}
# Plot the data
dev.new() # To avoid offsets between clicked and recorded points
plotRGB(s2_rl,2,3,4,stretch="lin",scale=5000)
# change plotting parameters to better see the points and numbers generated from clicking
par(col="red", cex=3)
# use the 'click' function
ct = click(s2_rl, id=T, xy=T, cell=T, type="p", pch=16, col="magenta", col.lab="red")
```
In this case I selected a `forest`, `grass` and `soil` pixel. The algorithm then uses these endmember spectra to estimate the fractional abundances of the endmembers in each pixel, subject to the constraint that the fractional abundances are non-negative and sum to one.
```{r}
# Create the endmember spectra
em = ct[4:15]
rownames(em) = c('forest','soil','grass')
# Do the unmixing
probs <- RStoolbox::mesma(s2_rl, em, method = "NNLS", verbose=TRUE)
```
Subsequently, we can visualize the output
```{r}
sync(viewRGB(s2_rl, r = 4, g = 3, b = 2) +
mapView(lc_r, maxpixels = 1242110), mapView(probs$forest, maxpixels = 1242110) +
mapView(probs$soil, maxpixels = 1242110) +
mapView(probs$grass, maxpixels = 1242110), mapView(probs$RMSE, maxpixels = 1242110))
```
## Step 14: Do it yourself
- Download a Sentinel-2 image over your home location that is roughly 5x5km centered on your home location
- Download the corresponding WorldCover map over your home location
- Visualize both the satellite image and land cover map
- Optimize three different land cover classifiers over your home location and assess their accuracy
- Assess the feature importance of the input data for accurate classification
- Repeat the classification but by using only the red, green and blue band
- Compare the full models with the simplified RGB model
- Perform an unmixing over your home area and compare that with the land cover classes
## Done