-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathrChapter3-2.Rmd
298 lines (215 loc) · 9.37 KB
/
rChapter3-2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
title: "Create example sequences & calculate the pairwise dissimilarity matrix using different aligment techniques"
description: |
Chapter 3.2 Alignment techniques
output: distill::distill_article
---
```{r setup, include=FALSE}
# Load required packages
library(here)
source(here("source", "load_libraries.R"))
# Output options
knitr::opts_chunk$set(eval=TRUE, echo=TRUE)
options("kableExtra.html.bsTable" = T)
# load data for Chapter 3
load(here("data", "3-0_ChapterSetup.RData"))
```
```{r, xaringanExtra-clipboard, echo=FALSE}
htmltools::tagList(
xaringanExtra::use_clipboard(
button_text = "<i class=\"fa fa-clone fa-2x\" style=\"color: #301e64\"></i>",
success_text = "<i class=\"fa fa-check fa-2x\" style=\"color: #90BE6D\"></i>",
error_text = "<i class=\"fa fa-times fa-2x\" style=\"color: #F94144\"></i>"
),
rmarkdown::html_dependency_font_awesome()
)
```
<details><summary>**Click here to get instructions...**</summary>
- Please download and unzip the replication files for Chapter 3
([`r fontawesome::fa("far fa-file-zipper")` Chapter03.zip](source/Chapter03.zip)).
- Read `readme.html` and run `3-0_ChapterSetup.R`. This will create `3-0_ChapterSetup.RData` in the sub folder `data/R`. This file contains the data required to produce the plots shown below.
- You also have to add the function `legend_large_box` to your environment in order to render the tweaked version of the legend described below. You find this file in the `source` folder of the unzipped Chapter 3 archive.
- We also recommend to load the libraries listed in Chapter 3's `LoadInstallPackages.R`
```{r, eval=FALSE}
# assuming you are working within .Rproj environment
library(here)
# install (if necessary) and load other required packages
source(here("source", "load_libraries.R"))
# load environment generated in "3-0_ChapterSetup.R"
load(here("data", "R", "3-0_ChapterSetup.RData"))
```
</details>
\
In chapter 3.2, we introduces techniques to compare whole sequences.
## Create example sequences
We first create two vectors 6-element-long (of equal length). Each element is coded with a letter: A, B, or C.
```{r, eval=TRUE, echo=TRUE}
ch3.ex1 <- c("A-B-B-C-C-C", "A-B-B-B-B-B")
```
We then display the two vectors:
```{r, eval=TRUE, echo=FALSE}
ch3.ex1
```
We create sequence objects from the three vectors:
```{r, eval=TRUE, echo=TRUE}
ch3.ex1.seq <- seqdef(ch3.ex1)
```
The three resulting sequences are then displayed:
```{r, eval=TRUE, echo=TRUE}
ch3.ex1.seq
```
## Optimal matching (OM)
In this case sequences are short and few in number, so that we immediately recognize that they are not the identical. In case of longer and more complex sequences, one might want to know if two sequences are the identical but doing that visually might not be feasible. The following code allows you to do that as it returns [TRUE] if the sequences are the same and [FALSE] if they are not. In this case we compare sequence 1 and 2 above:
```{r, eval=TRUE, echo=TRUE}
seqcomp(ch3.ex1.seq[1,],ch3.ex1.seq[2,])
```
However, the two sequences might have some similarity even if they are not exactly the same. To obtain the number of matching positions between the two sequences, we can use the code:
```{r, eval=TRUE, echo=TRUE}
seqmpos(ch3.ex1.seq[1,],ch3.ex1.seq[2,])
```
We now know that sequence 1 and sequence 2 share 3 matching positions.
Alternatively, we can compute the length of the longest common subsequence (elements composed of tokens - states and combinations of subsequent states - that occur in the same order along the sequence) of two sequences:
```{r, eval=TRUE, echo=TRUE}
seqLLCS(ch3.ex1.seq[1,],ch3.ex1.seq[2,])
```
We now know that the longest common subsequence between sequence 1 and sequence 2 is 3-element-long.
Aligning them pairwise to make them the same. Optimal Matching (OM) does so by changing the order of the states or the length of a spell.
<!--
In classical OM applications, alignment can be achieved by conducting two basic operations: insertion/deletion of a state (indel) and substitution of a state with another one. Different combinations of operations are possible to "match" the sequences pairwise. Each operation is assigned a “cost”: the sum of the costs of all operations will be regarded as the degree of dissimilarity between the two sequences. The matching is "optimal" as the algorithm finds the "cheapest" solution to align the sequences given a set of costs for the indel and substitution operations.
-->
In the simplest example possible, both indel and substitution operations can be assigned the same cost of 1 (Levenshtein I).
We first generate a substitution costs matrix to be used in the next steps:
```{r, eval=TRUE, echo=TRUE}
costs1 <- matrix(
c(0,1,1,
1,0,1,
1,1,0
),
nrow = 3, ncol = 3, byrow = TRUE)
```
We can display it:
```{r, eval=TRUE, echo=FALSE}
costs1
```
We then use the following code to obtain a summary of the optimal number and costs of operations the OM identifies as the cheapest to align the two sequences. `indel` and `sm`) require the indel cost and the substitution cost matrix respectively to be specified. In what follows, we use `?seqalign` command:
```{r, eval=TRUE, echo=TRUE}
al.ch3.1.2 <- seqalign(ch3.ex1.seq, 1:2, indel=1, sm=costs1)
print(al.ch3.1.2)
```
## Assigning costs to the alignment operations
Let's consider 3 example sequences
```{r, eval=TRUE, echo=TRUE}
ch3.ex2 <- c("A-B-B-C-C-C", "A-B-B-B-B-B", "B-C-C-C-B-B")
```
...define them as sequences objects...
```{r, eval=TRUE, echo=TRUE}
ch3.ex2.seq <- seqdef(ch3.ex2)
```
...and display them:
```{r, eval=TRUE, echo=TRUE}
ch3.ex2.seq
```
We now generate a substitution costs matrix with the following costs to be used in the next step: `indel=1`, `sm=1` - which corresponds to Levenshtein I. Note that in this case the pairwise dissimilarity equals to the number of non-matched positions
```{r, eval=TRUE, echo=TRUE}
costsL1 <- matrix(
c(0,1,1,
1,0,1,
1,1,0
),
nrow = 3, ncol = 3, byrow = TRUE)
```
One can display the newly-generated matrix:
```{r, eval=TRUE, echo=TRUE}
costsL1
```
Using the `?seqalign` command we obtain a summary of the optimal number and costs of operations. The OM identifies the following as the cheapest to align the three sequences pairwise:
```{r, eval=TRUE, echo=TRUE}
al1.L1 <- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsL1)
print(al1.L1)
```
```{r, eval=TRUE, echo=TRUE}
al2.L1 <- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsL1)
print(al2.L1)
```
```{r, eval=TRUE, echo=TRUE}
al3.L1 <- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsL1)
print(al3.L1)
```
We now generate an alternative substitution costs matrix with `sm=4`. Combined with `indel=1` this corresponds to Levenshtein II. Note that indel costs are lower than half of the minimum substitution cost, so that only indel operations will be used by the OM.
```{r, eval=TRUE, echo=TRUE}
costsL2 <- matrix(
c(0,4,4,
4,0,4,
4,4,0
),
nrow = 3, ncol = 3, byrow = TRUE)
```
One can display the newly-generated matrix:
```{r, eval=TRUE, echo=TRUE}
costsL2
```
Using the `?seqalign` command, we obtain a summary of the optimal number and costs of operations. The OM identifies the following as the cheapest to align the three sequences pairwise:
```{r, eval=TRUE, echo=TRUE}
al1.L2 <- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsL2)
print(al1.L2)
```
```{r, eval=TRUE, echo=TRUE}
al2.L2 <- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsL2)
print(al2.L2)
```
```{r, eval=TRUE, echo=TRUE}
al3.L2 <- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsL2)
print(al3.L2)
```
The next test corresponds to the Hamming distance setting, that is, only substitution costs are used. The costs setting will be: `indel=NA`, `sm=1`. As above, we generate a substitution costs matrix:
```{r, eval=TRUE, echo=TRUE}
costsHAM <- matrix(
c(0,1,1,
1,0,1,
1,1,0
),
nrow = 3, ncol = 3, byrow = TRUE)
```
...and display it:
```{r, eval=TRUE, echo=TRUE}
costsHAM
```
Using the `?seqalign` command, we obtain a summary of the optimal number and costs of operations: the OM identifies the following as the cheapest to align the three sequences pairwise. Note that here we have to specify `indel=4` (very high) just to make sure that the algorithm does not use them as with `seqalign` is not possible to specify directly `dist=HAM`
```{r, eval=TRUE, echo=TRUE}
al1.HAM <- seqalign(ch3.ex2.seq, 1:2, indel=4, sm=costsHAM)
print(al1.HAM)
```
```{r, eval=TRUE, echo=TRUE}
al2.HAM <- seqalign(ch3.ex2.seq, c(1,3), indel=4, sm=costsHAM)
print(al2.HAM)
```
```{r, eval=TRUE, echo=TRUE}
al3.HAM <- seqalign(ch3.ex2.seq, 2:3, indel=4, sm=costsHAM)
print(al3.HAM)
```
We now try the option with the following costs: `indel=1` , `substitution=2`, generating the substitution costs matrix first:
```{r, eval=TRUE, echo=TRUE}
costsOMcl <- matrix(
c(0,2,2,
2,0,2,
2,2,0
),
nrow = 3, ncol = 3, byrow = TRUE)
```
We display the resulting matrix of substitution costs:
```{r, eval=TRUE, echo=TRUE}
costsOMcl
```
Using the `?seqalign` command, we obtain a summary of the optimal number and costs of operation. The OM identifies the following as the cheapest to align the three sequences pairwise.
```{r, eval=TRUE, echo=TRUE}
al1.OMcl <- seqalign(ch3.ex2.seq, 1:2, indel=1, sm=costsOMcl)
print(al1.OMcl)
```
```{r, eval=TRUE, echo=TRUE}
al2.OMcl <- seqalign(ch3.ex2.seq, c(1,3), indel=1, sm=costsOMcl)
print(al2.OMcl)
```
```{r, eval=TRUE, echo=TRUE}
al3.OMcl <- seqalign(ch3.ex2.seq, 2:3, indel=1, sm=costsOMcl)
print(al3.OMcl)
```