-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathoverview.qmd
476 lines (357 loc) · 17.7 KB
/
overview.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
# Overview of `moiraine` {#sec-overview}
```{r}
#| child: "_setup.qmd"
```
```{r loading-packages}
#| include: false
library(targets)
library(moiraine)
library(ggplot2)
library(circlize)
library(purrr)
library(OmicsPLS)
```
```{r loading-data}
#| echo: false
mo_set <- tar_read(mo_set_de)
tar_load(interesting_features)
tar_load(pca_runs_list)
tar_load(mo_presel_supervised)
tar_load(so2pls_final_run)
tar_load(diablo_final_run)
mofa_output <- tar_read(mofa_output) |>
moiraine:::.filter_output_dimensions(paste("Factor", 1:4))
tar_load(diablo_output)
tar_load(output_list)
```
In this chapter, we provide an overview of the capabilities of `moiraine`. Some code is presented for illustration, but for more details readers will be directed towards the corresponding chapter in the manual.
```{r setup-visible}
#| code-fold: true
#| code-summary: "Loading packages"
#| eval: false
library(moiraine)
## For custom colour palettes
library(ggplot2)
library(circlize)
## For working with lists
library(purrr)
## For visualising sO2PLS summary
library(OmicsPLS)
```
## Input data
The first step is to import the omics datasets and associated information into R. For each omics dataset, the `moiraine` package expects three pieces of information, which are read from csv files (or other specific formats when possible such as gtf or gff3):
* the measurements of omics features across the samples (the dataset)
* information about the omics features measured (the features metadata)
* information about the samples measured (the samples metadata)
An example of input files is shown below:

`moiraine` uses the [`MultiDataSet` package](https://bioconductor.org/packages/release/bioc/html/MultiDataSet.html) to store the omics datasets and associated metadata into a single R object, which can be used as input for the different `moiraine` functions (see @sec-importing-data for details about data import). This ensures that these functions can be used regardless of the number or type of omics datasets to analyse. At any point, the datasets, features and samples metadata can easily be extracted from the `MultiDataSet` object through `get_datasets()`, `get_features_metadata()` and `get_samples_metadata()`.
Here is an example of a `MultiDataSet` object:
```{r show-multidataset}
#| echo: false
mo_set
```
Importantly, this means that all the information that we have about the omics features and samples are available to `moiraine` for customising plots or outputs. For example, it is possible to quickly generate a heatmap displaying the measurements for specific features of interest, and add information about the features and samples to facilitate the interpretation of the plot:
```{r plot-data-heatmap}
#| code-fold: true
#| fig.width: 10
#| fig.height: 6
head(interesting_features) ## vector of feature IDs
colours_list <- list(
"status" = c("Control" = "gold", "BRD" = "lightblue"),
"day_on_feed" = colorRamp2(c(5, 70), c("white", "pink3")),
"de_status" = c("downregulated" = "deepskyblue",
"upregulated" = "chartreuse3")
)
plot_data_heatmap(
mo_set, # the MultiDataSet object
interesting_features, # vector of feature IDs of interest
center = TRUE, # centering and scaling data for
scale = TRUE, # easier visualisation
show_column_names = FALSE, # hide sample IDs
only_common_samples = TRUE, # only samples present in all omics
samples_info = c("status", "day_on_feed"), # add info about samples
features_info = c("de_status"), # add info about features
colours_list = colours_list, # customise colours
label_cols = list( # specify features label
"rnaseq" = "Name", # from features metadata
"metabolome" = "name"
)
)
```
Similarly, a number of functions allow to quickly summarise different aspects of the multi-omics dataset, such as creating an upset plot to compare the samples present in each omics dataset (`plot_samples_upset()`), or generating a density plot for each omics dataset (`plot_density_data()`). See @sec-inspecting-multidataset for more details about the different visualisations and summary functions implemented.
## Data pre-processing
Target factories have been implemented to facilitate the application of similar tasks across the different omics datasets. For example, the `transformation_datasets_factory()` function generates a sequence of targets to apply one of many possible transformations (from the [`vsn`](https://bioconductor.org/packages/release/bioc/html/vsn.html), [`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html), or [`bestNormalize`](https://cran.r-project.org/web/packages/bestNormalize/index.html) packages, for example) on each omics dataset, store information about each transformation performed, and generate a new `MultiDataSet` object in which the omics measurements have been transformed:
<details>
<summary>Code</summary>
::: {.targets-chunk}
```{targets transformation-datasets-factory}
#| eval: false
transformation_datasets_factory(
mo_set, # MultiDataSet object
c("rnaseq" = "vst-deseq2", # VST through DESeq2 for RNAseq
"metabolome" = "logx"), # log2-transf. for NMR dataset
log_bases = 2, # Base for log transformation
pre_log_functions = zero_to_half_min, # Handling 0s in log2-transf.
transformed_data_name = "mo_set_transformed" # New MultiDataSet object
)
```
:::
</details>

Note that there is also the option for users to apply their own custom transformations to the datasets (see @sec-modifying-multidataset).
Similarly, the `pca_complete_data_factory` generates a list of targets to run a PCA on each omics dataset via the [`pcaMethods` package](https://bioconductor.org/packages/release/bioc/html/pcaMethods.html), and if necessary imputes missing values through NIPALS-PCA. The PCA results can be easily visualised for all or specific omics datasets:
```{r pca-screeplot}
#| code-fold: true
#| fig.width: 8
#| fig.height: 8
plot_screeplot_pca(pca_runs_list)
```
```{r show-pca-res-snps}
#| code-fold: true
#| fig.width: 8
#| fig.height: 8
#| warning: false
plot_samples_coordinates_pca(
pca_runs_list, # List of PCA results
datasets = "snps", # Dataset to plot
pcs = 1:3, # Principal components to display
mo_data = mo_set, # MultiDataSet object
colour_upper = "geno_comp_cluster", # Samples covariate
shape_upper = "status", # Samples covariate
colour_lower = "feedlot", # Samples covariate
scale_colour_lower = scale_colour_brewer(palette = "Set1") # Custom palette
) +
theme(legend.box = "vertical") # Plot legend vertically
```
More information about data pre-processing can be found in @sec-preprocessing.
## Data pre-filtering
The created `MultiDataSet` object can be filtered, both in terms of samples and features, by passing a list of sample or feature IDs to retain, or by using logical tests on samples or features metadata. In addition, we implement target factories to retain only the most variable features in each omics dataset --unsupervised filtering--, or to retain the features most associated with an outcome of interest, via sPLS-DA from [`mixOmics`](http://mixomics.org/) --supervised filtering-- (see @sec-prefiltering). This pre-filtering step is essential to reduce the size of the datasets prior to multi-omics integration.
<details>
<summary>Code</summary>
::: {.targets-chunk}
```{targets feature-preselection-splsda-factory}
#| eval: false
feature_preselection_splsda_factory(
mo_set_complete, # A MultiDataSet object
group = "status", # Sample covariate to use for supervised filtering
to_keep_ns = c( # Number of features to retain per dataset
"snps" = 1000,
"rnaseq" = 1000
),
filtered_set_target_name = "mo_presel_supervised" # Name of filtered object
)
```
:::
</details>
```{r show-mo-presel-supervised}
#| echo: false
mo_presel_supervised
```
## Multi-omics data integration
Currently, `moiraine` provides functions and target factories to facilitate the use of five integration methods: sPLS and DIABLO from the `mixOmics` package, sO2PLS from [`OmicsPLS`](https://cran.r-project.org/web/packages/OmicsPLS/index.html), as well as `MOFA` and `MEFISTO` from [`MOFA2`](https://biofam.github.io/MOFA2/).
This includes functions that transform a `MultiDataSet` object into the required input format for each integration method; for example for sPLS (only top of the matrices shown):
```{r get-input-spls}
#| code-fold: true
#| eval: false
get_input_spls(
mo_presel_supervised,
mode = "canonical",
datasets = c("rnaseq", "metabolome")
)
```
```{r show-input-spls}
#| echo: false
map(tar_read(spls_input), \(x) x[1:5, 1:5])
```
`moiraine` also offers helper functions and target factories to facilitate the application of these integration tools. For example, the `diablo_predefined_design_matrix()` function generates, for a given DIABLO input object, one of the three recommended design matrices for DIABLO (null, full or weighted full), while the `diablo_pairwise_pls_factory()` factory creates a list of targets to estimate the optimal design matrix to use for DIABLO based on datasets pairwise correlations estimated using PLS:
<details>
<summary>Code</summary>
::: {.targets-chunk}
```{targets diablo-design-matrix-factory}
#| eval: false
list(
tar_target(
diablo_input, # DIABLO input object
get_input_mixomics_supervised(
mo_presel_supervised, # MultiDataSet object (prefiltered)
group = "status" # Samples covariate of interest
)
),
diablo_pairwise_pls_factory(diablo_input) # Target factory for design matrix
# estimation
)
```
:::
</details>

In addition, a number of plotting functions have been implemented to visualise different aspects of the integration process: e.g. `diablo_plot_tune()` to show the results of model tuning in DIABLO or `so2pls_plot_summary()` (shown below) to display the percentage of variance explained by each latent component constructed by sO2PLS:
```{r so2pls-plot-summary}
#| code-fold: true
#| fig.height: 6
so2pls_plot_summary(so2pls_final_run)
```
::: {.callout-note}
Wouldn't it be nice to have informative labels for the features in DIABLO's circos plots? With `moiraine`, it is possible to use information from the features metadata provided as labels for the features in the plots. So, we can go from:
```{r diablo-circos-plot}
#| code-fold: true
#| fig.width: 6
#| fig.height: 6
mixOmics::circosPlot(
diablo_final_run,
cutoff = 0.7,
size.variables = 0.5,
comp = 1
)
```
to:
```{r moiraine-circos-plot}
#| code-fold: true
#| fig.width: 6
#| fig.height: 6
diablo_plot_circos(
diablo_final_run,
mo_set,
label_cols = list(
"rnaseq" = "Name",
"metabolome" = "name"
),
cutoff = 0.7,
size.variables = 0.5,
comp = 1
)
```
:::
More details about how to use these integration tools through `moiraine` can be found in [Chapters @sec-spls] to [-@sec-diablo].
## Interpreting the integration results
One of the main goals of `moiraine` is to facilitate the interpretation of the omics integration results. To this end, the outcome of any of the supported integration methods can be converted to a standardised integration output format, e.g.:
```{r get-output-mofa}
#| code-fold: true
#| eval: false
get_output(mofa_trained)
```
```{r show-mofa-output}
#| echo: false
mofa_output
```
This object can then be used to visualise the integration results in a number of ways, including:
* percentage of variance explained:
```{r mofa-variance-explained}
#| code-fold: true
plot_variance_explained(mofa_output)
```
* Sample scores as pairwise scatterplots:
```{r mofa-samples-score-ggpairs}
#| code-fold: true
#| fig.height: 8
plot_samples_score(
mofa_output, # MOFA standardised output
latent_dimensions = paste("Factor", 1:3), # MOFA factors to display
mo_data = mo_set, # MultiDataSet object
colour_upper = "status", # Sample covariate
scale_colour_upper = scale_colour_brewer(palette = "Set1"), # Custom palette
shape_upper = "gender", # Sample covariate
colour_lower = "geno_comp_cluster" # Sample covariate
) +
theme(legend.box = "vertical")
```
* Sample scores against a sample covariate of interest (either categorical or continuous):
```{r mofa-samples-score-covariate}
#| code-fold: true
#| fig.height: 5
#| warning: false
plot_samples_score_covariate(
mofa_output, # MOFA standardised output
mo_set, # MultiDataSet object
"status", # Sample covariate of interest
colour_by = "status", # Other sample covariate
shape_by = "geno_comp_cluster", # Other sample covariate
latent_dimensions = paste("Factor", 1:2) # MOFA factors to display
)
```
* Top contributing features with their importance:
```{r mofa-top-features}
#| code-fold: true
#| fig.width: 10
#| fig.height: 8
plot_top_features(
mofa_output, # MOFA standardised output
mo_data = mo_set, # MultiDataSet object
label_cols = list( # Custom labels for features from
"rnaseq" = "Name", # features metadata
"metabolome" = "name"
),
truncate = 25, # truncate long feature labels
latent_dimensions = paste("Factor", 1:2) # MOFA factors to display
)
```
More details can be found in @sec-interpretation.
## Evaluating the integration results
With `moiraine`, it is possible to evaluate the results of a given integration tool against prior information that we have about the features (e.g. knowledge about biological functions or results from a single-omics analysis) or samples (e.g. to assess the success of samples clustering into meaningful groups). For example, we can compare the feature selection performed by DIABLO to results from differential expression analyses performed on the omics datasets:
```{r evaluate-feature-selection-table}
#| code-fold: true
#| collapse: false
evaluate_feature_selection_table(
diablo_output, # DIABLO standardised output
mo_data = mo_set, # MultiDataSet object
col_names = list( # Columns from features metadata
"rnaseq" = "de_signif", # containing DE outcome
"metabolome" = "de_signif"
),
latent_dimensions = "Component 1" # Latent component to focus on
)
```
In addition, a number of functions are provided to help with features set enrichment (such as over-representation analysis or gene set enrichment analysis), e.g. by generating feature sets from features metadata through `make_feature_sets_from_fm()`, or by ensuring that a proper background set is used in the enrichment analysis, with `reduce_feature_sets_data()`. Information about evaluation of integration results is presented in @sec-evaluation.
## Comparison different integration results
Lastly, `moiraine` facilitates the comparison of the results from different integration methods, or from a same integration method but with different pre-processing options or parameter choices. It is possible to visualise the correlation between the results of different methods, in terms of the latent dimensions they constructed:
```{r comparison-heatmap-corr}
#| code-fold: true
#| eval: false
comparison_heatmap_corr(
output_list, # List of integration results (standardised format)
)
```
```{r show-comp-heatmap}
#| echo: false
#| fig.width: 8
#| fig.height: 7.5
comparison_heatmap_corr(
output_list,
latent_dimensions = list(
"sO2PLS" = "joint component 1",
"MOFA" = paste("Factor", 1:4)
)
)
```
Or compare the samples score or features weight from two latent dimensions created by two integration methods:
```{r}
#| code-fold: true
#| eval: false
plot_features_weight_pair(
list(mofa_output, diablo_output), # Integration results (stand. format)
list( # Latent dimensions to compare
"MOFA" = "Factor 1",
"DIABLO" = "Component 1"
),
mo_data = mo_set, # MultiDataSet object
features_metric = "importance", # Plot absolute or signed importance score
label_cols = list( # Columns from features metadata to use
"rnaseq" = "Name", # as labels in plot
"metabolome" = "name"
)
)
```
```{r show-features-weight-comp}
#| echo: false
plot_features_weight_pair(
output_list[3:4],
list("MOFA" = "Factor 1", "DIABLO" = "Component 1"),
mo_data = mo_set,
features_metric = "importance",
label_cols = list(
"rnaseq" = "Name",
"metabolome" = "name"
)
)
```
It is also possible to compute a consensus importance score for each feature, which summarises the contribution of the feature to different latent dimensions, and thus assess which features are highlighted across several integration methods. More details are provided in @sec-comparison.