overview.qmd

# Overview of `moiraine` {#sec-overview}

```{r}
#| child: "_setup.qmd"
```

```{r loading-packages}
#| include: false

library(targets)
library(moiraine)
library(ggplot2)
library(circlize)
library(purrr)
library(OmicsPLS)
```

```{r loading-data}
#| echo: false

mo_set <- tar_read(mo_set_de)
tar_load(interesting_features)
tar_load(pca_runs_list)
tar_load(mo_presel_supervised)
tar_load(so2pls_final_run)
tar_load(diablo_final_run)
mofa_output <- tar_read(mofa_output) |> 
  moiraine:::.filter_output_dimensions(paste("Factor", 1:4))
tar_load(diablo_output)
tar_load(output_list)
```

In this chapter, we provide an overview of the capabilities of `moiraine`. Some code is presented for illustration, but for more details readers will be directed towards the corresponding chapter in the manual.


```{r setup-visible}
#| code-fold: true
#| code-summary: "Loading packages"
#| eval: false

library(moiraine)

## For custom colour palettes
library(ggplot2)
library(circlize)

## For working with lists
library(purrr)

## For visualising sO2PLS summary
library(OmicsPLS)
```


## Input data

The first step is to import the omics datasets and associated information into R. For each omics dataset, the `moiraine` package expects three pieces of information, which are read from csv files (or other specific formats when possible such as gtf or gff3):

* the measurements of omics features across the samples (the dataset)

* information about the omics features measured (the features metadata)

* information about the samples measured (the samples metadata)

An example of input files is shown below:

![An example of input files for `moiraine` (first 10 rows and 4 columns only): A) RNAseq read counts across genes (rows) and samples (columns). B) Table of features metadata, providing information about the genes measured. c) Table of samples metadata, i.e. information about the animals that were observed.](images/input_data_example.png)


`moiraine` uses the [`MultiDataSet` package](https://bioconductor.org/packages/release/bioc/html/MultiDataSet.html) to store the omics datasets and associated metadata into a single R object, which can be used as input for the different `moiraine` functions (see @sec-importing-data for details about data import). This ensures that these functions can be used regardless of the number or type of omics datasets to analyse. At any point, the datasets, features and samples metadata can easily be extracted from the `MultiDataSet` object through `get_datasets()`, `get_features_metadata()` and `get_samples_metadata()`. 

Here is an example of a `MultiDataSet` object:

```{r show-multidataset}
#| echo: false
mo_set
```

Importantly, this means that all the information that we have about the omics features and samples are available to `moiraine` for customising plots or outputs. For example, it is possible to quickly generate a heatmap displaying the measurements for specific features of interest, and add information about the features and samples to facilitate the interpretation of the plot:

```{r plot-data-heatmap}
#| code-fold: true
#| fig.width: 10
#| fig.height: 6

head(interesting_features) ## vector of feature IDs

colours_list <- list(                       
  "status" = c("Control" = "gold", "BRD" = "lightblue"),
  "day_on_feed" = colorRamp2(c(5, 70), c("white", "pink3")),
  "de_status" = c("downregulated" = "deepskyblue", 
                  "upregulated" = "chartreuse3")
)

plot_data_heatmap(
  mo_set,                                    # the MultiDataSet object
  interesting_features,                      # vector of feature IDs of interest
  center = TRUE,                             # centering and scaling data for
  scale = TRUE,                              #   easier visualisation
  show_column_names = FALSE,                 # hide sample IDs
  only_common_samples = TRUE,                # only samples present in all omics
  samples_info = c("status", "day_on_feed"), # add info about samples
  features_info = c("de_status"),            # add info about features
  colours_list = colours_list,               # customise colours
  label_cols = list(                         # specify features label
    "rnaseq" = "Name",                       #   from features metadata
    "metabolome" = "name"
  )
)
```

Similarly, a number of functions allow to quickly summarise different aspects of the multi-omics dataset, such as creating an upset plot to compare the samples present in each omics dataset (`plot_samples_upset()`), or generating a density plot for each omics dataset (`plot_density_data()`). See @sec-inspecting-multidataset for more details about the different visualisations and summary functions implemented.

## Data pre-processing

Target factories have been implemented to facilitate the application of similar tasks across the different omics datasets. For example, the `transformation_datasets_factory()` function generates a sequence of targets to apply one of many possible transformations (from the [`vsn`](https://bioconductor.org/packages/release/bioc/html/vsn.html), [`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html), or [`bestNormalize`](https://cran.r-project.org/web/packages/bestNormalize/index.html) packages, for example) on each omics dataset, store information about each transformation performed, and generate a new `MultiDataSet` object in which the omics measurements have been transformed:

<details>

<summary>Code</summary>

::: {.targets-chunk}
```{targets transformation-datasets-factory}
#| eval: false

transformation_datasets_factory(
  mo_set,                                      # MultiDataSet object
  c("rnaseq" = "vst-deseq2",                   # VST through DESeq2 for RNAseq
    "metabolome" = "logx"),                    # log2-transf. for NMR dataset
  log_bases = 2,                               # Base for log transformation
  pre_log_functions = zero_to_half_min,        # Handling 0s in log2-transf.
  transformed_data_name = "mo_set_transformed" # New MultiDataSet object
)
```
:::

</details>

![](images/dag_transformation.png)

Note that there is also the option for users to apply their own custom transformations to the datasets (see @sec-modifying-multidataset).

Similarly, the `pca_complete_data_factory` generates a list of targets to run a PCA on each omics dataset via the [`pcaMethods` package](https://bioconductor.org/packages/release/bioc/html/pcaMethods.html), and if necessary imputes missing values through NIPALS-PCA. The PCA results can be easily visualised for all or specific omics datasets:

```{r pca-screeplot}
#| code-fold: true
#| fig.width: 8
#| fig.height: 8

plot_screeplot_pca(pca_runs_list)
```

```{r show-pca-res-snps}
#| code-fold: true
#| fig.width: 8
#| fig.height: 8
#| warning: false

plot_samples_coordinates_pca(
  pca_runs_list,                              # List of PCA results
  datasets = "snps",                          # Dataset to plot
  pcs = 1:3,                                  # Principal components to display
  mo_data = mo_set,                           # MultiDataSet object
  colour_upper = "geno_comp_cluster",         # Samples covariate
  shape_upper = "status",                     # Samples covariate
  colour_lower = "feedlot",                   # Samples covariate
  scale_colour_lower = scale_colour_brewer(palette = "Set1") # Custom palette
) +
  theme(legend.box = "vertical")              # Plot legend vertically
```

More information about data pre-processing can be found in @sec-preprocessing.

## Data pre-filtering

The created `MultiDataSet` object can be filtered, both in terms of samples and features, by passing a list of sample or feature IDs to retain, or by using logical tests on samples or features metadata. In addition, we implement target factories to retain only the most variable features in each omics dataset --unsupervised filtering--, or to retain the features most associated with an outcome of interest, via sPLS-DA from [`mixOmics`](http://mixomics.org/) --supervised filtering-- (see @sec-prefiltering). This pre-filtering step is essential to reduce the size of the datasets prior to multi-omics integration.

<details>

<summary>Code</summary>

::: {.targets-chunk}
```{targets feature-preselection-splsda-factory}
#| eval: false

feature_preselection_splsda_factory(
  mo_set_complete,            # A MultiDataSet object
  group = "status",           # Sample covariate to use for supervised filtering
  to_keep_ns = c(             # Number of features to retain per dataset
    "snps" = 1000, 
    "rnaseq" = 1000
  ), 
  filtered_set_target_name = "mo_presel_supervised" # Name of filtered object
)
```
:::

</details>

```{r show-mo-presel-supervised}
#| echo: false

mo_presel_supervised
```

## Multi-omics data integration

Currently, `moiraine` provides functions and target factories to facilitate the use of five integration methods: sPLS and DIABLO from the `mixOmics` package, sO2PLS from [`OmicsPLS`](https://cran.r-project.org/web/packages/OmicsPLS/index.html), as well as `MOFA` and `MEFISTO` from [`MOFA2`](https://biofam.github.io/MOFA2/). 

This includes functions that transform a `MultiDataSet` object into the required input format for each integration method; for example for sPLS (only top of the matrices shown):

```{r get-input-spls}
#| code-fold: true
#| eval: false

get_input_spls(
  mo_presel_supervised,
  mode = "canonical",
  datasets = c("rnaseq", "metabolome")
)
```

```{r show-input-spls}
#| echo: false

map(tar_read(spls_input), \(x) x[1:5, 1:5])
```


`moiraine` also offers helper functions and target factories to facilitate the application of these integration tools. For example, the `diablo_predefined_design_matrix()` function generates, for a given DIABLO input object, one of the three recommended design matrices for DIABLO (null, full or weighted full), while the `diablo_pairwise_pls_factory()` factory creates a list of targets to estimate the optimal design matrix to use for DIABLO based on datasets pairwise correlations estimated using PLS:

<details>

<summary>Code</summary>

::: {.targets-chunk}
```{targets diablo-design-matrix-factory}
#| eval: false

list(
  tar_target(
    diablo_input,                            # DIABLO input object
    get_input_mixomics_supervised(
      mo_presel_supervised,                  # MultiDataSet object (prefiltered)
      group = "status"                       # Samples covariate of interest
    )
  ),
  diablo_pairwise_pls_factory(diablo_input)  # Target factory for design matrix
                                             #   estimation
)
```
:::

</details>

![](images/dag_diablo_design_matrix.png)

In addition, a number of plotting functions have been implemented to visualise different aspects of the integration process: e.g. `diablo_plot_tune()` to show the results of model tuning in DIABLO or `so2pls_plot_summary()` (shown below) to display the percentage of variance explained by each latent component constructed by sO2PLS:

```{r so2pls-plot-summary}
#| code-fold: true
#| fig.height: 6

so2pls_plot_summary(so2pls_final_run)
```

::: {.callout-note}
Wouldn't it be nice to have informative labels for the features in DIABLO's circos plots? With `moiraine`, it is possible to use information from the features metadata provided as labels for the features in the plots. So, we can go from:

```{r diablo-circos-plot}
#| code-fold: true
#| fig.width: 6
#| fig.height: 6

mixOmics::circosPlot(
  diablo_final_run,
  cutoff = 0.7,
  size.variables = 0.5,
  comp = 1
)
```

to:

```{r moiraine-circos-plot}
#| code-fold: true
#| fig.width: 6
#| fig.height: 6

diablo_plot_circos(
  diablo_final_run,
  mo_set,
  label_cols = list(
    "rnaseq" = "Name",
    "metabolome" = "name"
  ),
  cutoff = 0.7,
  size.variables = 0.5,
  comp = 1
)
```

:::

More details about how to use these integration tools through `moiraine` can be found in [Chapters @sec-spls] to [-@sec-diablo].

## Interpreting the integration results

One of the main goals of `moiraine` is to facilitate the interpretation of the omics integration results. To this end, the outcome of any of the supported integration methods can be converted to a standardised integration output format, e.g.:

```{r get-output-mofa}
#| code-fold: true
#| eval: false

get_output(mofa_trained)
```

```{r show-mofa-output}
#| echo: false

mofa_output
```

This object can then be used to visualise the integration results in a number of ways, including:

* percentage of variance explained:

```{r mofa-variance-explained}
#| code-fold: true

plot_variance_explained(mofa_output)
```

* Sample scores as pairwise scatterplots:

```{r mofa-samples-score-ggpairs}
#| code-fold: true
#| fig.height: 8

plot_samples_score(
  mofa_output,                                        # MOFA standardised output
  latent_dimensions = paste("Factor", 1:3),           # MOFA factors to display
  mo_data = mo_set,                                   # MultiDataSet object
  colour_upper = "status",                            # Sample covariate
  scale_colour_upper = scale_colour_brewer(palette = "Set1"), # Custom palette
  shape_upper = "gender",                             # Sample covariate
  colour_lower = "geno_comp_cluster"                  # Sample covariate
) +
  theme(legend.box = "vertical")
```

* Sample scores against a sample covariate of interest (either categorical or continuous):

```{r mofa-samples-score-covariate}
#| code-fold: true
#| fig.height: 5
#| warning: false

plot_samples_score_covariate(
  mofa_output,                             # MOFA standardised output
  mo_set,                                  # MultiDataSet object
  "status",                                # Sample covariate of interest
  colour_by = "status",                    # Other sample covariate
  shape_by = "geno_comp_cluster",          # Other sample covariate
  latent_dimensions = paste("Factor", 1:2) # MOFA factors to display
)
```

* Top contributing features with their importance:

```{r mofa-top-features}
#| code-fold: true
#| fig.width: 10
#| fig.height: 8

plot_top_features(
  mofa_output,                             # MOFA standardised output
  mo_data = mo_set,                        # MultiDataSet object
  label_cols = list(                       # Custom labels for features from
    "rnaseq" = "Name",                     #   features metadata
    "metabolome" = "name"
  ),
  truncate = 25,                           # truncate long feature labels
  latent_dimensions = paste("Factor", 1:2) # MOFA factors to display
)
```

More details can be found in @sec-interpretation.

## Evaluating the integration results

With `moiraine`, it is possible to evaluate the results of a given integration tool against prior information that we have about the features (e.g. knowledge about biological functions or results from a single-omics analysis) or samples (e.g. to assess the success of samples clustering into meaningful groups). For example, we can compare the feature selection performed by DIABLO to results from differential expression analyses performed on the omics datasets:

```{r evaluate-feature-selection-table}
#| code-fold: true
#| collapse: false

evaluate_feature_selection_table(
  diablo_output,                                # DIABLO standardised output
  mo_data = mo_set,                             # MultiDataSet object
  col_names = list(                             # Columns from features metadata
    "rnaseq" = "de_signif",                     #   containing DE outcome
    "metabolome" = "de_signif" 
  ),
  latent_dimensions = "Component 1"             # Latent component to focus on
)
```

In addition, a number of functions are provided to help with features set enrichment (such as over-representation analysis or gene set enrichment analysis), e.g. by generating feature sets from features metadata through `make_feature_sets_from_fm()`, or by ensuring that a proper background set is used in the enrichment analysis, with `reduce_feature_sets_data()`. Information about evaluation of integration results is presented in @sec-evaluation.

## Comparison different integration results

Lastly, `moiraine` facilitates the comparison of the results from different integration methods, or from a same integration method but with different pre-processing options or parameter choices. It is possible to visualise the correlation between the results of different methods, in terms of the latent dimensions they constructed:

```{r comparison-heatmap-corr}
#| code-fold: true
#| eval: false

comparison_heatmap_corr(
  output_list,              # List of integration results (standardised format)
)
```

```{r show-comp-heatmap}
#| echo: false
#| fig.width: 8
#| fig.height: 7.5

comparison_heatmap_corr(
  output_list, 
  latent_dimensions = list(
    "sO2PLS" = "joint component 1",
    "MOFA" = paste("Factor", 1:4)
  )
)
```

Or compare the samples score or features weight from two latent dimensions created by two integration methods:

```{r}
#| code-fold: true
#| eval: false

plot_features_weight_pair(
  list(mofa_output, diablo_output),  # Integration results (stand. format)
  list(                              # Latent dimensions to compare
    "MOFA" = "Factor 1", 
    "DIABLO" = "Component 1"
  ),
  mo_data = mo_set,                 # MultiDataSet object
  features_metric = "importance",   # Plot absolute or signed importance score
  label_cols = list(                # Columns from features metadata to use
    "rnaseq" = "Name",              #  as labels in plot
    "metabolome" = "name"
  )
)
```


```{r show-features-weight-comp}
#| echo: false

plot_features_weight_pair(
  output_list[3:4],
  list("MOFA" = "Factor 1", "DIABLO" = "Component 1"),
  mo_data = mo_set,
  features_metric = "importance",
  label_cols = list(
    "rnaseq" = "Name",
    "metabolome" = "name"
  )
)
```

It is also possible to compute a consensus importance score for each feature, which summarises the contribution of the feature to different latent dimensions, and thus assess which features are highlighted across several integration methods. More details are provided in @sec-comparison.