---
title: 'fcoex: co-expression for single-cell data'
author:
- name: Tiago Lubiana
  affiliation: Computational Systems Biology Laboratory, University of São Paulo, Brazil
output:
  html_document:
    toc: yes
  pdf_document:
    toc: yes
  prettydoc::html_pretty:
    highlight: github
    theme: cayman
package: fcoex
vignette: > 
  %\VignetteIndexEntry{fcoex: co-expression for single-cell data}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

## Introduction and basic pipeline

The goal of fcoex is to provide a simple and intuitive way to generate co-expression modules for single cell data.
It's targeted as a tool for exploratory data analysis, and as a stepping stone for obtaining insights from single cell data. 

It is based in 3 steps:

- Pre-processing and label assignement (prior to fcoex)
- Discretization of gene expression
- Correlation and module detection via the FCBF algorithm (Fast Correlation-Based Filter)

First of all, we will load an a already preprocessed single cell dataset from 10XGenomics.
It was preprocessed according to the [OSCA pipeline](https://osca.bioconductor.org/a-basic-analysis.html#preprocessing-import-to-r), 14/08/2019).
It contains peripheral blood mononuclear cells and the most variable genes.

```{r  Loading datasets, message=FALSE  }
library(fcoex, quietly = TRUE)
library(SingleCellExperiment, quietly = TRUE)
data("mini_pbmc3k")
```

The `mini_pbmc3k` object is an object of the class SingleCellExperiment that we will explores in the vignette.
For more information on the class, check this [Introduction to the SingleCellExperiment Class](https://bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html). 

### Creating the fcoex object

The fcoex object is created from 2 different pieces: a previously normalized expression table (genes in rows) and a target factor with classes for the cells. (Note: fcoex does not need _names_, but only _clusters_, so you can run it even if you do not have names for your clusters.)


```{r Creating fcoex object, message=FALSE }
# Get clusters from the pre-processing
target <- colData(mini_pbmc3k)
target <- target$clusters

# Get normalized table from the pre-processing
exprs <- as.data.frame(assay(mini_pbmc3k, 'logcounts'))

# Create fcoex object
fc <- new_fcoex(data.frame(exprs),target)
```

Once you have set up your fcoex object, the first step is to convert a count matrix into a binarized dataframe.
The binarization is needed for the fcoex algorithm and empirically we notice that significant biological signal is preserved with the default parameters. 

The standard of fcoex works as follows:

For each gene, the maximum and minimum values are stored. 
This range is divided in n bins of equal width (parameter to be set).
The first bin is assigned to the class "low" and all the others to the class "high".

```{r Discretizing dataset, message=FALSE }
fc <- discretize(fc, number_of_bins = 8)
```

### Getting the modules

Other discretization methods are avaible too, and this step affects the final results in many ways. If you have the time, you can try different discretization parameters and observe the impact in the data. 

After the discretization, fcoex builds a co-expression network and extracts modules. 
Correlations are calculated via a information-theory method called Symmetrical Uncertainty. Three steps are present:

1 - Selection of n genes to be considered, ranked by correlation to the target variable. 

2 - Detection of predominantly correlated genes, a feature selection approach defined in the FCBF algorithm

3 - Building of modules around selected genes. Correlations between two genes are kept if they are more correlated to each other than to the target.

You can choose either to have a non-parallel processing, with a progress bar, or a faster parallel processing without progress bar. Up to you. 

```{r Finding cbf modules, message=FALSE }
fc <- find_cbf_modules(fc,n_genes_selected_in_first_step = 200, verbose = FALSE, is_parallel = FALSE)
```

The last step created an adjacency matrix and extracted some co-expression modules. 
We can present that information in a visual way by using the `get_nets` function, that generates network plots for each module. The plots are created in the `ggplot2` format and stored inside the fcoex object. 

The visualizations it generates were heavily inspired by the CEMiTool package, as much of the code in fcoex was.

We will take a look at the first two networks using the `show_net` function.

```{r Plotting module networks, message=FALSE }
fc <- get_nets(fc)

# Taking a look at the first two networks: 
network_plots <- show_net(fc)

network_plots[["CD79A"]]
network_plots[["HLA-DRB1"]]
```
Depending on your graphical device, the network might not appear in full. 
Saving the plots to a vector format, such as pdf, fixes some graphical device issues.

To save the plots, you can run the save plots function, which will create a "./Plots" directory and store plots there in a pdf format.

```{r Saving plots, eval= FALSE, message=FALSE, results='hide'}
save_plots(name = "fcoex_vignette", fc,force = TRUE, directory = "./Plots")
```

Additionaly, you can use the [ggsave](https://ggplot2.tidyverse.org/reference/ggsave.html) funcion of the `ggplot2` package. 

### Running an enrichment analysis

You can also run an over-representation analysis to see if the modules correspond to any known biological pathway.
In this example we will use the Reactome groups available in the CEMiTool package, but you can use any gene set of interest. 
Be sure, though, that the labels of the genes in the gene sets match the ones in the dataset.

It is likely that some modules will not have any enrichment, leading to messages of the type "no gene can be mapped.". That is not a problem.

```{r Running ORA analysis, warning=FALSE}

# You'll need CEMiTool, if you do not have it installed, just run BiocManager::install("CEMiTool")
gmt_filename <- system.file("extdata", "pathways.gmt", package = "CEMiTool")

if (gmt_filename == "")
  {
      print("You likely need to install CEMiTool")
} else {
      gmt_in <- pathwayPCA::read_gmt(gmt_filename,  description = TRUE)

}
fc <- mod_ora(fc, gmt_in)
fc <- plot_ora(fc)
```

Now we can save the plots again. 
Note that we have to set the force parameter equal to TRUE now, as the "./Plots" directory was already created in the previous step. 

```{r Saving plots again,  eval= FALSE, message=FALSE, results='hide'}
save_plots(name = "fcoex_vignette", fc, force = TRUE, directory = "./Plots")
```

### Reclustering the cells to find module-based populations.

We will use the module assignments to subdivide the cells in populations of interest. 
This is a way to explore the data and look for possible novel groupings ignored in the pre-processing.

```{r Reclustering , message=FALSE}
fc <- recluster(fc)
```

Why reclustering? Well, the algorithm behind fcoex considers at the same time inverse and direct correlations. 
Thus, it is nonsense to obtain a plot of the average expression of the genes in a module, for example.  

The reclustering allows us to gather all the signal in the dataset, positive and negative, to see how cells in the dataset behave in relation to that subpart of the transcriptomic space.

### Plotting clusters with schex

After obtaining the new labels, we can visualize them them using UMAP. 
Let's see the population represented in the modules CD79A and HLA-DRB1. 

For different datasets, different module headers will appear. 
It is up to the researcher to select which of those are interesting in their research settings.
The modules are nevertheless ordered based on correlation with the labels, so the first modules tend to be more interesting.

Notably, the clustering patterns are largely influenced by the expression patterns of header genes. 
It is interesting to see that two groups are present, header-positive (HP) and header negative (HN) clusters.

The stratification and exploration of different clustering points of view is 
one of the core features of fcoex.

We will use the package [schex](https://bioconductor.org/packages/release/bioc/html/schex.html) for visualizing the data, but you can use your package of preference.

```{r Visualizing}
identities_based_on_the_HLA_DRB1_module <- idents(fc)$`HLA-DRB1` 
colData(mini_pbmc3k) <- cbind(colData(mini_pbmc3k), `mod_HLA_DRB1` = identities_based_on_the_HLA_DRB1_module )

identities_based_on_the__CD79A_module <- idents(fc)$`HLA-DRB1` 
colData(mini_pbmc3k) <- cbind(colData(mini_pbmc3k), mod_CD79A = idents(fc)$CD79A)

# Let's see the original clusters
library(schex)

mini_pbmc3k <- make_hexbin(mini_pbmc3k, nbins = 40, 
    dimension_reduction = "UMAP", use_dims=c(1,2))

plot_hexbin_meta(mini_pbmc3k, col="clusters", action="majority")

library(gridExtra)
p1 = plot_hexbin_feature_plus(mini_pbmc3k,
    col="clusters", type="logcounts",
    feature="CD79A", action="mean") +
  ggtitle("original clusters (CD79A expression)") +
  theme_void()

p2 =plot_hexbin_feature_plus(mini_pbmc3k,
    col="clusters", type="logcounts",
    feature="HLA-DRB1", action="mean") +
  ggtitle("original clusters (HLA-DRB1 expression)") +
  theme_void()

p3 =  plot_hexbin_feature_plus(mini_pbmc3k,
    col="mod_CD79A", type="logcounts",
    feature="CD79A", action="mean") +
  ggtitle("fcoex CD79A clusters (CD79A expression)") +
  theme_void()

p4 =  plot_hexbin_feature_plus(mini_pbmc3k,
    col="mod_HLA_DRB1", type="logcounts",
    feature="HLA-DRB1", action="mean")+
  ggtitle("fcoex HLA cluster (HLA-DRB1 expression)") +
  theme_void()

grid.arrange(p1, p2, p3, p4, nrow=2)
```