---
title: "OPPAR: Outlier Profile and Pathway Analysis in R"
author: "Soroor Hediyehzadeh"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{OPPAR: Outlier Profile and Pathway Analysis in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

Cancer Outlier Profile Analysis (COPA) is a common analysis to identify genes that might be down-regulated or up-regulated only in a proportion of samples with the codition of interest. OPPAR is the R implementation of modified COPA [(mCOPA)](http://jclinbioinformatics.biomedcentral.com/articles/10.1186/2043-9113-2-22) method, originally published by Chenwei Wang et al. in 2012. The aim is to identify genes that are outliers in samples with condition of interest, compared to normal samples. The methods implemented in OPPAR enable the users to perform the analysis in various ways, namely detecting outlier features in control versus condition samples (whether or not there is a information on subtypes), and detecting genes that are outlier in one subtype compared to the other samples, if the subtypes are known. 

OPPAR can also be used for Gene Set Enrichment Analysis (GSEA). Here, a modified version of [GSVA](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-7) method is implemented. GSVA can be used to determine which samples in the study are enriched for gene expression signatures that are of interest. The `gsva` function in GSVA package returns an enrichment score for each sample, for the given signatures/gene sets. With the current implementation of the method, samples that strongly show enrichment for down(-regulated) gene expression signatures will receive negative scores.
However, Often it is in the interest of the biologists and researchers to get positive scores for samples that are enriched in both up and down signatures. Therefore, the `gsva` function has been modified to assign positive scores to samples that are enriched for the up-regulated and down-regulated gene expression signatures.

OPPAR comes with four functions: 

- **opa()** generates the outlier profile using the method described in    [mcopa](http://jclinbioinformatics.biomedcentral.com/articles/10.1186/2043-9113-2-22)
- **getSampleOutlier()** is used to extract the outliers detected for a given sample(s)
- **getSubtypeProbes()** is used to extract the outliers for a group of related samples e.g subtypes
- **gsva()** A modified version of gsva function in GSVA package is presented here.
    The original function returns negative enrichment scores (es) for samples 
    that are enriched for a gene list of down-regulated gene signature.
    However, it is often of interest of researchers to obtain positive scores 
    for samples that are enriched in both up gene signatures and down gene signatures.
    In the modified version the ranking is reversed for the genes in down gene signature,
    such that they receive high ranks and, therefore, high es scores. The es scores for 
    samples in down gene signature is then added to the es scores in up gene signature
    resulting in large positive es scores for samples displaying enrichment in both up-
    regulated genes and down-regulated genes in a given gene signature.


This vignette illustrates a possible workflow for OPPAR, using [Tomlins et al.](http://www.nature.com/ng/journal/v39/n1/full/ng1935.html) prostate cancer [data](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6099). In addition, 
[Maupin](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE23952)'s TGFb data have been analyzed for enrichement of a TGFb gene signature in the samples measured in this study.

**Please note although the analysis presented here have been done on microarray studies,
one can apply oppar tools to RPKM values of gene expression measurements in NGS studies.**


-----

# Analysis of Tomlins et al. prostate cancer dataset

Data was retrieved from GEO database, checked for normalization and subsetted according to procedure outlined
in the mCOPA paper. In addition, probes with no annotation were removed. The `impute` package was used to impute the missing values using K-nearest neighbours method (k = 10). The subsetted dataset is available in the package as a sample data, and contains an ExpressionSet object, storing information on samples, genes and gene expressions. We apply `opa` on Tomlins et al. data, then use `getSubtypeProbes` to get all down- regulated and all up-regulated outliers.

`opa` returns the outlier profile matrix, which is a matrix of -1 ( for down-regulated outliers), 0 ( not an outlier) and 1 (up-regulated outlier). For more information see `?opa`

For a brief overview of `oppar` package and functions, please see `?oppar`

```{r, echo=FALSE}
require(knitr)
knitr::opts_chunk$set(message = FALSE, 
											library(oppar), 
											suppressPackageStartupMessages(library(Biobase)),
											suppressPackageStartupMessages(library(limma)),
											suppressPackageStartupMessages(library(GO.db)),
											suppressPackageStartupMessages(library(org.Hs.eg.db))
											)

```

```{r}
data(Tomlins) # loads processed Tomlins data

# the first 21 samples are Normal samples, and the rest of 
# the samples are our cases (metastatic). We, thus, generate a group
# variable for the samples based on this knowledge.

g <- factor(c(rep(0,21),rep(1,ncol(exprs(eset)) - 21)))
g


# Apply opa on Tomlins data, to detect outliers relative to the
# lower 10% (lower.quantile = 0.1) and upper 5% (upper.quantile = 0.95 -- Default) of 
# gene expressions.
tomlins.opa <- opa(eset, group = g, lower.quantile = 0.1)
tomlins.opa

```

The matrix containing the outlier profiles is called `profileMatrix` and can be accessed using the $ operator.
The `upper.quantile` and `lower.quantile` parameters used to run the function can also be retrieved using this
operator.

```{r}
tomlins.opa$profileMatrix[1:6,1:5]
tomlins.opa$upper.quantile
tomlins.opa$lower.quantile
```

We can extract outlier profiles for any individual samples in the `profileMatrix`, using `getSampleOutlier`. see `?getSampleOutlier` for more detailed information

```{r}
getSampleOutlier(tomlins.opa, c(1,5))

```


Extracting down-regulated and up-regulated outliers in all samples using `getSubtypeProbes`:
```{r}

outlier.list <- getSubtypeProbes(tomlins.opa, 1:ncol(tomlins.opa$profileMatrix))
```


We can then obtain a list of GO terms from `org.Hs.eg.db`. Each element of the list will be a GO terms with the Entrez gene IDs corresponding to that term. We can the apply `mroast` from `limma` package for multiple gene set enrichment testing.

```{r}
# gene set testing with limma::mroast
#BiocManager::install(org.Hs.eg.db)
library(org.Hs.eg.db)
library(limma)
org.Hs.egGO2EG
go2eg <- as.list(org.Hs.egGO2EG)
head(go2eg)

# Gene Set analysis using rost from limma

# need to subset gene express data based on up outliers
up.mtrx <- exprs(eset)[fData(eset)$ID %in% outlier.list[["up"]], ]
# get Entrez gene IDs for genes in up.mtrx

entrez.ids.up.mtrx <- fData(eset)$Gene.ID[fData(eset)$ID %in% rownames(up.mtrx)]

# find the index of genes in GO gene set in the gene expression matrix
gset.idx <- lapply(go2eg, function(x){
	match(x, entrez.ids.up.mtrx)
})

# remove missing values
gset.idx <- lapply(gset.idx, function(x){
	x[!is.na(x)]
})

# removing gene sets with less than 10 elements
gset.ls <- unlist(lapply(gset.idx, length))
gset.idx <- gset.idx[which(gset.ls > 10)]

# need to define a model.matrix for mroast
design <- model.matrix(~ g)
up.mroast <- mroast(up.mtrx, index = gset.idx, design = design) 
head(up.mroast, n=5)
```

The GO terms for the first 10 GO Ids detected by `mroast` can be retrieved in the following way.

```{r}
go.terms <- rownames(up.mroast[1:10,])
#BiocManager::install(GO.db)
library(GO.db)
columns(GO.db)
keytypes(GO.db)

r2tab <- select(GO.db, keys=go.terms,
                columns=c("GOID","TERM"), 
                keytype="GOID")
r2tab
```

We repeating the above steps for down-regulated outliers, to see what GO terms they are enriched for.

```{r}
library(org.Hs.eg.db)
library(limma)
org.Hs.egGO2EG
go2eg <- as.list(org.Hs.egGO2EG)
head(go2eg)
# subsetting gene expression matrix based on down outliers
down_mtrx <- exprs(eset)[fData(eset)$ID %in% outlier.list[["down"]], ]
entrez_ids_down_mtrx <- fData(eset)$Gene.ID[fData(eset)$ID %in% rownames(down_mtrx)]

gset_idx_down <- lapply(go2eg, function(x){
	match(x, entrez_ids_down_mtrx)
})

# remove missing values
gset_idx_down <- lapply(gset_idx_down, function(x){
	x[!is.na(x)]
})

# removing gene sets with less than 10 elements
gset_ls_down <- unlist(lapply(gset_idx_down, length))
gset_idx_down <- gset_idx_down[which(gset_ls_down > 10)]

# apply mroast
down_mroast <- mroast(down_mtrx, gset_idx_down, design) 
head(down_mroast, n=5)
```

And extract GO terms for the top 10 results:
```{r}

go_terms_down <- rownames(down_mroast[1:10,])

dr2tab <- select(GO.db, keys=go_terms_down,
                columns=c("GOID","TERM"), 
                keytype="GOID")
dr2tab

```


-----
# Gene Set Enrichment Analysis 

We are now going to perform enrichment analysis for on Maupin's TGFb data (see `?maupin`), given a gene signature. The `maupin` data object contains a matrix containing gene expression measurements on 3 control samples and 3 TGFb induced samples.
We run the modified gsva function introduced in this package to get one large positive scores for samples enriched in the given gene signature, both for down gene signature and up gene signature. This is while the original gsva function
returns negative scores for samples that are enriched in down gene signature, and positive scores for samples enriched in up gene signature. Therefore, the scores returned by the gsva function in this package are the sum of the scores for up gene signature and down gene signature. Note that in order for the modified version of the gsva function to work properly, the `gset.idx.list` has to be a named list, with the up signature gene list being named 'up' and down gene signature gene list being names 'down' (see example code below). Also note that the `is.gset.list.up.down` argument has to be set to TRUE if the user wishes to use the modified version (i.e to get the sum of es scores for up and down gene signatures). See `?gsva` for more details.

```{r}
data("Maupin")
names(maupin)
head(maupin$data)
head(maupin$sig)

geneSet<- maupin$sig$EntrezID    #Symbol  ##EntrezID # both up and down genes:

up_sig<- maupin$sig[maupin$sig$upDown == "up",]
d_sig<- maupin$sig[maupin$sig$upDown == "down",]
u_geneSet<- up_sig$EntrezID   #Symbol   # up_sig$Symbol  ## EntrezID
d_geneSet<- d_sig$EntrezID

enrichment_scores <- gsva(maupin$data, list(up = u_geneSet, down= d_geneSet), mx.diff=1,
               verbose=TRUE, abs.ranking=FALSE, is.gset.list.up.down=TRUE, parallel.sz = 1 )$es.obs
			

head(enrichment_scores)

## calculating enrichment scores using ssgsea method
# es.dif.ssg <- gsva(maupin, list(up = u_geneSet, down= d_geneSet),
# 														 verbose=TRUE, abs.ranking=FALSE, is.gset.list.up.down=TRUE,
# 														 method = "ssgsea")
```

A histogram of enrichment scores is plotted below and the density of es scores for TGFb samples is shown in red. The distribution of es scores of control samples is shown in blue. As it can be seen from the plot below, TGFb induced samples that are expected to be enriched in the given TGFb signature have received positive scores and are on the right side of the histogram, whereas the control samples are on the left side of the histogram. In addition, TGFb induced samples and contorl samples have been nicely separated from each other.

```{r, fig.width= 6, fig.height= 5, fig.align= "center"}
hist(enrichment_scores, main = "enrichment scores", xlab="es")
lines(density(enrichment_scores[,1:3]), col = "blue") # control samples
lines(density(enrichment_scores[,4:6]), col = "red") # TGFb samples
legend("topleft", c("Control","TGFb"), lty = 1, col=c("blue","red"), cex = 0.6)


```