---
title: "Oncomix Vignette"
author: "Daniel G. Piqué"
date: "`r Sys.Date()`"
output:
    html_document:
        toc: true
        toc_float:
            collapsed: false
            smooth_scroll: false
vignette: >
    %\VignetteIndexEntry{OncoMix Vignette}
    %\VignetteEngine{knitr::rmarkdown}
    \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```

## 1. Introduction

### 1.1 Motivation for Developing Oncomix

The advent of large, well-curated databases, such as the [genomic data
commons](https://gdc.cancer.gov/), that contain RNA sequencing data from
hundreds of patient tumors has made it possible to identify oncogene candidates
solely based off of patterns present in mRNA expression data. Oncomix is the
first method developed to identify oncogenes in a visually-interpretable manner
from RNA-sequencing data in large cohorts of patients.

Oncomix is an R package for identifying oncogene candidates based off of
2-component Gaussian mixture models. It estimates parameters using the
expectation maximization procedure as implemented in the R package mclust. This
tutorial will demonstrate how to identify oncogene candidates from a set of mRNA
sequencing data. We start by loading the package:

```{r}
#devtools::install_github("dpique/oncomix", build_vignettes=T)
library(oncomix)
```


### 1.2 Distribution of Oncogene mRNA Expression
We first explore the idea of what the distribution of gene expression values for
a oncogene should look like. It is known that oncogenes such as *ERBB2* are
overexpressed in 15-20% of all breast cancer patients. In addition, oncogenes
should not be expressed in normal tissue. Based on this line of reasoning, we
formulate a model for the distribution of oncogene mRNA expression values in a
population of both tumor (teal curves) and normal (red-orange curves) tissue:

```{r}
library(ggplot2)
oncoMixIdeal()
```

The x-axis represents mRNA expression values, with lower values toward the left
and larger values (i.e. higher expression) toward the right. The y axis
represents density. The teal curves represent the best-fitting Gaussian
probability distribution (PD) over expression values from a single gene obtained
from multiple tumor samples. The red-orange curves represent the PD over
expression values from the same gene obtained from multiple adjacent normal
tissue samples. This mixture model is applied once to the tumor data and again
(separately) to the adjacent normal data, hence the 4 curves.

The advantage of applying a 2-component mixture model is that we are able to
capture biologically-relevant clusters of gene expression that may naturally
exist in the data. Otherwise, we might represent our data with just a single
curve sitting in the middle of what really are 2 distinct clusters. Visually, we
see that for a theoretical oncogene, there is a *subgroup* of tumors that
overexpresses this gene relative to normal tissue.

### 1.3 Comparison of Oncomix to Existing Differential Expression Methods

We now conceptually compare oncomix to the techniques employed by traditional 
differential expression analysis (e.g. Student's t-test, as employed by 
[limma](https://www.ncbi.nlm.nih.gov/pubmed/25605792), or 
[DESeq2](https://www.ncbi.nlm.nih.gov/pubmed/25516281)). These existing 
approaches make strong assumptions -- namely, that the data from a particular 
group are well-described by distributions with mass concentrated around a 
central value (such as a 'mean'). If we were to use one of these approaches on a
large dataset, our assumption would be that oncogenes are overexpressed in 
*every* tumor sample compared to normal tissue. This assumption can be 
visualized below:

```{r}
oncoMixTraditionalDE()
```

The red-orange curve represents the gene expression values from the adjacent 
normal data, and the teal curve represents the gene expression values from the 
tumor data. Note, however, that the goal for a DE analysis using existing
methods (such as limma) would be to find genes that maximize the difference 
between these two curves. This approach does not represent our knowledge of how 
oncogenes are expressed in a population of individuals -- that is, highly 
expressed in a subset of patient tumors, and lowly expressed in adjacent normal 
tissue.

## 2. Identifying Oncogene Candidates

### 2.1 Loading Example Data and Exploring the `mixModelParams` Object

Now, we will load an example dataset that contains expression values for 700 
mRNA isoforms obtained from paired samples of breast tumor (`exprTumIsof`) and 
adjacent normal(`exprNmlIsof`) breast tissue from 113 patients in the [Genomic
Data Commons](https://gdc.cancer.gov/).  We will fit the mixture model using the
oncomix function `mixModelParams`, which takes dataframes that contain patients
as rows and mRNA isoforms/genes as columns. The number of columns (genes) should
be the same between both dataframes, though the number of rows can vary.

```{r}
data(exprNmlIsof, exprTumIsof, package="oncomix")

##look at the matrix of mRNA expression data from adjacent normal samples
dim(exprNmlIsof)
exprNmlIsof[1:5, 1:5] 

##look at the matrix of mRNA expression data from tumors
dim(exprTumIsof)
exprTumIsof[1:5, 1:5] 

##fits the mixture models, will take a few minutes
mmParams <- mixModelParams(exprNmlIsof, exprTumIsof) 
head(mmParams)
```

The object returned by `mixModelParams` is a dataframe with rows corresponding
to genes and 12 columns containing mixture model parameters. The rows are sorted
according to the `score` column, with the first row containing the highest
oncomix score (defined below). The meaning of the dataframe columns are
described below:

- `nMu1` = the mean ($\mu$) of the Gaussian curve with the smaller mean
fit to the adjacent normal expression data (referred to as Mode 1).
- `nMu2` = the mean ($\mu$) of the Gaussian curve with the larger mean fit
to the adjacent normal expression data (referred to as Mode 2).
- `nVar` = the variance ($\sigma$) of the two Gaussian curves fit to the
adjacent normal expression data (fixed to be equal between the two curves)
- `nPi1` = the proportion of adjacent normal samples assigned to the
Gaussian curve with mean `nMu1`

- `tMu1` = the mean ($\mu$) of the Gaussian curve with the smaller mean
fit to the tumor expression data (referred to as Mode 1).
- `tMu2` = the mean ($\mu$) of the Gaussian curve with the larger mean fit
to the tumor expression data (referred to as Mode 2).
- `tVar` = the variance ($\sigma$) of the two Gaussian curves fit to the
tumor expression data (fixed to be equal between the two curves).
- `tPi1` = the proportion of tumor samples assigned to the Gaussian curve
with mean `tMu1`.

- `deltaMu2` = the difference between the means of the two curves between
groups. `tMu2` - `nMu2`. May be negative or positive.
- `deltaMu1` = the difference between the means of the two curves between
groups. `tMu1` - `nMu1`. May be negative or positive.
- `SI` = the selectivity index, or the proportion of adjacent normal samples
with expression values less than the boundary defined by `tMu2` -
`tMu1`. The selectivity index for the ith gene is computed as:

$$SI_i  = \frac{1}{N}\sum_{j=1}^N \Bigg\{ \begin{array}{ll} 1,~ if~x_{ij} <
\frac{\mu_{iLT}+\mu_{iHT}}{2}  \\ 
0, ~ otherwise 
\end{array},
$$

- where $N$ is the number of adjacent normal samples, and $x_{ij}$ is the
expression value of the ith gene in the jth adjacent normal sample.
$\mu_{iHT}$ is the mean of higher/larger Gaussian from the ith gene in tumor
samples, and $\mu_{iLT}$ is the mean of the smaller/lower Gaussian from the
ith gene in the tumor samples.

- `score` = The score for the $i^{th}$ gene is calculated as follows: 
$$score_i  = SI * [(\Delta\mu_{2i} - \Delta\mu_{1i}) - (var_{Ni} -
var_{Ti})] ~~~~~, $$

- where $SI$ is the selectivity index described above, $\Delta\mu_{2i}$ and
$\Delta\mu_{1i}$ are as described above (equivalent to `deltaMu2` and
`deltaMu1`), $var_{Ni}$ is the common variance across the adjacent normal
samples (equivalent to `n.var`), and $var_{Ti}$ is the common variance
across the tumor samples (equivalent to `tVar`).

### 2.2 Selecting Genes that Appear Most Like the Idealized Oncogene

We can now use the `mmParams` dataframe to figure out which of our isoforms in
our gene set are most similar in terms of their distribution to our ideal
oncogene candidate.

For example, lets say that we wanted to select a subset of gene isoforms that
most resembled the theoretically ideal oncogene. We can capture all of the genes
meeting or exceeding specified thresholds using the `topGeneQuant` function.
Here, we want to maximize `deltMu2`, minimize `deltMu1`, and maximize the `SI`.

```{r}
topGeneQuant <- topGeneQuants(mmParams, deltMu2Thr=99, 
    deltMu1Thr=10, siThr=.99)
print(topGeneQuant)
```

The results of the `topGeneQuants` function is to return a subsetted dataframe
containing only those isoforms that met or exceeded the specified threshold.
Here, there were `r nrow(topGeneQuant)` isoforms.

We can also select the top $N$ genes that most closely resemble the oncogene
candidate by simply selecting the first $N$ rows from the `mmParams` object
(e.g. `mmParams[1:N,]`). This is because the `mixModelParams` function returns a
dataframe sorted by the score, with the 1st row containing the isoform with the
highest score. There is an explanation of how the score is calculated above.

```{r}
mmParamsTop10 <- mmParams[1:10,]
print(mmParamsTop10)
```

## 3. Visualize the Output

### 3.1 Visualize Isoforms with a High SI & Oncomix Score

Now, we will visualize the distribution of gene expression values for a 
particular isoform. Specifically, we will create a overlapping histogram of an 
single isoform's expression values across both tumor (teal) and adjacent normal 
(red) samples with the best-fitting Gaussian curves superimposed. The isoform 
that we want to visualize here, uc002jxc.2, is one of the isoforms in the output
from `topGeneQuant` function. It also ranks highly in terms of its oncomix 
score, so it should have a distributional profile that is more similar to our 
theoretical oncogene that most other isoforms in this dataset.

```{r}
isof = "uc002jxc.2"
plotGeneHist(mmParams, exprNmlIsof, exprTumIsof, isof)
```

Next, we will create a scatterplot with the axes corresponding to the 
differences between component means. Our oncogene candidates will be those genes
that appear in the upper right quadrant of this scatterplot. The x axis 
corresponds to the difference between the means of the curves with the larger 
Gaussians (`deltaMu2`), and the y axis corresponds to the difference between the
means of the curves with the smaller Gaussians (`deltaMu1`) between the two 
treatments.

Here, $\alpha$ (in the title) is a term that is present in the denominator of 
the value of the y-axis and functions as an automatic scaling parameter to set 
the range of the y-axis to be approximately equal to the range of the x-axis.

```{r, fig.width=7, fig.height=6.5}
scatterMixPlot(mmParams)
```

We would expect isoforms that maximize `deltaMu1` and minimize `deltaMu2` to be
most visually similar to the theoretical oncogene candidate, and thus to be
present within the upper right quadrant of this histogram. However, due to the
large variance displayed by some of these isoforms, not all isoforms in the
upper right quadrant appear like the theoreticaly ideal oncogene. We developed
an index, termed the selectivity index (SI), that helps highlight genes that
follow our ideal profile. The SI ranges from 0 to 1, and genes with a larger 
selectivity index will follow more closely the ideal oncogene. Now, we will 
highlight the isoforms with a selectivity index greater than .99 using the 
`selIndThresh` argument to narrow our search.

```{r, fig.width=7, fig.height=6.5}
scatterMixPlot(mmParams, selIndThresh=.99)
```

We can also highlight where the top 10 isoforms with the highest score fall in
the scatterplot.

```{r}
scatterMixPlot(mmParams=mmParams, geneLabels=rownames(mmParamsTop10))
```

### 3.2 Visualize the Distribution of Isoforms that Map to Known Oncogenes

We can check the distribution of the isoforms that map to known human oncogenes.
To do this, we will use genes classified as oncogenes from the [ONGene 
database](http://ongene.bioinfo-minzhao.org/), which is first literature-curated
database of oncogenes. The [paper](https://www.ncbi.nlm.nih.gov/pubmed/28162959)
associated with the ONGene database was published in 2017.

We mapped the oncogenes in the ONGene database (which are in gene symbols) to 
ucsc isoforms, which is the gene format that we have in our original datasets, 
using an R interface to the public UCSC MySQL database (queried in Sept 2017). 
The ouput from this mapping is stored as `queryRes` object and is available as
part of this package. We'll show where the top 5 isoforms mapping to oncogenes
in the ONGene database land on the scatterplot shown above.

```{r}
##The code that follows was used to generate the `queryRes` object
##in September 2017.
##
##install.packages("RMySQL")
##library(RMySQL)
##
##read in a table of known human oncogenes from the ONGene database
##ongene <- read.table("http://ongene.bioinfo-minzhao.org/ongene_human.txt",
##    header=TRUE, sep="\t", quote="", stringsAsFactors=FALSE, row.names=NULL)
##
##send a sql query to UCSC to map the human oncogenes to ucsc isoform ids
##ucscGenome <- dbConnect(MySQL(), user="genome", 
##    host="genome-mysql.cse.ucsc.edu", db='hg19')
##createGeneQuery <- function(name){ #name is a character vector
##    p1 <- paste(name, collapse='\',\'')
##    p2 <- paste('(\'',p1, '\')',sep="")
##    return(p2)
##}
##geneQ <- createGeneQuery(ongene$OncogeneName)
##queryRes <- dbGetQuery(ucscGenome, 
##    paste0("SELECT kgID, geneSymbol FROM kgXref WHERE geneSymbol IN ",
##        geneQ, " ;"))
##dbDisconnect(ucscGenome)

##The database mapping ucsc symbols to gene symbols is loaded below,
##without needing to access the internet.
data(queryRes, package="oncomix")

##Merge the queryRes & mmParams dataframes
queryRes$kgIDs <- substr(queryRes$kgID, 1, 8)
mmParams$kgIDs <- substr(rownames(mmParams), 1, 8)
mmParams$kgID <- rownames(mmParams)
mmParamsMg <- merge(mmParams, queryRes, by="kgIDs", all.x=TRUE)
rownames(mmParamsMg) <- mmParamsMg$kgID.x

## Show the top 5 isoforms with the highest score 
## in our dataset that map to known oncogenes
mmParamsMg <- mmParamsMg[with(mmParamsMg, order(-score)), ]
mmParamsMgSbst <- subset(mmParamsMg, !is.na(geneSymbol))[1:5,]
print(mmParamsMgSbst)

scatterMixPlot(mmParams=mmParams, geneLabels=rownames(mmParamsMgSbst))
```

If you are interested in a particular oncogene, then you can plug the name of
that gene (as long as it is in your original dataset) into the `geneLabels`
argument of the `scatterMixPlot` function, which will highlight those genes
entered into the `geneLabels` argument on the scatterplot.

Note: Not all well-characterized oncogenes fall into or near the upper right
quadrant of this scatterplot (explored more in the analysis below). This is
expected because oncogenes can arise via a variety of mechanisms, such as via
mutational activation (eg BRAF V600E), which may not be associated with
increased expression of the gene. mRNA overexpresion is one of several ways that
an oncogene can drive tumor behavior, and it is this class of oncogenes that our
method seeks to detect.

### 3.3 Visualize the Distribution of the Oncomix Score

We will now check the distribution of the oncomix `score` across all gene 
isoforms. The histograms are superimposed and are colored by whether the isoform
maps to a gene in the ONGene database (orange, n = `r 
table(is.na(mmParamsMg$geneSymbol))[1]`) or not (purple, n = `r 
table(is.na(mmParamsMg$geneSymbol))[2]`).

```{r}
library(RColorBrewer)
col <- brewer.pal(3, "Dark2")
ggplot(mmParamsMg, aes(x=score, y=..density.., fill=is.na(geneSymbol))) +
    geom_histogram(data=subset(mmParamsMg, is.na(geneSymbol)), 
        fill=col[2], alpha=0.5) + 
    geom_histogram(data=subset(mmParamsMg, !is.na(geneSymbol)), 
        fill=col[3], alpha=0.5) +   
    theme_classic() + xlab("OncoMix Score") + theme_classic() 
```

The distribution of the oncomix scores looks equivalent between isoforms in the 
ONGene database versus those not in the ONGene database. However, the short 
right tail of this distribution, which contains isoforms with high scores, also 
preferentially consists of isoforms that map to genes in the ONGene database.
The few isoforms that are not in the ONGene database and that have high scores
may represent oncogenes that are yet to be discovered.

## 4. Session Info

```{r}
sessionInfo()
```

Please email me at daniel.pique@med.einstein.yu.edu with any suggestions,
questions, or comments. Thank you!