---
title: "debCAM User Manual"
author: "Lulu Chen"
date: "`r Sys.Date()`"
bibliography: debcam.bib
output: 
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{debCAM User Manual}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

Convex Analysis of Mixtures (CAM) is a fully unsupervised computational method to analyze tissue samples composed of unknown numbers and varying proportions of distinct subpopulations [@CAM2016]. CAM assumes that the measured expression level is the weighted sum of each subpopulation’s expression, where the contribution from a single subpopulation is proportional to the abundance and specific expression of that subpopulation. This linear mixing model can be formulated as $\mathbf{X'=AS'}$. CAM can identify molecular markers directly from the original mixed expression matrix, $\mathbf{X}$, and further estimate the constituent proportion matrix, $\mathbf{A}$, as well as subpopulation-specific expression profile matrix, $\mathbf{S}$.

`debCAM` is an R package developed for tissue heterogeneity characterization by CAM algorithm. It provides basic functions to perform unsupervised deconvolution on mixture expression profiles by CAM and some auxiliary functions to help understand the subpopulation-specific results. `debCAM` also implements functions to perform supervised deconvolution based on prior knowledge of molecular markers, S matrix or A matrix. Semi-supervised deconvolution can also be achieved by combining molecular markers from CAM and from prior knowledge to analyze mixture expressions.

# Quick Start

The function `CAM()` includes all necessary steps to decompose a matrix of mixture expression profiles. There are some optional steps upstream of `CAM()` that downsample the matrix for reducing running time. Each step in `CAM()` can also be performed separately if you prefer a more flexible workflow. More details will be introduced in sections below.

Starting your analysis by `CAM()`, you need to specify the range of possible subpopulation numbers and the percentage of low/high-expressed molecules to be removed. Typically, 30% ~ 50% low-expressed genes can be removed from gene expression data. Much less low-expressed proteins are removed, e.g. 0% ~ 10%, due to a limited number of proteins in proteomics data. The removal of high-expressed molecules has much less impact on results, usually set to be 0% ~ 10%.

```{r, eval=FALSE}
rCAM <- CAM(data, K = 2:5, thres.low = 0.30, thres.high = 0.95)
```

# Unsupervised Deconvolution

## Datatypes and Input Format

Theoretically, `debCAM` accepts any molecular expression data types as long as the expressions follow the linear mixing model. We have validated the feasibility of CAM in gene expression data (microarray, RNAseq), proteomics data and DNA methylation data. Requirements for the input expression data:

- be in non-log linear space with non-negative numerical values (i.e.>=0).
- be already processed by normalization and batch effect removal.
- no missing values; the molecules containing missing values should be removed prior to CAM.
- no all-zero expressed molecules; otherwise, all-zero expressed molecules will be removed internally.

The input expression data should be stored in a matrix. Data frame, SummarizedExperiment or ExpressionSet object is also accepted and will be internally coerced into a matrix format before analysis. Each column in the matrix should be a tissue sample. Each row should be a probe/gene/protein/etc. Row names should be provided so that CAM can return the names of detected markers. Otherwise, rows will be automatically named by 1,2,3,...

## CAM Workflow

We use a data set downsampled from [GSE19830](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19830) as an example to show CAM workflow. This data set provides gene expression profiles of pure tissues (brain, liver, lung) and their biological mixtures with different proportions.  

```{r, echo=TRUE}
library(debCAM)

data(ratMix3) 
#ratMix3$X: X matrix containing mixture expression profiles to be analyzed
#ratMix3$A: ground truth A matrix containing proportions
#ratMix3$S: ground truth S matrix containing subpopulation-specific expression profiles

data <- ratMix3$X
#10000 genes * 21 tissues
#meet the input data requirements
```

### Analysis by CAM

Unsupervised deconvolution can be achieved by the function `CAM()` with a simple setting as Section \@ref(quick-start) introduced. Other critical parameters are `dim.rdc` (reduced data dimension) and `cluster.num` (number of clusters). Increasing them will bring more time complexity. We can also specify `cores` for parallel computing configured by `r Biocpkg("BiocParallel")`. `cores = 0` will disable parallel computing. No `cores` argument will invoke one core for each element in `K`. Setting the seed of the random number generator prior to CAM can generate reproducible results. 

```{r, echo=TRUE, message=FALSE}
set.seed(111) # set seed for internal clustering to generate reproducible results
rCAM <- CAM(data, K = 2:5, thres.low = 0.30, thres.high = 0.95)
#CAM return three sub results: 
#rCAM@PrepResult contains details corresponding to data preprocessing.
#rCAM@MGResult contains details corresponding to marker gene clusters detection.
#rCAM@ASestResult contains details corresponding to A and S matrix estimation.
```

### Model Selection by MDL Curves

We used MDL, a widely-adopted and consistent information theoretic criterion, to guide model selection. The underlying subpopulation number can be decided by minimizing the total description code length:

```{r, fig.height=6, fig.width=6}
plot(MDL(rCAM), data.term = TRUE)
```

### Estimated A and S matrix

The A and S matrix estimated by CAM with a fixed subpopulation number, e.g. K = 3, can be obtained by

```{r, echo=TRUE}
Aest <- Amat(rCAM, 3)
Sest <- Smat(rCAM, 3)
```

### Detected Marker Genes

The marker genes detected by CAM and used for A matrix estimation can be obtained by

```{r, echo=TRUE}
MGlist <- MGsforA(rCAM, K = 3) #for three subpopulations
```

Data preprocessing has filtered many genes, among which there are also some biologically-meaningful marker genes. So we need to check each gene again to find all the possible markers. Two statistics based on the subpopulation-specific expressions are exploited to identify marker genes with certain thresholds. The first is OVE-FC (one versus everyone - fold change) (Yu et al. 2010). The second is the lower confidence bound of bootstrapped OVE-FC at $\alpha$ level. 

```{r, echo=TRUE}
MGstat <- MGstatistic(data, Aest, boot.alpha = 0.05, nboot = 1000)
MGlist.FC <- lapply(seq_len(3), function(x) 
    rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC > 10])
MGlist.FCboot <- lapply(seq_len(3), function(x) 
    rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC.alpha > 10])
```

The thresholds above are set arbitrary and will impact the number of resulted markers significantly. Each subpopulation can also have different threshold. To make threshold setting more easier, we can also set thresholds to be quantile of fold changes and margin-of-errors of one subpopulation's input markers. Margin-of-error is the distance between one gene and the simplex defined by column vectors of A matrix [@CAM2016]. Margin-of-error threshold can be relaxed to a value larger than the maximum. 

```{r, echo=TRUE}
MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.2', err.thres='q1.2')
#q0.2: 0.2-quantile
#q1.2: 1-quantile (maximum) times 1.2
```

### (Optional) Re-estimation

It is optional to re-estimate A and S matrix based on the new marker list and/or apply Alternating Least Square (ALS) method to further reduce mean squared error. Note that allowing too many iterations of ALS may bring the risk of a significant deviation from initial values. The constraint for methylation data, $\mathbf{S\in[0,1]}$, will be imposed during re-estimation.

```{r, echo=TRUE}
rre <- redoASest(data, MGlist.re, maxIter = 2, methy = FALSE)
#rre$Aest: re-estimated A matrix
#rre$Sest: re-estimated S matrix
```

### Simplex Plot - 2D

Fundamental to the success of CAM is that the scatter simplex of mixed expressions is a rotated and compressed version of the scatter simplex of pure expressions, where the marker genes are located at each vertex. `simplexplot()` function can show the scatter simplex and detected marker genes in a 2D plot. The vertices in the high-dimensional simplex will still locate at extreme points of the low-dimensional simplex. 

```{r, fig.height=6, fig.width=6}
layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
simplexplot(data, Aest, MGlist, main=c('Initially detected markers'))
simplexplot(data, Aest, MGlist.FC, main=c('fc > 10'))
simplexplot(data, Aest, MGlist.FCboot, 
            main=c(expression(bold(paste('fc(bootstrap,',alpha,'=0.05) > 10')))))
simplexplot(data, Aest, MGlist.re, main=c("fc >= 'q0.2', error <= 'q1.2'"))
```

The colors and the vertex order displayed in 2D plot can be changed as

```{r, fig.height=6, fig.width=6}
simplexplot(data, Aest, MGlist.FCboot, 
            data.extra=rbind(t(ratMix3$A),t(Aest)),
            corner.order = c(2,1,3), col = "blue", 
            mg.col = c("red","orange","green"),
            ex.col = "black", ex.pch = c(19,19,19,17,17,17))
legend("bottomright", cex=1.2, inset=.01, c("Ground Truth","CAM Estimation"),
       pch=c(19,17), col="black")
```

### Simplex Plot - 3D

We can also observe the convex cone and simplex of mixture expressions in 3D space by PCA. Note that PCA cannot guarantee that vertices are still preserved as extreme points of the dimension-reduced simplex.

Code to show convex cone:

```{r, eval=FALSE}
library(rgl)
Xp <- data %*% t(PCAmat(rCAM))
plot3d(Xp[, 1:3], col='gray', size=3, 
       xlim=range(Xp[,1]), ylim=range(Xp[,2:3]), zlim=range(Xp[,2:3]))
abclines3d(0,0,0, a=diag(3), col="black")
for(i in seq_along(MGlist)){
    points3d(Xp[MGlist[[i]], 1:3], col= rainbow(3)[i], size = 8)
}
```

Code to show simplex:

```{r, eval=FALSE}
library(rgl)
clear3d()
Xproj <- XWProj(data, PCAmat(rCAM))
Xp <- Xproj[,-1]
plot3d(Xp[, 1:3], col='gray', size=3, 
       xlim=range(Xp[,1:3]), ylim=range(Xp[,1:3]), zlim=range(Xp[,1:3]))
abclines3d(0,0,0, a=diag(3), col="black")
for(i in seq_along(MGlist)){
    points3d(Xp[MGlist[[i]], 1:3], col= rainbow(3)[i], size = 8)
}
```

### Validation with Ground Truth

If the ground truth A and S matrix are available, the estimation from CAM can be evaluated:

```{r, echo=TRUE}
cor(Aest, ratMix3$A)
cor(Sest, ratMix3$S)
```

Considering the presence of many co-expressed genes (housekeeping genes) may dominate the correlation coefficients between ground truth and estimated expression profiles, it is better to assess correlation coefficients over marker genes only.

```{r, echo=TRUE}
unlist(lapply(seq_len(3), function(k) {
    k.match <- which.max(cor(Aest[,k], ratMix3$A));
    mgk <- MGlist.FCboot[[k]];
    corr <- cor(Sest[mgk, k], ratMix3$S[mgk, k.match]);
    names(corr) <- colnames(ratMix3$A)[k.match];
    corr}))

```

## Alternative CAM Workflow

There major steps in CAM (data preprocessing, marker gene cluster detection, and matrix decomposition) can also be performed separately as a more flexible choice.

```{r, echo=TRUE}
set.seed(111)

#Data preprocession
rPrep <- CAMPrep(data, thres.low = 0.30, thres.high = 0.95)

#Marker gene cluster detection with a fixed K
rMGC <- CAMMGCluster(3, rPrep)

#A and S matrix estimation
rASest <- CAMASest(rMGC, rPrep, data)

#Obtain A and S matrix
Aest <- Amat(rASest)
Sest <- Smat(rASest)

#obtain marker gene list detected by CAM and used for A estimation
MGlist <- MGsforA(PrepResult = rPrep, MGResult = rMGC)

#obtain a full list of marker genes 
MGstat <- MGstatistic(data, Aest, boot.alpha = 0.05, nboot = 1000)
MGlist.FC <- lapply(seq_len(3), function(x) 
    rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC > 10])
MGlist.FCboot <- lapply(seq_len(3), function(x) 
    rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC.alpha > 10])
MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.2', err.thres='q1.2')
```

## Optional Sample Clustering

We have implemented PCA in `CAM()`/`CAMPrep()` to reduce data dimensions. Sample clustering, as another data dimension reduction method, is optional prior to `CAM()`/`CAMPrep()`. 

```{r, echo=TRUE, message=FALSE}
#clustering
library(apcluster)
apres <- apclusterK(negDistMat(r=2), t(data),  K = 10) 
#You can also use apcluster(), but need to make sure the number of clusters is large than potential subpopulation number.

data.clusterMean <- lapply(slot(apres,"clusters"), 
                           function(x) rowMeans(data[, x, drop = FALSE]))
data.clusterMean <- do.call(cbind, data.clusterMean)

set.seed(111)
rCAM <- CAM(data.clusterMean, K = 2:5, thres.low = 0.30, thres.high = 0.95)
# or rPrep <- CAMPrep(data.clusterMean, thres.low = 0.30, thres.high = 0.95)
```

We can still follow the workflow in \@ref(cam-workflow) or \@ref(alternative-cam-workflow) to obtain marker gene list and estimated S matrix. However, the estimated A matrix is for the new data composed of cluster centers. The A matrix for the original data can be obtained by

```{r, echo=TRUE}
Sest <- Smat(rCAM,3)
MGlist <- MGsforA(rCAM, K = 3)
Aest <- AfromMarkers(data, MGlist)
#(Optional) alternative re-estimation of A and S matrix
rre <- redoASest(data, MGlist, maxIter = 10) 
```

## Case Study
### GSE11058

[GSE11058](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11058) run four immune cell lines on microarrays as well as their mixtures of various relative proportions. The code chunk below shows how to use CAM to blindly separate the four mixtures into four pure cell lines. Note that this data set contains 54613 probe/probesets. We can reduce running time by downsampling probe/probesets or remove more low-expressed genes (e.g. 70%).

```{r, echo=TRUE, message=FALSE}
#download data and phenotypes
library(GEOquery)
gsm<- getGEO('GSE11058')
pheno <- pData(phenoData(gsm[[1]]))$characteristics_ch1
mat <- exprs(gsm[[1]])
mat <- mat[-grep("^AFFX", rownames(mat)),]
mat.aggre <- sapply(unique(pheno), function(x) rowMeans(mat[,pheno == x]))
data <- mat.aggre[,5:8]

#running CAM
set.seed(111)
rCAM <- CAM(data, K = 4, thres.low = 0.70, thres.high = 0.95)
Aest <- Amat(rCAM, 4)
Aest

#Use ground truth A to validate CAM-estimated A matrix
Atrue <- matrix(c(2.50, 0.50, 0.10, 0.02,
              1.25, 3.17, 4.95, 3.33,
              2.50, 4.75, 1.65, 3.33,
              3.75, 1.58, 3.30, 3.33), 4,4,
              dimnames = list(c("MixA", "MixB", "MixC","MixD"),
                               c("Jurkat", "IM-9", "Raji", "THP-1")))
Atrue <- Atrue / rowSums(Atrue)
Atrue
cor(Aest, Atrue)
```

### GSE41826 (methylation)

[GSE41826](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41826) conducted a reconstitution experiment by mixing neuron and glial derived DNA from a single individual from 10% ~ 90%. The code chunk below shows how to use CAM to blindly separate 9 mixtures into neuron and glial specific CpG methylation quantification. Additional requirements of running CAM on methylation data: 

- Use beta-value, not M-value, as methylation quantification.
- CpG sites in X or Y chromosomes must be removed if mixture tissues are from both males and females. 
- CpG sites needs to be downsampled since the huge number of CpG sites will slow the clustering process. 

```{r, echo=TRUE, message=FALSE, eval=FALSE}
#download data
library(GEOquery)
gsm <- getGEO('GSE41826')
mixtureId <- unlist(lapply(paste0('Mix',seq_len(9)), 
    function(x) gsm[[1]]$geo_accession[gsm[[1]]$title==x]))
data <- gsm[[1]][,mixtureId]

#Remove CpG sites in sex chromosomes if tissues are from both males and females
#Not necessary in this example as mixtures are from the same subject
#gpl<- getGEO('GPL13534')
#annot<-dataTable(gpl)@table[,c('Name','CHR')]
#rownames(annot) <- annot$Name
#annot <- annot[rownames(data),]
#data <- data[annot$CHR != 'X' & annot$CHR != 'Y',]

#downsample CpG sites
featureId <- sample(seq_len(nrow(data)), 50000)

#Running CAM
#When input data has too many probes, lof has to be disabled due to space limit.
#When cluster.num is large, quick.select can be enable to increase the speed
rCAM <- CAM(data[featureId,], K = 2, thres.low = 0.10, thres.high = 0.60,
            cluster.num = 100, MG.num.thres = 10, 
            lof.thres = 0, quick.select = 20)
MGlist <- MGsforA(rCAM, K = 2)

#Identify markers from all CpG sites
MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.2', err.thres='q1.2')

#re-esitmation with methylation constraint
rre <- redoASest(data, MGlist.re, maxIter = 20, methy = TRUE)

#Validation using ground truth A matrix
Atrue <- cbind(seq(0.1, 0.9, 0.1), seq(0.9, 0.1, -0.1))
cor(rre$Aest, Atrue)
```

We can also run CAM on unmethylated quantification, 1 - beta, to obtain quite similar results since the underlying linear mixing model is also applicable for unmethylated probe intensity.

```{r, echo=TRUE, message=FALSE, eval=FALSE}
rCAM <- CAM(1 - exprs(data[featureId,]),
            K = 2, thres.low = 0.10, thres.high = 0.60,
            cluster.num = 100, MG.num.thres = 10, 
            lof.thres = 0, quick.select = 20)
```

# Supervised/Semi-supervised Deconvolution
## Molecular markers are known

CAM algorithm can estimate A and S matrix based on blindly detected markers. Thus, we can also use part of CAM algorithm to estimate A and S matrix based on known markers.

The package provides `AfromMarkers()` to estimate A matrix from marker list.

```{r, echo=TRUE, eval=FALSE}
Aest <- AfromMarkers(data, MGlist)
#MGlist is a list of vectors, each of which contains known markers for one subpopulation
```

Or we can use `redoASest()` to estimate A and S matrix from marker list and alternatively re-estimate two matrices by ALS to further reduce mean squared error. Note that allowing too many iterations of ALS may bring the risk of a significant deviation from initial values. The constraint for methylation data, $\mathbf{S\in[0,1]}$, can  be imposed during re-estimation. 

```{r, echo=TRUE, eval=FALSE}
rre <- redoASest(data, MGlist, maxIter = 10)
#MGlist is a list of vectors, each of which contains known markers for one subpopulation
#maxIter = 0: No re-estimation by ALS
#rre$Aest: estimated A matrix
#rre$Sest: estimated S matrix
```

## S matrix is known

Many datasets provide expression profiles for purified cell lines or even every single cell, which can be treated as references of S matrix. Some methods use least squares techniques or support vector regression [@CIBERSORT] to estimate A matrix based on known S matrix. `debCAM` will estimate A matrix by identified markers from known S matrix, which has better performance in terms of correlation coefficient with ground truth A matrix.  

```{r, echo=TRUE}
data <- ratMix3$X
S <- ratMix3$S #known S matrix

#Identify markers
pMGstat <- MGstatistic(S, c("Liver","Brain","Lung"))
pMGlist.FC <- lapply(c("Liver","Brain","Lung"), function(x) 
    rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10])

#Estimate A matrix
Aest <- AfromMarkers(data, pMGlist.FC)

#(Optional) alternative re-estimation
rre <- redoASest(data, pMGlist.FC, maxIter = 10) 
```

`debCAM` also supports estimating A matrix directly from known S matrix using least squares method. Marker identification from S matrix is still needed because only markers' squared errors will be counted which facilitates (1) faster computational running time and (2) a greater signal-to-noise ratio owing to markers' discriminatory power [@CIBERSORT].

```{r, echo=TRUE, eval=FALSE}
Aest <- redoASest(data, pMGlist.FC, S=S, maxIter = 0)$Aest
```

## A matrix is known

With known A matrix, `debCAM` estimates S matrix using non-negative least squares (NNLS) from `r CRANpkg("NMF")`, and further identify markers.

```{r, echo=TRUE}
data <- ratMix3$X
A <- ratMix3$A #known A matrix

#Estimate S matrix
Sest <- t(NMF::.fcnnls(A, t(data))$coef)

#Identify markers
pMGstat <- MGstatistic(data, A)
pMGlist.FC <- lapply(unique(pMGstat$idx), function(x) 
    rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10])

#(Optional) alternative re-estimation
rre <- redoASest(data, pMGlist.FC, A=A,  maxIter = 10) 
```

## Semi-supervised Deconvolution

When prior information of markers, S matrix and/or A matrix is available, semi-supervised deconvolution can also be performed by combining markers from prior information and markers identified by CAM. While supervised deconvolution cannot handle the underlying subpopulations without prior information, unsupervised deconvolution may miss the subpopulation without enough discrimination power. Therefore, semi-supervised deconvolution can take advantage of both methods.   

```{r, echo=TRUE, eval=FALSE}
Aest <- AfromMarkers(data, MGlist)
#MGlist is a list of vectors, each of which contains known markers and/or CAM-detected markers for one subpopulation
```

Or 

```{r, echo=TRUE, eval=FALSE}
rre <- redoASest(data, MGlist, maxIter = 10)
#MGlist is a list of vectors, each of which contains known markers and/or CAM-detected markers for one subpopulation
#maxIter = 0: No re-estimation by ALS
#rre$Aest: estimated A matrix
#rre$Sest: estimated S matrix
```

# References