---
title: "DEqMS R Markdown vignettes"
author: 
- name: Yafeng Zhu
  affiliation: Karolinska Institute, Stockholm, Sweden
- name: Lukas Orre
  affiliation: Karolinska Institute, Stockholm, Sweden
- name: Georgios Mermelekas
  affiliation: Karolinska Institute, Stockholm, Sweden
- name: Henrik Johansson
  affiliation: Karolinska Institute, Stockholm, Sweden
- name: Alina Malyutina
  affiliation: University of Helsinki, Helsinki, Finland
- name: Simon Anders
- affiliation: Heidelberg University (ZMBH), Heidelberg, Germany
- name: Janne Lehtiö
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
    toc_fload: true
  BiocStyle::pdf_document: default   
package: DEqMS
abstract:
  Instructions to perform differential protein expression analysis using DEqMS
vignette: >
    %\VignetteIndexEntry{DEqMS R Markdown vignettes}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

# Overview of DEqMS

`DEqMS` builds on top of `Limma`, a widely-used R package for microarray data 
analysis (Smyth G. et al 2004), and improves it with mass spectrometry specific 
data properties, accounting for variance dependence on the number of quantified 
peptides or PSMs for statistical testing of differential protein expression. 

# Load the package

```{r Loadpackage}
library(DEqMS)
```

# Differential protein expression analysis
## Example Dataset 1
We analyzed a protemoics dataset (TMT10plex labelled) in which A431 cells 
(human epidermoid carcinoma cell line) were treated with three different miRNA 
mimics (Yan Z. et al. Oncogene 2016). The raw MS data was searched with MS-GF+ 
(Kim et al Nat Communications 2016) and post processed with Percolator (Kall L. 
et al Nat Method 2007). A tabular text output of PSM intensity table filtered 
at 1% peptide level FDR is used as input for downstream analysis.

## Loading the data and preprocess
Read a tabular input in which PSMs are in rows and samples are in columns.
Intenisty values were log transformed since systematic effects and variance 
components are usually assumed to be additive on log scale (Oberg AL. et al 
JPR 2008; Hill EG. et al JPR 2008).
```{r retrieveExampleData}
### retrieve example dataset
library(ExperimentHub)
eh = ExperimentHub()
query(eh, "DEqMS")
dat.psm = eh[["EH1663"]]
```

```{r log2transform}
dat.psm.log = dat.psm
dat.psm.log[,3:12] =  log2(dat.psm[,3:12])
head(dat.psm.log)
```
Generate sample annotation table.
```{r,design}
cond = c("ctrl","miR191","miR372","miR519","ctrl",
"miR372","miR519","ctrl","miR191","miR372")

sampleTable <- data.frame(
row.names = colnames(dat.psm)[3:12],
cond = as.factor(cond))
sampleTable
```

## Generate a summary of the data
Since each row is one PSM, to count how many PSMs per protein, we simply use 
function `table` in R to count the number of occurance that each protein ID 
appears.
```{r PSMdistribution}
psm.count.table = as.data.frame(table(dat.psm$gene))
rownames(psm.count.table)=psm.count.table$Var1

plot(sort(log2(psm.count.table$Freq)),pch=".",
    xlab="Proteins ordered by PSM count",ylab="log2(PSM count)")
```

## Summarization and Normalization
Here, median sweeping is used to summarize PSMs intensities to protein log2 
ratios. In this procedure, we substract the spectrum log2 intensity from the 
median log2 intensities of all samples. The relative abundance estimate for 
each protein is calculated as the median over all PSMs belonging to this 
protein. 
Assume the log2 intensity of PSM `i` in sample `j` is $y_{i,j}$, its relative 
log2 intensity of PSM `i` in sample `j` is $y'_{i,j}$: 
$$y'_{i,j} = y_{i,j} - median_{j'\in ctrl}\ y_{i,j'} $$
Relative abundance of protein `k` in sample `j` $Y_{k,j}$ is calculated as:
$$Y_{k,j} = median_{i\in protein\ k}\ y'_{i,j} $$

Correction for differences in amounts of material loaded in the channels 
is then done by subtracting the channel median from the relative abundance 
(log2 ratio), normalizing all channels to have median zero.
```{r boxplot}
dat.gene.nm = medianSweeping(dat.psm.log,group_col = 2)

boxplot(dat.gene.nm,xlab="TMT 10 channels",
        ylab="log2 relative protein abundance")
```

## Ordinary T test
We first apply t.test to detect significant protein changes between ctrl 
samples and miR372 treated samples, both have three replicates.

```{r t-test}

pval.372 = apply(dat.gene.nm, 1, function(x) 
t.test(as.numeric(x[c(1,5,8)]), as.numeric(x[c(3,6,10)]))$p.value)

logFC.372 = rowMeans(dat.gene.nm[,c(3,6,10)])-rowMeans(dat.gene.nm[,c(1,5,8)])
```

Generate a data.frame of t.test results, add PSM count values and order the 
table by p-value.
```{r,echo=TRUE}
ttest.results = data.frame(gene=rownames(dat.gene.nm),
                    logFC=logFC.372,P.Value = pval.372, 
                    adj.pval = p.adjust(pval.372,method = "BH")) 

ttest.results$PSMcount = psm.count.table[ttest.results$gene,]$Freq
ttest.results = ttest.results[with(ttest.results, order(P.Value)), ]
head(ttest.results)
```

## Anova analysis
Anova analysis is equivalent to linear model analysis. The difference to 
Limma analysis is that estimated variance is not moderated using empirical 
bayesian approach as it is done in Limma. The purpose here is to compare 
the statistical power to that when Limma is applied. 

We first make a design matrix and then use `lmFit` function in Limma for 
linear model analysis. The input for `lmFit` function is a matrix of relative 
protein abundance (log2 ratios) and a design matrix containing sample 
annotation. Use `head(fit$coefficients)` to see effect size(fold change)  of 
genes for different conditions.

```{r Anova}
library(limma)
gene.matrix = as.matrix(dat.gene.nm)
colnames(gene.matrix) = as.character(sampleTable$cond)
design = model.matrix(~cond,sampleTable)

fit1 <- lmFit(gene.matrix,design)

ord.t = fit1$coefficients[, 3]/fit1$sigma/fit1$stdev.unscaled[, 3]
ord.p = 2*pt(-abs(ord.t), fit1$df.residual)
ord.q = p.adjust(ord.p,method = "BH")
anova.results = data.frame(gene=names(fit1$sigma),
                            logFC=fit1$coefficients[,3],
                            t=ord.t, 
                            P.Value=ord.p, 
                            adj.P.Val = ord.q)

anova.results$PSMcount = psm.count.table[anova.results$gene,]$Freq
anova.results = anova.results[with(anova.results,order(P.Value)),]

head(anova.results)
```

## Limma analysis
Limma is essentially a combination of linear model analysis and empirical 
bayeisan estimation of variance, the latter is applied to increase the 
statistical power by borrowing informations across genes to shrink the
variance. 

```{r Limma}
fit2 <- eBayes(lmFit(gene.matrix,design))
head(fit2$coefficients)
```

Extract limma results using `topTable` function,  `coef = 3` allows you to 
extract the contrast of specific condition (miR372), option `n= Inf` output 
all rows.
Add PSM count values in the table.
```{r,echo=TRUE}
limma.results = topTable(fit2,coef = 3,n= Inf)
limma.results$gene = rownames(limma.results)
limma.results$PSMcount = psm.count.table[limma.results$gene,]$Freq

head(limma.results)
```


## Variance dependence on spectra count
We observed that the variance of gene across samples gradually decreases 
as the number of spectra count increases.
```{r Variance, echo=TRUE, fig.height=5, fig.width=10}
dat.temp = data.frame(var = fit2$sigma^2,
                        PSMcount = psm.count.table[names(fit2$sigma),]$Freq)
dat.temp.filter = dat.temp[dat.temp$PSMcount<21,]

op <- par(mfrow=c(1,2), mar=c(4,4,4,1), oma=c(0.5,0.5,0.5,0))
plot(log2(psm.count.table[rownames(dat.gene.nm),2]),
    dat.gene.nm[,1]-dat.gene.nm[,5],pch=20,cex=0.5,
    xlab="log2(PSM count)",ylab="log2 ratio between two ctrl replicates")

boxplot(log2(var)~PSMcount,dat.temp.filter,xlab="PSMcount",
        ylab="log2(Variance)")

```

## DEqMS analysis
Limma assumes equal prior variance for all genes, the function 
`spectraCounteBayes` in DEqMS package is able to correct the biase of prior 
variance estimate for proteins quantified by different number of PSMs. It works 
in a similar way to the intensity-based hierarchical Bayes method
(Maureen A. Sartor et al BMC Bioinformatics 2006).
Outputs of `spectraCounteBayes`:  
object is augmented form of "fit" object from `eBayes` in Limma, with the 
additions being:
`sca.t` - Spectra Count Adjusted posterior t-value  
`sca.p` - Spectra Count Adjusted posterior p-value  
`sca.dfprior` - estimated prior degrees of freedom  
`sca.priorvar`- estimated prior variance
`sca.postvar` - estimated posterior variance
`loess.model` - fitted loess model

```{r DEqMS}
fit2$count = psm.count.table[rownames(fit2$coefficients),]$Freq

fit3 = spectraCounteBayes(fit2)
DEqMS.results = outputResult(fit3,coef_col = 3)
head(DEqMS.results)
```

Check if the fitted variance ~ PMS count model is correct.
```{r PriorVarianceTrend}
plotFitCurve(fit3,type = "boxplot",xlab="PSM count")
```


# Comparing Limma and DEqMS
## Compare prior variance and posterior variance in Limma and DEqMS

Visualize the prior variance fitted by Limma and DEqMS
```{r PriorVar}
x = fit3$count
y = fit3$sigma^2

limma.prior = fit3$s2.prior
DEqMS.prior = fit3$sca.priorvar

plot(log2(x),log(y),pch=20, cex=0.5,xlab = "log2(PSMcount)",
     ylab="log(Variance)")
abline(h = log(limma.prior),col="green",lwd=3 )
k=order(x)
lines(log2(x[k]),log(DEqMS.prior[k]),col="red",lwd=3 )
legend("topright",legend=c("Limma prior variance","DEqMS prior variance"),
        col=c("green","red"),lwd=3)
```


Visualize the change of posterior variance after PSM count adjustment. 
The plot here shows posterior variance of proteins "shrink" toward the 
fitted value to different extent depending on PSM number. 
```{r PostVar, echo=TRUE, fig.height=5, fig.width=10}
x = fit3$count
y = fit3$s2.post

op <- par(mfrow=c(1,2), mar=c(4,4,4,1), oma=c(0.5,0.5,0.5,0))
plot(log2(x),log(y),pch=20, cex=0.5, xlab = "log2(PSMcount)", 
    ylab="log(Variance)",
    main="Posterior Variance in Limma")

y = fit3$sca.postvar
plot(log2(x),log(y),pch=20,cex=0.5, xlab = "log2(PSMcount)",
    ylab="log(Variance)", 
    main="Posterior Variance in DEqMS")
```


## Comparing p-values by different analysis.
plotting top 500 genes ranked by p-values.
```{r pvalue}
plot(sort(-log10(limma.results$P.Value),decreasing = TRUE)[1:500], 
    type="l",lty=2,lwd=2, ylab="-log10(p-value)",
    xlab="Proteins ranked by p-values",
    col="purple")
lines(sort(-log10(DEqMS.results$sca.P.Value),decreasing = TRUE)[1:500], 
        lty=1,lwd=2,col="red")
lines(sort(-log10(anova.results$P.Value),decreasing = TRUE)[1:500], 
        lty=2,lwd=2,col="blue")
lines(sort(-log10(ttest.results$P.Value),decreasing = TRUE)[1:500], 
        lty=2,lwd=2,col="orange")
legend("topright",legend = c("Limma","DEqMS","Anova","t.test"),
        col = c("purple","red","blue","orange"),lty=c(2,1,2,2),lwd=2)
```


# Visualization of the results

## PSM/Peptide profile plot
`peptideProfilePlot` function will plot log2 intensity of each PSM/peptide of 
the protein in the input table.
```{r PSMintensity}
library(ggplot2)
dat.psm.log.ordered = dat.psm.log[,c("Peptide","gene",
                                    sort(colnames(dat.psm.log)[3:12]))]
peptideProfilePlot(dat=dat.psm.log.ordered,col=2,gene="TGFBR2")
```

## Volcano plot
```{r volcanoplot1}
library(ggrepel)

# to plot adjusted p-values by BH method on y-axis
DEqMS.results$log.adj.pval = -log10(DEqMS.results$sca.adj.pval) 
ggplot(DEqMS.results, aes(x = logFC, y = log.adj.pval)) + 
    geom_point(size=0.5 )+
    theme_bw(base_size = 16) + # change theme
    xlab(expression("log2 miR372/ctrl")) + # x-axis label
    ylab(expression(" -log10 adj.pval")) + # y-axis label
    geom_vline(xintercept = c(-1,1), colour = "red") + # Add fold change cutoffs
    geom_hline(yintercept = 2, colour = "red") + # Add significance cutoffs
    geom_vline(xintercept = 0, colour = "black") + # Add 0 lines
    scale_colour_gradient(low = "black", high = "black", guide = FALSE)+
    geom_text_repel(data=subset(DEqMS.results, abs(logFC)>1&log.adj.pval> 2),
                    aes( logFC, log.adj.pval ,label=gene)) # add gene label

```

you can also use `volcanoplot` function from Limma. However, it uses `p.value`
from Limma. If you want to plot `sca.pvalue` from `DEqMS`, you need to modify
the fit object.

```{r volcanoplot2}
fit3$p.value = fit3$sca.p
volcanoplot(fit3,coef=3, style = "p-value", highlight = 20,
            names=rownames(fit3$coefficients))
```

## PCA plot
```{r PCAplot}
library( RColorBrewer )
pr <- prcomp(t(gene.matrix)) 
plot( pr$x[,1:2], asp=1, col=brewer.pal(4,"Set1")[sampleTable$cond], pch=17)
text( pr$x[,1], pr$x[,2]-1, label=as.character(sampleTable$cond))
legend( "bottomright", legend = levels( sampleTable$cond ), 
    col = brewer.pal(4,"Set1"), pch=17 )
```

## Sample correlation heatmaps
plot sample correlation heatmap
```{r heatmap1}
library( pheatmap )
cm <- cor( gene.matrix )
# rearrange columns so that same sample types are together
cm.order = cm[order(colnames(cm)),order(colnames(cm))]

pheatmap(cm.order,
    cluster_rows=FALSE, cluster_cols=FALSE,
    color = colorRampPalette(c("blue", "white", "red"))(100))
```


or plot Eucldiean distance heatmap
```{r heatmap2}
dm <- as.matrix( dist( t( gene.matrix ) ))
dm.order = dm[order(colnames(cm)),order(colnames(cm))]
pheatmap(dm.order,
    cluster_rows=FALSE, cluster_cols=FALSE,
    color = colorRampPalette(c("red", "white"))(100))

```