--- title: "DEqMS R Markdown vignettes" author: - name: Yafeng Zhu affiliation: Karolinska Institute, Stockholm, Sweden - name: Lukas Orre affiliation: Karolinska Institute, Stockholm, Sweden - name: Georgios Mermelekas affiliation: Karolinska Institute, Stockholm, Sweden - name: Henrik Johansson affiliation: Karolinska Institute, Stockholm, Sweden - name: Alina Malyutina affiliation: University of Helsinki, Helsinki, Finland - name: Simon Anders - affiliation: Heidelberg University (ZMBH), Heidelberg, Germany - name: Janne Lehtiƶ date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_fload: true BiocStyle::pdf_document: default package: DEqMS abstract: Instructions to perform differential protein expression analysis using DEqMS vignette: > %\VignetteIndexEntry{DEqMS R Markdown vignettes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Overview of DEqMS `DEqMS` builds on top of `Limma`, a widely-used R package for microarray data analysis (Smyth G. et al 2004), and improves it with mass spectrometry specific data properties, accounting for variance dependence on the number of quantified peptides or PSMs for statistical testing of differential protein expression. # Load the package ```{r Loadpackage} library(DEqMS) ``` # Differential protein expression analysis ## Example Dataset 1 We analyzed a protemoics dataset (TMT10plex labelled) in which A431 cells (human epidermoid carcinoma cell line) were treated with three different miRNA mimics (Yan Z. et al. Oncogene 2016). The raw MS data was searched with MS-GF+ (Kim et al Nat Communications 2016) and post processed with Percolator (Kall L. et al Nat Method 2007). A tabular text output of PSM intensity table filtered at 1% peptide level FDR is used as input for downstream analysis. ## Loading the data and preprocess Read a tabular input in which PSMs are in rows and samples are in columns. Intenisty values were log transformed since systematic effects and variance components are usually assumed to be additive on log scale (Oberg AL. et al JPR 2008; Hill EG. et al JPR 2008). ```{r retrieveExampleData} ### retrieve example dataset library(ExperimentHub) eh = ExperimentHub() query(eh, "DEqMS") dat.psm = eh[["EH1663"]] ``` ```{r log2transform} dat.psm.log = dat.psm dat.psm.log[,3:12] = log2(dat.psm[,3:12]) head(dat.psm.log) ``` Generate sample annotation table. ```{r,design} cond = c("ctrl","miR191","miR372","miR519","ctrl", "miR372","miR519","ctrl","miR191","miR372") sampleTable <- data.frame( row.names = colnames(dat.psm)[3:12], cond = as.factor(cond)) sampleTable ``` ## Generate a summary of the data Since each row is one PSM, to count how many PSMs per protein, we simply use function `table` in R to count the number of occurance that each protein ID appears. ```{r PSMdistribution} psm.count.table = as.data.frame(table(dat.psm$gene)) rownames(psm.count.table)=psm.count.table$Var1 plot(sort(log2(psm.count.table$Freq)),pch=".", xlab="Proteins ordered by PSM count",ylab="log2(PSM count)") ``` ## Summarization and Normalization Here, median sweeping is used to summarize PSMs intensities to protein log2 ratios. In this procedure, we substract the spectrum log2 intensity from the median log2 intensities of all samples. The relative abundance estimate for each protein is calculated as the median over all PSMs belonging to this protein. Assume the log2 intensity of PSM `i` in sample `j` is $y_{i,j}$, its relative log2 intensity of PSM `i` in sample `j` is $y'_{i,j}$: $$y'_{i,j} = y_{i,j} - median_{j'\in ctrl}\ y_{i,j'} $$ Relative abundance of protein `k` in sample `j` $Y_{k,j}$ is calculated as: $$Y_{k,j} = median_{i\in protein\ k}\ y'_{i,j} $$ Correction for differences in amounts of material loaded in the channels is then done by subtracting the channel median from the relative abundance (log2 ratio), normalizing all channels to have median zero. ```{r boxplot} dat.gene.nm = medianSweeping(dat.psm.log,group_col = 2) boxplot(dat.gene.nm,xlab="TMT 10 channels", ylab="log2 relative protein abundance") ``` ## Ordinary T test We first apply t.test to detect significant protein changes between ctrl samples and miR372 treated samples, both have three replicates. ```{r t-test} pval.372 = apply(dat.gene.nm, 1, function(x) t.test(as.numeric(x[c(1,5,8)]), as.numeric(x[c(3,6,10)]))$p.value) logFC.372 = rowMeans(dat.gene.nm[,c(3,6,10)])-rowMeans(dat.gene.nm[,c(1,5,8)]) ``` Generate a data.frame of t.test results, add PSM count values and order the table by p-value. ```{r,echo=TRUE} ttest.results = data.frame(gene=rownames(dat.gene.nm), logFC=logFC.372,P.Value = pval.372, adj.pval = p.adjust(pval.372,method = "BH")) ttest.results$PSMcount = psm.count.table[ttest.results$gene,]$Freq ttest.results = ttest.results[with(ttest.results, order(P.Value)), ] head(ttest.results) ``` ## Anova analysis Anova analysis is equivalent to linear model analysis. The difference to Limma analysis is that estimated variance is not moderated using empirical bayesian approach as it is done in Limma. The purpose here is to compare the statistical power to that when Limma is applied. We first make a design matrix and then use `lmFit` function in Limma for linear model analysis. The input for `lmFit` function is a matrix of relative protein abundance (log2 ratios) and a design matrix containing sample annotation. Use `head(fit$coefficients)` to see effect size(fold change) of genes for different conditions. ```{r Anova} library(limma) gene.matrix = as.matrix(dat.gene.nm) colnames(gene.matrix) = as.character(sampleTable$cond) design = model.matrix(~cond,sampleTable) fit1 <- lmFit(gene.matrix,design) ord.t = fit1$coefficients[, 3]/fit1$sigma/fit1$stdev.unscaled[, 3] ord.p = 2*pt(-abs(ord.t), fit1$df.residual) ord.q = p.adjust(ord.p,method = "BH") anova.results = data.frame(gene=names(fit1$sigma), logFC=fit1$coefficients[,3], t=ord.t, P.Value=ord.p, adj.P.Val = ord.q) anova.results$PSMcount = psm.count.table[anova.results$gene,]$Freq anova.results = anova.results[with(anova.results,order(P.Value)),] head(anova.results) ``` ## Limma analysis Limma is essentially a combination of linear model analysis and empirical bayeisan estimation of variance, the latter is applied to increase the statistical power by borrowing informations across genes to shrink the variance. ```{r Limma} fit2 <- eBayes(lmFit(gene.matrix,design)) head(fit2$coefficients) ``` Extract limma results using `topTable` function, `coef = 3` allows you to extract the contrast of specific condition (miR372), option `n= Inf` output all rows. Add PSM count values in the table. ```{r,echo=TRUE} limma.results = topTable(fit2,coef = 3,n= Inf) limma.results$gene = rownames(limma.results) limma.results$PSMcount = psm.count.table[limma.results$gene,]$Freq head(limma.results) ``` ## Variance dependence on spectra count We observed that the variance of gene across samples gradually decreases as the number of spectra count increases. ```{r Variance, echo=TRUE, fig.height=5, fig.width=10} dat.temp = data.frame(var = fit2$sigma^2, PSMcount = psm.count.table[names(fit2$sigma),]$Freq) dat.temp.filter = dat.temp[dat.temp$PSMcount<21,] op <- par(mfrow=c(1,2), mar=c(4,4,4,1), oma=c(0.5,0.5,0.5,0)) plot(log2(psm.count.table[rownames(dat.gene.nm),2]), dat.gene.nm[,1]-dat.gene.nm[,5],pch=20,cex=0.5, xlab="log2(PSM count)",ylab="log2 ratio between two ctrl replicates") boxplot(log2(var)~PSMcount,dat.temp.filter,xlab="PSMcount", ylab="log2(Variance)") ``` ## DEqMS analysis Limma assumes equal prior variance for all genes, the function `spectraCounteBayes` in DEqMS package is able to correct the biase of prior variance estimate for proteins quantified by different number of PSMs. It works in a similar way to the intensity-based hierarchical Bayes method (Maureen A. Sartor et al BMC Bioinformatics 2006). Outputs of `spectraCounteBayes`: object is augmented form of "fit" object from `eBayes` in Limma, with the additions being: `sca.t` - Spectra Count Adjusted posterior t-value `sca.p` - Spectra Count Adjusted posterior p-value `sca.dfprior` - estimated prior degrees of freedom `sca.priorvar`- estimated prior variance `sca.postvar` - estimated posterior variance `loess.model` - fitted loess model ```{r DEqMS} fit2$count = psm.count.table[rownames(fit2$coefficients),]$Freq fit3 = spectraCounteBayes(fit2) DEqMS.results = outputResult(fit3,coef_col = 3) head(DEqMS.results) ``` Check if the fitted variance ~ PMS count model is correct. ```{r PriorVarianceTrend} plotFitCurve(fit3,type = "boxplot",xlab="PSM count") ``` # Comparing Limma and DEqMS ## Compare prior variance and posterior variance in Limma and DEqMS Visualize the prior variance fitted by Limma and DEqMS ```{r PriorVar} x = fit3$count y = fit3$sigma^2 limma.prior = fit3$s2.prior DEqMS.prior = fit3$sca.priorvar plot(log2(x),log(y),pch=20, cex=0.5,xlab = "log2(PSMcount)", ylab="log(Variance)") abline(h = log(limma.prior),col="green",lwd=3 ) k=order(x) lines(log2(x[k]),log(DEqMS.prior[k]),col="red",lwd=3 ) legend("topright",legend=c("Limma prior variance","DEqMS prior variance"), col=c("green","red"),lwd=3) ``` Visualize the change of posterior variance after PSM count adjustment. The plot here shows posterior variance of proteins "shrink" toward the fitted value to different extent depending on PSM number. ```{r PostVar, echo=TRUE, fig.height=5, fig.width=10} x = fit3$count y = fit3$s2.post op <- par(mfrow=c(1,2), mar=c(4,4,4,1), oma=c(0.5,0.5,0.5,0)) plot(log2(x),log(y),pch=20, cex=0.5, xlab = "log2(PSMcount)", ylab="log(Variance)", main="Posterior Variance in Limma") y = fit3$sca.postvar plot(log2(x),log(y),pch=20,cex=0.5, xlab = "log2(PSMcount)", ylab="log(Variance)", main="Posterior Variance in DEqMS") ``` ## Comparing p-values by different analysis. plotting top 500 genes ranked by p-values. ```{r pvalue} plot(sort(-log10(limma.results$P.Value),decreasing = TRUE)[1:500], type="l",lty=2,lwd=2, ylab="-log10(p-value)", xlab="Proteins ranked by p-values", col="purple") lines(sort(-log10(DEqMS.results$sca.P.Value),decreasing = TRUE)[1:500], lty=1,lwd=2,col="red") lines(sort(-log10(anova.results$P.Value),decreasing = TRUE)[1:500], lty=2,lwd=2,col="blue") lines(sort(-log10(ttest.results$P.Value),decreasing = TRUE)[1:500], lty=2,lwd=2,col="orange") legend("topright",legend = c("Limma","DEqMS","Anova","t.test"), col = c("purple","red","blue","orange"),lty=c(2,1,2,2),lwd=2) ``` # Visualization of the results ## PSM/Peptide profile plot `peptideProfilePlot` function will plot log2 intensity of each PSM/peptide of the protein in the input table. ```{r PSMintensity} library(ggplot2) dat.psm.log.ordered = dat.psm.log[,c("Peptide","gene", sort(colnames(dat.psm.log)[3:12]))] peptideProfilePlot(dat=dat.psm.log.ordered,col=2,gene="TGFBR2") ``` ## Volcano plot ```{r volcanoplot1} library(ggrepel) # to plot adjusted p-values by BH method on y-axis DEqMS.results$log.adj.pval = -log10(DEqMS.results$sca.adj.pval) ggplot(DEqMS.results, aes(x = logFC, y = log.adj.pval)) + geom_point(size=0.5 )+ theme_bw(base_size = 16) + # change theme xlab(expression("log2 miR372/ctrl")) + # x-axis label ylab(expression(" -log10 adj.pval")) + # y-axis label geom_vline(xintercept = c(-1,1), colour = "red") + # Add fold change cutoffs geom_hline(yintercept = 2, colour = "red") + # Add significance cutoffs geom_vline(xintercept = 0, colour = "black") + # Add 0 lines scale_colour_gradient(low = "black", high = "black", guide = FALSE)+ geom_text_repel(data=subset(DEqMS.results, abs(logFC)>1&log.adj.pval> 2), aes( logFC, log.adj.pval ,label=gene)) # add gene label ``` you can also use `volcanoplot` function from Limma. However, it uses `p.value` from Limma. If you want to plot `sca.pvalue` from `DEqMS`, you need to modify the fit object. ```{r volcanoplot2} fit3$p.value = fit3$sca.p volcanoplot(fit3,coef=3, style = "p-value", highlight = 20, names=rownames(fit3$coefficients)) ``` ## PCA plot ```{r PCAplot} library( RColorBrewer ) pr <- prcomp(t(gene.matrix)) plot( pr$x[,1:2], asp=1, col=brewer.pal(4,"Set1")[sampleTable$cond], pch=17) text( pr$x[,1], pr$x[,2]-1, label=as.character(sampleTable$cond)) legend( "bottomright", legend = levels( sampleTable$cond ), col = brewer.pal(4,"Set1"), pch=17 ) ``` ## Sample correlation heatmaps plot sample correlation heatmap ```{r heatmap1} library( pheatmap ) cm <- cor( gene.matrix ) # rearrange columns so that same sample types are together cm.order = cm[order(colnames(cm)),order(colnames(cm))] pheatmap(cm.order, cluster_rows=FALSE, cluster_cols=FALSE, color = colorRampPalette(c("blue", "white", "red"))(100)) ``` or plot Eucldiean distance heatmap ```{r heatmap2} dm <- as.matrix( dist( t( gene.matrix ) )) dm.order = dm[order(colnames(cm)),order(colnames(cm))] pheatmap(dm.order, cluster_rows=FALSE, cluster_cols=FALSE, color = colorRampPalette(c("red", "white"))(100)) ```