%\VignetteIndexEntry{Roleswitch} \documentclass[12pt]{article} \usepackage[left=1in,top=1in,right=1in, bottom=1in]{geometry} \usepackage{Sweave} \usepackage{times} \usepackage{hyperref} \usepackage{subfig} \usepackage{natbib} \usepackage{graphicx} \hypersetup{ colorlinks, citecolor=black, filecolor=black, linkcolor=black, urlcolor=black } \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\textsf{R}} \newcommand{\TopHat}{\software{TopHat}} \newcommand{\Bowtie}{\software{Bowtie}} \setkeys{Gin}{width=1.0\textwidth} \bibliographystyle{plainnat} \title{Infer miRNA-mRNA interactions using paired expression data from a single sample} \author{Yue Li \\ \texttt{yueli@cs.toronto.edu}} \date{\today} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \section{Introduction} MicroRNAs (miRNAs) are small ($\sim 22$ nucleotides) RNA molecules that base-pair with mRNA primarily at the 3$'$ untranslated region (UTR) to cause mRNA degradation or translational repression (\cite{Bartel:2009fh}). The expression levels of miRNA and mRNA are usually measured by microarray or RNA-seq. Paried expression profiling of both miRNA and mRNA enables identifying miRNA-mRNA interactions within an individual and calls for a new computational method. We develop \Rpackage{Roleswitch} to infer Probabilities of MiRNA-mRNA Interaction Signature (ProMISe) using paired expression data from a single sample (paper in preparation). Roleswitch takes as inputs two expression vectors of $N$ mRNAs and $M$ miRNAs and a $N\times M$ seed match matrix containing the number of target sites for each mRNA $i$ and miRNA $k$. The program then outputs a probability matrix that mRNA $i$ (row) being a target of miRNA $k$ (column). User can provide \Rfunction{roleswitch()} with a seed-match matrix. Otherwise, \Rfunction{getSeedMatrix} will be invoked to retrieve seed-match matrix from existing (online) database. Briefly, Roleswitch operates in two phases by inferring the probabilities of mRNA (miRNA) being the targets ("targets") of miRNA (mRNA), taking into account the expression of all of the mRNAs (miRNAs) due to their potential competition for the same miRNA (mRNA). Due to mRNA transcription and miRNA repression events simutaneously happening in the cell, Roleswitch assumes that the total transcribed mRNA levels are higher than the observed (equilibrium) mRNA levels and iteratively updates the total transcription of each mRNA targets based on the above inference. Based on our extensive tests on cancer data from TCGA, Roleswitch identifies more validated targets comparing with existing methods (paper in preparation). Additionally, the inferred ProMISe rivals expression profiles in cancer diagnosis yet provides unique opportunities to explore oncogenic mRNA-miRNA interactions (paper in preparation). The algorithm is outlined as follow: \begin{enumerate} \item Infer mRNA i targeted by miRNA k taking into account the hidden total expression of 1 \ldots N mRNA and miRNA k \item Estimate total transcription level of mRNA i \item Infer miRNA k ``targeted" by mRNA i taking into account 1 \ldots M miRNA and mRNA i expression \item Repeat 1-3 until convergence \end{enumerate} \section{Simulation} To help appreciate the model, this section demonstrates a toy example using simulated data of 10 mRNAs and 4 miRNAs. Specifically, we generated expression of 10 mRNAs and 4 miRNAs from Gaussian distribution using \texttt{rnorm} with mean and standard deviation set to 3 and 1, respectively. The $10\times 4$ seed matrix were generated (using \texttt{rpois}) from a Poisson distribution with $\lambda = 0.2$. <>= library(Roleswitch) # simulated example N <- 10 M <- 4 x.o <- matrix(abs(rnorm(N, mean=3))) rownames(x.o) <- paste("mRNA", 1:nrow(x.o)) colnames(x.o) <- "mRNA expression" # miRNA expression z.o <- matrix(abs(rnorm(M, mean=3))) rownames(z.o) <- paste("miRNA", 1:nrow(z.o)) colnames(z.o) <- "miRNA expression" # simulate target sites c <- matrix(rpois(nrow(z.o)*nrow(x.o), 0.2), nrow=nrow(x.o)) # ensure each miRNA (mRNA) has at least one # seed (seed match) to a mRNA (miRNA) c[apply(c,1,sum)==0, sample(1:ncol(c),1)] <- 1 c[sample(1:nrow(c),1),apply(c,2,sum)==0] <- 1 dimnames(c) <- list(rownames(x.o), rownames(z.o)) # simulate true labels rs.pred <- roleswitch(x.o, z.o, c) @ \begin{figure}[htbp] \begin{center} <>= diagnosticPlot(rs.pred) @ \caption{From left to right, the top panel displays (A) the 10 simulated mRNA expression, (B) 4 miRNA expression, (C) the $10\times 4$ seed-match matrix, and (D) the inferred total mRNA expression by the proposed model; the bottom panel displays (E) the inferred probabilities of the 4 miRNAs targeting the 10 mRNA (miRNA-mRNA), (F) the probabilities of the 10 mRNA ``targeting'' the 4 miRNAs (mRNA-miRNA), (G) the dot product of the above two matrices (which is defined as ProMISe), and (H) the convergence rate. The coloured rectangles on the particular interaction scores help explain the properties of model in the main text.} \label{fig:toy} \end{center} \end{figure} As shown in Fig \ref{fig:toy}, the top panels (left to right) display the observed mRNA and miRNA expression, seed-match matrix and inferred total mRNA expression (A-D); the bottom panels display the probability matrix of miRNA-mRNA (i.e. miRNA targeting mRNA), mRNA-miRNA (i.e. mRNA ``targeting" miRNA), the dot product of the above two matrices, and the convergence rate (E-H). Such simple example is sufficient to highlight several important features of the proposed model. First and most obviously, mRNA that does not carry a seed match for miRNA has zero probability of being a target of that miRNA, regardless the expression levels, and vice versa (Fig \ref{fig:toy}C,E-G). Second, $p(t^{(x)}_{i,k} | \mathbf{x}^{(t)}, z_k, \mathbf{c}_{.,k})$ (miRNA-mRNA) (Fig \ref{fig:toy}E) and $p(t^{(z)}_{i,k} | x_i^{(t)}, \mathbf{z}, \mathbf{c}_{i,.})$ (mRNA-miRNA) (Fig \ref{fig:toy}F) differ in many cases for the same pair of miRNA and mRNA. Third, the joint probabilities (ProMISe, Fig \ref{fig:toy}G) reflect both aspects of the targeting mechanisms and can differentiate many cases, where the probabilities are equal in either $p(t^{(x)}_{i,k}|.)$ (Fig \ref{fig:toy}E) or $p(t^{(z)}_{i,k}|.)$ (Fig \ref{fig:toy}F). Finally, $p(t^{(x)}|.)$ (or $p(t^{(z)}|.)$) converges quickly in only a few iterations (Fig \ref{fig:toy}H). The same holds true for practically large number of mRNAs and miRNAs: the model converges within 10 iterations at $tol=10^{-5}$. \section{Real test} In this section, we demonstrate the real utility of \Rpackage{Roleswitch} in construct ProMISe from a single sample. The test data of miRNA and mRNA expression for the same individual (barcode ID: TCGA-02-0001-01) were downloaded from TCGA GBM. We linear transformed the data to to the non-negative scale since negative expression will produce unexpected results. <>= data(tcga_gbm_testdata) # rescale to non-negative values (if any) if(any(x<0)) x <- rescale(as.matrix(x), to=c(0, max(x))) if(any(z<0)) z <- rescale(as.matrix(z), to=c(0, max(z))) @ Next, we obtain seed-match matrix using \Rfunction{getSeedMatrix} from pre-compiled and processed human target site information \Robject{hsTargets} saved in the \Rpackage{microRNA} package, originally downloaded from Microcosm (\url{http://www.ebi.ac.uk/enright-srv/microcosm/htdocs/targets/v5/}) \cite{GriffithsJones:2008kb}. For each mRNA-miRNA pair, we calculated the number of corresponding target sites. For multiple transcripts of the same gene, we used transcripts with the longest 3$'$UTR. The end result is a $N \times M$ seed-match matrix of $N$ distinct mRNAs each corresponding to a distinct gene and $M$ distinct miRNAs. The execution of the follwing code requires \Rpackage{microRNA} package. <>= seedMatrix <- getSeedMatrix(species="human") seedMatrix <- seedMatrix[match(rownames(x), rownames(seedMatrix), nomatch=F), match(rownames(z), colnames(seedMatrix), nomatch=F)] x <- x[match(rownames(seedMatrix),rownames(x), nomatch=F),,drop=F] z <- z[match(colnames(seedMatrix),rownames(z), nomatch=F),,drop=F] @ We now apply Roleswitch to the test data: <>= rs.pred <- roleswitch(x,z,seedMatrix) @ To demonstrate the quality of the prediction, we compare the Roleswitch predicted ProMISe with using seed-match matrix alone. Among the 100-1000 rank with 100-interval from each method, we counted validated targets downloaded from mirTarBase (\cite{Hsu:2011ed}) (http://mirtarbase.mbc.nctu.edu.tw). For Roleswitch, we use the joint probability matrix as its final inference of ProMISe. For seed-match matrix, which does not consider expression data, the mRNA-miRNA interaction was simply ranked by the corresponding total number of target sites - the more target sites a mRNA $i$ has for miRNA $k$, the more likely it is the miRNA target. To ascertain functional interaction, we restrict the valdiated miRNAs to validated miRNAs that has nonzero expression values in the sample since miRNAs that do not express at all will not have any target in that sample (evan though they may be validated in some other cell-line or tissues). We first need to process the validated targets to make it having the same order of \texttt{dimnames} as the \texttt{seedMatrix}: <>= # reorder valdiated taregts to match with seedMatrix validated <- lapply(colnames(seedMatrix), function(j) { as.matrix(rownames(seedMatrix)) %in% as.matrix(subset(mirtarbase, miRNA==j)$`Target Gene`) }) validated <- do.call("cbind", validated) dimnames(validated) <- dimnames(seedMatrix) @ We now count the validated targets from the top rank targets from Roleswitch and Seed-match matrix and plotted the results in barplot. As shown in Fig \ref{fig:toprank}, Roleswitch predicted more validated targets of the expression miRNAs than using Seed-match alone. For more extensive comparison with other exisitng methods, please refer to our paper (once it is published). <>= toprank <- seq(from=100,to=1000,by=100) toprank_eval <- function(pred, decreasing=T, mirna.expr) { expressed.miRNA <- rownames(mirna.expr)[mirna.expr > 0] tmp <- validated tmp[, !colnames(tmp) %in% expressed.miRNA] <- FALSE valid <- which(as.numeric(tmp)==1) tp <- sapply(toprank, function(n) sum(head(order(pred, decreasing=decreasing), n) %in% valid)) data.frame(rank=toprank, validated=tp) } rs.toprank <- data.frame(toprank_eval(as.numeric(rs.pred$p.xz), mirna.expr=z), type="GBM", method="Roleswitch") seed.toprank <- data.frame(toprank_eval(as.numeric(seedMatrix), mirna.expr=z), type="GBM", method="Seed Matrix") @ <>= require(ggplot2) df <- rbind(rs.toprank, seed.toprank) gg <- ggplot(data=df, aes(x=factor(rank), y=validated, fill=method)) + theme_bw() + geom_bar(stat="identity", position="dodge") + scale_x_discrete("Top rank") + scale_y_continuous("Validated targets of expressed miRNAs") @ \begin{figure}[htbp] \begin{center} <>= print(gg) @ \caption{The number of validated targets selected by Roleswitch and seed match matrix among their top rankings.} \label{fig:toprank} \end{center} \end{figure} \section{Working with \Robject{eSet/ExpressionSet}}\label{sec:eset} Roleswitch supports \Robject{eSet} or \Robject{ExpressionSet} from \Rpackage{Biobase} as input for mRNA expression.\footnote{For miRNA expression, user still need to provide a $1\times M$ matrix containing expression of $M$ miRNAs} For an eSet containing multiple samples, Roleswitch will only take the first sample. Thus, user needs to run Roleswitch multiple times on distinct samples or average the probe values across multiple samples (replicates). In the following example, Roleswitch converts the probe ID to gene symbol using packages dedicated for the chip (i.e., hgu95av2 in the example below), avearges multiple probe values for the same gene or miRNA, and constructs seed match matrix for human, automatically. Fig \ref{promise} depicts the $36\times 7$ probabilities matrix generated by Roleswitch for the distinct 36 mRNAs being the targets of 7 distinct miRNAs. <>= # mRNA expression from eSet dataDirectory <- system.file("extdata", package="Biobase") exprsFile <- file.path(dataDirectory, "exprsData.txt") exprs <- as.matrix(read.table(exprsFile, header=TRUE, sep="\t", row.names=1, as.is=TRUE)) eset <- ExpressionSet(assayData=exprs[,1,drop=F], annotation="hgu95av2") annotation.db <- sprintf("%s.db", annotation(eset)) # miRNA expression mirna.expr <- matrix( c(1.23, 3.52, 2.42, 5.2, 2.2, 1.42, 1.23, 1.20, 1.37), dimnames=list( c("hsa-miR-148b", "hsa-miR-27b", "hsa-miR-25", "hsa-miR-181a", "hsa-miR-27a", "hsa-miR-7", "hsa-miR-32", "hsa-miR-32", "hsa-miR-7"), "miRNA Expression") ) rs <- roleswitch(eset, mirna.expr) promise <- rs$p.xz[apply(rs$p.xz,1,sum)>0, apply(rs$p.xz,2,sum)>0] @ \begin{figure}[htbp] \begin{center} <>= color2D.matplot(promise, extremes=c("white", "red"), main=sprintf("ProMISe"), axes=FALSE, xlab="", ylab="", show.values=T) axis(1,at=0.5:(ncol(promise)-0.5),las=3,labels=sub("hsa-","", colnames(promise))) axis(2,at=0.5:(nrow(promise)-0.5),las=2,labels=rownames(promise)) @ \caption{ProMISe generated from Section \ref{sec:eset}. Probabitlies are displayed in the cells. The intensity of the red color corresponds to the magnitude of the probabilisties (i.e. the higher the probability the more red).} \label{fig:promise} \end{center} \end{figure} \section{Session Info} <>= sessionInfo() @ \bibliography{Roleswitch} \end{document}