% Benjamin Haibe-Kains % % % October 5, 2011 \documentclass[a4paper,11pt]{article} %\VignetteIndexEntry{predictionet} %\VignetteDepends{predictionet} %\VignetteKeywords{predictionet} %\VignettePackage{predictionet} \usepackage[american]{babel} \usepackage{url} \usepackage{hyperref} \usepackage{a4wide} \usepackage{natbib} \usepackage{graphicx} \usepackage{authblk} \usepackage{wasysym} \usepackage{color} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \title{\Rpackage{predictionet}: a package for inferring predictive networks from high-dimensional genomic data} \author[1,2]{Benjamin Haibe-Kains} \author[3]{Catharina Olsen} \author[3]{Gianluca Bontempi} \author[1,2]{John Quackenbush} \affil[1]{Computational Biology and Functional Genomics Laboratory, Dana-Farber Cancer Institute, Harvard School of Public Health} \affil[2]{Center for Cancer Computational Biology, Dana-Farber Cancer Institute} \affil[3]{Machine Learning Group, Universit\'{e} Libre de Bruxelles} \begin{document} <>= ## initialize seed set.seed(123) @ \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \tableofcontents \clearpage \section{Introduction} DNA microarrays and other high-throughput omics technologies provide large datasets that often include patterns of correlation between genes reflecting the complex processes that underlie cellular processes. The challenge in analyzing large-scale expression data has been to extract biologically meaningful inferences regarding these processes \textendash \ often represented as networks \textendash \ in an environment where the datasets are complex and noisy. Although many techniques have been developed in an attempt to address these issues, to date their ability to extract meaningful and predictive network relationships has been limited. In this vignette we introduce a platform developed in John Quackenbush's lab, which enables inference of predictive gene interaction networks from prior biological knowledge, in the form of biomedical literature and structured databases, and from gene expression data. Using real data, we will show the benefit of using prior biological knowledge to infer networks and how to quantitatively assess the quality of such networks. \subsection*{Getting started} After starting R, the package should be loaded using the following command: <>= library(predictionet) @ This will load \Rpackage{predictionet} as well as its dependencies. In this vignette we will use two example datasets included in the \Rpackage{predictionet} package, namely \Robject{expO.colon.ras} and \Robject{jorissen.colon.ras}. The vignette describes in detail all the necessary steps to infer a gene interaction network combining prior biological knowledge and gene expression data. \medskip \noindent Our approach is the following: \begin{enumerate} \item Select a gene expression dataset and a list of genes of interest. \item Extract priors from the biomedical literature and public structured databases using the \textit{Predictive Networks} web application. \item Use the \Rpackage{predictionet} R package to infer a gene interaction network from priors and gene expression data. \begin{enumerate} \item Infer a network using the main function \Rfunction{netinf}. \item Explore and display the topology of the resulting network. \item Assess the stability of the network inference in cross-validation (function \Rfunction{netinf.cv}). \item Assess quantitatively the predictive ability of the network in cross-validation (function \Rfunction{netinf.cv}). \item Assess quantitatively the predictive ability of the network in a fully independent dataset (functions \Rfunction{netinf.predict} and \Rfunction{pred.score}). \end{enumerate} \item Use \Rpackage{predictionet} to statistically compare multiple gene interaction networks. \end{enumerate} Using two colon cancer gene expression datasets as examples (see Section~\ref{sec:dataset}) we go step by step and provide all the R code necessary to perform the entire analysis. \section{Biology and Data}\label{sec:dataset} Let's focus on colon cancer and more specifically the RAS signaling pathway. \subsection{RAS signaling pathway} A series of alterations in the cellular genome affecting the expression or function of genes controlling cell growth and differentiation is considered to be the main cause of cancer. These mutational events include activation of oncogenes and inactivation of tumor suppressor genes. The elucidation of human cancer at the molecular level allows the design of rational, mechanism-based therapeutic agents that antagonize the specific activity of biochemical processes that are essential to the malignant phenotype of cancer cells. Because the frequency of RAS mutations is among the highest for any gene in human cancers, development of inhibitors of the Ras-mitogen-activated protein kinase (RAS/MAPK) pathway as potential anticancer agents is a very promising pharmacologic strategy~\citep{reuter2000targeting}. \citeauthor{bild2006oncogenic} identified a list of genes being differentially expressed between colorectal cancer cell lines carrying the RAS mutation and those with the wild-type RAS gene \citep{bild2006oncogenic}. This gene list is provided in \texttt{files/bild2006\_ras\_signature\_348.csv} and will serve as the core set of genes involved in the RAS pathway. \subsection{Colon cancer gene expression data} We use two large gene expression datasets of primary colon tumors collected before any adjuvant therapy. The first dataset, provided in the \Rpackage{predictionet} package (see \Robject{?expO.colon.ras}), was published by the expO project\footnote{\url{http://www.intgen.org/expo/}} and consists of 292 colon tumors hybridized on the Affymetrix GeneChip HG-U133PLUS2, composed of 54,675 probesets. The second dataset, also provided in the \Rpackage{predictionet} package (see \Robject{?jorissen.colon.ras}) was published in \citep{jorissen2009metastasisassociated} and consists of 290 colon tumors hybridized on the same Affymetrix GeneChip (HG-U133PLUS2). The raw data have been collected from GEO\footnote{\url{http://www.ncbi.nlm.nih.gov/geo/}}, series accession numbers GSE2109 and GSE14333 for the first and the second dataset respectively \paragraph{Data preprocessing} The raw files (\texttt{*.CEL}) have been normalized using \Rfunction{frma} \citep{mccall2010frozen}. Then only a subset of the gene expressions has been kept for further analysis of the RAS signaling pathway by selecting the probesets in Bild's RAS signature (see \texttt{files/bild2006\_ras\_signature\_348.csv}) which represent a unique gene. When multiple probesets represented the same gene, the most variant has ben selected. The final datasets contain 259 genes and are stored in the \Rpackage{predictionet} package (see \Robject{?expO.colon.ras} and \Robject{?jorissen.colon.ras}). \subsection{Known gene interactions extracted from the biomedical literature and public structured databases}\label{sec:priors} In order to extract previously published gene interactions, Prof John Quackenbush initiated the development of the \textit{Predictive Networks} web application [\url{https://compbio.dfci.harvard.edu/predictivenetworks/}] implemented by Dr Christopher Bouton and his team at \textsc{Entagen}\footnote{http://www.entagen.com}. This tool enables users to easily retrieve a large number of high quality gene interactions reported in the biomedical literature (full-text open-access PubMed articles and/or MEDLINE abstracts) and/or structured biological databases (e.g., Pathway Commons) by focussing on a core set of genes (referred to as gene list). One can use the \textit{Predictive Networks} web application (Figure~\ref{fig:pnwebappscreen}) to retrieve a list of interactions involving at least one gene included in our list of RAS-related genes. To do so, one has to first create an account to login to the webapp. Then go to "My Page" and create a new gene list by cutting-and-pasting the gene symbols\footnote{Be extremely careful if you use Microsoft Excel since it will automatically interpret SEPT6, that is gene "SEPTIN 6", as the 6th of September!} included in \texttt{files/bild2006\_ras\_signature\_348.csv}. Lastly, we have to export all the resulting triples (e.g., "$gene_a$ regulates $gene_b$ "represented by the interaction $gene_a \rightarrow gene_b$) in a CSV file, \texttt{priors\_ras\_bild2006\_pnwebapp.csv}, by clicking on "View Triples" and then "Download w/ Sentences As". One can use R to read this file and count how many times a gene interaction has been observed in the biomedical literature and reported structured biological databases. %Actually the results of such manipulations have been incorporated into the \Rpackage{predictionet} package as described below. \begin{figure}[!th] \centering \includegraphics[width=1\textwidth]{predictionet-pn_webapp_ras} \caption{Screenshot of the \textit{Predictive Networks} web application where one searches known interactions for HRAS. } \label{fig:pnwebappscreen} \end{figure} Thanks to PN a directed graph can be efficiently constructed from known gene interactions which is represented by the matrix $P_{m \times m}$ where $m$ is the number of genes in the network and $P_{ij}$ contains the number of times the interaction $X_i \rightarrow X_j$ between the two genes $X_i$ and $X_j$ has been reported in the biomedical literature and structured biological databases. The prior counts in $P$ are then rescaled into $\{-1,1\}$ where $P_{ij}=0$ represents no prior information about interaction $ij$, $P_{ij}=1$ represents strong evidence for interaction $ij$ and $P_{ij}=-1$ represents strong evidence for the absence of interaction between genes $i$ and $j$. Such a directed graph constructed from priors will be later combined with a directed graph inferred from genomic data (Figure~\ref{fig:regrnetdesign}; see Section~\ref{sec:datanet}). Let's read the newly generated priors, \texttt{priors\_ras\_bild2006\_pnwebapp.csv}, into a R session. <>= ## RAS-related genes genes.ras <- colnames(data.ras) ## read priors generated by the Predictive Networks web application pn.priors <- read.csv(system.file(file.path("extdata", "priors_ras_bild2006_pnwebapp.csv"), package="predictionet"), stringsAsFactors=FALSE) ## the column names should be: subject, predicate, object, direction, evidence, sentence, article, network ## remove special characters in the gene symbols pn.priors[ ,"subject"] <- gsub(pattern="[-]|[+]|[*]|[%]|[$]|[#]|[{]|[}]|[[]|[]]|[|]|[\\^]", replacement=".", x=pn.priors[ ,"subject"]) pn.priors[ ,"object"] <- gsub(pattern="[-]|[+]|[*]|[%]|[$]|[#]|[{]|[}]|[[]|[]]|[|]|[\\^]", replacement=".", x=pn.priors[ ,"object"]) genes.ras <- gsub(pattern="[-]|[+]|[*]|[%]|[$]|[#]|[{]|[}]|[[]|[]]|[|]|[\\^]", replacement=".", x=genes.ras) ## missing values pn.priors[!is.na(pn.priors) & (pn.priors == "" | pn.priors == " " | pn.priors == "N/A")] <- NA ## select only the interactions in which the genes are comprised in our gene expression dataset myx <- is.element(pn.priors[ , "subject"], genes.ras) & is.element(pn.priors[ , "object"], genes.ras) pn.priors <- pn.priors[myx, , drop=FALSE] @ \noindent The R object \Robject{pn.priors} is a matrix that contains the triples $gene_a \rightarrow gene_b$ that have been reported in the literature and the structured biological databases. <>= pn.priors <- pn.priors[sample(1:nrow(pn.priors)), ] @ <>= print(head(pn.priors)) @ \noindent As we will see later it is convenient to transform this long list of triples in a square matrix of counts containing the number of times an interaction $gene_a \rightarrow gene_b$ has been reported in the literature and the structured biological databases. <>= ## build prior counts pn.priors.counts <- matrix(0, nrow=length(genes.ras), ncol=length(genes.ras), dimnames=list(genes.ras, genes.ras)) for(i in 1:nrow(pn.priors)) { switch(tolower(pn.priors[i, "direction"]), "right"={ pn.priors.counts[pn.priors[i, "subject"], pn.priors[i, "object"]] <- pn.priors.counts[pn.priors[i, "subject"], pn.priors[i, "object"]] + ifelse(!is.na(pn.priors[i, "evidence"]), ifelse(tolower(pn.priors[i, "evidence"]) == "positive", +1, -1), 0) }, "left"={ pn.priors.counts[pn.priors[i, "object"], pn.priors[i, "subject"]] <- pn.priors.counts[pn.priors[i, "object"], pn.priors[i, "subject"]] + ifelse(!is.na(pn.priors[i, "evidence"]), ifelse(tolower(pn.priors[i, "evidence"]) == "positive", +1, -1), 0) }, { pn.priors.counts[pn.priors[i, "subject"], pn.priors[i, "object"]] <- pn.priors.counts[pn.priors[i, "subject"], pn.priors[i, "object"]] + ifelse(!is.na(pn.priors[i, "evidence"]), ifelse(tolower(pn.priors[i, "evidence"]) == "positive", +1, -1), 0) if(pn.priors[i, "object"] != pn.priors[i, "subject"]) { pn.priors.counts[pn.priors[i, "object"], pn.priors[i, "subject"]] <- pn.priors.counts[pn.priors[i, "object"], pn.priors[i, "subject"]] + ifelse(!is.na(pn.priors[i, "evidence"]), ifelse(tolower(pn.priors[i, "evidence"]) == "positive", +1, -1), 0) } }) } ## negative count represent evidence for ABSENCE of an interaction, positive otherwise @ In this example, few interactions have been previously reported. <>= print(table(pn.priors.counts)) @ \subsection{Predictionet package} All these function implemented in \Rpackage{predictionet} are listed in the documentation of the package itself. <>= library(help=predictionet) @ Looking at the help page of \Robject{expO.colon.ras} and \Robject{jorissen.colon.ras} will give the details about the data that we will use during this course. <>= help(expO.colon.ras) help(jorissen.colon.ras) @ \noindent \Robject{expO.colon.ras} and \Robject{jorissen.colon.ras} contain three R objects:\\ \begin{description} \item[\Robject{data*.ras} ] matrix of gene expression data; tumors in rows, probes in columns. \item[\Robject{annot*.ras} ] data frame of probe annotations; probes in rows, annotations in columns. \item[\Robject{demo*.ras} ] data frame of clinical information of the colon cancer patients; patients in rows, clinical variables in columns. \item[\Robject{priors*.ras} ] matrix of prior information about the gene interactions; parents/sources in rows, children:targets in columns. \end{description} Although the objects \Robject{data*.ras}, and \Robject{demo*.ras} are specific to each dataset, the objects \Robject{data*.ras} and \Robject{data*.ras} are common in the two datasets under study because the same microarray technology has been used to generate them and we extracted the same genes, and subsequently the same known gene interactions (priors). In this tutorial, we will not used the object \Robject{priors.ras} but we will use \Robject{pn.priors.counts} instead. \section{Network inference from priors and gene expression data\label{sec:datanet}} Many methods have been developed for network inference: Bayesian networks, nested $q$-partial graphs, information-theoretic networks, regression-based networks,\dots These methods differ greatly in their way of dealing with the potentially high dimensionality of the problem, handling missing values, learning from observations and/or interventions, combining genomic data and prior knowledge, assessing the quality of fitted networks, and their ability to make predictions. In this course, I present a regression-based network inference approach we specifically developed \begin{itemize} \item to deal efficiently with large number of genes (potentially $\geq 100$), \item to avoid discretization of the input values, \item to combine prior knowledge and gene expression data, \item to infer stable gene interaction networks, \item to make predictions about the expression of genes of interest in independent data. \end{itemize} \subsection{Methodology} The design of our regression-based approach for network inference is illustrated in Figure~\ref{fig:regrnetdesign}. The extraction of prior knowledge about gene interactions has been explained in Section~\ref{sec:priors}. In order to infer a directed graph from gene expression data we first build an undirected graph using the \textit{maximum relevance minimum redundancy} (MRMR) feature selection technique \citep{ding2005minimum,meyer2007information} and we then infer causality \citep{cheng2002learning,olsen2009inferring}. The network topology is determined by combining the directed graphs inferred from priors and gene expression data. Using the resulting topology, we can fit regression models in order to assess the predictive ability of the network model. \begin{figure}[htbp] \centering \includegraphics[keepaspectratio=true,height=10cm]{predictionet-regrnet_design} \caption{Design of the regression-based network inference approach.} \label{fig:regrnetdesign} \end{figure} \subsubsection*{MRMR} To infer an undirected graph from gene expression data, we use the maximum relevance minimum redundancy (MRMR) feature selection technique \citep{ding2005minimum,meyer2007information}. For this iterative forward selection technique, at each step a different gene is used as the target gene $X_T$. The algorithm then selects the genes that exhibit the highest difference between mutual information with the target gene $X_T$ and redundancy with the previously selected genes (referred to as $\mathbf{X}_\mathcal{S}$). This difference acts as a score for the possible interactions. In the first step of the feature selection procedure, the gene $X_j$ which has the highest mutual information with the target gene $X_T$ is selected, so at this stage $\mathbf{X}_\mathcal{S} = \{X_j\}$. In the next steps, given a set $\mathbf{X}_\mathcal{S}$ of selected variables, the criterion updates $\mathbf{X}_\mathcal{S}$ by adding the gene $X_k$ that maximizes the score $$ s_k = I(X_k,X_T) - \frac{1}{| \mathbf{X}_\mathcal{S} |} \sum_{X_i \in \mathbf{X}_\mathcal{S}}{I(X_k,X_i)} $$ where $| \mathbf{X}_\mathcal{S} |$ is the number of variables previously selected in set $\mathbf{X}_\mathcal{S}$. The feature selection procedure stops when the size of $\mathbf{X}_\mathcal{S}$, the number of genes connected to the target $X_T$, is equal to the maximum number of parents allowed by the user. %The second selected variable $X_j$ will be the one with a high information $I(X_j,X_T)$ to the target and at the same time a low information $I(X_j,X_i)$ to the previously selected variable $X_i$ thus trying to find the gene $X_j$ maximizing the score $I(X_j,X_T)-I(X_j,X_i)$, so at this stage $\mathbf{X}_\mathcal{S} = \{X_i, X_j\}$. Note that the resulting undirected graph is locally acyclic since the target gene cannot be selected. \subsubsection*{Causality inference} Once the undirected graph is built with the genes included in $\mathbf{X}_\mathcal{S}$, we have to perform the arc orientation. Based on previous works on causality (see \citep{neopalitan2003learning} for a review), we can actually infer causality by using the "explaining away effect" which states that a common effect creates a dependency between two variables (this can be graphically represented by the $v$-structure, where the value of the collider is conditioned upon). In \citep{ding2005minimum,meyer2008information} this effect has been related to the property \begin{equation} I(X_i,X_k|X_j) > I(X_i,X_k) \Leftrightarrow C(X_i,X_j,X_k) < 0 \label{eq:cneg} \end{equation} where $X_i$ and $X_k$ are potential causes of $X_j$, $I(X_i,X_k)$ is the mutual information of $X_i$ and $X_k$, $I(X_i,X_k|X_j)$ is the mutual information of $X_i$ and $X_k$ given $X_j$, and $C(X_i,X_j,X_k)$ is the interaction information of $X_i$, $X_j$ and $X_k$ such that \begin{eqnarray} C(X_i,X_j,X_k) & = & I(X_i,X_k) - I(X_i,X_k|X_j) \nonumber \end{eqnarray} The interaction information has been introduced in \citep{mcgill1954multivariate} as an extension of mutual information taking into account sets of variables instead of only pairwise relations. If the undirected (acyclic) graph is given, then for every triple of variables $X_i \relbar X_j \relbar X_k$ the property (\ref{eq:cneg}) corresponds to a v-structure $X_i \rightarrow X_j \leftarrow X_k$. On the contrary, knowing only that the interaction information is less than zero does not help when inferring the network because the three possible colliders exhibit all the same interaction information, that is the difference between mutual information and conditional mutual information. Thus, it is not evident which variable should be the collider. %Therefore, the only possibility for orienting the arcs using the interaction information is to use the following property: %\begin{quote} %Given three variables $X$, $Y$ and $Z$ and a structure $X \relbar Y \relbar Z$, then if $I(X,Z|Y) > I(X,Z)$, that is $C(X,Y,Z) < 0$, then $X \rightarrow Y \leftarrow Z$ %\end{quote} In this work we use property (\ref{eq:cneg}) to rank the genes selected in $\mathbf{X}_\mathcal{S}$ based on their degree of causality for the target gene $X_T$. More specifically the genes selected in $\mathbf{X}_\mathcal{S}$ are ranked by a causality score $$q_k = \max_{X_i\in \mathbf{X}_\mathcal{S}}{\{-C(X_i,X_T,X_k)\}},$$ where for each gene $X_k$, the score $q_k$ is determined by the maximal negative interaction information between the target gene $X_T$, $X_k$ and any other variable $X_i\in \mathbf{X}_\mathcal{S}$ forming a possible v-structure $X_i \rightarrow X_T \leftarrow X_k$. The causality score $q_k \mathrm{\ for\ } X_k \in \mathbf{X}_\mathcal{S}$ lays in $\{-1,1\}$. For each target gene $X_T$, the selected genes $X_k \in \mathbf{X}_\mathcal{S}$ are ranked by their causality score $q_k$ while the genes which are not selected, $X_j \notin \mathbf{X}_\mathcal{S}$, are assigned the minimum causality score of $-1$. The matrix $Q_{m \times m}$ is then populated such that $Q_{ij}$ is the causality score of the interaction between genes $X_i \rightarrow X_j$. \noindent Depending on the data, we can now identify some of the genes as parents. \subsubsection*{Combination with priors} To infer the final network topology, we combine for each interaction between gene $X_i$ and gene $X_j$, the score based on priors ($P_{ij}$) and MRMR+causality~inference ($Q_{ij}$). We let users control the weight put on the priors, which represents their confidence on their prior knowledge (either gathered with the PN web application or another source), using the following combination schema \begin{equation} g_{ij} = w P_{ij} + (1-w) Q_{ij} \label{eq:combiscore} \end{equation} where $g_{ij}$ is the combined topological score for the interaction $X_i \rightarrow X_j$ and $w \in [0,1]$. To ensure sparsity of the inferred network only the $n$ interactions with the largest $g_{ij} > 0$ are considered in the network topology. In this case $n$ is less or equal to the maximum number of parents allowed by the user. The choice of $0$ as cutoff for the interaction score enables to focus on the genes for which causality is supported by the data and/or the priors. Note that our schema for combining scores computed from priors and genomic data (Equation~(\ref{eq:combiscore})) allows users to easily infer networks using prior knowledge only ($w = 1$) or using genomic data only ($w = 0$), or a combination of priors and genomic data. \section{Predictive ability of the network model\label{sec:predabmodel}} In the last step of our network inference approach we use the network topology to build a predictive model. Such a predictive network model should be able to make one- and multi-step predictions. One-step prediction refers (i) to prediction the expression of a gene of interest from the expression of its parents only or (ii) to prediction of the effect of a gene perturbation on its direct children. Multi-step prediction refers to prediction of the state of some missing genes in the network or the direct and indirect effect of a perturbation. With the inferred topology and genomic data in hands we can use different methods to make our network model predictive. If the data can be discretized in a biological meaningful way, we can compute conditional probability tables for all genes and use then to make one-step prediction, and subsequently compute the marginal probabilities to make multi-step predictions. If the data are continuous, we can fit local (linear) regression models for all genes, using parent genes as predictors, and then use these models to make one- and multi-step predictions. In this work we fitted traditional linear regression model for each gene (target gene $X_T$) in the network by using only the parent genes ($\mathbf{X}_\mathcal{P}$) as predictors \begin{equation} X_T = \beta_0 + \sum_{X_i \in \mathbf{X}_\mathcal{P}}{\beta_i X_i} \end{equation} where $\beta$ are the coefficients estimated via the ordinary least squares method. \bigskip This novel approach is implemented in the \Rpackage{predictionet} package and is also integrated in the \textit{Predictive Networks} web application (see "Analysis" panel). Even if the package is not yet available from Bioconductor\footnote{\url{http://www.bioconductor.org}}, a public Git repository is accessible from \url{https://github.com/bhaibeka/predictionet}; any non-violent and constructive feedback is welcome. \smiley \subsection{Network inference with predictionet} Let's infer our first network and go through the main options of the package. The main function of the \Rpackage{predictionet} package is \Rfunction{netinf} with the following key parameters: . \begin{description} \item[data ] Matrix of continuous or categorical values with observations in rows and features in columns. \item[categories ] If this parameter missing, \Robject{data} should be already discretized; otherwise either a single integer or a vector of integers specifying the number of categories used to discretize each variable (data are then discretized using equal-frequency bins) or a list of cutoffs to use to discretize each of the variables in \Robject{data} matrix. If method='bayesnet' and \Robject{categories} is missing, \Robject{data} should contain categorical values and the number of categories will determine from the data. \item[perturbations ] Matrix of \{0, 1\} specifying whether a gene has been perturbed (e.g., knockdown, over-expression) in some experiments. Dimensions should be the same than \Robject{data}. \item[priors ] Matrix of prior information available for gene-gene interaction (parents in rows, children in columns). Values may be probabilities or any other weights (citations count for instance). if priors counts are used the parameter \Robject{priors.count} should be TRUE so the priors are scaled accordingly. \item[predn ] Indices or names of variables to fit during network inference. If missing, all the variables will be used for network inference. Note that for bayesian network inference (method='bayesnet') this parameter is ignored and a network will be generated using all the variables. \item[priors.count ] \Robject{TRUE} if priors specified by the user are number of citations (count) for each interaction, \Robject{FALSE} if probabilities or any other weight in [0,1] are reported instead. \item[priors.weight ] Real value in [0,1] specifying the weight to put on the priors (0=only the data are used, 1=only the priors are used to infer the topology of the network). \item[maxparents ] Maximum number of parents allowed for each gene. \item[subset ] Vector of indices to select only subset of the observations. \item[method ] 'bayesnet' for Bayesian network inference with the \Robject{catnet} package (not implemented yet), 'regrnet' for regression-based network inference. \item[causal ] \Robject{TRUE} if the causality should be inferred from the data, \Robject{FALSE} otherwise \item[seed ] Set the seed to make the network inference deterministic. \end{description} \noindent You can easily access this description by consulting the help page of the \Rfunction{netinf} <>= help(netinf) @ The \Rfunction{netinf} function returns a list containing the names of the method used for network inference, the network topology and a list of local regression models. \noindent You can infer a gene interaction network using the expO dataset \Robject{data.ras}, by equally balancing the importance of priors and data in the network inference process (\Robject{priors.weight} = 0.5). To reduce the computational time, we will focus on the top 15 genes which are the most differentially expressed between RAS mutated and wild type cell lines: <>= ## number of genes to select for the analysis genen <- 30 ## select only the top genes goi <- dimnames(annot.ras)[[1]][order(abs(log2(annot.ras[ ,"fold.change"])), decreasing=TRUE)[1:genen]] @ Now run the network inference on the reduced \Robject{expO.colon.ras} dataset: <>= mynet <- netinf(data=data.ras[ ,goi], priors=pn.priors.counts[goi,goi], priors.count=TRUE, priors.weight=0.5, maxparents=4, method="regrnet", seed=54321) @ \medskip \noindent Now let's display the topology of the network we just inferred (see Figure~\ref{fig:topoglobal}) using the \Rfunction{plot} function of the \Rpackage{network} package. \begin{figure}[!th] \centering <>= ## network topology mytopoglobal <- mynet$topology library(network) xnet <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]]) mynetlayout <- plot.network(x=xnet, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.6) @ \caption{Directed graph representing the topology of the network inferred from priors and gene expression data.} \label{fig:topoglobal} \end{figure} \noindent Unfortunately, such kind of plots are not interactive and we cannot change the position of the genes as we would like. There exist other packages to display networks in R, some of them are interactive: \Rpackage{igraph}, \Rpackage{Graphviz},\dots To enable interactive manipulation of the network topology we choose another approach that is to export the network object in \texttt{.gml} file which could be imported in external software such as Cytoscape\footnote{\url{http://www.cytoscape.org/}} \citep{smoot2011cytoscape}. We will see later how to use the function \Rfunction{netinf2gml} from the \Rpackage{predictionet} package to export a \texttt{.gml} file and import it to Cytoscape (see Section~\ref{sec:cytoscape}). \medskip \noindent A quick look at the topology in Figure~\ref{fig:topoglobal} allows us to identify IL13RA2 as a highly connected gene. By searching for these two genes in GeneSigDB\footnote{\url{http://compbio.dfci.harvard.edu/genesigdb/}}, a manually curated database of published gene signatures developed by Dr Aedin Culhane, we can see that IL13RA2, in addition to be included in the RAS signature published in \citep{bild2006oncogenic}, is also part of other signatures published in colon, leukemia, ovarian and stomach cancers. Receptors for interleukin-13 (IL-13R) are over-expressed on several types of solid cancers; it is not only associated with colon cancer, but also with many other cancers including colon cancer \citep{niranjan2008study}. We could also highlight the interactions that were already known (\textbf{\textcolor{blue}{blue}}), the new interactions supported by the gene expression (\textbf{\textcolor{green}{green}}) and the interactions both known and supported by the data (\textbf{\textcolor{red}{red}}), see Figure~\ref{fig:topocol}. <>= ## preparing colors mycols <- c("blue", "green", "red") names(mycols) <- c("prior", "data", "both") myedgecols <- matrix("white", nrow=nrow(mytopoglobal), ncol=ncol(mytopoglobal), dimnames=dimnames(mytopoglobal)) ## prior only myedgecols[mytopoglobal == 0 & pn.priors.counts[goi,goi] >= 1] <- mycols["prior"] ## data only myedgecols[mytopoglobal == 1 & pn.priors.counts[goi,goi] < 1] <- mycols["data"] ## both in priors and data myedgecols[mytopoglobal == 1 & pn.priors.counts[goi,goi] >= 1] <- mycols["both"] @ \begin{figure}[!th] \centering <>= mytopoglobal2 <- (myedgecols != "white") + 0 ## network topology xnet2 <- network(x=mytopoglobal2, matrix.type="adjacency", directed=TRUE, loops=TRUE, vertex.attrnames=dimnames(mytopoglobal2)[[1]]) plot.network(x=xnet2, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col=myedgecols, coord=mynetlayout) @ \caption{Network topology where the edges are colored according to their source of evidence: \textbf{\textcolor{blue}{blue}} for edges supported by the priors only, \textbf{\textcolor{green}{green}} for edges supported by the data only, and \textbf{\textcolor{red}{red}} for edges supported by both the priors and the data.} \label{fig:topocol} \end{figure} \noindent In Figure~\ref{fig:topocol} we can see that all the interactions included in the priors have been also supported by the data (\textbf{\textcolor{red}{red}}) and have been inferred in the final network; only the self loops (\textbf{\textcolor{blue}{blue}}), which are not allowed by our regression-based network inference method, have been discarded. Many new interactions (\textbf{\textcolor{green}{green}}) have been inferred from the gene expression data. \medskip \noindent Although of interest, the topology does not tell us much about the quality of the network inference. Therefore we implemented two statistics to help us focus on the gene interactions that are well supported by the data: \begin{itemize} \item edge-specific stability, \item gene-specific prediction score. \end{itemize} \noindent The idea is to use a cross-validation procedure\footnote{\url{http://en.wikipedia.org/wiki/Cross-validation_(statistics)}} to infer multiple networks from different training datasets to assess both the stability of the inference and its predictive ability. The function \Rfunction{netinf.cv}, although computationally intensive, is doing all the work for us. The vast majority of the parameters are the same than for the \Rfunction{netinf} function, with some additions such as \Robject{nfold} that is the number of folds for the cross-validation procedure. <>= myres.cv <- netinf.cv(data=data.ras[ ,goi], categories=3, priors=pn.priors.counts[goi,goi], maxparents=4, priors.weight=0.5, method="regrnet", nfold=10, seed=54321) @ \noindent The object \Robject{myres.cv} contains a lot of information related to the network inference process: <>= print(str(myres.cv, 1)) @ where \begin{description} \item[net ] Model object of the network inferred using the entire dataset. \item[net.cv ] List of models network models fitted at each fold of the cross-validation. \item[topology ] Topology of the model inferred using the entire dataset. \item[topology.coeff ] coefficients of each local linear regression model fitted using the entire dataset. \item[topology.cv ] Topology of the networks inferred at each fold of the cross-validation. \item[topology.coeff ] coefficients of each local linear regression model fitted at each fold of the cross-validation. \item[prediction.score.cv ] List of prediction scores ($R^2$, $NRMSE$, $MCC$) computed at each fold of the cross-validation. \item[edge.stability ] Stability of the edges inferred during cross-validation; only the stability of the edges present in the network inferred using the entire dataset is reported. \item[edge.stability.cv ] Stability of the edges inferred during cross-validation. \end{description} \noindent We can now extract these statistics to better quantify the robustness of the network inference. \subsection{Edge-specific stability} At each fold of the cross-validation, a network is inferred using a slightly different dataset. This variation in the set of observations used to fit the network model induces some variation at the level of the inference. Some edges may be poorly supported by the data and therefore their inference strongly depends on the training dataset and is not generalizable. Since we performed a 10-fold cross-validation, we can calculate for each edge the frequency of its presence in the inferred networks, the most robust edge being inferred 10 times, the less robust ones only once or twice. \medskip \noindent Let's display the topology of the network inferred from the entire dataset where the edges are colored according to their stability (Figure~\ref{fig:topostab}). <>= ## preparing colors ii <- 0:10 mycols <- c("#BEBEBE", rev(rainbow(10, v=0.8, alpha=0.5))) names(mycols) <- as.character(ii/10) myedgecols <- matrix("#00000000", nrow=nrow(mytopoglobal), ncol=ncol(mytopoglobal), dimnames=dimnames(mytopoglobal)) for(k in 1:length(mycols)) { myedgecols[myres.cv$edge.stability == names(mycols)[k]] <- mycols[k] } myedgecols[!mytopoglobal] <- "#00000000" @ <>= def.par <- par(no.readonly=TRUE) layout(rbind(1,2), heights=rbind(8,1)) par(mar=c(1,1,1,1)) ## network topology xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]]) plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col=myedgecols, coord=mynetlayout) par(mar=c(0,3,1,3)) plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="Stability scale", cex.main=1) rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey") text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1) par(def.par) @ \begin{figure}[!th] \centering <>= def.par <- par(no.readonly=TRUE) layout(rbind(1,2), heights=rbind(8,1)) par(mar=c(1,1,1,1)) ## network topology xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]]) plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col=myedgecols, coord=mynetlayout) par(mar=c(0,3,1,3)) plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="Stability scale", cex.main=1) rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey") text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1) par(def.par) @ \caption{Network topology where the edges are colored according to their stability.} \label{fig:topostab} \end{figure} \noindent As can be seen, the inference of more than half of the edges is very stable, especially around the highly connected genes, that are IL13RA2, PLAUR, PTGS2, FOS. However the inference of the interactions with PPBP is unstable. \subsection{Gene-specific prediction score} Since our regression-based network inference approach actually fits local regression models to assess the dependence between the parent genes and the target/child gene, we can actually quantify the strength of this dependence by assessing the predictive ability of the network model for each individual gene. Several performance criteria have been implemented so far: \begin{itemize} \item $R^2$: proportion of variance of the target/child gene explained by the regression model. The value lies in $ \{0, 1\}$, the larger, the better. \item $NRMSE$: normalized root mean squared error of the regression model. The values are $ > 0$, the lower, the better. %The value lies approximately in $ \{0, 1\}$, the lower, the better. \item $MCC$: Matthews correlation coefficient (also called multi-class correlation). This is a performance criterion widely used in the classification framework so it requires first a discretization of the gene expression data in classes (this is done implicitly by the functions \Rfunction{netinf.cv} and \Rfunction{pred.score}). The value lies in $ \{0, 1\}$, the larger, the better. \end{itemize} \noindent Let's focus on $R^2$ and $MCC$. We can easily display the topology by coloring nodes according to the predictive ability of the network estimated by $R^2$ (Figure~\ref{fig:topopredabr2}) and $MCC$ (Figure~\ref{fig:topopredabmcc}). Here is the code for generating the plot reporting the $R^2$ estimations: <>= myr2 <- apply(myres.cv$prediction.score.cv$r2, 2, mean, na.rm=TRUE) myr2 <- round(myr2*10)/10 ## preparing colors ii <- 0:10 mycols <- c("#BEBEBE", rev(rainbow(10, v=0.8, alpha=0.5))) names(mycols) <- as.character(ii/10) myvertexcols <- mycols[match(myr2, names(mycols))] names(myvertexcols) <- names(myr2) @ <>= def.par <- par(no.readonly=TRUE) layout(rbind(1,2), heights=rbind(8,1)) par(mar=c(1,1,1,1)) ## network topology xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]]) plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout) par(mar=c(0,3,1,3)) plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$R^2$", cex.main=1) rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey") text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1) par(def.par) @ \begin{figure}[!th] \centering <>= def.par <- par(no.readonly=TRUE) layout(rbind(1,2), heights=rbind(8,1)) par(mar=c(1,1,1,1)) ## network topology xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]]) plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout) par(mar=c(0,3,1,3)) plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$R^2$", cex.main=1) rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey") text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1) par(def.par) @ \caption{Network topology where the genes are colored according to the predictive performance of the network measure by $R^2$ in cross-validation.} \label{fig:topopredabr2} \end{figure} \noindent A similar piece of code could be used to generate the plot reporting the $MCC$ estimations. <>= mymcc <- apply(myres.cv$prediction.score.cv$mcc, 2, mean, na.rm=TRUE) mymcc <- round(mymcc*20)/20 mymcc[mymcc < 0.5] <- 0.5 ## preparing colors ii <- 0:10 mycols <- c("#BEBEBE", rev(rainbow(10, v=0.8, alpha=0.5))) names(mycols) <- seq(0.5, 1, 0.05) myvertexcols <- mycols[match(mymcc, names(mycols))] names(myvertexcols) <- names(mymcc) @ <>= def.par <- par(no.readonly=TRUE) layout(rbind(1,2), heights=rbind(8,1)) par(mar=c(1,1,1,1)) ## network topology xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]]) plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout) par(mar=c(0,3,1,3)) plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$MCC$", cex.main=1) rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey") text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1) par(def.par) @ \begin{figure}[!th] \centering <>= def.par <- par(no.readonly=TRUE) layout(rbind(1,2), heights=rbind(8,1)) par(mar=c(1,1,1,1)) ## network topology xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]]) plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout) par(mar=c(0,3,1,3)) plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$MCC$", cex.main=1) rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey") text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1) par(def.par) @ \caption{Network topology where the genes are colored according to the predictive performance of the network measured by $MCC$ in cross-validation.} \label{fig:topopredabmcc} \end{figure} \medskip \noindent As can be seen in Figures~\ref{fig:topopredabr2} and ~\ref{fig:topopredabmcc}, the local predictive ability we get with the inferred network is quite low. This may be due to the small number of genes we consider in the RAS pathway what provide us with little information about causality of many genes (a majority of them are unconnected or have no parents). If we look at the genes with the best prediction we can see that the assessment of the network predictive ability is concordant with its stability analysis: the genes around PLAUR,, IL13RA and PTGS2 are quite well predicted while the genes connected with unstable edges are poorly predicted. \paragraph{Validation} Even though we used a proper cross-validation procedure to get an unbiased estimate of the predictive ability of the model, it is of interest to validate our network model in a fully independent dataset. The task is challenging because of the various biases that are inevitably present in a dataset generated from a different population of colon cancer patients and in a different lab. Although the resulting hidden batch effects might dramatically drive down the performance of a model, this additional validation step is necessary to assess the quality of the model in a real word situation where we cannot control for all the batch effects. In this course, we use our second dataset of colon cancer patients to validate the performance of the network model inferred from the first dataset. we can actually compute the predictions and the corresponding $R^2$ and $MCC$ estimates using the following piece of code and then display the network using the same code than before (see Figures~\ref{fig:topopredabr2test} and \ref{fig:topopredabmcctest} for the $R^2$ and $MCC$ respectively). <>= ## make the network model predictive mynet <- net2pred(net=mynet, data=data.ras[ ,goi], method="linear") ## compute predictions mynet.test.pred <- netinf.predict(net=mynet, data=data2.ras[ ,goi]) ## performance estimation: R2 mynet.test.r2 <- pred.score(data=data2.ras[ ,goi], pred=mynet.test.pred, method="r2") ## performance estimation: MCC mynet.test.mcc <- pred.score(data=data2.ras[ ,goi], categories=3, pred=mynet.test.pred, method="mcc") @ <>= mynet.test.r2 <- round(mynet.test.r2*10)/10 ## preparing colors ii <- 0:10 mycols <- c("#BEBEBE", rev(rainbow(10, v=0.8, alpha=0.5))) names(mycols) <- as.character(ii/10) myvertexcols <- mycols[match(mynet.test.r2, names(mycols))] names(myvertexcols) <- names(mynet.test.r2) @ <>= def.par <- par(no.readonly=TRUE) layout(rbind(1,2), heights=rbind(8,1)) par(mar=c(1,1,1,1)) ## network topology xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]]) plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout) par(mar=c(0,3,1,3)) plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$R^2$", cex.main=1) rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey") text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1) par(def.par) @ \begin{figure}[!th] \centering <>= def.par <- par(no.readonly=TRUE) layout(rbind(1,2), heights=rbind(8,1)) par(mar=c(1,1,1,1)) ## network topology xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]]) plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout) par(mar=c(0,3,1,3)) plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$R^2$", cex.main=1) rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey") text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1) par(def.par) @ \caption{Network topology where the genes are colored according to the predictive performance of the network measured by $R^2$ in a fully independent dataset.} \label{fig:topopredabr2test} \end{figure} <>= mynet.test.mcc <- round(mynet.test.mcc*20)/20 mynet.test.mcc[mynet.test.mcc < 0.5] <- 0.5 ## preparing colors ii <- 0:10 mycols <- c("#BEBEBE", rev(rainbow(10, v=0.8, alpha=0.5))) names(mycols) <- seq(0.5, 1, 0.05) myvertexcols <- mycols[match(mynet.test.mcc, names(mycols))] names(myvertexcols) <- names(mynet.test.mcc) @ <>= def.par <- par(no.readonly=TRUE) layout(rbind(1,2), heights=rbind(8,1)) par(mar=c(1,1,1,1)) ## network topology xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]]) plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout) par(mar=c(0,3,1,3)) plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$MCC$", cex.main=1) rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey") text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1) par(def.par) @ \begin{figure}[!th] \centering <>= def.par <- par(no.readonly=TRUE) layout(rbind(1,2), heights=rbind(8,1)) par(mar=c(1,1,1,1)) ## network topology xnet3 <- network(x=mytopoglobal, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopoglobal)[[1]]) plot.network(x=xnet3, displayisolates=TRUE, displaylabels=TRUE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col=myvertexcols, jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout) par(mar=c(0,3,1,3)) plot(ii+1, ii+10/6+1, bty="n", type="n", yaxt="n", xaxt="n", ylab="", xlab ="", main="$MCC$", cex.main=1) rect(xleft=ii+0.5, ybottom=3, xright=ii+1.4, ytop=10+3, col=mycols, border="grey") text(ii+1, y=2.4, labels=names(mycols), pos=3, cex=1) par(def.par) @ \caption{Network topology where the genes are colored according to the predictive performance of the network measured by $MCC$ in a fully independent dataset.} \label{fig:topopredabmcctest} \end{figure} \noindent As expected, the performance of our model slightly decreased but the vast majority of our observations remain true. \subsection{Predictionet and Cytoscape\label{sec:cytoscape}} When large network are inferred ($>50$ variables), any plot will be extremely busy and complicated to interpret, This is the reason why the function \Rfunction{netinf2gml} has been implemented in the \Rpackage{predictionet} package in order to allow to export any networks inferred from the \Rfunction{netinf} or \Rfunction{netinf.cv} functions in a file that is readable in many third-party software. Let's export a \texttt{gml} file containing all the information about the network model we just inferred from the \Rfunction{netinf.cv} function. <>= netinf2gml(object=myres.cv, file="regrnet_expO_cv") @ A GML file called \texttt{regrnet\_expO\_cv.gml} and a Vizmap Property file \texttt{egrnet\_expO\_cv.props} are then created in the working directory, You can import the GML into Cytoscape and load the corresponding Vizmap property file. You can move vertices, filter them based on prediction scores, color the edges based on stability, and so on. \begin{figure}[!th] \centering \includegraphics[width=1\textwidth]{predictionet-cytoscape} \caption{Screenshot of a network inferred by \Rpackage{predictionet} and imported into Cytoscape with the Vizmap Property file \textit{predictionet\_vizmap2}. Size of each vertex (gene) is proportional to their prediction score; color of the edges report the edge-specific stability where gray $\rightarrow$ green $\rightarrow$ orange $\rightarrow$ red colors represent edges with low to high stability.} \label{fig:cytoscapescreen} \end{figure} \section{Comparison of network inference with respect to the priors} Based on our approach to quantify the stability and predictive ability of a gene interaction network, we are now able to statistically compare the performance of two or more networks. This network model selection could help us optimize a parameter such as the weight of the priors during the network inference (\Robject{priors.weight}) or identify the best method of network inference (Bayesian vs regression-based network inference for instance). \medskip \noindent Let's infer and compare the prediction scores (estimated in cross-validation) of five networks with \Robject{priors.weight} = 0, 0.25, 0.5, 0.75 and 1. <>= ## priors.weight 0 myres.cv.pw0 <- netinf.cv(data=data.ras[ ,goi], categories=3, priors=pn.priors.counts[goi,goi], maxparents=4, priors.weight=0, method="regrnet", nfold=10, seed=54321) ## priors.weight 0.25 myres.cv.pw025 <- netinf.cv(data=data.ras[ ,goi], categories=3, priors=pn.priors.counts[goi,goi], maxparents=4, priors.weight=0.25, method="regrnet", nfold=10, seed=54321) ## priors.weight 0.5 myres.cv.pw050 <- myres.cv ## priors.weight 0.75 myres.cv.pw075 <- netinf.cv(data=data.ras[ ,goi], categories=3, priors=pn.priors.counts[goi,goi], maxparents=4, priors.weight=0.75, method="regrnet", nfold=10, seed=54321) ## priors.weight 0 myres.cv.pw1 <- netinf.cv(data=data.ras[ ,goi], categories=3, priors=pn.priors.counts[goi,goi], maxparents=4, priors.weight=1, method="regrnet", nfold=10, seed=54321) @ %\medskip \noindent Now let's display the topology of the networks we just inferred by varying the weight we put on the priors (see Figure~\ref{fig:topoglobals}). <>= def.par <- par(no.readonly=TRUE) layout(mat=matrix(1:4, nrow=2, ncol=2, byrow=TRUE)) ## priors.weight 0 mytopot <- myres.cv.pw0$topology xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0 (data only)") ## priors.weight 0.25 mytopot <- myres.cv.pw025$topology xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0.25") ## priors.weight 0.75 mytopot <- myres.cv.pw075$topology xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0.75") ## priors.weight 1 mytopot <- myres.cv.pw1$topology xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 1 (priors only)") par(def.par) @ \begin{figure}[!th] \centering <>= def.par <- par(no.readonly=TRUE) layout(mat=matrix(1:4, nrow=2, ncol=2, byrow=TRUE)) ## priors.weight 0 mytopot <- myres.cv.pw0$topology xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0 (data only)") ## priors.weight 0.25 mytopot <- myres.cv.pw025$topology xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0.25") ## priors.weight 0.75 mytopot <- myres.cv.pw075$topology xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 0.75") ## priors.weight 1 mytopot <- myres.cv.pw1$topology xnett <- network(x=mytopot, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col="black", coord=mynetlayout, main="Priors weight = 1 (priors only)") par(def.par) @ \caption{Directed graph representing the topology of the networks inferred from priors and gene expression data with varying prior weight.} \label{fig:topoglobals} \end{figure} \noindent As can be seen in Figure~\ref{fig:topoglobals}, the networks generated from data only (\Robject{priors.weight} = 0) and from priors only (\Robject{priors.weight} = 1) are very sparse, and the combination of priors and data enables the identification of more gene interactions. But a denser topology does not imply that all the inferred interactions are well supported by the data. In order to select the best network(s) we can statistically compare their stability and predictive ability. \subsection{Comparison of edge-specific prediction scores} In order to compare the stability of the network inference, we can draw a boxplot representing the edge-specific stability of the network inference with respect to the weight put on the priors (Figure~\ref{fig:boxplotstabpw}). As expected, the network inferred from priors only is perfectly stable (the inference does not depend on the data anymore) but what is interesting is the gain in stability when the priors are used as compared to networks inferred from data alone. \begin{figure}[!th] \centering <>= gg <- c("0", "0.25", "0.50", "0.75", "1") mystab.cv.pw <- list(myres.cv.pw0$edge.stability[myres.cv.pw0$topology == 1], myres.cv.pw025$edge.stability[myres.cv.pw025$topology == 1], myres.cv.pw050$edge.stability[myres.cv.pw050$topology == 1], myres.cv.pw075$edge.stability[myres.cv.pw075$topology == 1], myres.cv.pw1$edge.stability[myres.cv.pw1$topology == 1]) names(mystab.cv.pw) <- gg boxplot(mystab.cv.pw, xlab="priors.weight", ylim=c(0, 1), ylab="Edge stability", border="grey", col="lightgrey", outline=FALSE) points(x=jitter(x=rep(1:length(mystab.cv.pw), times=sapply(mystab.cv.pw, length)), amount=0.15), y=unlist(mystab.cv.pw), cex=0.55, pch=16, col="royalblue") @ \caption{Boxplot of the edge-specific stability computed in cross-validation, for network models with different prior weight.} \label{fig:boxplotstabpw} \end{figure} \subsection{Comparison of gene-specific prediction score} In terms of predictive ability, the boxplot representing the gene-specific $R^2$ scores computed in cross-validation, for network models with different prior weights (Figure~\ref{fig:boxplotpw}), suggests that a combination of both data and priors yield better predictive ability. \begin{figure}[!th] \centering <>= gg <- c("0", "0.25", "0.50", "0.75", "1") myr2.cv.pw <- cbind(apply(myres.cv.pw0$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw025$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw050$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw075$prediction.score.cv$r2, 2, mean, na.rm=TRUE), apply(myres.cv.pw1$prediction.score.cv$r2, 2, mean, na.rm=TRUE)) colnames(myr2.cv.pw) <- gg gg <- factor(rep(gg, each=genen), levels=gg) boxplot(myr2.cv.pw, xlab="priors.weight", ylim=c(0, 1), ylab="$R^2$", border="grey", col="lightgrey", outline=FALSE) points(x=jitter(x=rep(1:ncol(myr2.cv.pw), times=nrow(myr2.cv.pw)), amount=0.15), y=as.vector(t(myr2.cv.pw)), cex=0.55, pch=16, col="royalblue") @ \caption{Boxplot of the gene-specific $R^2$ scores computed in cross-validation, for network models with different prior weight.} \label{fig:boxplotpw} \end{figure} \noindent A rigorous statistical comparison does not allow us to claim that a network model inferred from data+priors yields always significantly better predictive ability than networks inferred from data or priors only. The Friedman test tells us that at least one of the network models yielded significantly different predictive ability; the pairwise Wilcoxon Rank Sum tests suggest that the model using the priors only (\Robject{priors.weight} = 1) is significantly less predictive than all the other network models; however the model inferred from data only does not yield statistically different predictive ability (although the p-value is close to significance)\footnote{An analysis involving more genes and better priors could show that network inferred from a combination of priors and data always lead to significantly more predictive models.}. <>= ## Friedman test to test whether at least one of the networks gives statstically different predictive ability print(friedman.test(y=myr2.cv.pw)) ## Pairwise Wilcoxon Rank Sum tests print(pairwise.wilcox.test(x=as.vector(myr2.cv.pw), g=gg, paired=TRUE, exact=FALSE, alternative="two.sided", p.adjust.method="holm")) @ \noindent Similar conclusions can be drawn by computing the $MCC$. \section{Comparison of network inference with respect to the training data} To identify the parts of the networks where interactions are well supported by the data we estimated in cross-validation edge-specific stabilities and gene-specific prediction scores of a given network model. Ultimately these statistics should help us identify the parts of the network that are generalizable, whose inference does not strongly depend on the dataset used as training set. Inference of generalizable network models is a challenging task because of the various biases that are inevitably present in datasets generated from different populations of cancer patients and in different labs. Therefore the resulting hidden batch effects might dramatically affect the inference of a network model. Let's use \Robject{expO.colon.ras} dataset as training set to infer our network model (use data only for inference, \Robject{priors.weight}=0). Now we will use our second dataset, \Robject{jorissen.colon.ras}, to infer another network model (use data only for inference, \Robject{priors.weight}=0) and compare them to test whether the network inference strongly depend on the training set or not. First infer a network from \Robject{expO.colon.ras} and \Robject{jerissen.colon.ras} datasets separately. <>= ## expO myres21.cv <- netinf.cv(data=data.ras[ ,goi], categories=3, priors=priors2.ras[goi,goi], maxparents=4, priors.weight=0, method="regrnet", nfold=10, seed=54321) ## jorissen myres22.cv <- netinf.cv(data=data2.ras[ ,goi], categories=3, priors=priors2.ras[goi,goi], maxparents=4, priors.weight=0, method="regrnet", nfold=10, seed=54321) @ Let's display the topologies of the networks inferred from these two datasets. As can be seen in Figure~\ref{fig:topo1topo2}, approximately half of the interactions are inferred in both datasets. <>= topo1 <- myres21.cv$topology topo2 <- myres22.cv$topology ## preparing colors myedgecols <- matrix("white", nrow=nrow(topo1), ncol=ncol(topo1), dimnames=dimnames(topo1)) myedgecols[topo1== 1 & topo2 == 1 & pn.priors.counts[rownames(topo1), colnames(topo1)] > 0] <- "orange" myedgecols[topo1== 1 & topo2 == 1 & pn.priors.counts[rownames(topo1), colnames(topo1)] <= 0] <- "red" def.par <- par(no.readonly=TRUE) layout(mat=matrix(1:2, nrow=1, ncol=2, byrow=TRUE)) ## expO mycolt <- myedgecols mycolt[myedgecols == "white" & topo1 == 1 ] <- "gray" xnett <- network(x=topo1, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col=mycolt, coord=mynetlayout, main="Dataset: expO.colon.ras") ## jorissen mycolt <- myedgecols mycolt[myedgecols == "white" & topo2 == 1 ] <- "gray" xnett <- network(x=topo2, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col=mycolt, coord=mynetlayout, main="Dataset: jorissen.colon.ras") par(def.par) @ \begin{figure}[!th] \centering <>= topo1 <- myres21.cv$topology topo2 <- myres22.cv$topology ## preparing colors myedgecols <- matrix("white", nrow=nrow(topo1), ncol=ncol(topo1), dimnames=dimnames(topo1)) myedgecols[topo1== 1 & topo2 == 1 & pn.priors.counts[rownames(topo1), colnames(topo1)] > 0] <- "orange" myedgecols[topo1== 1 & topo2 == 1 & pn.priors.counts[rownames(topo1), colnames(topo1)] <= 0] <- "red" def.par <- par(no.readonly=TRUE) layout(mat=matrix(1:2, nrow=1, ncol=2, byrow=TRUE)) ## expO mycolt <- myedgecols mycolt[myedgecols == "white" & topo1 == 1 ] <- "gray" xnett <- network(x=topo1, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col=mycolt, coord=mynetlayout, main="Dataset: expO.colon.ras") ## jorissen mycolt <- myedgecols mycolt[myedgecols == "white" & topo2 == 1 ] <- "gray" xnett <- network(x=topo2, matrix.type="adjacency", directed=TRUE, loops=FALSE, vertex.attrnames=dimnames(mytopot)[[1]]) plot.network(x=xnett, displayisolates=TRUE, displaylabels=FALSE, boxed.labels=FALSE, label.pos=0, arrowhead.cex=1.5, vertex.cex=2, vertex.col="royalblue", jitter=FALSE, pad=0.5, edge.col=mycolt, coord=mynetlayout, main="Dataset: jorissen.colon.ras") par(def.par) @ \caption{Directed graph representing the topology of the networks inferred from two different datasets, \Robject{expO.colon.ras} and \Robject{jerissen.colon.ras}. The %\textbf{\textcolor{orange}{orange}} orange edges represent interactions that are both supported by the data and the priors during network inference; the \textbf{\textcolor{red}{red}} edges represent the interactions that are supported only by the gene expression data and inferred in both networks; the %\textbf{\textcolor{gray}{gray}} gray edges represent interactions identified only in one dataset but not the other. } \label{fig:topo1topo2} \end{figure} We could now ask the question whether the interactions inferred from both datasets are also the ones with high stability and/or involve genes with high prediction scores. We first compared the stability of interactions which are inferred only in one dataset and those which are inferred in both datasets (Figure~\ref{fig:topo1topo2stab}). As can be seen, the interactions common to both datasets tend to have a higher stability. <>= def.par <- par(no.readonly=TRUE) layout(mat=matrix(1:2, nrow=1, ncol=2, byrow=TRUE)) ## expO stab2 <- list("specific"=myres21.cv$edge.stability[(topo1 == 1 & topo2 == 0)], "common"=myres21.cv$edge.stability[topo1 == 1 & topo2 == 1]) wt <- wilcox.test(x=stab2$specific, y=stab2$common) boxplot(stab2, ylab="Stability", ylim=c(0, 1), xlab="", border="grey", col="lightgrey", outline=FALSE, sub=sprintf("Wilcoxon test p-value = %.1E", wt$p.value), main="Dataset: expO.colon.ras") points(x=jitter(x=rep(1:length(stab2), times=sapply(stab2, length)), amount=0.15), y=unlist(stab2), cex=0.55, pch=16, col="royalblue") ## jorissen stab2 <- list("specific"=myres22.cv$edge.stability[(topo1 == 0 & topo2 == 1)], "common"=myres22.cv$edge.stability[topo1 == 1 & topo2 == 1]) wt <- wilcox.test(x=stab2$specific, y=stab2$common) boxplot(stab2, ylab="Stability", ylim=c(0, 1), xlab="", border="grey", col="lightgrey", outline=FALSE, sub=sprintf("Wilcoxon test p-value = %.1E", wt$p.value), main="Dataset: jorissen.colon.ras") points(x=jitter(x=rep(1:length(stab2), times=sapply(stab2, length)), amount=0.15), y=unlist(stab2), cex=0.55, pch=16, col="royalblue") par(def.par) @ \begin{figure}[!th] \centering <>= def.par <- par(no.readonly=TRUE) layout(mat=matrix(1:2, nrow=1, ncol=2, byrow=TRUE)) ## expO stab2 <- list("specific"=myres21.cv$edge.stability[(topo1 == 1 & topo2 == 0)], "common"=myres21.cv$edge.stability[topo1 == 1 & topo2 == 1]) wt <- wilcox.test(x=stab2$specific, y=stab2$common) boxplot(stab2, ylab="Stability", ylim=c(0, 1), xlab="", border="grey", col="lightgrey", outline=FALSE, sub=sprintf("Wilcoxon test p-value = %.1E", wt$p.value), main="Dataset: expO.colon.ras") points(x=jitter(x=rep(1:length(stab2), times=sapply(stab2, length)), amount=0.15), y=unlist(stab2), cex=0.55, pch=16, col="royalblue") ## jorissen stab2 <- list("specific"=myres22.cv$edge.stability[(topo1 == 0 & topo2 == 1)], "common"=myres22.cv$edge.stability[topo1 == 1 & topo2 == 1]) wt <- wilcox.test(x=stab2$specific, y=stab2$common) boxplot(stab2, ylab="Stability", ylim=c(0, 1), xlab="", border="grey", col="lightgrey", outline=FALSE, sub=sprintf("Wilcoxon test p-value = %.1E", wt$p.value), main="Dataset: jorissen.colon.ras") points(x=jitter(x=rep(1:length(stab2), times=sapply(stab2, length)), amount=0.15), y=unlist(stab2), cex=0.55, pch=16, col="royalblue") par(def.par) @ \caption{Boxplots reporting the stability of edges which are inferred only in one dataset and those which are inferred in both datasets.} \label{fig:topo1topo2stab} \end{figure} We can also compare the prediction scores in the networks inferred from the two different datasets. As illustrated in Figure~\ref{fig:topo1topo2pred}, there is a weak correlation between the prediction scores ($R^2$) computed in the two datasets. So a gene with high prediction score in one dataset may have a low score in another dataset, what makes this statistic not a good surrogate for generalization in network inference. This observation holds true for $MCC$. \begin{figure}[!th] \centering <>= pred2 <- list("expO"=apply(myres21.cv$prediction.score.cv$r2, 2, mean, na.rm=TRUE), "jorissen"=apply(myres22.cv$prediction.score.cv$r2, 2, mean, na.rm=TRUE)) plot(x=pred2$expO, y=pred2$jorissen, xlim=c(0, 0.6), ylim=c(0, 0.6), pch=16, col="royalblue", xlab="Prediction scores in expO.colon.ras", ylab="Prediction scores in jorissen.colon.ras") legend(x="bottomright", legend=sprintf("cor = %.2g", cor(pred2$expO, pred2$jorissen, method="spearman", use="complete.obs")), bty="n") @ \caption{Prediction scores estimated in \Robject{expO.colon.ras} and \Robject{jorissen.colon.ras} and their corresponding Spearman correlation coefficient.} \label{fig:topo1topo2pred} \end{figure} \clearpage \section*{Session Info} <>= toLatex(sessionInfo(), locale=FALSE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \clearpage \bibliographystyle{plainnat} \bibliography{biblio} \end{document}