%\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in %\headheight=-.3in %\VignetteIndexEntry{Case study: predicting protein function} %\VignettePackage{BiocStyle} %\VignetteEngine{utils::Sweave} \documentclass[12pt]{article} <>= BiocStyle::latex() @ \usepackage{dsfont} \usepackage{booktabs} \usepackage{fixltx2e} %subscripts %\usepackage{hyperref} \newcommand{\textq}[1]{``#1''} % \newcommand{\Rcode}[1]{{\texttt{#1}}} % \newcommand{\Robject}[1]{{\texttt{#1}}} % \newcommand{\Rfunction}[1]{{\texttt{#1}}} % \newcommand{\Rpackage}[1]{{\textit{#1}}} % \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \begin{document} \SweaveOpts{concordance=TRUE, include = TRUE, echo=TRUE} %\SweaveOpts{fig=FALSE, prefix.string=yeast_latexfiles/} %\setkeys{Gin}{width=1\textwidth} \bioctitle[The diffuStats R package]{An R package to compute diffusion-based scores on biological networks: diffuStats} \author[1,2]{Sergio Picart-Armada\thanks{\email{sergi.picart@upc.edu}}} \author[3,4]{Wesley K. Thompson} \author[3]{Alfonso Buil} \author[1,2]{Alexandre Perera-Lluna} \affil[1]{B2SLab, Departament d'Enginyeria de Sistemes, Autom\`atica i Inform\`atica Industrial, Universitat Polit\`ecnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain} \affil[2]{Institut de Recerca Pedi\`atrica Hospital Sant Joan de D\'eu, Esplugues de Llobregat, Barcelona, 08950, Spain} \affil[3]{Mental Health Center Sct. Hans, 4000 Roskilde, Denmark} \affil[4]{Department of Family Medicine and Public Health, University of California, San Diego, La Jolla, California, USA} %\small{\author{Sergio Picart-Armada, %Wesley K. Thompson, \\ %Alfonso Buil and Alexandre Perera-Lluna}} \maketitle \section{Abstract} Label propagation approaches are a standard and ubiquitous procedure in computational biology for giving context to molecular entities. Node labels, which can derive from gene expression, genome-wide association studies, protein domains or metabolomics profiling, are propagated to their neighbours, effectively smoothing the scores through prior annotated knowledge and prioritising novel candidates. However, there are several settings to tune when defining the diffusion process, including the diffusion kernel, the numeric codification of the labels and a choice of statistical normalisation of the scores. These settings can have a large impact on results, and there is currently no software implementing many of them in one place to screen their performance in the application of interest. This vignette presents \Rpackage{diffuStats}, an R package with a collection of diffusion kernels and scores, as well as a parallel permutation analysis for the normalised scores, that eases the analysis of several sets of molecular entities at once. \section{Introduction} The application of label propagation algorithms \cite{labelpropagation} is based on the guilt by association principle \cite{gba}, which can be rephrased in the protein-protein interaction context as \textq{proteins that interact are more likely to share biological functions}. However, this principle is extremely general and has experienced success in numerous applications in bioinformatics. HotNet \cite{hotnet} uses a diffusion process with mutated genes as seed nodes to find modules with a statistically high number of mutated genes in cancer. Another attempt to find relevant modules from gene expression and mutation data can be found in \cite{mosca}, where the authors propose a diffusion process followed by a statistical normalisation and an automatic process to extract a subnetwork. TieDIE \cite{tiedie} runs two diffusion processes to link perturbation in the genome with changes in the transcriptome, effectively linking two sets of genes. GeneMANIA \cite{genemania} is a web server that predicts gene function using label propagation with a bias on the unlabelled nodes. Network-based learning sharing a background with diffusion has been also applied to protein classification using multiple networks \cite{fastprotein}. Label propagation using graph kernels has been proven successful in gene-disease association \cite{valentini, diffusion_gwas}. Equivalent formulations can be found under different terminology, like the electrical model applied to prioritise candidate genes in eQTL in \cite{electricalmodel}. The heterogeneity of applications hinders comparisons among approaches, therefore tools gathering the state of the art are highly needed. An existing solution is \Rpackage{RANKS}, an R package that contains a variety of diffusion kernels and kernelised scores for label propagation using a binary input vector. \Rpackage{RANKS} eases kernelised scores benchmarking and models it as a \textq{one-class} classification semi-supervised learning problem, in which only some members of the class (positives) are known. Another possibility is to divide nodes into labelled positive, negative and unlabelled, like in \cite{genemania}, which poses questions like the effect of unlabelled nodes and possible numeric codifications of the labels, or the option to include quantitative data in the labels. In addition, statistical normalisations such as in \cite{mosca} remove the effect of network structures like hubs and should be taken into account when choosing a diffusion scoring method. This motivates the introduction of our \Rpackage{diffuStats} R package, which collects widely adopted input codifications and explicitly accounts for unlabelled nodes. It also includes three statistically normalised scores, which can be obtained through Monte Carlo trials or a parametric alternative. The \Rpackage{diffuStats} package uses existent classes and provides high-level functions to screen the performance of several diffusion scores, in order to facilitate their integration in any computational biology study. \section{Methodology} One of the main purposes of \Rpackage{diffuStats} is to offer a battery of approaches to compute and compare diffusion scores. The diffusion scores $f$ using an input vector $y$ and a diffusion kernel $K$ are generally computed as $$ f = K\cdot y$$ possibly followed by further adjustments or a statistical normalisation. The decisions taken in the definition of $K$, $y$ and the posterior normalisation generally give rise to different priorisations due to a different treatment of the balance between positive and negative examples, the unlabelled data and the network structure. The following sections cover the implemented choices for the kernel $K$ and the initial labels $y$. \subsection{Diffusion kernels and regularisation} The representation of any kind of data in a network model allows the definition of notions like distance or similarity based on the links in the network. This section will follow the notation in \cite{smola} and summarise the kernels proposed by the authors. In general, an undirected graph $G = (V, E)$ consists of a set of $n$ nodes $V$ and a set of edges $E$ of unordered pairs of nodes. This can be extended to weighted, undirected graphs, where each edge $i \sim j$ has a weight attribute $W_{ij} \in [0, \infty)$. The degree matrix of $G$ is defined as the $n \times n$ diagonal matrix so that $D_{ii} = \sum_{j=1}^n W_{ij}$. The (unnormalised) Laplacian of $G$ is defined as the $n \times n$ matrix $L = D - W$, whereas its normalised version is $\tilde{L} = D^{-\frac{1}{2}}\cdot L \cdot D^{-\frac{1}{2}}$. The graph Laplacian is diagonalisable and can be written in terms of its eigenvalues $v_j$ and eigenvectors $\lambda_j$, as $L = \sum_{j=1}^n \lambda_j v_j v_j^T$. The proposed kernels stem from a family of regularisation functions $r(\lambda)$ on the spectrum of the graph Laplacian: $$K = \sum_{j=1}^n r^{-1}(\lambda_j) v_j v_j^T$$ Well known graph kernels belong to this family because they can be written as transformations on the Laplacian spectrum. Table \ref{tab:kernels} summarises them, assuming the usage of the normalised Laplacian - the unnormalised Laplacian can also be used as long as the resulting kernel is still positive semidefinite. Further details about this family of kernels, all available in our package \Rpackage{diffuStats}, can be found in the original manuscript \cite{smola}. \begin{table}[!t] \begin{center} \begin{tabular}{@{}ll@{}} \toprule Kernel & Function \\\midrule Regularised Laplacian & $r(\lambda) = 1 + \sigma^2\lambda$ \\ Diffusion process & $r(\lambda) = \exp(\frac{\sigma^2}{2}\lambda)$ \\ $p$-Step random walk & $r(\lambda) = (a - \lambda)^{-p}$ with $a \geq 2$, $p \geq 1$ \\ Inverse cosine & $r(\lambda) = (cos(\lambda \frac{\pi}{4}))^{-1}$ \\\bottomrule \end{tabular} \caption{Implemented diffusion kernels from \cite{smola}} \label{tab:kernels} \end{center} \end{table} Additionally, the \Rpackage{diffuStats} package includes the commute time kernel, introduced in \cite{ctkernel}. This kernel, also writable in terms of a regularisation function, is simply the pseudoinverse of the graph Laplacian, $K = L^+$. The default option in the \Rpackage{diffuStats} package is the regularised Laplacian kernel, as it is widely adopted and describes many physical models, for instance in \cite{hotnet, tiedie, electricalmodel}. \subsection{Diffusion scores} Besides choosing a graph kernel, the codification of the input and the presence of a statistical normalisation can lead to important differences in the results. Table \ref{tab:scores} gives an overview of the implemented scores, which will be detailed in the following sections. The argument \Rfunarg{method} in the function \Rfunction{diffuse} can be set to the desired scores in table \ref{tab:scores}, which are described in the following sections. The numeric values of the positive, negative and unlabelled examples are respectively $y^+$, $y^-$ and $y^u$. Column \textq{normalised} refers to the application of a statistical model involving permutations, whereas \textq{stochastic} enumerates the normalised scores that need actual Monte Carlo permutations. The scores that also accept quantitative inputs instead of binary labels are listed in the \textq{quantitative} column. \begin{table}[!t] \begin{center} \resizebox{\textwidth}{!}{ \begin{tabular}{@{}llllllll@{}} \toprule Score & $y^+$ & $y^-$ & $y^u$ & Normalised & Stochastic & Quantitative & Reference \\\midrule raw & 1 & 0 & 0 & No & No & Yes & \cite{hotnet} \\ ml & 1 & -1 & 0 & No & No & No & \cite{fastprotein, labelpropagation} \\ gm & 1 & -1 & $k$ & No & No & No & \cite{genemania} \\ ber\textsubscript{s} & 1 & 0 & 0 & No & No & Yes & \cite{mosca} \\ ber\textsubscript{p} & 1 & 0 & 0* & Yes & Yes & Yes & \cite{mosca} \\ mc & 1 & 0 & 0* & Yes & Yes & Yes & \cite{mosca} \\ z & 1 & 0 & 0* & Yes & No & Yes & \cite{kerneltesting} \\\bottomrule \end{tabular} } \caption{Implemented diffusion scores} \label{tab:scores} \end{center} \end{table} \subsubsection{Scores without statistical normalisation} The base diffusion score \Rcode{raw}, which has been used in algorithms like HotNet \cite{hotnet} and TieDIE \cite{tiedie}, solves a diffusion problem in terms of the regularised Laplacian kernel \cite{smola}. $$ f_{raw} = K\cdot y_{raw} $$ $K$ is the kernel matrix and $y_{raw}$ the vector of codified inputs. In general, the $i$-th component of $y$ equals $y^+$ if node $i$ is a positive, $y^-$ if $i$ is a negative and $y^u$ if $i$ is unlabelled. In the particular case of $y_{raw}$, the positively labelled nodes introduce one flow unit in the network ($y_{raw}^{+}=1$), whereas the negative and unlabelled nodes are treated equivalently and do not introduce anything ($y_{raw}^{-}=y_{raw}^{u}=0$). In the physical model using the regularised Laplacian kernel, the flow can be evacuated from the graph due to the presence of first-order leaking in every node, see \cite{hotnet} for further details on this. This formulation is proportional up to a scaling factor to the average score in \Rpackage{RANKS}. On the other hand, the classical label propagation \cite{labelpropagation} treats positives as $y_{ml}^{+}=1$ and negatives as $y_{ml}^{-}=-1$, while unlabelled nodes remain as $y_{ml}^{u}=0$, thus making a distinction between the last two. A biological example can be found in protein classification \cite{fastprotein}. This option is available as \Rcode{ml}, and intuitively scores a node by counting if the majority of its neighbours vote positive or negative: $$ f_{ml} = K\cdot y_{ml} $$ The authors of GeneMANIA \cite{genemania} propose a modification on the \Rcode{ml} input - they adhere to $y_{gm}^{+}=1$ and $y_{gm}^{-}=-1$, but introduce a bias term in the unlabelled nodes $$y_{gm}^{u} = \frac{n^{+}-n^{-}}{n^{+}+n^{-}+n^{u}}$$ being $n^{+}$, $n^{-}$ and $n^{u}$ the number of positives, negatives and unlabelled entities. The \Rcode{gm} score is then computed through $$ f_{gm} = K\cdot y_{gm} $$ The last option in this part, named \Rcode{ber\_s} \cite{mosca}, is a quantification of the relative change in the node score before and after the network smoothing. The score for a particular node $i$ can be written as $$ f_{ber_s, i} = \frac{f_{raw, i}}{y_{raw, i} + \epsilon}$$ where $\epsilon > 0$ is a parameter that regulates the importance of the relative change. \subsubsection{Scores with statistical normalisation} Recently, the combination of a permutation analysis with diffusion processes has been suggested \cite{mosca}. This is a way to quantify how the diffusion score of a certain node compares to its score if the input was randomised - nodes that might have systematically high or low scores regardless of the input are normalised accordingly. The cornerstone of normalised scores is the empirical p-value \cite{empiricalpvalue} that indicates, for a node $i$, the proportion of input permutations that led to a diffusion score as high or higher than the original diffusion score. Specifically, $f_{raw}$ is compared to scores from random trials $j$, $f_{raw}^{null, j} = K\cdot \pi_j(y_{raw})$, where $\pi_j(y_{raw})$ is a permutation of $y_{raw}$ on the labelled entities. The empirical p-value for node $i$ is therefore defined as in \cite{empiricalpvalue}: $$p_i = \frac{r_i + 1}{n + 1}$$ being $r_i$ the number of trials $j$ in which $f_{raw, i}^{null, j} \geq f_{raw, i}$, and $n$ the total number of trials. To be consistent with the increasing direction of the scores, the \Rcode{mc} scores are defined as $$f_{mc, i} = 1 - p_i$$ Importantly, the permutation has been applied only to the observed nodes. Therefore, any node outside the labelled background cannot receive a different input score in a permuted input. This implies that even if both negatives and unlabelled nodes are assigned the same input value ($y^{-}=y^{u}=0$), the negatives are actually permuted and eventually exchanged for a $1$ in some permutations provided that the number of runs is large enough. For this reason, negatives and unlabelled nodes are not equivalent in these scores. A parametric alternative to $f_{mc}$, are \Rcode{z} scores, where each node $i$ is scored as: $$f_{z,i} = \frac{f_{raw, i} - E(f_{raw, i}^{null, j})} {\sqrt{V(f_{raw, i}^{null, j})}}$$ The expectation and variance are computed in a closed form and these scores do not require running actual permutations, therefore saving on computational time and avoiding stochastic models. The analytical expression proven below also works for the general case when the input has quantitative labels. First of all, note that all the unlabelled nodes will cancel out when substracting the expected value, so they can be excluded without loss of generality. Thus, let the row vector $\tilde{k}_{i}$ be the $i$-th row of the submatrix from $K$ including the columns indexed by the labelled nodes only, and let $n_{lab}$ be the number of labelled nodes. Analogously, let $\tilde{y}$ be the (quantitative or binary) inputs for the observed nodes, with the same indexing as $\tilde{k}_{i}$. Consider the following scalar quantities: $$ Sk_i^I = \sum_{j = 1}^{n_{lab}} (\tilde{k}_{i})_{j} $$ $$ Sk_i^{II} = \sum_{j = 1}^{n_{lab}} (\tilde{k}_{i})_{j}^2 $$ $$ Sy^I = \sum_{j = 1}^{n_{lab}} \tilde{y}_j $$ $$ Sy^{II} = \sum_{j = 1}^{n_{lab}} \tilde{y}_j^2 $$ Using these definitions and notation, the raw score for node $i$ is $$f_i = \tilde{k}_{i} \cdot \tilde{y}$$ Its null score using a permuted input score is the random variable $$ f_i^{null} = \tilde{k}_{i} \cdot X $$ where $X$ is the random variable that arises from permuting $\tilde{y}$. In order to compute the z-scores, the expected value and variance of $f_i^{null}$ are needed. Starting with its expected value, $$ E_X(f_i^{null}) = E_X(\tilde{k}_{i} \cdot X) = \tilde{k}_{i}\cdot E_X(X) = \tilde{k}_i \cdot\frac{\sum_{j = 1}^{n_{lab}} \tilde{y}_j}{n_{lab}}\cdot\mathds{1}= \frac{Sk_i^I \cdot Sy^I}{n_{lab}} $$ where $\mathds{1}$ is the $n_{lab}$-th dimensional column vector full of ones. Regarding its variance, $$ V_X(f_i^{null}) = V_X(\tilde{k}_{i} \cdot X) = \tilde{k}_{i}\cdot V_X(X)\cdot \tilde{k}_{i}^T $$ The covariance of $X$ can be written as $$ V_X(X) = \left[ \frac{\sum_{j = 1}^{n_{lab}} \tilde{y}_j^2}{n_{lab}} - \left(\frac{\sum_{j = 1}^{n_{lab}} \tilde{y}_j}{n_{lab}}\right)^2 \right] \left[ \frac{n_{lab}}{n_{lab} - 1}Id - \frac{1}{n_{lab} - 1}\mathds{1}\mathds{1}^T \right]$$ being $Id$ the $n_{lab} \times n_{lab}$ identity matrix. Operating, the variance of $f_i^{null}$ can be finally computed as $$ V_X(f_i^{null}) = \frac{1}{(n_{lab} - 1)n_{lab}^2} \left[n_{lab}Sy^{II} - (Sy^I)^2 \right] \left[n_{lab}Sk_i^{II} - (Sk_i^I)^2 \right]$$ The z-score for node $i$ is, in terms of the new notation: $$f_{z,i} = \frac{f_{i} - E_X(f_{i}^{null})} {\sqrt{V_X(f_{i}^{null})}}$$ This closes the \Rcode{z} scoring option, which clearly does not require any actual permutations but normalises through the theoretical first and second statistical moments of the null distribution of the diffusion scores. Finally, the authors in \cite{mosca} also suggest a combination of a classical score with an statistically normalised one. Available as \Rcode{ber\_p}, the score of node $i$ is defined as $$f_{ber_p, i} = -\log_{10}(p_i)\cdot f_{raw, i}$$ This approach corrects the original diffusion scores by the effect of the network, in order to mitigate the effect of structures like hubs. \subsubsection{Quantitative inputs} In its current release, \Rpackage{diffuStats} also accepts quantitative labels as input in the following scores: \Rcode{raw}, \Rcode{ber\_s}, \Rcode{z}, \Rcode{ber\_p} and \Rcode{mc}. The scores \Rcode{ml} and \Rcode{gm} are naturally excluded from non-binary inputs due to their definitions. Currently \Rcode{mc} scores (and therefore \Rcode{ber\_p} scores as well) accept quantitative inputs that are treated as sparse. Dense continuous inputs might take more time to compute, but this will be extended in future versions. Beware that quantitative inputs should be meaningful and one-tailed (monotonic) - the nature of diffusion processes involves averaging and strong effects with opposing signs cancel out instead of adding up. \subsection{Implementation, functions and classes} The package \Rpackage{diffuStats} is mainly implemented in R \cite{r}, but takes advantage of existing classes in \Rpackage{igraph} \cite{igraph} and basic data types, thus not introducing any new data structure - this minimises the learning effort by the final user. Inputs and outputs are conceived to require minimal formatting effort in the whole analysis. The computationally intense stochastic part of the permutation analysis is implemented in C++ through the packages \Rpackage{Rcpp} \cite{rcpp}, \Rpackage{RcppArmadillo} \cite{rcpparmadillo} and parallelised through \Rpackage{RcppParallel} \cite{rcppparallel}. Package \Rpackage{diffuStats} also contains documented functions and unit testing with small cases to spot potential bugs. Two vignettes facilitate the user experience: an introductory vignette showing the basic usage of the functions on a synthetic example and this vignette, which further describes the contents of the package and shows an application to real data. \begin{figure}[!tpb]%figure1 {\includegraphics{diffustats_fig_overview.eps}} \caption{Overview of the main package functions.} \label{fig:overview} \end{figure} A diagram containing the main functions in the R package \Rpackage{diffuStats} can be found in (Fig. \ref{fig:overview}). The main funcion is \Rfunction{diffuse}, a wrapper for computing diffusion scores from several categories at once, stemming from possibly different observed backgrounds. Function \Rfunction{diffuse} makes use of the deterministic \Rfunction{diffuse\_raw} or the stochastic \Rfunction{diffuse\_mc} implementations and combines them to give the desired scores for all the nodes in each of the observed backgrounds. The second wrapper \Rfunction{perf} compares the result of the diffusion scores to target validation scores. Validation scores might include only part of the nodes of the network and these nodes can be background-specific. Regarding memory and processing power requirements, the analysis of networks with more than 10,000 nodes might need additional RAM memory and processing power. The main reason is the manipulation of dense graph kernel matrices that scale quadratically with the network order. To give a reference, a dense matrix of double-precision real numbers with 10,000 rows and columns uses roughly 800MB of memory. Computing a graph kernel on a large network can require -depending on the kernel- matrix diagonalisation, matrix products and matrix inversion operations that are likely to use a considerable amount of memory and time. \subsection{Limitations} The kernel framework is known to scale poorly with the number of nodes of the network when the kernel is explicitly computed and dense. Therefore, \Rpackage{diffuStats} is best suited for manipulating biological networks of a medium size - few tenths of thousands of nodes. Protein-protein interaction networks can have around 20,000 nodes, which is also the limit of the capabilities \Rpackage{diffuStats}, as it is now, in a standard workstation with 16GB of physical memory. In particular, the explicit kernel computation is a demanding step, although it only has to be performed once with a given network. Figure \ref{fig:profiling} contains a profiling on the kernel computation using a Compaq q8100 workstation (intel core i5 650 @3.20GHz, 16GB physical memory). This should give an approximation of what to expect in terms of time and memory consumption. The same figure contains the memory usage of the kernel matrix per se. \begin{figure}[!tpb]%figure1 {\includegraphics{diffustats_fig_profiling.eps}} \caption{Profiling of the computation of the regularised Laplacian kernel on a synthetic n-th order network. Undirected networks are generated through \Rfunction{barabasi.game} in \CRANpkg{igraph} using default parameters and \Rfunarg{m = 6} (each node adds 6 edges).} \label{fig:profiling} \end{figure} On the other hand, the parallel implementation of the stochastic permutation analysis is another demanding task. It has been optimised assuming that the number of positives is usually low. Lists of scores with a very high amount of positives might slow down the permutation analysis, but not the parametric \Rcode{z}. \section{Getting started} This vignette contains a classical example of label propagation on a biological network. The core tools for this analysis are the \Rpackage{igraph} R package \cite{igraph} and the \Rpackage{diffuStats} package, whereas \Rpackage{ggplot2} \cite{ggplot2} is a convenient tool to plot the results. The data for this example is the \Rcode{yeast} interactome with functional annotations, as found in the data package \Rcode{igraphdata} \cite{igraphdata}. <<01_imports>>= # Core library(igraph) library(diffuStats) # Plotting library(ggplot2) library(ggsci) # Data library(igraphdata) data(yeast) set.seed(1) @ \subsection{Data description} A summary of the network object can be obtained by just showing the object: <<02_data1>>= summary(yeast) @ For this analysis, only the largest connected component of this graph will be used, although the algorithms can handle graphs with several connected components. <<02_data2>>= yeast <- diffuStats::largest_cc(yeast) @ This yields to a graph with \Sexpr{format(vcount(yeast))} nodes and \Sexpr{format(ecount(yeast))} edges. There are several attributes that can be of interest. First of all, the \Rcode{name} of the protein nodes: <<02_data3>>= head(V(yeast)$name) @ Furthermore, the corresponding aliases and complete names can be found in \Rcode{Description} <<02_data4>>= head(V(yeast)$Description) @ The labels to perform network propagation are MIPS categories \cite{mips}, which provide means to classify proteins regarding their function. These functions are coded as characters in the \Rcode{yeast} object, in the node attribute \Rcode{Class} <<02_data5>>= table_classes <- table(V(yeast)$Class, useNA = "always") table_classes @ The graph attribute \Rcode{Classes} maps these abbreviations to the actual category: <<02_data6>>= head(yeast$Classes) @ Finally, the graph edges have a \Rcode{Confidence} attribute that assesses the amount of evidence supporting the interaction. All the edges will be kept in this analysis, but different confidences can be weighted to favour diffusion in high confidence edges. <<02_data7>>= table(E(yeast)$Confidence) @ More on the yeast object can be found through \Rcode{?yeast}. \subsection{First analysis: protein ranking} In this first case, the diffusion scores will be applied to the prediction of a single protein function. Let's assume that 50\% of the labelled proteins in the graph as \Rcode{transport and sensing} (category \Rcode{A}) are actually unlabelled. Now, using the labels of the known positive and negative examples for \Rcode{transport and sensing}, can we correctly label the remaining 50\%? First of all, the list of known and unknown positives is generated. The function \Rfunction{diffuse} uses (row)names in the input scores so that unlabelled nodes are accounted as so. <<03_datasplit>>= perc <- .5 # Transport and sensing is class A nodes_A <- V(yeast)[Class %in% "A"]$name nodes_unlabelled <- V(yeast)[Class %in% c(NA, "U")]$name nodes_notA <- setdiff(V(yeast)$name, c(nodes_A, nodes_unlabelled)) # Known labels known_A <- sample(nodes_A, perc*length(nodes_A)) known_notA <- sample(nodes_notA, perc*length(nodes_notA)) known <- c(known_A, known_notA) # Unknown target nodes target_A <- setdiff(nodes_A, known_A) target_notA <- setdiff(nodes_notA, known_notA) target <- c(target_A, target_notA) target_id <- V(yeast)$name %in% target # True scores scores_true <- V(yeast)$Class %in% "A" @ Now that the input is ready, the diffusion algorithm can be applied to rank all the proteins. As a first approach, the vanilla diffusion scores will be computed through the \Rfunarg{raw} method and the default regularised Laplacian kernel, which is calculated on the fly. <<03_diffuse>>= # Vector of scores scores_A <- setNames((known %in% known_A)*1, known) # Diffusion diff <- diffuStats::diffuse( yeast, scores = scores_A, method = "raw" ) @ Diffusion scores are ready and in the same format they were introduced: <<03_diff1>>= head(diff) @ Now, the scores obtained by the proteins actually belonging to \Rcode{transport and sensing} can be compared to proteins with other labels. <<03_diff2, fig=TRUE>>= # Compare scores df_plot <- data.frame( Protein = V(yeast)$name, Class = ifelse(scores_true, "Transport and sensing", "Other"), DiffusionScore = diff, Target = target_id, Method = "raw", stringsAsFactors = FALSE ) ggplot(subset(df_plot, Target), aes(x = Class, y = DiffusionScore)) + geom_boxplot(aes(fill = Method)) + theme_bw() + scale_y_log10() + xlab("Protein class") + ylab("Diffusion score") + ggtitle("Target proteins in 'transport and sensing'") @ The last plot justifies the usefulness of label propagation, as proteins in \Rcode{transport and sensing} obtain higher diffusion scores than the rest. The network analysis can be deepened by examining, for instance, the subnetwork containing the proteins with the top 30 diffusion scores, highlighting with squares the ones that were positive labels in the input. Notice the small clusters of proteins: <<03_diff3, fig=TRUE>>= # Top scores subnetwork vertex_ids <- head(order(df_plot$DiffusionScore, decreasing = TRUE), 30) yeast_top <- igraph::induced.subgraph(yeast, vertex_ids) # Overlay desired properties # use tkplot for interactive plotting igraph::plot.igraph( yeast_top, vertex.color = diffuStats::scores2colours( df_plot$DiffusionScore[vertex_ids]), vertex.shape = diffuStats::scores2shapes( df_plot$Protein[vertex_ids] %in% known_A), vertex.label.color = "gray10", main = "Top 30 proteins from diffusion scores" ) @ \subsection{Second example: comparing scores with single protein ranking} The proposed diffusion scores can be easily applied and compared. The regularised Laplacian kernel will be used to compute all the implemented scores for the target nodes in \Rcode{transport and sensing}. <<04_kernel>>= K_rl <- diffuStats::regularisedLaplacianKernel(yeast) @ Functions \Rfunction{diffuse} and \Rfunction{perf} do accept, however, an \Rcode{igraph} object as well, and compute the kernel automatically. For medium networks (10,000 nodes or more) the kernel computation can be computationally expensive in memory and time, so precomputing it avoids unnecessary recalculations. The diffusion scores can be computed over a list of methods or sets of parameters. This can be achieved with instructions like \Rfunction{lapply}, but \Rpackage{diffuStats} contains a wrapper to facilitate this task. The function \Rfunction{diffuse\_grid} takes the specified combinations of parameters -which can include the scoring method as well- and computes the diffusion scores for each one. The results are concatenated in a data frame that can be easily plotted: <<04_diff>>= list_methods <- c("raw", "ml", "gm", "ber_s", "ber_p", "mc", "z") df_diff <- diffuse_grid( K = K_rl, scores = scores_A, grid_param = expand.grid(method = list_methods), n.perm = 1000 ) df_diff$transport <- ifelse( df_diff$node_id %in% nodes_A, "Transport and sensing", "Other" ) @ The results can be directly plotted: <<04_plot, fig=TRUE>>= df_plot <- subset(df_diff, node_id %in% target) ggplot(df_plot, aes(x = transport, y = node_score)) + geom_boxplot(aes(fill = method)) + scale_fill_npg() + theme_bw() + theme(axis.text.x = element_text( angle = 45, vjust = 1, hjust = 1)) + facet_wrap( ~ method, nrow = 1, scales = "free") + xlab("Protein class") + ylab("Diffusion score") + ggtitle("Target proteins scores in 'transport and sensing'") @ As expected, all the diffusion scores qualitatively show differences between positive and negative labels, but the quality of class separation will generally depend on the dataset and scoring method. \subsection{Third example: benchmarking scores with multiple protein functions} The package \Rpackage{diffuStats} is able to perform several screenings at once. To show its usefulness, we will generalise the procedure in the last section but screening all the categories in the yeast graph. First of all, the input data must meet an adequate format - a straightforward approach is to populate a matrix with the input labels (one category per column). <<05_scores>>= # All classes except NA and unlabelled names_classes <- setdiff(names(table_classes), c("U", NA)) # matrix format mat_classes <- sapply( names_classes, function(class) { V(yeast)$Class %in% class } )*1 rownames(mat_classes) <- V(yeast)$name colnames(mat_classes) <- names_classes @ The former 50\% known / 50\% unknown approach will be kept with the same split, although not all the \Sexpr{ncol(mat_classes)} categories will be totally balanced in the splits now. All the methods will be compared using the area under the ROC curve (AUROC) as a performance index. Please note that \Rpackage{diffuStats} is equipped with basic performance measures for rankers: the AUROC, the area under the Precision-Recall curve, or AUPRC, and their partial versions. These are available through the helper function \Rfunction{metric\_fun} and can be passed in list format to \Rfunction{perf}. These measures are based on the \CRANpkg{precrec} R package - further detail can be found in the original manuscript \cite{precrec}. <<05_perf>>= list_methods <- c("raw", "ml", "gm", "ber_s", "ber_p", "mc", "z") df_methods <- perf( K = K_rl, scores = mat_classes[known, ], validation = mat_classes[target, ], grid_param = expand.grid( method = list_methods, stringsAsFactors = FALSE), n.perm = 1000 ) @ This allows plotting of the AUCs over the categories for each method in one step: <<05_plot, fig=TRUE>>= ggplot(df_methods, aes(x = method, y = auc)) + geom_boxplot(aes(fill = method)) + scale_fill_npg() + theme_bw() + xlab("Method") + ylab("Area under the curve") + ggtitle("Methods performance in all categories") @ <<05_plotsave, echo=FALSE, eval=FALSE, include=FALSE>>= # Save plot for manuscript # # library(extrafont) setEPS() postscript("data-raw/BoxplotAUC.eps", width = 3.38, height = 3.38) ggplot(df_methods, aes(x = method, y = auc)) + # geom_boxplot(aes(fill = method), outlier.size = 1) + geom_boxplot(outlier.size = 1) + scale_fill_npg(name = "Method") + theme_bw() + xlab("Method") + ylab("Area under the curve") + # coord_fixed() + ggtitle("Benchmark of protein categories") + theme( text = element_text(size = 10), axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1, color = "black"), axis.text.y = element_text(size = 8, color = "black"), plot.title = element_text(size = 12, hjust = .5)) dev.off() # Build table for manuscript df_perf <- reshape2::acast(df_methods, Column~method, value.var = "auc") df_test <- perf_wilcox( df_perf, digits_p = 1, adjust = function(p) p.adjust(p, method = "fdr"), scientific = FALSE) xtable::xtable(df_test) @ Scaling up the analysis can be useful for assessing how adequate a diffusion score is in the dataset of interest. These results suggest that, for the current \Rcode{yeast} interactome and protein functions, the best priorisations are those obtained through a statistical normalisation, which might motivate its usage in other biological networks. The user can also statistically compare the performance metrics through the function \Rfunction{perf\_wilcox}. This generates a table with (i) the estimates on the differences on performance between the methods in rows and columns, with their confidence intervals and (ii) their associated p-value (Wilcoxon test). Positive and negative estimates respectively favour the method in the row and the column. <<05_test, results=tex>>= # Format the data df_perf <- reshape2::acast(df_methods, Column~method, value.var = "auc") # Compute the comparison matrix df_test <- perf_wilcox( df_perf, digits_p = 1, adjust = function(p) p.adjust(p, method = "fdr"), scientific = FALSE) knitr::kable(df_test, format = "latex") @ \section{Conclusions} The \Rpackage{diffuStats} package is a new computational tool to compute and compare single-network diffusion scores that are object of active research in several bioinformatics areas. It is an effort to gather a collection of settings in the diffusion process like the graph kernel, the label codification and the choice of a statistical normalisation. The \Rpackage{diffuStats} package will help the end user in choosing and computing the best performing diffusion scores in the application of interest. \section{Funding} This work was supported by the Spanish Ministry of Economy and Competitiveness (MINECO) [TEC2014-60337-R to A.P.] and the National Institutes of Health (NIH) [R01GM104400 to W.T.]. AP. and S.P. thank for funding the Spanish Biomedical Research Centre in Diabetes and Associated Metabolic Disorders (CIBERDEM) and the Networking Biomedical Research Centre in the subject area of Bioengineering, Biomaterials and Nanomedicine (CIBER-BBN), both initiatives of Instituto de Investigaci\'on Carlos III (ISCIII). SP. thanks the AGAUR FI-scholarship programme. \newpage %\bibliographystyle{plain} % \bibliographystyle{apalike} \bibliography{bibliography} \newpage \appendix \section{Session info} Here is the output of \Rfunction{sessionInfo()} on the system that compiled this vignette: <>= toLatex(sessionInfo()) @ \end{document}