% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. %% Modified using Sweave2knitr(). %\VignetteIndexEntry{Pigengene: Computing and using eigengenes} %\VignetteDepends{Pigengene} %\VignetteKeywords{Gene expression, Network, Biomedical Informatics, Systems Biology} %\VignettePackage{Pigengene} %\VignetteEngine{knitr::knitr} %% Commands: \newcommand{\pipa}{\Biocpkg{Pigengene} } \newcommand{\E}{\mathcal{E}} \documentclass[12pt]{article} <>= BiocStyle::latex() @ \bioctitle{\Biocpkg{Pigengene}: Computing and using eigengenes} \author{Habil Zare} \date{Modified: 26 April, 2016. Compiled: \today} \begin{document} \maketitle \tableofcontents \section{Introduction} Gene expression profiling technologies such as microarray or RNA Seq provide valuable datasets, however, inferring biological information from these data remains cumbersome. \pipa address two challenges: \begin{enumerate} \item \textbf{Curse of dimensionality:} The number of features in an expression profile is usually very high. For instance, there are about 20,000 genes in human. In contrast, the number of samples (patients) is often very limited in practice, and may not exceed a few hundreds. Yeung et al. have shown that standard data reduction methods such as principal component analysis (PCA) are not appropriate to directly apply on gene expression data \cite{yeung2001principal}. Instead, \pipa addresses this challenge by applying PCA on gene modules. \item \textbf{Normalization:} Data produced using different technologies, or in different labs, are not easily comparable. \pipa identifies {\em eigengenes}, informative biological signatures that are robust with respect to the profiling platform. For instance, it can identify the signatures (compute the eigengenes) on microarray data, and infer them on biologically-related RNA Seq data. The resulting signatures are directly comparable even if the set of samples (patients) are independent and disjoint in the two analyzed datasets. \end{enumerate} \section{How to run \pipa?} \subsection{Installation} \pipa is an \R{} package that can be downloaded and installed from \Bioconductor{} by the followig commands in R: \newline\newline \texttt{if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager")} \newline \texttt{BiocManager::install("Pigengene")} Alternatively, if the source code is already available, the package can be installed by the following command in Linux: \newline $\\$ \texttt{R CMD INSTALL Pigengene\_x.y.z.tar.gz} \newline $\\$ where x.y.z. determines the version. The second approach requires all the dependencies be installed manually, therefore, the first approach is preferred. \subsection{A quick overview} \pipa identifies gene modules (clusters), computes an eigengene for each module, and uses these biological signatures as features for classification. The main function is \Rfunction{one.step.pigengene} which requires a gene expression profile and the corresponding conditions (types). Individual functions are also provided to facilitate running the pipeline in a customized way. The inferred biological signatures (eigengenes) are useful for supervised or unsupervised analyses. \subsection{What is an eigengene?} In most functions of this package, eigenegenes are computed or used as robust biological signatures. Briefly, each eigengene $\E$ is a weighted average of the expression of all genes in a given set of $n$ genes (also known as a gene module or a cluster of genes). %%\begin{ceqn} \begin{equation} \E=\alpha_1 g_1 + \alpha_2 g_2+ \dots + \alpha_n g_n, \label{eq:eigen} \end{equation} %%\end{ceqn} \noindent where $\alpha_i$ represents the weight corresponding to gene $g_i$. The weights are adjusted in a way that the explained variance is maximized. This guarantees that the loss in the biological information in minimized. \subsection{A toy example} For a quick start, the application of \pipa pipeline on some leukemia dataset is demonstrated below \cite{mills2009microarray}. The first step is to load the package and data in \R{}: <>= library(Pigengene) data(aml) data(mds) d1 <- rbind(aml,mds) Labels <- c(rep("AML",nrow(aml)),rep("MDS",nrow(mds))) names(Labels) <- rownames(d1) Disease <- as.data.frame(Labels) h1 <- pheatmap.type(d1[,1:20],annRow=Disease,show_rownames=FALSE) @ Please note that the provided data in the package is sub-sampled for a quicker demonstration. For real applications, the expression of thousands of genes should be provided in order to the co-expression network analysis to be appropriate. It is common to first perform differential expression analysis, sort all the genes based on p-values, and use the top-third as the input \cite{zhang2013integrated}. Analyzing such input with \pipa can take a few hours and may require 5-10 GB of memory. The following command runs Pigengene pipeline on the toy data: %% <>= p1 <- one.step.pigengene(Data=d1,saveDir='pigengene', bnNum=0, verbose=1, seed=1, Labels=Labels, toCompact=FALSE, doHeat=FALSE) @ Results and figures are saved in \Rcode{pigengene} folder under the current directory. For more advanced applications, the user is encouraged to analyze the data step-by-step and customize the individual functions such as \Rfunction{compute.pigenegene} and \Rfunction{make.decision.tree}. In addition to the provided decision trees, the user can also take alternative approaches to perform classification, clustering, survival analysis, etc. {\em using eigengenes as robust biological signatures (informative features)}. Eigengenes and other useful objects can be retrieved from the output. For instance, \Rcode{c5treeRes} is a list containing the results of fitting decision trees to the eigengenes. As shown above, a couple of trees were fitted, one per a value for \Rcode{minPerLeaf}. The following command plots the tree corresponding to 34, i.e., it was fitted requiring the minimum number of samples per every leaf to be at least 34. %% <>= plot(p1$c5treeRes$c5Trees[["34"]]) @ The tree corresponding to other values are saved in \Rcode{pigengene} folder. Of note, is the \Robject{pigenegene} object that contains the matrix of inferred eigenegenes. Each row corresponds to a sample, and each column represents an eigengene. %% <>= dim(p1$pigengene$eigengenes) p1 <- pheatmap.type(p1$pigengene$eigengenes,annRow=Disease,show_rownames=FALSE) @ \subsection{Running the \pipa pipeline step by step} If you are curious about the specific steps in the \pipa pipeline, or you need to run some steps with different settings, you can follow the steps below. The results will be similar to the output of the \Rcode{one.step.pigengene} function. The first step is quality control to make sure that the matrices have row and column names, and do not include too many \Rcode{NA} values: <>= ## QC: checked <- check.pigengene.input(Data=d1, Labels=Labels) DataI <- checked$Data LabelsI <- checked$Labels @ We oversample the data such that the number of cases in each condition is almost balanced. <>= wData <- balance(Data=DataI, Labels=LabelsI, verbose=1)$balanced @ Weighted gene co-expression network analysis (WGCNA) \cite{langfelder2008wgcna} does not assume any cutoff (i.e., hard threshold) on the correlation values. Instead, it raises the correlation values to a power so that the correlation values that are close to zero diminish. This hyperparameter is called $\beta$, and can be estimated using as follows: <>= betaI <- calculate.beta(RsquaredCut=0.8, Data=wData, verbose=1)$power saveDir <- "steps" ## Results will be saved in this folder. dir.create(saveDir) @ Once we have an estimate for $\beta$, WGCNA can be done in one step using the following function to identify gene modules (i.e., clusters): <>= ## WGCNA wgRes <- wgcna.one.step(Data=wData, seed=1, power=betaI, saveDir=saveDir, verbose=1) @ The output of \Rcode{wgcna.one.step} is a list of objects including \Rcode{modules}, which is a numeric vector named with genes. Genes that map to the same numeric value are considered to be in the same module. We use this information to compute an eigengene for every module. <>= ## Eigengenes: pigengene <- compute.pigengene(Data=DataI, Labels=LabelsI, saveFile=file.path(saveDir, 'pigengene.RData'), modules=wgRes$modules, verbose=1) class(pigengene) print(dim(pigengene$eigengenes)) ##This is the eigengenes matrix. print(pigengene$eigengenes[1:3,1:4]) @ The number of columns in the eigengene matrix is equal to the number of modules, and the number of rows is equal to the number of samples. Eigengenes can be used as robust biological signatures (i.e., features in machine learning terms) for classification, clustering, exploratory analysis, etc. For example, we can use them as random variables to fit a Bayesian network to data \cite{agrahari2018applications}. <>= ## Learning the BN structure: library(bnlearn) learnt <- learn.bn(pigengene=pigengene, bnPath=file.path(saveDir, "bn"), bnNum=10, ## In real applications, at least 100-1000. seed=1, verbose=1) BN <- learnt$consensus1$BN @ The \Rcode{learn.bn} function has many arguments. See the corresponding documentation and publication \cite{agrahari2018applications} for technical details. Because usually thousands of individual networks are needed, it is wise to train many models in parallel on a cluster, which can be done with appropriate settings for the \Rcode{learn.bn} input arguments. We can replot the consensus Bayesian network, which is already saved on disk, using the \Rcode{draw.bn} function. <>= drawn <- draw.bn(BN) ## By construction, the Disease node can have no parents. @ The \Rcode{learn.bn} function tries to find the best structure for the optimum Bayesian network. The conditional probability tables are yet to be inferred from the data. <>= ## Fit the parameters of the Bayesian network: fit <- bn.fit(x=BN, data=learnt$consensus1$Data, method="bayes", iss=10) ##where learnt$consensus1$Data is the discretized data matrix. ## The conditional probability table for one of the children of the Disease node: selectedFeatures <- children("Disease", x=BN) print(fit[[selectedFeatures[1]]]) @ The fitted Bayesian network can be used for predicting the labels (i.e., values of the Disease node). <>= l2 <- predict(object=fit, node="Disease", data=learnt$consensus1$Data, method="bayes-lw") table(LabelsI, l2) @ Eigengenes can also be used in predictive models simpler than a Bayesian network. For example, a decision tree can be fitted using the \Rcode{make.decision.tree} function \cite{foroushani2016large}. <>= ## Decision trees: treePath <- file.path(saveDir, 'C5Trees') dir.create(path=treePath) treeRes <- make.decision.tree(pigengene=pigengene, Data=DataI, selectedFeatures=selectedFeatures, saveDir=treePath, verbose=1, toCompact=FALSE) @ If \Rcode{selectedFeatures="All"}, the \Rcode{make.decision.tree} function automatically selects a ``minimal'' subset of eigengenes in order to prevent overfitting. \subsection{Citation} The methodology and an interesting application of \pipa on studying hematological malignancies is presented in the following reference \cite{foroushani2016large}. <>= citation("Pigengene") @ \section{Session Information} The output of \Rfunction{sessionInfo} on the system that compiled this document is as follows: <>= toLatex(sessionInfo()) @ \bibliography{pigengene} \end{document}