% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is likely to be overwritten. %% Modified using Sweave2knitr(). %\VignetteIndexEntry{iNETgrate: Integrating gene expression and DNA methylation data in a gene network} %\VignetteDepends{iNETgrate} %\VignetteKeywords{Gene expression, DNA methylation, Network, Biomedical Informatics, Systems Biology} %\VignettePackage{iNETgrate} %\VignetteEngine{knitr::knitr} %% Commands: \newcommand{\inet}{\Biocpkg{iNETgrate} } \newcommand{\W}{\mathcal{W}} \newcommand{\E}{\mathcal{E}} %% \DeclareMathOperator{\cor}{cor} \documentclass[12pt]{article} %% Packages: \usepackage{commath} %% To use \abs %% Narrower margines: %%\usepackage[margin=0in]{geometry} %% Did not work, the margines are set in BiocStyle::latex. <>= BiocStyle::latex(width=90) @ \bioctitle{\Biocpkg{iNETgrate}: Integrating gene expression and DNA methylation data in a gene network} \author{Isha Mehta and Habil Zare} \date{Modified: 5 December, 2021. Compiled: \today} \begin{document} \maketitle \tableofcontents \section{Introduction} The biological processes that take place within a cell often require intricate coordination among \emph{multiple} genes and proteins %%, not just one gene or a single protein \cite{cho2012network}. Thus, the traditional approaches that study associations between individual genes and a %%specific disease or phenotype cannot provide a full understanding of these complex biological phenomena \cite{halsey2015fickle, choi2009statistical}. These approaches fail to pinpoint the biological mechanisms of complicated conditions such as cancer because complex diseases are usually caused by collaboration of more than a few genes. In particular, the detection of {\color{red} \ul{the subtle but coordinated and consistent}} changes in the expression levels of a set of functionally related %%[space] interacting genes is often more important and informative %% for understanding the underlying biology %%and pathology than detecting dramatic changes in the expression levels of a few individual genes \cite{cho2012network}. \ul{Network analysis can detect subtle but consistent %% and coordinated changes in a set of interacting and functionally related genes} \cite{cho2012network}. The more data used to build the network, the better the network obtained because the resulting network models the interaction between genes more accurately. However, integrating different types of omics data into a single network can be challenging. For example, DNA methylation is usually measured on hundreds of thousands of loci, and there is no one--to--one correspondence between these loci and genes, which are the nodes of the coexpression network. The \inet package is specifically developed to fuse (i.e., effectively integrate) DNA methylation data with gene expression data in a single network \cite{samimi2021dna}. The nodes (vertices) of the network are genes and the connection (edge) between two nodes is weighted based on both gene expression and DNA methylation data. The \inet package is a significant enhancement over our \Biocpkg{Pigengene} approach \cite{foroushani2016large} because it effectively \ul{i}nte\ul{grate}s DNA methylation data into the gene \ul{net}work \cite{samimi2021dna}. A practical exemplar application of \inet is in prognostication of acute myeloid leukemia (AML), which is a cancer of the blood and bone marrow \cite{samimi2021dna}. Most AML patients are classified as low--, intermediate--, and high--risk. There are urgent and effective treatments for high-- and low--risk patients, such as bone marrow transplant and chemotherapy, respectively. The intermediate--risk group is a mixture of high-- and low--risk patients but, their actual risk level is not detectable based on current methods used in clinics. Since the survival rate of these patients cannot be predicted precisely, it is not possible to choose the most efficient treatment for them. Therefore, there is a demand to find a way to reclassify these patients into either high-- or low--risk groups based on their clinical and molecular omics data. In most methods, gene expression is the primary data source to find risk groups. Some research groups used mutation and molecular abnormalities along with gene expression and found risk groups of more patients. Here, we illustrate using gene expression and DNA methylation to reclassify intermediate--risk patients. \section{How to run \inet?} \subsection{Installation} \inet is an \R{} package that can be downloaded and installed from \Bioconductor{} by using following commands in R: \newline\newline \texttt{if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager")} \newline \texttt{BiocManager::install("iNETgrate")} %% \texttt{BiocManager::install("habilzare/iNETgrate")} installs the devel version. %%\iffalse Alternatively, if the built package is already available, it can be installed by the following command in Unix: \newline $\\$ \texttt{R CMD INSTALL iNETgrate\_x.y.z.tar.gz} \newline $\\$ where x.y.z determines the version. The second approach requires all the dependencies be installed manually, therefore, the first approach is preferred. %%\fi \subsection{A quick overview} \inet constructs a weighted gene network where the weights of the edges is calculated by a combined correlation scores of gene expression and DNA methylation beta values, i.e., %% \begin{equation} \W(g_i,g_j)=(1-\mu)(\abs{cor_E(g_i,g_j)}) + \mu(\abs{cor_M(g_i,g_j)}), \label{eq:sim} \end{equation} %% \noindent where $g_i$ and $g_j$ is a pair of genes for which edge weight is to be calculated, $\mu$ is a value between 0 and 1 given by the user, $\abs{cor_E(g_i, g_j)}$ is the absolute correlation between expression of the gene pair, and similarly, $\abs{cor_M(g_i, g_j)}$ is the absolute correlation between beta values of the gene pair. \inet then identifies gene modules (i.e., clusters of coexpressed and co-methylated genes), computes an eigengene for each module, and uses these biological signatures as features for classifying patients into risk categories. The main function is \Rfunction{iNETgrate} which requires a gene expression profile, a DNA methylation beta value profile and the corresponding conditions (types). Individual functions are also provided to facilitate running the pipeline in a customized way. The inferred biological signatures (eigengenes) are useful for further supervised or unsupervised analyses. \subsection{What is an eigengene?} In most functions of this package, eigenegenes are computed or used as robust biological signatures. Briefly, each eigengene $\E$ is a weighted average of the expression of all genes in a given set of $n$ genes (also known as a gene module or a cluster of genes). %%\begin{ceqn} \begin{equation} \E=\alpha_1 g_1 + \alpha_2 g_2 + \dots + \alpha_i g_i \label{eq:eigen} \end{equation} %%\end{ceqn} \noindent where $\alpha_i$ represents the weight corresponding to gene $g_i$. The weights are adjusted using PCA in a way that the explained variance is maximized. This guarantees that the loss in the biological information in minimized. \subsection{What is an eigenloci?} To compute an effective DNA methylation value at gene level, we calculate a weighted average of beta values per gene. Briefly, each eigenloci is a weighted average of the beta values of the probes (loci) corresponding to the gene of interest. If the number of loci for a gene is less than six, all of them contribute to the eigenloci with some weight, which is determined automatically. Otherwise, we identify and use a subset of highly correlating loci (i.e., "core") to compute the eigenloci. A gene expression profile and an eigenloci profile are the two key inputs to the \Rfunction{makeNetwork} function. \subsection{A toy example} For a quick start, we demonstrate the application of \inet pipeline on a leukemia dataset below \cite{samimi2021dna}. The first step is to load the package and data in~\R{}: <>= library(iNETgrate) set.seed(1) @ \iffalse You can download the AML dataset from TCGA using the following function: You can conveniently download the AML dataset from TCGA using the \Biocpkg{iNETgrate}::\Rfunction{downloaData} function. <>= tcgaAML <- downloaData(dataProject="TCGA-LAML", savePath=".") @ \fi In this vignette, we use the toy data that is included in the package. Please note that the provided data in the package is sub--sampled for a quick demonstration. For real applications, the expression of thousands of genes should be used. Analyzing such input with \inet can take a few hours and may require 5-10 GB of memory to save the results. The following commands run \inet pipeline on the toy data. First, we load the toy data. <>= data(toyRawAml) class(toyRawAml) names(toyRawAml) @ Then, we define the clinical settings and run \inet. %% Set to FALSE for a quicker build and check. <>= clinSettings <- c("patientIDCol"="bcr_patient_barcode", "eventCol"="vital_status", "timeCol"="days_to_last_followup", "riskCatCol"="acute_myeloid_leukemia_calgb_cytogenetics_risk_category", "riskFactorCol"="cytogenetic_abnormalities", "event"="Dead", "riskHigh"="Poor", "riskLow"="Favorable") print(clinSettings) @ See the documentation of the \Rcode{iNETgrate} function for a toy example like the following: <>= inetgrator <- iNETgrate(Data=toyRawAml, clinSettings=clinSettings, saveDir="iNETgrateOut", mus=0.6) @ Results and figures are saved in the "iNETgrateOut" directory under the current directory. For more advanced applications, the user is encouraged to analyze the data step-by-step and customize the individual functions such as \Rfunction{makeNetwork} and \Rfunction{computEigengenes}. \subsection{Running the \inet pipeline step by step} If you are curious about the specific steps in the \inet pipeline, or you need to run some steps with different settings, you can follow the steps below. With the default values, the results will be similar to the output of the \Rcode{iNETgrate} function. \subsubsection{Setting paths} Before starting the analysis, we need to create some directory to save the plots and results. <>= resPath <- file.path(tempdir(), "iNETgrateRes") ## the result path message(paste("Results will be saved in:", resPath, sep="\n")) netPath <- file.path(resPath, "net") survivalPath <- file.path(netPath, "surv") dir.create(survivalPath, showWarnings=FALSE, recursive=TRUE) @ \subsubsection{Cleaning data} The first step of the \inet pipeline is to clean the input data using the \Rfunction{cleanAllData} function. Here we make sure that the gene expression and DNA methylation matrices have row and column names, and do not include too many \Rcode{NA} values. We impute missing values in the DNA methylation data if needed. Also, we clean and subset clinical data. <>= riskCatCol <- "acute_myeloid_leukemia_calgb_cytogenetics_risk_category" riskFactorCol <- "cytogenetic_abnormalities" cleanedToy <- cleanAllData(genExpr=toyRawAml$genExpr, genExprSampleInfo=toyRawAml$genExprSampleInfo, rawDnam=toyRawAml$rawDnam, savePath=resPath, clinical=toyRawAml$clinical, riskCatCol=riskCatCol, riskFactorCol=riskFactorCol, riskHigh="Poor", riskLow="Favorable", verbose=1) @ The output must be saved by the user if they intent to use the cleaned data later. The output is a list of \Rcode{4} components namely \Rcode{genExpr}, \Rcode{dnam}, \Rcode{locus2gene} and \Rcode{survival}. \Rcode{survival} is a subset of clinical data, where rows are patients and columns are vitality and survival related information. \Rcode{genExpr} is a matrix of gene expression profile, where rows are genes and columns are patient IDs with sample type attached in cases where multiple sample types exists. \Rcode{dnam} is a matrix of DNA methylation profile, where rows are loci and columns are patient IDs with sample type attached in cases where multiple sample types exists. DNA methylation data is processed using \Rfunction{preprocess.Dnam} for missing beta values, and removing some non-CpG loci or SNP enriched loci.Finally \Rcode{locus2gene} is a dataframe where rows are loci and columns provide information on the related gene. This is followed by filtering and omitting genes or loci that have too low correlation with survival data and vital status. The \Rfunction{filterLowCor} function computes these correlations and selects the desired data. We then find a combined set of genes that have some correlation with survival time or vital status based on expression or DNA methylation profiles using \Rfunction{computeUnion}. The above--mentioned filtering and combining steps can be performed using the function \Rfunction{electGenes}, which is a wrapper for \Rfunction{filterLowCor} and \Rfunction{computeUnion} functions. <>= data(toyCleanedAml) cleaned <- toyCleanedAml elected <- electGenes(genExpr=cleaned$genExpr, dnam=cleaned$dnam, survival=cleaned$survival, savePath=resPath, event="Dead", locus2gene=cleaned$locus2gene, doAlLoci=FALSE, verbose=1) @ The output of \Rfunction{electGenes} is a list of \Rcode{5} components including \Rcode{unionGenes}, which is a character vector of gene IDs that have some correlation with survival time or vital status based on expression or DNA methylation profiles. This union gene set is further used to makeNetwork and compute eigengenes in next steps. \subsubsection{Computing DNA methylation values at the gene level} Computing effective DNA methylation profile at gene level is necessary to construct a combined network where the nodes of the network are genes. Hence, we compute eigenloci described earlier. <>= patientLabel <- setNames(as.character(cleaned$survival$Risk1), nm=rownames(cleaned$survival)) inBoth <- intersect(colnames(cleaned$dnam), names(patientLabel)) computedEloci <- computEigenloci(dnam=cleaned$dnam[ ,inBoth], Labels=patientLabel[inBoth], geNames=elected$unionGenes, locus2gene=cleaned$locus2gene, plotPath=resPath, Label1="Low", Label2="High", verbose=1) @ \subsubsection{Making the combined network} Now, our data is ready to make a network using \Rfunction{makeNetwork}. Here, we use a \Rcode{mu} value to combine adjacency matrices from the expression and DNA methylation (i.e., \Rcode{eigenloci}) datasets into a single network of genes (Eq.~\ref{eq:sim}). In real applications, several values for \Rcode{mu} can be used in a numeric vector and then the best value will be determined in later steps of the analyzeSurvival. <>= eigenloci <- computedEloci$eigenloci madeNet <- makeNetwork(genExpr=cleaned$genExpr, eigenloci=eigenloci, geNames=elected$unionGenes, mus=0.6, doRemoveTOM=TRUE, outPath=netPath, verbose=1) print(madeNet$mu2modules[,1:5, drop=FALSE]) @ The modules for each value of the input \Rcode{mu} are identified and returned in \Robject{madeNet\$mu2modules}, which is a matrix with the \Rcode{mu} values on rows and genes on columns. The entries of the matrix are integer values determining the module number a gene belongs to. \subsubsection{Computing eigengenes} The next step is to compute a representative value (feature) for each identified module (i.e., an eigengene) based on gene expression and optionally, DNA methylation data. <>= e1 <- computEigengenes(genExpr=cleaned$genExpr, eigenloci=eigenloci, netPath=netPath, geNames=elected$unionGenes, Labels=patientLabel, Label1="Low", Label2="High", mus=c(0.6), combiningMu=NA, doIgnoreNas=TRUE, survival=cleaned$survival, event="Dead", verbose=1, mu2modules=madeNet$mu2modules) @ \subsubsection{Survival analysis} Finally, our Cox regression analysis identifies the module (gene cluster) or the combination of up to 3 modules that are together most useful for prognostication. The identified modules are used in an accelerated failure time (AFT) model to classify the patients into different risk categories. <>= s1 <- analyzeSurvival(survival=cleaned$survival, favRisk="High", subSet="Int", mus=0.6, netPath=netPath, outPath=survivalPath, xmax1=15, xmin1=0, verbose=1) @ The above survival plots are saved in \Rcode{survivalPath}. Each plot compares the KM curves among risk categories based on \inet analysis, or the input labels from the clinical data, as mentioned in plot titles. The first two plots include all cases and the last two plots include only a subset of cases as described in each plot title. The confusion matrix is useful to compare the input labels vs. the risk categorization done by \inet (i.e., the numbers on the diagonal show the agreement while other numbers show the discrepancy). \subsubsection{Inferring eigengenes in another dataset} If you want to infer the selected eigengenes in an independent dataset for validation, the \Rfunction{bestInetgrator} can be used to extract the weights in an organized list, which then can be used by the \Rfunction{inferEigengenes} function. <>= inetgrator <- bestInetgrator(bestPvalues=s1$bestPvalues, usefuLoci=computedEloci$usefuLoci, lociPigen=computedEloci$lociPigen, netPath=netPath) @ \subsubsection{Pathway analysis} Pathwyay overrepresentation analysis of the modules selected by our Cox analysis can be done using the \Biocpkg{Pigengene}::\Rfunction{get.enriched.pw} function. For each selected module, a folder will be created, which includes an Excel file listing the pathways that are overrepresented by the genes in the corresponding module. <>= selectedModules <- names(inetgrator$modules) geneList <- list() for(m1 in selectedModules) geneList[[m1]] <- names(inetgrator$modules[[m1]]$genes) library(org.Hs.eg.db) ## Needed for human genes. got <- Pigengene::get.enriched.pw(geneList, idType="SYMBOL", pathwayDb="KEGG", outPath=survivalPath) @ %%<>= %%@ \subsection{Citation} <>= citation("iNETgrate") @ \section{Session Information} The output of \Rfunction{sessionInfo} on the system that compiled this document is as follows: <>= toLatex(sessionInfo()) @ \nocite{*} \bibliography{iNETgrate} \end{document}