%\VignetteIndexEntry{Transcription Factor Association Rule Miner} %\VignetteDepends{} %\VignetteKeywords{} %\VignettePackage{TFARM} %\VignetteEngine{knitr::knitr} \documentclass{article} \usepackage[algoruled]{algorithm2e} \usepackage{color} \usepackage{amsfonts} \usepackage{graphicx} %\tracingstats=0 \definecolor{bronze}{rgb}{0.93, 0.53, 0.18} \definecolor{darkblue}{rgb}{0, 0.3, 0.6} \SetKw{template}{Template:} \SetKw{assign}{Assignment:} \SetKw{norm}{Normalization:} <>= BiocStyle::latex() @ \newcommand{\bam}{\texttt{BAM}} \title{\Biocpkg{TFARM}: Transcription Factor Associatio Rule Miner} \author{Liuba Nausicaa Martino \email{liuban.martino@gmail.com} \\Alice Parodi \\Gaia Ceddia \\Piercesare Secchi \\Stefano Campaner \\Marco Masseroli } %\date{Modified: February 24, 2016. Compiled: \today} \date{\today} \begin{document} <>= library(knitr) opts_chunk$set( concordance=FALSE ) @ <>= library(knitr) opts_chunk$set( background = "#C0C0C0" ) @ \maketitle \tableofcontents <>= options(width=60) @ <>= library(GenomicRanges) @ \bigskip <>= library(TFARM) @ \section{Introduction} Looking for association rules between transcription factors in genomic regions of interest can be useful to find direct or indirect interactions among regulatory factors of DNA transcription. However, the results provided by the most recent algorithms for the search of association rules \cite{borgelt2002induction} \cite{agrawal1993mining} alone are often not enough intelligible and summarized, since they only provide a list of association rules. A novel method is proposed for a subsequent mining of these results to evaluate the contribution of the items in each association rule. The \Biocpkg{TFARM} package allows to identify and extract the most relevant association rules with a given transcription factor target, and compute the \textit{Importance Index} of a transcription factor (or a combination of some of them) in the extracted rules. Such index is useful to associate a numerical value to the contribution of one or more transcription factors to the co-regulation with a given transcription factor target. \section{Dataset} Association rules are extracted from a GRanges object in which metadata columns identify transcription factors and genomic coordinates are represented in the left-hand-side of the GRanges, therefore each rows is a different genomic region. The element (i,j) (with j > 4) of the metadata section is equal to 0 if a binding site of transcription factor j is absent in region i, or to 1 (or any other value) if it is present. This dataset, called \textit{Indicator of presence matrix}, should not have rows with only 0 values, since we consider regions with no transcription factors as uninteresting regions. The first column of the genomic ranges contains the chromosome name of each region, the second column contains the genomic coordinates of each region (i.e., \textit{left} coordinate and \textit{right} coordinate are the ends of the region along the DNA, respectively the coordinates of the leftmost and rightmost bases of the region); the third column contains the region strand of each region (encoded as "+", "-", or "*" if unknown). The GRanges we consider here is obtained from the analysis of ENCODE ChIP-seq data: it concerns the localization of transcription factors binding sites and histone modifications in DNA, as well as RefSeq data (https://www.ncbi.nlm.nih.gov/refseq/); specifically here we focus on promotorial regions, but further analysis are possible on distal regions (or any DNA region in general, for which an annotation exists). Such data have been processed and extracted with GMQL (GenoMetric Query Language \cite{masseroli2015genometric}, http://www.bioinformatics.deib.polimi.it/GMQL/) queries. In this example, the dataset we consider is the Indicator of presence matrix of the binding sites of 25 transcription factors of the MCF-7 human breast adenocarcinoma cell line (i.e., all the transcription factors evaluated in ENCODE for this cell line), in the 2,944 promotorial regions of chromosome 1: \bigskip <<>>= # Load and visualize the dataset: data("MCF7_chr1") length(MCF7_chr1) MCF7_chr1 @ \section{Extraction of the most relevant associations} We define a relevant association for the prediction of the presence of transcription factor TFt in the considered genomic regions as an association rule of the type: \begin{center} \{TF1=1,TF2=1,TF3=1\} $\rightarrow$ \{TFt=1\} \end{center} which means that the presence of the transcription factors TF1, TF2 and TF3 implies the presence of transcription factor TFt. Every association rule is completely characterized by a set of three measures: support, confidence and lift: \begin{itemize} \item \textit{support}: \begin{equation} supp(X \rightarrow Y) = \frac {supp(X \cup Y)}{N} \label{supp_rule} \end{equation} where N is the number of transactions, X $\cup$ Y is a set of items and Supp(X $\cup$ Y) is the support of the itemset \{X,Y\}, defined as \begin{equation} supp(X) = \frac {|\{t_i \in N ; X \subseteq t_i\}|} {N} \label{supp} \end{equation} that is the proportion of transactions $t_i$ in the dataset which contains the itemset X. The support of an association rule measures the frequency of a rule in the dataset and varies in the interval [0,1]. \item \textit{confidence}: \begin{equation} conf(X \rightarrow Y) = \frac {supp(X \cup Y)} {supp(X)} \label{conf} \end{equation} It gives an estimate of the conditioned probability P(Y|X), that is the probability to find the right-hand-side (RHS) of the rule (i.e., the itemset Y) in a set of transactions, given that these transactions also contain the left-hand-side (LHS) of the rule (i.e., the itemset X). Therefore, it measures the realiability of the inference made by the rule X $\rightarrow Y$. The higher is the confidence of the rule, the higher is the probability to find the itemset Y in a transaction containing the itemset X. It varies in the interval [0,1]. \item \textit{lift}: \begin{equation} lift(X \rightarrow Y) = \frac {supp(X \cup Y)}{supp(X) supp(Y)} \label{lift} \end{equation}It measures the strength of the rule, and varies in the interval [0,$\infty$]. \end{itemize} \bigskip To extract a set of relevant associations the user has to specify: \begin{itemize} \item[1.] the presence/absence of the transcription factor target to be predicted, TFt; \item[2.] the minimal support threshold of the rules to be extracted; \item[3.] the minimal confidence threshold of the rules to be extracted. \end{itemize} Points 2. and 3. strongly depend on the dimensions of the dataset (i.e., number of rows - regions - and numer of columns - transcription factors), the presence of the transcription factor target in the considered regions, the number of relevant associations that the user wants to find. Usually, the confidence threshold is set greater than 0.5, since it measures the posterior probability to have TFt given the presence of the pattern in the left-hand-side of the rule (e.g., \{TF1=1,TF2=1,TF3=1\}). \medskip The function \texttt{rulesGen} in the \Biocpkg{TFARM} package extracts the association rules by calling the \texttt{apriori} function of the \Rpackage{arules} package \cite{arules1} \cite{arules2} \cite{arules3}. It takes in input: \begin{itemize} \item the GRanges object in which the Indicator of presence matrix is represented; \item the transcription factor target; \item the minimum support threshold of the rules to be extracted; \item the minimum confidence threshold of the rules to be extracted; \item the logical parameter \textit{type} that sets the type of left-hand-side of the rules to be extracted (i.e., containing only present transcription factors, or containing present and/or absent transcription factors). \end{itemize} The result of the \texttt{rulesGen} function is a data.frame containing: \begin{itemize} \item in the first column the left-hand-side of each extracted rule; \item in the second column the right-hand-side of each extracted rule (that is the presence/absence of the given transcription factor target); \item in the third column the support of each extracted rule; \item in the fourth column the confidence of each extracted rule; \item in the fifth column the lift of each extracted rule. \end{itemize} See \Rpackage{arulesViz} package for visualization tools of association rules. \bigskip <<>>= # Coming back to the example on the transcription factors of cell line MCF-7, # in the promotorial regions of chromosome 1. # Suppose that the user wants to find the most relevant association rules for # the prediction of the presence of the transcription factor TEAD4 and such # that the left-hand-side of the rules contains only present transcription # factors. This means extracting all the association rules with # right-hand-side equal to {TEAD4=1} setting the parameter type = TRUE; # the minimun support and minimum confidence thresholds are set, # as an example, to 0.005 and 0.62, respectively: r_TEAD4 <- rulesGen(MCF7_chr1, "TEAD4=1", 0.005, 0.62, TRUE) dim(r_TEAD4) head(r_TEAD4) @ \bigskip Once the set of the most relevant association rules (i.e., with support and confidence higher than the thresholds specified as parameters) is extracted, the user can look for \textit{candidate co-regulator transcription factors} with the transcription factor target (in the example TEAD4), which are the transcription factors present in the LHS of the extracted rules. This is provided by the function \texttt{presAbs} of the \Biocpkg{TFARM} package. The function \texttt{presAbs} takes in input: \begin{itemize} \item a string vector containing the names of all the transcription factors present in the Indicator of presence matrix; \item the set of the most relevant association rules previously extracted with \texttt{rulesGen}; \item a logical parameter, \textit{type}, which refers to the type of rules extracted with the \texttt{rulesGen} function. If \textit{type = TRUE}, the LHS of the rules can contain only items of the type TF=1, otherwise, if \textit{type = FALSE}, the LHS of the rules can contain both items of the type TF=1 and TF=0. \end{itemize} The \texttt{presAbs} function has two outputs: \begin{itemize} \item \textit{pres}, which is a string vector containing all the items present in the LHSs of the considered set of rules; \item \textit{abs}, which is a string vector containing all the items absent in the LHSs of the considered set of rules. \end{itemize} \bigskip << >>= # Transcription factors present in at least one of the regions in the # considered dataset: c <- names(mcols(MCF7_chr1)) c lc <- length(c) names(presAbs(c, r_TEAD4, TRUE)) # Transcription factors present in at least one of the association rules: p_TFs <- presAbs(c, r_TEAD4, TRUE)$pres p_TFs # Transcription factors absent in all the association rules: a <- presAbs(c[1:lc], r_TEAD4, TRUE)$abs a @ \bigskip All the transcription factors in p are said to be \textit{candidate co-regulator transcription factors} with the TFt in the most relevant associations extracted with \texttt{rulesGen}. \section{Importance Index of a transcription factor} The extraction of candidate co-regulator transcription factors with a given transcription factor target TFt can be useful to provide a global vision of the possible associations of the transcription factor target TFt. However, since the number of association rules and candidate co-regulators can be very high, this list does not provide an intelligible result, giving the lack of a measure of how much each transcription factor contributes to the existence of a certain complex of transcription factors. Let us consider for example the rule \begin{center} \{TF1=1,TF2=1,TF3=1\} $\rightarrow$ \{TFt=1\} \end{center}Just looking at it, the user could not tell if the presences of TF1, TF2 and TF3 equally contribute to the prediction of the presence of TFt. A solution to this problem can be given by removing, alternatively, TF1, TF2 and TF3 from the rule and evaluate: \begin{itemize} \item[1)] if the rule keeps on existing and being relevant \item[2)] how the three quality measures of support, confidence and lift of the rule change. \end{itemize} If a rule is not found as relevant after removing a transcription factor from its LHS, then the presence of that transcription factor in the pattern \{TF1=1,TF2=1,TF3=1\} is fundamental for the existence of the association rule \{TF1=1,TF2=1,TF3=1\} $\rightarrow$ \{TFt=1\}. Otherwise, if the rule keeps on existing as relevant, and its quality measures are similar to the ones of the rule initially considered, then the presence of that transcription factor in the pattern \{TF1=1,TF2=1,TF3=1\} is not fundamental for the existence of the association rule \{TF1=1,TF2=1,TF3=1\} $\rightarrow$ \{TFt=1\}. Let us fix an item I (i.e., a candidate co-regulator transcription factor with the transcription factor target) and extract the subset of all the most relevant associations containing I, named \{R$^I$\} (with J number of rules in \{R$^I$\}, J=$|$\{R$^I$\}$|$). Each element of \{R$^I$$_j$\}$_{j=1:J}$ is described by a set of quality measures of support, confidence and lift: \{$s^I_j$, $c^I_j$, $l^I_j$\}$_{j=1:J}$. \begin{table}[htbp]\footnotesize \begin{tabular}{|c|c|c|c|} \hline rule & support & confidence & lift \\ \hline $R^I_1$ & $s^I_1$ & $c^I_1$ & $l^I_1$ \\ ... & ... & ... & ... \\ $R^I_J$ & $s^I_J$ & $c^I_J$ & $l^I_J$ \\ \hline \end{tabular} \caption{\small Rules containing item I, and corrispondent measures of support, confidence and lift.} \label{m_rules_1} \end{table} Let then be $\{R^{I-}_j\}_{j=1:J}$ the set of rules obtained substituing the presence of the item I with its absence in each element of $\{R^{I}_j\}_{j=1:J}$. For example, if I is TF1 and $R^I_j$ is the rule \{TF1=1,TF2=1,TF3=1\} $\rightarrow$ \{TFt=1\}, with measures $\{s^I_j, c^I_j, l^I_j\}$, then $R^{I-}_j$ will be the rule $\{TF1=0,TF2=1,TF3=1\} \rightarrow \{TFt=1\}$ with measures $\{s^{I-}_j, c^{I-}_j, l^{I-}_j\}$. Therefore given the set $R^{I}_j$, $R^{-I}_j$ considers the same number of association rules for each item. \begin{table}[htbp]\footnotesize \begin{tabular}{|c|c|c|c|} \hline rule & support & confidence & lift \\ \hline $R^{I-}_1$ & $s^{I-}_1$ & $c^{I-}_1$ & $l^{I-}_1$ \\ ... & ... & ... & ... \\ $R^{I-}_J$ & $s^{I-}_J$ & $c^{I-}_J$ & $l^{I-}_J$ \\ \hline \end{tabular} \caption{\small Rules originally containing item I obtained by removing I, and corrispondent support, confidence and lift measures.} \label{m_rules_2} \end{table} To analyze the importance of a transcription factor, for example TF1, we can compare the two distributions \{$s^I_j$, $c^I_j$\}$_{j=1:J}$ and \{$s^{I^-}_j$, $c^{I^-}_j$\}$_{j=1:J}$ for each j in \{1,...,J\}. Since support and confidence vary in [0,1], while lift is directly proportional to the confidence measure, we can define an index of importance of the item I in the rule R$^I$$_j$ for j in \{1,...,J\} as: \begin{center} \begin{equation} \label{imp_rule} imp(I)_j = {\Delta s}_j + {\Delta c}_j \end{equation} \end{center} with: \hspace{0.5cm} \begin{math} {\Delta s}_j = {s^I}_j - {s^{I^-}}_j\hspace{0.7cm} {\Delta c}_j = {c^I}_j - {c^{I^-}}_j\hspace{0.7cm} \end{math} The importance of I in its set of rules \{R$^I$\} is obtained evaluating the mean of all its importances imp(I)$_j$ in the set of rules: \begin{equation} \centering \label{imp_formula} imp(I) = \frac{\sum_{j=1}^{J} imp(I)_j}{J} \end{equation} Then, evaluating the index imp(I) for each item I in the relevant association rules extracted can be useful to rank the transcription factors by their importance in the association with the transcription factor target, TFt. The presence of the transcription factors with highest mean Importance Index is assumed to be fundamental for the existence of some regulatory complexes (i.e., association rules assumed to be relevant); the transcription factors with lower mean importances, instead, do not significantly influence the pattern of transcription factors associated with the transcription factor target. The definition of the Importance Index can be extended to couples of items, triplettes and so on. This can be easily done substituting the item I with a set of items (for example for a couple of items I becomes, for instance, I=\{TF1=1,TF2=1\}), and applying the rest of the procedure in a completely analogous way. Thus, we identify as $R^I$ the set of rules containing both TF1 and TF2 and $R^{I-}$ as the set of correspondent rules without the two transcription factors. This kind of approach allows the identification of interactions between transcription factors that would be unrevealed just looking at a list of association rules. The \texttt{rulesTF} function in \Biocpkg{TFARM} package provides the subset of input rules containing a given transcription factor TFi. It takes in input: \begin{itemize} \item a set of rules \item the transcription factor TFi that the user wants to find in the LHSs of a subset of the considered rules \item a logical parameter, \textit{verbose}: if \textit{verbose = TRUE}, a console message is returned if the searched subset of rules is empty. \end{itemize} The output of the function is a data.frame containing the subset of rules whose LHSs contain TFi, and the corresponding quality measures. Using the introduced notation, the output of the \texttt{rulesTF} function is \{R$^I$$_j$\}$_{j=1:J}$ with the quality measures \{s$^I$$_j$, c$^I$$_j$, l$^I$$_j$\}$_{j=1:J}$. The data.frame has J rows and five columns: the first colum contains the LHS of the selected rules, the second one contains the RHS of the rules and the last three columns contain s$^I$$_j$, c$^I$$_j$, l$^I$$_j$ (that is a data.frame like the one in Table \ref{m_rules_1}). \bigskip <<>>= # To find the subset of rules containing the transcription factor FOSL2: r_FOSL2 <- rulesTF(TFi = 'FOSL2=1', rules = r_TEAD4, verbose = TRUE) head(r_FOSL2) dim(r_FOSL2)[1] @ \bigskip <<>>= # If none of the rules in input to rulesTF contains the given item TFi, # and verbose = TRUE, a warnig message is reported to the user: r_CTCF <- rulesTF(TFi = 'CTCF=1', rules = r_TEAD4, verbose = TRUE) @ \bigskip If the user wants to evaluate the importance of an item I in a set of rules $R^I$, the user needs to substitute the presence of I in all the left-hand-side patterns of $R^I$ with its absence: this is done using the function \texttt{rulesTF0} in \Biocpkg{TFARM} package. This function takes in input: \begin{itemize} \item the transcription factor TFi to be removed \item a set of rules containing TFi \item the total set of rules \item the GRanges object containing the Indicator of presence matrix \item the transcription factor target. \end{itemize} It returns a data.frame with the rules obtained substituing the presence of TFi with its absence and the corrispondent measures. Using the introduced notation, the output of the \texttt{rulesTF0} function is \{R$^{I-}$$_j$\}$_{j=1:J}$ with the quality measures \{s$^{I-}$$_j$, c$^{I-}$$_j$, l$^{I-}$$_j$\}$_{j=1:J}$. The data.frame has J rows and five columns: the first colum contains the LHS of the rules in $R^I$ without TFi, the second one contains the RHS of the rules and the last three columns contain s$^{I-}$$_j$, c$^{I-}$$_j$, l$^{I-}$$_j$ (that is a data.frame like the one in Table \ref{m_rules_2}). \bigskip <>= # For example to evaluate FOSL2 importance in the set of rules r_FOSL2: r_noFOSL2 <- rulesTF0('FOSL2=1', r_FOSL2, r_TEAD4, MCF7_chr1, "TEAD4=1") @ <<>>= head(r_noFOSL2) @ \bigskip Now that the two sets of rules \{R$^I$$_j$\}$_{j=1:J}$ and \{R$^{I-}$$_j$\}$_{j=1:J}$ and the two sets of measures \{s$^I$$_j$, c$^I$$_j$, l$^I$$_j$\}$_{j=1:J}$ and \{s$^{I-}$$_j$, c$^{I-}$$_j$, l$^{I-}$$_j$\}$_{j=1:J}$ are obtained, the user can compute the Importance Index distribution for the chosen transcription factor TFi. This can be done with the function \texttt{IComp} in the \Biocpkg{TFARM} package which takes in input: \begin{itemize} \item the transcription factor TFi \item the subset of rules rules\_TF containing TFi (provided by the function \texttt{rulesTF}) with their quality measures of support, confidence and lift \item the subset of rules rules\_noTF obtained from rules\_TF removing TFi (provided by the function \texttt{rulesTF0}) \item a logical parameter (figures) to graphically rapresent \{s$^I$$_j$, c$^I$$_j$, l$^I$$_j$\}$_{j=1:J}$ and \{s$^{I-}$$_j$, c$^{I-}$$_j$, l$^{I-}$$_j$\}$_{j=1:J}$; set \textit{figures = TRUE} to get it as an output. \end{itemize} The function has five outputs: \begin{itemize} \item imp, which is the set of importance index values of TFi in the given set of rules (rules\_TF), one value for each rule. \item delta, which is the matrix of variations of standarded support, confidence and lift obtained removing TFi from rules\_TF. \item rwi, which is a data.frame that contains rules from \texttt{rulesTF} associated with each candidate co-regulator transcription factor. \item rwo, which is a data.frame with rules in rwi obtained removing each transcription factor TFi. \item the plots of \{s$^I$$_j$, c$^I$$_j$, l$^I$$_j$\}$_{j=1:J}$ and \{s$^{I-}$$_j$, c$^{I-}$$_j$, l$^{I-}$$_j$\}$_{j=1:J}$ obtained if the user sets \textit{figures = TRUE}. \end{itemize} \bigskip <>= # Perform the IComp function to compute the Importance Index distribution: imp_FOSL2 <- IComp('FOSL2=1', r_FOSL2, r_noFOSL2, figures=TRUE) names(imp_FOSL2) imp_FOSL2$imp head(imp_FOSL2$delta) head(imp_FOSL2$rwi) head(imp_FOSL2$rwo) @ \bigskip \incfig{figure/IComp-1}{\textwidth}{Support and Confidence for the extracted rules before and after the removal of item $I$.}{Left panel: Support distribution $\{s^I_j\}_{j=1:J}$, black thick line, and $\{s^{I-}_j\}_{j=1:J}$,red dotted line. Right panel: Confidence distribution $\{c^I_j\}_{j=1:J}$, black thick line, and $\{c^{I-}_j\}_{j=1:J}$, red dotted line.} The most useful application of the function \texttt{IComp} is the ranking of candidate co-regulator transcription factors through their importances. As previously seen, the candidate co-regulators are returned by the function \texttt{presAbs}. The evaluation of the mean importance of each co-regulator can be computed cycling the three functions \texttt{rulesTF}, \texttt{rulesTF0} and \texttt{IComp} over a string vector with all the transcription factors present in the set of relevant association rules extracted, as returned by the function presAbs. \bigskip <>= # For the considered example the user could run: DELTA_mean_supp <- vector("list", length(p_TFs)) DELTA_mean_conf <- vector("list", length(p_TFs)) all <- lapply(p_TFs, function(pi) { A <- rulesTF(pi, r_TEAD4, FALSE) B <- rulesTF0(pi, A, r_TEAD4, MCF7_chr1, "TEAD4=1") IComp(pi, A, B, figures=FALSE) }) for (i in 1:length(p_TFs)) { IMP_Z[[i]] <- all[[i]]$imp # Extract the delta variations of support and confidence: DELTA_mean_supp[[i]] <- apply(all[[i]]$delta[1], 2, mean) DELTA_mean_conf[[i]] <- apply(all[[i]]$delta[2], 2, mean) } IMP <- data.frame( TF = p_TFs, imp = sapply(IMP_Z, mean), sd = sapply(IMP_Z, sd), delta_support = as.numeric(DELTA_mean_supp), delta_confidence = as.numeric(DELTA_mean_conf), nrules = sapply(IMP_Z, length), stringsAsFactors=FALSE ) library(plyr) @ <<>>= # Sort by imp column of IMP IMP.ord <- arrange(IMP, desc(imp)) IMP.ord @ \bigskip In this way we get, besides the mean Importance Index of each candidate co-regulator of TFt (TFt = TEAD4 in the example), the standard deviation of the distribution of the Importance Index of each candidate co-regulator of TFt, and the number of rules in which each candidate co-regulator of TFt is present. \medskip The function \texttt{IComp} can be easily generalized for the computation of the mean Importance Index of combinations of transcription factors (see the example used for the \texttt{heatI} function in the following section). \subsection{Validation of the Importance Index formula} We defined the Importance Index of an item in an association rule as a linear combination of the variations of the support and confidence of the rule, obtained substituting the presence of the item in the left-hand-side of the association rule, with its absence (as in Formula \ref{imp_rule}). In this way we assume that each of the two variations equally contributes to the evaluation of the contribution of the item to the prediction of the presence of another item in the right-hand-side of the considered association rule. Neverthless, one of the two quality measures might be more or less sensitive than the others to the removal of the item from the rule, leading to a greater or smaller variation of one or more of the values of support and confidence. We observe that for each item I, the variations of support and confidence obtained removing I from a set of rules in which I is involved, are placed in a 2D space defined by the terns ($\Delta s$, $\Delta c$). \begin{table}[htbp]\footnotesize \begin{tabular}{c|c|c|c} TF & $\Delta s_s$ & $\Delta c_c$\\ \hline $TF_1$ & $\Delta s_{s,1}$ & $\Delta c_{c,1}$\\ ... & ... & ... & ...\\ $TF_1$ & $\Delta s_{s,n_1}$ & $\Delta c_{c,n_1}$\\ & & & \\ ... & ... & ... & ...\\ & & & \\ $TF_M$ & $\Delta s_{s,K-n_M+1}$ & $\Delta c_{c,K-n_M+1}$\\ ... & ... & ... & ...\\ $TF_M$ & $\Delta s_{s,K}$ & $\Delta c_{c,K}$\\ \end{tabular} \caption{\small Matrix with the variations of the support and confidence, obtained removing each transcription factor from the subset of rules in which it is present. M is the total number of transcription factors, K is the total number of rules and $n_i$ is the number of rules for transcription factor TFi.} \label{D} \end{table} Thanks to the Principal Components Analysis \cite{johnson2002applied} \cite{bro2014principal}, computed by the function \texttt{IPCA} in the \Biocpkg{TFARM} package, we can evaluate if it is possible to find a subspace of $\mathbb{R}^2$ in which the most variability of the dataset containing the variations of the measures (Table \ref{D}) is captured. This can be easily done by extracting the delta variations of support and confidence, using the function \texttt{IComp}, simply getting its \textit{delta} output, as well as a matrix containing the candidate co-regulators found, and the number of rules in which each of them appears. A principal component is a combination of the original variables after a linear transformation; the set of principal components defines a new reference system. The new coordinates of data rapresented in the reference system defined by principal components are called \textit{scores}, and the coefficients of the linear combination that define each principal component are called \textit{loadings} (so, loadings give a measure of the contribution of every observation to each principal component). The \texttt{IPCA} function takes in input: \begin{itemize} \item the list of variations of distributions of support and confidence measures, obtained from the \texttt{IComp} function \item a matrix with the mean importance of every candidate co-regulator transcription factor and the number of rules in which each of them appears. \end{itemize} It returns: \begin{itemize} \item a summary, containing: the standard deviation on each principal component, the proportion of variance explained by each principal component, and the cumulative proportion of variance described by each principal component; \item the scores of each principal component \item the loadings of each principal component \item a plot with the variability and the cumulate percentage of variance explained by each principal component \item a plot with the loadings of the principal components \end{itemize} \bigskip <>= # Select the candidate co-regulators and the number of rules # associated with them, then perform the Principal Component Analysis: colnames(IMP) TF_Imp <- data.frame(IMP$TF, IMP$imp, IMP$nrules) i.pc <- IPCA(DELTA, TF_Imp) names(i.pc) i.pc$summary head(i.pc$loadings) head(i.pc$scores) @ %% \incfig{figure/IPCA-1}{\textwidth}{Principal Component Analysis of Importance Index}{Variances of each of the two principal components (on the top left), cumulate proportion of variance explained by each principal component (on the top right), and loadings of the two principal components.} \bigskip As we can see looking at the value of the variance associated with the first principal component in Figure \ref{figure/IPCA-1}, this value explains the 89.13\% of the variability of the DELTA dataset. Moreover, from the plot of the loadings in Figure \ref{figure/IPCA-1}, it is easy to note that the first principal component is a linear combination of the variations of support and confidence, that equally contribute to the combination. So, it is reasonable to define the Importance Index as in Formula \ref{imp_rule}. \section{Visualization tools} The function \texttt{distribViz} in the \Biocpkg{TFARM} package provides a boxplot visualization of the Importance Index distributions of a set of transcription factors (or of combinations of transcription factors). \bigskip <>= # Considering for example the candidate co-regulator transcription factors # found in the set of rules r_TEAD4: distribViz(IMP_Z, p_TFs) @ \incfig{figure/distribViz-1}{\textwidth}{Importance Index distribution.}{Importance Index distributions of candidate co-regulators of TEAD4 in the set of the 30 most relevant associations for the prediction of the presence of TEAD4 in promotorial regions of chromosome 1 in cell line MCF7.} The shape of a boxplot changes as follows: \begin{itemize} \item The higher the number of rules containing the candidate co-regulator I, the larger the boxplot for I is \item The higher the variability of the Importance Index of I, the longer the boxplot for I is \item The higher the median of the Importance Index distribution of I, the higher the boxplot for I is aligned with respect to the y axis. \end{itemize} Moreover, named $q_1$ and $q_3$ the first and third quartiles of the Importance Index distribution for a given item I, all the rules where I has importance \begin{math}x \leq q_1 - 1.5*(q_3 - q_1)\end{math} or \begin{math} x \geq q_1 + 1.5*(q_3 - q_1)\end{math} are considered outlier rules, and represented with a circle outside the boxplot. For example, in the boxplots in Figure \ref{figure/distribViz-1} it is easy to notice that: \begin{itemize} \item[1.] SIN3AK20, HDAC2, GATA3 and GABPA have the highest median Importance Index, and they are present in an high number of relevant association rules \item[2.] HA.E2F1 and TCF12 have intermediate median Importance Index and the lowest variability Importance Index distribution \item[3.] ELF1 and NR2F2 are present in a high number of relevant association rules, but they have low median Importance Index and high variability of the Importance Index distribution. \end{itemize} It can also be noticed that for the transcription factors GABPA, MYC, MAX, NR2F2, FOSL2 and ZNF217 there are some outlier rules, that are rules in which the Importance Index of the candidate co-regulator transcription factor is a lot higher or lower than in the rest of the distribution. These outliers can be extracted as reported in the following text: \bigskip <<>>= # Select the index of the list of importances IMP_Z # containing importance distributions of transcription factor ZNF217 ZNF217_index <- which(p_TFs == 'ZNF217=1') # Select outlier rules where ZNF217 has importance greater than 0 o <- IMP_Z[[ZNF217_index]] > 0 rule_o <- all[[ZNF217_index]]$rwi[o,] # Display the one rule for example the sixth rule_o[6,] # So, ZNF217 is very relevant in the pattern of transcription factors # {FOSL2=1,GABPA=1,MYC=1,MAX=1,ZNF217=1} # for the prediction of the presence of TEAD4. # To extract support, confidence and lift of the corresponding rule # without ZNF217: all <- all[[ZNF217_index]]$rwo[o,] all[6,] # Since the measure of the rule obtained removing ZNF217 is equal to zero, # the rule {FOSL2=1,GABPA=1,MYC=1,MAX=1,ZNF217=0} -> {TEAD4=1}, # obtained removing ZNF217, is found in the relevant rules for the prediction # of the presence of TEAD4. @ \bigskip The function \texttt{heatI} is another useful visualization tool of the package \Biocpkg{TFARM}; it takes in input: \begin{itemize} \item a string vector with names of transcription factors \item a vector of mean importances of pairs of transcription factors in the previous input. \end{itemize} It returns a heatmap visualization of the mean importances of transcription factors in the considered string vector. Evaluating importances of combinations of transcription factors, the number of Importance Index distribution grows combinatorially. This makes more difficult to see which are the most important combinations (even sorting them by their mean importances). For the pairs of transcription factors, the function \texttt{heatI} gives an heatmap visualization of a square matrix whose elements are as follows (Table \ref{matrimp}): called M the number of candidate co-regulators transcription factors, the element (i,j) of such matrix is the mean importance of the couple of transcription factors ($TF_i$, $TF_j$). This matrix is symmetric with respect to the main diagonal. \begin{table}[htbp]\footnotesize \begin{tabular}{|c|c|c|c|c|c|} & $TF_1$ & $TF_2$ & ... & $TF_{M-1}$ & $TF_M$\\ \hline $TF_1$ & imp($TF_1$)& imp(\{$TF_1$,$TF_2$\}) & ... & imp(\{$TF_1$,$TF_{M-1}$\}) & imp(\{$TF_1$,$TF_M$\}) \\ $TF_2$ & imp(\{$TF_2$,$TF_1$\}) & imp($TF_2$)&... &imp(\{$TF_2$,$TF_{M-1}$\}) & imp(\{$TF_2$,$TF_M$\}) \\ ... & & & & & \\ $TF_{M-1}$ & imp(\{$TF_{M-1}$,$TF_1$\}) & imp(\{$TF_{M-1}$,$TF_2$\}) &... & imp($TF_{M-1}$)& imp(\{$TF_{M-1}$,$TF_M$\}) \\ $TF_M$ & imp(\{$TF_M$,$TF_1$\}) & imp(\{$TF_M$,$TF_2$\}) & ... & imp(\{$TF_M$,$TF_{M-1}$\}) & imp($TF_M$) \\ \end{tabular} \caption{\small Mean importance matrix of couples of transcription factors} \label{matrimp} \end{table} To get this matrix, all the possible combinations of two candidate co-regulator transcription factors need to be built. It can be easily computed with the function \texttt{combn} in the package \Rpackage{combinat}. This function takes as input a vector (which is a string vector of transcription factors) and the number of elements in the required combinations. Using combn(p, 2), it generates all combinations of the elements of p taken two at a time. The elements of each combination are then combined in the form \textit{TF1,TF2}. \bigskip <<>>= # Construct couples as a vector in which all possible combinations of # transcription factors (present in at least one association rules) # are included: couples_0 <- combn(p_TFs, 2) couples <- paste(couples_0[1,], couples_0[2,], sep=',') head(couples) @ <>= # The evaluation of the mean Importance Index of each pair is # then computed similarly as previously done for single transcription factors: # Compute rulesTF, rulesTF0 and IComp for each pair, avoiding pairs not # found in the r_TEAD4 set of rules IMP_c <- lapply(couples, function(ci) { A_c <- rulesTF(ci, r_TEAD4, FALSE) if (all(!is.na(A_c[[1]][1]))){ B_c <- rulesTF0(ci, A_c, r_TEAD4, MCF7_chr1, "TEAD4=1") IComp(ci, A_c, B_c, figures=FALSE)$imp } }) # Delete all NULL elements and compute the mean Importance Index of each pair I_c <- matrix(0, length(couples), 2) I_c <- data.frame(I_c) I_c[,1] <- paste(couples) null.indexes <- vapply(IMP_c, is.null, numeric(1)) IMP_c <- IMP_c[!null.indexes] I_c <- I_c[!null.indexes,] I_c[,2] <- vapply(IMP_c, mean, numeric(1)) colnames(I_c) <- colnames(IMP[,1:2]) @ <<>>= # Select rows with mean Importance Index different from NaN, then order I_c: I_c <- I_c[!is.na(I_c[,2]),] I_c_ord <- arrange(I_c, desc(imp)) head(I_c_ord) @ <>= # Construction of a vector in which mean Importance Index values of pairs # of transcription factors are represented. # These transcription factors are taken from the output of presAbs as # present in at least one association rules. # The function rbind is used to combine IMP columns and I_c_ord columns and # then the function arrange orders the data frame by the imp column. I_c_2 <- arrange(rbind(IMP[,1:2], I_c_ord), desc(imp)) i.heat <- heatI(p_TFs, I_c_2) @ \bigskip To build the heatmap the user must also consider the single transcription factor mean importances (since the heatmap diagonal elements are the mean importances of single transcritpion factors). \bigskip \incfig{figure/heatmap-1}{0.55\textwidth}{Heatmap.}{Mean importance of couples of candidate co-regulator transcription factors in the set of the 30 most relevant rules for the prediction of the presence of TEAD4 in promotorial regions of chromosome 1 in cell line MCF-7. The mean importances of single transcrption factors are rapresented in the main diagonal as in Table \ref{matrimp}.} The obtained heatmap is represented in Figure \ref{figure/heatmap-1}. The colour scale indicates that the lowest mean importances are represented in dark red, whereas the highest ones are represented in light white. This rapresentation is useful to notice that, for example: \begin{itemize} \item ZNF127 has high mean Importance Index alone and in couple with all the others candidate co-regulator transcription factors; \item TCF12 has low mean Importance Index alone and in couple with all the others candidate co-regulator transcription factors, except with GABPA, ZNF127 and NR2F2. \end{itemize} \bibliography{bibliography} \end{document}