\documentclass[a4paper,12pt]{article} <>= BiocStyle::latex() @ %\VignetteIndexEntry{Manual for the TTMap library} %\VignettePackage{TTmap} \usepackage{amsmath} % need for subequations \usepackage{amssymb} %useful mathematical symbols \usepackage{bm} %needed for bold greek letters and math symbols \usepackage{graphicx} % need for PS figures \usepackage{hyperref} % use for hypertext links \begin{document} \title[Two-Tier Mapper]{\textbf{\LARGE{Two-Tier Mapper: a user-independent clustering method for global gene expression analysis based on topology}}} \author{Rachel Jeitziner\\ \small{UPBRI\& ISREC} \\ \small{EPFL Lausanne}} \date{} %comment to include current date \maketitle \tableofcontents \section{Introduction} \label{sec:intro} We developed a new user-independent analytical framework, called \textit{Two-Tier Mapper (TTMap)}. This tool is separated into two parts. TTMap consists of two separated and independent parts : 1. Hyperrectangle Deviation assessment (HDA) and 2. Global-to-Local Mapper (GtLMap), where the first step establishes properties of the control group and removes outliers in order to calculate the deviation of each vector in the test group from the corrected control group. The second step uses the traditional Mapper algorithm \cite{Extracting} with a two-tier cover and a special distance. This topological tool detects both global and local differences in the patterns of deviations and thereby captures the structure of the test group. The samples are clustered according to the shape of their deviation (do they both deviate positively, negatively or are they as the control). To still keep on the information about the amount of deviation, one separates the data into 4 clusters according to a function measuring the amount of deviation. These represent then the second tier. Each cluster is colored by the extent of the deviation. A list of the differentially expressed genes is also provided. The functions and methods presented on this \textit{vignette} provide explanation on how to use TTMap, by default and what can be changed by the user. \begin{figure*}[h] \begin{center} \includegraphics[height= 17.5cm]{Figure1new_test_small.png} \caption{ Schematic overview of TTMap. The inputs (green) are given by two gene expression matrices, the control ($\mathsf{N}$) and the test group ($\mathsf{T}$), rows represent genes and columns samples. In Part 1, TTMap adjusts the control group for outlier values ($\bar{N}_*$), feature by feature. It calculates deviation from this corrected control group for individual samples in the test group ($Dc.T_* $). In Part 2, TTMap computes a similarity measure, the mismatch distance (represented as a heatmap) using the deviation components. The Mapper \cite{Mapper} algorithm is used with a two-tier cover to generate a visual representation of the clustering creating a network of global clusters (Overall) and local clusters (1st, 2nd, 3rd, 4th quartile of a filter function). It takes as inputs the mismatch distance and the deviation components.} \end{center} \end{figure*} \section{Prepare the data} \label{sec:intro} Upload the file(s) to compare in R. Log transform and subselect them. Use \emph{make\_matrices} to create the needed files for the first function of TTMap since it generates the control and the test matrix in the right format. As an example, we use the airway data set available at Bioconductor. \begin{scriptsize} <>= library(airway) data(airway) airway <- airway[rowSums(assay(airway))>80,] assay(airway) <- log(assay(airway)+1,2) experiment <- TTMap::make_matrices(airway,seq_len(4),seq_len(4)+4, NAME = rownames(airway), CLID =rownames(airway)) @ \end{scriptsize} This function can directly be used on a normalised count table from RNA- seq precising what are the columns of the control group (in col\_ctrl) and what are the columns in the test group (in col\_test) . \section{TTMap part1: Adjustement of the control group (ctrl\_adj)} The first part of the method checks if the control and the test matrices have the same row-names, and if not the method subselects the common rows. It outputs the files with the common rows subselected (with the extension mesh). It then calculates the corrected control matrix, which removes outliers and replaces them by a chosen method (given by a function with input the matrix with NAs where there is an outlier and should return a matrix without NAs), or by the median of the other values by default. The inputs can even be given by the CTRL and TEST variables of the list given by the output of \emph{make\_matrices} or by imputed control and test matrices in pcl format (see \cite{Monica}). The name of the control group and the project name need to be inputed as well as the working directory, in which the output files will be created. A value for what to consider as an outlier (called e) can be imputed or use the data-driven default value given by the method. If there are any batch effects to consider, they can be imputed using the variable B, which is a vector of numbers representing the batches. Last, the parameter $P$ is a value which will remove the genes that have a higher percentage than $P$ of outlier values. \begin{scriptsize} <>= E=1 Pw=1.1 Bw=0 TTMAP_part1prime <-TTMap::control_adjustment(normal.pcl = experiment$CTRL, tumor.pcl = experiment$TEST, normalname = "The_healthy_controls", dataname = "The_effect_of_cancer", org.directory = getwd(), e=E,P=Pw,B=Bw); @ \end{scriptsize} This outputs: \begin{itemize} \item A file with the number of outliers per sample (Dataname followed by the number of the batch followed by na\_numbers\_per\_col.txt) \item A file with the number of outliers per row (Dataname followed by the number of the batch followed by na\_numbers\_per\_col.txt) \item A picture of the distribution of the mean against variance for each gene, before (Dataname followed by \_mean\_vs\_variance.pdf) and \item after correction of outliers (Dataname followed by \\ \_mean\_vs\_variance\_after\_correction.pdf). \end{itemize} The corrected control matrix is output in the next step. A possible output after this first step is shown in figure \ref{fig_mean_vs_var} \begin{center} \begin{figure} \includegraphics[scale=14]{The_effect_of_cancer_mean_vs_variance.pdf} \caption{\texttt{barplotSignifSignatures}: Plot of mean against variance per gene.} \label{fig_mean_vs_var} \end{figure} \end{center} \section{TTMap part1: Hyperrectangle deviation assessement (hda)} This part consists in calculating deviation components from a hyperrectangle. This enables the calculation in the third function (ttmap\_part2\_gtlmap) of the shape of deviation. One parameter k is given by if all the vectors of the control group should be kept or if only the the top k-dimensional principal component approximation of the control matrix should be kept using the singular value decomposition (as in \cite{Monica}). The default is to keep all the vectors. \begin{scriptsize} <>= TTMAP_part1_hda <- TTMap::hyperrectangle_deviation_assessment(x = TTMAP_part1prime,k=dim(TTMAP_part1prime$Normal.mat)[2], dataname = "The_effect_of_cancer", normalname = "The_healthy_controls"); head(TTMAP_part1_hda$Dc.Dmat) @ \end{scriptsize} The outputs of this step are the following. \begin{itemize} \item The corrected control matrix, calculated at the first step is given in \textit{The\_healthy\_controls.NormalModel.pcl}, with a possible trimming of columns if $k$ is different than the number of columns in the corrected matrix. \item The deviation component of each test sample is written in \textit{The\_effect\_of\_cancer.Tdis.pcl}. An example of the deviation component is found in the previous script by writing \textit{head(TTMAP\_part1\_hda\$Dc.Dmat)} \item The normal component of each test sample is written in \textit{The\_effect\_of\_cancer.Tnorm.pcl}. \end{itemize} The two values of this function are the deviation component matrix and the overall deviation (calculated by summing in absolute values the deviation components). \section{TTMap part2: Global-to-local Mapper (gtlmap)} The third part corresponds to the Global-to-local Mapper part. One starts with an annotation file of our samples, in order to annotate the obtained clusters. In this example here we just copied several times the column names. This annotation file needs to have as rownames the columns of the test samples followed by ".Dis". We then calculate the distance matrix between the samples using the \textit{generate\_mismatch\_distance} function, which uses a cutoff parameter $\alpha$ in order to decide what is a considered as noise. Any other distance matrix can be computed here and used for the next step. Then, we calculate and output the clusters using \textit{ttmap\_part2\_gtlmap}, which needs as inputs the values of \textit{ttmap\_part1\_ctrl\_adj, ttmap\_part1\_hda.} The default parameter uses all the genes to calculate the overall deviation, but if a subset should be selected (only one pathway for example), it can be imputed here. ttmap\_part2\_gtlmap then calculates using \textit{calcul\_e} a parameter of closeness using the data, in order to know what distance is "close" enough to clusters samples together. The parameter n determines which column of metadata should be chosen for the output files. Two more parameters of convenience, if ad is set to something different than 0 (the default) then the clusters on the output picture will not be annotated and if bd is different than 0 (default), the output will be without outliers of the test data set. After the picture has been adjusted to what one wants to see one can save it using the \textit{rgl.postscript} function. \begin{scriptsize} <>= library(rgl) ALPHA <- 1 annot <- c(paste(colnames(experiment$TEST[,-seq_len(3)]),"Dis",sep=".") ,paste(colnames(experiment$CTRL[,-seq_len(3)]),"Dis",sep=".")) annot <- cbind(annot,annot) rownames(annot)<-annot[,1] dd5_sgn_only <-TTMap::generate_mismatch_distance(TTMAP_part1_hda, select=rownames(TTMAP_part1_hda$Dc.Dmat),alpha = ALPHA) TTMAP_part2_gtlmap <- TTMap::ttmap(TTMAP_part1_hda,TTMAP_part1_hda$m, select=rownames(TTMAP_part1_hda$Dc.Dmat), annot,e= TTMap::calcul_e(dd5_sgn_only,0.95,TTMAP_part1prime,1), filename="first_comparison", n=1,dd=dd5_sgn_only) rgl.postscript("first_output.pdf","pdf") @ \end{scriptsize} \section{TTMap: Finding the significant genes (sgn\_genes)} This last function analyses the different clusters for significant features. It outputs a file per level (one for overall, called all, one for the lower quartile, called low, one for the second quartile, called mid1, the third, mid2, and the higher quartile, called high). In each of them one file per cluster is given, with the list of significant genes linked to the cluster. Relaxed is a parameter permitting to select as a match one sample that would be 0 for the deviation component, while the others deviate in the same shape. \begin{scriptsize} <>= TTMap::ttmap_sgn_genes(TTMAP_part2_gtlmap, TTMAP_part1_hda, TTMAP_part1prime, annot, n = 2, a = ALPHA, filename = "first_trial", annot = TTMAP_part1prime$tag.pcl, col = "NAME", path = getwd(), Relaxed = 0) @ \end{scriptsize} \section{Conclusion} Two-Tier Mapper (TTMap) is a topology-based clustering tool, which is user- friendly and reliable. The algorithm first provides an overall clustering, in an unbiased manner, since all the parameters are defined in a data-driven manner or by reliable default parameters. his method enables a refined view on the composition of the clusters by delineating how clusters differ locally and how the local clusters relate to the global structure of the dataset. The output is a visual interpretation of the data given by a colored graph that is easy to interpret, which describes the shape of the data according to the chosen distance. \\ \bibliography{biblio2} \end{document}