--- title: "A Guide to multiClust" author: "Nathan Lawlor" date: '`r Sys.Date()`' output: html_document: toc: yes toc_depth: 6 pdf_document: toc: yes vignette: | %\VignetteIndexEntry{"A Guide to multiClust"} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction Whole transcriptomic profiles are useful for studying the expression levels of thousands of genes across samples. Clustering algorithms are used to identify patterns in these profiles to determine clinically relevant subgroups. Feature selection is a critical integral part of the process. Currently, there are many feature selection and clustering methods to identify the relevant genes and perform clustering of samples. However, choosing the appropriate methods is difficult as recent work demonstrates that no method is the clear winner. Hence, we present an R-package called `multiClust` that allows researchers to experiment with the choice of combination of methods for gene selection and clustering with ease. In addition, using multiClust, we present the merit of gene selection and clustering methods in the context of clinical relevance of clustering, specifically clinical outcome. Our integrative R-package contains: 1. A function to read in gene expression data and format appropriately for analysis in R. 2. Four different ways to select the number of genes a. Fixed b. Percent c. Poly d. GMM 3. Four gene ranking options that order genes based on different statistical criteria a. CV_Rank b. CV_Guided c. SD_Rank d. Poly 4. Two ways to determine the cluster number a. Fixed b. Gap Statistic 5. Two clustering algorithms a. Hierarchical clustering b. K-means clustering 6. A function to calculate average gene expression in each sample cluster 7. A function to correlate sample clusters with clinical outcome **Function Workflow** The seven functions listed below should be used in the following order: 1. `input_file`, a function to read-in a text file containing the matrix of gene probes and samples to analyze. 2. `number_probes`, a function to determine the number of gene probes to select for in the feature selection process. 3. `probe_ranking`, a function to rank and select for probes using one of the available ranking options. 4. `number_clusters`, a function to determine the number of clusters to be used to cluster gene probes and samples. 5. `cluster_analysis`, a function to perform Kmeans or Hierarchical clustering analysis of the selected gene probe expression data. 6. `avg_probe_exp`, a function to produce a matrix containing the average expression of each gene probe within each sample cluster. 7. `surv_analysis`, a function to produce Kaplan-Meier Survival Plots after the discretization of samples into different clusters. **Other Functions Included** * `WriteMatrixToFile`, a function to write a data matrix to a text file. * `nor.min.max`, a function to that uses feature scaling to normalize values in a matrix between 0 and 1. --- ## 1. Getting Started In order to use `multiClust`, the user will need two text files. * The first file is a gene probe expression dataset. This file should be a matrix with columns being the samples and the rows being genes or probes. * The second file contains the patient clinical parameters. This file should be a matrix consisting of two columns. The first column contains the patient survival time and the second column contains the patient survival event occurrence. Survival time should be recorded in months. Survival event occurrence should be indicated by a 0 or 1. A 0 indicates censorship and a 1 indicates an event has occurred. ### 1.1 Obtaining a Gene Expression Dataset and Clinical Information In this example, a gene expression dataset, GSE2034, will be obtained from the Gene Expression Omnibus (GEO) website . The following code can be used to extract the gene expression data and clinical information from the series matrix text zip file. When using this code, be sure that you have a strong internet connection. The GSE datasets are very large! ```{r, eval=FALSE} # Load GEO and biobase libraries library(GEOquery) library(Biobase) library(multiClust) # Obtain GSE series matrix file from GEO website using getGEO function gse <- getGEO(GEO="GSE2034") # Save the gene expression matrix as an object data.gse <- exprs(gse) # Save the patient clinical data to an object pheno <- pData(phenoData(gse)) # Write the gene expression and clinical data to text files WriteMatrixToFile(tmpMatrix=data.gse, tmpFileName="GSE2034.expression.txt", blnRowNames=TRUE, blnColNames=TRUE) WriteMatrixToFile(tmpMatrix=pheno, tmpFileName="GSE2034.clinical.txt", blnRowNames=TRUE, blnColNames=TRUE) ``` In the event that an error is produced when trying to obtain GSE datasets from the NCBI GEO FTP site, please use this alternative method: 1. On the NCBI GEO website (), simply search for the desired dataset (e.g."GSE2034") and scroll down to the "Analyze with GEO2R" section and click on "Series Matrix Files". 2. From there, download the series matrix txt.gz file to your computer. 3. Move the zipped file to a folder of your choice. 4. In R, set your working directory to the directory with the GSE file. 5. In R, read in the downloaded GSE series matrix file. After downloading the GSE series matrix file and moving it to a directory the following code can be used. ```{r, eval=FALSE} # Load GEO and biobase libraries library(GEOquery) library(Biobase) library(multiClust) # Obtain GSE series matrix file from GEO website using getGEO function gse <- getGEO(filename="GSE2034_series_matrix.txt.gz") # Save the gene expression matrix as an object data.gse <- exprs(gse) # Save the patient clinical data to an object pheno <- pData(phenoData(gse)) # Write the gene expression and clinical data to text files WriteMatrixToFile(tmpMatrix=data.gse, tmpFileName="GSE2034.expression.txt", blnRowNames=TRUE, blnColNames=TRUE) WriteMatrixToFile(tmpMatrix=pheno, tmpFileName="GSE2034.clinical.txt", blnRowNames=TRUE, blnColNames=TRUE) ``` #### 1.2 Normalization of Gene Expression Datasets Some datasets on GEO are already normalized using Robust Multichip Average (RMA) procedure available in the 'affy' R pacakge or quantile normalization and log2 scaling using the 'preprocessCore' R package. For datasets that are not normalized, the following code can be used to extract the gene expression matrix from the series matrix text file, quantile normalize the data, log2 scale the data, and lastly write the data to a text file. ```{r, eval=FALSE} # Load necessary libraries library(GEOquery) library(Biobase) library(preprocessCore) library(multiClust) # Obtain GSE series matrix file from GEO website using getGEo function gse <- getGEO(GEO="GSE2034") # Save the gene expression matrix as an object data.gse <- exprs(gse) # Quantile normalization of the dataset data.norm <- normalize.quantiles(data.gse, copy=FALSE) # shift data before log scaling to prevent errors from log scaling # negative numbers if (min(data.norm)> 0) { } else { mindata.norm=abs(min(data.norm)) + .001 data.norm=data.norm + mindata.norm } # Log2 scaling of the dataset data.log <- t(apply(data.norm, 1, log2)) # Write the gene expression and clinical data to text files WriteMatrixToFile(tmpMatrix=data.log, tmpFileName="GSE2034.normalized.expression.txt", blnRowNames=TRUE, blnColNames=TRUE) ``` #### 1.3 Formatting the Patient Clinical Information As mentioned in the beginning of this section, the patient survival information file should be a matrix consisting of two columns. However, when the patient clinical information is written to a text file, it is generally a matrix of more than two columns with all of the clinical parameters. It is recommended that this text file be loaded in R or Microsoft Excel to obtain the two columns of survival information needed. Below is an example of what the patient survival information file should look like: ```{r, eval=TRUE} # Obtain clinical outcome file clin_file <- system.file("extdata", "GSE2034-RFS-clinical-outcome.txt", package="multiClust") # Load in survival information file clinical <- read.delim2(file=clin_file, header=TRUE) # Display first few rows of the file clinical[1:5, 1:2] # Column one represents the survival time in months # Column two represents the relapse free survival (RFS) event ``` --- ## 2. Loading Your Gene Probe Expression Dataset into R The first function of `multiClust` is used to load a gene expression dataset into R and format the matrix so that the probe or gene names are the rownames of the matrix. ### 2.1 Loading Text Files Containing Gene Expression Matrix In this example, the GSE2034 dataset is loaded into the R console using the `input_file` function. **Function Arguments** * This function has one variable called "input" which is the string name of the gene expression text file to load. **Example** ```{r, eval=TRUE} # Load necessary libraries library(multiClust) # Obtain gene expression matrix exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt", package= "multiClust") # Load the gene expression matrix data.exprs <- input_file(input=exp_file) # View the first few rows and columns of the matrix data.exprs[1:4,1:4] ``` In this example, the gene expression matrix is stored into the object "data.exprs". --- ## 3. Gene Selection Algorithms ### 3.1 Determining the Number of Desired Probes or Genes The second function of this package, `number_probes` is a function used to specify the number of desired probes or genes the user would like to select for in their gene selection analysis. **Function Arguments** * This function has multiple arguments. The first argument "input" represents the string name of the expression matrix text file to be loaded into R. * The second argument "data.exp" is the object containing the expression matrix * The last four arguments "Fixed", "Percent", "Poly", and "Adaptive" are the four different methods that determine how the number of genes are selected. The default option in the function is set to "Fixed". 1. **Fixed** is an option where the user provides a positive integer to specify desired number of genes or probes to select for. The default value of "Fixed" is set to 1000. If the user writes: ```{r, eval=TRUE} # Example 1: Choosing a fixed gene or probe number # Load necessary libraries library(multiClust) # Obtain gene expression matrix exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt", package="multiClust") gene_num <- number_probes(input=exp_file, data.exp=data.exprs, Fixed=300, Percent=NULL, Poly=NULL, Adaptive=NULL) ``` As a result, 300 genes from the dataset will be selected during the gene or probe selection process. 2. **Percent** is another option where the user provides a positive integer between 0 and 100 indicating the percentage of total genes or probes to select from the dataset. For example, if the dataset has 20,000 total probes and the user specifies: ```{r, eval=TRUE} # Example 2: Choosing 50% of the total selected gene probes in a dataset # Load necessary libraries library(multiClust) # Obtain gene expression matrix exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt", package="multiClust") gene_num <- number_probes(input=exp_file, data.exp=data.exprs, Fixed=NULL, Percent=50, Poly=NULL, Adaptive=NULL) ``` As a result, 50% of the total dataset probes, or 300 probes, will be selected. 3. **Poly** is the third option which fits three second degree polynomial functions to the gene expression dataset based on the dataset's mean and standard deviation. When the user specifies: ```{r, eval=TRUE} # Example 3: Choosing the polynomial selected gene probes in a dataset # Load necessary libraries library(multiClust) # Obtain gene expression matrix exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt", package="multiClust") gene_num <- number_probes(input=exp_file, data.exp=data.exprs, Fixed=NULL, Percent=NULL, Poly=TRUE, Adaptive=NULL) ``` 4. **Adaptive** is the last option which implements Gaussian mixture modeling (GMM) to determine the appropriate number of genes or probes to select for. When the user specifies: ```{r, eval=FALSE} # Example 4: Choosing the Adaptive Gaussian Mixture Modeling method # Load necessary libraries library(multiClust) gene_num <- number_probes(input=exp_file, data.exp=data.exprs, Fixed=NULL, Percent=NULL, Poly=NULL, Adaptive=TRUE) ``` The "Mclust" function from the package mclust is used to apply Gaussian mixture modeling to the inputted gene expression matrix. In addition, files containing information about the dataset's mean, variance, mixing proportion, and gaussian assignment are also outputted. It should be noted that when choosing one of the three methods of gene or probe number selection, the other methods should be set equal to "NULL". Furthermore, the **Adaptive** option has a very long computational time. **Function Output** This function returns an object containing the number of genes/probes to select for during the feature selection process. #### 3.2 Choosing a Gene Selection Algorithm The third function of this package, `probe_ranking` is used to select the most informative gene probes within a dataset. This function allows the user choose from one of four different probe ranking algorithms. **Function Arguments** * This function has multiple arguments. The first argument "input" represents the string name of the expression matrix text file to be loaded into R. * The second argument "probe_number" is an output of the `number probes` function. For this argument, the user specifies the number of genes to select for. * The third argument "probe_num selection" is a string specifying which type of probe number selection method was used for the `number_probes` function. * The fourth argument "data.exp" is the object containing the expression matrix. * The last argument "method" is a string that specifies which of the four probe ranking methods to use. **List of the Five Probe Ranking Methods** 1. __CV_Rank__ is a gene probe ranking method that selects for probes using the coefficient of variation of the entire dataset (CV). 2. __CV_Guided__ is a gene probe ranking method that uses the coefficient of variation (CV) for each probe to select for probes. 3. __SD_Rank__ is a gene probe ranking method that selects for probes with the highest standard deviation within the dataset. 4. __Poly__ is a ranking method that fits three second degree polynomial functions of mean and standard deviation to the dataset to select the most variable probes in the dataset. This is the only method that does not use a probe number input from the `number_probes` function. When using this ranking method "probe_num" can be set to NULL. **Example** Below is an example of how to use the `probe_ranking` function: ```{r, eval=TRUE} # Load necessary libraries library(multiClust) # Obtain gene expression matrix exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt", package="multiClust") # Load the gene expression matrix data.exprs <- input_file(input=exp_file) # Call probe_ranking function # Select for 500 probes # Choose genes using the SD_Rank method ranked.exprs <- probe_ranking(input=exp_file, probe_number=300, probe_num_selection="Fixed_Probe_Num", data.exp=data.exprs, method="SD_Rank") # Display the first few columns and rows of the matrix ranked.exprs[1:4,1:4] ``` **Function Outputs** The output of this function is a text file containing a selected gene or probe expression matrix. In addition, the selected expression matrix is stored into the chosen variable "ranked.exprs". --- ## 4. Cluster Analysis of Selected Genes and Samples ### 4.1 Determining the Number of Clusters to Divide Samples Into The fourth function is this package, `number_clusters`, is a function used to designate the number of clusters that samples will be assigned to. **Function Arguments** 1. The "data.exp" argument represents an object containing the original gene expression matrix to be used for clustering of genes and samples. This object is an output of the `input_file` function. 2. The "Fixed" argument is a positive integer used to represent the number of clusters the samples and probes will be divided into. 3. The "gap_statistic" argument is a logical indicating whether to calculate the optimal number of clusters to divide the samples into using a gap statistic function `clus_Gap` from the package `cluster`. **Note** The user should only choose either the "Fixed" or "gap_statistic" option, not both. When using the gap_statistic option, change the argument to TRUE and "Fixed" to NULL. **Fixed Cluster Number Example** Below is an example of how to use the `number_clusters` function with a fixed number of clusters: ```{r, eval=TRUE} # Load necessary libraries library(multiClust) # Call the number_clusters function # data.exp is the original expression matrix object outputted from # the input_file function # User specifies that samples will be separated into 3 clusters # with the "Fixed" argument cluster_num <- number_clusters(data.exp=data.exprs, Fixed=3, gap_statistic=NULL) ``` The output from this example will be the object "cluster_num" with a value of 3. **Gap statistic Example** The `number_clusters` function can be called in a similar way to use the gap_statistic option: ```{r, eval=FALSE} # Load necessary libraries library(multiClust) # Call the number_clusters function # data.exp is the original expression matrix object ouputted from # the input_file function # User chooses the gap_statistic option by making gap_statistic equal TRUE # The Fixed argument is also set to NULL cluster_num <- number_clusters(data.exp=data.exprs, Fixed=NULL, gap_statistic=TRUE) ``` The gap_statistic option has a very long computational time and can take up to several hours. The output of this function will be a positive integer indicating the optimal number of clusters to divide samples into. #### 4.2 Kmeans or Hierarchical Clustering of Genes/Probes and Samples The fifth function in this package, `cluster_analysis` is a function used to perform Kmeans or Hierarchical clustering analysis of genes/probes and samples after undergoing a feature selection process in the `probe_ranking` function. **Function Arguments:** 1. The first argument "sel.exp", is an object containing the numeric selected gene expression matrix. This object is an output of the `probe_ranking` function. 2. The second argument "cluster_type" is a string indicating the type of clustering method to use. *Kmeans* or *HClust* are the two options. 3. The third argument "distance" is a string describing the distance metric to use for Hierarchical clustering via the `dist` function. `dist` uses a default distance metric of Euclidean distance. Options include one of "euclidean", "maximum", manhattan", "canberra", "binary", or "minkowski". Kmeans does not use a distance metric. 4. The fourth argument, "linkage_type" is a string describing the linkage metric to be used for hierarchical clustering in the `hclust` function. The default is "ward.D2", however other options include "average", "complete", "median", "centroid", "single", and "mcquitty". Kmeans does not use a linkage metric. 5. The fifth argument, "gene_distance" is a string describing the distance measure to be used for the `Dist` function when performing hierarchical clustering of genes. Options include one of "euclidean", "maximum", "manhattan", "canberra", "binary", "pearson", "abspearson", "correlation", "abscorrelation", "spearman" or "kendall". The default of gene_distance is set to "correlation". The argument can be set to NULL when Kmeans clustering is used. 6. The sixth argument "num_clusters" is a positive integer to specify the number of clusters samples will be divided into. This number is determined by the `number_clusters` function. 7. The seventh argument "data_name" is a string indicating the cancer type and/or name of the dataset being analyzed. This name will be used to label the sample dendrograms and heatmap files. 8. The eighth argument "probe_rank" is a string indicating the feature selection method used in the `probe_ranking` function. Options include "CV_Rank", "CV_Guided", "SD_Rank", and "Poly". 9. The ninth argument "probe_num_selection" is a string indicating the way in which probes were selected in the `number_probes` function. Options include "Fixed_Probe_Num", "Percent_Probe_Num", "Poly_Probe_Num", and "Adaptive_Probe_Num". 10. The tenth argument, "cluster_num_selection" is a string indicating how the number of clusters were determined in the number_clusters function. Options include "Fixed_Clust_Num" and "Gap_Statistic". **Function Outputs** * A CSV file containing the sample names and their respective cluster. An object containing a vector of the sample names and their cluster number is returned. * When hierarchical clustering is chosen as the cluster method, a pdf file of the sample dendrogram as well as atr, gtr, and cdt files for viewing in Java TreeView are outputted. In the Java TreeView heatmap, the samples are clustered by the method indicated by the "linkage_type" argument while genes are clustered by the method indicated by the "gene_distance" argument. The default of sample clustering for the Java TreeView heatmap is "ward.D2" and the default for gene clustering is centered pearson "correlation". **Hierarchical Clustering Example** Below is an example of how to use the `cluster_analysis` function using the Hierarchical clustering option: ```{r, eval=TRUE, fig.show='hold', warning=FALSE} # Load necessary libraries suppressPackageStartupMessages(library(ctc)) suppressPackageStartupMessages(library(gplots)) suppressPackageStartupMessages(library(dendextend)) library(ctc) library(gplots) library(dendextend) library(graphics) library(grDevices) library(amap) library(multiClust) # Call the cluster_analysis function hclust_analysis <- cluster_analysis(sel.exp=ranked.exprs, cluster_type="HClust", distance="euclidean", linkage_type="ward.D2", gene_distance="correlation", num_clusters=3, data_name="GSE2034 Breast", probe_rank="SD_Rank", probe_num_selection="Fixed_Probe_Num", cluster_num_selection="Fixed_Clust_Num") # Display the first few columns and rows of the object head(hclust_analysis) ``` **Function Outputs** The outputs from this example would be an object "hclust_analysis" containing a vector of the sample names and cluster number. In addition, a sample dendrogram pdf file would be written. Lastly, the atr, gtr, and cdt files are outputted to view a heatmap of the genes and samples in Java TreeView. **Note** The `cluster_analysis` function does not display the sample dendrogram and heatmap in the console. To show what these images would look like, the following examples are provided below. **Java TreeView Heatmap Example** Below is an example of a heatmap produced in Java TreeView from the atr, gtr, abnd cdt files produced by this function. In this heatmap, rows represent genes and columns represent samples. ```{r, eval=TRUE, echo=FALSE, fig.show='hold', fig.width=8, fig.height=8, fig.align='center', warning=FALSE} # retrieve file path for image fpath <- system.file("extdata", "GSE2034_Breast.SD_Rank.ward.D2.Fixed_Probe_Num.Fixed_Clust_Num.jpg", package="multiClust") knitr::include_graphics(path = fpath) ``` **Sample Dendrogram Example** Below is an example of a sample dendrogram outputted by this function: ```{r, eval=TRUE, echo=FALSE, fig.show='hold', fig.width=8, fig.height=8, fig.align='center', warning=FALSE} # Load necessary libraries library(gplots) suppressPackageStartupMessages(library(dendextend)) library(dendextend) # Produce a sample dendrogram using the selected gene/probe expression object # In the function, cluster_analysis the dendrogram is not displayed in # the console. # Make the selected expression object numeric colname <- colnames(ranked.exprs) ranked.exprs <- t(apply(ranked.exprs, 1,as.numeric)) colnames(ranked.exprs) <- colname # Normalization of the selected expression matrix before clustering norm.sel.exp <- t(apply(ranked.exprs, 1, nor.min.max)) hc <- hclust(dist(x=t(norm.sel.exp), method="euclidean"), method="ward.D2") dc <- as.dendrogram(hc) # Portray the samples in each cluster as a different color dc <- set(dc, "branches_k_color", value=1:3, k=3) # Make a vector filled with blank spaces to replace sample names with # blank space vec <- NULL len <- length(labels(dc)) for (i in 1:len) { i <- "" vec <- c(vec, i) } # Remove sample labels from dendrogram dc <- dendextend::set(dc, "labels", vec) # Plot the sample dendrogram plot(dc, main=paste("GSE2034 Breast", "Sample Dendrogram", sep=" ")) # Display orange lines on the plot to clearly separate each cluster rect.dendrogram(dc, k=3, border='orange') ``` **Kmeans Clustering Example** Below is an example of how to use the `cluster_analysis` function using the Kmeans clustering option: ```{r, eval=TRUE} # Load necessary libraries library(multiClust) # Call the cluster_analysis function kmeans_analysis <- cluster_analysis(sel.exp=ranked.exprs, cluster_type="Kmeans", distance=NULL, linkage_type=NULL, gene_distance=NULL, num_clusters=3, data_name="GSE2034 Breast", probe_rank="SD_Rank", probe_num_selection="Fixed_Probe_Num", cluster_num_selection="Fixed_Clust_Num") # Display the first few rows and columns of the object head(kmeans_analysis) ``` **Function Outputs** The output from this example would solely be an object "kmeans_analysis" containing the vector of sample names and cluster number. There would be no sample dendrogram file. --- ## 5. Obtaining the Average Expression for Each Gene/Probe in Each Cluster The sixth function in this package `avg_probe_exp` is used to determine the average expression of each gene/probe for the samples in a particular cluster. **Function Arguments** 1. The first argument "sel.exp" is an object containing the numeric selected gene/prone expression matrix. This object is an output of the `probe_ranking` function. 2. The second argument "samp_cluster" is an object vector containing the samples and the cluster number they belong to. This is an output of the `cluster_analysis` function. 3. The third argument "data_name" is a string indicating the cancer type and name of the dataset being analyzed. 4. The fourth argument "cluster_type" is a string indicating the type of clustering method used in the `cluster_analysis` function. "Kmeans" or "HClust" are the two options. 5. The fifth argument "distance" is a string describing the distance metric used for HClust in the `cluster_analysis` function. Options include one of "euclidean", "maximum", manhattan", "canberra", "binary", or "minkowski". 6. The sixth argument "linkage_type" is a string describing the linkage metric used in the HClust method in the `cluster_analysis` function. Options include "ward.D2", "average", "complete", "median", "centroid", "single", and "mcquitty". 7. The seventh argument "probe_rank" is a string indicating the feature selection method used in the `probe_ranking` function. Options include "CV_Rank", "CV_Guided", "SD_Rank", and "Poly". 8. The eighth argument "probe_num_selection" is a string indicating the way in which probes were selected in the `number_probes` function. Examples include "Fixed_Probe_Num", "Percent_Probe_Num", "Poly_Probe_Num", and "Adaptive_Probe_Num". 9. The ninth argument "cluster_num_selection" is a string indicating how the number of clusters were determined in the `number_clusters` function. Examples include "Fixed_Clust_Num" and "Gap_Statistic". **Example** Below is an example of how to use the `avg_probe_exp` function after performing Kmeans clustering analysis with the `cluster_analysis` function: ```{r, eval=TRUE} # Load necessary libraries library(multiClust) # Call the avg_probe_exp function avg_matrix <- avg_probe_exp(sel.exp=ranked.exprs, samp_cluster=kmeans_analysis, data_name="GSE2034 Breast", cluster_type="Kmeans", distance=NULL, linkage_type=NULL, probe_rank="SD_Rank", probe_num_selection="Fixed_Probe_Num", cluster_num_selection="Fixed_Clust_Num") # Display the first few rows and columns of the matrix head(avg_matrix) ``` **Function Outputs** The outputs from this function are a text file containing the matrix of average expression of each probe/gene in each of the different clusters. In addition the object "avg_matrix" would also contain the average expression matrix. --- ## 6. Clinical Analysis of Selected Gene Probes and Samples After dividing the selected gene probes and samples into their respective clusters, the `surv_analysis` function can be used to produce Kaplan-Meier Survival Plots. This function uses a Cox Proportional Hazard Model to portray the survival probabilities of the patients within each cluster over time. **Function Arguments** 1. The first argument "samp_cluster" is an object vector containing the samples and the cluster number they belong to. This object is an output of the `cluster_analysis` function. 2. The second argument "clinical" is a string indicating the name of the text file containing patient clinical information. This file should be a matrix consisting of two columns. The first column contains the patient survival time information in months. The second column indicates the occurrence of a censorship (0) or an event (1). 3. The third argument "survival_type" is a string specifying the type of survival event being analyzed. Examples include "Disease-free survival (DFS)", "Overall Survival (OS)", "Relapse-free survival (RFS)", etc. 4. The fourth argument "data_name" is a string indicating the name to be used to label the Kaplan-Meier Survival Plot. 5. The fifth argument "cluster_type" is a string indicating the type of clustering method used in the `cluster_analysis` function. "Kmeans" or "HClust" are the two options. 6. The sixth argument "distance" is a string describing the distance metric uses for HClust in the `cluster_analysis` function. Options include one of "euclidean", "maximum", manhattan", "canberra", "binary", or "minkowski". 7. The seventh argument "linkage_type" is a string describing the linkage metric used in the `cluster_analysis` function. Options include "ward.D2", "average", "complete", "median", "centroid", "single", and "mcquitty". 8. The eighth argument "probe_rank" is a string indicating the feature selection method used in the `probe_ranking` function. Options include "CV_Rank", "CV_Guided", "SD_Rank", and "Poly". The ninth argument "probe_num_selection" is a string indicating the way in which probes were selected in the `number_probes` function. Options include "Fixed_Probe_Num", "Percent_Probe_Num", "Poly_Probe_Num", and "Adaptive_Probee_Num". The tenth argument "cluster_num_selection" is a string indicating how the number of clusters were determined in the `number_clusters` function. Options include "Fixed_Clust_Num" and "Gap_Statistic". Below is an example of how to use the `surv_analysis` function: **Example** ```{r, eval=TRUE, fig.show='hold'} # Load necessary libraries library(gplots) library(graphics) library(grDevices) library(multiClust) library(survival) # Obtain clinical outcome file clin_file <- system.file("extdata", "GSE2034-RFS-clinical-outcome.txt", package="multiClust") # Call the avg_probe_exp function surv <- surv_analysis(samp_cluster=kmeans_analysis, clinical=clin_file, survival_type="RFS", data_name="GSE2034 Breast", cluster_type="Kmeans", distance=NULL, linkage_type=NULL, probe_rank="SD_Rank", probe_num_selection="Fixed_Probe_Num", cluster_num_selection="Fixed_Clust_Num") # Display the survival P value surv ``` **Function Outputs** This function outputs a pdf Kaplan-Meier Survival Plot of the patient survival time in months vs. the survival probability. In this example, the object "surv" will also contain the Cox Survival P value. **Note** The `surv_analysis` function also does not display the Kaplan-Meier plot in the console, but rather outputs the plot as a pdf file. Below is an example of what the plots would look like. **Kaplan-Meier Survival Plot** ```{r, eval=TRUE, echo=FALSE, fig.show='hold', fig.width=8, fig.height=8, fig.align='center', warning=FALSE} # Load necessary libraries library(gplots) library(graphics) library(grDevices) library(multiClust) library(survival) # Obtain clinical outcome file clin_file <- system.file("extdata", "GSE2034-RFS-clinical-outcome.txt", package="multiClust") # Produce a Kaplan-Meier plot using the sample cluster information # from the cluster_analysis function # Read in survival data text file sample.anns <- read.delim2(file=clin_file, header=TRUE, stringsAsFactors=FALSE) time <- sample.anns[,1] event <- sample.anns[,2] time <- as.numeric(time) event <- as.numeric(event) event_type <- "RFS" time_type <- 'Months' # Determine max number of clusters in samp_cluster samp_cluster <- kmeans_analysis Number_Clusters <- max(samp_cluster) # Produce Kaplan Meier survival plots lev <- '1' for(i in 2:Number_Clusters) { lev <- c(lev, paste('',i, sep='')) } groupfac <- factor(samp_cluster, levels=lev) cox <- coxph(Surv(time, event==1) ~ groupfac) csumm <- summary(cox) cox.p.value <- c(csumm$logtest[3]) p <- survfit(Surv(time,event==1)~strata(samp_cluster)) plot (p,xlab= time_type ,ylab='Survival Probability',conf.int=FALSE, xlim=c(0,100), col=1:Number_Clusters, main=paste("GSE2034 Breast", "Kmeans", "SD_Rank", "RFS", sep =' ')) legend('bottomleft',legend=levels(strata(samp_cluster)),lty=1, col=1:Number_Clusters, text.col=1:Number_Clusters, title='clusters') legend('topright', paste('Pvalue =', signif(csumm$logtest[3],digits=2), sep ='')) ``` --- ## 7. References 1. Davis, S. and Meltzer, P. S. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics, 2007, 14, 1846-1847 2. Orchestrating high-throughput genomic analysis with Bioconductor. W. Huber, V.J. Carey, R. Gentleman, ..., M. Morgan Nature Methods, 2015:12, 115. 3. Benjamin Milo Bolstad. preprocessCore: A collection of pre-processing functions. R package version 1.30.0. 4. Chris Fraley, Adrian E. Raftery, T. Brendan Murphy, and Luca Scrucca (2012) mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation Technical Report No. 597, Department of Statistics, University of Washington 5. Antoine Lucas and Laurent Gautier (2005). ctc: Cluster and Tree Conversion.. R package version 1.42.0. http://antoinelucas.free.fr/ctc 6. Gregory R. Warnes, Ben Bolker, Lodewijk Bonebakker, Robert Gentleman, Wolfgang Huber Andy Liaw, Thomas Lumley, Martin Maechler, Arni Magnusson, Steffen Moeller, Marc Schwartz and Bill Venables (2015). gplots: Various R Programming Tools for Plotting Data. R package version 2.17.0. http://CRAN.R-project.org/package=gplots 7. Therneau T (2015). _A Package for Survival Analysis in S. version 2.38, . 8. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2015). cluster: Cluster Analysis Basics and Extensions. R package version 2.0.3. 9. Tal Galili (2015). dendextend: Extending R's Dendrogram Functionality. R package version 1.0.1. http://CRAN.R-project.org/package=dendextend 10. R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. 11. Saldanha, A. J. "Java Treeview--extensible Visualization of Microarray Data."Bioinformatics 20.17 (2004): 3246-248. 12. Antoine Lucas (2014). amap: Another Multidimensional Analysis Package. R package version 0.8-14. http://CRAN.R-project.org/package=amap