%\VignetteIndexEntry{sampleClassifier Vignette} %\VignettePackage{sampleClassifier} %\VignetteEngine{utils::Sweave} \documentclass{article} <>= BiocStyle::latex() @ \title{Vignette for the \Biocpkg{sampleClassifier} R Package} \author{ Khadija El Amrani\thanks{\email{a.khadija@gmx.de}} and Andreas Kurtz\\ Charit\'e - Universit\"atsmedizin Berlin, Berlin Brandenburg Center for Regenerative Therapies (BCRT) } \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents \newpage %--------------------------------------------------------- \section{Introduction} %--------------------------------------------------------- Discrimination between different classes of samples such as different cell types or tissues using gene expression profiles is an important problem in genetic and cell research. It has several implications and can contribute to our understanding of cell phenotype differences and will allow precise identification of various cell types and tissues. \Biocpkg{sampleClassifier} provides functions for the classification of samples using their gene expression profiles. The package supports the classification of microarray and RNA-seq gene expression profiles. The tool requires a reference and a test data set, and uses a simple algorithm called Shared Marker Genes (SMG). As the name suggests, the number of shared marker genes between a reference and a query sample is used as a similarity measure. Marker genes are detected using the tools \Biocpkg{MGFM} \cite{ElAmrani2015, mgfm} or \Biocpkg{MGFR} \cite{mgfr} that we have developed previously. \Biocpkg{sampleClassifier} can be applied: i) to evaluate the similarity of experimentally derived cells with their desired target cell type (e.g. to compare primary hepatocytes with induced hepatocytes); ii) to compare in vitro derived organoids to their in vivo counterparts; iii) to classify different types of diseases.\\ This vignette provides an introduction to gene expression profile based sample classification using the \Biocpkg{sampleClassifier} R package and the accompanying data package, \Biocpkg{sampleClassifierData}. It contains guidelines on how to select all inputs to the tool (such as reference and test matrices), and instructions on running \Biocpkg{sampleClassifier} using microarray and RNA-seq data from the \Biocpkg{sampleClassifierData}. %\newpage \section{Contents of the package} The \Biocpkg{sampleClassifier} package contains the following functions: <>= library("sampleClassifier") ls("package:sampleClassifier") @ \subsection{\Rfunction{classifyProfile()}} \Rfunction{classifyProfile()} is a function to classify microarray gene expression profiles. \subsubsection{Parameter Settings} \label{subsec:Input} \begin{enumerate} \item \textit{$ref\_matrix$}: Normalized microarray data matrix to be used as reference, with probe sets corresponding to rows and samples corresponding to columns. \item \textit{$query\_mat$} : Normalized microarray query matrix with query samples to be classified, with probe sets corresponding to rows and samples corresponding to columns. \item \textit{chip1} : Chip name of the reference matrix (e.g. 'hgu133plus2'). \item \textit{chip2}: Chip name of the query matrix. This parameter can be ignored if the reference and query matrix are from the same chip. \item \textit{fun1}: \Rfunction{mean} or \Rfunction{median}. This will specify the number of marker genes that will be used for classification. Default is \Rfunction{median}. \item \textit{fun2}: \Rfunction{mean} or \Rfunction{median}. This will be used to summarize the expression values of probe sets that belong to the same gene. This parameter can be ignored if the reference and query matrix are from the same chip. Default is \Rfunction{mean}. \item \textit{write2File}: If TRUE, the classification results for each query profile will be written to a file. \item \textit{out.dir}: Path to a directory to write the classification results, default is the current working directory. \end{enumerate} \subsubsection{Output} The function \Rfunction{classifyProfile()} returns a list with hits for each of the query samples in the query matrix. The hits are sorted according to their similarity to the query. \subsection{\Rfunction{classifyProfile.rnaseq()}} \Rfunction{classifyProfile.rnaseq()} is a function to classify RNA-seq gene expression profiles. \subsubsection{Parameter Settings} \label{subsec:Input} \begin{enumerate} \item \textit{$ref\_matrix$}: RNA-seq data matrix to be used as reference, with genes corresponding to rows and samples corresponding to columns. \item \textit{$query\_mat$} : RNA-seq query matrix with query samples to be classified, with genes corresponding to rows and samples corresponding to columns. \item \textit{gene.ids.type} : Type of the used gene identifiers, the following gene identifiers are supported: ensembl, refseq and ucsc gene ids. Default is ensembl. \item \textit{fun1}: \Rfunction{mean} or \Rfunction{median}. This will specify the number of marker genes that will be used for classification. Default is \Rfunction{median}. \item \textit{write2File}: If TRUE, the classification results for each query profile will be written to a file. \item \textit{out.dir}: Path to a directory to write the classification results, default is the current working directory. \end{enumerate} \subsubsection{Output} The function \Rfunction{classifyProfile.rnaseq()} returns a list with top hits for each query profile, sorted according to a similarity score. \subsection{\Rfunction{classifyProfile.svm()}} \Rfunction{classifyProfile.svm()} is a function to classify microarray gene expression profiles using Support Vector Machines (SVM). \subsubsection{Parameter Settings} \label{subsec:Input} \begin{enumerate} \item \textit{$ref\_matrix$}: Normalized microarray data matrix to be used as reference, with probe sets corresponding to rows and samples corresponding to columns. \item \textit{$query\_mat$} : Normalized microarray query matrix to be classified, with probe sets corresponding to rows and samples corresponding to columns. \item \textit{chip1} : Chip name of the reference matrix (e.g. 'hgu133plus2'). \item \textit{chip2}: Chip name of the query matrix. This parameter can be ignored if the reference and query matrix are from the same chip. \item \textit{fun1}: \Rfunction{mean} or \Rfunction{median}. This will specify the number of marker genes that will be used for classification. Default is \Rfunction{median}. \item \textit{fun2}: \Rfunction{mean} or \Rfunction{median}. This will be used to summarize the expression values of probe sets that belong to the same gene. This parameter can be ignored if the reference and query matrix are from the same chip. \end{enumerate} \subsubsection{Output} The function \Rfunction{classifyProfile.svm()} returns a data frame with the predicted classes for each query profile. \subsection{\Rfunction{classifyProfile.rnaseq.svm()}} \Rfunction{classifyProfile.rnaseq.svm()} is a function to classify RNA-seq gene expression profiles using Support Vector Machines (SVM). \subsubsection{Parameter Settings} \label{subsec:Input} \begin{enumerate} \item \Robject{$ref\_matrix$}: RNA-seq data matrix to be used as reference, with genes corresponding to rows and samples corresponding to columns. \item \Robject{$query\_mat$} : RNA-seq query matrix with query samples to be classified, with genes corresponding to rows and samples corresponding to columns. \item \textit{gene.ids.type} : Type of the used gene identifiers, the following gene identifiers are supported: ensembl, refseq and ucsc gene ids. Default is ensembl. \item \textit{fun1}: \Rfunction{mean} or \Rfunction{median}. This will specify the number of marker genes that will be used for classification. Default is \Rfunction{median}. \end{enumerate} \subsubsection{Output} The function \Rfunction{classifyProfile.rnaseq.svm()} returns a data frame with the predicted classes for each query profile.\\ Please note that all replicates of a sample type should have the same label in the reference matrix. \subsection{\Rfunction{get.heatmap()}} \Rfunction{get.heatmap()} is a function to display the classification predictions as a heatmap. \subsubsection{Parameter Settings} \label{subsec:Input} \begin{enumerate} \item \Robject{res.list}: the result list returned by the function \Rfunction{classifyProfile()} or \Rfunction{classifyProfile.rnaseq()} \end{enumerate} \subsubsection{Output} The function \Rfunction{get.heatmap()} is used only for the side effect of creating a heatmap. %--------------------------------------------------------- \section{Getting Started} %--------------------------------------------------------- The \Biocpkg{sampleClassifier} package can be downloaded and installed by running the following code from within R: <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("sampleClassifier") @ We also recommend installing the accompanying data package, \Biocpkg{sampleClassifierData}, which contains pre-processed microarray and RNA-seq data, which are derived from normal human tissues, and are available from public repositories. More details about the data can be found in the Vignette of the data package.\\ After installing and loading \Biocpkg{sampleClassifierData}, the individual sampleClassifierData datasets can be loaded using the function \Rfunction{data()}. For instance, the RNA-seq dataset named \Robject{$se\_rnaseq\_refmat$} is stored as a \Robject{SummarizedExperiment} object. The numeric matrix can be extracted using the function \Rfunction{assay()} from the \Biocpkg{SummarizedExperiment} R package: <>= library(sampleClassifierData) data("se_rnaseq_refmat") rnaseq_refmat <- assay(se_rnaseq_refmat) @ %--------------------------------------------------------- \section{Classification using sampleClassifier} %--------------------------------------------------------- We will use the data provided in the data package \Biocpkg{sampleClassifierData} to demonstrate how to classify samples using \Biocpkg{sampleClassifier}. \subsection{Classification of microarray data} To classify microarray gene expression profiles, we use the function \Rfunction{classifyProfile()}. It expects a reference and a test matrix. We recommend using 3 replicates for each sample type in the reference matrix. Please note that replicates of the same sample type should have the same name in the reference matrix. We also recommend processing the reference and the test matrix in the same way.\\ As a reference matrix we use here a microarray dataset derived from the study GSE3526 \cite{Roth2006a}, which is available from GEO \cite{Barrett2007a}. This dataset is named \Robject{$se\_micro\_refmat$} and can be loaded with the following code: <>= library(sampleClassifierData) data("se_micro_refmat") micro_refmat <- assay(se_micro_refmat) dim(micro_refmat) @ As a test matrix we use a microarray dataset derived from the study GSE2361 \cite{Ge2005a}, which is available from GEO \cite{Barrett2007a}. This dataset is named \Robject{$se\_micro\_testmat$} and can be loaded with the following code: <>= data("se_micro_testmat") micro_testmat <- assay(se_micro_testmat) dim(micro_testmat) @ Now, we can call the function \Rfunction{classifyProfile()}: <>= options(width=60) @ <>= res1.list <- classifyProfile(ref_matrix=micro_refmat, query_mat=micro_testmat, chip1="hgu133plus2",chip2="hgu133a", write2File=FALSE) @ For simplicity, we show only the two top hits for each query sample: <>= lapply(res1.list,"[",c(1,2),,drop=FALSE) @ To display the classification results as a heatmap, we call the function \Rfunction{get.heatmap()} with the resulted list as input. <>= names(res1.list) <- unlist(strsplit(names(res1.list),split= " : "))[seq(2,32,2)] @ <>= options(width=90) @ <>= get.heatmap(res1.list) @ In order to compare the classification predictions for microarray query profiles to those of Support Vector Machines (SVM), the function \Rfunction{classifyProfile.svm()} was implemented based on functions from the R-package \Rpackage{e1071}. <>= res1.svm.df <- classifyProfile.svm(ref_matrix=micro_refmat, query_mat=micro_testmat, chip1="hgu133plus2",chip2="hgu133a") res1.svm.df @ Our method \Rfunction{classifyProfile()} classified 14 samples correctly, whereas SVM classified 13 of the 16 query samples correctly. The sample 'GSM44678' was classified correctly by \Rfunction{classifyProfile()} as prostate and miclassified by SVM as urethra. The sample 'GSM44673' (tissue: spleen) was misclassified as bone marrow or adipose tissue omental by \Rfunction{classifyProfile()} or \Rfunction{classifyProfile.svm()}, respectively.\\ The sample 'GSM44706' (tissue: fetal liver) was misclassified as bone marrow by SVM. \Rfunction{classifyProfile()} predicted both bone marrow and liver as top hits with the same similarity score.\\ \subsection{Classification of RNA-seq data} To classify RNA-seq gene expression profiles, we use the function \Rfunction{classifyProfile.rnaseq()}. It expects a reference and a test matrix. As a reference matrix we use an RNA-seq dataset derived from the study E-MTAB-1733 \cite{Fagerberg2014a}, which is available from ArrayExpress \cite{Brazma2003}. This dataset is named \Robject{$se\_rnaseq\_refmat$} and can be loaded with the following code: <>= library(sampleClassifierData) data("se_rnaseq_refmat") rnaseq_refmat <- assay(se_rnaseq_refmat) dim(rnaseq_refmat) @ As a test matrix we use an RNA-seq dataset derived from the study E-MTAB-513 \cite{Illumina}, which is available from ArrayExpress. This dataset is named \Robject{$se\_rnaseq\_testmat$} and can be loaded with the following code: <>= data("se_rnaseq_testmat") rnaseq_testmat <- assay(se_rnaseq_testmat) dim(rnaseq_testmat) @ Now, we can call the function \Rfunction{classifyProfile.rnaseq()} to predict the classes of the samples in the test matrix: <>= options(width=60) @ <>= res2.list <- classifyProfile.rnaseq(ref_matrix=rnaseq_refmat, query_mat=rnaseq_testmat, gene.ids.type="ensembl",write2File=FALSE) @ For simplicity, we show only the two top hits for each query sample: <>= lapply(res2.list,"[",c(1,2),,drop=FALSE) @ To display the classification results as a heatmap, we call the function \Rfunction{get.heatmap()} with the resulted list as input. <>= options(width=80) @ <>= get.heatmap(res2.list) @ In order to compare the classification predictions for RNA-seq query profiles to those of SVM, the function \Rfunction{classifyProfile.rnaseq.svm()} was implemented based on functions from the R-package \Rpackage{e1071}. <>= res2.svm.df <- classifyProfile.rnaseq.svm(ref_matrix=rnaseq_refmat, query_mat=rnaseq_testmat, gene.ids.type="ensembl") res2.svm.df @ Our method \Rfunction{classifyProfile.rnaseq()} classified 9 samples correctly, whereas SVM classified 7 of the 12 query samples correctly. To show the query samples that were misclassified by our method or SVM, we run the following code: <>= misclas.inds <- which(as.character(res2.svm.df[,1])!=as.character(res2.svm.df[,2])) colnames(res2.svm.df) <- c("query_real_class","predicted_class_by_SVM") pred.classifyProfile.rnaseq <- as.character(unlist(lapply(res2.list[which(names(res2.list) %in% as.character(res2.svm.df[misclas.inds,1]))],"[",1,1,drop=TRUE))) comp.df <- cbind(res2.svm.df[misclas.inds,], predicted_by_classifyProfile.rnaseq=pred.classifyProfile.rnaseq) comp.df @ \section{sampleClassifier algorithm details} The algorithm used in sampleClassifier is a simple algorithm called Shared Marker Genes (SMG). As the name suggests, the number of shared marker genes between a query and a reference is used as similarity measure. The tool requires a reference matrix with at least three replicates for each sample type. This matrix is used for marker gene detection using \Biocpkg{MGFM} or \Biocpkg{MGFR}. Since the number of detected markers differs depending on the sample types, we filter the list of marker genes of each sample type. Using the complete list of markers of each sample type, will result in a bias towards the sample type with the most marker genes. For example, if testis is the tissue with the most marker genes, using all marker genes for classification will result in classifying query samples often as testis. Suppose the reference matrix contains four tissues: liver, lung, kidney and midbrain, and X=(16, 20, 100, 500) is the vector of lengths of predicted marker genes for these tissues. The filtering is based on the median number of marker genes, in this case \Rfunction{median}(X) = 60. If the number of predicted markers for a tissue > 60, then only the top 60 marker genes will be used for classification. If the number of predicted markers for a tissue <= 60, then all markers are used for classification. After the filtering step, each query sample will be compared to all sample types in the reference and the number of marker genes shared between the query and each sample type in the reference is calculated. A query shares a marker gene with a reference sample if this marker gene is highly expressed in the query sample compared to all other sample types in the reference. The ratio of the number of shared marker genes and the total number of markers used for classification is used as a similarity score. This score has a value between 0 and 1. A value of 1 means that the query shares all marker genes with the reference, and a value of 0 means that no marker genes are shared between the query and the reference. For each query, the hits are sorted according to this score. The class of the first top hit is predicted as a class for the query. \section{R sessionInfo} The results in this file were generated using the following packages: <>= sessionInfo() @ \nocite{Team2007} \bibliography{references} \end{document}