%\VignetteIndexEntry{Manual} \documentclass{article} \usepackage{amsmath} \usepackage{amsfonts} \usepackage{amssymb} <>= BiocStyle::latex() @ \title{CAFE Manual} \author{Sander Bollen} \begin{document} \bibliographystyle{plain} \maketitle \tableofcontents \newpage \section{Introduction} So, you have downloaded the {\bf CAFE} package, and are wondering "What on Earth can I do with this?". In answering this question, we can first of all refer to the package name: I'm sure you'd all have thought about a nice warm cup of coffee, but in reality {\bf CAFE} stands for \textit{{\bf C}hromosomal {\bf A}berrations {\bf F}inder in {\bf E}xpression data}. Now, that might not tell you much, so here is a slightly better - albeit longer - summary of what it does: {\bf CAFE} analyzes microarray expression data, and tries to find out whether your samples have any gross chromosomal gains or losses. On top of that, it provides some plotting functions - using the \Biocpkg{ggplot2} package - to give you a nice visual tool to determine where those aberrations are located. How it exactly does this is something you can find later in this document. CAFE does not invent any new algorithms for the detection of chromosomal aberrations. Instead, it takes the approach of Ben-David \textit{et al}, and molds it into an easy-to-use R package. \subsection{Prerequisites} As all software packages, {\bf CAFE} has its requirements. You should preferably have the latest version of R, but at the very least version 2.10. Other dependencies can be found in the DESCRIPTION file. Some packages imported by {\bf CAFE} require you to have \texttt{libcurl} installed in your system, and by extension the \Biocpkg{RCurl} and \Biocpkg{Rtracklayer} packages. If these are not yet installed in your system, this could take a while to install. \subsection{Preparing your file system} {\bf CAFE} analyzes microarray expression data. That means that without anything \textit{to} analyze, it won't do anything at all. So, how do we fix this obvious problem? As a starter, {\bf CAFE} only analyzes .CEL files. This means that {\bf CAFE} will only work with data from Affymetrix microarrays. Illumina or other platforms unfortunately will not work. To prepare your filesystem correctly, follow the following steps. Put all CEL files in a folder and then start R in that folder. Alternatively, you can set the working directory to this folder by \texttt{setwd("/some/folder/path/")}. And we're done. It's that simple. See figure 1 for a screenshot of how this looks in a file manager. See figure 1 for a screenshot of how this is layed out \begin{figure} \centering \includegraphics[scale=1]{simple} \caption{The file system layout} \end{figure} \section{Analysis} Now we're ready to start analyzing our dataset(s). {\bf CAFE} provides the \Rfunction{ProcessCels()} function. This function takes four arguments: \texttt{threshold.over}, \texttt{threshold.under} \texttt{remove\_method} and \texttt{local\_file}. The threshold will determine which probes are going to be considered as overexpressed. The default setting is \texttt{threshold.over=1.5}, meaning that probes with a relative expression over 1.5 times median will be considered overexpressed. Likewise \texttt{threshold.under} determines which probesets will be considered underexpressed. As for the \texttt{remove\_method} argument, this determines in what way \texttt{ProcessCels} will remove duplicate probes. See section 3.1 for an overview of this option. The \Rfunction{ProcessCels()} function will basically suck up all your CEL files in your working directory - but don't worry, they'll still be there in your file system - and perform some number magic on them; it will normalize, calculate probes that are overexpressed, log2 transform and remove duplicates. It returns you a \texttt{list} object of your entire dataset, which the rest of {\bf CAFE} requires to function. So lets try that. First of all, we of course need to load the package: <<>>= library(CAFE) @ Then, we should set our working directory to the dataset folder if we're not already there. <>= setwd("~/some/path") @ Now we're there we can process the CEL files. <>= datalist <- ProcessCels() @ \subsection{A moment of reproducibillity} The stuff we've done above here is unfortunately not really reproducible. It requires you to have .CEL files in a folder called \texttt{/some/path} in your home directory, and that's of course not very pretty. So to keep the rest of this document reproducible, the {\bf CAFE} package comes together with a data object. This is the \texttt{list} object returned by \texttt{ProcessCels()} when processing both \href{https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10809}{GSE10809} and \href{https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6561}{GSE6561}. <<>>= data("CAFE_data") #will put object named {\bf CAFE}_data in your global environment @ \subsection{Statistics} So now we come to the core of {\bf CAFE}: finding which chromosomes are significantly over- or underexpressed. {\bf CAFE} uses \emph{thresholding} to determine which probes are overexpressed or underexpressed. We use the probes we deemed overexpressed by our threshold to create a contingency table which can then be used in an Exact Fisher or Chi-Square test. Basically we assume that everything over our defined threshold.over is really overexpressed, and everything uder our defined threshold.under is underexpressed. To calculate p-values for chromosomes, we will use the \Rfunction{chromosomeStats()} function. We can then also select which test (fisher or chi square) we want to use. When performing multiple tests (i.e. if we are testing multiple chromosomes), we need to correct our p-values for type I errors. We can therefore set the \texttt{bonferroni} argument to true when testing multiple chromosomes. The \texttt{enrichment} argument controls which side (over- or underexpressed) we are testing for. <<>>= #we first have to decide which samples we want to use. names(CAFE_data[[2]]) #to see which samples we got sam <- c(1,3) #so we use sample numbers 1, and 3 to compare against the rest chromosomeStats(CAFE_data,chromNum=17,samples=sam, test="fisher",bonferroni=FALSE,enrichment="greater") # we are only testing 1 chromosome @ This uses an Exact Fisher test. Technically speaking, this is better than a chi square test, but can be slower for very large sample sizes. We can also do a chi square test, which is slightly faster. <<>>= chromosomeStats(CAFE_data,chromNum=17,samples=sam, test="chisqr",bonferroni=FALSE,enrichment="greater") @ As you will have seen, the output of this is different from \texttt{test="fisher"}. The reason for this is that the Fisher test gives an \textit{exact} p-value, whereas a chi square test is just an approximation. But if we now want to test multiple chromosomes, we have to correct our p-values <<>>= chromosomeStats(CAFE_data,chromNum="ALL",samples=sam, test="fisher",bonferroni=TRUE,enrichment="greater") @ The results which we got might be all nice and well - there seems be something amiss with chromosome 17 - but preferably we would like to delve a bit deeper. We would like to know which chromosome \textit{bands} are duplicated or lost. To do this we can use the same syntax as for chromosomes, except that we are using a different function: <<>>= bandStats(CAFE_data,chromNum=17,samples=sam,test="fisher", bonferroni=TRUE,enrichment="greater") #multiple bands per chromosome, so need bonferroni! @ So this way we see that there is most likely a duplication around band 17p11.2 \subsection{Plotting} So now we know that chromosome 17 for these two samples is aberrant, but we would like to plot that. A picture says more than a thousand words - as the saying goes. {\bf CAFE} provides four functions for plotting samples, \Rfunction{rawPlot()}, \Rfunction{slidPlot()}, \Rfunction{discontPlot()} and \Rfunction{facetPlot()}. \subsubsection{Raw plots} The \Rfunction{rawPlot()} function plots each individual probe along the chromosome with its 'raw', unaltered, log2 relative expression value. This can give a very rough overview of what is happening. As a visual tool, an ideogram of the chromosome can be plotted over the plot. See figure 2 for an example. <>= p <- rawPlot(CAFE_data,samples=c(1,3,10),chromNum=17) @ <>= print(p) @ \begin{figure}[h!] \centering <>= <> @ \caption{A rawPlot of samples 1, 3 and 10 for chromosome 17} \end{figure} \subsubsection{Sliding plots} The plots given by \Rfunction{rawPlot()} are often not very informative since the within-sample variation in expression levels is quite high. The \Rfunction{slidPlot()} function solves this problem by applying a sliding average to the entire sample. As such, patterns become visible that would otherwise have went unnoticed. The function has two extra arguments as \Rfunction{rawPlot()}: if \texttt{combine=TRUE} a raw plot will be plotted in the background. The size of the sliding window can be determined by argument \texttt{k}. See figure 3 for an example. <>= p <- slidPlot(CAFE_data,samples=c(1,3,10),chromNum=17,k=100) @ <>= print(p) @ \begin{figure}[h!] \centering <>= <> @ \caption{A slidPlot of samples 1, 3 and 10 for chromosome 17 with a sliding window size of 50, with raw data in the background} \end{figure} \subsubsection{Discontinuous plots} In reality, there can only be an integer number of copy numbers for a given region (be it an entire chromosome or a band). One cannot 2.5 times duplicate a region; no, it is 0, 1, 2, 3 times etc. Also, there should be a defined boundary where the duplication or loss begins and ends. Yet, functions like \Rfunction{slidPlot()} will give smooth transitions and variable regions, which is a relatively poor reflection of what is actually happening. As such, we need a discontinuous smoother rather than a sliding average smoother. A discontinuous smoother is a smoother which produces distinct "jumps" rather than smooth transitions. The \Rfunction{discontPlot()} function implements such a discontinuous smoother - called a \textit{Potts filter}. The smoothness (i.e. amount of jumps) can be determined by setting parameter \texttt{gamma}. A higher gamma will result in a smoother graph, with less jumps. See figure 4 for an example. <>= p <- discontPlot(CAFE_data,samples=c(1,3,10),chromNum=17,gamma=100) @ <>= print(p) @ \begin{figure}[h!] \centering <>= <> @ \caption{A discontPlot of samples 1, 3 and 10 for chromosome 17 with a gamma of 50, with raw data in the background} \end{figure} \subsubsection{Facet plots} With the \Rfunction{facetPlot()} function you can plot all chromosomes stitched together horizontally in one single splot. It includes options to use either a sliding average smoother. See figure 4 for an example.. <>= p <- facetPlot(CAFE_data,samples=c(1,3,10),slid=TRUE,combine=TRUE,k=100) @ <>= print(p) @ \begin{figure}[h!] \centering <>= <> @ \caption{A facetPlot of samples 1, 3 and 10 with a sliding average with window size of 50, with raw data in the background} \end{figure} \section{In a nutshell} As follows from the above, CAFE analysis basically boils down to just three steps \begin{enumerate} \item \texttt{data <- ProcessCels()} \item \texttt{xxxStats(data, ...)} \item \texttt{xxxPlot(data, ...)} \end{enumerate} For instance: <>= data <- ProcessCels() chromosomeStats(data,samples=c(1,3),chromNum="ALL") discontPlot(data,samples=c(1,3),chromNum="ALL",gamma=100) @ \section{Behind the scenes} \subsection{Normalization} Affymetrix CEL files are read in and normalized by the \Rfunction{justRMA()} function in the \Biocpkg{affy} package. Absent probes are then found (by the \Rfunction{mas5calls} function), and are subsequently removed if absent in more than 20\% of samples. After normalization is complete, relative expression values (to median) are calculated. The \texttt{remove\_method} argument controls which duplicate probes are removed from the dataset. When \texttt{remove\_method=0}, no duplicate probes will be removed. When \texttt{remove\_method=1}, duplicate probes linking to the same gene will be removed such that each gene only links to one single probe, according to the following priorities: \begin{enumerate} \item Probes with \texttt{...\_at} are preferably retained \item Probes with \texttt{...a\_at} \item Probes with \texttt{...n\_at} \item Probes with \texttt{...s\_at} \item And finally probes with \texttt{...x\_at} \end{enumerate} When this still results in multiple probes per gene, the probe with the earliest chromosomal location is retained. When \texttt{remove\_method=2}, duplicate probes are only removed when they link to the exact same location, using the same priority list as specified above. \subsection{Category testing} When using \Rfunction{chromosomeStats()} and \Rfunction{bandStats()} the dataset is eventually split into four categories: \begin{enumerate} \item Whole dataset \& On Chromosome (or band) \item Whole dataset \& Overexpressed \& On Chromosome (or band) \item Sample(s) \& On Chromosome (or band) \item Sample(s) \& Overexpressed \& On Chromosome (or band) \end{enumerate} The number of probes for each category are counted, and saved in a so-called contingency table. Then we test the likelihood that Sample(s) occurs within Whole via either the Exact Fisher test or the Chi Square test. The Exact Fisher test supplies \textit{exact} p-values, whereas the Chi Square test supplies merely an approximation. However, the Exact Fisher test can be slower than a Chi Square test, and the Chi Square test increases in accuracy with increasing sample size. "Overexpressed" is defined as those probes which were over the set threshold - the \texttt{threshold.over} argument in \texttt{ProcessCels()}. Likewise, "underexpressed" is defined as those probes which were under the set threshold - argument \texttt{threshold.under}. Please note that these thresholds are a defined as a fraction of median expression value for each probe. \subsection{Smoothers} \subsubsection{Sliding average} With the moving average we attempt to the smooth the raw data so that patterns emerge. We have a set of points $(x_i,y_i), i=1 \dots n$, which is ordered such that $x_1 \leq \dots \leq x_n$. Since we plotting chromosomal location versus log2 relative expression, $x_i$ refers to chromosomal locations and $y_i$ refers to log2 relative expressions. Since chromosomal locations are integers, $x_i \in \mathbb{N}:x_i \geq0$, and since log2 relative expressions can be any real number $y_i \in \mathbb{R}$. We have a sliding average \emph{windows}, denoted $k$, which determines the smoothness. We are reconstructiong $y_i$ (the log2 relative expressions) to come to a smoother $\hat{y}_i$. For $y_i$ at least $k$-elements from the boundary, the smoother is calculated as in equation 1. For the remaining $\hat{y}_i$, where $i \in \mathbb{N}: 1 \leq i \leq k$ and $i \in \mathbb{N}: (n-k+1) \leq i \leq n$ , we first define two variables $a$ and $b$, where $a = max(1,(i-k))$ and $b = min(n,(i+k))$. The remaining $\hat{y}_i$ are then calculated using equation 2. \begin{equation} \hat{y}_i = \frac{1}{k} \cdot \sum\limits_{i=j}^{i+k} y_j , i \in \mathbb{N}: (k+1) \leq i \leq (n-k) \end{equation} \begin{equation} \hat{y}_i = \frac{\sum\limits_{i=a}^b y_i}{(a-b+1)} \end{equation} \subsubsection{Discontinuous smoother} The discontinuous smoother used in \Rfunction{discontPlot()}, is essentially a minimization problem. We again have a set of points $(x_i,y_i), i=1 \dots n$. This set is ordered in such manner that $x_1 \leq \dots \leq x_n$. For our particular problem $x_i$ refers to chromosomal coordinates, such that $x_i \in \mathbb{N}:x_i \geq 0$. Conversely, $y_i$ refers to log2 relative expressions, such that $y_i \in \mathbb{R}$. We want to reconstruct $y_i$ in such a way that the reconstruction - called $\hat{y_i}$ - closely resembles $y_i$ but has as little jumps as possible. A jump is defined as $\hat{y_i} \neq \hat{y}_{i+1}$. The reconstruction $\hat{y}_i$ can be constructed by minimizing the following function: \begin{equation} H = \sum\limits_{i=1}^n ({y_i} - {\hat{y}_i})^2 + \gamma \cdot |(\hat{y}_i \neq \hat{y}_{i+1})| \end{equation} In the above formula, $||$ denotes cardinality. The first term in the formula denotes a least squares goodness-of-fit term. The second term determines how strict the formula is. A higher $\gamma$ will result in a flatter result, with fewer jumps, and this $\gamma$ can be any number over 0. The entire formula is called a \textit{Potts filter}. The mathematics behind this formula is described in detail by Winkler et al. The implementation in \texttt{discontPlot()} uses the algorithm descibed by Friedrich et al. \subsection{Non-Affymetrix data} Even though {\bf CAFE} is designed to work only with Affymetrix .CEl files, it is possible to analyse non-affymetrix data via a workaround. For {\bf CAFE} to work with non-affymetrix data, it is necessary you provide the correct data structure which we use to analyse our data. The major structure the software works with is a {\bf list of lists} structure. The list contains {\bf three} lists - named \texttt{\$whole}, \texttt{\$over} and \texttt{\$under} - which both contain data frames for each and every sample using the following format: \begin{enumerate} \item \texttt{\$ID}, some identifier (probe IDs in affymetrix) (character vector) \item \texttt{\$Sym}, gene symbol (character vector) \item \texttt{\$Value}, log2 transformed expression value (numerical vector) \item \texttt{\$LogRel}, log2 transformed relative expressions (as a function of mean) (numerical vector) \item \texttt{\$Loc}, chromosomal locations in bp (integer vector) \item \texttt{\$Chr}, chromosome number (character vector) \item \texttt{\$Band}, cytoband (character vector) \end{enumerate} For instance <<>>= str(CAFE_data$whole[[1]]) @ The \texttt{\$whole} list should contain data for all probes, the \texttt{\$over} list should only contain those probes which are deemed overexpressed (not required if one uses \texttt{thresholding=FALSE}). The order of probes and locations in \texttt{\$whole} should be \emph{identical} acros s all samples. Each list element (i.e. the dataframes) is named according to its sample name. For instance <<>>= print(names(CAFE_data$whole)) @ \section{References} \begin{enumerate} \item Uri Ben-David, Yoav Mayshar, and Nissim Benvenisty. Virtual karyotyping of pluripotent stem cells on the basis of their global gene expression profiles. \textit{Nature protocols}, \textbf{8(5):989-997, 2013}. \item G Winkler, Volkmar Liebscher, and V Aurich. Smoothers for Discontinuous Signals. \textit{Journal of Nonparametric Statistics}, \textbf{14:203-222, 2002}. \item F Friedrich, a Kempe, V Liebscher, and G Winkler. Complexity Penalized M- Estimation. \textit{Journal of Computational and Graphical Statistics}, \textbf{17(1):201-224, March 2008}. \end{enumerate} \end{document}