%\VignetteIndexEntry{HOWTO PROcess} %\VignettePackage{PROcess} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \bibliographystyle{plainnat} %\bibliographystyle{apalike} \begin{document} @ \title{HowTo Use the Bioconductor PROcess package} \author{} \maketitle \tableofcontents % library(tools) % Rnwfile<- file.path("~/PROcess/inst/doc","howtoprocess.Rnw") % Sweave(Rnwfile,pdf=TRUE,eps=TRUE,stylepath=TRUE,driver=RweaveLatex()) \section{Introduction} The {\tt PROcess} package contains a collection of functions for processing spectra to remove baseline drifts if any, detect peaks and align them to a set of protobiomarkers. This document serves as a quick tutorial for using the {\tt PROcess} package. \section{Baseline subtraction} Our first observation of a raw spectrum is that it exhibits elevated baseline, more so at smaller m/z values than at larger values. This elevated baseline is mostly caused by the chemical noises in the EAM and ion overload. Ideally a spectrum should rest more or less on the zero horizontal line. This baseline needs to be subtracted from each raw spectrum. The following example shows the result of a spectrum with its baseline removed. <>= library(PROcess) fdat <- system.file("Test", package="PROcess") fs <- list.files(fdat, pattern="\\.*csv\\.*", full.names=TRUE) f1 <- read.files(fs[1]) fcut <- f1[f1[,1]>0,] bseoff <-bslnoff(fcut,method="loess",plot=TRUE, bw=0.1) title(basename(fs[1])) @ \section{Peak detection} After baseline is removed, peaks can be located by using {\tt isPeak}. A spectrum is smoothed first using moving average of the $k$ nearest neighbours. Smoothing helps to enhance peaks and get rid of spurious peaks. However, we do not recommend large amount of smoothing (controlled by parameter {\tt sm.span}) in this step because we do not wish to smooth away too many short and wide peaks and also we need the precision in peak locations. As a first step we do not mind getting more potential features. <>= pkgobj <- isPeak(bseoff,span=81,sm.span=11,plot=TRUE) @ We can also zoom in to inspect peaks in a particular range of m/z values. <>= specZoom(pkgobj, xlim=c(5000,10000)) @ \section{Batch operation} We demonstrate the batch functionality of this package using a set of 2 spectra. \subsection{Apply baseline subtraction to a set of spectra} <>= testdir <- system.file("Test", package = "PROcess") testM <- rmBaseline(testdir) @ \subsection{Renormalize spectra} Suppose we want to normalize a set of spectra to their median AUC (Area Under the Curve), where an AUC is calculated for m/z values greater than a cutoff point, 1500. <>= rtM <- renorm(testM, cutoff=1500) @ \subsection{Identify peaks of spectra} <>= peakfile <- paste(tempdir(), "testpeakinfo.csv", sep = "/") getPeaks(rtM, peakfile) @ \subsection{Quality assessment} Quality assessment is necessary because some spectra are very noisy and have hardly any peaks. Function {\tt quality} computes three parameters {\tt Quality}, {\tt Retain} and {\tt peak} for assessing a set of spectra. <>= qualRes <- quality(testM, peakfile, cutoff=1500) print(qualRes) @ A spectrum is deemed of poor quality and should be removed from subsequent analyses if it meets the following 3 conditions simultaneously: \begin{enumerate} \item $\mbox{\tt Quality}\, <\, 0.4$; \item $\mbox{\tt Retain}\, <\, 0.1 $; \item $\mbox{\tt peak} \,< \,1/2 \mbox{ of the mean peak number in the chip}$. \end{enumerate} \subsection{Get protobiomarkers} One challenge in MS data is that not only they exhibit variation vertically but also do they horizontally. This horizontal variation is not simply a constant shift but associated with value of m/z. Currently the accuracy in the m/z position is believed to be within $0.3\%$ o f the m/z value. Once the peaks are detected, we align the peaks by first generating an interval of size $0.3\%$ of the m/z value which centers at m/z for each m/z where a peak is detected. We treat those m/z intervals as interval censored data. We treat those intervals as a partially ordered set and use the locations of the maximal cliques to define the locations of the peaks \cite{Gent:Van:2001}. We call these aligned peaks (across spectra) proto-biomarkers and use the centers of the resulting intervals as the locations of the aligned peaks. For each spectrum we determine which actual peaks are represented by an aligned peak (proto-biomarker) and use the maximum of those as the height of the proto-biomarker. If there were no peaks then we use the maximum value within the resulting interval. <>= bmkfile <- paste(tempdir(), "testbiomarker.csv", sep = "/") testBio <- pk2bmkr(peakfile, rtM, bmkfile) mzs <- as.numeric(rownames(rtM)) matplot(mzs, rtM, type = "l", xlim = c(1000, 10000), ylab="intensities", main="proto-biomarkers") bks <- getMzs(testBio) print(round(bks)) abline(v = bks, col = "green") @ \section{An alternative way to obtain proto-biomarkers} \cite{baggMean} propose an alternative way for peak detection using the average spectrum of all spectra of a given experiment. Their algorithm is comprised of the following steps, (1) compute the mean of all raw spectra, (2) de-noise, baseline correct and find peaks by locating all local maxima in the mean spectrum, and (3) quantify the identified peaks in the individual spectra. The major advantage of this approach is the simplicity and the speed to arrive at a set of proto-biomarkers in comparison to the approach described in the previous sections to select peaks from individual spectra. In {\tt PROcess} we adopt the general idea of using mean spectrum to locate peaks, but leave the decision to users whether they should compute the mean spectrum of the raw spectra, or of the smoothed (de-noised), baseline corrected and normalized spectra for their experiment at hand. If the steps of baseline-subtraction and normalization are done properly, there may be improvement in peak detection using the mean spectrum computed from the pre-processed spectra. Our approach comprises the following steps. \begin{itemize} \item Compute the mean of all raw spectra using {\tt aveSpec}, or the mean of all pre-processed spectra using standard R function {\tt rowMeans}. \item Detect peaks of the mean spectrum by {\tt isPeak}. \item Align peaks by {\tt align} if there seems to be peak clusters. \item Quantify peaks in individual spectra that have been smoothed, baseline-removed and normalized, by {\tt getPeaks2} that locates for each peak the maximum intensity in a neighbourhood of the peak defined by a user-specified precision of peaks. \end{itemize} We now demonstrate this approach to the test data set using {\tt PROcess}. We execute the following code to compute the mean spectrum. <>= grandAve <- aveSpec(fs) mzs <- grandAve[,1] @ Should you wish to compute the mean spectrum of the pre-processed spectra, you can do {\tt rowMeans(rtM)} instead, then skip the baseline correction step and proceed to the peak detection step using {\tt isPeak} . We execute the following code to remove baseline, detect peaks in the overall mean spectrum and quantify them in individual spectra. <>= grandOff <- bslnoff(grandAve[mzs>0,], method="loess", plot=T, bw=0.1) @ <>= grandPkg <- isPeak(grandOff[grandOff[,1]>1500,], zerothrsh=1, plot=T, ratio=0.1) grandpvec <- round(grandPkg[grandPkg$peak, "mz"]) print(as.vector(grandpvec)) @ We can then quantify the peaks by running the following code. <>= grandBmk <- getPeaks2(rtM, grandpvec) @ \bibliography{howtoprocess} \end{document}