%\VignetteIndexEntry{BRAIN Usage} \documentclass[a4paper]{article} \usepackage{url} \title{Bioconductor BRAIN Package Vignette} \author{Piotr Dittwald, J\"{u}rgen Claesen, Dirk Valkenborg} \SweaveOpts{echo=true} \begin{document} \maketitle <>= library(BRAIN) @ % useful resources for Sweave: http://www.statistik.lmu.de/~leisch/Sweave The R package BRAIN ({\bf B}affling {\bf R}ecursive {\bf A}lgorithm for {\bf I}sotopic distributio{\bf N} calculations) provides a computation- and memory-efficient method to calculate the{\it{ aggregated isotopic distribution}} of peptides and proteins. \section{Introduction} The {\it{isotopic distribution}} is an important, but often forgotten, concept in the field of biomolecular mass spectrometry. Yet, it is particularly useful for the interpretation of the complex patterns observed in mass spectral data. The isotopic distribution reflects the probabilities of occurrence of different isotopic variants of a molecule and is visualized in the mass spectrum by the relative heights of the series of peaks related to the molecule. Ignoring the small deviations of the masses from integer values, {\it{aggregated isotopic variants}} of a molecule, with masses differing approximately by 1Da, can be used to calculate the{\it{ aggregated isotopic distribution}}. Prior knowledge about this distribution can be used to develop strategies for searching particular profiles in the spectra and, hence, for efficient processing of the spectral information. Computing the isotopic distribution for small molecules is relatively easy for small molecules, however the complexity of this computation increases drastically with the size of the molecule. Therefore, the calculation of the (aggregated) isotopic distribution for larger molecules can be sub-optimal or even problematic due to a combinatorial explosion of terms. We presented an alternate, computation- and memory-efficient method to calculate the probabilities of occurrence and exact center-masses of the aggregated isotopic distribution of a molecule in Claesen {\it{et al.}}, 2012. \section{Overview of the BRAIN package} This package has five functions: \begin{enumerate} \item {\texttt{useBRAIN}}, which computes the probabilities of aggregated isotopic variants and their center-masses for biomolecules composed out of carbon, hydrogen, oxygen, nitrogen and sulfur. Additionally the function returns the average mass of the biomolecule; \item {\texttt{calculateMonoisotopicMass}}, which computes the theoretical monoisotopic mass of biomolecules composed out of carbon, hydrogen, oxygen, nitrogen and sulfur; \item {\texttt{calculateAverageMass}}, which computes the theoretical average mass of biomolecules composed out of carbon, hydrogen, oxygen, nitrogen and sulfur; \item {\texttt{calculateIsotopicProbabilities}}, which computes the isotopic probabilities of the aggregated isotopic variants for biomolecules composed out of carbon, hydrogen, oxygen, nitrogen and sulfur; \item {\texttt{calculateNrPeaks}}, which computes heuristically the required number of consecutive aggregated isotopic variants (starting from the monoisotopic mass), with a minimum of five; \item {\texttt{getAtomsFromSeq}}, which creates atomic composition from a given aminoacid sequence. \end{enumerate} \section{Aggregated isotopic distribution calculation} Here, we will show how to use the package with angiotensine~II as an example. Its atomic composition is C$_{50}$H$_{71}$N$_{13}$O$_{12}$. \subsection{Theoretical monoisotopic mass} The theoretical monoisotopic mass of a biomolecule with atomic composition $C_vH_wN_xO_yS_z$ is calculated as follows: \begin{equation} %\label{form:mono} Monoisotopic \; mass = v M_{C_{12}}+ w M_{H_{1}} + x M_{N_{14}} + y M_{O_{16}} + z M_{S_{32}}\nonumber \end{equation} We can calculate the monoisotopic mass for angiotensine~II with the BRAIN-package as follows: <<>>= #angiotensineII angiotensineII <- list(C=50,H=71,N=13,O=12) monoisotopicMassAngiotensineII <- calculateMonoisotopicMass(aC = angiotensineII) monoisotopicMassAngiotensineII #human dynein heavy chain humandynein <- list(C=23832,H=37816,N=6528,O=7031,S=170) monoisotopicMassHumandynein <- calculateMonoisotopicMass(aC = humandynein) monoisotopicMassHumandynein @ \subsection{Theoretical average mass} The theoretical average mass of the same biomolecule is calculated as follows: \begin{eqnarray} \label{form:avg} \mbox{\emph{Average mass }}& = & v M_{C_{12}} \times P_{C_{12}} + v M_{C_{13}} \times P_{C_{13}} \nonumber\\ & + & w M_{H_{1}} \times P_{H_{1}}+ w M_{H_{2}} \times P_{H_{2}} \nonumber\\ & + & x M_{N_{14}} \times P_{N_{14}} + x M_{N_{15}} \times P_{N_{15}} \nonumber\\ & + & y M_{O_{16}} \times P_{O_{16}} + y M_{O_{17}} \times P_{O_{17}} + y M_{O_{18}} \times P_{O_{18}}\nonumber\\ & + & z M_{S_{32}} \times P_{S_{32}} + z M_{S_{33}} \times P_{S_{33}} + z M_{S_{34}} \times P_{S_{34}} + z M_{S_{36}} \times P_{S_{36}}\nonumber \end{eqnarray} We can calculate the average mass for angiotensine~II with the BRAIN-package with: <<>>= #angiotensineII averageMassAngiotensineII <- calculateAverageMass(aC = angiotensineII) averageMassAngiotensineII #humandynein averageMassHumandynein <- calculateAverageMass(aC = humandynein) averageMassHumandynein @ \subsection{Number of requested aggregated isotopic variants} The calculation of the aggregated isotopic distribution can be stopped when the required number of aggregated isotopic variants has been reached. The latter number can be heuristically obtained. For this purpose, we have added the function {\it{calculateNrPeaks}}, which uses following rule of thumb: the difference between the theoretical monoisotopic mass and the theoretical average mass is computed and multiplied by two. Subsequently, the obtained number is rounded to the nearest integer greater than or equal to the multiplied difference. For small molecules, the minimal number of requested variants is five. <<>>= #angiotensineII nrPeaksAngiotensineII <- calculateNrPeaks(aC = angiotensineII) nrPeaksAngiotensineII #human dynein heavy chain nrPeaksHumandynein <- calculateNrPeaks(aC = humandynein) nrPeaksHumandynein @ \subsection{Isotopic probabilities of the aggregated isotopic variants} The function {\it{calculateIsotopicProbabilities}} returns the isotopic probabilities of the aggregated isotopic variants for molecules with carbon, hydrogen, oxygen, nitrogen and sulfur as atomic building blocks. The function will stop computing as soon as the required number of aggregated variants has been reached (default), when the user-defined coverage has been reached, or when the computed occurrence probabilities become smaller than the defined /textbf{abundantEstim}. <<>>= #angiotensineII #with default options prob1 <- calculateIsotopicProbabilities(aC = angiotensineII) print(length(prob1)) #with user defined number of requested aggregated isotopic variants prob2 <- calculateIsotopicProbabilities(aC = angiotensineII, nrPeaks=20) print(length(prob2)) #with user-defined coverage as stopping criterium prob3 <- calculateIsotopicProbabilities(aC = angiotensineII, stopOption = "coverage", coverage = 0.99) print(length(prob3)) #with user-defined abundantEstim as stopping criterium prob4 <- calculateIsotopicProbabilities(aC = angiotensineII, stopOption = "abundantEstim", abundantEstim = 10) print(length(prob4)) #human dynein heavy chain prob1 <- calculateIsotopicProbabilities(aC = humandynein) print(length(prob1)) prob2 <- calculateIsotopicProbabilities(aC = humandynein, nrPeaks=300) print(length(prob2)) prob3 <- calculateIsotopicProbabilities(aC = humandynein, stopOption = "coverage", coverage = 0.99) print(length(prob3)) prob4 <- calculateIsotopicProbabilities(aC = humandynein, stopOption = "abundantEstim", abundantEstim = 150) print(length(prob4)) @ \subsection{Global function} The functions described above are incorporated in the global function {\it{useBRAIN}}. <>= #angiotensineII headr <- expression(paste(C[50], H[71], N[13],O[12])) #with default options res <- useBRAIN(aC= angiotensineII) plot(res$masses,res$isoDistr,xlab="mass",ylab="abundances", type="h", xlim=c(min(res$masses)-1,max(res$masses)+1.5)) title(headr) labelMono <- paste("mono-isotopic mass:", res$monoisotopicMass, "Da", sep=" ") labelAvg <- paste("average mass:", res$avgMass, "Da", sep=" ") text(x=1048.7, y=0.5, labelMono, col="purple") text(x=1048.7, y=0.45, labelAvg, col="blue") lines(x=rep(res$monoisotopicMass[1],2),y=c(0,res$isoDistr[1]), col = "purple") lines(x=rep(res$avgMass,2),y=c(0,max(res$isoDistr)), col = "blue", lty=2) @ <>= #human dynein heavy chain headr <- expression(paste(C[23832], H[37816], N[6528],O[7031], S[170])) res <- useBRAIN(aC=humandynein, stopOption="coverage", coverage=0.99) plot(res$masses,res$isoDistr,xlab="mass",ylab="abundances", type="h", xlim=c(min(res$masses)-1,max(res$masses)+1)) title(headr) labelMono <- paste("mono-isotopic mass: ", res$monoisotopicMass, "Da", sep="") labelAvg <- paste("average mass: ", res$avgMass, "Da", sep="") mostAbundant <- res$masses[which.max(res$isoDistr)] labelAbundant <- paste("most abundant mass: ", mostAbundant, "Da", sep="") text(x=533550, y=0.02, labelMono, col="purple") text(x=533550, y=0.0175, labelAvg, col="blue") text(x=533550, y=0.015, labelAbundant, col="red") lines(x=rep(res$avgMass,2),y=c(0,max(res$isoDistr)), col = "blue", lty=2) lines(x=rep(mostAbundant,2),y=c(0,max(res$isoDistr)), col = "red") @ % lines(x=rep(res$monoisotopic.mass[1],2),y=c(0,res$iso.distr[1]), col = "purple") #not seen Another stopping criteria as for the function {\it{calculateIsotopicDistribution}} are available (corresponding plots not shown in this document). <<>>= #with user defined number of requested aggregated isotopic variants res <- useBRAIN(aC = angiotensineII, nrPeaks = 20) plot(res$masses,res$isoDistr,xlab="mass",ylab="abundances", type="h", xlim=c(min(res$masses)-1,max(res$masses)+1)) title(headr) #with user defined coverage as stopping criterium res <- useBRAIN(aC = angiotensineII, stopOption = "coverage", coverage = 0.99) plot(res$masses,res$isoDistr,xlab="mass",ylab="abundances", type="h", xlim=c(min(res$masses)-1,max(res$masses)+1)) title(headr) #with user defined abundantEstim as stopping criterium res <- useBRAIN(aC = angiotensineII, stopOption = "abundantEstim", abundantEstim = 10) plot(res$masses,res$isoDistr,xlab="mass",ylab="abundances", type="h", xlim=c(min(res$masses)-1,max(res$masses)+1)) title(headr) @ \section{High-throughput calculation of the aggregated isotopic distribution and their exact center-masses}\label{High-throughput} We used the Uniprot database as a case example for the high-throughput calculations. The data was downloaded at 28.11.2011 (release 2011\_11) for query: \texttt{organism:"Homo sapiens (Human) [9606]" AND keyword:"Complete proteome [181]"} in UniProtKB (\url{http://www.uniprot.org/uniprot/?query=organism:9606+keyword:181}). We downloaded data in a tab-delimited format and considered both reviewed (20,245) (UniProtKB/Swiss-Prot) and unreviewed (37,802) (UniProtKB/TrEMBL) entries. We only used the availably sequence information in this example. % The data have in separate lines sequences of proteins and therefore were processed by the following script (last operation splitted space breaks in sequence): % The most interesting column for us is the column named Sequence containing sequences of amino acids for each protein. <>= tabUniprot <- read.table('data/uniprot.tab', sep="\t", header=TRUE) tabUniprot$Sequence <- gsub(" ", "", tabUniprot$Sequence) howMany <- nrow(tabUniprot) @ We will process only those proteins which are solely composed out of the $20$ natural amino acides, i.e. those which return TRUE values from the test below (other proteins are ignored). % DIRK's NOTE: I have to check later but I believe it is difficult to distinguish between leucine and isoleucine indicated by the letter X.\\ % DIRK's NOTE:. This means that wild cards [*] and uncertainty between [..] and [..] is not considered\\ <>= AMINOS <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V") !is.na(sum(match(seqVector, AMINOS))) @ To be able to use the function \texttt{useBRAIN} from the BRAIN package we need to change the amino acid sequences into chemical formulas containing the numbers of Carbon, Hydrogen, Nitrogen, Oxygen and Sulphur atoms. This may be done using the function \texttt{getAtomsFromSeq} changing the amino acid string into the list with corresponding atomic composition. Finally we run the following script to obtain \texttt{dfUniprot} data frame. <>= nrPeaks = 1000 howMany <- nrow(tabUniprot) dfUniprot <- data.frame() for (idx in 1:howMany){ seq <- as.character(tabUniprot[idx,"Sequence"]) seqVector <- strsplit(seq, split="")[[1]] if (!is.na(sum(match(seqVector, AMINOS)))){ aC <- getAtomsFromSeq(seq) res <- useBRAIN(aC = aC, nrPeaks = nrPeaks, stopOption = "abundantEstim", abundantEstim = 10) isoDistr <- res$isoDistr masses <- res$masses maxIdx <- which.max(isoDistr) mostAbundantPeakMass <- masses[maxIdx] monoMass <- res$monoisotopicMass dfAtomicComp <- data.frame(C=aC[1], H=aC[2], N=aC[3], O=aC[4], S=aC[5]) singleDfUniprot <- data.frame(dfAtomicComp, monoMass=monoMass, maxIdx=maxIdx, mostAbundantPeakMass=mostAbundantPeakMass ) if (!is.na(monoMass)){ #for huge atomic configuration numerical problems may occur dfUniprot <- rbind(dfUniprot, singleDfUniprot) } } } write.table(unique(dfUniprot), "data/uniprotBRAIN.txt") @ Execution of this code needed around 80 minutes on PC with two Intel(R) Core(TM)2 2.40GHz CPUs. \section{Predicting monoisotopic mass from most abundant peak mass} We obtain the data frame with the processed human proteins from Uniprot database (the processing procedure was described in Section~\ref{High-throughput}). We may also check the number of rows. <<>>= uniprotBRAIN <- read.table(system.file("extdata", "uniprotBRAIN.txt", package="BRAIN")) nrow(uniprotBRAIN) @ We only consider proteins with monoisotopic mass lower than $10^5$ Da. <>= uniprot10to5 <- uniprotBRAIN[uniprotBRAIN$monoMass < 10^5,] nrow(uniprot10to5) @ We can observe a linear relationship between most abundant peak mass and monoisotopic mass for the considered data. <>= library(lattice) mm <- uniprot10to5$monoMass mmMin <- floor(min(mm)) - 1 mmMax <- ceiling(max(mm)) + 1 sq <- seq(from=0, to=mmMax, by=5000) bw <- bwplot(cut(monoMass, sq) ~ mostAbundantPeakMass, data=uniprot10to5, ylab="monoMass (Da)", xlab="mostAbundantPeakMass (Da)") plot(bw) @ <>= bw2 <- bwplot(cut(monoMass, sq) ~ (mostAbundantPeakMass - monoMass), data=uniprot10to5, ylab="monoMass (Da)", xlab="mostAbundantPeakMass - monoMass (Da)") plot(bw2) @ % plots % <>= % @ Following obtained linear relationship we build the linear model for predicting monoisotopic mass just by knowing most abundant peak mass. <<>>= lmod <- lm(monoMass ~ mostAbundantPeakMass, data=uniprot10to5) summary(lmod) @ We can then further analyze the model using histogram which shows the residuals. % <>= % icpt.lm <- lmod$coefficients[1] % cff.lm <- lmod$coefficients[2] % cff <- (1 - cff.lm) % icpt <- -icpt.lm % plot(uniprot.10to5$most.abundant.peak.mass, (uniprot.10to5$most.abundant.peak.mass - uniprot.10to5$mono.mass), pch=1) % points(uniprot.10to5$most.abundant.peak.mass, round(icpt + cff * uniprot.10to5$mono.mass), col="green", pch=".") % points(uniprot.10to5$most.abundant.peak.mass, round(icpt + cff * uniprot.10to5$mono.mass)-1, col="violet", pch=".") % points(uniprot.10to5$most.abundant.peak.mass, round(icpt + cff * uniprot.10to5$mono.mass)-2, col="violet", pch=".") % points(uniprot.10to5$most.abundant.peak.mass, round(icpt + cff * uniprot.10to5$mono.mass)+1, col="red", pch=".") % points(uniprot.10to5$most.abundant.peak.mass, round(icpt + cff * uniprot.10to5$mono.mass)+2, col="red", pch=".") % @ <>= icptLm <- lmod$coefficients[1] cffLm <- lmod$coefficients[2] expected <- icptLm + cffLm * uniprot10to5$mostAbundantPeakMass residuals <- uniprot10to5$monoMass - expected hist <- hist(residuals, seq(floor(min(residuals)),ceiling(max(residuals)), by=0.1), main="", xlab="residuals of the linear model (Da)") plot(hist) @ \end{document}