\documentclass[article]{bioinf} \usepackage{amsmath} \usepackage{hyperref} \hypersetup{colorlinks=false, pdfborder=0 0 0, pdftitle={PrOCoil - A Web Service and an R Package for Predicting the Oligomerization of Coiled Coil Proteins}, pdfauthor={Ulrich Bodenhofer}} \title{PrOCoil --- A Web Service and an \R\ Package for\\ Predicting the Oligomerization of\\ Coiled Coil Proteins} \author{Ulrich Bodenhofer} \affiliation{Institute of Bioinformatics, Johannes Kepler University Linz\\Altenberger Str. 69, 4040 Linz, Austria\\ \email{procoil@bioinf.jku.at}} \newcommand{\PrOCoil}{PrOCoil} \newcommand{\R}{R} %\VignetteIndexEntry{PrOCoil - A Web Service and an R Package for Predicting the Oligomerization of Coiled-Coil Proteins} %\VignetteDepends{methods, stats, graphics, utils} %\VignetteEngine{knitr::knitr} \setlength{\fboxrule}{1.5pt} \begin{document} <>= options(width=72) knitr::opts_knit$set(width=72) set.seed(0) library(procoil, quietly=TRUE) procoilVersion <- packageDescription("procoil")$Version procoilDateRaw <- packageDescription("procoil")$Date procoilDateYear <- as.numeric(substr(procoilDateRaw, 1, 4)) procoilDateMonth <- as.numeric(substr(procoilDateRaw, 6, 7)) procoilDateDay <- as.numeric(substr(procoilDateRaw, 9, 10)) procoilDate <- paste(month.name[procoilDateMonth], " ", procoilDateDay, ", ", procoilDateYear, sep="") @ \newcommand{\PrOCoilVer}{\Sexpr{procoilVersion}} \newcommand{\PrOCoilDate}{\Sexpr{procoilDate}} \manualtitlepage[Version \PrOCoilVer, \PrOCoilDate] \section*{Scope and Purpose of this Document} This document is a user manual for \PrOCoil, the software suite accompanying the paper \cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11}. It provides a gentle introduction into how to use \PrOCoil. Not all features of the \R\ package are described in full detail. Such details can be obtained from the documentation enclosed in the \R\ package. Further note the following: (1) this is not an introduction to coiled coil proteins; (2) this is not an introduction to \R; (3) this is not an introduction to support vector machines. If you lack the background for understanding this manual, you first have to read introductory literature on the subjects mentioned above. \vspace{1cm} \newlength{\auxparskip} \setlength{\auxparskip}{\parskip} \setlength{\parskip}{0pt} \tableofcontents \clearpage \setlength{\parskip}{\auxparskip} \newlength{\Nboxwidth} \setlength{\Nboxwidth}{\textwidth} \addtolength{\Nboxwidth}{-2\fboxrule} \addtolength{\Nboxwidth}{-2\fboxsep} \newcommand{\notebox}[1]{% \begin{center} \fcolorbox{bioaz}{biowh}{\begin{minipage}{\Nboxwidth} \noindent{\sffamily\bfseries Note:} #1 \end{minipage}} \end{center}} \section{Introduction} This user manual describes the \PrOCoil\ software suite that accompanies the paper \cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11}. This software is concerned with analyzing coiled coil sequences in terms of their oligomerization behavior. \PrOCoil\ does not only provide a prediction, but also detailed insights into which residues or sub-sequences are responsible for the predicted oligomerization. \PrOCoil\ is available both as an easy-to-use Web interface and an \R\ package. Both variants offer the same prediction and analysis facilities. The following table summarizes the essential differences: \begin{center} \begin{tabular}{|p{0.4\textwidth}|p{0.4\textwidth}|} \hline \PrOCoil\ Web interface & \PrOCoil\ \R\ package\\ \hline \hline can be used instantly on any computer with Internet access and a Web browser & requires \R\ and installation of package {\ttfamily procoil} \\ \hline supports only the standard \PrOCoil\ model & supports the standard \PrOCoil\ model and the alternative model optimized for balanced accuracy; other models can be loaded from files \\ \hline graphics are produced in a non-customizable standard format & graphics are customizable \\ \hline only single sequences or at most pairs of sequences with the same heptad register can be analyzed at a time & multiple sequences can be analyzed per run \\ \hline beside standard input (amino acid sequence $+$ aligned heptad annotation), also PairCoil2 output format is supported; Marcoil output can be used upon a separate pre-processing step & only standard input (amino acid sequence(s) $+$ aligned heptad annotation(s)) \\ \hline \end{tabular} \end{center} We recommend beginners to use the Web interface. Experienced users can benefit from the greater flexibility of the \R\ package. \R\ package users can use the Web interface for converting PairCoil2 and/or Marcoil output into the input format the \R\ package requires. \section{Input Data}\label{sec:input} As already mentioned, \PrOCoil\ predicts whether a given coiled coil segment of an amino acid sequence is more likely to form a dimer or trimer. Such a segment has to consist of an amino acid (sub-)sequence and an aligned heptad annotation of the same length. As an example, the GCN4 yeast transcription factor is a dimer consisting of two equal sequences (i.e.\ it is a homo-dimer), the coiled coil parts of which (according to SOCKET \cite{WalshawWoolfson01}) look as follows: \begin{quote} {\footnotesize\begin{verbatim} MKQLEDKVEELLSKNYHLENEVARLKKLV abcdefgabcdefgabcdefgabcdefga \end{verbatim}} \end{quote} The heptad annotation is essential for \PrOCoil\ to work and cannot be omitted. The letters `\verb+a+'--`\verb+g+' correspond to the usual annotation of positions within the heptad motif. \PrOCoil\ can also process full amino acid sequences containing one or more coiled coil segments as sub-sequences, where the coiled coil segments must be annotated with the usual symbols `\verb+a+'--`\verb+g+' and non-coiled coil residues must be annotated with dashes `\verb+-+'. As an example, let us consider the following sample sequence (annotation derived from running Marcoil% \footnote{\url{http://www.isrec.isb-sib.ch/webmarcoil/webmarcoilC1.html}} \cite{DelorenziSpeed02} on this sequence with a cut-off of 50\%): \begin{quote} {\footnotesize\begin{verbatim} MGECDQLLVFMITSRVLVLSTLIIMDSRQVYLENLRQFAENLRQNIENVHSFLENLRADLENLRQKFPGKWYSAMPGRHG -------------------------------abcdefgabcdefgabcdefgabcdefgabcdefg-------------- \end{verbatim}} \end{quote} If the heptad annotation contains dashes, \PrOCoil\ extracts all coiled coil regions (contiguous non-dash regions) and classifies them separately. From the above sample sequence, \PrOCoil\ will extract and classify the following coiled coil segment (residues 32--66 of the above example): \begin{quote} {\footnotesize\begin{verbatim} LENLRQFAENLRQNIENVHSFLENLRADLENLRQK abcdefgabcdefgabcdefgabcdefgabcdefg \end{verbatim}} \end{quote} \section{Predictions Using the Web Interface}\label{sec:webpredict} The \PrOCoil\ Web interface can be accessed directly via the \PrOCoil\ homepage.\footnote{\url{http://www.bioinf.jku.at/software/procoil/}} The page contains an input field that is pre-filled with the GCN4 wild type as an example: \begin{center} \fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{Input}} \end{center} The \PrOCoil\ server first checks your input for validity (see Section\ \ref{sec:input}). If the input is correct, all coiled coil segments are extracted from the input and analyzed separately. For the standard example, the GCN4 wild type, the output looks as follows: \begin{center} \fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{GCN4output}} \end{center} The output of the \PrOCoil\ Web interface is structured into three sections: \begin{description} \item[Input data:] amino acid sequence and heptad annotation of the coiled coil segment; \item[Classification result:] discriminant function value and final classification; a positive value means that the sequence is classified as trimer, a negative value means that the sequence is classified as dimer. The higher the absolute value, the clearer the oligomerization tendency is. The closer the value is to zero, the more the sequence can be considered a borderline case. \item[Prediction profile:] this plot shows the contribution of each residue to the discriminant function value. The more positive a value is for a residue, the more this residue contributes to an oligomerization tendency towards trimers. The more negative the value is, the more this residue contributes to an oligomerization tendency towards dimers. The horizontal line corresponds to the base line of the classifier (see \ref{ssec:profiles}). The discriminant function value is actually obtained as the area above the grey baseline minus the area below the grey baseline. The links below the prediction profile plot allow for downloading the profile in various formats. \end{description} \notebox{Values in the prediction profile cannot be understood as general rules for which oligomerization behavior a given amino acid at a given heptad position is indicative for. \PrOCoil\ takes pairwise interactions of amino acids into account. Therefore, the values in the prediction profile are always to be considered in the context of the given sequence. The same amino acid at the same heptad position might have a completely different value in the prediction profile of a different sequence.} If the heptad annotation contains at least one dash `\verb+-+', \PrOCoil\ first extracts all coiled coil segments, i.e.\ all contiguous sub-sequences with no dashes in the heptad annotation. Then all these coiled coil segments are analyzed independently and the results are presented sequentially in the order they appear in the input sequence in the same format as above. The only notable difference is that the section ``Input data'' shows the complete input sequence, whereas the extracted coiled coil segment is displayed in a separate section ``Coiled coil segment'' along with its consecutive number and the corresponding start and end positions in the input sequence. For the above sample sequence, the output looks as follows: \begin{center} \fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{MarcoilOut}} \end{center} The \PrOCoil\ Web interface also facilitates easy analysis of mutations of coiled coil segments. This feature is limited to two sequences at a time. The two sequences must have the same length and heptad register. If these conditions are met, the second sequence can simply be written underneath the heptad register (with the first sequence remaining above the heptad register): \begin{center} \fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{MutationAnalysisInput}} \end{center} The output is structured as described for single sequences. The section ``Classification result'' shows discriminant function values and predictions of both sequences along with a comparison whether the second sequence is more dimeric or more trimeric than the first sequence. Only one profile plot is produced in which both profiles are visualized, the profile of the first sequence in red and the second sequence's profile in blue: \begin{center} \fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{MutationAnalysisOutput}} \end{center} \section{Preprocessing Predicted Coiled Coil Segments Using the Web interface}\label{sec:preproc} We expect \PrOCoil\ to be used mainly for sequences the exact 3D structures of which are unknown (otherwise the correct oligomerization can be determined using SOCKET \cite{WalshawWoolfson01}). For structurally unresolved sequences, a coiled coil predictor must be used first to determine the coiled coil segments and their presumed heptad annotations. Currently the two most commonly used predictors for this task are {\bf Marcoil} \cite{DelorenziSpeed02} and {\bf PairCoil2} \cite{BergerWilsonWolfTonchevMillaKim95,McDonnellJiangKeatingBerger06}. Both programs have their own output formats that do not comply with the format described in Section\ \ref{sec:input}. In order to ease integration with Marcoil and PairCoil2, the \PrOCoil\ Web interface offers preprocessing tools that allow to use the outputs of Marcoil and PairCoil2 for processing with \PrOCoil. \subsection{Processing PairCoil2 results} The PairCoil2 Web interface\footnote{\url{http://groups.csail.mit.edu/cb/paircoil2/paircoil2.html}} produces the following kind of output for the Marcoil sample sequence: \begin{center} \fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{PairCoil2output}} \end{center} If you click the link ``Positions and registers'', a text file with the predicted heptad annotation is displayed. Select the lines with the amino acids and heptad annotations in the following way: \begin{center} \fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{PairCoil2sel}} \end{center} Now copy this selection into the input field of the \PrOCoil\ Web interface. If you check ``Accept PairCoil2 output format'', you can directly process the PairCoil2 output with \PrOCoil: \begin{center} \fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{PairCoil2input}} \end{center} \subsection{Processing Marcoil results} If you submit a sequence to the Marcoil Web interface,\footnote{\url{http://toolkit.tuebingen.mpg.de/marcoil}} the results are shown in your Web browser. These results cannot be processed by \PrOCoil\ directly, since coiled coil segments must first be singled out by applying a probability cut-off threshold. This can be one as follows: first go to the ``Prob List'' tab and click ``Export'', then a plain text file with results is shown as follows: \begin{center} \fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{Marcoil}} \end{center} As in the above screenshot, select the four-column table with residue numbers, residues, probabilities, and most probable heptad positions. Then copy and paste it into the text field at the bottom of the \PrOCoil\ Web page as follows: \begin{center} \fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{MarcoilPreProc}} \end{center} After clicking ``Submit'', an output page is displayed that shows the result of applying the chosen probability cut-off in a read-only field at the bottom: \begin{center} \fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{MarcoilPreProcOut}} \end{center} If you are satisfied with the result, you can directly pass the data to \PrOCoil\ by clicking ``Proceed to PrOCoil''. If you think you should have used a different threshold, your input is displayed in the input field on top of the page. Select a different threshold until the result meets your expectation and then pass the data to \PrOCoil. \clearpage \section{\PrOCoil\ \R\ Package} \subsection{Installation} The \PrOCoil\ \R\ package (current version: \PrOCoilVer) is available via Bioconductor. The simplest way to install the package is the following: <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("procoil") @ If you wish to install the package manually instead, you can download the package archive that fits best to your computer system from the Bioconductor homepage. \subsection{Getting started}\label{ssec:getstarted} To load the \PrOCoil\ package, enter <>= library(procoil) @ \noindent in your \R\ session. If this command terminates without any error message or warning, you can be sure that the \PrOCoil\ package has been installed successfully. If so, the \PrOCoil\ package is ready for use now and you can start predicting the oligomerization of coiled coils. As a first example, the following command makes a prediction for the GCN4 wild type already used above: <>= GCN4wt <- predict(PrOCoilModel, "MKQLEDKVEELLSKNYHLENEVARLKKLV", "abcdefgabcdefgabcdefgabcdefga") @ The object `\texttt{PrOCoilModel}' is an object of class `\texttt{CCModel}' in which the \PrOCoil\ model is stored. Since \texttt{predict()} is a generic function that determines the function to call from the type of the first argument, it is essential that you provide an object of class `\texttt{CCModel}' as first argument of \texttt{predict()}. For more information about `\texttt{CCModel}' objects, enter \texttt{?CCModel}. Note that the sequence need not be a plain character string. Firstly, multiple sequences can be supplied at a time as a character vector, but also via objects of several other classes. Heptad registers need not be supplied via the `\texttt{reg}' argument. Instead, they can be attached to the sequence objects via attributes or metadata (see \ref{sssec:reg} for details). As mentioned in Section\ \ref{sec:input}, the sequence and the register annotation must have equal lengths. The function \texttt{predict()} creates an object of class `\texttt{CCProfile}' in which prediction results along with various additional information is stored. To obtain more information about this class, enter \texttt{?CCProfile}. Basic information about the result can be displayed by \texttt{show(GCN4wt)} or simply by entering the name of the object: <>= GCN4wt @ \noindent The discriminant value and its interpretation are the same as described in Section\ \ref{sec:webpredict} above. A prediction profile can be plotted simply as follows: \begin{center} <>= plot(GCN4wt) @ \end{center} \noindent The \texttt{plot()} function for `\texttt{CCProfile}' objects provides various ways for customizing the plot and for writing directly to graphics files (see also \ref{sssec:plots} and the documentation available via \verb+?plot.CCProfile+). \subsection{Predictions for non-trimmed sequences containing coiled coil segments} Like the Web version described above, the \R\ package is also capable of handling sequences that contain non-coiled-coil sub-sequences. If the heptad annotation contains at least one dash ``\texttt{-}'', \texttt{predict()} first extracts all coiled coil segments, i.e.\ all contiguous sub-sequences with no dashes in the heptad annotation. Then all these coiled coil segments are analyzed independently and the results are returned as a list, the components of which are the prediction results of the coiled coil segments in the order they appear in the sequence. Let us consider the Marcoil sample sequence again: {\footnotesize <>= res <- predict(PrOCoilModel, "MGECDQLLVFMITSRVLVLSTLIIMDSRQVYLENLRQFAENLRQNIENVHSFLENLRADLENLRQKFPGKWYSAMPGRHG", "-------------------------------abcdefgabcdefgabcdefgabcdefgabcdefg--------------") @ } \noindent The returned object `\texttt{res}' is a `\texttt{CCProfile}' object that contains the predictions and profiles of all coiled coil segments that were found in the sequences. In the given example, this is just one segment: <>= res @ \noindent If the untrimmed sequences were unnamed, the segments will also remain unnamed in the resulting `\texttt{CCProfile}' object. However, if the sequences were named, the segments are named according to a specific strategy. This strategy is best explained by the following example: suppose we have three sequences that are named `\texttt{S1}', `\texttt{S2}', and `\texttt{S3}', and assume that sequence `\texttt{S1}' contains one coiled coil segment, `\texttt{S2}' contains three coiled coil segments, and `\texttt{S3}' contains no coiled coil segment. Then the resulting `\texttt{CCProfile}' object will contain predictions and profiles of a total of four coiled coil segments whose names are `\texttt{S1.1}', `\texttt{S2.1}', `\texttt{S2.2}', and `\texttt{S2.3}'. The prediction profile must be plotted for each coiled coil segment separately. This can be done by accessing the respective entry in the \texttt{CCProfile} object explicitly: \begin{center} <>= plot(res[1]) @ \end{center} \subsection{Comparative mutation analysis}\label{ssec:compmut} The \PrOCoil\ \R\ package allows not only to consider input sequences completely independently of one another, but also allows for overlaying profiles of aligned coiled coil segments. This feature is useful, for example, for studying effects of different mutations. As an example, we consider multiple mutations of GCN4: <>= GCN4mSeq <- c("GCN4wt" ="MKQLEDKVEELLSKNYHLENEVARLKKLV", "GCN4_N16Y_L19T"="MKQLEDKVEELLSKYYHTENEVARLKKLV", "GCN4_E22R_K27E"="MKQLEDKVEELLSKNYHLENRVARLEKLV", "GCN4_V23K_K27E"="MKQLEDKVEELLSKNYHLENEKARLEKLV") GCN4mReg <- rep("abcdefgabcdefgabcdefgabcdefga", 4) GCN4mut <- predict(PrOCoilModel, GCN4mSeq, GCN4mReg) GCN4mut @ The \texttt{plot()} function allows for plotting two profiles in one plot: \begin{center} <>= plot(GCN4mut[c(1, 2)]) plot(GCN4mut[c(1, 3)]) plot(GCN4mut[c(1, 4)]) @ \end{center} By default, the first profile is plotted in red and the second profile is plotted in blue. The first sequence is shown above the graph and the second sequence is shown below the graph. Black letters correspond to residues that are the same in the two sequences, whereas mutations are highlighted in the corresponding colors. More information on how to customize plots is available in \ref{sssec:plots} and the documentation available via \verb+?plot.CCProfile+. \notebox{Profile overlays also work if the two sequences are not equally long and/or if their heptad annotations are not aligned to each other. However, this hardly makes sense, so the overlay of profiles is primarily made for showing profiles of aligned sequences with matching heptad registers.} \pagebreak For three or more sequences, the profiles can also be visualized as heatmaps: \begin{center} <>= heatmap(GCN4mut) @ \end{center} In this view, profiles/sequences are even hierarchically clustered. The dendrogram is shown to the left of the heatmap. In the example above, obviously, the two trimers are clustered together as rows 1--2 and the two dimers are clustered together as rows 3--4. The note from above applies to heatmaps too: heatmaps make most sense for showing profiles of aligned sequences with matching heptad registers. \subsection{Miscellanea} \subsubsection{Processing predicted coiled coil segments with the \R\ package} The \PrOCoil\ \R\ package itself is not able to process Marcoil or PairCoil2 output. Users who want to process prediction results obtained from these programs are recommended to use the \PrOCoil\ Web interface to convert Marcoil or PairCoil2 output (see Section\ \ref{sec:preproc}) into the correct input format that can be processed by the \PrOCoil\ \R\ package. \subsubsection{Heptad irregularities} Some coiled coils occurring in nature have heptad irregularities, e.g.\ incomplete heptads in which an `\verb+a+' position follows after a `\verb+d+' position. \PrOCoil\ can also process heptad annotations with such irregularities. In the profile plots created by the \PrOCoil\ Web interface, such irregularities can only be seen in the register labels in the middle of the plot. In the profile plots created by the \PrOCoil\ \R\ package, heptad irregularities are additionally visualized by a vertical red line between the two positions that do not conform to the usual regular pattern. \begin{center} <>= plot(predict(PrOCoilModel, "LQDTLVRQERPIRKSIEDLRNTV", "defgabcdefgabcdabcdefga")) @ \end{center} \subsubsection{Alternative models}\label{sssec:models} Trimers occur less frequently than dimers. Correspondingly, there were more dimers and trimers in the data set used for training the default model `\verb+PrOCoilModel+'. Since this model was optimized for (standard) classification accuracy, it tends to concentrate on classifying the larger class --- dimers --- properly. That is why the model performs better in terms of specificity/true negative rate (correctly classified dimers) than in terms of sensitivity/true positive rate (correctly classified trimers). In case one wants to attribute equal importance to sensitivity and specificity, we have trained a second model that is optimized for {\em balanced accuracy}, i.e., the average of sensitivity and specificity (see publication supplement of \cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11}). This model can be used simply by calling \verb+predict()+ with `\verb+PrOCoilModelBA+' as first argument. Like the standard model `\verb+PrOCoilModel+', the alternative model `\verb+PrOCoilModelBA+' is automatically available once the \PrOCoil\ \R\ package has been loaded. In case the user wants to define custom models or wishes to use previous versions of the prediction models, the functions \texttt{readCCModel()} and \texttt{writeCCModel()} can be used to read/write models from/to plain text files that can be viewed and also modified. In order to see how such a model file should be organized, the two models included in the \PrOCoil\ \R\ package are available for download at: \begin{quotation} \noindent\url{http://www.bioinf.jku.at/software/procoil/PrOCoilModel_v2.CCModel}\newline \noindent\url{http://www.bioinf.jku.at/software/procoil/PrOCoilModelBA_v2.CCModel} \end{quotation} \noindent It is also possible to read them directly using the \texttt{readCCModel()} function: <>= URL <- "http://www.bioinf.jku.at/software/procoil/PrOCoilModel_v2.CCModel" myModel <- readCCModel(URL) @ The present version of the \R\ package includes newer models than the ones reported in \cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11} (which were used in \PrOCoil\ versions before 2.0.0). However, for backward compatibility, these models can still be downloaded and used with the \PrOCoil\ \R\ package: \begin{quotation} \noindent\url{http://www.bioinf.jku.at/software/procoil/PrOCoilModel_v1.CCModel}\newline \noindent\url{http://www.bioinf.jku.at/software/procoil/PrOCoilModelBA_v1.CCModel} \end{quotation} \subsubsection{Customizing and saving plots}\label{sssec:plots} The \PrOCoil\ \R\ package provides several opportunities to customize plots. For more detailed documentation, see \verb+?plot.CCProfile+. We simply provide an example here that uses legends, shading, custom coloring of profiles, and a plot header: \begin{center} <>= plot(GCN4mut[c(1, 2)], legend=c("wild type", "mutant N16Y,L19T"), col=c(rgb(0.7, 0, 0), rgb(0, 0, 0.8)), main="GCN4 Mutation Analysis", shades=c(rgb(0.77, 0.85, 0.95), rgb(0.99, 0.84, 0.71))) @ \end{center} \R\ graphics have a fixed size and aspect ratio by default. That is why \PrOCoil\ prediction profiles appear quite stretched for shorter and constricted for longer sequences. When exporting profile plots to graphics files, we therefore recommend to adjust the graphics dimensions in order to achieve a consistent and optically pleasing apperance: \begin{itemize} \item Use a fixed value for the {\em height} of the graphics device; we recommend a value between 5'' and 7'' for vector formats (PDF and (encapsulated) PostScript) and a few hundred pixels for bitmap image formats (BMP, JPEG, TIFF, PNG). \item Scale the {\em width} according to the length of the sequence; we recommend 1/24 of the graphic's height per residue as a rule of thumb. \end{itemize} \pagebreak \noindent{\bf Examples:} <>= pdf(file="GCN4wt.pdf", height=6, width=max(nchar(sequences(GCN4wt))) * 6 / 24) plot(GCN4wt) dev.off() bmp(file="GCN4wt.bmp", height=480, width=max(nchar(sequences(GCN4wt))) * 480 / 24) plot(GCN4wt) dev.off() @ \subsubsection{Alternative ways of supplying heptad registers to \texttt{predict()}}\label{sssec:reg} As mentioned briefly in Subsection\ \ref{ssec:getstarted}, the \texttt{predict()} function requires a string that contains the heptad register annotation of the sequence to be classified. Normally, this string is passed as the third argument `\texttt{reg}'. However, the following alternatives exist as well: \begin{itemize} \item If the second argument `\texttt{seq}' is a character vector, the heptad register can also be supplied via the attribute `\texttt{reg}': <>= GCN4wtSeq1 <- "MKQLEDKVEELLSKNYHLENEVARLKKLV" attr(GCN4wtSeq1, "reg") <- "abcdefgabcdefgabcdefgabcdefga" res <- predict(PrOCoilModel, GCN4wtSeq1) @ \item If the second argument `\texttt{seq}' is a single amino acid sequence given as an `\texttt{AAString}' object, the heptad register can be specified via a metadata component `\texttt{reg}': <>= GCN4wtSeq2 <- AAString("MKQLEDKVEELLSKNYHLENEVARLKKLV") metadata(GCN4wtSeq2)$reg <- "abcdefgabcdefgabcdefgabcdefga" res <- predict(PrOCoilModel, GCN4wtSeq2) @ \item If the second argument `\texttt{seq}' is a set of amino acid sequences given as an `\texttt{AAStringSet}' or `\texttt{AAVector}' object, the heptad registers can be specified via a metadata component `\texttt{reg}' as above or via the annotation metadata: <>= GCN4mSeq2 <- AAStringSet(GCN4mSeq) annotationMetadata(GCN4mSeq2, annCharset="abcdefg") <- GCN4mReg res <- predict(PrOCoilModel, GCN4mSeq2) @ \end{itemize} In any case, the `\texttt{reg}' argument has priority over all other ways of specifying the heptad annotation. In other words, if `\texttt{reg}' is specified and `\texttt{seq}' contains heptad annotations in one of the ways described above, the `\texttt{reg}' argument has priority and the heptad annotation in `\texttt{seq}' is ignored. If no heptad register is found in any of the places mentioned above, \texttt{predict()} quits with an error message. \subsection{Upgrading from a version prior to 2.0.0} Version 2.0.0 of the \PrOCoil\ \R\ package has not only brought improved prediction models (see \ref{ssec:models}), but also several changes to the functionality of the package. While the basic functionality of the \texttt{predict()} function and the \texttt{plot()} function remained unchanged, users who upgrade from versions prior to 2.0.0 should take the following caveats into account: \begin{enumerate} \item The classes `\texttt{CCModel}' and `\texttt{CCProfile}' have changed signficantly internally. Objects saved with versions prior to 2.0.0 cannot be used with version 2.0.0 or newer. \item Calling \texttt{predict()} for a sequence that contains multiple coiled coil segments no longer results in a list of `\texttt{CCProfile}' objects, but will result in a single `\texttt{CCProfile}' object that contains the predictions for all segments. `\texttt{CCProfile}' objects are now subsettable to extract a subset of predictions/profiles. \item The \texttt{plot()} method with signature (`\texttt{CCProfile}', `\texttt{CCProfile}') is no longer available for plotting two prediction profiles together. Instead, the user has to call \texttt{predict()} for a sequence object that contains two or more sequences. Consequently, all predictions are returned in a single `\texttt{CCProfile}' object (see \ref{ssec:compmut} for examples). \item The argument `\texttt{legend.pos}' of the \texttt{plot()} method has been renamed to `\texttt{legendPos}' for the sake of compatibility with the KeBABS package. The argument `\texttt{rng}' is no longer available. \item The \texttt{readCCModel()} function expects patterns in a slightly different format now: instead of patterns that only refer to the heptad position of the first amino acid of the pattern (e.g.\ `\texttt{N..La}'), \texttt{readCCModel()} now expects the patterns to specify also the heptad position of the second amino acid in the pattern (e.g.\ `\texttt{N..La..d}'; see also \ref{ssec:pred}). \end{enumerate} \section{More Details About \PrOCoil} \subsection{How the prediction works}\label{ssec:pred} \PrOCoil's classification models are based on support vector machines \cite{Burges98,CortesVapnik95,MuellerMikaRaetschTsudaSchoelkopf01,SchoelkopfSmola02} with a kernel designed specifically for coiled coil classification: \[ k(x,y)=\sum\limits_{p\in P} N(p,x)\cdot N(p,y) \] In this formula, $x$ and $y$ are the two input sequences that are to be compared, $P$ is a set of coiled coil-specific patterns, and $N(p,x)$ is the number of occurrences of pattern $p$ in sequence $x$. \PrOCoil\ uses the following set of patterns $P$: pairs of amino acids at fixed heptad positions with no more than a maximum number $m$ of residues in between. Internally, these patterns are represented as strings with an amino acid letter on the first position, then a certain number of wildcards (between 0 and $m$ as noted above), then the second amino acid letter, and finally an aligned sequence starting and ending with a letter `\verb+a+'--`\verb+g+' denoting the heptad register position of the first and the last amino acid, e.g.\ ``\verb+N..La..d+''. This pattern matches a coiled coil sequence if the sequence has an `\verb+N+' (Asparagine) at an `\verb+a+' position and an `\verb+L+' (Leucine) at the next `\verb+d+' position. For instance, the GCN4 wild type has one occurrence of this pattern: \begin{quote} {\footnotesize\begin{verbatim} MKQLEDKVEELLSKNYHLENEVARLKKLV abcdefgabcdefgabcdefgabcdefga N..L a d \end{verbatim}} \end{quote} So, obviously, \PrOCoil\ considers pair interactions of amino acids at given heptad positions which are no more than $m$ positions apart. The kernel described above resembles the spatial sample kernel \cite{KuksaHuangPavlovic08} (however, with a heptad-specific property) and the kernel described in \cite{FongKeatingSingh04} (however, considering interactions within one sequence and not restricting to a particular subset of interactions). The two models included in \PrOCoil\ further employ kernel normalization \cite{SchoelkopfTsudaVert04} to correct for different sequence lengths. This means that the support vector machines have been trained with the kernel \[ k'(x, y)=\frac{k(x, y)}{\sqrt{k(x, x)\cdot k(y,y)}}, \] where $k(.,.)$ is the {\em coiled coil kernel} described above. Using an explicit representation of the feature mapping underlying the kernel, the support vector machines have been transformed into linear classifiers on sequence features (cf.\ \cite{BodenhoferSchwarzbauerIonescuHochreiter09} for details). Thus, \PrOCoil's prediction models consist of weights for specific sequence features (i.e.\ {\em patterns}) and a constant offset $b$. Let us assume that $x$ denotes a new coiled coil sequence. Without kernel normalization, the discriminant function value of the new sequence $x$ is given as \[ f(x)=b+\sum\limits_{p\in P} N(p,x)\cdot w(p), \] where $b$ is the constant offset of the support vector machine and $w(p)$ is the weight of pattern $p$. If kernel normalization is employed, the following, slightly more complicated, representation is obtained: \[ f(x)=b+\left(\sum\limits_{p\in P} N(p,x)\cdot w(p)\right)\Big/\underbrace{\sqrt{\sum_{p\in P} N(p,x)^2}}_{=R(x)} \] Obviously, $R(x)$ is the value that corrects for the sequence length. The longer the sequence, the larger $R(x)$. \subsection{\PrOCoil's built-in models}\label{ssec:models} As already mentioned above, \PrOCoil\ provides a default model that is optimized for classification accuracy (object `\verb+PrOCoilModel+' in the \PrOCoil\ \R\ package) and an alternative model that is optimized for balanced accuracy (object `\verb+PrOCoilModelBA+'). Both models have been trained on the same training set which consisted of verified coiled coils from the RCSB Protein Data Bank (PDB)% \footnote{\url{http://www.rcsb.org/}} \cite{BermanWestbrookFengGillilandBhatWeissigShindyalov00} enriched by putatively similar sequences that were obtained by BLAST \cite{AltschulGishMillerMyersLipman90} in conjunction with Marcoil \cite{DelorenziSpeed02}. The default model was created with the KeBABS package \cite{PalmeHochreiterBodenhofer15} using the coiled coil kernel mentioned above (in the lingo of the KeBABS package called {\em annotation-specific gappy pair kernel}) and the frontend to LibLineaR package, which provides an R interface to LIBLINEAR \cite{FanChangHsiehWangLin08}. The current models (since Version 2.0.0) use the normalized coiled coil kernel with $m=5$ and a penalty parameter of $C=2$. The alternative modes was created with PSVM \cite{HochreiterObermayer06} using the normalized coiled coil kernel with $m=8$, class balancing, and regularization parameters $C=8$ and $\varepsilon=0.8$. In the \PrOCoil\ \R\ package, all parameters describing a model are stored in the corresponding `\texttt{CCModel}' object: <>= PrOCoilModel weights(PrOCoilModel)["N..La..d"] PrOCoilModelBA weights(PrOCoilModelBA)["N..La..d"] @ \noindent In both models, the pattern weights are sorted decreasingly, i.e., from most trimeric to most dimeric. Thus, the user also has easy access to the most indicative patterns. Here we provide an example how to extract the 25 most trimeric (i.e.\ the first 25 patterns in the list) and the 25 most dimeric patterns (i.e.\ the last 25 patterns in the list) from the \PrOCoil model: <>= noP <- length(weights(PrOCoilModel)) names(weights(PrOCoilModel))[1:25] names(weights(PrOCoilModel))[noP:(noP - 24)] @ \subsection{How prediction profiles are obtained}\label{ssec:profiles} Regardless of whether kernel normalization is employed or not, the essential component of the discriminant function is the sum \[ \sum\limits_{p\in P} N(p,x)\cdot w(p). \] It is obvious that every match of a pattern $p$ contributes $w(p)$ to the sum. In order to find out the extent to which each residue contributes to the final classification, we can rewrite the sum as \begin{equation}\label{eq:profile} \sum\limits_{p\in P} N(p,x)\cdot w(p)=\sum\limits_{i=1}^L c_i(x), \end{equation} where $L$ is the length of the sequence $x$ and $c_i(x)$ is the contribution of the $i$-th residue of $x$. The contribution $c_i(x)$ can simply be computed as half the sum of weights of patterns matching the $i$-th residue.\footnote{As all patterns match two distinct residues, each weight must be split to the two residues in order to ensure that the equality \eqref{eq:profile} holds.} The weights $c_i(x)$ (or $c_i(x)/R(x)$ in case kernel normalization is employed) can be understood as a profile that can be plot over the sequence. We have already introduced these values as {\em prediction profiles} above. In order to have a unified terminology, let us denote the prediction profiles as \[ s_i(x)= \begin{cases} c_i(x)/R(x) & \text{if kernel normalization is employed,}\\ c_i(x) & \text{otherwise.} \end{cases} \] A positive values $s_i(x)$ indicates that the $i$-th residue is participating more strongly in patterns that are indicative for trimers. A negative values $s_i(x)$ indicates that the $i$-th residue is participating more strongly in patterns that are indicative for dimers. A value $s_i(x)$ or zero or close to zero either means that the $i$-th residue is not participating in any indicative patterns or that it participates in dimer and trimer patterns the contributions of which compensate/cancel each other. \PrOCoil\ computes prediction profiles each time it computes a prediction. It is clear that we can recover the discriminant value $f(x)$ as follows: \[ f(x)=b+\sum\limits_{i=1}^L s_i(x) \] If $b\not=0$, which is the normal case, the values $s_i(x)$ do not provide enough information to infer the final classification from the prediction profile. In order to facilitate a more sensible analysis, let us reformulate the above equality as \[ f(x)=\sum\limits_{i=1}^L \big(s_i(x)+\frac{b}{L}\big)=\sum\limits_{i=1}^L \big(s_i(x)-(-\frac{b}{L})\big). \] Hence, if we plot the profile values $s_i(x)$ along with a horizontal line at $-\frac{b}{L}$, we can recover the discriminant value $f(x)$ as the difference of the area above the $-\frac{b}{L}$ minus the area below $-\frac{b}{L}$. The value $-\frac{b}{L}$ exactly corresponds to the ``base line'' mentioned above. \section{Acknowledgments} Several colleagues have contributed to \PrOCoil\ and its implementation: {\bf Carsten C.\ Mahrenholz}, {\bf Ingrid G.\ Abfalter}, {\bf Rudolf Volkmer}, and {\bf Sepp Hochreiter}, who are also the co-authors of the first \PrOCoil\ paper \cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11}, have contributed to the inception and success of \PrOCoil\ with their biochemical knowledge, machine learning expertise, and by putting together the first coiled data set that was published along with \cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11} and used for training he original \PrOCoil\ models. The data set underlying the \PrOCoil\ 2.0.0 models was created by {\bf Annette Jacyszyn} who also performed the hyperparameter selection and training of the new models. Last but definitely not least, the new version of the package would have been completely impossible without {\bf Johannes Palme}'s amazing KeBABS package. \section{How to Cite PrOCoil} If you use PrOCoil for research that is published later, you are kindly asked to cite it as follows: \begin{quotation} \noindent C.\ C. Mahrenholz, I.\ G. Abfalter, U.\ Bodenhofer, R.\ Volkmer, and S.\ Hochreiter. \newblock Complex networks govern coiled coil oligomerization --- predicting and profiling by means of a machine learning approach. \newblock {\em Mol. Cell. Proteomics} 10(5), 2011. \newblock DOI:\ \href{http://dx.doi.org/10.1074/mcp.M110.004994}{10.1074/mcp.M110.004994} \end{quotation} \noindent To obtain this reference in Bib\TeX\ format, you can enter the following into your R session: <>= toBibtex(citation("procoil")) @ %\bibliographystyle{plain} %\bibliography{Bioinformatics,MachineLearningClassical,Biology,BodenhoferPub} \begin{thebibliography}{10} \bibitem{AltschulGishMillerMyersLipman90} S.~F. Altschul, W.~Gish, W.~Miller, E.~W. Myers, and D.~J. Lipman. \newblock Basic local alignment search tool. \newblock {\em J. Mol. Biol.}, 215:403--410, 1990. \bibitem{BergerWilsonWolfTonchevMillaKim95} B.~Berger, D.~B. Wilson, E.~Wolf, T.~Tonchev, M.~Milla, and P.~S. Kim. \newblock Predicting coiled coils by use of pairwise residue correlations. \newblock {\em Proc. Natl. Acad. Sci. USA}, 92:8259--8263, 1995. \bibitem{BermanWestbrookFengGillilandBhatWeissigShindyalov00} H.~M. Berman, J.~Westbrook, Z.~Feng, G.~Gilliland, T.~N. Bhat, H.~Weissig, I.~N. Shindyalov, and P.~E. Bourne. \newblock The {P}rotein {D}ata {B}ank. \newblock {\em Nucleic Acids Res.}, 28(1):235--242, 2000. \bibitem{BodenhoferSchwarzbauerIonescuHochreiter09} U.~Bodenhofer, K.~Schwarzbauer, M.~Ionescu, and S.~Hochreiter. \newblock Modeling position specificity in sequence kernels by fuzzy equivalence relations. \newblock In J.~P. Carvalho, D.~Dubois, U.~Kaymak, and J.~M.~C. Sousa, editors, {\em Proc. Joint 13th IFSA World Congress and 6th EUSFLAT Conference}, pages 1376--1381, Lisbon, July 2009. \bibitem{Burges98} C.~J.~M. Burges. \newblock A tutorial on support vector machines for pattern recognition. \newblock {\em Data Min. Knowl. Discov.}, 2:121--167, 1998. \bibitem{CortesVapnik95} C.~Cortes and V.~N. Vapnik. \newblock Support vector networks. \newblock {\em Machine Learning}, 20:273--297, 1986. \bibitem{DelorenziSpeed02} M.~Delorenzi and T.~Speed. \newblock An {HMM} model for coiled-coil domains and a comparison with {PSSM}-based predictions. \newblock {\em Bioinformatics}, 18(4):617--625, 2002. \bibitem{FanChangHsiehWangLin08} R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. \newblock {LIBLINEAR:} a library for large linear classification. \newblock {\em J. Mach. Learn. Res.}, 9:1871--1874, 2008. \bibitem{FongKeatingSingh04} J.~H. Fong, A.~E. Keating, and M.~Singh. \newblock Predicting specificity in {bZIP} coiled-coil protein interactions. \newblock {\em Genome Biol.}, 5:R11, 2004. \bibitem{HochreiterObermayer06} S.~Hochreiter and K.~Obermayer. \newblock Support vector machines for dyadic data. \newblock {\em Neural Comput.}, 18:1472--1510, 2006. \bibitem{KuksaHuangPavlovic08} P.~Kuksa, P.-H. Huang, and V.~Pavlovic. \newblock A fast, large-scale learning method for protein sequence classification. \newblock In {\em 8th Int. Workshop on Data Mining in Bioinformatics}, pages 29--37, Las Vegas, NV, 2008. \bibitem{MahrenholzAbfalterBodenhoferVolkmerHochreiter11} C.~C. Mahrenholz, I.~G. Abfalter, U.~Bodenhofer, R.~Volkmer, and S.~Hochreiter. \newblock Complex networks govern coiled coil oligomerization --- predicting and profiling by means of a machine learning approach. \newblock {\em Mol. Cell. Proteomics}, 10(5):M110.004994, 2011. \bibitem{McDonnellJiangKeatingBerger06} A.~V. McDonnell, T.~Jiang, A.~E. Keating, and B.~Berger. \newblock Paircoil2: improved prediction of coiled coils from sequence. \newblock {\em Bioinformatics}, 22(3):356--358, 2006. \bibitem{MuellerMikaRaetschTsudaSchoelkopf01} K.-R. M\"uller, S.~Mika, G.~R\"atsch, K.~Tsuda, and B.~Sch\"olkopf. \newblock An introduction to kernel-based learning algorithms. \newblock {\em IEEE Trans. Neural Networks}, 12(2):181--201, 2001. \bibitem{PalmeHochreiterBodenhofer15} J.~Palme, S.~Hochreiter, and U.~Bodenhofer. \newblock {KeBABS}: an {R} package for kernel-based analysis of biological sequences. \newblock {\em Bioinformatics}, 31(15):2574--2576, 2015. \bibitem{SchoelkopfSmola02} B.~Sch\"olkopf and A.~J. Smola. \newblock {\em Learning with Kernels}. \newblock Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2002. \bibitem{SchoelkopfTsudaVert04} B.~Sch\"olkopf, K.~Tsuda, and J.-P. Vert, editors. \newblock {\em Kernel Methods in Computational Biology}. \newblock MIT Press, Cambridge, MA, 2004. \bibitem{WalshawWoolfson01} J.~Walshaw and D.~N. Woolfson. \newblock {SOCKET:} a program for identifying and analysing coiled-coil motifs within protein structures. \newblock {\em J. Mol. Biol.}, 307(5):1427--1450, 2001. \end{thebibliography} \end{document}