\documentclass[article]{bioinf}

\usepackage{amsmath}
\usepackage{hyperref}

\hypersetup{colorlinks=false,
   pdfborder=0 0 0,
   pdftitle={PrOCoil - A Web Service and an R Package for Predicting the Oligomerization of Coiled Coil Proteins},
   pdfauthor={Ulrich Bodenhofer}}

\title{PrOCoil --- A Web Service and an \R\ Package for\\ Predicting the Oligomerization of\\ Coiled Coil Proteins}
\author{Ulrich Bodenhofer}
\affiliation{Institute of Bioinformatics, Johannes Kepler University
Linz\\Altenberger Str. 69, 4040 Linz, Austria\\
\email{procoil@bioinf.jku.at}}

\newcommand{\PrOCoil}{PrOCoil}
\newcommand{\R}{R}

%\VignetteIndexEntry{PrOCoil - A Web Service and an R Package for Predicting the Oligomerization of Coiled-Coil Proteins}
%\VignetteDepends{methods, stats, graphics, utils}
%\VignetteEngine{knitr::knitr}

\setlength{\fboxrule}{1.5pt}

\begin{document}
<<LoadPackageToDetermineVersion,echo=FALSE,message=FALSE,results='hide'>>=
options(width=72)
knitr::opts_knit$set(width=72)
set.seed(0)
library(procoil, quietly=TRUE)
procoilVersion <- packageDescription("procoil")$Version
procoilDateRaw <- packageDescription("procoil")$Date
procoilDateYear <- as.numeric(substr(procoilDateRaw, 1, 4))
procoilDateMonth <- as.numeric(substr(procoilDateRaw, 6, 7))
procoilDateDay <- as.numeric(substr(procoilDateRaw, 9, 10))
procoilDate <- paste(month.name[procoilDateMonth], " ",
                     procoilDateDay, ", ",
                     procoilDateYear, sep="")
@
\newcommand{\PrOCoilVer}{\Sexpr{procoilVersion}}
\newcommand{\PrOCoilDate}{\Sexpr{procoilDate}}
\manualtitlepage[Version \PrOCoilVer, \PrOCoilDate]

\section*{Scope and Purpose of this Document}

This document is a user manual for \PrOCoil, the software suite accompanying
the paper \cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11}.
It provides a gentle
introduction into how to use \PrOCoil. Not all features of the \R\
package are described in full detail. Such details can be obtained
from the documentation enclosed in the  \R\ package. Further note
the following: (1) this is not an introduction to coiled coil
proteins; (2) this is not an introduction to \R; (3) this is not an
introduction to support vector machines. If you lack the background
for understanding this manual, you first have to read introductory
literature on the subjects mentioned above.

\vspace{1cm}

\newlength{\auxparskip}
\setlength{\auxparskip}{\parskip}
\setlength{\parskip}{0pt}
\tableofcontents
\clearpage
\setlength{\parskip}{\auxparskip}

\newlength{\Nboxwidth}
\setlength{\Nboxwidth}{\textwidth}
\addtolength{\Nboxwidth}{-2\fboxrule}
\addtolength{\Nboxwidth}{-2\fboxsep}

\newcommand{\notebox}[1]{%
\begin{center}
\fcolorbox{bioaz}{biowh}{\begin{minipage}{\Nboxwidth}
\noindent{\sffamily\bfseries Note:} #1
\end{minipage}}
\end{center}}

\section{Introduction}

This user manual describes the \PrOCoil\ software suite that accompanies
the paper \cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11}. This
software is concerned with analyzing coiled coil sequences in terms
of their oligomerization behavior. \PrOCoil\ does not only provide a
prediction, but also detailed insights into which residues or
sub-sequences are responsible for the predicted oligomerization.

\PrOCoil\ is available both as an easy-to-use Web interface and an
\R\ package. Both variants offer the same prediction and analysis
facilities. The following table summarizes the essential
differences:

\begin{center}
\begin{tabular}{|p{0.4\textwidth}|p{0.4\textwidth}|}
\hline
\PrOCoil\ Web interface & \PrOCoil\ \R\  package\\
\hline \hline
can be used instantly on any computer with Internet access and a Web browser &
requires \R\  and installation of package {\ttfamily
procoil}
\\
\hline
supports only the standard \PrOCoil\ model &
supports the standard \PrOCoil\ model and the alternative model optimized
for balanced accuracy; other models can be loaded from files
\\
\hline
graphics are produced in a non-customizable standard format
& graphics are customizable
\\
\hline
only single sequences or at most pairs of sequences with the same heptad register
can be analyzed at a time &
multiple sequences can be analyzed per run
\\
\hline
beside standard input (amino acid sequence $+$ aligned heptad
annotation), also PairCoil2 output format is
supported; Marcoil output can be used upon a separate
pre-processing step &
only standard input (amino acid sequence(s) $+$ aligned heptad
annotation(s))
\\
\hline
\end{tabular}
\end{center}

We recommend beginners to use the Web interface. Experienced users
can benefit from the greater flexibility of the \R\ package. \R\ package users
can use the Web interface for converting PairCoil2
and/or Marcoil output into the input format the \R\  package
requires.

\section{Input Data}\label{sec:input}

As already mentioned, \PrOCoil\ predicts whether a given coiled coil
segment of an amino acid sequence is more likely to form a dimer or
trimer. Such a segment has to consist of an amino acid
(sub-)sequence and an aligned heptad annotation of the same length.
As an example, the GCN4 yeast transcription factor is a dimer
consisting of two equal sequences (i.e.\ it is a homo-dimer), the
coiled coil parts of which (according to  SOCKET
\cite{WalshawWoolfson01}) look as follows:
\begin{quote}
{\footnotesize\begin{verbatim}
MKQLEDKVEELLSKNYHLENEVARLKKLV
abcdefgabcdefgabcdefgabcdefga
\end{verbatim}}
\end{quote}
The heptad annotation is essential for \PrOCoil\ to work and cannot
be omitted. The letters `\verb+a+'--`\verb+g+' correspond to the
usual annotation of positions within the heptad motif. \PrOCoil\ can
also process full amino acid sequences containing one or more coiled
coil segments as sub-sequences, where the coiled coil segments must
be annotated with the usual symbols `\verb+a+'--`\verb+g+' and
non-coiled coil residues must be annotated with dashes `\verb+-+'.
As an example, let us consider the following sample sequence (annotation
derived from running Marcoil%
\footnote{\url{http://www.isrec.isb-sib.ch/webmarcoil/webmarcoilC1.html}}
\cite{DelorenziSpeed02} on this sequence with a cut-off of 50\%):
\begin{quote}
{\footnotesize\begin{verbatim}
MGECDQLLVFMITSRVLVLSTLIIMDSRQVYLENLRQFAENLRQNIENVHSFLENLRADLENLRQKFPGKWYSAMPGRHG
-------------------------------abcdefgabcdefgabcdefgabcdefgabcdefg--------------
\end{verbatim}}
\end{quote}
If the heptad annotation contains dashes, \PrOCoil\ extracts all
coiled coil regions (contiguous non-dash regions) and classifies
them separately. From the above sample sequence, \PrOCoil\ will
extract and classify the following coiled coil segment (residues
32--66 of the above example):
\begin{quote}
{\footnotesize\begin{verbatim}
LENLRQFAENLRQNIENVHSFLENLRADLENLRQK
abcdefgabcdefgabcdefgabcdefgabcdefg
\end{verbatim}}
\end{quote}

\section{Predictions Using the Web Interface}\label{sec:webpredict}

The \PrOCoil\ Web interface can be accessed directly via the
\PrOCoil\
homepage.\footnote{\url{http://www.bioinf.jku.at/software/procoil/}}
The page contains an input field that is pre-filled with the GCN4
wild type as an example:

\begin{center}
\fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{Input}}
\end{center}

The \PrOCoil\ server first checks your input for validity (see
Section\ \ref{sec:input}). If the input is correct, all coiled coil
segments are extracted from the input and analyzed separately. For
the standard example, the GCN4 wild type, the output looks as
follows:

\begin{center}
\fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{GCN4output}}
\end{center}

The output of the \PrOCoil\ Web interface is structured into three
sections:
\begin{description}
\item[Input data:] amino acid sequence and heptad annotation of the coiled
coil segment;
\item[Classification result:] discriminant function value and final
classification; a positive value means that the sequence is classified
as trimer, a negative value means that the sequence is classified as
dimer. The higher the absolute value, the clearer the oligomerization
tendency is. The closer the value is to zero, the more the sequence can be
considered a borderline case.
\item[Prediction profile:] this plot shows
the contribution of each residue to the discriminant function value.
The more positive a value is for a residue, the more this residue
contributes to an oligomerization tendency towards trimers. The more
negative the value is, the more this residue contributes to an
oligomerization tendency towards dimers. The horizontal line
corresponds to the base line of the classifier (see \ref{ssec:profiles}).
The discriminant
function value is actually obtained as the area above the grey
baseline minus the area below the grey baseline. The links below the
prediction profile plot allow for downloading the profile in various
formats.
\end{description}

\notebox{Values in the prediction profile cannot be
understood as general rules for which oligomerization behavior a
given amino acid at a given heptad position is indicative for.
\PrOCoil\ takes pairwise interactions of amino acids into account.
Therefore, the values in the prediction profile are always to be
considered in the context of the given sequence. The same amino acid
at the same heptad position might have a completely different
value in the prediction profile of a different sequence.}

If the heptad annotation contains at least one dash `\verb+-+',
\PrOCoil\ first extracts all coiled coil segments, i.e.\ all
contiguous sub-sequences with no dashes in the heptad annotation.
Then all these coiled coil segments are analyzed independently and
the results are presented sequentially in the order they appear in
the input sequence in the same format as above. The only notable
difference is that the section ``Input data'' shows the complete
input sequence, whereas the extracted coiled coil segment is
displayed in a separate section ``Coiled coil segment'' along with
its consecutive number and the corresponding start and end positions
in the input sequence. For the above sample sequence, the output
looks as follows:
\begin{center}
\fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{MarcoilOut}}
\end{center}

The \PrOCoil\ Web interface also facilitates easy analysis of mutations of
coiled coil segments. This feature is limited to two sequences at a time.
The two sequences must have the same length and heptad register. If these
conditions are met, the second sequence can simply be written underneath
the heptad register (with the first sequence remaining above the heptad
register):
\begin{center}
\fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{MutationAnalysisInput}}
\end{center}

The output is structured as described for single sequences. The section
``Classification result'' shows discriminant function values and predictions
of both sequences along with a comparison whether the second sequence is
more dimeric or more trimeric than the first sequence. Only one profile
plot is produced in which both profiles are visualized, the profile of
the first sequence in red and the second sequence's profile in blue:
\begin{center}
\fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{MutationAnalysisOutput}}
\end{center}


\section{Preprocessing Predicted Coiled Coil Segments Using the Web
interface}\label{sec:preproc}

We expect \PrOCoil\ to be used mainly for sequences the exact 3D
structures of which are unknown (otherwise the correct
oligomerization can be determined using SOCKET
\cite{WalshawWoolfson01}). For structurally unresolved sequences, a
coiled coil predictor must be used first to determine the coiled
coil segments and their presumed heptad annotations. Currently the
two most commonly used predictors for this task are {\bf Marcoil}
\cite{DelorenziSpeed02} and {\bf PairCoil2}
\cite{BergerWilsonWolfTonchevMillaKim95,McDonnellJiangKeatingBerger06}.
Both programs have their own output formats that do not comply with
the format described in Section\ \ref{sec:input}. In order to ease
integration with Marcoil and PairCoil2, the \PrOCoil\ Web interface
offers preprocessing tools that allow to use the outputs of Marcoil
and PairCoil2 for processing with \PrOCoil.

\subsection{Processing PairCoil2 results}

The PairCoil2 Web
interface\footnote{\url{http://groups.csail.mit.edu/cb/paircoil2/paircoil2.html}}
produces the following kind of output for the Marcoil sample
sequence:
\begin{center}
\fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{PairCoil2output}}
\end{center}

If you click the link ``Positions and registers'', a text file with the
predicted heptad annotation is displayed. Select
the lines with the amino acids and heptad annotations in the following
way:
\begin{center}
\fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{PairCoil2sel}}
\end{center}

Now copy this selection into the
input field of the \PrOCoil\ Web interface. If you check ``Accept
PairCoil2 output format'', you can directly process the PairCoil2
output with \PrOCoil:
\begin{center}
\fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{PairCoil2input}}
\end{center}


\subsection{Processing Marcoil results}

If you submit a sequence to the Marcoil Web
interface,\footnote{\url{http://toolkit.tuebingen.mpg.de/marcoil}} the
results are shown in your Web browser. These results cannot be
processed by \PrOCoil\ directly, since coiled coil segments must first
be singled out by applying a probability cut-off threshold.
This can be one as follows: first go to the ``Prob List'' tab and click ``Export'',
then a plain text file with results is shown as follows:
\begin{center}
\fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{Marcoil}}
\end{center}
As in the above screenshot, select the four-column table with
residue numbers, residues, probabilities, and most probable heptad positions.
Then copy and paste it into the text field at the bottom of the
\PrOCoil\ Web page as follows:
\begin{center}
\fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{MarcoilPreProc}}
\end{center}

After clicking ``Submit'', an output page is displayed that shows
the result of applying the chosen probability cut-off in a read-only
field at the bottom:
\begin{center}
\fcolorbox{bioaz}{biowh}{\includegraphics[scale=0.46]{MarcoilPreProcOut}}
\end{center}
If you are satisfied with the result, you
can directly pass the data to \PrOCoil\ by clicking ``Proceed to
PrOCoil''. If you think you should have used a different
threshold, your input is displayed in the input field on top of the page.
Select a different threshold until the result meets your expectation and then
pass the data to \PrOCoil.

\clearpage

\section{\PrOCoil\ \R\ Package}

\subsection{Installation}

The \PrOCoil\ \R\ package (current version: \PrOCoilVer) is
available via Bioconductor. The simplest way to install the package
is the following:
<<InstallPrOCoil,eval=FALSE>>=
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("procoil")
@

If you wish to install the package manually instead, you can download
the package archive that fits best to your computer system
from the Bioconductor homepage.

\subsection{Getting started}\label{ssec:getstarted}

To load the \PrOCoil\ package, enter
<<LoadPrOCoil,eval=FALSE>>=
library(procoil)
@
\noindent in your \R\ session. If this command terminates without any error
message or warning, you can be sure that the \PrOCoil\ package has
been installed successfully. If so, the \PrOCoil\ package is ready
for use now and you can start predicting the oligomerization of
coiled coils.

As a first example, the following command makes a prediction for the
GCN4 wild type already used above:
<<PredictGCN4WildType>>=
GCN4wt <- predict(PrOCoilModel,
                  "MKQLEDKVEELLSKNYHLENEVARLKKLV",
                  "abcdefgabcdefgabcdefgabcdefga")
@

The object `\texttt{PrOCoilModel}' is an object of class
`\texttt{CCModel}' in which the \PrOCoil\ model is stored.
Since \texttt{predict()}
is a generic function that determines the function to call from
the type of the first argument, it is essential that you provide an
object of class `\texttt{CCModel}' as first argument of
\texttt{predict()}. For more information about `\texttt{CCModel}'
objects, enter \texttt{?CCModel}.

Note that the sequence need not be a plain character string.
Firstly, multiple sequences can be supplied at a time as a character vector, but
also via objects of several other classes. Heptad registers need not be supplied
via the `\texttt{reg}' argument. Instead, they can be attached to the
sequence objects via attributes or metadata (see \ref{sssec:reg}
for details).

As mentioned in Section\ \ref{sec:input}, the sequence and the register annotation must
have equal lengths. The function \texttt{predict()}
creates an object of class `\texttt{CCProfile}' in which prediction results
along with various additional information is stored. To obtain more information
about this class, enter \texttt{?CCProfile}. Basic information about
the result can be displayed by \texttt{show(GCN4wt)} or simply by entering
the name of the object:
<<DisplayResultForGCN4WildType>>=
GCN4wt
@
\noindent The discriminant value and its interpretation are the same as described in
Section\ \ref{sec:webpredict} above.

A prediction profile can be plotted simply as follows:
\begin{center}
<<PlotResultForGCN4WildType,fig.width=7,fig.height=5,out.width='0.6\\textwidth'>>=
plot(GCN4wt)
@
\end{center}
\noindent The \texttt{plot()} function for `\texttt{CCProfile}' objects
provides various ways for customizing the plot and for writing directly to
graphics files (see also \ref{sssec:plots} and the documentation available via
\verb+?plot.CCProfile+).

\subsection{Predictions for non-trimmed sequences containing coiled
coil segments}

Like the Web version described above, the \R\ package is also
capable of handling sequences that contain non-coiled-coil
sub-sequences. If the heptad annotation contains at least one dash
``\texttt{-}'', \texttt{predict()} first extracts all coiled coil
segments, i.e.\ all contiguous sub-sequences with no dashes in the
heptad annotation. Then all these coiled coil segments are analyzed
independently and the results are returned as a list, the components
of which are the prediction results of the coiled coil segments in
the order they appear in the sequence. Let us consider the Marcoil
sample sequence again:

{\footnotesize
<<PredictMarcoilExample>>=
res <- predict(PrOCoilModel,
"MGECDQLLVFMITSRVLVLSTLIIMDSRQVYLENLRQFAENLRQNIENVHSFLENLRADLENLRQKFPGKWYSAMPGRHG",
"-------------------------------abcdefgabcdefgabcdefgabcdefgabcdefg--------------")
@
}

\noindent The returned object `\texttt{res}' is a `\texttt{CCProfile}'
object that contains the predictions and profiles of all coiled coil segments
that were found in the sequences. In the given example, this is just one segment:
<<DisplayResultForMarcoilExample>>=
res
@
\noindent If the untrimmed sequences were unnamed, the segments will also
remain unnamed in the resulting `\texttt{CCProfile}' object. However, if the
sequences were named, the segments are named according to a specific strategy.
This strategy is best explained by the following example: suppose we have three
sequences that are named `\texttt{S1}', `\texttt{S2}', and `\texttt{S3}', and
assume that sequence `\texttt{S1}' contains one coiled coil segment,
`\texttt{S2}' contains three coiled coil segments, and `\texttt{S3}' contains
no coiled coil segment. Then the resulting `\texttt{CCProfile}' object will
contain predictions and profiles of a total of four coiled coil segments whose
names are `\texttt{S1.1}', `\texttt{S2.1}', `\texttt{S2.2}', and
`\texttt{S2.3}'.

The prediction profile must be plotted for each coiled coil segment separately.
This can be done by accessing the respective entry in the \texttt{CCProfile}
object explicitly:
\begin{center}
<<PlotResultForMarcoilExample,fig.width=8,fig.height=5,out.width='0.7\\textwidth'>>=
plot(res[1])
@
\end{center}

\subsection{Comparative mutation analysis}\label{ssec:compmut}

The \PrOCoil\ \R\ package allows not only to consider input sequences completely
independently of one another, but also allows for overlaying profiles of aligned coiled
coil segments. This feature is useful, for example, for studying effects of
different mutations. As an example, we consider multiple mutations of GCN4:
<<PredictGCN4Mutation>>=
GCN4mSeq <- c("GCN4wt"        ="MKQLEDKVEELLSKNYHLENEVARLKKLV",
              "GCN4_N16Y_L19T"="MKQLEDKVEELLSKYYHTENEVARLKKLV",
              "GCN4_E22R_K27E"="MKQLEDKVEELLSKNYHLENRVARLEKLV",
              "GCN4_V23K_K27E"="MKQLEDKVEELLSKNYHLENEKARLEKLV")
GCN4mReg <- rep("abcdefgabcdefgabcdefgabcdefga", 4)

GCN4mut <- predict(PrOCoilModel, GCN4mSeq, GCN4mReg)
GCN4mut
@
The \texttt{plot()} function allows for plotting two profiles in one plot:
\begin{center}
<<PlotGCN4Mutation,fig.width=7,fig.height=5,out.width='0.6\\textwidth'>>=
plot(GCN4mut[c(1, 2)])
plot(GCN4mut[c(1, 3)])
plot(GCN4mut[c(1, 4)])
@
\end{center}
By default, the first profile is plotted in red and the second profile is plotted in blue.
The first sequence is shown above the graph and the second sequence is
shown below the graph. Black letters correspond to residues that are the same in the
two sequences, whereas mutations are highlighted in the corresponding colors. More information on how to
customize plots is available in \ref{sssec:plots} and the documentation available
via \verb+?plot.CCProfile+.

\notebox{Profile overlays also work if the two sequences are not equally long and/or
if their heptad annotations are not aligned to each other. However, this hardly makes sense,
so the overlay of profiles is primarily made for showing profiles of aligned sequences with
matching heptad registers.}

\pagebreak

For three or more sequences, the profiles can also be visualized as heatmaps:
\begin{center}
<<HeatmapGCN4Mutation,fig.width=8,fig.height=4.5,out.width='0.9\\textwidth'>>=
heatmap(GCN4mut)
@
\end{center}
In this view, profiles/sequences are even hierarchically clustered. The dendrogram is shown to the left
of the heatmap. In the example above, obviously, the two trimers are clustered together as
rows 1--2 and the two dimers are clustered together as rows 3--4. The note from above applies to
heatmaps too: heatmaps make most sense for showing profiles of aligned sequences with
matching heptad registers.

\subsection{Miscellanea}

\subsubsection{Processing predicted coiled coil segments with the \R\ package}

The \PrOCoil\ \R\ package itself is not able to process Marcoil or PairCoil2 output.
Users who want to process prediction results obtained from these programs are
recommended to use the \PrOCoil\ Web interface to convert Marcoil or
PairCoil2 output (see Section\ \ref{sec:preproc}) into the correct input format that can be
processed by the \PrOCoil\ \R\ package.

\subsubsection{Heptad irregularities}

Some coiled coils occurring in nature have heptad irregularities,
e.g.\ incomplete heptads in which an `\verb+a+' position follows
after a `\verb+d+' position. \PrOCoil\ can also process heptad
annotations with such irregularities. In the profile plots created
by the \PrOCoil\ Web interface, such irregularities can only be seen
in the register labels in the middle of the plot. In the profile plots created by
the \PrOCoil\ \R\ package, heptad irregularities are additionally
visualized by a vertical red line between the two positions that do
not conform to the usual regular pattern.
\begin{center}
<<PlotResultForExampleWithHeptadIrregularity,fig.width=7,fig.height=5,out.width='0.6\\textwidth'>>=
plot(predict(PrOCoilModel,
             "LQDTLVRQERPIRKSIEDLRNTV",
             "defgabcdefgabcdabcdefga"))
@
\end{center}

\subsubsection{Alternative models}\label{sssec:models}

Trimers occur less frequently than dimers. Correspondingly, there were
more dimers and trimers in the data set used for training the default
model `\verb+PrOCoilModel+'. Since this model was optimized for (standard)
classification accuracy, it tends to concentrate on classifying the
larger class --- dimers --- properly. That is why the model performs
better in terms of specificity/true negative rate
(correctly classified dimers) than in terms of
sensitivity/true positive rate (correctly classified trimers). In case
one wants to attribute equal importance to sensitivity and
specificity, we have trained a second model that is
optimized for {\em balanced accuracy}, i.e., the average of
sensitivity and specificity (see publication supplement of
\cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11}).
This model can be used simply by calling \verb+predict()+ with
`\verb+PrOCoilModelBA+' as first argument. Like the
standard model `\verb+PrOCoilModel+', the alternative model
`\verb+PrOCoilModelBA+' is automatically
available once the \PrOCoil\ \R\ package has been loaded.

In case the user wants to define custom models or wishes to use previous
versions of the prediction models, the functions \texttt{readCCModel()} and
\texttt{writeCCModel()} can be used to read/write models from/to plain text
files that can be viewed and also modified. In order to see how such a model
file should be organized, the two models included in the \PrOCoil\ \R\
package are available for download at:
\begin{quotation}
\noindent\url{http://www.bioinf.jku.at/software/procoil/PrOCoilModel_v2.CCModel}\newline
\noindent\url{http://www.bioinf.jku.at/software/procoil/PrOCoilModelBA_v2.CCModel}
\end{quotation}
\noindent It is also possible to read them directly using the \texttt{readCCModel()} function:
<<ReadModelFile,eval=FALSE>>=
URL <- "http://www.bioinf.jku.at/software/procoil/PrOCoilModel_v2.CCModel"
myModel <- readCCModel(URL)
@

The present version of the \R\ package includes newer models than the ones reported in
\cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11} (which were used in \PrOCoil\ versions
before 2.0.0). However, for backward compatibility,
these models can still be downloaded and used with the \PrOCoil\ \R\ package:
\begin{quotation}
\noindent\url{http://www.bioinf.jku.at/software/procoil/PrOCoilModel_v1.CCModel}\newline
\noindent\url{http://www.bioinf.jku.at/software/procoil/PrOCoilModelBA_v1.CCModel}
\end{quotation}

\subsubsection{Customizing and saving plots}\label{sssec:plots}

The \PrOCoil\ \R\ package provides several opportunities to customize plots.
For more detailed documentation, see \verb+?plot.CCProfile+. We simply
provide an example here that uses legends, shading, custom coloring of profiles, and
a plot header:

\begin{center}
<<CustomPlot,fig.width=7,fig.height=5,out.width='0.6\\textwidth'>>=
plot(GCN4mut[c(1, 2)], legend=c("wild type", "mutant N16Y,L19T"),
     col=c(rgb(0.7, 0, 0), rgb(0, 0, 0.8)), main="GCN4 Mutation Analysis",
     shades=c(rgb(0.77, 0.85, 0.95), rgb(0.99, 0.84, 0.71)))
@
\end{center}

\R\ graphics have a fixed size and aspect ratio by default. That is why \PrOCoil\
prediction profiles appear quite stretched for shorter and constricted for longer
sequences. When exporting profile plots to graphics files, we therefore recommend
to adjust the graphics dimensions in order to achieve a consistent and optically
pleasing apperance:
\begin{itemize}
\item Use a fixed value for the {\em height} of the graphics device; we recommend
a value between 5'' and 7'' for vector formats (PDF and (encapsulated) PostScript)
and a few hundred pixels for bitmap image formats (BMP, JPEG, TIFF, PNG).
\item Scale the {\em width} according to the length of the sequence; we recommend
1/24 of the graphic's height per residue as a rule of thumb.
\end{itemize}

\pagebreak

\noindent{\bf Examples:}
<<PlotProfileToGraphicsFile,eval=FALSE>>=
pdf(file="GCN4wt.pdf", height=6,
    width=max(nchar(sequences(GCN4wt))) * 6 / 24)
plot(GCN4wt)
dev.off()

bmp(file="GCN4wt.bmp", height=480,
    width=max(nchar(sequences(GCN4wt))) * 480 / 24)
plot(GCN4wt)
dev.off()
@

\subsubsection{Alternative ways of supplying heptad registers to
\texttt{predict()}}\label{sssec:reg}

As mentioned briefly in Subsection\ \ref{ssec:getstarted}, the
\texttt{predict()} function requires a string that contains the heptad
register annotation of the sequence to be classified.
Normally, this string is passed as the third argument `\texttt{reg}'.
However, the following alternatives exist as well:
\begin{itemize}
\item If the second argument `\texttt{seq}' is a character vector,
the heptad register can also be supplied via
the attribute `\texttt{reg}':
<<GCN4CharacterExampleWithAttribute>>=
GCN4wtSeq1 <- "MKQLEDKVEELLSKNYHLENEVARLKKLV"
attr(GCN4wtSeq1, "reg") <- "abcdefgabcdefgabcdefgabcdefga"
res <- predict(PrOCoilModel, GCN4wtSeq1)
@
\item If the second argument `\texttt{seq}' is a single amino acid
sequence given as an `\texttt{AAString}' object,
the heptad register can be specified via a metadata component `\texttt{reg}':
<<GCN4AAStringExampleWithMetadata>>=
GCN4wtSeq2 <- AAString("MKQLEDKVEELLSKNYHLENEVARLKKLV")
metadata(GCN4wtSeq2)$reg <- "abcdefgabcdefgabcdefgabcdefga"
res <- predict(PrOCoilModel, GCN4wtSeq2)
@
\item If the second argument `\texttt{seq}' is a set of amino acid sequences
given as an `\texttt{AAStringSet}' or `\texttt{AAVector}' object, the heptad
registers can be specified via a metadata component `\texttt{reg}' as above
or via the annotation metadata:
<<GCN4AAStringSetExampleWithAnnotationMetadata>>=
GCN4mSeq2 <- AAStringSet(GCN4mSeq)
annotationMetadata(GCN4mSeq2, annCharset="abcdefg") <- GCN4mReg
res <- predict(PrOCoilModel, GCN4mSeq2)
@
\end{itemize}
In any case, the `\texttt{reg}' argument has priority over all other
ways of specifying the heptad annotation. In other words,
if `\texttt{reg}' is specified and `\texttt{seq}' contains heptad
annotations in one of the ways described above, the
`\texttt{reg}' argument has priority and the heptad annotation in
`\texttt{seq}' is ignored. If no heptad register is
found in any of the places mentioned above, \texttt{predict()}
quits with an error message.

\subsection{Upgrading from a version prior to 2.0.0}

Version 2.0.0 of the \PrOCoil\ \R\ package has not only brought improved
prediction models (see \ref{ssec:models}), but also several changes to
the functionality of the package. While the basic functionality of
the \texttt{predict()} function and the \texttt{plot()} function
remained unchanged, users who upgrade from versions prior to 2.0.0
should take the following caveats into account:
\begin{enumerate}
\item The classes `\texttt{CCModel}' and `\texttt{CCProfile}' have
  changed signficantly internally. Objects saved with versions prior
  to 2.0.0 cannot be used with version 2.0.0 or newer.
\item Calling \texttt{predict()} for a sequence that contains multiple
  coiled coil segments no longer results in a list of `\texttt{CCProfile}'
  objects, but will result in a single `\texttt{CCProfile}' object that
  contains the predictions for all segments. `\texttt{CCProfile}' objects
  are now subsettable to extract a subset of predictions/profiles.
\item The \texttt{plot()} method with signature (`\texttt{CCProfile}',
  `\texttt{CCProfile}') is no longer available for plotting two
  prediction profiles together. Instead, the user has to call
  \texttt{predict()} for a sequence object that contains two or
  more sequences. Consequently, all predictions are returned in a
  single `\texttt{CCProfile}' object (see \ref{ssec:compmut}
  for examples).
\item The argument `\texttt{legend.pos}' of the \texttt{plot()} method
  has been renamed to `\texttt{legendPos}' for the sake of compatibility
  with the KeBABS package. The argument `\texttt{rng}' is no longer available.
\item The \texttt{readCCModel()} function expects patterns in a slightly
  different format now: instead of patterns that only refer to the heptad
  position of the first amino acid of the pattern (e.g.\ `\texttt{N..La}'),
  \texttt{readCCModel()} now expects the patterns to specify also the heptad
  position of the second amino acid in the pattern (e.g.\
  `\texttt{N..La..d}'; see also \ref{ssec:pred}).
\end{enumerate}

\section{More Details About \PrOCoil}

\subsection{How the prediction works}\label{ssec:pred}

\PrOCoil's classification models are based on support vector machines
\cite{Burges98,CortesVapnik95,MuellerMikaRaetschTsudaSchoelkopf01,SchoelkopfSmola02}
with a kernel designed specifically for coiled coil classification:
\[
k(x,y)=\sum\limits_{p\in P} N(p,x)\cdot N(p,y)
\]
In this formula, $x$ and $y$ are the two input sequences that are to
be compared, $P$ is a set of coiled coil-specific patterns, and $N(p,x)$
is the number of occurrences of pattern $p$ in sequence $x$.

\PrOCoil\ uses the following set of patterns $P$: pairs of amino
acids at fixed heptad positions with no more than a maximum number
$m$ of residues in between. Internally, these patterns are
represented as strings with an amino acid letter on the first
position, then a certain number of wildcards (between 0 and $m$ as
noted above), then the second amino acid letter, and finally an
aligned sequence starting and ending with a letter `\verb+a+'--`\verb+g+'
denoting the heptad register position of the first and the last
amino acid, e.g.\ ``\verb+N..La..d+''. This pattern
matches a coiled coil sequence if the sequence has an `\verb+N+'
(Asparagine) at an `\verb+a+' position and an `\verb+L+' (Leucine)
at the next `\verb+d+' position. For instance, the GCN4 wild type has
one occurrence of this pattern:
\begin{quote}
{\footnotesize\begin{verbatim}
MKQLEDKVEELLSKNYHLENEVARLKKLV
abcdefgabcdefgabcdefgabcdefga
              N..L
              a  d
\end{verbatim}}
\end{quote}

So, obviously, \PrOCoil\ considers pair interactions of amino acids
at given heptad positions which are no more than $m$ positions
apart. The kernel described above resembles the spatial sample
kernel \cite{KuksaHuangPavlovic08} (however, with a heptad-specific
property) and the kernel described in \cite{FongKeatingSingh04}
(however, considering interactions within one sequence and not
restricting to a particular subset of interactions).

The two models included in \PrOCoil\ further employ kernel
normalization \cite{SchoelkopfTsudaVert04} to correct for different
sequence lengths. This means that the support vector machines have
been trained with the kernel
\[
k'(x, y)=\frac{k(x, y)}{\sqrt{k(x, x)\cdot k(y,y)}},
\]
where $k(.,.)$ is the {\em coiled coil kernel} described above.

Using an explicit representation of the feature mapping underlying
the kernel, the support vector machines have been transformed into
linear classifiers on sequence features (cf.\
\cite{BodenhoferSchwarzbauerIonescuHochreiter09} for details). Thus,
\PrOCoil's prediction models consist of weights for specific
sequence features (i.e.\ {\em patterns}) and a constant offset $b$. Let
us assume that $x$ denotes a new coiled coil sequence. Without
kernel normalization, the discriminant function value of the new
sequence $x$ is given as
\[
f(x)=b+\sum\limits_{p\in P} N(p,x)\cdot w(p),
\]
where $b$ is the constant offset of the support vector machine and
$w(p)$ is the weight of pattern $p$.

If kernel normalization is employed, the following, slightly more
complicated, representation is obtained:
\[
f(x)=b+\left(\sum\limits_{p\in P} N(p,x)\cdot
w(p)\right)\Big/\underbrace{\sqrt{\sum_{p\in P} N(p,x)^2}}_{=R(x)}
\]
Obviously, $R(x)$ is the value that corrects for the sequence
length. The longer the sequence, the larger $R(x)$.

\subsection{\PrOCoil's built-in models}\label{ssec:models}

As already mentioned above, \PrOCoil\ provides a default model that
is optimized for classification accuracy (object `\verb+PrOCoilModel+'
in the \PrOCoil\ \R\ package) and an alternative model that is
optimized for balanced accuracy (object `\verb+PrOCoilModelBA+').
Both models have been trained on the same training set which consisted
of verified coiled coils from the RCSB Protein Data Bank (PDB)%
\footnote{\url{http://www.rcsb.org/}}
\cite{BermanWestbrookFengGillilandBhatWeissigShindyalov00} enriched by
putatively similar sequences that were obtained by BLAST
\cite{AltschulGishMillerMyersLipman90} in conjunction with Marcoil
\cite{DelorenziSpeed02}. The default model was created with the KeBABS package
\cite{PalmeHochreiterBodenhofer15} using the coiled coil kernel mentioned above
(in the lingo of the KeBABS package called {\em annotation-specific gappy pair
kernel}) and the frontend to LibLineaR package, which provides an R interface
to LIBLINEAR \cite{FanChangHsiehWangLin08}. The current models (since Version
2.0.0) use the normalized coiled coil kernel with $m=5$ and a penalty parameter
of $C=2$. The alternative modes was created with PSVM
\cite{HochreiterObermayer06} using the normalized coiled coil kernel with $m=8$,
class balancing, and regularization parameters $C=8$ and $\varepsilon=0.8$.

In the \PrOCoil\ \R\ package, all parameters describing a model are
stored in the corresponding `\texttt{CCModel}' object:
<<DisplayModels>>=
PrOCoilModel
weights(PrOCoilModel)["N..La..d"]
PrOCoilModelBA
weights(PrOCoilModelBA)["N..La..d"]
@
\noindent In both models, the pattern weights  are sorted decreasingly,
i.e., from most trimeric to most dimeric. Thus, the user also has easy access
to the most indicative patterns. Here we provide an example how to extract the
25 most trimeric (i.e.\ the first 25 patterns in the list) and the 25 most
dimeric patterns (i.e.\ the last 25 patterns in the list) from the \PrOCoil
model:
<<DisplayWeights>>=
noP <- length(weights(PrOCoilModel))
names(weights(PrOCoilModel))[1:25]
names(weights(PrOCoilModel))[noP:(noP - 24)]
@

\subsection{How prediction profiles are obtained}\label{ssec:profiles}

Regardless of whether kernel normalization is employed or not, the
essential component of the discriminant function is the sum
\[
\sum\limits_{p\in P} N(p,x)\cdot w(p).
\]
It is obvious that every match of a pattern $p$ contributes $w(p)$
to the sum. In order to find out the extent to which each residue
contributes to the final classification, we can rewrite the sum as
\begin{equation}\label{eq:profile}
\sum\limits_{p\in P} N(p,x)\cdot w(p)=\sum\limits_{i=1}^L c_i(x),
\end{equation}
where $L$ is the length of the sequence $x$ and $c_i(x)$ is the
contribution of the $i$-th residue of $x$. The contribution $c_i(x)$
can simply be computed as half the sum of weights of patterns
matching the $i$-th residue.\footnote{As all patterns match two
distinct residues, each weight must be split to the two residues in
order to ensure that the equality \eqref{eq:profile} holds.} The
weights $c_i(x)$ (or $c_i(x)/R(x)$ in case kernel normalization is
employed) can be understood as a profile that can be plot over the
sequence. We have already introduced these values as {\em prediction
profiles} above. In order to have a unified terminology, let us
denote the prediction profiles as
\[
s_i(x)=
\begin{cases}
c_i(x)/R(x) & \text{if kernel normalization is employed,}\\
c_i(x) & \text{otherwise.}
\end{cases}
\]
A positive values $s_i(x)$ indicates that the $i$-th residue is
participating more strongly in patterns that are indicative for
trimers. A negative values $s_i(x)$ indicates that the $i$-th
residue is participating more strongly in patterns that are
indicative for dimers. A value $s_i(x)$ or zero or close to zero
either means that the $i$-th residue is not participating in any
indicative patterns or that it participates in dimer and trimer
patterns the contributions of which compensate/cancel each other.

\PrOCoil\ computes prediction profiles each time it computes a
prediction. It is clear that we can recover the discriminant value
$f(x)$ as follows:
\[
f(x)=b+\sum\limits_{i=1}^L s_i(x)
\]
If $b\not=0$, which is the normal case, the values $s_i(x)$ do not
provide enough information to infer the final classification from
the prediction profile. In order to facilitate a more sensible
analysis, let us reformulate the above equality as
\[
f(x)=\sum\limits_{i=1}^L
\big(s_i(x)+\frac{b}{L}\big)=\sum\limits_{i=1}^L
\big(s_i(x)-(-\frac{b}{L})\big).
\]
Hence, if we plot the profile values $s_i(x)$ along with a
horizontal line at $-\frac{b}{L}$, we can recover the discriminant
value $f(x)$ as the difference of the area above the $-\frac{b}{L}$
minus the area below $-\frac{b}{L}$. The value $-\frac{b}{L}$
exactly corresponds to the ``base line'' mentioned above.

\section{Acknowledgments}

Several colleagues have contributed to \PrOCoil\ and its implementation:
{\bf Carsten C.\ Mahrenholz}, {\bf Ingrid G.\ Abfalter}, {\bf Rudolf Volkmer},
and {\bf Sepp Hochreiter}, who are also the co-authors of the first \PrOCoil\
paper \cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11}, have contributed
to the inception and success of \PrOCoil\ with their biochemical knowledge,
machine learning expertise, and by putting together the first coiled data set
that was published along with
\cite{MahrenholzAbfalterBodenhoferVolkmerHochreiter11} and used for training
he original \PrOCoil\ models.

The data set underlying the \PrOCoil\ 2.0.0 models was created by
{\bf Annette Jacyszyn} who also performed the hyperparameter selection and training
of the new models.

Last but definitely not least, the new version of the package would have been
completely impossible without {\bf Johannes Palme}'s amazing KeBABS package.


\section{How to Cite PrOCoil}

If you use PrOCoil for research that is published later, you are kindly
asked to cite it as follows:
\begin{quotation}
\noindent C.\ C. Mahrenholz, I.\ G. Abfalter, U.\ Bodenhofer, R.\ Volkmer, and
S.\ Hochreiter.
\newblock Complex networks govern coiled coil oligomerization --- predicting
  and profiling by means of a machine learning approach.
\newblock {\em Mol. Cell. Proteomics} 10(5), 2011.
\newblock DOI:\
\href{http://dx.doi.org/10.1074/mcp.M110.004994}{10.1074/mcp.M110.004994}
\end{quotation}

\noindent To obtain this reference in Bib\TeX\ format, you can enter the
following into your R session:
<<DisplayBibTeXReference,eval=FALSE>>=
toBibtex(citation("procoil"))
@

%\bibliographystyle{plain}
%\bibliography{Bioinformatics,MachineLearningClassical,Biology,BodenhoferPub}

\begin{thebibliography}{10}

\bibitem{AltschulGishMillerMyersLipman90}
S.~F. Altschul, W.~Gish, W.~Miller, E.~W. Myers, and D.~J. Lipman.
\newblock Basic local alignment search tool.
\newblock {\em J. Mol. Biol.}, 215:403--410, 1990.

\bibitem{BergerWilsonWolfTonchevMillaKim95}
B.~Berger, D.~B. Wilson, E.~Wolf, T.~Tonchev, M.~Milla, and P.~S. Kim.
\newblock Predicting coiled coils by use of pairwise residue correlations.
\newblock {\em Proc. Natl. Acad. Sci. USA}, 92:8259--8263, 1995.

\bibitem{BermanWestbrookFengGillilandBhatWeissigShindyalov00}
H.~M. Berman, J.~Westbrook, Z.~Feng, G.~Gilliland, T.~N. Bhat, H.~Weissig,
  I.~N. Shindyalov, and P.~E. Bourne.
\newblock The {P}rotein {D}ata {B}ank.
\newblock {\em Nucleic Acids Res.}, 28(1):235--242, 2000.

\bibitem{BodenhoferSchwarzbauerIonescuHochreiter09}
U.~Bodenhofer, K.~Schwarzbauer, M.~Ionescu, and S.~Hochreiter.
\newblock Modeling position specificity in sequence kernels by fuzzy
  equivalence relations.
\newblock In J.~P. Carvalho, D.~Dubois, U.~Kaymak, and J.~M.~C. Sousa, editors,
  {\em Proc. Joint 13th IFSA World Congress and 6th EUSFLAT Conference}, pages
  1376--1381, Lisbon, July 2009.

\bibitem{Burges98}
C.~J.~M. Burges.
\newblock A tutorial on support vector machines for pattern recognition.
\newblock {\em Data Min. Knowl. Discov.}, 2:121--167, 1998.

\bibitem{CortesVapnik95}
C.~Cortes and V.~N. Vapnik.
\newblock Support vector networks.
\newblock {\em Machine Learning}, 20:273--297, 1986.

\bibitem{DelorenziSpeed02}
M.~Delorenzi and T.~Speed.
\newblock An {HMM} model for coiled-coil domains and a comparison with
  {PSSM}-based predictions.
\newblock {\em Bioinformatics}, 18(4):617--625, 2002.

\bibitem{FanChangHsiehWangLin08}
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin.
\newblock {LIBLINEAR:} a library for large linear classification.
\newblock {\em J. Mach. Learn. Res.}, 9:1871--1874, 2008.

\bibitem{FongKeatingSingh04}
J.~H. Fong, A.~E. Keating, and M.~Singh.
\newblock Predicting specificity in {bZIP} coiled-coil protein interactions.
\newblock {\em Genome Biol.}, 5:R11, 2004.

\bibitem{HochreiterObermayer06}
S.~Hochreiter and K.~Obermayer.
\newblock Support vector machines for dyadic data.
\newblock {\em Neural Comput.}, 18:1472--1510, 2006.

\bibitem{KuksaHuangPavlovic08}
P.~Kuksa, P.-H. Huang, and V.~Pavlovic.
\newblock A fast, large-scale learning method for protein sequence
  classification.
\newblock In {\em 8th Int. Workshop on Data Mining in Bioinformatics}, pages
  29--37, Las Vegas, NV, 2008.

\bibitem{MahrenholzAbfalterBodenhoferVolkmerHochreiter11}
C.~C. Mahrenholz, I.~G. Abfalter, U.~Bodenhofer, R.~Volkmer, and S.~Hochreiter.
\newblock Complex networks govern coiled coil oligomerization --- predicting
  and profiling by means of a machine learning approach.
\newblock {\em Mol. Cell. Proteomics}, 10(5):M110.004994, 2011.

\bibitem{McDonnellJiangKeatingBerger06}
A.~V. McDonnell, T.~Jiang, A.~E. Keating, and B.~Berger.
\newblock Paircoil2: improved prediction of coiled coils from sequence.
\newblock {\em Bioinformatics}, 22(3):356--358, 2006.

\bibitem{MuellerMikaRaetschTsudaSchoelkopf01}
K.-R. M\"uller, S.~Mika, G.~R\"atsch, K.~Tsuda, and B.~Sch\"olkopf.
\newblock An introduction to kernel-based learning algorithms.
\newblock {\em IEEE Trans. Neural Networks}, 12(2):181--201, 2001.

\bibitem{PalmeHochreiterBodenhofer15}
J.~Palme, S.~Hochreiter, and U.~Bodenhofer.
\newblock {KeBABS}: an {R} package for kernel-based analysis of biological
  sequences.
\newblock {\em Bioinformatics}, 31(15):2574--2576, 2015.

\bibitem{SchoelkopfSmola02}
B.~Sch\"olkopf and A.~J. Smola.
\newblock {\em Learning with Kernels}.
\newblock Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA,
  2002.

\bibitem{SchoelkopfTsudaVert04}
B.~Sch\"olkopf, K.~Tsuda, and J.-P. Vert, editors.
\newblock {\em Kernel Methods in Computational Biology}.
\newblock MIT Press, Cambridge, MA, 2004.

\bibitem{WalshawWoolfson01}
J.~Walshaw and D.~N. Woolfson.
\newblock {SOCKET:} a program for identifying and analysing coiled-coil motifs
  within protein structures.
\newblock {\em J. Mol. Biol.}, 307(5):1427--1450, 2001.

\end{thebibliography}


\end{document}