% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
%\VignetteIndexEntry{Indentification of functional residues with BGA (HowTo)}
%\VignetteDepends{made4,seqinr}
%\VignetteKeywords{Multivariate Analysis, Graphics, Functional Residues, Alignment}
%\VignettePackage{bgafun}

\newcommand{\PrintVol}[1]{\textbf{#1}}                                                                                
\documentclass[12pt]{article}
                                                                                
                                                                                
\begin{document}
                                                                                
%------------------------------------------------------------
\title{Introduction to identifying functional residues in an alignment using BGAFUN}
%------------------------------------------------------------
\author{Iain Wallace}
\maketitle


Proteins that evolve from a common ancestor can change functionality over time, and it is important to be able identify residues that cause this change. 
There are numerous different methods that can be used to solve this problem. 

This package contains the functions to use supervised multivariate statistical method, Between Group Analysis (BGA) \cite{Culhane:2005:21} to identify these residues from families of proteins 
with different substrate specifities using multiple sequence alignments. 

The main steps involved are 

\begin{enumerate}
\item Reading in the alignment
\item Defining groups
\item Convert alignment into a matrix 
\item Run the Between Groups Analysis
\item Plot the results
\item Identify important positions in the Alignment
\end{enumerate}


\section{Reading in the alignment}

The sequences are read in using the seqinR \cite{Charif:2006:22} commands

<<>>=
library(bgafun)
LDH <- read.alignment(file = system.file("sequences/LDH-MDH-PF00056.fasta", package = "bgafun"), format = "fasta")
class(LDH)
@

\section{Defining groups}

Next, a factor assigning each sequence to a group has to be defined. One method of doing this is shown in the code below. The alignment object is first converted into a matrix, in which the rownames are sequence names
<<>>=
library(bgafun)
data(LDH)
LDH.amino=convert_aln_amino(LDH)
LDH.groups=rownames(LDH.amino)
LDH.groups[grep("LDH",LDH.groups)]="LDH"
LDH.groups[grep("MDH",LDH.groups)]="MDH"
LDH.groups=as.factor(LDH.groups)
LDH.groups
@


\section{Convert alignment into a matrix}
The alignment has to be converted into a matrix that is suitable for Between Group Analysis. There are two different encoding schemes available in this package
\begin{enumerate}
\item Binary 
\item Amino Acid Properties
\end{enumerate}

\subsection{Binary Encoding Scheme}
Each column in the sequence alignment is represented by a binary vector of length 20 which represents the presence or absence of a particular amino acid at this position. 
This results in a data matrix of dimension NxLx20 where N is the number of sequences and L is the length of the alignment. Gaps are removed from the matrix using the remove\_gaps\_groups function. This removes columns that contain more than a certain percentage, gap\_fraction, of gaps in a column in the alignment in any of the defined groups.

<<>>=
library(bgafun)
data(LDH)
data(LDH.groups)
LDH.amino=convert_aln_amino(LDH)
dim(LDH.amino)
LDH.amino.gapless=remove_gaps_groups(LDH.amino,LDH.groups)
dim(LDH.amino.gapless)
@

We found that it is important to add pseudo count as otherwise the absence of amino acids in a column dominates the analysis. There are two methods of adding pseudocounts:
\begin{enumerate}
\item Add 1 to all elements in the matrix
<<>>=
library(bgafun)
data(LDH.amino.gapless)
LDH.pseudo=LDH.amino.gapless+1
dim(LDH.pseudo)
@
\item Use the add\_pseudo\_counts function
<<>>=
library(bgafun)
data(LDH.amino.gapless)
LDH.pseudo=add_pseudo_counts(LDH.amino.gapless,LDH.groups)
dim(LDH.pseudo)
@
\end{enumerate}

\subsection{Amino Acid Properties Encoding Scheme}

Each residue in the alignment is represented by a vector of five continuous variables as given by Atchley et al \cite{Atchley:2005:23}. They applied a multivariate statistic approach to reduce the information in 494 amino acid attributes into a set of five factors for each amino acid. 
\begin{enumerate}
\item Factor A is termed the polarity index. It correlates well with a large variety of descriptors including the number of hydrogen bond donors, polarity versus nonpolarity, and hydrophobicity versus hydrophilicity. 
\item Factor B is a secondary structure index. It represents the propensity of an amino acid to be in a particular type of secondary structure, such as a coil, turn or bend versus the frequency of it in an a-helix. 
\item Factor C is correlated with molecular size, volume and molecular weight. 
\item Factor D reflects the number of codons coding for an amino acid and amino acid composition. These attributes are related to various physical properties including refractivity and heat capacity. 
\item Factor E is related to the electrostatic charge. 
\end{enumerate}

Columns in the alignment containing more than 80\% gaps are removed. When using the AAP encoding, the remaining gaps were replaced with the average value for the subfamily in that column using the function average\_cols\_aap

<<>>=
library(bgafun)
data(LDH)
data(LDH.groups)
LDH.aap=convert_aln_AAP(LDH)
dim(LDH.aap)
LDH.aap.ave=average_cols_aap(LDH.aap,LDH.groups)
dim(LDH.aap.ave)
@


\section{Run the Between Groups Analysis}
Two different ordination techniques can be used, either Principal Components Analysis or Correspondance Analysis. PCA is the most suitable ordination technique for the AAP encoding as it can be used to ordinate data that contains continuous variables. CA, on the other hand, can only be used to ordinate data that contains positive integer values, such as the binary representation of a sequence alignment.
To run the analysis with PCA the run\_between\_pca function is used, 
It requires an binary amino acid matrix which is used to calculate sequence weights, the matrix to be analysised and the group definition

<<>>=
library(bgafun)
data(LDH)
data(LDH.groups)
data(LDH.amino.gapless)
data(LDH.aap.ave)
LDH.aap.ave.bga=run_between_pca(LDH.amino.gapless,LDH.aap.ave,LDH.groups)
class(LDH.aap.ave.bga)

@
To run the analysis with CA, without weights and with very simple pseudocounts

<<>>=
library(bgafun)
data(LDH.groups)
data(LDH.amino.gapless)
LDH.binary.bga=bga(t(LDH.amino.gapless+1),LDH.groups)
@


\section{Plot the results}

The results are plotted using the MADE4 functions. When the labels overlab, like below, the easiest method to clean them is to use Photoshop.

\begin{figure}[htbp]
\begin{center}
<<PCAplot, fig=TRUE>>=
plot(LDH.aap.ave.bga)
@
\caption{\label{figure 1} PCA plot of the LDH example}
\end{center}
\end{figure}
\clearpage

\section{Identify important positions in the Alignment}

Important positions for two groups can be identified by finding the most extreme positions on the axis using the top\_residues\_2\_groups function
<<>>=
top_res=top_residues_2_groups(LDH.binary.bga)
names(top_res)=sub("X","",names(top_res))
@

Important positions for three groups can be identified by finding the most extreme positions that are plotted same direction as the sequences on the graph


Now, to look at the amino acid content in the alignment that the analsyis has identified one can use the following code. It creates a profile string for each of the groups. This shows how many of each type of amino acids are present in each group at every position. The
<<>>=
LDH.profiles=create_profile_strings(LDH.amino,LDH.groups)
LDH.profiles[, colnames(LDH.profiles) %in% names(top_res)]

@

\section{Bib}
\bibliography{Bibtext}
\bibliographystyle{mystyle}

\end{document}