%\VignetteIndexEntry{genphen overview}

\documentclass[a4paper, 11pt]{article}
\usepackage[utf8]{inputenc}
\let\chapter\section
\usepackage[lined]{algorithm2e}

\usepackage[numbers]{natbib}
\bibliographystyle{plainnat}

\usepackage[english]{babel}
\selectlanguage{english}

\usepackage{graphicx}
\usepackage{placeins}
\usepackage{amsmath}
\usepackage{amscd}
\usepackage{ifthen}
\usepackage{float}
\usepackage{subfig}

\usepackage{lscape}

\usepackage{parskip}
\usepackage{multirow}
\usepackage{todonotes}
\usepackage{color}
\usepackage{colortbl}
\definecolor{hellgrau}{rgb}{0.9,0.9,0.9}
\definecolor{dunkelgrau}{rgb}{0.8,0.8,0.8}

\usepackage{hyperref}
% remove ref boxes
%\usepackage[hidelinks]{hyperref}

%add-on renews
\renewcommand\topfraction{0.85}
\renewcommand\bottomfraction{0.85}
\renewcommand\textfraction{0.1}
\renewcommand\floatpagefraction{0.85}

%confusion matrix
\newcommand\confusion[1]{
\fbox{\lower0.75cm
\vbox to 0.5cm{\vfil
\hbox to 0.5cm{\hfil\parbox{1cm}{#1}}
\vfil}%
}%
}

\setlength\floatsep{1.25\baselineskip plus 3pt minus 2pt}
\setlength\textfloatsep{1.25\baselineskip plus 3pt minus 2pt}
\setlength\intextsep{1.25\baselineskip plus 3pt minus 2 pt}

\begin{document}
\SweaveOpts{concordance=TRUE}


\title{Tutorial on using \texttt{genphen}}
\author{Simo Kitanovski, \\
Bioinformatics, University of Duisburg-Essen, \\
Essen, Germany}
\maketitle

\tableofcontents

This tutorial gives you some of the technical background underlying
\texttt{genphen} that should enable you to understand and use this tool.

\section{\texttt{genphen} quantifies genotype-phenotype associations}
Genome wide association studies (GWAS) have become an important tool to
understand the association between genotypes and phenotypes. With GWAS we try
to answer questions such as ``what are the genotypes in the human genome which
predispose to a disease?'' or ``what are the genotypes in certain strains of
mice which allow them to be more resistant against a specific virus?''. The are
countless applications of genotype-phenotype association studies, whereby the
genotype can either be a set of single nucleotide polymorphisms (SNPs) found in
nucleotide sequences, or a set of single amino acid polymorphisms (SAAPs) found
at sites in specific protein sequence. The phenotype can be any measured
quantity related to the individuals or sequences in which the polymorphic
genotypes were found.

To conduct GWAS frequentist statistical methods are typically used, relying on
P values to report the strength of the association between the genotypes and the
phenotypes. This often leads to massive multiple hypothesis problems, typically
answered with stringent P value correction methods, which are used to curb the
false positives but have the effect of introducing many false negatives. More
sophisticated statistical learning approaches such as random forest (RF) and
support vector machines (SVM) alleviate the P value problem and are able to
capture the most complex relationships between the genotypes and phenotypes but
have the drawback of poor interpretability. Bayesian inference on the other
hand, provides a solution to both the P value issue and the interpretabilitiy
of the statistical learning approaches, but comes at a substantially higher
computational cost.

Here we introduce \texttt{genphen}, a tool for GWAS which implements both
statistical learning techniques such as RF and SVM as well as Bayesian inference
using hierarchical models, to quantify the association between genotypes and
continuous phenotypes. Additionally, \texttt{genphen} implements a set of
visualization procedures which allow the user to inspect the vast results of
the main association analyses and pinpoint the significant genotype-phenotype
associations.


\section{Conducting GWAS with \texttt{genphen}}
\subsection{Input}
Two data types are necessary to perform a genetic association study, namely the
genotype data and the phenotype data. As an example of genotype data imagine a
set 1000 SNPs obtained for 10 different strains of laboratory mice taken from
the Mouse Hapmap Project. The phenotype, on the other hand can be an
experimental continuous measurement made for each of the 10 different mouse
strains (e.g. height, body weight, temperature, immune response, etc.)
(see SNP-Phenotype example in Fig \hyperref[fig:fig1]{1a}).

Another example of a genotype data can be a multiple sequence alignment (MSA)
of 100 protein homologs (e.g. 120 aligned protein sequences of different
organisms with 154 protein sites, some of which may contain amino acid
polymorphisms). The phenotype can once again be continuous measurement made for
each of the 120 organisms. Similar to the previous example, in this case too we
can use \texttt{genphen} to estimate the association between the polymorphic
protein sites and a given phenotype of the organisms.

More specifically, we can think of the genotype data as a character matrix with
dimensions N$\times$M, whereby the M columns represent different SNPs or SAAPs,
and the N rows represent different individuals or sequences for which we have
measures some phenotype. On the other hand, we can think of the phenotype
as a numerical vector of length N, where each phenotype corresponds to a
particular individual.

\subsection{Methods}
The T-test is a popular method which is often used to quantify the association
between the genotypes and phenotypes. Given a SNP like the one shown in Fig.
\hyperref[fig:fig1]{1a}, the T-test checks whether there exists a significant
difference between the two states (alleles) of the SNP w.r.t. the phenotype,
whereby the strength of the association is summarized with a P value.
\texttt{genphen} implements three statistical methods which perform similar
tasks, but are superior to the T-test. These methods can be used through the
following three interfaces:
\begin{itemize}
\item runGenphenRf - RF-based analysis
\item runGenphenSvm - SVM-based analysis
\item runGenphenBayes - Bayesian inference based analysis
\end{itemize}

\subsubsection{runGenphenRf and runGenphenSvm}
The procedures runGenphenRf and runGenphenSvm perform similar tasks using
different statistical learning techniques, namely RF and SVM. The following
metrics are estimated using each of them:

\paragraph{Classification accuracy($CA$)}
This metric is used to quantify the strength of the association between each
specific genotype and the phenotype. Both of the procedures generate
classification models between the phenotype (a numerical predictor) and a
specific genotype (a categorical response). If there exists a strong association
between the genotype and the phenotype, one should be able to build an accurate
classification model be ($CA = 1$, for a perfect classifier). To obtain a robust
$CA$ we applied cross-validation, where a subset of the genotype-phenotype data
is selected at random for training the classifier, followed by testing based on
the remaining data. The final $CA$ is the mean accuracy of resulting from this
procedure. The following confusion matrix represents the result of one
cross-validation step and is the data based on which $CA$ is computed:
\newcommand{\M}{\textmd}
\newcommand{\B}{$\bullet$}
\newcommand{\MC}{\multicolumn}
\newcommand{\MR}{\multirow}
\setlength{\doublerulesep}{.9\arrayrulewidth}
\addtolength{\tabcolsep}{2pt}
\begin{table}[H] \centering
\begin{tabular}%
{|cl|c|c|}
\hline
&   &  \MC{2}{c||}{\textbf{Real}} \\
&   &  \textbf{$allele_1$}  &  \textbf{$allele_2$} \\
\hline\hline
& \textbf{$allele_1$} & a & b\\
\cline{2-4}
\MR{-2}{*}{\textbf{Predicted}} & \textbf{$allele_2$} & c & d\\
\hline
\end{tabular}
\caption{Confusion matrix resulting from a classification analysis}
\label{tab:t0}
\end{table}
The CA of the cross-validation step $i$ is then estimated as:
\begin{gather*}
CA_i = \frac{a+d}{a+b+c+d}
\end{gather*}
The final $CA$ for $N$ cross-validation steps is then estimated as:
\begin{gather*}
CA = \frac{1}{N} \sum_{i=1}^{N} CA_{i}
\end{gather*}
In addition to estimating $CA$, one can also compute the 95\% highest density
interval (95\% HDI) from the  distribution of the individual $CA$s obtained
throughout the cross-validation. In order to obtain reliable HDIs, the user
should run these procedures with at least a hundred cross-validation steps
(parameter nboots > 100). Those genotypes for which a $CA$ is close to 1,
coupled with a narrow CI, have the most significant association with the
phenotype.

The metric $CA$ has two advantages over the P values. First, it is more
intuitive that a P value as a $CA=0.9$ means that on average, one is able to
correctly predict 90\% of the states of the genotype from the phenotypes.
Second, we do not run into multiple hypothesis problems and therefore there is
no need for a P value correction. One drawback of $CA$ is that it is difficult
to find a ``perfect'' cuttoff to separate the strong associations from the weak
ones.

\paragraph{Cohen's $\kappa$ statistics}
Often we are interested in comparing the observed $CA$ with classification
accuracy which is expected simply by chance ($CA_{exp}$). This is in particular
useful when the genetic states of the genotype are not evenly represented, i.e.
allele A of a given SNP may be represented in 80\% of the individuals, while
the other allele T may be represented in only 20\% of the individuals. Such
uneven composition of the genotype can affect the classification analysis,
resulting in high $CA$s simply because the classifier only predicts the
dominant label. Cohen's $\kappa$ statistics can be used to estimate how much
stronger the observed $CA$ is, compared to $CA_{exp}$. To compute the $\kappa$
statistics, the confusion matrix shown before in Table \hyperref[tab:t0]{1} is
used:
\begin{gather*}
\kappa =  \frac{CA - CA_{exp}}
{1 - CA_{exp}} \\
CA_{exp} = \frac{a+b}{a+b+c+d} \cdot \frac{a+c}{a+b+c+d} +
\frac{c+d}{a+b+c+d} \cdot \frac{b+d}{a+b+c+d} \\
\end{gather*}
The $\kappa$ statistics is a quality metric, which is to be used together with
$CA$. Cohen defines the following meaningful $\kappa$ intervals: [$\kappa$<0]:
``no agreement'', [0.0-0.2]: ``slight agreement'', [0.2-0.4]: ``fair agreement''
, [0.4-0.6]: ``moderate agreement'', [0.6-0.8]: ``substantial agreement'' and
[0.8-1.0]: ``almost perfect agreement''. Similarly to the estimation of $CA$,
the final Cohen's $\kappa$ is also estimated by averaging the individual
$\kappa$s computed for each step of the cross-validation. Here too, highest
density intervals are estimated.


\paragraph{Cohen's effect size ($d$)}
Using the package \texttt{effsize}, we compute the Cohen's $d$ for each
genotype-phenotype pair. High Cohen's $d$ estimations indicate that there is
a large difference in the measured phenotype between the two genetic states of
the specific genotype. Cohen (1992) defines thresholds which define the
magnitude of the effects as: |$d$|<0.2 ``negligible'', |$d$|<0.5 ``small'',
|$d$|<0.8 ``medium'', otherwise ``large''. \texttt{genphen} computes both the
Cohen's $d$ statistics and the corresponding 95\% confidence intervals. The
Cohen's $d$ is computed as follows:
\begin{gather*}
d = \frac{\mu_1 - \mu_2}{\sqrt{\frac{(n_1 - 1) * \sigma_1 +
(n_2 - 1) * \sigma_2}{n_1 + n_2 - 2}}}
\end{gather*}
where $\mu_1$, $\mu_2$ and $\sigma_1$, $\sigma_2$ represent the mean and the
standard deviations of the phenotypes in the two genetic states of the genotype,
while $n_1$ and $n_2$ represent the sample sizes of the two genetic states of
the genotype.

Both runGenphenRf and runGenphenSvm estimate the three metrics $CA$, $\kappa$
and $d$ with their corresponding highest density intervals, and with that
provide an alternative approach for GWAS.


\subsubsection{runGenphenBayes}
The procedure runGenphenBayes employes Bayesian inference to quantify the
association between genotypes and phenotypes. It performs Bayesian T-test
(Kruschke, 2014), using hierarchical models. The results of this procedure are
the most probable estimates (posterior estimates) for the parameters $\mu_1$
and $\mu_2$ which represent the mean phenotypes in each group of a given
genotype, as well as the standard deviations $\sigma_1$ and $\sigma_2$. While
the classical T-test has to adhere to two assumptions, namely: 1) the phenotypes
should be normally distributed in each group and 2) the variances in the two
groups should be equal; the Bayesian hierarchical models have been designed
such that both of these assumptions are relaxed. Using the posterior estimates
for $\mu_1$ and $\mu_2$ for a given genotype-phenotype combination, one can
estimate the contrast $\mu_1 - \mu_2$ and the so-called highest density
interval (HDI) to evaluate whether there exists a significant phenotypic
effect between the two groups which excludes the null-effect.

Using runGenphenBayes and analysing genotype-phenotype association based on the
Bayesian effect size $\mu_1 - \mu_2$ and the corresponding HDI has multiple
advantages:
\begin{itemize}
\item the effect size is easy to interpret
\item it helps avoid the multiple hypothesis problem
\item the HDI of the effect size provides a convenient filtering technique with
which the associations can be classified as either reliable (the HDI excludes
the null-effect) or unreliable (the HDI spans the null-effect)
\item the Bayesian hierarchical models relax the assumptions of the classical
T-test, and employ prior which model more accurately the observed data
\end{itemize}
The only disadvantage of applying this method genome-wide is its computational
complexity.


\subsection{Case studies}
\subsubsection{SNP-phenotype association with \texttt{genphen}}
This example is intended to guide the user through the previously described
\texttt{genphen} methodology. In particular, we present a situation in which
the association is to be computed between SNPs and phenotypes. We use a simple
example shown in Fig \hyperref[fig:fig1]{1a} to explain the \texttt{genphen}
procedures runGenphenRf/runGenphenSvm and runGenphenBayes.

\begin{itemize}
\item Input:
\begin{itemize}
\item genotype: a SNP column vector of 14 elements, where the two alleles A
and T are contained with 5 and 9 elements, respectively. The 14 elements of
this vector represent 14 mouse species.
\item phenotype: a vector of 14 elements, where each element represents the
measured immune response of a specific mouse.
\item Hint: we can inspect the given genotype-phenotype pair using the
procedure \texttt{plotSpecificGenotype} whose results are shown in Fig.
\hyperref[fig:fig1]{1b}.
\end{itemize}

\item Goal: can we quantify the association between the genotype and the
phenotype vectors using \texttt{genphen}?
\item Methods: we use the procedures runGenphenRf/runGenphenSvm and
runGenphenBayes to compute the association.
\end{itemize}

\begin{figure}[ht]
\centering
\includegraphics[width=12cm]{fig1.pdf}
\caption{\textbf{a}, genotype-phenotype data pair. The genotype is a SNP with
two genetic states (alleles A and T), found in 14 individuals. \textbf{b},
boxplot showing the phenotypic distribution of the genotype as a function of the
two alleles.}
\label{fig:fig1}
\end{figure}

\paragraph{runGenphenSvm and runGenphenRf}
<<sectionTag>>=
library(genphen)
genotype <- c("A", "A", "A", "T", "T", "T", "T", "T", "T",
              "T", "T", "T", "A", "A")

phenotype <- c(6.14, 3.95, 6.19, 8.02, 8.07, 7.32, 7.28,
               7.10, 7.10, 8.26, 8.38, 7.48, 5.22, 5.60)
@

<<sectionTag>>=
result <- runGenphenSvm(genotype = genotype,
                        phenotype = phenotype,
                        cv.fold = 0.66,
                        cv.steps = 1000,
                        hdi.level = 0.95)
@

\begin{table}[H] \centering
\begin{tabular}{@{\extracolsep{-4pt}} ccccccccc}
\hline
site & g.1 & g.2 & count.1 & count.2 & $d$ & $CA$ & $\kappa$ & t.test.pvalue \\
\hline \\[-1.8ex]
1 & A & T & $5$ & $9$ & $$-$3.339$ & $0.960$ & $0.918$ & $0.003$ \\
\hline
\end{tabular}
\caption{Results of the runGenphenSvm procedure (HDIs have been left out)}
\label{runGenphenSvm}
\end{table}

In Fig. \hyperref[fig:fig1]{1b} we notice that the two states of the genotype
are associated with divergent distributions of the phenotype. Hence, we can
expect a significant association between the SNP and the phenotype. This is
confirmed by the results of the procedure runGenphenSvm, which yield $CA$ close
to 1 ($CA = 0.960$), high Cohen's $\kappa$ ($\kappa = 0.918$) and a substantial
Cohen's effect size ($d = -3.339$). Similar results can be obtained with the
alternative statistical method (RF).


\paragraph{runGenphenBayes}
<<sectionTag>>=
result <- runGenphenBayes(genotype = genotype,
                          phenotype = phenotype,
                          chain.nr = 2,
                          mcmc.iter = 1000,
                          model = "tdist",
                          hdi.levels = 0.95)
@
\begin{table}[H] \centering
\begin{tabular}{@{\extracolsep{-4pt}} ccccccccccc}
\hline
site & g.1 & g.2 & count.1 & count.2 & mu.1 & mu.2 & effect & effect.0.95.HDI \\
\hline \\[-1.8ex]
1 & A & T & $5$ & $9$ & $5.460$ & $7.662$&$$-$2.202$&($$-$3.771$,$$-$0.757$) \\
\hline
\end{tabular}
\caption{Results of the runGenphenBayes procedure}
\label{runGenphenBayes}
\end{table}

The results of runGenphenBayes confirm the previous claims about the association
between the SNP and the phenotype. The state A is associatied with a mean
phenotype of 5.460, while T is associated with 7.662. The Bayesian effect size
therefore equals to -2.202 with 95\% HDI of (-3.771, -0.757), which excludes the
null-effect. The HDI criterion is an intuitive quality measure which can be used
to filter out poor association. In contrast to this, the filtering is not as
intuitive based on $CA$ and $\kappa$ in the case of the statistical learning
procedures.


\subsubsection{SAAP-phenotype association with \texttt{genphen}}
This example is intended to guide the user through the previously described
\texttt{genphen} methodology, this time when the association is to be computed
between a SAAP and a phenotype.


\begin{itemize}
\item Input:
\begin{itemize}
\item genotype: a SAAP column vector of 120 elements, where the four amino
acid states are present H, Q, N and K with the following counts 62, 55, 2 and
1, respectively. This genotype vector is a single site taken from the protein
sequence of 120 organisms, therefore each amino acid state corresponds to a
specific organism.
\item phenotype: a vector of 120 numerical elements (artificially generated).
\item Hint: we can first inspect the genotype-phenotype pair using the
procedure \texttt{plotSpecificGenotype} whose results are shown in Fig
\hyperref[fig:fig2]{2}.
\end{itemize}

\item Goal: can we quantify the association between the genotype and the
phenotype vectors using \texttt{genphen}?

\item Method: we can use the \texttt{genphen} method \texttt{runGenphen}
to compute the association. This procedure will first need to decompose the
genotype-phenotype pair into the 6 possible amino acid substitution pairs,
namely (H, Q), (H, K), (H, N), (Q, K), (Q, N) and (K, N), and then compute
the association between substitution pair and the phenotype just as it was
presented in the previous example.
\end{itemize}

\begin{figure}[ht]
\centering
\includegraphics[width=8cm]{fig2.pdf}
\caption{Boxplot showing the phenotypic distribution as a function of four
amino acids (H, Q, N, K) at a given SAAP of 120 amino acids in total.}
\label{fig:fig2}
\end{figure}


\paragraph{runGenphenSvm and runGenphenRf}
<<sectionTag>>=
library(genphen)
data(genotype.saap)
data(phenotype.saap)

genotype <- genotype.saap[, 82]
phenotype <- phenotype.saap
@

<<sectionTag>>=
result <- runGenphenSvm(genotype = genotype,
                        phenotype = phenotype,
                        cv.fold = 0.66,
                        cv.steps = 1000,
                        hdi.level = 0.95)
@

\begin{table}[H] \centering
\begin{tabular}{@{\extracolsep{-4pt}} ccccccccc}
\hline
site & g.1 & g.2 & count.1 & count.2 & $d$ & $CA$ & $\kappa$ & t.test.pvalue \\
\hline \\[-1.8ex]
1 & H & Q & $62$ & $55$ & $2.381$ & $0.879$ & $0.755$ & $0$ \\
1 & H & K & $62$ & $1$ & $NA$ & $NA$ & $NA$ & $NA$ \\
1 & H & N & $62$ & $2$ & $2.339$ & $0.974$ & $$-$0.0001$ & $0.221$ \\
1 & Q & K & $55$ & $1$ & $NA$ & $NA$ & $NA$ & $NA$ \\
1 & Q & N & $55$ & $2$ & $$-$0.102$ & $0.973$ & $0$ & $0.934$ \\
1 & K & N & $1$ & $2$ & $NA$ & $NA$ & $NA$ & $NA$ \\
\hline
\end{tabular}
\caption{Results of the runGenphenSvm procedure (HDIs have been left out)}
\label{runGenphenSvm}
\end{table}

For each SAAP (amino acid substitution), runGenphenSvm and runGenphenRf
estimates the previously defined metrics w.r.t. the phenotype. One can then use
the $CA$, Cohen's $\kappa$ and Cohen's effect size $d$, to find the most
significant associations. The SAAP involving the amino acid (H, Q) stands out
in this example not only due to its high $CA$, $\kappa$ and $d$, but also
because these estimates are based on substantial data of 62 + 55 data points.
If only a single data point is available for a given genetic state of a SAAP
(e.g. amino acid state K), then NA (not available) estimates are returned.

\paragraph{runGenphenBayes}
<<sectionTag>>=
result <- runGenphenBayes(genotype = genotype,
                          phenotype = phenotype,
                          chain.nr = 2,
                          mcmc.iter = 1000,
                          model = "tdist",
                          hdi.levels = 0.95)
@
\begin{table}[H] \centering
\begin{tabular}{@{\extracolsep{-4pt}} ccccccccccc}
\hline
site & g.1 & g.2 & count.1 & count.2 & mu.1 & mu.2 & effect & effect.0.95.HDI \\
\hline \\[-1.8ex]
1 & H & Q & $62$ & $55$ & $2.285$ & $$-$0.514$ & $2.798$ & ($2.355$, $3.219$) \\
1 & H & K & $62$ & $1$ & $NA$ & $NA$ & $NA$ & ($NA$, $NA$) \\
1 & H & N & $62$ & $2$ & $2.286$ & $$-$0.575$ & $2.861$&($$-$6.145$, $12.663$)\\
1 & Q & K & $55$ & $1$ & $NA$ & $NA$ & $NA$ & ($NA$, $NA$) \\
1 & Q & N & $55$ & $2$ & $$-$0.513$ & $$-$0.604$&$0.091$&($$-$8.121$, $8.140$)\\
1 & K & N & $1$ & $2$ & $NA$ & $NA$ & $NA$ & ($NA$, $NA$) \\
\hline
\end{tabular}
\caption{Results of the runGenphenBayes procedure}
\label{runGenphenBayes}
\end{table}
The results of runGenphenBayes once again confirm the previous claims about the
association between the different SAAPs and the phenotype. The amino acid state
H is  associatied with a mean phenotype of 2.285, while Q is associated with
-0.514. The Bayesian effect size therefore equals to 2.798 with 95\% HDI of
(2.355, 3.219), which excludes the null-effect. None of the other associations
satisfies this criterion.


\subsection{Visualization}
The tool \texttt{genphen} implements visualization procedures which help the
user evaluate its results. Here we introduce the following four procedures:
\begin{itemize}
\item plotSpecificGenotype
\item plotGenphenRfSvm
\item plotGenphenBayes
\item plotManhattan
\end{itemize}

\paragraph{plotSpecificGenotype}
The results of the procedure \texttt{plotGenphenResults} are shown in
Fig \hyperref[fig:fig3]{3}. With it, one can plot the phenotype as a function
of a specific genotype, in which the user might be interested (e.g. specific
SNP).

\paragraph{plotGenphenRfSvm}
The results generated either by runGenphenRf and runGenphenSvm can be inspected
via this procedure. With it one visualizes the quantified genotype-phenotype
associations with respect to their $CA$ and $d$ measures.

\paragraph{plotGenphenBayes}
The results generated by plotGenphenBayes can be inspected via this procedure.
With it one visualizes the quantified genotype-phenotype associations with
respect to their Bayesian effect sizes $\mu_1 - \mu_2$ and the corresponding
HDIs.

\paragraph{plotManhattan}
The -log10 transformed P values estimated via the two-sample T-test for each
genotype-phenotype association can be inspected via this procedure using via a
Manhattan plot.

All four plotting procedures are applied in the following section
\ref{sec:example}.


\section{Complete worked example}
\label{sec:example}

\subsection{Workflow 1}
\begin{enumerate}
\item Loading data
<<sectionTag>>=
library(genphen)
data(genotype.saap)
# One can also use a multiple sequence alignment as a genotype input (in the
# form of either a DNAMultipleAlignment or AAMultipleAlignment objects of the
# Biostring package).
data(phenotype.saap)
@


\item Running genphen algorithm using random forests (RF) and linear support
vector machines (LSVM):
<<sectionTag>>=
# if DNAMultipleAlignment is loaded you cannot subset
# with genotype.snp[, 1:10]
genphen.rf <- runGenphenRf(genotype = genotype.saap[, 1:10],
                           phenotype = phenotype.saap,
                           cv.fold = 0.66,
                           cv.steps = 100,
                           hdi.level = 0.99,
                           ntree = 1000)


genphen.svm <- runGenphenSvm(genotype = genotype.saap[, 1:10],
                             phenotype = phenotype.saap,
                             cv.fold = 0.66,
                             cv.steps = 100,
                             hdi.level = 0.99)
@


\item Filtering out association which could not be quantified by
\texttt{genphen}, due to lack of data:
<<sectionTag>>=
genphen.rf <- genphen.rf[complete.cases(genphen.rf), ]
genphen.svm <- genphen.svm[complete.cases(genphen.svm), ]
@


\item Plotting results
\begin{figure}[H]
\centering
<<fig=true, echo = FALSE, width=7, height=5>>=
plotGenphenRfSvm(genphen.results = genphen.rf)
@
\caption{Effect site - classification accuracy plot}
\end{figure}


\begin{figure}[H]
\centering
<<fig=true, echo = FALSE, width=7, height=5>>=
plotGenphenRfSvm(genphen.results = genphen.svm)
@
\caption{Effect site - classification accuracy plot}
\end{figure}


\begin{figure}[H]
\centering
<<fig=true, echo = FALSE, width=7, height=5>>=
plotManhattan(genphen.results = genphen.rf)
@
\caption{Manhattan plot}
\end{figure}


\item Running the genphen algorithm using Bayesian inference:
<<sectionTag>>=
# if DNAMultipleAlignment is loaded you cannot subset
# with genotype.snp[, 1:10]
genphen.bayes <- runGenphenBayes(genotype = genotype.saap[, 1:10],
                                 phenotype = phenotype.saap,
                                 chain.nr = 2,
                                 mcmc.iter = 1000,
                                 model = "tdist",
                                 hdi.levels = c(0.90, 0.95, 0.99))
@


\item Filtering out association which could not be quantified by
\texttt{genphen}, due to lack of data:
<<sectionTag>>=
genphen.bayes <- genphen.bayes[complete.cases(genphen.bayes), ]
@


\item Plotting results
\begin{figure}[H]
\centering
<<fig=true, echo = FALSE, width=7, height=5>>=
plotGenphenBayes(genphen.results = genphen.bayes, hdi = 0.9)
@
\caption{Effect site - Bayesian effect size plot}
\end{figure}


\begin{figure}[H]
\centering
<<fig=true, echo = FALSE, width=7, height=5>>=
plotManhattan(genphen.results = genphen.bayes)
@
\caption{Manhattan plot}
\end{figure}


\item Visual inspection of the association between a specific genotypes and
the phenotypes:
\begin{figure}[H]
\centering
<<fig=true, echo = FALSE, height=2.5>>=
plotSpecificGenotype(genotype = genotype.saap,
                     phenotype = phenotype.saap,
                     index = 9)
@
\caption{Specific genotype-phenotype plot}
\end{figure}
\end{enumerate}


\end{document}