%\VignetteIndexEntry{genphen overview}

\documentclass[a4paper, 11pt]{article}
\usepackage[utf8]{inputenc}
\let\chapter\section
\usepackage[lined]{algorithm2e}

\usepackage[numbers]{natbib}
\bibliographystyle{plainnat}

\usepackage[english]{babel}
\selectlanguage{english}

\usepackage{graphicx}
\usepackage{placeins}
\usepackage{amsmath}
\usepackage{amscd}
\usepackage{ifthen}
\usepackage{float}
\usepackage{subfig}

\usepackage{lscape}

\usepackage{parskip}
\usepackage{multirow}
\usepackage{todonotes}
\usepackage{color}
\usepackage{colortbl}
\definecolor{hellgrau}{rgb}{0.9,0.9,0.9}
\definecolor{dunkelgrau}{rgb}{0.8,0.8,0.8}

\usepackage{hyperref}
% remove ref boxes
%\usepackage[hidelinks]{hyperref}

%add-on renews
\renewcommand\topfraction{0.85}
\renewcommand\bottomfraction{0.85}
\renewcommand\textfraction{0.1}
\renewcommand\floatpagefraction{0.85}

%confusion matrix
\newcommand\confusion[1]{
\fbox{\lower0.75cm
\vbox to 0.5cm{\vfil
\hbox to 0.5cm{\hfil\parbox{1cm}{#1}}
\vfil}%
}%
}

\setlength\floatsep{1.25\baselineskip plus 3pt minus 2pt}
\setlength\textfloatsep{1.25\baselineskip plus 3pt minus 2pt}
\setlength\intextsep{1.25\baselineskip plus 3pt minus 2 pt}
\setlength{\parindent}{0pt}

\begin{document}
\SweaveOpts{concordance=TRUE}


\title{Tutorial on using \texttt{genphen}}
\author{Simo Kitanovski, Daniel Hoffmann,\\
Bioinformatics and Computational Biophysics,\\
University of Duisburg-Essen, Essen, Germany}
\maketitle

\tableofcontents

This tutorial gives you some of the technical background underlying
\texttt{genphen} that should enable you to understand and use this tool.

\section{\texttt{genphen} quantifies genotype-phenotype associations}
Genome wide association studies (GWAS) have become an important tool to
understand the association between genotypes and phenotypes. With GWAS we try
to answer questions such as ``what are the genotypes in the human genome which
predispose to a disease?'' or ``what are the genotypes in certain strains of
mice which confer resistance against a specific virus?''. The advances in high-
throughput sequencing technology have provided massive genetic data and thus
potentially countless applications of genotype-phenotype association studies.
The genotype can be a set of single nucleotide polymorphisms (SNPs) or a set of
single amino acid polymorphisms (SAAPs) identified in a group of individuals,
whereas the phenotype can be any continuous or discrete individual quantity or
characteristics.

To conduct GWAS, frequentist statistical methods are typically used, relying on
simple and often inadequate methods to capture the complex and potentially non-
linear genotype-phenotype association. Moreover, these methods often use P-
values to quantify the strength of association, bringing with them a set of
disadvantages, some of which include poor interpretation, difficulty to compare
between different studies, as well as massive multiple hypothesis problems.

With \texttt{genphen} we provide a hybrid method which reaps the benefits of
sophisticated statistical learning approaches such as random forest (RF) and
support vector machine (SVM) to capture complex genotype-phenotype associations,
on the one hand, and Bayesian inference on the other hand, to accurately
quantify the strength of association using models consistent with the data.
The results of \texttt{genphen} are multiple association scores for each
genotype. Visualizing these scores together can present the researcher a
meaningful guide to selecting the most promising association.

Furthermore, \texttt{genphen} provides a set of procedures including a test for
phylogenetic bias (used to discover biases in the data due to the population
structure), procedure for data reduction (used for removing non-informative
genotypes and thereby simplifying the GWAS), data augmentation (used to augment
small sample datasets) and methods for gene prioritization based on network
diffusion algorithms using functional network data.

\section{Methods}
\label{sec:methods}
\subsection{Input}
Two data types are necessary to perform a genetic association study:
\begin{itemize}
\item genotype data (e.g. set of 1,000 SNPs found along the aligned genomes of
10 individuals)
\item phenotype data (experimental measurement made for each individual such
as body mass index, immune response, survival, case-control, etc.)
\end{itemize}

More generally, we can think of the genotype data as a character matrix with
dimensions $N \times M$, whereby the $M$ columns represent different SNPs or
SAAPs, and the $N$ rows represent different individuals or sequences for which
we have some measured phenotype. Therefore, we can think of the phenotype as a
numerical vector of length $N$, where each phenotype corresponds to a particular
individual. Moreover, the phenotype can be of specific type (e.g. quantitative,
dichotomous, etc.), imposing requirements on the type of statistical test that
can be be used for the association analysis.

\subsection{Association Scores}
Between each genotype (SNP/SAAP) and phenotype, \texttt{genphen} computes
several measures of association, each of which is explained in the following
paragraphs.

\paragraph{Classification accuracy($CA$)}
$CA$ measures the degree of accuracy with which one can classify (predict) the
alleles of an SNP from the phenotype measurements. If there exists a strong
association between a particular SNP and the phenotype, one should be able to
build a statistical model which accurately classifies the two alleles of that
SNP solely from the phenotype data ($CA \approx 1$). Otherwise, the
classification accuracy of statistical model should be approximately similar to
that of simple guessing ($CA \approx 0.5$).

To estimate a robust $CA$ estimate, \texttt{genphen} uses cross-validation (CV),
obtaining a distribution of possible $CAs$. During the CV procedure a subset
(e.g. 70\%) of the genotype-phenotype data is selected at random for training
the classifier, followed by testing based on the remaining data. The following
confusion matrix represents the result of one CV step:
\newcommand{\M}{\textmd}
\newcommand{\B}{$\bullet$}
\newcommand{\MC}{\multicolumn}
\newcommand{\MR}{\multirow}
\setlength{\doublerulesep}{.9\arrayrulewidth}
\addtolength{\tabcolsep}{2pt}
\begin{table}[H] \centering
\begin{tabular}%
{|cl|c|c|}
\hline
&   &  \MC{2}{c||}{\textbf{Real}} \\
&   &  \textbf{$allele_1$}  &  \textbf{$allele_2$} \\
\hline\hline
& \textbf{$allele_1$} & a & b\\
\cline{2-4}
\MR{-2}{*}{\textbf{Predicted}} & \textbf{$allele_2$} & c & d\\
\hline
\end{tabular}
\caption{Confusion matrix resulting from a classification analysis}
\label{tab:t0}
\end{table}
The CA of the cross-validation step $i$ is then estimated as:
\begin{gather*}
CA_i = \frac{a+d}{a+b+c+d}
\end{gather*}
The final $CA$ after 1000 CV steps is then estimated as:
\begin{gather*}
CA = \frac{1}{1000} \sum_{i=1}^{1000} CA_{i}
\end{gather*}
In addition to estimating $CA$, the distribution of $CAs$ enables us to also
compute the 95\% highest density interval (95\% HDI) of CA. The mutations with
$CA \approx 1$, with narrow HDI have the strongest associations.

The metric $CA$ has the following advantages:
\begin{itemize}
\item simple to estimate
\item simple to interpret (bound between 0 and 1)
\item simple to compare across studies
\item one can have high CA despite small effects
\item detects strong non-linear associations as well
\end{itemize}


\paragraph{Cohen's $\kappa$ statistic}
There is one pitfall where the $CA$ estimate can be truly misleading, and this
is the case when the analyzed SNP is composed of unevenly represented genetic
states (alleles). For instance, the allele A of a given SNP is found in 90\%
of the individuals, while the other allele T in only 10\%. Such an uneven
composition of the alleles can lead to misleading results, i.e. even without
learning the algorithm can produce a high $CA \approx 0.9$ simply by constantly
predicting only the dominant label. The Cohen's $\kappa$ statistics can be used
to estimate how much better the observed $CA$ is, compared to the classification
accuracy expected by chance ($CA_{exp}$). To compute the $\kappa$ statistics,
the confusion matrix shown before in Table \hyperref[tab:t0]{1} is used:
\begin{gather*}
\kappa =  \frac{CA - CA_{exp}}
{1 - CA_{exp}} \\
CA_{exp} = \frac{a+b}{a+b+c+d} \cdot \frac{a+c}{a+b+c+d} +
\frac{c+d}{a+b+c+d} \cdot \frac{b+d}{a+b+c+d} \\
\end{gather*}
The $\kappa$ statistics is a quality metric, which is to be used together with
$CA$. Cohen defines the following meaningful $\kappa$ intervals: [$\kappa$<0]:
``no agreement'', [0.0-0.2]: ``slight agreement'', [0.2-0.4]: ``fair agreement''
, [0.4-0.6]: ``moderate agreement'', [0.6-0.8]: ``substantial agreement'' and
[0.8-1.0]: ``almost perfect agreement''. Similarly to the estimation of $CA$,
the final Cohen's $\kappa$ is also estimated by averaging the individual
$\kappa$ scores computed for each step of the CV. Here too,
95\% HDIs are estimated.


\paragraph{Cohen's $d$ effect size}
Given data from two groups (two allele groups in a SNP), we ask the question:
How much is one group different from the other with respect to the phenotype
observed in each group? In the case of quantitative phenotypes we answer this
question by computing the Cohen's $d$ for each genotype-phenotype pair. Cohen's
$d$ estimates which are significantly greater or smaller than 0 indicate that
there is a large difference in the phenotype between the two genetic states of
the specific genotype. Cohen (1992) defines level which define the magnitude of
the effects as: |$d$|<0.2``negligible'', |$d$|<0.5 ``small'', |$d$|<0.8
``medium'', otherwise ``large''. The Cohen's $d$ is computed as follows:
\begin{gather*}
d = \frac{\mu_1-\mu_2}{\sqrt{({\sigma_1}^2+{\sigma_2}^2)/2}}
\end{gather*}
where $\mu_1$, $\mu_2$ and $\sigma_1$, $\sigma_2$ represent the mean and the
standard deviations of the phenotypes in the two genetic states of the genotype.

\texttt{genphen} uses the following Bayesian inference models designed in STAN
\footnote{Stan Development Team. 2017. Stan Modeling Language Users Guide and
Reference Manual, Version 2.17.0. http://mc-stan.org}, to estimate each of the
parameters in the Cohen's $d$ equation from the data:
\begin{gather*}
Y_{ijk} \sim T(\nu_i, \mu_{ijk}, \sigma_{ijk}) \\
\mu_{jk} \sim N(\hat{\mu}, \hat{\sigma} \cdot 100) \\
\sigma_{jk} \sim U(\hat{\sigma}/100, \hat{\sigma} \cdot 100) \\
\nu_j \sim Gamma(2.0, 0.1)
\end{gather*}
where $i$, $j$ and $k$ index phenotype observation $i$ at site $j$ and genotype
$k$, respectively in the phenotype vector $Y$; $\mu$, $\sigma$ and $\nu$ are the
mean, standard deviation and degrees of freedom parameters of the T-distribution
which is used to model the phenotype observed in each allele; $\hat{\sigma}$ and
$\hat{\mu}$ are the empirically estimated mean and standard deviations of the
phenotype in each allele which are used to setup the broad priors for $\mu$ and
$\sigma$.

For each parameter we estimate a complete posterior distribution with MCMC
sampling implemented in \texttt{rstan}. Therefore, by plugging in the entire
posterior distributions of the parameters into the Cohen's $d$ equation, we
estimate a complete posterior distribution for $d$ as well. This also allows us
to compute the corresponding 95\% HDI of $d$. A complete description of the
hierarchical model is provided in Krushke, 2013) \footnote{Kruschke, John K.
"Bayesian estimation supersedes the t test."Journal of Experimental Psychology:
General 142.2 (2013): 573}. For SNPs with more than two groups (e.g. 3 alleles),
the Cohen's $d$ estimate is computed between each pair of alleles.

This approach, although computationally more challenging than a simple t-test,
has a few advantages related to the fact that the phenotype is modeled in a
more consistent way, i.e. the distribution of the phenotype in each group is
described as a T-distribution with a mean ($\mu$), standard deviation ($\sigma$)
and shape parameter ($\nu$).
\begin{itemize}
\item complete information about the credible parameter values is obtained
\item handling of outliers achieved by describing the data with heavy-tailed
distributions instead of normal distribution $\rightarrow$ assumption of
normality is alleviated
\item the T-distribution which describes the phenotype in each group has its
own variance parameter $\rightarrow$ assumption of homoscedasticity is
alleviated.
\item in addition to computing the differences between the central tendencies of
the groups, we can use the standard deviation parameters to compute differences
between their variabilities
\end{itemize}


\paragraph{Absolute effect size ($a$)}
For dichotomous phenotypes, \texttt{genphen} simply computes the absolute
difference (also known as contrast $a$) between the phenotypes of any two
genetic states of each polymorphism as follows:
\begin{gather*}
a = p_1 - p_2
\end{gather*}
where $p_1$, $p_2$ represent the proportion of "successes" in a given number of
observed trials in the two genetic states of the given SNP/SAAP. Similar to the
case of having a continuous phenotype, \texttt{genphen} uses a Bayesian
inference models designed in STAN to estimate each of the parameters in the
equation above from the data:
\begin{gather*}
Y_{ijk} \sim Bern(p_{jk}) \\
p_{jk} \sim Beta(1/2, 1/2)
\end{gather*}
where $i$, $j$ and $k$ index phenotype observation $i$ at site $j$ and genotype
$k$, respectively in the phenotype vector $Y$; $p$, is the probability parameter
of the Bernoulli distribution which is used to model the dichotomous phenotype
data as a set of sucesses in a set of trials observed each allele; The prior of
$p$ is a Beta distribution whose two parameters are fixed at 0.5 (Jeffrey's
prior).

Plugging in the estimated posterior distributions of $p$ into the equation above
allows us to compute the mean $a$ point estimate and its 95\% HDI.


\paragraph{Bhattacharyya coefficient ($BC$)}
We further use the infered parameters needed to compute the effect size, to
perform posterior predictive checks, i.e. we simulate phenotype data. The
simulated data for each genotype is then used to estimate the degree of overlap
between the simulated phenotype distributions in two genetic states of the
mutation. The overlap is quantified using the Bhattacharyya coefficient (BC):
\begin{gather*}
BC(p_1, p_2) = \int_x\sqrt(p_1(x) \cdot p_2(x)) dx
\end{gather*}
where $p_1$ and $p_2$ are the simulated phenotype distributions in both genetic
states of a mutation. For a complete overlap BC = 1 (i.e. no difference between
the phenotype distributions in the two genetic state), and BC = 0 for no overlap
(significant difference).


\subsection{Phylogenetic Bias ($B$)}
To control for potential phylogenetic biases (population structure), we devised
the following procedure. First, we use the complete genotype data (all SNPs) to
compute a kinship matrix (NxN dissimilarity matrix between the N-individuals).
Alternatively, the users can provide their own kinship matrix (e.g. estimated
using more accurate phylogenetic methods). For a group of individuals which
belong to a group defined by an alleles of a given SNP, we next compute their
mean kinship distance using the kinship matrix data. If the individuals in the
group are related, the compute mean kinship distance must be significantly lower
than the mean kinship distance computed from the complete kinship matrix. Thus,
we define the phylogenetic bias as:
\begin{gather*}
B = 1 - \hat{d}_{g}/\hat{d}_{t}
\end{gather*}
where $\hat{d}_{g}$ is the mean kinship distance between the individuals who
share the genotype $g$; $\hat{d}_{t}$ is the mean kinship distance of the
complete kinship matrix. For a complete phylogenetic bias, B = 1 ($\hat{d}_{g}
<< \hat{d}_{t}$), and $B = 0$ (or slightly negative) for no bias. This estimate
is computed for each SNP and genotype group within each SNP.

To compute the phylogenetic bias associated with a mutation ($g_{1} -> g_{2}$),
we compute:
\begin{gather*}
B = 1 - min(\hat{d}_{g_{1}}, \hat{d}_{g_{2}})/\hat{d}_{t}
\end{gather*}
where $\hat{d}_{g_{1}}$ and $\hat{d}_{g_{2}}$ represent the mean kinship
distance between the individuals who share the genotype (allele) $g_{1}$ and
$g_{2}$ or a given SNP; $\hat{d}_{t}$ is the mean kinship distance in the
complete kinship matrix. For a complete phylogenetic bias, B = 1 and B = 0
(or slightly negative) for no bias. This estimate is computed for each SNP
and each pair of genotypes.

\section{Case studies}
\subsection{I: Association between SNPs and a *continuous* phenotype}
\label{sec:case1}

In the first case study, we show a typical genotype-phenotype analysis, whereby
the genotype is a protein sequence alignment composed of 6 sites and 120
individuals (sequences), and a continuous phenotype measured for each of the
individuals.

<<echo=FALSE, include=TRUE, results=hide>>=
require(genphen)
require(ggplot2)
require(knitr)
require(ggrepel)
require(reshape2)
require(ape)
require(xtable)
options(xtable.floating = FALSE)

# genotype inputs:
data(genotype.saap)
# phenotype inputs:
data(phenotype.saap)

# run genphen
c.out <- runGenphen(genotype = genotype.saap[, 80:85],
                    phenotype = phenotype.saap,
                    phenotype.type = "continuous",
                    mcmc.chains = 2,
                    mcmc.iterations = 2000,
                    mcmc.warmup = 500,
                    mcmc.cores = 4,
                    hdi.level = 0.95,
                    stat.learn.method = "rf",
                    cv.iterations = 300)

# scores
c.convergence <- c.out$convergence
c.score <- c.out$scores
@

\paragraph{Genotype-phenotype data}
First we show an overview of the distribution of the phenotype across each of
the 6 studied polymorphic sites in the sequence alignment, and the underlying
genotype states. \texttt{genphen} will list the mutations found at each size,
followed by quantification fo the association strength as explained in
\ref{sec:methods}.

\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=8, height=4>>=
df <- data.frame(genotype.saap[, 80:85], stringsAsFactors = FALSE)
df$phenotype <- phenotype.saap
df <- melt(data = df, id.vars = "phenotype")
colnames(df) <- c("phenotype", "site", "genotype")
df$site <- gsub(pattern = "X", replacement = '', x = df$site)

ggplot(data = df)+
  facet_wrap(facets = ~site, ncol = 6, scales = "free_x")+
  geom_violin(aes(x = genotype, y = phenotype))+
  geom_point(aes(x = genotype, y = phenotype, col = genotype), size = 1,
             shape = 21, position = position_jitterdodge(jitter.width = 0.2,
                                                         jitter.height = 0.1,
                                                         dodge.width = 0.5))+
  scale_color_discrete(name = "genotype")+
  theme_bw(base_size = 16)+
  theme(legend.position="none")
@
\end{figure}

\paragraph{Association analysis}
A typical way of visualizing the \texttt{genphen} results is with the following
plot where each point represents a polymorphism (here SAAP) plotted according
to x=classification accuracy ($CA$), y=Cohen's $d$, color=Cohen's $\kappa$. The
most promising SAAPs have CA and $\kappa$ close to 1, with Cohen's $d$ estimate
whose 95\% highest density interval (HDI) does not overlap with the null effect
(dashed line in figure: $d = 0$). The labels indicate the MSA site number,
followed by the type of the polymorphism, and finally the BC score.

\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=8, height=6>>=
c.score$label <- paste(c.score$site, c.score$mutation,
                       round(x = c.score$bc, digits=2), sep=':')
ggplot(data = c.score)+
  geom_errorbar(aes(x = ca, ymin = cohens.d.L, ymax = cohens.d.H))+
  geom_point(aes(x = ca, y = cohens.d, fill = kappa), shape = 21, size = 3)+
  geom_text_repel(aes(x = ca, y = cohens.d, label = label), size = 4)+
  theme_bw(base_size = 16)+
  xlab(label = "CA")+
  ylab(label = "Cohen's d (with 95% HDI)")+
  scale_x_continuous(limits = c(0, 1))+
  geom_hline(yintercept = 0, linetype = "dashed")+
  scale_fill_gradientn(colours = terrain.colors(n = 10), limits = c(0, 1))
@
\end{figure}

The association scores are also shown in the following table:

<<echo=FALSE, include=TRUE, results=tex>>=
print(xtable(c.out$debug.scores[, c("site", "mutation", "cohens.d",
                                    "cohens.d.hdi", "bc", "ca",
                                    "ca.hdi", "kappa", "kappa.hdi"), ],
             align = rep(x = "c", times = 10, digits = 2)),
      include.rownames = FALSE, size = "small")
@

\paragraph{MCMC convergence}
Next, we want to check the validity our Bayesian inference model by inspecting
the \texttt{genphen} output named convergence which contains information about
the markov chain monte carlo (MCMC) simulation done with R package rstan
including potential scale reduction factor (Rhat) and effective sampling size
(ESS), as well as information concerning potential convergence issues such as
divergences, tree depth exceeded warnings, etc. The small sample size of
specific alleles (of a site) are often the cause of such warnings, which are
then reported for the the alleles of that site.

<<echo=FALSE, include=TRUE, results=tex>>=
print(xtable(c.convergence, align = rep(x = "c", times = 10, digits = 2)),
      include.rownames = FALSE, size = "small")
@


\paragraph{Phylogenetic bias control}

Next, we compute the phylogenetic bias of each mutation, shown in the table
below:

<<echo=FALSE, include=TRUE, results=tex>>=
# compute the phylogenetic bias
bias <- runPhyloBiasCheck(genotype = genotype.saap, input.kinship.matrix = NULL)

# extract kinship matrix
kinship.matrix <- bias$kinship.matrix

# extract the bias associated with mutations of the sites which were included
# in the association analysis
mutation.bias <- bias$bias.mutations
mutation.bias <- mutation.bias[mutation.bias$site %in% 80:85, ]
mutation.bias <- mutation.bias[mutation.bias$mutation %in% c.score$mutation, ]

# show the bias table
print(xtable(mutation.bias, align = rep(x = "c", times = 3+1, digits = 2)),
      include.rownames = FALSE, size = "small")
@


We use the kinship matrix to perform hierarchical clustering, visualizing the
population strcuture and two examples (mutations) with genotype 1 marked with
blue and genotype 2 marked with orange in either case. Individuals not covered
by either genotype are marked with gray color. The shown examples differ in
the degree of phylogenetic bias.

\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=8, height=5>>=
color.a <- character(length = nrow(genotype.saap))
color.a[1:length(color.a)] <- "gray"
color.a[which(genotype.saap[, 82] == "h")] <- "orange"
color.a[which(genotype.saap[, 82] == "q")] <- "blue"

color.b <- character(length = nrow(genotype.saap))
color.b[1:length(color.b)] <- "gray"
color.b[which(genotype.saap[, 84] == "a")] <- "orange"
color.b[which(genotype.saap[, 84] == "d")] <- "blue"

c.hclust <- hclust(as.dist(kinship.matrix), method = "average")

par(mfrow = c(1, 2), mar = c(0,0,1,0) + 0.1)
plot(as.phylo(c.hclust), tip.color = color.a, cex = 0.6, type = "fan",
     main = "B = 0.15")
plot(as.phylo(c.hclust), tip.color = color.b, cex = 0.6, type = "fan",
     main = "B = 0.43")
@
\end{figure}


\subsection{II: Association between SNPs and a *dichotomous* phenotype}
\label{sec:case2}

In the second case study we show you how to use \texttt{genphen} in case the
phenotype is of dichotomous type. The genotype input is a protein sequence
alignment composed of 12 sites and 120 individuals (sequences), and the
phenotype is a vector of 120 dichotomous values measured for each individual.

<<echo=FALSE, include=TRUE, results=hide>>=
# genotype inputs:
data(genotype.saap)
# phenotype inputs:
data(dichotomous.phenotype.saap)

# run genphen
d.out <- runGenphen(genotype = genotype.saap[, 68:79],
                    phenotype = dichotomous.phenotype.saap,
                    phenotype.type = "dichotomous",
                    mcmc.chains = 2,
                    mcmc.iterations = 2000,
                    mcmc.warmup = 500,
                    mcmc.cores = 4,
                    hdi.level = 0.95,
                    stat.learn.method = "rf",
                    cv.iterations = 300)

# scores
d.convergence <- d.out$convergence
d.score <- d.out$scores
@

\paragraph{Genotype-phenotype data}
First we show an overview of the distribution of the phenotype across each of
the 12 studied polymorphic sites in the sequence alignment, and the underlying
genotype states. \texttt{genphen} will list the mutations found at each size,
followed by quantification fo the association strength as explained in
\ref{sec:methods}.

\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=8, height=6>>=
df <- data.frame(genotype.saap[, 68:79], stringsAsFactors = FALSE)
df$phenotype <- dichotomous.phenotype.saap
df <- melt(data = df, id.vars = "phenotype")
colnames(df) <- c("phenotype", "site", "genotype")
df$site <- gsub(pattern = "X", replacement = '', x = df$site)

ggplot(data = df)+
  facet_wrap(facets = ~site, ncol = 6, scales = "free_x")+
  geom_point(aes(x = genotype, y = phenotype, col = genotype), size = 1,
             shape = 21, position = position_jitterdodge(jitter.width = 0.2,
                                                         jitter.height = 0.1,
                                                         dodge.width = 0.5))+
  scale_color_discrete(name = "genotype")+
  theme_bw(base_size = 16)+
  theme(legend.position="none")
@
\end{figure}

\paragraph{Association analysis}
A typical way of visualizing the \texttt{genphen} results is with the following
plot where each point represents a polymorphism (here SAAP) plotted according
to x=classification accuracy ($CA$), y=absolute $d$ (with error bars 95\% HDI),
color=Cohen's $\kappa$. The most promising SAAPs have CA and $\kappa \approx 1$,
with absolute $d$ estimate whose 95\% highest density interval (HDI) does not
overlap with the null effect ($d \neq 0$). The labels indicate the MSA site
number, followed by the type of the polymorphism, and finally the BC score.

\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=8, height=6>>=
d.score$label <- paste(d.score$site, d.score$mutation,
                       round(x = d.score$bc, digits = 2),
                       sep = ':')
ggplot(data = d.score)+
  geom_errorbar(aes(x = ca, ymin = absolute.d.L, ymax = absolute.d.H))+
  geom_point(aes(x = ca, y = absolute.d, fill = kappa), shape = 21, size = 3)+
  geom_text_repel(aes(x = ca, y = absolute.d, label = label), size = 4)+
  theme_bw(base_size = 16)+
  xlab(label = "CA")+
  ylab(label = "Absolute d (with 95% HDI)")+
  scale_x_continuous(limits = c(0, 1))+
  geom_hline(yintercept = 0, linetype = "dashed")+
  scale_fill_gradientn(colours = terrain.colors(n = 10), limits = c(0, 1))
@
\end{figure}

\paragraph{MCMC convergence}
Next, we want to check the validity our Bayesian inference model by inspecting
the \texttt{genphen} output named convergence. A similar analysis was performed
and described in \ref{sec:case1}.

<<echo=FALSE, include=TRUE, results=tex>>=
print(xtable(d.convergence, align = rep(x = "c", times = 8, digits = 2)),
      include.rownames = FALSE, size = "small")
@


\section{Extra Utilities}
\label{sec:utilities}
\subsection{Data Reduction}
The methods implemented in \texttt{genphen} are statistically superior to the
ones implemented by classical tools for GWAS. This however comes at the cost of
increased computational burdain. Therefore, using \texttt{genphen} to study the
association between hundreeds of thousands of SNPs and a phenotype can be quite
costly. Motivated by the biological fact that a major fraction of the SNPs are
non-informative (genetic noise) with respect to the selected phenotype, we
implemented a low-weight diagnostics procedure in \texttt{genphen}, which allows
us to quickly scan the SNP space and quickly discard large portion of the
non-informative SNPs.

The data reduction procedure includes the following steps:
\begin{enumerate}
\item using random forest and their variable importance measures, we obtain one
importance value for each SNP.
\item next, we rank the SNPs by their importance and use the importance
distribution as a 'rough' guide to sample and evaluate SNPs (so-called anchor
SMPs) using a lighter-weight version of the standard genphen approach.
\item using the previously explained \texttt{genphen} association scores.
\item we can therefore determine the importance rank at which the SNPs no longer
carry any information and would be discarded, reduce the data and perform the
analysis on the remaining data using the standard \texttt{genphen} method.
\end{enumerate}

<<echo=FALSE, include=F, results=hide>>=
set.seed(seed = 551155)
g1 <- replicate(n=10^4, expr = as.character(rbinom(n=30, size = 1, prob = 0.5)))
g2 <- replicate(n=10^4, expr = as.character(rbinom(n=30, size = 1, prob = 0.5)))
gen <- rbind(g1, g2)
phen <- c(rnorm(n = 30, mean = 3, sd = 3), rnorm(n = 30, mean = 5, sd = 3))

# run diagnostics
diag.pre <- runDiagnostics(genotype = gen,
                           phenotype = phen,
                           phenotype.type = "continuous",
                           rf.importance.trees = 50000,
                           with.anchor.points = FALSE,
                           mcmc.chains = 2,
                           mcmc.iterations = 1500,
                           mcmc.warmup = 500,
                           mcmc.cores = 2,
                           hdi.level = 0.95,
                           anchor.points = c(1:10))

# run diagnostics
diag.post <- runDiagnostics(genotype = gen,
                           phenotype = phen,
                           phenotype.type = "continuous",
                           rf.importance.trees = 50000,
                           with.anchor.points = TRUE,
                           mcmc.chains = 2,
                           mcmc.iterations = 1500,
                           mcmc.warmup = 500,
                           mcmc.cores = 2,
                           hdi.level = 0.99,
                           anchor.points = c(1:5, 101:105, 301:305,
                                             501:505, 1001:1005,
                                             2001:2005, 5001:5005,
                                             9001:9005))

pre.scores <- diag.pre$importance.scores
post.scores <- diag.post$scores
post.scores$significant <- ifelse(test = post.scores$cohens.d.L <= 0
                                  & post.scores$cohens.d.H >= 0,
                                  yes = FALSE, no = TRUE)
@


Using a case study, we explain the typical data reduction steps in more detail.
First we use random forest to get the distribution of variable (SNP) importance,
shown in the figure below.

\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=8, height=4>>=
ggplot(data = pre.scores)+
  geom_step(aes(x = importance.rank, y = importance))+
  xlab("Rank")+
  ylab("Importance")+
  ggtitle("Importance distribution")+
  theme_bw(base_size = 16)+
  scale_fill_gradientn(colours = terrain.colors(n = 10), limits = c(0, 1))
@
\end{figure}

We then select a set of anchor points (SNPs) to analyze. In this particular
example we selected 25 anchor points with ranks in the interval 1-5, 101-105,
301-305, 501-505, 1001-1005, 2001-2005, 5001-5005 and 9001-9005 from the total
set of 10,000 genotypes. We then visualize the estimated Cohen's d effect size
(and 99\% HDI) as shown in the figure below. The anchor points (SNPs) are shown
as points, colored green if the HDI (gray error bar) does not include the 0
effect, and green otherwise. We observe that the non-informative SNPs become
prevalent past rank 100. We can therefore select all SNPs with rank higher than
100 and perform the main analysis with only 100 SNPs (yielding 99\% data
reduction). We might also select a more conservative threshold such as 500
(yielding 95\% data reduction).

\begin{figure}[H]
\centering
<<echo=FALSE, include=TRUE, results=hide, fig=TRUE, width=8, height=4>>=
ggplot(data = post.scores)+
  geom_errorbar(aes(x = anchor.point, ymin = cohens.d.L, ymax = cohens.d.H),
                col = "gray", width = 0.1)+
  geom_point(aes(x = anchor.point, y = cohens.d, col = significant), size = 2)+
  geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
  theme_bw(base_size = 16)+
  xlab(label = "Rank")+
  ylab(label = "Cohen's d (with 99% HDI)")+
  scale_x_continuous(trans = "log10", breaks = c(1, 10, 100, 1000, 10000),
                     labels = c(1, 10, 100, 1000, 10000))+
  annotation_logticks(sides = "b")+
  theme(legend.position = "top")
@
\end{figure}


\end{document}