%\VignetteIndexEntry{genphen overview} \documentclass[a4paper, 11pt]{article} \usepackage[utf8]{inputenc} \let\chapter\section \usepackage[lined]{algorithm2e} \usepackage[numbers]{natbib} \bibliographystyle{plainnat} \usepackage[english]{babel} \selectlanguage{english} \usepackage{graphicx} \usepackage{placeins} \usepackage{amsmath} \usepackage{amscd} \usepackage{ifthen} \usepackage{float} \usepackage{subfig} \usepackage{lscape} \usepackage{parskip} \usepackage{multirow} \usepackage{todonotes} \usepackage{color} \usepackage{colortbl} \definecolor{hellgrau}{rgb}{0.9,0.9,0.9} \definecolor{dunkelgrau}{rgb}{0.8,0.8,0.8} \usepackage{hyperref} % remove ref boxes %\usepackage[hidelinks]{hyperref} %add-on renews \renewcommand\topfraction{0.85} \renewcommand\bottomfraction{0.85} \renewcommand\textfraction{0.1} \renewcommand\floatpagefraction{0.85} %confusion matrix \newcommand\confusion[1]{ \fbox{\lower0.75cm \vbox to 0.5cm{\vfil \hbox to 0.5cm{\hfil\parbox{1cm}{#1}} \vfil}% }% } \setlength\floatsep{1.25\baselineskip plus 3pt minus 2pt} \setlength\textfloatsep{1.25\baselineskip plus 3pt minus 2pt} \setlength\intextsep{1.25\baselineskip plus 3pt minus 2 pt} \setlength{\parindent}{0pt} \begin{document} \SweaveOpts{concordance=TRUE} \title{Tutorial on using \texttt{genphen}} \author{Simo Kitanovski, Daniel Hoffmann,\\ Bioinformatics and Computational Biophysics,\\ University of Duisburg-Essen, Essen, Germany} \maketitle \tableofcontents This tutorial gives you some of the technical background underlying \texttt{genphen} that should enable you to understand and use this tool. \section{\texttt{genphen} quantifies genotype-phenotype associations} Genome wide association studies (GWAS) have become an important tool to understand the association between genotypes and phenotypes. With GWAS we try to answer questions such as ``what are the genotypes in the human genome which predispose to a disease?'' or ``what are the genotypes in certain strains of mice which confer resistance against a specific virus?''. The advances in high- throughput sequencing technology have provided massive genetic data and thus potentially countless applications of genotype-phenotype association studies. The genotype can be a set of single nucleotide polymorphisms (SNPs) or a set of single amino acid polymorphisms (SAAPs) identified in a group of individuals, whereas the phenotype can be any continuous or discrete individual quantity or characteristics. To conduct GWAS, frequentist statistical methods are typically used, relying on simple and often inadequate methods to capture the complex and potentially non- linear genotype-phenotype association. Moreover, these methods often use P- values to quantify the strength of association, bringing with them a set of disadvantages, some of which include poor interpretation, difficulty to compare between different studies, as well as massive multiple hypothesis problems. With \texttt{genphen} we provide a hybrid method which reaps the benefits of sophisticated statistical learning approaches such as random forest (RF) and support vector machine (SVM) to capture complex genotype-phenotype associations, on the one hand, and Bayesian inference on the other hand, to accurately quantify the strength of association using models consistent with the data. The results of \texttt{genphen} are multiple association scores for each genotype. Visualizing these scores together can present the researcher a meaningful guide to selecting the most promising association. Furthermore, \texttt{genphen} provides a set of procedures including a test for phylogenetic bias (used to discover biases in the data due to the population structure), procedure for data reduction (used for removing non-informative genotypes and thereby simplifying the GWAS), data augmentation (used to augment small sample datasets) and methods for gene prioritization based on network diffusion algorithms using functional network data. \section{Methods} \label{sec:methods} \subsection{Input} Two data types are necessary to perform a genetic association study: \begin{itemize} \item genotype data (e.g. set of 1,000 SNPs found along the aligned genomes of 10 individuals) \item phenotype data (experimental measurement made for each individual such as body mass index, immune response, survival, case-control, etc.) \end{itemize} More generally, we can think of the genotype data as a character matrix with dimensions $N \times M$, whereby the $M$ columns represent different SNPs or SAAPs, and the $N$ rows represent different individuals or sequences for which we have some measured phenotype. Therefore, we can think of the phenotype as a numerical vector of length $N$, where each phenotype corresponds to a particular individual. Moreover, the phenotype can be of specific type (e.g. quantitative, dichotomous, etc.), imposing requirements on the type of statistical test that can be be used for the association analysis. \subsection{Association Scores} Between each genotype (SNP/SAAP) and phenotype, \texttt{genphen} computes several measures of association, each of which is explained in the following paragraphs. \paragraph{Classification accuracy($CA$)} $CA$ measures the degree of accuracy with which one can classify (predict) the alleles of an SNP from the phenotype measurements. If there exists a strong association between a particular SNP and the phenotype, one should be able to build a statistical model which accurately classifies the two alleles of that SNP solely from the phenotype data ($CA \approx 1$). Otherwise, the classification accuracy of statistical model should be approximately similar to that of simple guessing ($CA \approx 0.5$). To estimate a robust $CA$ estimate, \texttt{genphen} uses cross-validation (CV), obtaining a distribution of possible $CAs$. During the CV procedure a subset (e.g. 70\%) of the genotype-phenotype data is selected at random for training the classifier, followed by testing based on the remaining data. The following confusion matrix represents the result of one CV step: \newcommand{\M}{\textmd} \newcommand{\B}{$\bullet$} \newcommand{\MC}{\multicolumn} \newcommand{\MR}{\multirow} \setlength{\doublerulesep}{.9\arrayrulewidth} \addtolength{\tabcolsep}{2pt} \begin{table}[H] \centering \begin{tabular}% {|cl|c|c|} \hline & & \MC{2}{c||}{\textbf{Real}} \\ & & \textbf{$allele_1$} & \textbf{$allele_2$} \\ \hline\hline & \textbf{$allele_1$} & a & b\\ \cline{2-4} \MR{-2}{*}{\textbf{Predicted}} & \textbf{$allele_2$} & c & d\\ \hline \end{tabular} \caption{Confusion matrix resulting from a classification analysis} \label{tab:t0} \end{table} The CA of the cross-validation step $i$ is then estimated as: \begin{gather*} CA_i = \frac{a+d}{a+b+c+d} \end{gather*} The final $CA$ after 1000 CV steps is then estimated as: \begin{gather*} CA = \frac{1}{1000} \sum_{i=1}^{1000} CA_{i} \end{gather*} In addition to estimating $CA$, the distribution of $CAs$ enables us to also compute the 95\% highest density interval (95\% HDI) of CA. The mutations with $CA \approx 1$, with narrow HDI have the strongest associations. The metric $CA$ has the following advantages: \begin{itemize} \item simple to estimate \item simple to interpret (bound between 0 and 1) \item simple to compare across studies \item one can have high CA despite small effects \item detects strong non-linear associations as well \end{itemize} \paragraph{Cohen's $\kappa$ statistic} There is one pitfall where the $CA$ estimate can be truly misleading, and this is the case when the analyzed SNP is composed of unevenly represented genetic states (alleles). For instance, the allele A of a given SNP is found in 90\% of the individuals, while the other allele T in only 10\%. Such an uneven composition of the alleles can lead to misleading results, i.e. even without learning the algorithm can produce a high $CA \approx 0.9$ simply by constantly predicting only the dominant label. The Cohen's $\kappa$ statistics can be used to estimate how much better the observed $CA$ is, compared to the classification accuracy expected by chance ($CA_{exp}$). To compute the $\kappa$ statistics, the confusion matrix shown before in Table \hyperref[tab:t0]{1} is used: \begin{gather*} \kappa = \frac{CA - CA_{exp}} {1 - CA_{exp}} \\ CA_{exp} = \frac{a+b}{a+b+c+d} \cdot \frac{a+c}{a+b+c+d} + \frac{c+d}{a+b+c+d} \cdot \frac{b+d}{a+b+c+d} \\ \end{gather*} The $\kappa$ statistics is a quality metric, which is to be used together with $CA$. Cohen defines the following meaningful $\kappa$ intervals: [$\kappa$<0]: ``no agreement'', [0.0-0.2]: ``slight agreement'', [0.2-0.4]: ``fair agreement'' , [0.4-0.6]: ``moderate agreement'', [0.6-0.8]: ``substantial agreement'' and [0.8-1.0]: ``almost perfect agreement''. Similarly to the estimation of $CA$, the final Cohen's $\kappa$ is also estimated by averaging the individual $\kappa$ scores computed for each step of the CV. Here too, 95\% HDIs are estimated. \paragraph{Cohen's $d$ effect size} Given data from two groups (two allele groups in a SNP), we ask the question: How much is one group different from the other with respect to the phenotype observed in each group? In the case of quantitative phenotypes we answer this question by computing the Cohen's $d$ for each genotype-phenotype pair. Cohen's $d$ estimates which are significantly greater or smaller than 0 indicate that there is a large difference in the phenotype between the two genetic states of the specific genotype. Cohen (1992) defines level which define the magnitude of the effects as: |$d$|<0.2``negligible'', |$d$|<0.5 ``small'', |$d$|<0.8 ``medium'', otherwise ``large''. The Cohen's $d$ is computed as follows: \begin{gather*} d = \frac{\mu_1-\mu_2}{\sqrt{({\sigma_1}^2+{\sigma_2}^2)/2}} \end{gather*} where $\mu_1$, $\mu_2$ and $\sigma_1$, $\sigma_2$ represent the mean and the standard deviations of the phenotypes in the two genetic states of the genotype. \texttt{genphen} uses the following Bayesian inference models designed in STAN \footnote{Stan Development Team. 2017. Stan Modeling Language Users Guide and Reference Manual, Version 2.17.0. http://mc-stan.org}, to estimate each of the parameters in the Cohen's $d$ equation from the data: \begin{gather*} Y_{ijk} \sim T(\nu_i, \mu_{ijk}, \sigma_{ijk}) \\ \mu_{jk} \sim N(\hat{\mu}, \hat{\sigma} \cdot 100) \\ \sigma_{jk} \sim U(\hat{\sigma}/100, \hat{\sigma} \cdot 100) \\ \nu_j \sim Gamma(2.0, 0.1) \end{gather*} where $i$, $j$ and $k$ index phenotype observation $i$ at site $j$ and genotype $k$, respectively in the phenotype vector $Y$; $\mu$, $\sigma$ and $\nu$ are the mean, standard deviation and degrees of freedom parameters of the T-distribution which is used to model the phenotype observed in each allele; $\hat{\sigma}$ and $\hat{\mu}$ are the empirically estimated mean and standard deviations of the phenotype in each allele which are used to setup the broad priors for $\mu$ and $\sigma$. For each parameter we estimate a complete posterior distribution with MCMC sampling implemented in \texttt{rstan}. Therefore, by plugging in the entire posterior distributions of the parameters into the Cohen's $d$ equation, we estimate a complete posterior distribution for $d$ as well. This also allows us to compute the corresponding 95\% HDI of $d$. A complete description of the hierarchical model is provided in Krushke, 2013) \footnote{Kruschke, John K. "Bayesian estimation supersedes the t test."Journal of Experimental Psychology: General 142.2 (2013): 573}. For SNPs with more than two groups (e.g. 3 alleles), the Cohen's $d$ estimate is computed between each pair of alleles. This approach, although computationally more challenging than a simple t-test, has a few advantages related to the fact that the phenotype is modeled in a more consistent way, i.e. the distribution of the phenotype in each group is described as a T-distribution with a mean ($\mu$), standard deviation ($\sigma$) and shape parameter ($\nu$). \begin{itemize} \item complete information about the credible parameter values is obtained \item handling of outliers achieved by describing the data with heavy-tailed distributions instead of normal distribution $\rightarrow$ assumption of normality is alleviated \item the T-distribution which describes the phenotype in each group has its own variance parameter $\rightarrow$ assumption of homoscedasticity is alleviated. \item in addition to computing the differences between the central tendencies of the groups, we can use the standard deviation parameters to compute differences between their variabilities \end{itemize} \paragraph{Absolute effect size ($a$)} For dichotomous phenotypes, \texttt{genphen} simply computes the absolute difference (also known as contrast $a$) between the phenotypes of any two genetic states of each polymorphism as follows: \begin{gather*} a = p_1 - p_2 \end{gather*} where $p_1$, $p_2$ represent the proportion of "successes" in a given number of observed trials in the two genetic states of the given SNP/SAAP. Similar to the case of having a continuous phenotype, \texttt{genphen} uses a Bayesian inference models designed in STAN to estimate each of the parameters in the equation above from the data: \begin{gather*} Y_{ijk} \sim Bern(p_{jk}) \\ p_{jk} \sim Beta(1/2, 1/2) \end{gather*} where $i$, $j$ and $k$ index phenotype observation $i$ at site $j$ and genotype $k$, respectively in the phenotype vector $Y$; $p$, is the probability parameter of the Bernoulli distribution which is used to model the dichotomous phenotype data as a set of sucesses in a set of trials observed each allele; The prior of $p$ is a Beta distribution whose two parameters are fixed at 0.5 (Jeffrey's prior). Plugging in the estimated posterior distributions of $p$ into the equation above allows us to compute the mean $a$ point estimate and its 95\% HDI. \paragraph{Bhattacharyya coefficient ($BC$)} We further use the infered parameters needed to compute the effect size, to perform posterior predictive checks, i.e. we simulate phenotype data. The simulated data for each genotype is then used to estimate the degree of overlap between the simulated phenotype distributions in two genetic states of the mutation. The overlap is quantified using the Bhattacharyya coefficient (BC): \begin{gather*} BC(p_1, p_2) = \int_x\sqrt(p_1(x) \cdot p_2(x)) dx \end{gather*} where $p_1$ and $p_2$ are the simulated phenotype distributions in both genetic states of a mutation. For a complete overlap BC = 1 (i.e. no difference between the phenotype distributions in the two genetic state), and BC = 0 for no overlap (significant difference). \subsection{Phylogenetic Bias ($B$)} To control for potential phylogenetic biases (population structure), we devised the following procedure. First, we use the complete genotype data (all SNPs) to compute a kinship matrix (NxN dissimilarity matrix between the N-individuals). Alternatively, the users can provide their own kinship matrix (e.g. estimated using more accurate phylogenetic methods). For a group of individuals which belong to a group defined by an alleles of a given SNP, we next compute their mean kinship distance using the kinship matrix data. If the individuals in the group are related, the compute mean kinship distance must be significantly lower than the mean kinship distance computed from the complete kinship matrix. Thus, we define the phylogenetic bias as: \begin{gather*} B = 1 - \hat{d}_{g}/\hat{d}_{t} \end{gather*} where $\hat{d}_{g}$ is the mean kinship distance between the individuals who share the genotype $g$; $\hat{d}_{t}$ is the mean kinship distance of the complete kinship matrix. For a complete phylogenetic bias, B = 1 ($\hat{d}_{g} << \hat{d}_{t}$), and $B = 0$ (or slightly negative) for no bias. This estimate is computed for each SNP and genotype group within each SNP. To compute the phylogenetic bias associated with a mutation ($g_{1} -> g_{2}$), we compute: \begin{gather*} B = 1 - min(\hat{d}_{g_{1}}, \hat{d}_{g_{2}})/\hat{d}_{t} \end{gather*} where $\hat{d}_{g_{1}}$ and $\hat{d}_{g_{2}}$ represent the mean kinship distance between the individuals who share the genotype (allele) $g_{1}$ and $g_{2}$ or a given SNP; $\hat{d}_{t}$ is the mean kinship distance in the complete kinship matrix. For a complete phylogenetic bias, B = 1 and B = 0 (or slightly negative) for no bias. This estimate is computed for each SNP and each pair of genotypes. \section{Case studies} \subsection{I: Association between SNPs and a *continuous* phenotype} \label{sec:case1} In the first case study, we show a typical genotype-phenotype analysis, whereby the genotype is a protein sequence alignment composed of 6 sites and 120 individuals (sequences), and a continuous phenotype measured for each of the individuals. <>= require(genphen) require(ggplot2) require(knitr) require(ggrepel) require(reshape2) require(ape) require(xtable) options(xtable.floating = FALSE) # genotype inputs: data(genotype.saap) # phenotype inputs: data(phenotype.saap) # run genphen c.out <- runGenphen(genotype = genotype.saap[, 80:85], phenotype = phenotype.saap, phenotype.type = "continuous", mcmc.chains = 2, mcmc.iterations = 2000, mcmc.warmup = 500, mcmc.cores = 4, hdi.level = 0.95, stat.learn.method = "rf", cv.iterations = 300) # scores c.convergence <- c.out$convergence c.score <- c.out$scores @ \paragraph{Genotype-phenotype data} First we show an overview of the distribution of the phenotype across each of the 6 studied polymorphic sites in the sequence alignment, and the underlying genotype states. \texttt{genphen} will list the mutations found at each size, followed by quantification fo the association strength as explained in \ref{sec:methods}. \begin{figure}[H] \centering <>= df <- data.frame(genotype.saap[, 80:85], stringsAsFactors = FALSE) df$phenotype <- phenotype.saap df <- melt(data = df, id.vars = "phenotype") colnames(df) <- c("phenotype", "site", "genotype") df$site <- gsub(pattern = "X", replacement = '', x = df$site) ggplot(data = df)+ facet_wrap(facets = ~site, ncol = 6, scales = "free_x")+ geom_violin(aes(x = genotype, y = phenotype))+ geom_point(aes(x = genotype, y = phenotype, col = genotype), size = 1, shape = 21, position = position_jitterdodge(jitter.width = 0.2, jitter.height = 0.1, dodge.width = 0.5))+ scale_color_discrete(name = "genotype")+ theme_bw(base_size = 16)+ theme(legend.position="none") @ \end{figure} \paragraph{Association analysis} A typical way of visualizing the \texttt{genphen} results is with the following plot where each point represents a polymorphism (here SAAP) plotted according to x=classification accuracy ($CA$), y=Cohen's $d$, color=Cohen's $\kappa$. The most promising SAAPs have CA and $\kappa$ close to 1, with Cohen's $d$ estimate whose 95\% highest density interval (HDI) does not overlap with the null effect (dashed line in figure: $d = 0$). The labels indicate the MSA site number, followed by the type of the polymorphism, and finally the BC score. \begin{figure}[H] \centering <>= c.score$label <- paste(c.score$site, c.score$mutation, round(x = c.score$bc, digits=2), sep=':') ggplot(data = c.score)+ geom_errorbar(aes(x = ca, ymin = cohens.d.L, ymax = cohens.d.H))+ geom_point(aes(x = ca, y = cohens.d, fill = kappa), shape = 21, size = 3)+ geom_text_repel(aes(x = ca, y = cohens.d, label = label), size = 4)+ theme_bw(base_size = 16)+ xlab(label = "CA")+ ylab(label = "Cohen's d (with 95% HDI)")+ scale_x_continuous(limits = c(0, 1))+ geom_hline(yintercept = 0, linetype = "dashed")+ scale_fill_gradientn(colours = terrain.colors(n = 10), limits = c(0, 1)) @ \end{figure} The association scores are also shown in the following table: <>= print(xtable(c.out$debug.scores[, c("site", "mutation", "cohens.d", "cohens.d.hdi", "bc", "ca", "ca.hdi", "kappa", "kappa.hdi"), ], align = rep(x = "c", times = 10, digits = 2)), include.rownames = FALSE, size = "small") @ \paragraph{MCMC convergence} Next, we want to check the validity our Bayesian inference model by inspecting the \texttt{genphen} output named convergence which contains information about the markov chain monte carlo (MCMC) simulation done with R package rstan including potential scale reduction factor (Rhat) and effective sampling size (ESS), as well as information concerning potential convergence issues such as divergences, tree depth exceeded warnings, etc. The small sample size of specific alleles (of a site) are often the cause of such warnings, which are then reported for the the alleles of that site. <>= print(xtable(c.convergence, align = rep(x = "c", times = 10, digits = 2)), include.rownames = FALSE, size = "small") @ \paragraph{Phylogenetic bias control} Next, we compute the phylogenetic bias of each mutation, shown in the table below: <>= # compute the phylogenetic bias bias <- runPhyloBiasCheck(genotype = genotype.saap, input.kinship.matrix = NULL) # extract kinship matrix kinship.matrix <- bias$kinship.matrix # extract the bias associated with mutations of the sites which were included # in the association analysis mutation.bias <- bias$bias.mutations mutation.bias <- mutation.bias[mutation.bias$site %in% 80:85, ] mutation.bias <- mutation.bias[mutation.bias$mutation %in% c.score$mutation, ] # show the bias table print(xtable(mutation.bias, align = rep(x = "c", times = 3+1, digits = 2)), include.rownames = FALSE, size = "small") @ We use the kinship matrix to perform hierarchical clustering, visualizing the population strcuture and two examples (mutations) with genotype 1 marked with blue and genotype 2 marked with orange in either case. Individuals not covered by either genotype are marked with gray color. The shown examples differ in the degree of phylogenetic bias. \begin{figure}[H] \centering <>= color.a <- character(length = nrow(genotype.saap)) color.a[1:length(color.a)] <- "gray" color.a[which(genotype.saap[, 82] == "h")] <- "orange" color.a[which(genotype.saap[, 82] == "q")] <- "blue" color.b <- character(length = nrow(genotype.saap)) color.b[1:length(color.b)] <- "gray" color.b[which(genotype.saap[, 84] == "a")] <- "orange" color.b[which(genotype.saap[, 84] == "d")] <- "blue" c.hclust <- hclust(as.dist(kinship.matrix), method = "average") par(mfrow = c(1, 2), mar = c(0,0,1,0) + 0.1) plot(as.phylo(c.hclust), tip.color = color.a, cex = 0.6, type = "fan", main = "B = 0.15") plot(as.phylo(c.hclust), tip.color = color.b, cex = 0.6, type = "fan", main = "B = 0.43") @ \end{figure} \subsection{II: Association between SNPs and a *dichotomous* phenotype} \label{sec:case2} In the second case study we show you how to use \texttt{genphen} in case the phenotype is of dichotomous type. The genotype input is a protein sequence alignment composed of 12 sites and 120 individuals (sequences), and the phenotype is a vector of 120 dichotomous values measured for each individual. <>= # genotype inputs: data(genotype.saap) # phenotype inputs: data(dichotomous.phenotype.saap) # run genphen d.out <- runGenphen(genotype = genotype.saap[, 68:79], phenotype = dichotomous.phenotype.saap, phenotype.type = "dichotomous", mcmc.chains = 2, mcmc.iterations = 2000, mcmc.warmup = 500, mcmc.cores = 4, hdi.level = 0.95, stat.learn.method = "rf", cv.iterations = 300) # scores d.convergence <- d.out$convergence d.score <- d.out$scores @ \paragraph{Genotype-phenotype data} First we show an overview of the distribution of the phenotype across each of the 12 studied polymorphic sites in the sequence alignment, and the underlying genotype states. \texttt{genphen} will list the mutations found at each size, followed by quantification fo the association strength as explained in \ref{sec:methods}. \begin{figure}[H] \centering <>= df <- data.frame(genotype.saap[, 68:79], stringsAsFactors = FALSE) df$phenotype <- dichotomous.phenotype.saap df <- melt(data = df, id.vars = "phenotype") colnames(df) <- c("phenotype", "site", "genotype") df$site <- gsub(pattern = "X", replacement = '', x = df$site) ggplot(data = df)+ facet_wrap(facets = ~site, ncol = 6, scales = "free_x")+ geom_point(aes(x = genotype, y = phenotype, col = genotype), size = 1, shape = 21, position = position_jitterdodge(jitter.width = 0.2, jitter.height = 0.1, dodge.width = 0.5))+ scale_color_discrete(name = "genotype")+ theme_bw(base_size = 16)+ theme(legend.position="none") @ \end{figure} \paragraph{Association analysis} A typical way of visualizing the \texttt{genphen} results is with the following plot where each point represents a polymorphism (here SAAP) plotted according to x=classification accuracy ($CA$), y=absolute $d$ (with error bars 95\% HDI), color=Cohen's $\kappa$. The most promising SAAPs have CA and $\kappa \approx 1$, with absolute $d$ estimate whose 95\% highest density interval (HDI) does not overlap with the null effect ($d \neq 0$). The labels indicate the MSA site number, followed by the type of the polymorphism, and finally the BC score. \begin{figure}[H] \centering <>= d.score$label <- paste(d.score$site, d.score$mutation, round(x = d.score$bc, digits = 2), sep = ':') ggplot(data = d.score)+ geom_errorbar(aes(x = ca, ymin = absolute.d.L, ymax = absolute.d.H))+ geom_point(aes(x = ca, y = absolute.d, fill = kappa), shape = 21, size = 3)+ geom_text_repel(aes(x = ca, y = absolute.d, label = label), size = 4)+ theme_bw(base_size = 16)+ xlab(label = "CA")+ ylab(label = "Absolute d (with 95% HDI)")+ scale_x_continuous(limits = c(0, 1))+ geom_hline(yintercept = 0, linetype = "dashed")+ scale_fill_gradientn(colours = terrain.colors(n = 10), limits = c(0, 1)) @ \end{figure} \paragraph{MCMC convergence} Next, we want to check the validity our Bayesian inference model by inspecting the \texttt{genphen} output named convergence. A similar analysis was performed and described in \ref{sec:case1}. <>= print(xtable(d.convergence, align = rep(x = "c", times = 8, digits = 2)), include.rownames = FALSE, size = "small") @ \section{Extra Utilities} \label{sec:utilities} \subsection{Data Reduction} The methods implemented in \texttt{genphen} are statistically superior to the ones implemented by classical tools for GWAS. This however comes at the cost of increased computational burdain. Therefore, using \texttt{genphen} to study the association between hundreeds of thousands of SNPs and a phenotype can be quite costly. Motivated by the biological fact that a major fraction of the SNPs are non-informative (genetic noise) with respect to the selected phenotype, we implemented a low-weight diagnostics procedure in \texttt{genphen}, which allows us to quickly scan the SNP space and quickly discard large portion of the non-informative SNPs. The data reduction procedure includes the following steps: \begin{enumerate} \item using random forest and their variable importance measures, we obtain one importance value for each SNP. \item next, we rank the SNPs by their importance and use the importance distribution as a 'rough' guide to sample and evaluate SNPs (so-called anchor SMPs) using a lighter-weight version of the standard genphen approach. \item using the previously explained \texttt{genphen} association scores. \item we can therefore determine the importance rank at which the SNPs no longer carry any information and would be discarded, reduce the data and perform the analysis on the remaining data using the standard \texttt{genphen} method. \end{enumerate} <>= set.seed(seed = 551155) g1 <- replicate(n=10^4, expr = as.character(rbinom(n=30, size = 1, prob = 0.5))) g2 <- replicate(n=10^4, expr = as.character(rbinom(n=30, size = 1, prob = 0.5))) gen <- rbind(g1, g2) phen <- c(rnorm(n = 30, mean = 3, sd = 3), rnorm(n = 30, mean = 5, sd = 3)) # run diagnostics diag.pre <- runDiagnostics(genotype = gen, phenotype = phen, phenotype.type = "continuous", rf.importance.trees = 50000, with.anchor.points = FALSE, mcmc.chains = 2, mcmc.iterations = 1500, mcmc.warmup = 500, mcmc.cores = 2, hdi.level = 0.95, anchor.points = c(1:10)) # run diagnostics diag.post <- runDiagnostics(genotype = gen, phenotype = phen, phenotype.type = "continuous", rf.importance.trees = 50000, with.anchor.points = TRUE, mcmc.chains = 2, mcmc.iterations = 1500, mcmc.warmup = 500, mcmc.cores = 2, hdi.level = 0.99, anchor.points = c(1:5, 101:105, 301:305, 501:505, 1001:1005, 2001:2005, 5001:5005, 9001:9005)) pre.scores <- diag.pre$importance.scores post.scores <- diag.post$scores post.scores$significant <- ifelse(test = post.scores$cohens.d.L <= 0 & post.scores$cohens.d.H >= 0, yes = FALSE, no = TRUE) @ Using a case study, we explain the typical data reduction steps in more detail. First we use random forest to get the distribution of variable (SNP) importance, shown in the figure below. \begin{figure}[H] \centering <>= ggplot(data = pre.scores)+ geom_step(aes(x = importance.rank, y = importance))+ xlab("Rank")+ ylab("Importance")+ ggtitle("Importance distribution")+ theme_bw(base_size = 16)+ scale_fill_gradientn(colours = terrain.colors(n = 10), limits = c(0, 1)) @ \end{figure} We then select a set of anchor points (SNPs) to analyze. In this particular example we selected 25 anchor points with ranks in the interval 1-5, 101-105, 301-305, 501-505, 1001-1005, 2001-2005, 5001-5005 and 9001-9005 from the total set of 10,000 genotypes. We then visualize the estimated Cohen's d effect size (and 99\% HDI) as shown in the figure below. The anchor points (SNPs) are shown as points, colored green if the HDI (gray error bar) does not include the 0 effect, and green otherwise. We observe that the non-informative SNPs become prevalent past rank 100. We can therefore select all SNPs with rank higher than 100 and perform the main analysis with only 100 SNPs (yielding 99\% data reduction). We might also select a more conservative threshold such as 500 (yielding 95\% data reduction). \begin{figure}[H] \centering <>= ggplot(data = post.scores)+ geom_errorbar(aes(x = anchor.point, ymin = cohens.d.L, ymax = cohens.d.H), col = "gray", width = 0.1)+ geom_point(aes(x = anchor.point, y = cohens.d, col = significant), size = 2)+ geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+ theme_bw(base_size = 16)+ xlab(label = "Rank")+ ylab(label = "Cohen's d (with 99% HDI)")+ scale_x_continuous(trans = "log10", breaks = c(1, 10, 100, 1000, 10000), labels = c(1, 10, 100, 1000, 10000))+ annotation_logticks(sides = "b")+ theme(legend.position = "top") @ \end{figure} \end{document}