\documentclass{ubmanual} \usepackage{bm} \usepackage{amsmath,amssymb,amsfonts} \usepackage{color,colortbl} \usepackage{hyperref} \hypersetup{colorlinks=false, pdfborder=0 0 0, pdftitle={PODKAT - An R Package for Association Testing Involving Rare and Private Variants} pdfauthor={Ulrich Bodenhofer}} \title{{\Huge PODKAT}\\[5mm] An R Package for Association Testing Involving Rare and Private Variants} \author{Ulrich Bodenhofer\affilmark{1,2}} \affiliation{\affilmark{1} School of Informatics, Communication and Media\\ University of Applied Sciences Upper Austria\\ Softwarepark 11, 4232 Hagenberg, Austria\\[2mm] {\grey\affilmark{2} previously with: Institute of Bioinformatics, Johannes Kepler University\\Altenberger Str.\ 69, 4040 Linz, Austria}} \newcommand{\PODKAT}{PODKAT} \newcommand{\SKAT}{SKAT} \newcommand{\podkat}{\texttt{podkat}} \newcommand{\R}{R} \newcommand{\TM}{\textsuperscript{\tiny TM}} \newcommand{\RTM}{\textsuperscript{\textcircled{\tiny R}}} \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\Mat}[1]{\mathbf{#1}} \newcommand{\logit}{\mathrm{logit}} \newcommand{\Real}{\mathbb{R}} \newcommand{\Natural}{\mathbb{N}} \newcommand{\MAF}{\mathrm{MAF}} \author{Ulrich Bodenhofer\affilmark{1,2}} \definecolor{ubye}{rgb}{1.00,0.83,0.16} \newcommand{\gO}{\cellcolor{ubye}1} \newcommand{\gF}{\cellcolor{ubye}4} \DeclareFontShape{OT1}{cmtt}{bx}{n}% {<5><6><7><8><9><10><10.95><12><14.4><17.28><20.74><24.88>cmttb10}{} \newcommand{\ttbf}{\fontencoding{OT1}\fontfamily{cmtt}% \fontseries{bx}\selectfont} %\VignetteIndexEntry{PODKAT - An R Package for Association Testing Involving Rare and Private Variants} %\VignetteDepends{methods, stats, graphics, utils, TxDb.Hsapiens.UCSC.hg38.knownGene, GWASTools, BSgenome.Hsapiens.UCSC.hg38.masked, BSgenome.Mmusculus.UCSC.mm10.masked} %\VignetteEngine{knitr::knitr} \setlength{\fboxrule}{1.5pt} \newcommand{\notebox}[1]{% \begin{center} \fcolorbox{bioaz}{biowh}{\begin{minipage}{\Nboxwidth} \noindent{\sffamily\bfseries Note:} #1 \end{minipage}} \end{center}} \begin{document} <>= options(width=65) set.seed(0) library(podkat) podkatVersion <- packageDescription("podkat")$Version podkatDateRaw <- packageDescription("podkat")$Date podkatDateYear <- as.numeric(substr(podkatDateRaw, 1, 4)) podkatDateMonth <- as.numeric(substr(podkatDateRaw, 6, 7)) podkatDateDay <- as.numeric(substr(podkatDateRaw, 9, 10)) podkatDate <- paste(month.name[podkatDateMonth], " ", podkatDateDay, ", ", podkatDateYear, sep="") @ \newcommand{\PODKATVer}{\Sexpr{podkatVersion}} \newcommand{\PODKATDate}{\Sexpr{podkatDate}} \newcommand{\PODKATYear}{\Sexpr{podkatDateYear}} \manualtitlepage{Version \PODKATVer, \PODKATDate}{https://github.com/UBod/podkat} \section*{Scope and Purpose of this Document} This document is a user manual for \PODKAT, an \R\ package implementing non-burden association tests for rare and private variants, most importantly, the {\em position-dependent kernel association test (PODKAT)}. It provides a gentle introduction into how to use \PODKAT. Not all features of the \R\ package are described in full detail. Such details can be obtained from the documentation enclosed in the \R\ package. Further note the following: (1) this is not an introduction to statistical genetics or association testing; (2) this is not an introduction to \R\ or any of the Bioconductor packages used in this document; (3) this is not an introduction to statistical hypothesis testing or probability theory. If you lack the background for understanding this manual, you first have to read introductory literature on the subjects mentioned above. All \R\ code in this document is written to be runnable by any user. However, some of the code chunks require the download of external files, require an Internet connection, or require too much computation time to be runnable when the package is built, checked, or installed. The output lines of \R\ code chunks that are not actually executed when processing this document are marked with `\verb+##!!##+' and, in case that the user needs to perform extra steps to execute the code, these steps are listed explicitly. \clearpage \newlength{\auxparskip} \setlength{\auxparskip}{\parskip} \setlength{\parskip}{0pt} \tableofcontents \clearpage \setlength{\parskip}{\auxparskip} \newlength{\Nboxwidth} \setlength{\Nboxwidth}{\textwidth} \addtolength{\Nboxwidth}{-2\fboxrule} \addtolength{\Nboxwidth}{-2\fboxsep} \section{Introduction} This user manual describes the \R\ package \PODKAT. This \R\ package implements non-burden association tests for rare and private variants, most importantly, the {\em position-dependent kernel association test (PODKAT)}. Before discussing details of how to use the package, let us first discuss the general aim and setup of association studies. Suppose we have a certain number of {\em samples} (study participants, patients, etc.) for each of which we can measure/sequence the {\em genotype} and for each of which we know/have measured/have observed a certain {\em trait} that we want to study. This trait may be {\em continuous}, i.e.\ real-valued on a continuous scale (for instance, age, height, body mass index, etc.), or {\em categorical}, i.e.\ from a discrete set of categories (for instance, case vs.\ control, treatment outcome, disease type, etc.). In the following, we will only consider continuous traits and categorical traits with two categories and refer to this case as a {\em binary trait} (sometimes called {\em dichotomous trait} as well) with values 0 or 1. The goal of association testing is to find out whether there are any {\em statistically significant associations between the genotype and the trait}. In some studies, additional information about the samples' phenotypes or environmental conditions is available that might also have an influence on the trait (for instance, age, sex, ethnicity, family status, etc.). Such additional features can be treated as {\em covariates}. More specifically, it is rather common to train a model that predicts the trait from the covariates first. Then the association between the genotype and those components of the traits is studied which have not been sufficiently explained by the covariates. In the case of \PODKAT, this is done by a {\em kernel-based variance-component score test} \cite{Lin97,WuLeeCaiLiBoehnkeLin11}. Assume that, for a given set of samples, we are given a trait vector (one entry for each sample), genotypes of all samples (in matrix format or as a VCF file\footnote{Variant Call Format; see \url{http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42} (last visited 2021-04-30) for a detailed specification of this file format}), and a matrix of covariates (if any). Then an association test using \PODKAT\ consists of the following basic steps: \begin{description} \item[Training of null model:] pre-processing of trait vector and covariates (if any) for later use in an association test (see Section~\ref{sec:nullModel}); \item[Selection of regions of interest:] specification of one or more genomic regions for which association tests should be performed (see Section~\ref{sec:regInterest}); \item[Association testing:] testing of association between genotype and trait/null model for each selected region of interest (see Section~\ref{sec:assocTest}); \item[Analysis of results:] post-processing (such as, multiple testing correction or filtering) and visualization of results (see Section~\ref{sec:analysis}); \end{description} Figure~\ref{fig:workflow} shows a graphical overview of these basic steps along with dependencies and data types. \begin{figure}[p] \centerline{\includegraphics[width=\textwidth]{PODKAT_workflow}} \caption{Overview of the basic steps of the data analysis pipeline offered by \PODKAT\ for analyzing associations between traits and genotypes.} \label{fig:workflow} \end{figure} This manual is organized as follows: after some basic instructions how to install the package (Section~\ref{sec:install}), Section~\ref{sec:impatient} provides a simple, yet complete, example that illustrates the general workflow. Sections~\ref{sec:nullModel}--\ref{sec:analysis} provide more details about the steps necessary to perform association tests with \PODKAT. Sections~\ref{sec:misc}--\ref{sec:cite} provide miscellaneous additional information. \section{Installation}\label{sec:install} The \PODKAT\ \R\ package (current version: \PODKATVer) is available via Bioconductor. The simplest way to install the package is the following: <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("podkat") @ If you wish to install the package manually instead, you can download the package archive that fits best to your computer system from the Bioconductor homepage. To test the installation of the \PODKAT\ package, enter <>= library(podkat) @ \noindent in your \R\ session. If this command terminates without any error message or warning, you can be sure that the \PODKAT\ package has been installed successfully. If so, the \PODKAT\ package is ready for use now and you can start performing association tests. \section{\PODKAT\ for the Impatient}\label{sec:impatient} In order to illustrate the basic workflow, this section presents two simple examples without going into the details of each step. Let us first retrieve the file names of the example files that are supplied as part of the \PODKAT\ package: <>= phenoFileLin <- system.file("examples/example1lin.csv", package="podkat") phenoFileLog <- system.file("examples/example1log.csv", package="podkat") vcfFile <- system.file("examples/example1.vcf.gz", package="podkat") @ Now let us train the null model for the continuous trait contained in the file \verb+example1lin.csv+: <>= pheno.c <- read.table(phenoFileLin, header=TRUE, sep=",") model.c <- nullModel(y ~ ., pheno.c) model.c @ The examples are based on the small artificial genome \verb+hgA+ that is also supplied as part of \PODKAT. So we load it first and then partition it into overlapping windows: <>= data(hgA) hgA windows <- partitionRegions(hgA) windows @ The VCF file used for these two examples is small enough to be loadable at once: <>= geno <- readGenotypeMatrix(vcfFile) geno @ Now we can already perform the two association tests. Let us start with the continuous trait: <>= res.c <- assocTest(geno, model.c, windows) print(res.c) @ Now we perform multiple testing correction: <>= res.c <- p.adjust(res.c) print(res.c) @ Finally, we create a Manhattan plot: \begin{center} <>= plot(res.c, which="p.value.adj") @ \end{center} For a binary trait, the whole pipeline looks the same. The \verb+nullModel()+ function automatically detects that the trait is binary and this information is passed on to the subsequent steps without the need of making additional settings: <>= pheno.b <- read.table(phenoFileLog, header=TRUE, sep=",") model.b <- nullModel(y ~ ., pheno.b) model.b @ Now we can already perform the association tests for the binary trait. This time, however, we do not load the entire genotype first, but we let \verb+assocTest()+ read from the VCF file directly (which is only done piece by piece in order to avoid excessive use of memory): <>= res.b <- assocTest(vcfFile, model.b, windows) print(res.b) @ Multiple testing correction again: <>= res.b <- p.adjust(res.b) print(res.b) @ Finally, we create a Manhattan plot: \begin{center} <>= plot(res.b, which="p.value.adj") @ \end{center} The following sections provide details and more background information about the functions used in the above steps. \section{Training a Null Model}\label{sec:nullModel} Before an association test can be performed, we have to pre-process the trait vector and create a so-called {\em null model}, i.e.\ a probabilistic model of the trait under the null assumption that the trait is independent of the genotype and only depends on the covariates (if any). \PODKAT\ currently offers three types of such null models: \begin{description} \item[Linear model:] the trait is continuous and depends linearly on the covariates, i.e. \[ y = \alpha_0+\bm{\alpha}^T\cdot\vec{x} + \varepsilon, \] where $y$ is the trait, $\alpha_0$ is the intercept, $\bm{\alpha}$ is a weight vector, $\mathbf{x}$ is the vector of covariates, and $\varepsilon$ is normally distributed random noise. If there are no covariates, $y$ is normally distributed around the intercept $\alpha_0$. \item[Logistic linear model:] the trait is binary and depends on the covariates in the following way (with the same notations as above): \[ \logit\big(p(y=1)\big) = \alpha_0+\bm{\alpha}^T\cdot\vec{x} \] If there are no covariates, $y$ is a binary Bernoulli-distributed random variable with constant $p(y=1)=\logit^{-1}(\alpha_0)=1/(1+\exp(-\alpha_0))$. \item[Bernoulli-distributed trait:] the trait is binary, does not depend on any covariates, and follows a simple Bernoulli distribution with constant $p$. \end{description} \PODKAT\ offers one function \verb+nullModel()+ that allows for the creation of any of the above three types of null models. In order to demonstrate how \verb+nullModel()+ works, we first load two examples that are shipped with the \PODKAT\ package. For the subsequent examples, we consider the two data frames \verb+pheno.c+ and \verb+pheno.b+ that we created in Section~\ref{sec:impatient} and investigate them in more detail. The object \verb+pheno.c+ is a data frame with two covariate columns and one column \verb+y+ containing a continuous trait: <>= colnames(pheno.c) summary(pheno.c) @ The object \verb+pheno.b+ is a data frame with two covariate columns and one column \verb+y+ containing a binary trait. <>= colnames(pheno.b) summary(pheno.b) table(pheno.b$y) @ As we have seen in Section~\ref{sec:impatient} already, the simplest way of creating a null model is to call \verb+nullModel()+ via the formula interface, in a way that is largely analogous to the \R\ standard functions \verb+lm()+ and \verb+glm()+: <>= model.c <- nullModel(y ~ ., pheno.c) model.c model.b <- nullModel(y ~ ., pheno.b) model.b @ Note that, in the above calls to \verb+nullModel()+, we did not explicitly specify the type of the model. Whenever the \verb+type+ argument is not specified, \verb+nullModel()+ tries to guess the right type of model. If the trait vector/column is a factor or a numeric vector containing only 0's and 1's (where both values must be present, otherwise an association test would be meaningless), the trait is supposed to be binary and a logistic linear model is trained, unless the following conditions are satisfied: \begin{enumerate} \item The number of samples does not exceed 100. \item No intercept and no covariates have been specified. \end{enumerate} If these two conditions are fulfilled for a binary trait, \verb+nullModel()+ considers the trait as a Bernoulli-distributed random variable (i.e.\ as the third type of model described above). If the trait is numeric and not binary, a linear model is trained. If the user wants to enforce a specific type of model explicitly, he/she can do so by setting the argument \verb+type+ to one of the three choices \verb+"linear"+, \verb+"logistic"+, or \verb+"bernoulli"+ (see \verb+?nullModel+ for details). An example using only the intercept, but no covariates: <>= nullModel(y ~ 1, pheno.c) @ An example in which we want to consider the traits as a Bernoulli-distributed variable: <>= nullModel(y ~ 0, pheno.b, type="bernoulli") @ Apart from the formula interface used above, \verb+nullModel()+ also allows for supplying a covariate matrix as first argument \verb+X+ (optional, omit if no covariates should be considered) and a trait vector as second argument \verb+y+: <>= covX <- as.matrix(pheno.c[, 1:2]) traitY <- pheno.c$y nullModel(covX, traitY) nullModel(y=traitY) covX <- as.matrix(pheno.b[, 1:2]) traitY <- pheno.b$y nullModel(covX, traitY) nullModel(y=traitY) nullModel(y=traitY, type="bernoulli") @ In the same way this works for many other \R\ functions, it is also possible to attach the data frame with the phenotype data (trait plus covariates) to the global environment. Then it is no longer necessary to the pass the data frame to the \verb+nullModel()+ function. However, one has to be more cautious with the selection of the covariates. The option to simply select all covariates with \verb+.+ is no longer available then. <>= attach(pheno.c) nullModel(y ~ X.1 + X.2) @ Regardless of the type of model and of which interface has been used to call \verb+nullModel()+, the function always creates an \R\ object of class \verb+NullModel+ (the objects named \verb+model.c+ and \verb+model.b+ in the examples above) that can be used in subsequent association tests. <>= detach(pheno.c) @ Variance-score component tests based on linear logistic models may not necessarily determine the null distribution of the test statistic correctly \cite{LeeEmondBamshadBarnesRiederNickerson12,Lin97} and, therefore, they may not control the type-I error rate correctly. Following a philosophy inspired by the \SKAT\ package \cite{LeeEmondBamshadBarnesRiederNickerson12,% WuLeeCaiLiBoehnkeLin11} \PODKAT\ offers two means to counteract this issue: \begin{description} \item[Resampling:] under the null assumption that the trait only depends on the covariates (if any) and not on the genotype, a certain number of model residuals are sampled. Then, when association testing is performed, $p$-values are computed also for all these sampled residuals, and an additional estimated $p$-value is computed as the relative frequency of $p$-values of sampled residuals that are at least as significant as the test's $p$-value. The number of sampled residuals is controlled with the \verb+n.resampling+ argument (the default is 0) and the type of sampling procedure is controlled with the \verb+type.resampling+ argument (see \verb+?nullModel+ for more details). \item[Small sample correction and adjustment of higher moments:] Lee~{\em et~al.}\label{page:sscor} \cite{LeeEmondBamshadBarnesRiederNickerson12} proposed a correction of the null distribution for small samples and a sampling method for adjusting higher moments of the null distribution of the test statistic (see also Subsections~\ref{ssec:teststat} and~\ref{ssec:sscor}). \PODKAT\ implements both corrections (see Subsection~\ref{ssec:sscor} about implementation details). The argument \verb+adj+ controls whether the null model is created such that any of the two corrections can be used later. The default is that the corrections are switched on for samples sizes up to 2,000, while \verb+adj="force"+ always turns corrections on and \verb+adj="none"+ always turns corrections off. The adjustment of higher moments requires sampled null model residuals. The number of those is controlled with the \verb+n.resampling.adj+ argument and the type of sampling procedure is again controlled with the \verb+type.resampling+ argument (see \verb+?nullModel+ and Subsection~\ref{ssec:sscor} for more details). \end{description} For linear models, there is no need for any correction of the null distribution (cf.\ Subsection~\ref{ssec:teststat}). Consequently, small sample correction is not available for linear models. Resampling, however, is available for linear models, too. None of the two methods is available for association tests using a Bernoulli-distributed trait. Some examples showing how to control resampling and small sample corrections for logistic linear models: <>= nullModel(y ~ ., pheno.b, n.resampling=1000, adj="none") nullModel(y ~ ., pheno.b, n.resampling.adj=2000) @ \section{Selection of Regions of Interest}\label{sec:regInterest} Association tests with \PODKAT\ typically consider multiple regions of interest along the samples' genome. The most common scenarios are whole-genome association testing, whole-exome association testing, or association tests for specific user-defined regions. In the following, we will highlight the basic steps necessary for each of these three scenarios. \subsection{Regions of Interest for Whole-Genome Association Testing}% \label{ssec:regInterestWGS} Suppose that the samples' genotypes have been determined by whole-genome sequencing or any other technology that covers variants across the whole genome. The first step for this case is to define the genome and where it has been sequenced. \PODKAT\ comes with four ready-made \texttt{GRangesList} objects (see Bioconductor package \texttt{GenomicRanges}) that define these regions for autosomal chromosomes, sex chromosomes, and the mitochondrial DNA of the human genome. Those objects are called \verb+hg18Unmasked+, \verb+hg19Unmasked+, \verb+hg38Unmasked+, \verb+b36Unmasked+, and \verb+b37Unmasked+. The three former are the standard hg18, hg19, and hg38 builds as shipped with the Bioconductor packages \begin{itemize} \item \verb+BSgenome.Hsapiens.UCSC.hg18.masked+, \item \verb+BSgenome.Hsapiens.UCSC.hg19.masked+, and \item \verb+BSgenome.Hsapiens.UCSC.hg38.masked+. \end{itemize} The two latter are basically the same regions as in \verb+hg18Unmasked+ and \verb+hg19Unmasked+, but with chromosomes named as in the genomes b36 and b37 that are frequently used by the Genome Analysis Toolkit (GATK).% \footnote{\url{https://www.broadinstitute.org/gatk/} (last visited 2021-04-30)} The five objects are available upon \verb+data()+ calls as in the following example: <>= data(hg38Unmasked) hg38Unmasked names(hg38Unmasked) hg38Unmasked$chr1 seqinfo(hg38Unmasked) @ All four objects are organized in the same way; they consist of 31 components: one for each of the 22 autosomal chromosomes, one for each of the two sex chromosomes, one for the mitochondrial DNA, and two for each of the three pseudoautosomal regions. This structure has been chosen to allow the user to consider different chromosomes and pseudoautosomal regions separately. Table~\ref{tab:unmaskedChrs} gives an overview of the list components of each of those \verb+GRangesList+ objects, how their list components are named, and how they relate to chromosomes in the human genome. \begin{table} \caption{Overview of how the \texttt{GRangesList} objects \texttt{hg18Unmasked}, \texttt{hg19Unmasked}, \texttt{hg38Unmasked}, \texttt{b36Unmasked}, and \texttt{b37Unmasked} are organized: each row corresponds to one chromosome/sequence of the human genome and lists the names of those list components that contain regions from these chromosomes/sequences.} \label{tab:unmaskedChrs} \begin{center}\small% \begin{tabular}{|c|l|l|l|} \hline \bf Chromosome & \ttbf hg*Unmasked & \ttbf b36Unmasked & \ttbf b37Unmasked \\ \hline \hline 1 & \texttt{"chr1"} & \texttt{"1"} & \texttt{"1"} \\ \vdots & \qquad\vdots & \ \vdots & \ \vdots \\ 22 & \texttt{"chr22"} & \texttt{"22"} & \texttt{"22"} \\ \hline X & \texttt{"chrX"}, \texttt{"X.PAR1"}, & \texttt{"X"}, \texttt{"X.PAR1"}, & \texttt{"X"}, \texttt{"X.PAR1"}, \\ & \texttt{"X.PAR2"}, \texttt{"X.XTR"} & \texttt{"X.PAR2"}, \texttt{"X.XTR"} & \texttt{"X.PAR2"}, \texttt{"X.XTR"} \\ \hline Y & \texttt{"chrY"}, \texttt{"Y.PAR1"}, & \texttt{"Y"}, \texttt{"Y.PAR1"}, & \texttt{"Y"}, \texttt{"Y.PAR1"}, \\ & \texttt{"Y.PAR2"}, \texttt{"Y.XTR"} & \texttt{"Y.PAR2"}, \texttt{"Y.XTR"} & \texttt{"Y.PAR2"}, \texttt{"Y.XTR"} \\ \hline mtDNA & \texttt{"chrM"} & \texttt{"M"} & \texttt{"MT"} \\ \hline \end{tabular} \end{center} \end{table} A simpler structure can be created easily. As an example, the pseudoautosomal regions can be re-united with the X and Y chromosomes as follows: <>= hg38basic <- hg38Unmasked[paste0("chr", 1:22)] hg38basic$chrX <- reduce(unlist(hg38Unmasked[c("chrX", "X.PAR1", "X.PAR2", "X.XTR")])) hg38basic$chrY <- reduce(unlist(hg38Unmasked[c("chrY", "Y.PAR1", "Y.PAR2", "Y.XTR")])) hg38basic names(hg38basic) @ If the user prefers to have all unmasked regions in one single \verb+GRanges+ object, this can be done as follows: <>= hg38all <- reduce(unlist(hg38Unmasked)) hg38all @ If association testing should be done for any other genome, the user must specify unmasked regions as a \verb+GRanges+ or \verb+GRangesList+ object first. This can be done manually, but it is more convenient to start from a \verb+MaskedBSgenome+ object. Subsection~\ref{ssec:newGenome} provides more details. It makes little sense to perform association tests for whole chromosomes (or unmasked regions thereof). The most common approach is to split these regions into overlapping windows of (almost) equal lengths. In order to do this conveniently, \PODKAT\ provides the function \verb+partitionRegions()+. A toy example: <>= gr <- GRanges(seqnames="chr1", ranges=IRanges(start=1, end=140000)) partitionRegions(gr, width=10000, overlap=0.5) partitionRegions(gr, width=15000, overlap=0.8) partitionRegions(gr, width=10000, overlap=0) @ Obviously, the \verb+width+ argument controls the width of the windows (the default is 5,000) and the \verb+overlap+ argument controls the relative overlap (the default is 0.5, which corresponds to 50\%\ overlap). The windows are placed such that possible overhangs are balanced at the beginning and end of the partitioned region. The choice of the right window width is crucial. If the windows are too narrow, causal regions may be split across multiple windows which may impair statistical power and requires more aggressive multiple testing correction. However, if the windows are too large, associations may be diluted by the large number of variants considered by every single test. We recommend a width between 5,000~bp and 50,000~bp along with 50\%\ overlap. If called for a \verb+GRanges+ object, \verb+partitionRegions()+ returns a \verb+GRanges+ object with partitioned regions. If called for a \verb+GRangesList+ object, \verb+partitionRegions()+ returns a \verb+GRangesList+ object, where each component of the output object corresponds to the partitioning of one of the components of the input object. <>= partitionRegions(hg38Unmasked, width=20000) @ The \verb+partitionRegions()+ functions also allows for partitioning only a subset of chromosomes. This can be done by specifying the \verb+chrs+ argument, e.g.\ \verb+chrs="chr22"+ only considers regions on chromosome 22 and omits all other regions. This works both for \verb+GRanges+ and \verb+GRangesList+ objects. However, \verb+partitionRegions()+ works for any \verb+GRangesList+ object and makes no prior assumption about which chromosomes appear in each of the list components. Technically, this means that all list components will be searched for regions that lie on the specified chromosome(s). The \verb+GRangesList+ objects \verb+hg18Unmasked+, \verb+hg19Unmasked+, \verb+hg38Unmasked+, \verb+b36Unmasked+, and \verb+b37Unmasked+ included in the \PODKAT\ package, however, are organized that all list components only contain regions from one chromosome (see Table~\ref{tab:unmaskedChrs}). Therefore, it is not necessary to search all list components. The following example does this more efficiently by restricting to chromosomes 21 and 22 from the beginning: <>= partitionRegions(hg38Unmasked[c("chr21", "chr22")], width=20000) @ The following call using the \verb+chrs+ argument would give exactly the same result as the command above, but takes approximately 10 times as much time: \begin{knitrout} \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe} \begin{alltt} \hlkwd{partitionRegions}\hlstd{(hg38Unmasked,} \hlkwc{chrs}\hlstd{=}\hlkwd{c}\hlstd{(}\hlstr{"chr21"}\hlstd{,} \hlstr{"chr22"}\hlstd{),} \hlkwc{width}\hlstd{=}\hlnum{20000}\hlstd{)} \end{alltt} \begin{verbatim} ##!!## GRangesList object of length 2: ##!!## $chr21 ##!!## GRanges object with 3997 ranges and 0 metadata columns: ##!!## seqnames ranges strand ##!!## ##!!## [1] chr21 [5010001, 5028123] * ##!!## [2] chr21 [5018124, 5038123] * ##!!## [3] chr21 [5028124, 5048123] * ##!!## [4] chr21 [5038124, 5058123] * ##!!## [5] chr21 [5048124, 5068123] * ##!!## ... ... ... ... ##!!## [3993] chr21 [46641223, 46661222] * ##!!## [3994] chr21 [46651223, 46671222] * ##!!## [3995] chr21 [46661223, 46681222] * ##!!## [3996] chr21 [46671223, 46691222] * ##!!## [3997] chr21 [46681223, 46699983] * ##!!## ##!!## ... ##!!## <1 more element> ##!!## ------- ##!!## seqinfo: 25 sequences (1 circular) from hg38 genome \end{verbatim} \end{kframe} \end{knitrout} \subsection{Regions of Interest for Whole-Exome Association Testing}% \label{ssec:regInterestWXS} Suppose that the samples' genotypes have been determined by whole-exome sequencing. In this case, it makes little sense to use a partition of the whole genome as regions of interest. Instead, the best way is to use exactly those regions that have been targeted by the capturing technology. If these regions are available as a BED file% \footnote{\url{https://genome.ucsc.edu/FAQ/FAQformat.html\#format1} (last visited 2021-04-30)}, this file can be read with the function \verb+readRegionsFromBedFile()+. In the following example, we demonstrate this for a BED file that specifies the regions targeted by the Illumina\RTM{} TruSeq DNA Exome Kit. The regions are based on the hg19 human genome build. In order to make this code example work, users must first download the file from the Illumina\RTM{} website% \footnote{\url{https://emea.support.illumina.com/downloads/truseq-exome-product-files.html} (last visited: 2021-04-30)}: \begin{knitrout} \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe} \begin{alltt} \hlkwd{readRegionsFromBedFile}\hlstd{(}\hlstr{"truseq-exome-targeted-regions-manifest-v1-2.bed"}\hlstd{)} \end{alltt} \begin{verbatim} ##!!## GRanges object with 214126 ranges and 0 metadata columns: ##!!# seqnames ranges strand ##!!## ##!!## CEX-chr1-12099-12258 chr1 12098-12258 * ##!!## CEX-chr1-12554-12721 chr1 12553-12721 * ##!!## CEX-chr1-13332-13701 chr1 13331-13701 * ##!!## CEX-chr1-30335-30503 chr1 30334-30503 * ##!!## CEX-chr1-35046-35544 chr1 35045-35544 * ##!!## ... ... ... ... ##!!## CEX-chrY-59355682-59355884 chrY 59355681-59355884 * ##!!## CEX-chrY-59355972-59356131 chrY 59355971-59356131 * ##!!## CEX-chrY-59356790-59356943 chrY 59356789-59356943 * ##!!## CEX-chrY-59357687-59357786 chrY 59357686-59357786 * ##!!## CEX-chrY-59357911-59358045 chrY 59357910-59358045 * ##!!## ------- ##!!## seqinfo: 25 sequences from an unspecified genome; no seqlengths \end{verbatim} \end{kframe} \end{knitrout} Since a BED file does not contain any genomic annotation, \verb+readRegionsFromBedFile()+ is not able to set chromosome names and chromosome lengths properly. In order to overcome this limitation, \verb+readRegionsFromBedFile()+ allows for passing a \verb+Seqinfo+ object via the \verb+seqInfo+ argument. Then the metadata of the returned object are properly set to those passed as \verb+seqInfo+ argument: \begin{knitrout} \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe} \begin{alltt} \hlkwd{data}\hlstd{(hg19Unmasked)} \hlstd{reg} \hlkwb{<-} \hlkwd{readRegionsFromBedFile}\hlstd{(}\hlstr{"truseq-exome-targeted-regions-manifest-v1-2.bed"}\hlstd{,} \hlkwc{seqInfo}\hlstd{=}\hlkwd{seqinfo}\hlstd{(hg19Unmasked))} \hlkwd{seqinfo}\hlstd{(reg)} \end{alltt} \begin{verbatim} ##!!## Seqinfo object with 25 sequences from hg19 genome: ##!!## seqnames seqlengths isCircular genome ##!!## chr1 249250621 hg19 ##!!## chr2 243199373 hg19 ##!!## chr3 198022430 hg19 ##!!## chr4 191154276 hg19 ##!!## chr5 180915260 hg19 ##!!## ... ... ... ... ##!!## chr21 48129895 hg19 ##!!## chr22 51304566 hg19 ##!!## chrX 155270560 hg19 ##!!## chrY 59373566 hg19 ##!!## chrM 16571 hg19 \end{verbatim} \end{kframe} \end{knitrout} Locations of transcripts can be used as regions of interest, too: <>= library(TxDb.Hsapiens.UCSC.hg38.knownGene) hg38tr <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene, columns="tx_name") hg38tr @ The \verb+GRanges+ object returned by \verb+transcripts()+ in the above example code includes strand information. This does no harm to a subsequent association test, since all \verb+assocTest()+ methods ignore all information except the chromosome and the start of the region. In any case, the user is advised rather to use exactly those regions that were targeted by the biotechnology that was applied. If this information is not available, we rather recommend to use transcripts, perhaps extended to promotor regions and untranslated regions. Alternatively, to narrow down association analysis by removing introns, it is also possible to use exons only. This can simply be done by replacing the above call to \verb+transcripts()+ by \verb+exons()+. If regions located on sex chromosomes or in pseudo-autosomal regions should be treated differently, the best option is to split the regions object first such that the different regions are grouped together. For convenience, \PODKAT\ provides a \verb+split()+ method that allows for splitting a \verb+GRanges+ object along grouped regions contained in a \verb+GRangesList+ object. The following example splits up transcripts (for the sake of shorter computation times, we restrict to the X chromosome and pseudo-autosomal regions located on the X chromosome): <>= strand(hg38tr) <- "*" split(hg38tr, hg38Unmasked[c("chrX", "X.PAR1", "X.PAR2", "X.XTR")]) @ The \verb+split()+ function is strand-specific, that is why we have to discard the strand information in \verb+hg38tr+ first (whereas \verb+hg38Unmasked+ does not contain any strand information anyway). The lengths of exons, transcripts, and captured target sequences vary quite a lot. In order to avoid that the results of an association test are biased to the lengths of the regions of interest, we suggest to partition longer exons or transcripts as well. This can be done by simply calling \verb+partitionRegions()+ for the \verb+GRanges+ objects containing exons or transcripts (the call to \verb+reduce()+ removes duplicates and unifies partial overlaps): <>= partitionRegions(reduce(hg38tr)) @ \subsection{Defining Custom Regions of Interest}% \label{ssec:regInterestCustom} If a user is not interested in a genome-wide analysis, he/she might want to restrict to a particular genomic region, for example, a particular gene or set of genes that are likely to be relevant for his/her study. In such a case, there are multiple options to define regions of interest. The simplest, but also most tedious, approach is to enter the regions manually. Let us consider the simple example that we are interested in the two human hemoglobin alpha genes HBA1 and HBA2. If we search for these genes in the UCSC Genome Browser\footnote{\url{http://genome.ucsc.edu/} (last visited 2021-04-30)} \cite{KentSugnetFureyRoskinPringleZahlerHaussler02}, we see that the genes are in the following regions (according to the hg38 human genome build): <>= hbaGenes <- GRanges(seqnames="chr16", ranges=IRanges(start=c(176680, 172847), end=c(177521, 173710))) names(hbaGenes) <- c("HBA1", "HBA2") hbaGenes @ Another variant is to use the \verb+TxDb.Hsapiens.UCSC.hg38.knownGene+ package to access the genes' locations. In order to do that, we have to take into account that the corresponding transcripts have the UCSC IDs \verb+uc002cfx.2+ and\verb+uc002cfv.4+ (which enable us to select the right regions by searching for these IDs in the \verb+tx_name+ metadata column): <>= hbaGenes <- hg38tr[which(mcols(hg38tr)$tx_name %in% c("ENST00000320868.9", "ENST00000251595.11"))] names(hbaGenes) <- c("HBA1", "HBA2") hbaGenes @ The probably most general, most automatic, and most elegant way to determine the genes' locations is via direct access to some biological database. The Bioconductor package \verb+biomaRt+ facilitates such interfaces. However, this interface returns its results as data frames. So, we have to convert the data to a \verb+GRanges+ object ourselves. The hemoglobin alpha example again, this time using \verb+biomaRt+: \begin{knitrout} \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe} \begin{alltt} \hlkwd{library}\hlstd{(biomaRt)} \hlstd{ensem} \hlkwb{<-} \hlkwd{useMart}\hlstd{(}\hlstr{"ensembl"}\hlstd{)} \hlstd{hsEnsem} \hlkwb{<-} \hlkwd{useDataset}\hlstd{(}\hlstr{"hsapiens_gene_ensembl"}\hlstd{,} \hlkwc{mart}\hlstd{=ensem)} \hlstd{res} \hlkwb{<-} \hlkwd{getBM}\hlstd{(}\hlkwc{attributes}\hlstd{=}\hlkwd{c}\hlstd{(}\hlstr{"hgnc_symbol"}\hlstd{,} \hlstr{"chromosome_name"}\hlstd{,} \hlstr{"start_position"}\hlstd{,} \hlstr{"end_position"}\hlstd{,} \hlstr{"ucsc"}\hlstd{),} \hlkwc{filters}\hlstd{=}\hlstr{"hgnc_symbol"}\hlstd{,} \hlkwc{values}\hlstd{=}\hlkwd{c}\hlstd{(}\hlstr{"HBA1"}\hlstd{,} \hlstr{"HBA2"}\hlstd{),} \hlkwc{mart}\hlstd{=hsEnsem)} \hlstd{res} \end{alltt} \begin{verbatim} ##!!## hgnc_symbol chromosome_name start_position end_position ##!!## 1 HBA1 16 176680 177522 ##!!## 2 HBA1 16 176680 177522 ##!!## 3 HBA1 16 176680 177522 ##!!## 4 HBA1 16 176680 177522 ##!!## 5 HBA2 16 172847 173710 ##!!## 6 HBA2 16 172847 173710 ##!!## 7 HBA2 16 172847 173710 ##!!## 8 HBA2 16 172847 173710 ##!!## ucsc ##!!## 1 uc002cfx.2 ##!!## 2 uc059ohb.1 ##!!## 3 uc059ohc.1 ##!!## 4 uc059ohd.1 ##!!## 5 uc002cfv.4 ##!!## 6 uc059ogy.1 ##!!## 7 uc059ogz.1 ##!!## 8 uc059oha.1 \end{verbatim} \begin{alltt} \hlstd{hbaGenes} \hlkwb{<-} \hlkwd{GRanges}\hlstd{(}\hlkwc{seqnames}\hlstd{=}\hlkwd{paste0}\hlstd{(}\hlstr{"chr"}\hlstd{, res}\hlopt{$}\hlstd{chromosome_name),} \hlkwc{ranges}\hlstd{=}\hlkwd{IRanges}\hlstd{(}\hlkwc{start}\hlstd{=res}\hlopt{$}\hlstd{start_position,} \hlkwc{end}\hlstd{=res}\hlopt{$}\hlstd{end_position))} \hlkwd{names}\hlstd{(hbaGenes)} \hlkwb{<-} \hlstd{res}\hlopt{$}\hlstd{hgnc_symbol} \hlstd{hbaGenes} \end{alltt} \begin{verbatim} ##!!## GRanges object with 8 ranges and 0 metadata columns: ##!!## seqnames ranges strand ##!!## ##!!## HBA1 chr16 [176680, 177522] * ##!!## HBA1 chr16 [176680, 177522] * ##!!## HBA1 chr16 [176680, 177522] * ##!!## HBA1 chr16 [176680, 177522] * ##!!## HBA2 chr16 [172847, 173710] * ##!!## HBA2 chr16 [172847, 173710] * ##!!## HBA2 chr16 [172847, 173710] * ##!!## HBA2 chr16 [172847, 173710] * ##!!## ------- ##!!## seqinfo: 1 sequence from an unspecified genome; no seqlengths \end{verbatim} \end{kframe} \end{knitrout} As already mentioned at the end of Subsection~\ref{ssec:regInterestWXS}, the lengths of exons and transcripts vary quite a lot. So it is recommended to partition even custom-defined regions of interest using \verb+partitionRegions()+ if the regions' lengths differ strongly and/or if the regions are longer than the window size one would usually employ for whole-genome studies. \section{Performing an Association Test}\label{sec:assocTest} We have already seen in Section~\ref{sec:impatient} that the function for performing the actual association tests is \verb+assocTest()+. We have also seen that there are basically two ways how to use this function. The simpler one is to load the genotype data into a matrix-like object first and to perform an association test on this matrix-like object. To first load the genotype, the following command can be used: <>= geno <- readGenotypeMatrix(vcfFile) geno @ The object returned by \verb+readGenotypeMatrix()+ is of class \verb+GenotypeMatrix+. This class is defined by \PODKAT\ and essentially consists of the genotypes in column-oriented sparse matrix format (with rows corresponding to samples and columns corresponding to variants) along with information about the genomic positions of the variants. The \verb+readGenotypeMatrix()+ function has some additional arguments for controlling how variants are pre-processed and filtered (see \verb+?readGenotypeMatrix+ and Subsection \ref{ssec:readVCF} for more details). Information about the genomic positions of the variants can be obtained as follows: <>= variantInfo(geno) @ Obviously, this is a \verb+GRanges+ object that also contains a metadata column with minor allele frequencies (MAFs). For convenience, there is a separate accessor function \verb+MAF()+ for retrieving these MAFs: <>= summary(MAF(geno)) @ Moreover, there is a custom variant of the \verb+summary()+ method: <>= summary(variantInfo(geno), details=TRUE) @ The simplest approach is to perform a test for the association between the null model the entire genotype: <>= assocTest(geno, model.c) assocTest(geno, model.b) @ As already mentioned in Section~\ref{sec:regInterest}, it is not necessarily the best idea to consider associations between the null model and all variants of the whole genome at once. The larger the number of variants is that are considered at once, the smaller is the ratio of potentially associated variants. Thus, it may become harder for the test to disentangle random effects from true associations. Therefore, the better and more common approach is to split the genome into a set of (overlapping) windows/regions of interest as described in Section~\ref{sec:regInterest} --- hoping that potentially causal variants will accumulate in certain regions and, thereby, lead to highly significant $p$-values for these regions. If we already have a certain set of regions of interest, we can simply pass them to \verb+assocTest()+ as third argument: <>= res.c <- assocTest(geno, model.c, windows) print(res.c) @ Obviously, \verb+assocTest()+ performs the association test for each region independently and computes a $p$-value for each region. The \verb+assocTest()+ as used above has a few arguments that influence the way the tests are performed. Most importantly, the \verb+kernel+ argument allows for choosing among 6 different kernels. Details about these kernels are available in Subsection~\ref{ssec:kernels}. We suggest the default setting \verb+"linear.podkat"+, i.e.\ the position-dependent linear kernel, \PODKAT's most important contribution and achievement. For comparison purposes, \PODKAT\ also provides an up-to-date implementation of the SKAT test \cite{WuLeeCaiLiBoehnkeLin11} which can be chosen by setting \verb+kernel="linear.SKAT"+. The four other kernels have not turned out to be advantageous in simulations, moreover, they require much longer computation times. Anyway, they are available and ready to be used. Another important decision is the choice of a weighting schemes, i.e.\ whether and how to choose weights for variants depending on their minor allele frequency (MAF). Details are provided in Subsection~\ref{ssec:weightFunc}. If the genotype matrix is too large to fit into the computer's main memory or if parallelization is desired, the alternative, as already mentioned, is to call \verb+assocTest()+ for a VCF file name. Then \verb+assocTest()+ splits the regions of interest into batches and only loads those variants from the VCF file at once (see Subsection~\ref{sssec:chunking}). If called for a VCF file name (or \verb+TabixFile+ object), the same arguments for controlling how variants are pre-processed and filtered are available as for the \verb+readGenotypeMatrix()+ function (see \verb+?assocTest+, \verb+?readGenotypeMatrix+, and Subsection~\ref{ssec:readVCF} of this manual for more details). This interface also allows for carrying out these computations on multiple processor cores and/or on a computing cluster (see Subsection~\ref{sssec:parallel}). We first provide a simple example here without any parallelization: <>= res.c <- assocTest(vcfFile, model.c, windows) print(res.c) @ The above steps can be carried out for binary traits as well. For brevity, we only show the variant with the genotype matrix here: <>= res.b <- assocTest(geno, model.b, windows) print(res.b) @ So, it seems as if the computations were completely analogous to continuous traits. However, there is an important aspect that indeed makes a difference: small sample adjustment. As the reader may have noticed above, the null model \verb+model.b+ includes 10,000 resampled residuals for correction of higher moments of the null distribution. This correction has actually been performed by the above association test. Let us shortly run the association test without any correction: <>= res.b.noAdj <- assocTest(geno, model.b, windows, adj="none") print(res.b.noAdj) @ Obviously, the $p$-values seem to have become more significant. Actually, this is not the case, but only the result of a too crude approximation of the null distribution for smaller sample sizes. So, in a case like this one, small sample adjustment and correction of higher moments are essential. \section{Analyzing and Visualizing Results}\label{sec:analysis} \subsection{Multiple Testing Correction}\label{ssec:multtest} As soon as multiple tests are performed in parallel, the tests' raw $p$-values do not allow for a correct assessment of the overall type I error rate anymore and multiple testing correction must be employed. \PODKAT\ provides a simple method for multiple testing correction that is a wrapper around the \R\ standard function \verb+p.adjust()+ from the \verb+stats+ package. If called for an object of class \verb+AssocTestResultRanges+ (the class of objects the \verb+assocTest()+ function creates when called for multiple regions), \verb+p.adjust()+ adds a metadata column named \verb+p.value.adj+ with adjusted $p$-values: <>= print(p.adjust(res.c)) @ For consistency with the standard \verb+p.adjust()+ function, the default correction procedure is \verb+"holm"+, which corresponds to the Holm-Bonferroni method for controlling the familywise error rate (FWER) \cite{Holm79}. If a different method is desired, e.g.\ the popular Benjamini-Hochberg false discovery rate (FDR) correction \cite{BenjaminiHochberg95}, the method argument must be used to select the desired method: <>= res.c.adj <- p.adjust(res.c, method="BH") print(res.c.adj) res.b.adj <- p.adjust(res.b, method="BH") print(res.b.adj) @ The user has to make sure to choose a correction method the assumptions of which are fulfilled by the given setup. Note that the tests performed by \PODKAT\ are usually not independent of each other, at least not if overlapping windows and/or windows close to each other are tested. \subsection{Visualization}\label{ssec:plots} \PODKAT\ also offers functions for visualizing the results of association tests. Suppose we have carried out an association test involving multiple regions, e.g.\ a whole-genome or whole-exome association test. In such cases, the \verb+assocTest()+ returns an \verb+AssocTestResultRanges+ object that contains a metadata column with $p$-values and, if multiple testing correction has been employed (see Subsection~\ref{ssec:multtest} above), another metadata column with adjusted $p$-values. If the \R\ standard generic function \verb+plot()+ is called on such an object, a so-called {\em Manhattan plot} is produced, that is, log-transformed $p$-values are plotted along the genome: \begin{center} <>= plot(res.c.adj) @ \end{center} Obviously, the function plots raw (uncorrected) $p$-values, where the ones passing the significance threshold are plotted in red and the insignificant ones are plotted in gray. In order to correctly account for multiple testing, we would either have to choose a stricter threshold or use adjusted $p$-values instead. The latter can be accomplished by choosing a different $p$-value column for plotting: \begin{center} <>= plot(res.c.adj, which="p.value.adj") @ \end{center} If the genome consists of multiple chromosomes, the default is to plot the insignificant $p$-values in alternating colors (gray/light gray by default). If very many regions have been tested, in particular in whole-genome studies, it is advisable to use semi-transparent colors (using the alpha channel of functions like \verb+gray()+, \verb+rgb()+, or \verb+hsv()+) to get an impression of the density of $p$-values. Although this simple example does not necessitate this technique, we show an example in order to demonstrate how it works: \begin{center} <>= plot(res.c.adj, which="p.value.adj", col=gray(0.5, alpha=0.25)) @ \end{center} \PODKAT\ further provides a function for making quantile-quantile (Q-Q) plots. If called for a single \verb+AssocTestResultRanges+ object, the function \verb+qqplot()+ plots log-transformed $p$-values against a uniform distribution of $p$-values (which one would expect under the null hypothesis): \begin{center} \setkeys{Gin}{width=0.6\textwidth} <>= qqplot(res.c) @ \end{center} The same function can also be used to compare two association test results in terms of their distributions of $p$-values, e.g.\ to compare two different kernels or to compare results with and without small-sample correction: \begin{center} <>= qqplot(res.b, res.b.noAdj) @ \end{center} For the above example, we see that the $p$-values without small-sample correction are supposedly more significant. The reason is that, in this example, the $p$-values without correction are actually systematically inflated: \begin{center} <>= qqplot(res.b.noAdj) @ \end{center} \subsection{Filtering Significant Regions}\label{ssec:filter} \PODKAT\ offers a simple method for stripping off all insignificant results from an association test result. The method is called \verb+filterResult()+ and can be applied to association test results given as objects of class \verb+AssocTestResultRanges+. The user can choose the significance threshold and which $p$-value column the filter should be applied to. The result is a subset of the input object consisting of those regions the $p$-value of which passed the threshold. <>= res.c.f <- filterResult(res.c, cutoff=1.e-6) print(res.c.f, cutoff=1.e-6) res.c.adj.f <- filterResult(res.c.adj, filterBy="p.value.adj") print(res.c.adj.f) @ \subsection{Contributions of Individual Variants}\label{ssec:contrib} The association tests provided by \PODKAT\ do not test single-locus variants, but consider multiple variants located in the same genomic window simultaneously, i.e.\ multiple variants are ``collapsed'' into a single score and tested together. As a consequence, \PODKAT\ provides association test results per window, but does not allow for pinpointing which individual variants may have made a major contribution to the test's outcome. For linear kernels (choices \verb+kernel="linear.podkat"+ and \verb+kernel="linear.SKAT"+), \PODKAT\ offers a method named \verb+weights()+ that allows for computing the individual contribution that individual variants made to the outcome of a test. It should be clear that this makes little sense for non-significant windows. Hence, this method should be applied to a filtered result. <>= w.res.c.adj <- weights(res.c.adj.f, geno, model.c) w.res.c.adj @ Obviously, \verb+weights()+ returns a \verb+GRangesList+ object with as many components as the first argument has regions. Each of the list components is a \verb+GRanges+ object with two metadata columns \verb+weight.raw+ and \verb+weight.contribution+. The former corresponds to the raw contributions of the variants. For each variant, the corresponding entry in the column \verb+weight.raw+ is positive if the variant is positively associated with the residual/trait. It is negative if the variant is negatively associated with the residual/trait. The value is around zero if there is (almost) no association. The absolute value gives an indication about the magnitude of contribution to the test's statistic. The metadata column \verb+weight.contribution+ corresponds to the relative contribution of each variant. It is nothing else but the squares of the \verb+weight.raw+ column, but normalized to a sum of 1. As an example, a value of 0.2 means that this variant contributed 20\%\ to the test statistic of this region. Subsection~\ref{ssec:contrib} provides more details about the contributions are computed. As an example, we plot a histogram of relative contributions of variants of the first region: \begin{center} <>= hist(mcols(w.res.c.adj[[1]])$weight.contribution, col="lightblue", border="lightblue", xlab="weight.contribution", main="") @ \end{center} The histogram shows a bimodal distribution which indicates that the major contributions are made by only a few variants, whereas all others only make a minor contribution. The \PODKAT\ package provides a method for plotting numerical metadata columns of \verb+GRanges+ objects. This method can be used to plot the contributions of individual variants in a window: \begin{center} <>= plot(w.res.c.adj[[1]], "weight.contribution") @ \end{center} In the above plot, each variant is visualized as an equally large bar/interval along the horizontal axis. Alternatively, the function can also plot contributions (or any other numerical metadata column) along the genome. In the following plot, the type \verb+"b"+ has been chosen with the aim to indicate the positions of variants: \begin{center} <>= plot(w.res.c.adj[[1]], "weight.raw", alongGenome=TRUE, type="b") @ \end{center} In order to allow the user to easily find the most indicative variant, the \verb+filterResult()+ method can be applied to weights, too. If called for \verb+GRanges+ or \verb+GRangesList+ objects, the function \verb+filterResult()+ checks if a metadata column \verb+weight.contribution+ is available and strips off all variant with a relative contribution lower than the given threshold \verb+cutoff+ (the default is 0.1): <>= filterResult(w.res.c.adj, cutoff=0.07) @ That the same variants stand out twice in the above result is neither an error nor a coincidence: the most indicative variants of both windows appear are located in their overlap. As a further analysis tool, \PODKAT\ offers to plot the genotype in a heatmap-like fashion, as it is or with respect to traits/phenotypes. In the following example, we read the region that has been identified as most significant from the VCF file and display it: \begin{center} <>= res.c.adj.sorted <- sort(res.c.adj, sortBy="p.value.adj") Zi <- readGenotypeMatrix(vcfFile, regions=res.c.adj.sorted[1]) plot(Zi, labRow=NA) @ \end{center} The plot is more expressive if the genotypes are plotted along with the trait/phenotype: \begin{center} <>= plot(Zi, y=pheno.c$y, labRow=NA) @ \end{center} In this plot, the samples (rows) are sorted according to the phenotype value. This allows for better finding variants whose minor alleles (gray or black) accumulate in the upper or lower part of the plot. For instance, it is clearly visible that the minor alleles of variant \verb+snv:239+, the one that had the highest contribution, accumulate in the upper part of the plot. Note, however, that the genotypes have not been tested for associations with the trait directly, but for associations with the trait after correction for covariates (i.e.\ with the null model residuals): \begin{center} <>= plot(Zi, y=residuals(model.c), labRow=NA) @ \end{center} Now the accumulation of minor alleles of variant \verb+snv:239+ in the upper part of the plot becomes even more prominent. Analogous functionality is also available for binary traits. In such a case, however, samples are sorted according to class (trait 0 or 1) and color-coded: \begin{center} <>= res.b.adj.sorted <- sort(res.b.adj, sortBy="p.value.adj") Zi <- readGenotypeMatrix(vcfFile, regions=res.b.adj.sorted[1]) plot(Zi, y=factor(pheno.b$y), labRow=NA) @ \end{center} Note that, if the binary trait is numerical, it must be passed as a factor in order to ensure that \verb+plot()+ correctly recognizes it as a categorical entity. \section{Miscellanea}\label{sec:misc} \subsection{Creating Suitable VCF Files}\label{ssec:prepVCF} The main file format for storing variant calls, no matter whether they have been determined by microarrays or next-generation sequencing, is the {\em Variant Call Format (VCF)}.% \footnote{\url{http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42} (last visited 2021-04-30)} Not surprisingly, this is the main file format that is used and supported by \PODKAT. However, in order to make VCF files suitable as input files for \PODKAT, some preparations may be necessary. Most importantly, {\em \PODKAT\ expects all samples of a study to be included in one single VCF file.} In other words, if samples are spread over multiple VCF files, these files must be {\em merged} before \PODKAT\ can be used. \subsubsection{Software tools} All steps described below require that \verb+tabix+ and \verb+bgzip+ are available on the computer system on which the VCF preprocessing steps are to be performed. If they are not yet available, the latest version of \verb+tabix+ \cite{LiHandsakerWysokerFennellRuanHomerMarth09} must be downloaded,% \footnote{\url{http://sourceforge.net/projects/samtools/files/tabix/} (last visited 2021-04-30)} compiled, and installed. This package also includes \verb+bgzip+, so no extra effort is necessary to install \verb+bgzip+. Make sure that the executables \verb+tabix+ and \verb+bgzip+ are in the default search path. For merging and concatentating VCF files, either install the Perl-based VCFtools\footnote{\url{https://vcftools.github.io/index.html} (last visited 2021-04-30)} \cite{DanecekAutonAbecasisAlbersBanksDePristoHandsaker11} or the faster bcftools/htslib VCF commands.% \footnote{\url{https://vcftools.github.io/htslib.html} (last visited 2021-04-30)} Make sure that the necessary executables are in the default search path. All examples below are supposed to run on Unix-like systems (including Linux and Mac OS X). The examples below that are prefixed with \verb+$+ are supposed to be run in a Unix/Linux terminal, not in an \R\ session. If the user wants to run them on a MS Windows system, some modifications may be necessary. \subsubsection{Merging VCF files} Suppose that the samples of a study are distributed across multiple Gnu-zipped VCF files, e.g. \verb+s1.vcf.gz+ -- \verb+s100.vcf.gz+. Further suppose that a Tabix index file is available for each of these files. If this is not the case, \verb+tabix+ must be run on the VCF files to create this index file. As an example, \begin{quote} \begin{verbatim} $ tabix -p vcf s1.vcf.gz \end{verbatim} \end{quote} will create an index file \verb+s1.vcf.gz.tbi+. As said, if an index file is available for each VCF file, the files are ready to be merged. Using the Perl-based VCFtools, this can be done as follows: \begin{quote} \begin{verbatim} $ vcf-merge -c both s*.vcf.gz | bgzip -c > merged.vcf.gz $ tabix -p vcf merged.vcf.gz \end{verbatim} \end{quote} The first command above merges all files \verb+s1.vcf.gz+ -- \verb+s10.vcf.gz+ into a newly created VCF file \verb+merged.vcf.gz+. The second command again creates a Tabix index file of the merged VCF file. The option \verb+-c both+ ensures that single-nucleotide variants (SNVs)% \footnote{which of course includes single-nucleotide polymorphisms (SNPs)} and indels ocurring at overlapping locations are kept separate and not merged. If the bcftools/htslib VCF commands are used, merging can be done in a very similar way: \begin{quote} \begin{verbatim} $ bcftools merge -m both s*.vcf.gz | bgzip -c > merged.vcf.gz $ tabix -p vcf merged.vcf.gz \end{verbatim} \end{quote} Note that \verb+-c both+ must be replaced by \verb+-m both+ in this variant; the meaning, however, is the same. {\em The above commands are to be used in cases where the sample sets of the individual VCF files are disjoint/non-overlapping. Do not use them in case that the sample sets of the VCF files overlap, as this would lead to duplication of samples in the merged VCF file!} \subsubsection{Concatenating VCF files} Suppose that the study data are split over multiple files each of which contains the same samples, but each of which contains different variants, e.g.\ each file contains variants of one chromosome. Then \PODKAT\ can be used in two ways: (1) by running association tests for each VCF file/chromosome independently and merging the results afterwards or (2) by {\em concatenating the VCF files into one single VCF file.} The latter can be accomplished as follows (for an example in which we have files \verb+chr1.vcf.gz+ -- \verb+chr22.vcf.gz+ and \verb+chrX.vcf.gz+ all of which have been indexed previously): \begin{quote} \begin{verbatim} $ vcf-concat chr*.vcf.gz | bgzip -c > concat.vcf.gz $ tabix -p vcf concat.vcf.gz \end{verbatim} \end{quote} Suppose the same scenario as above, but with an additional file \verb+chrY.vcf.gz+ which contains only the male samples of the study. This case can be handled with the \verb+-p+ option which merges files even if the sample sets do not agree. In such a case, missing values are imputed on the Y chromosome for female samples: \begin{quote} \begin{verbatim} $ vcf-concat -p chr*.vcf.gz | bgzip -c > concat.vcf.gz $ tabix -p vcf concat.vcf.gz \end{verbatim} \end{quote} \subsubsection{Filtering VCF files} If some special filtering steps should be performed prior to running an association test with \PODKAT, the commands \verb+vcf-annotate+ or \verb+bcftools annotate+ can be used. See the tools' documentation for more information \cite{DanecekAutonAbecasisAlbersBanksDePristoHandsaker11}. \subsection{Reading from VCF Files}\label{ssec:readVCF} The \PODKAT\ package provides a lightweight, fast, and memory-efficient method for reading genotypes from VCF files. Like the \verb+readVcf()+ function from the \verb+VariantAnnotation+ package \cite{ObenchainLawrenceCareyGogartenShannonMorgan14}, it uses the \verb+tabix+ API provided by the \verb+Rsamtools+ package \cite{LiHandsakerWysokerFennellRuanHomerMarth09,MorganPagesObenchainHayen2015}. In contrast to \verb+readVcf()+, however, it concentrates on the absolutely necessary minimum and passes the result as a sparse matrix, which greatly reduces memory usage, in particular, if many variants have a low minor allele frequency. The name of the function that has already been used above is \verb+readGenotypeMatrix()+. As a first argument, it expects a \verb+TabixFile+ object or simply the filename of a VCF file. The function allows for reading the entire VCF file at once or for limiting to certain genomic regions. The latter can be accomplished by passing a \verb+GRanges+ object with the regions of interest to \verb+readGenotypeMatrix()+ as argument \verb+regions+. All of this functionality has been used above already (cf.~Sections~\ref{sec:impatient} and \ref{sec:assocTest} above). The function \verb+readGenotypeMatrix()+ can be used as an alternative to the \verb+readVcf()+ function from the \verb+VariantAnnotation+ package \cite{ObenchainLawrenceCareyGogartenShannonMorgan14}. However, the following restrictions have to be taken into account: \begin{itemize} \item The returned object does not provide the exact genotypes. Instead, only integer values are returned in sparse matrix format. A 1 corresponds to one minor allele, whereas higher numbers correspond to a higher number of minor alleles. Phasing information is not taken into account and different minor alleles are not distinguished either. All values not present in the sparse matrix object are considered as major alleles only. \item The \verb+readGenotypeMatrix()+ function does not allow for returning missing values. The \verb+na.action+ option allows for three ways of treating missing values in the VCF file: if \verb+"impute.major"+ (the default), all missing values are imputed with major alleles; if \verb+"omit"+, all variants with missing values are ignored and omitted from the output object; if \verb+"fail"+, the function stops with an error when it encounters the first missing value. \end{itemize} The function further allows for omitting indels, for omitting variants that do not have a \verb+PASS+ in the \verb+FILTERS+ column of the VCF file, for omitting variants exceeding a certain ratio of missing values, for omitting variants with an MAF above a certain threshold, and for swapping minor and major alleles if a variant has an MAF greater than 50\%. For details, see the help page of \verb+readGenotypeMatrix()+. The \PODKAT\ package further provides a method for reading basic info, such as, alleles, types of mutations, and minor allele frequencies (MAFs), from a VCF file without actually reading the genotypes: <>= vInfo <- readVariantInfo(vcfFile, omitZeroMAF=FALSE, refAlt=TRUE) vInfo summary(vInfo) @ The object returned by \verb+readVariantInfo()+ is of class \verb+VariantInfo+ which is essentially a \verb+GRanges+ object with the information about the variants stored in its metadata columns. By default, \verb+readVariantInfo()+ does not return reference and alternate alleles. This must be enforced by \verb+refAlt=TRUE+. In any case, even if reference and alternate alleles are not returned, the function returns information about the type of mutation (transition, transversion, multiple alternate alleles, indel, or unknown/other). \subsection{Using Genotypes from Other Data Sources}\label{ssec:otherSource} The Variant Call Format% \footnote{\url{http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42} (last visited 2021-04-30)} (VCF) is the primary file format supported by the \PODKAT\ package. If a user has genotype data in another format, there are basically two ways of using such data: \begin{enumerate} \item Use some software tool to convert the genotype data into a VCF file. \item Convert the data to a matrix format (inside or outside of R) and pass the matrix data to \PODKAT. \end{enumerate} The first approach above obviously requires no adapation of the \PODKAT\ workflow. For the second approach, the matrix data have to be converted to a \verb+GenotypeMatrix+ object first. This can be done simply by the \verb+genotypeMatrix()+ constructor. This function converts the matrix to a columnwise sparse matrix and attaches positional information to it. The original matrix has to be passed to \verb+genotypeMatrix()+ such that rows correspond to samples and columns correspond to variants. The values in the matrix need to conform to \PODKAT's interpretation (0 \dots\ only major alleles / other values \dots\ number of minor alleles). Here is a simple example with a random matrix: <>= A <- matrix(rbinom(10000, size=1, prob=0.05), 200, 50) pos <- sort(sample(1:200000, 50)) Z <- genotypeMatrix(A, pos=pos, seqnames="chr1") Z variantInfo(Z) @ In the example above, positions and chromosome names are supplied as a numeric vector and a character vector, respectively. It is also possible to supply a \verb+GRanges+ object directly. Though \PODKAT\ has mainly been designed for analyzing sequencing data, there are also \verb+genotypeMatrix()+ constructors that create a \verb+GenotypeMatrix+ object from an expression set (\verb+eSet+) object. Finally, it is worth to mention that \PODKAT\ also can handle VCF data that are stored in an object of class \verb+VCF+ (cf.\ package \verb+VariantAnnotation+ \cite{ObenchainLawrenceCareyGogartenShannonMorgan14}). For details, see the help page of \verb+genotypeMatrix()+. \subsection{Preparations for a New Genome}\label{ssec:newGenome} As described in Section~\ref{sec:regInterest}, it is necessary to define regions of interest for association testing. For whole-genome association testing, this is typically done by partitioning the sequenced regions of a genome into windows (overlapping or non-overlapping). The \PODKAT\ package provides readymade \verb+GRangesList+ objects with the sequenced regions of the following genomes: hg18, hg19, hg38, b36, and b37. For other genomes, the sequenced regions must be pre-processed first and stored to an object of class \verb+GRanges+ or \verb+GRangesList+. \PODKAT\ provides a function \verb+unmaskedRegions()+ that allows for performing this pre-processing step conveniently for genomes given as \verb+MaskedBSgenome+ objects. The following example shows how this can be done for the autosomal chromosomes of the mouse mm10 genome: <>= library(BSgenome.Mmusculus.UCSC.mm10.masked) regions <- unmaskedRegions(BSgenome.Mmusculus.UCSC.mm10.masked, chrs=paste0("chr", 1:19)) names(regions) regions$chr1 @ \PODKAT\ also allows for treating pseudo-autosomal regions separately. In order to facilitate association testing in which pseudo-autosomal regions are treated like autosomal regions (and unlike sex chromosomes), it is necessary to define pseudo-autosomal regions from the beginning. The \verb+unmaskedRegions()+ function can do that and all it needs is a data frame with positional information about pseudo-autosomal regions. The format of the data frame has been chosen deliberately to make direct use of the definitions of pseudo-autosomal regions as provided by the \verb+GWASTools+ package. The following code shows how to extract unmasked regions of sex chromosomes taking pseudoautosomal regions into account (based on hg38): <>= library(BSgenome.Hsapiens.UCSC.hg38.masked) library(GWASTools) pseudoautosomal.hg38 ## from GWASTools package psaut <- pseudoautosomal.hg38 psaut$chrom <- paste0("chr", psaut$chrom) psaut regions <- unmaskedRegions(BSgenome.Hsapiens.UCSC.hg38.masked, chrs=c("chrX", "chrY"), pseudoautosomal=psaut) names(regions) regions$chrX regions$X.PAR1 @ Except for the fact that the above example only considers the two sex chromosomes, this is exactly the R code that was used to create the \verb+hg38Unmasked+ data object provided by \PODKAT. The other objects for hg18, hg19, b36, and b37 also contain pseudo-autosomal regions as separate components. \subsection{Handling Large Data Sets}% \label{ssec:parallel} Small or moderately sized association tests like the examples presented above can be performed within seconds on a regular desktop computer. Large whole-genome studies, for example, the two whole-genome cohorts from the UK10K project\footnote{\url{http://www.uk10k.org/studies/cohorts.html} (last visited 2021-04-30)} comprise thousands of samples and tens of millions of variants, and the compressed VCF files amount to hundreds of gigabytes. In order to analyze such vast amounts of data, \PODKAT\ implements two complementary strategies, chunking and parallelization, that we will describe in more detail in the following. \subsubsection{Chunking}\label{sssec:chunking} The genotype data stored in a 300GB compressed VCF file would hardly fit into the main memory of a supercomputer, let alone a desktop computer. It has already been mentioned above that \verb+assocTest()+ can be called for a \verb+TabixFile+ object or simply the file name of a compressed VCF file, though we have not yet gone into detail how \verb+assocTest()+ processes a VCF file. In fact, it does {\em not load the entire file} (as we have done above by calling \verb+readGenotypeMatrix()+ and then calling \verb+assocTest()+ on the returned \verb+GenotypeMatrix+ object). Instead, it processes batches of regions, i.e.\ the variants from a certain number of regions are read from the VCF file and processed at once. How many regions are processed at once is determined by the \verb+batchSize+ argument (the default is 20). So suppose that \verb+assocTest()+ is called for a \verb+ranges+ argument that contains 1000 regions to be tested, then 50 read operations are performed, each time 20 regions are read from the VCF file at once and then processed. The \verb+batchSize+ should be chosen such that the entire genotype matrix of the regions of the batch still fits into the computer's main memory. Otherwise swapping will severly slow down the computations. A batch size of 1 is not optimal either, since the large number of reading operations and the redundancies from reading variants of overlapping regions may lead to much overhead. The default of 20 regions is a cautious choice that need not be optimal under all possible circumstances. Depending on the number of samples and the size of the regions, one may safely increase the batch size to 100 or even higher. By means of its chunking strategy, \PODKAT\ is in principle able to process large VCF files even on desktop computers in a single-processor manner. The resulting computation times, however, can be considerably long. If this is not acceptable, a multi-core system must be used along with \PODKAT's parallelization abilities (see next). \subsubsection{Parallel Processing}\label{sssec:parallel} As mentioned above already, \PODKAT\ allows for performing large association tests on multiple processors. \PODKAT\ makes use of R socket clusters as provided by the \verb+parallel+ package (formerly \verb+snow+/\verb+snowfall+). The simplest way of using this approach is to set the \verb+nnodes+ argument to a number larger than 1: \begin{knitrout} \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe} \begin{alltt} \hlkwd{assocTest}\hlstd{(vcfFile, model.b, windows,} \hlkwc{nnodes}\hlstd{=}\hlnum{8}\hlstd{)} \end{alltt} \begin{verbatim} ##!!## Overview of association test: ##!!## Null model: logistic ##!!## Number of samples: 200 ##!!## Number of regions: 79 ##!!## Number of regions without variants: 0 ##!!## Average number of variants in regions: 24.1 ##!!## Genome: hgA ##!!## Kernel: linear.podkat ##!!## p-value adjustment: none \end{verbatim} \end{kframe} \end{knitrout} In this example, computations are carried out by 8 parallel R client processes. If the computer system has 8 or more cores/processors, these R client processes are typically assigned to different cores/processors. So it makes no sense to set \verb+nnodes+ to a number higher than the number of cores/processors; otherwise only unnecessary overhead would be caused. If the \verb+nnodes+ argument is used, the \verb+assocTest()+ function creates the socket cluster internally and also shuts it down as soon as the computations have been finished. This is surely acceptable for one-time analyses, but starting and shutting down the cluster creates unnecessary overhead if multiple association tests are to be performed. In such a scenario, it is more efficient if the user creates the socket cluster outside of \verb+assocTest()+ and uses it multiple times via the \verb+cl+ argument, as in the following example which creates a socket cluster with 8 R client processes, runs two association tests on this cluster, and then shuts down the cluster: \begin{knitrout} \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe} \begin{alltt} \hlstd{cl} \hlkwb{<-} \hlkwd{makePSOCKcluster}\hlstd{(}\hlnum{8}\hlstd{)} \hlkwd{assocTest}\hlstd{(vcfFile, model.b, windows,} \hlkwc{cl}\hlstd{=cl)} \end{alltt} \begin{verbatim} ##!!## Overview of association test: ##!!## Null model: logistic ##!!## Number of samples: 200 ##!!## Number of regions: 79 ##!!## Number of regions without variants: 0 ##!!## Average number of variants in regions: 24.1 ##!!## Genome: hgA ##!!## Kernel: linear.podkat ##!!## p-value adjustment: none \end{verbatim} \begin{alltt} \hlkwd{assocTest}\hlstd{(vcfFile, model.c, windows,} \hlkwc{cl}\hlstd{=cl)} \end{alltt} \begin{verbatim} ##!!## Overview of association test: ##!!## Null model: linear ##!!## Number of samples: 200 ##!!## Number of regions: 79 ##!!## Number of regions without variants: 0 ##!!## Average number of variants in regions: 24.1 ##!!## Genome: hgA ##!!## Kernel: linear.podkat ##!!## p-value adjustment: none \end{verbatim} \begin{alltt} \hlkwd{stopCluster}\hlstd{(cl)} \end{alltt} \end{kframe} \end{knitrout} The granularity of parallelization is determined by the \verb+batchSize+ argument described in \ref{sssec:chunking} above: each client process reads and processes as many regions at once as determined by the \verb+batchSize+ argument, then hands back control to the R master session until it is assigned the next chunk/batch. Each client reads directly from the VCF file itself. Therefore, no large genomic data need to be exchanged between master and client processes. Note that socket connections are used for the communication between the R master session and its clients. Since the number of connections that can be opened in an R session is currently limited to 128, the maximum number of possible clients is also limited by 128 --- or less if other connections are open in the R master session at the same time. The above examples are geared to running parallelized association tests on a multi-core/multi-processor machine. The \verb+parallel+ package also allows for distributing computations over multiple machines (see documentation of package \verb+parallel+ for more details). This also works with \PODKAT, however, only under the following two conditions: \begin{enumerate} \item All necessary packages (including \PODKAT) are installed on all machines. Moreover, R versions and package versions need to be at least compatible on the different machines. (better: exactly the same) \item The path to the VCF file that should be analyzed is accessible to all R processes (master and clients). This can be accomplished by (1) storing a copy of the VCF file on each machine at exactly the same location or (2) by sharing the file between all machines over the network or, in the ideal case, within a clustered file system. Moreover, when \verb+assocTest()+ distributes its computations over multiple client processes, the exchange of the null model object between the master process and the client processes is done via a temporary file. This file must reside in a directory that is accessible with exactly the same path from all clients. For details on how to ensure that, see \verb+?assocTest+ for details (\verb+tmpdir+ argument). \end{enumerate} As said, it is in principle possible to distribute association tests over multiple machines, however, it is more convenient and more efficient to run large association tests on sufficiently large multi-core/multi-processor computers. It is hard to give general estimations of computation times, since the performance of \PODKAT's association tests depends on various factors, such as, number of cores/processors, main memory, bus architecture, file system performance, etc. So we restrict to one, not necessarily representative example: Johannes Kepler University Linz (JKU) operates an SGI\RTM\ Altix\RTM UltraViolet 1000 compute server with 2,048 cores. On this system, a whole-genome association test (all autosomal human chromosomes) with about 1,800 samples and about 570,000 regions completed within less than 15 minutes (testing against continuous trait; size of compressed VCF file: about 350GB; computation distributed over 120 cores). Despite the intimidating figures of this example, \PODKAT\ is implemented such that it can perform this analysis also on a regular desktop computer with a single processor only. The computation time, however, would amount to about 30 hours. \section{More Details About \PODKAT}\label{sec:details} \subsection{Test Statistics}\label{ssec:teststat} In line with the SNP-set Kernel Association Test% \footnote{formerly known as Sequence Kernel Association Test} (\SKAT) \cite{WuLeeCaiLiBoehnkeLin11}, \PODKAT\ uses a variance component score test to test for associations between genotypes and traits. \SKAT\ assumes that traits are distributed according to the following semi-parametric mixed models: \begin{align*} \mathrm{logit}\big(p(y=1)\big)&= \alpha_0 + \bm{\alpha}^T\cdot\bm{x}+ h(\bm{z}) & \text{(binary trait)} \\ y &= \alpha_0 + \bm{\alpha}^T\cdot\bm{x}+ h(\bm{z})+\varepsilon & \text{(continuous trait)} \end{align*} In the above formulas, $y$ is the trait, $\bm{x}$ is the covariate vector, $\bm{z}$ is the genotype vector, $\alpha_0$ is the intercept, $\bm{\alpha}$ are the fixed effect coefficients, $h(.)$ is an unknown centered smooth function and $\varepsilon$ is the error term. \SKAT\ and \PODKAT\ both assume that the function $h(.)$ is from a function space that is generated by a given positive semi-definite kernel function $K(.,.)$ \cite{LiuGhoshLin08}. \SKAT's and \PODKAT's null hypothesis is that $y$ is not influenced by the genotype: \begin{align*} p(y=1)&= \mathrm{logit}^{-1}\big(\alpha_0 + \bm{\alpha}^T\cdot\bm{x} \big) & \text{(binary trait)} \\ y &= \alpha_0 + \bm{\alpha}^T\cdot\bm{x}+ \varepsilon & \text{(continuous trait)} \end{align*} As mentioned above, we use a {\em variance component score test} \cite{Lin97,LiuGhoshLin08,WuLeeCaiLiBoehnkeLin11} to test against the null hypothesis. More specifically, assume that a study consists of $l$ samples. The traits are, therefore, given as a vector $\bm{y}\in\{0,1\}^l$ (if the trait is binary) or $\bm{y}\in\Real^l$ (if the trait is continuous). Covariates are given as an $l\times n$ matrix $\Mat{X}$, and genotypes are given as an $l\times d$ matrix $\Mat{Z}$. Further suppose that a (logistic) linear model has been trained to predict $\bm{y}$ from the covariates only. For a continuous trait, we denote the obtained predictions/fitted values with $\hat{\bm{y}}$. For a binary trait, $\hat{\bm{y}}$ denotes the estimated/fitted probabilities that each sample belongs to the positive class. Then, in both cases, $\bm{y}-\hat{\bm{y}}$ corresponds to the null model residuals, i.e.\ what cannot be explained by the covariates only. Then the test statistic is defined as \[ Q=(\bm{y}-\hat{\bm{y}})^T\cdot\Mat{K} \cdot(\bm{y}-\hat{\bm{y}}), \] where $\Mat{K}$ is an $l\times l$ positive semi-definite kernel matrix defined as $K_{i,j}=K(\bm{z}_i,\bm{z}_j)$, where $\bm{z}_i$ and $\bm{z}_j$ are the genotypes of the $i$-th and $j$-th sample, respectively, i.e.\ the $i$-th and $j$-th row of the genotype matrix $\Mat{Z}$. The kernel matrix $\Mat{K}$ can be understood as a matrix that measures the pairwise similarity of genotypes of samples. Since $\Mat{K}$ is positive semi-definite, $Q$ is non-negative. The more structure the residual $\bm{y}-\hat{\bm{y}}$ and the matrix $\Mat{K}$ share, the larger $Q$ (see Figure~\ref{fig:SKATinterp} for illustrative examples). If the residuals and the genotypes are independent, i.e.\ if the test's null hypothesis holds true, large values can only occur by pure chance with a low probability. Hence, we test whether the actually observed $Q$ value is higher than expected by pure chance. In other words, the test's $p$-value is computed as the (estimated) probability of observing a value under the null hypothesis of independence between genotypes and traits that is at least as large as the observed $Q$. \begin{figure} \begin{center} \begingroup\setlength{\arraycolsep}{2pt} \fbox{\begin{minipage}{0.98\textwidth} {\bfseries\sffamily A}\\[-2ex] {\tiny \[ Q=\overbrace{\left(\begin{array}{cccccccc} \gO & \gO & \gO & \gO & -1 & -1 & -1 & -1 \end{array}\right)}^{(\bm{y}-\bm{\hat y})^T}\cdot \overbrace{\left( \begin{array}{cccccccc} \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & \gO & \gO & \gO & \gO \\ 0 & 0 & 0 & 0 & \gO & \gO & \gO & \gO \\ 0 & 0 & 0 & 0 & \gO & \gO & \gO & \gO \\ 0 & 0 & 0 & 0 & \gO & \gO & \gO & \gO \end{array} \right)}^{\Mat{K}}\cdot \overbrace{\left( \begin{array}{r} \gO\\ \gO\\ \gO\\ \gO\\ -1\\ -1\\ -1\\ -1 \end{array} \right)}^{\bm{y}-\bm{\hat y}} =\overbrace{\left(\begin{array}{cccccccc} \gF & \gF & \gF & \gF & -4 & -4 & -4 & -4 \end{array}\right)}^{(\bm{y}-\bm{\hat y})^T\cdot\Mat{K}}\cdot \overbrace{\left( \begin{array}{r} \gO\\ \gO\\ \gO\\ \gO\\ -1\\ -1\\ -1\\ -1 \end{array} \right)}^{\bm{y}-\bm{\hat y}}=32 \]} \end{minipage}} \fbox{\begin{minipage}{0.98\textwidth} {\bfseries\sffamily B}\\[-2ex] {\tiny \[ Q=\overbrace{\left(\begin{array}{cccccccc} \gO & -1 & \gO & -1 & \gO & -1 & \gO & -1 \end{array}\right)}^{(\bm{y}-\bm{\hat y})^T}\cdot \overbrace{\left( \begin{array}{cccccccc} \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & \gO & \gO & \gO & \gO \\ 0 & 0 & 0 & 0 & \gO & \gO & \gO & \gO \\ 0 & 0 & 0 & 0 & \gO & \gO & \gO & \gO \\ 0 & 0 & 0 & 0 & \gO & \gO & \gO & \gO \end{array} \right)}^{\Mat{K}}\cdot \overbrace{\left( \begin{array}{r} \gO\\ -1\\ \gO\\ -1\\ \gO\\ -1\\ \gO\\ -1 \end{array} \right)}^{\bm{y}-\bm{\hat y}} =\overbrace{\left(\begin{array}{cccccccc} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{array}\right)}^{(\bm{y}-\bm{\hat y})^T\cdot\Mat{K}}\cdot \overbrace{\left( \begin{array}{r} \gO\\ -1\\ \gO\\ -1\\ \gO\\ -1\\ \gO\\ -1 \end{array} \right)}^{\bm{y}-\bm{\hat y}}=0 \]} \end{minipage}} \fbox{\begin{minipage}{0.98\textwidth} {\bfseries\sffamily C}\\[-2ex] {\tiny \[ Q=\overbrace{\left(\begin{array}{cccccccc} \gO & \gO & \gO & \gO & -1 & -1 & -1 & -1 \end{array}\right)}^{(\bm{y}-\bm{\hat y})^T}\cdot \overbrace{\left( \begin{array}{cccccccc} \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ \gO & \gO & \gO & \gO & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \end{array} \right)}^{\Mat{K}}\cdot \overbrace{\left( \begin{array}{r} \gO\\ \gO\\ \gO\\ \gO\\ -1\\ -1\\ -1\\ -1 \end{array} \right)}^{\bm{y}-\bm{\hat y}} =\overbrace{\left(\begin{array}{cccccccc} \gF & \gF & \gF & \gF & 0 & 0 & 0 & 0 \end{array}\right)}^{(\bm{y}-\bm{\hat y})^T\cdot\Mat{K}}\cdot \overbrace{\left( \begin{array}{r} \gO\\ \gO\\ \gO\\ \gO\\ -1\\ -1\\ -1\\ -1 \end{array} \right)}^{\bm{y}-\bm{\hat y}}=16 \]} \end{minipage}} \endgroup \end{center} \caption{Examples with 8 samples illustrating the variance component test score. {\bf A:}~strong correspondence between the signs of residuals and the blocks in the kernel matrix $\Mat{K}$ results in high $Q$. {\bf B:}~no correspondence between signs of residuals and blocks in $\Mat{K}$ results in low $Q$. {\bf C:}~even if there is only a partial correspondence between the signs of the residuals and the blocks in $\Mat{K}$, a relatively high $Q$ is obtained.} \label{fig:SKATinterp} \end{figure} In order to compute the $p$-value, we need to know the null distribution of $Q$, i.e.\ how $Q$ is distributed under the assumption that $y$ does not depend on $\bm{z}$. For continuous traits and normally distributed noise $\varepsilon$, the residuals are normally distributed and the distribution of $Q$ is obviously a {\em mixture of $\chi^2$ distributions}. For binary traits, $Q$ approximately follows a {\em mixture of $\chi^2$ distributions}, too \cite{Lin97,WuLeeCaiLiBoehnkeLin11}. In either case, \begin{equation} Q\sim \sum\limits_{k=1}^q \lambda_k\cdot\chi_{1,k}^2,\label{eq:chi2mixt} \end{equation} where $\chi_{1,k}^2$ are independent $\chi^2$-distributed random variables with one degree of freedom and $\lambda_1,\dots,\lambda_q$ are the non-zero eigenvalues of the positive semi-definite $l\times l$ matrix \[ \Mat{P}_0^{\frac{1}{2}}\cdot\Mat{K}\cdot\Mat{P}_0^{\frac{1}{2}}. \] with \begin{equation} \Mat{P}_0 = \Mat{V} - \Mat{V}\cdot\Mat{\tilde X}\cdot \big(\Mat{\tilde X}^T\cdot\Mat{V}\cdot\Mat{\tilde X}\big)^{-1}\cdot \Mat{\tilde X}^T\cdot\Mat{V}.\label{eq:P0} \end{equation} In \eqref{eq:P0}, $\Mat{V}$ is a diagonal matrix defined as $\Mat{V}=\hat{\sigma}\cdot{\Mat{I}}$ for continuous traits (with $\hat{\sigma}$ being the estimated variance of regression residuals under the null hypothesis) and as $\Mat{V}=\mathrm{diag}\big(\hat{y}_1(1-\hat{y}_1),\dots, \hat{y}_l(1-\hat{y}_l)\big)$ for binary traits, where the diagonal entries $\hat{y}_i(1-\hat{y}_i)$ are obviously the estimated individual variances of the fitted values of the null model \cite{WuLeeCaiLiBoehnkeLin11}. Furthermore, $\Mat{\tilde X}=(\vec{1}\mid \Mat{X})$ is the model matrix of the (logistic) linear model trained on the covariates. Once the eigenvalues $\lambda_1,\dots,\lambda_q$ have been determined, a method for estimating the probability distribution function of a mixture of $\chi^2$ distributions can be used to compute the tests $p$-value, such as, Davies' method \cite{Davies80} or Liu's method \cite{LiuTangZhang09}. Like \SKAT\ does too, \PODKAT\ implements both methods and the method can be chosen with the \verb+method+ argument (unless small sample correction is used; see Subsection~\ref{ssec:sscor} below). \PODKAT\ further implements a new variant of the above test that is suitable for smaller studies with binary traits in which no covariates need to be taken into account. This test assumes the following random effects model: \[ \mathrm{logit}\big(p(y=1)\big)= \alpha_0 + h(\bm{z}) \] The null hypothesis of this variant is that the trait is Bernoulli-distributed with a constant probability and independent of the genotype, i.e. \[ p(y=1)=\mathrm{logit}^{-1}(\alpha_0). \] With the same assumptions as above, the test statistic is given as \[ Q=\bm{y}^T\cdot\Mat{K} \cdot\bm{y}. \] Since all $y_i$ are Bernoulli-distributed (we estimate the probability of a positive outcome as the relative frequency of positive outcomes $\bar p=\frac{1}{l}\sum_{i=1}^l y_i$), $Q$ is distributed according to a {\em mixture of Bernoulli distribitions} and the exact $p$-value of the test can be computed as \[ p=\sum\limits_{\bm{y}} p(\bm{y})\cdot \left\{\begin{array}{ll} 1 & \text{if } \bm{y}^T\cdot\Mat{K} \cdot\bm{y}\geq Q \\ 0 & \text{otherwise} \end{array}\right\}, \] i.e.\ the sum of probabilities of outcomes that produce a test statistic at least as large as $Q$. For a given outcome $\bm{y}=(y_1,\dots,y_l)$, the probability is given as \begin{align*} p(\bm{y})&=\bar p^k\cdot(1-\bar p)^{l-k} & \text{with } & k=\sum\limits_{i=1}^l y_i. \end{align*} Since all $y_i$ are binary, $Q=\bm{y}^T\cdot\Mat{K} \cdot\bm{y}$ is nothing else but the sum of the sub-matrix of $\Mat{K}$ that consists of all those rows and columns $i$ for which $y_i=1$ holds. If all entries of $\Mat{K}$ are non-negative, the test's exact $p$-value can be computed by a recursive algorithm that starts from $\bm{y}=(1,\dots,1)$ and recursively removes ones as long as $\bm{y}^T\cdot\Mat{K} \cdot\bm{y}\geq Q$ holds. In order to use this variant, the null model must be created with \verb+type="bernoulli"+. With the default setting \verb+type="automatic"+, this variant is chosen if the trait is binary, no covariates are present, and the number of samples does not exceed 100. The latter restriction is necessary to avoid excessive computation times caused by the recursive computation of exact $p$-values. \subsection{Kernels}\label{ssec:kernels} It is clear from the previous section that the association tests implemented in \PODKAT\ rely on the choice of a kernel function that computes the pairwise similarities of the samples' genotypes. The simplest kernel is the {\em linear kernel} that computes the similarities as the outer product of the genotype matrices: \[ \Mat{K}=\Mat{Z}\cdot\Mat{Z}^T \] This kernel can be used by calling \verb+assocTest()+ with the option \verb+kernel="linear.SKAT"+ and without any weighting (\verb+weightFunc=NULL+). The kernel's name has been deliberately chosen to indicate that this is nothing else but the linear \SKAT\ test without weights. The same test can also be carried out with the {\em weighted linear kernel} \begin{equation} \Mat{K}=\Mat{Z}\cdot\Mat{W}\cdot\Mat{W}^T\cdot\Mat{Z}^T,\label{eq:linKernel} \end{equation} where $\Mat{W}$ is a diagonal matrix of weights $\Mat{W}=\mathrm{diag}(w_1,\dots,w_d)$ with which the columns (i.e.\ variants) in the genotype matrix $\Mat{Z}$ are scaled before the outer product is computed. If \verb+assocTest()+ is called for a VCF file, the weighting must be done via a {\em weighting function} that computes the variants' weights as a function of their minor allele frequencies (see Subsection~\ref{ssec:weightFunc} below). If \verb+assocTest()+ is called for a matrix-like object, weights can be also specified as a per-column weight vector using the \verb+weights+ argument. The acronym \PODKAT\ stands for {\em Position-Dependent Kernel Association Test}. This test uses a {\em position-dependent linear kernel}: \begin{equation} \Mat{K}=\Mat{Z}\cdot\Mat{W}\cdot\Mat{P}\cdot\Mat{P}^T \cdot\Mat{W}^T\cdot\Mat{Z}^T,\label{eq:posDepKernel} \end{equation} The $d\times d$ matrix $\Mat{P}$ is a positive semi-definite kernel matrix that measures the similarities/closeness of positions of variants, i.e. \[ P_{i,j}=\max\big(1-{\textstyle\frac{1}{w}}|\mathrm{pos}_i-\mathrm{pos}_j|, 0\big), \] where $\mathrm{pos}_i$ and $\mathrm{pos}_j$ are the genomic positions of the $i$-th and the $j$-th variant, respectively. The parameter $w$ determines the {\em maximal radius of tolerance}: if two positions are the same, the kernel gives a similarity of 1; the similarity decreases linearly with increasing genomic distance of the two variants under consideration; if the distance is $w$ or larger, the positional similarity is 0. This similarity is actually a positive semi-definite kernel \cite{BelancheVazquezVazquez08,BodenhoferSchwarzbauerIonescuHochreiter09}. Figure~\ref{fig:posKernel} shows how the similarity depends on the difference of positions. This kernel can be used by calling \verb+assocTest()+ with the option \verb+kernel="linear.podkat"+ (which is the default). Weighting can be configured in the same way as described for the linear kernel above. The radius of tolerance $w$ can be set with the parameter \verb+width+ (the default is 1,000~bp). \begin{figure} \begin{center} \includegraphics[width=0.8\textwidth]{PosKernel} \end{center} \caption{Graph demonstrating how the similarity of genomic positions is computed depending on the difference of positions.}\label{fig:posKernel} \end{figure} The motivation behind \PODKAT's position-dependent kernel is to better account for very rare or even {\em private variants}, that is, variants that only occur in one sample. The linear kernel is not able to take private variants into account. In order to demonstrate that, consider how a single entry of the kernel matrix is computed (for simplicity without any weighting): \[ K_{i,j}=\sum\limits_{k=1}^d Z_{i,k}\cdot Z_{j,k} \] If the $k$-th variant is private, there is only one $i$ for which $Z_{i,k}>0$, whereas all other $Z_{j,k}=0$. Hence, the $k$-th variant makes no contribution to $K_{i,j}$ for $i\not=j$. The position-dependent kernel, however, computes a convolution of the genotype matrix with the position kernel and thereby makes use of agglomerations of private variants in the same genomic region. \begin{figure} \begin{center} \fbox{\begin{minipage}{0.95\textwidth} {\bfseries\sffamily A}\hspace{0.34\textwidth}% \makebox[0pt][c]{$\Mat{Z}$} \hspace{0.42\textwidth}\makebox[0pt][c]{$\Mat{K}=\Mat{Z}\cdot\Mat{Z}^T$} \\[1.6ex] \includegraphics[width=\textwidth]{SKAT_example} \end{minipage}} \fbox{\begin{minipage}{0.95\textwidth} {\bfseries\sffamily B}\hspace{0.34\textwidth}% \makebox[0pt][c]{$\Mat{Z}\cdot\Mat{P}$}\hspace{0.42\textwidth} \makebox[0pt][c]{$\Mat{K}=\Mat{Z}\cdot\Mat{P}\cdot\Mat{P}^T\cdot\Mat{Z}^T$} \\[1.6ex] \includegraphics[width=\textwidth]{PODKAT_example} \end{minipage}} \end{center} \caption{Simple example demonstrating how \PODKAT's position-dependent kernel takes private variants into account. {\bf A:}~genotype matrix $\Mat{Z}$ (left) and kernel matrix of the linear kernel (right); {\bfseries B:}~convolution of genotype matrix $\Mat{Z}$ with positional similarities $\Mat{P}$ (left) and kernel matrix of the position-dependent kernel (right).} \label{fig:posDepKernelExample} \end{figure} Figure~\ref{fig:posDepKernelExample} shows a simple example that demonstrates how \PODKAT's position-dependent kernel takes private variants into account. On the left, Panel~A, shows the genotype matrix $\Mat{Z}$. Obviously, $\Mat{Z}$ consists of 6 samples and 11 variants, each of which occurs in only one sample. The right-hand side of Panel~A shows the kernel matrix that would be obtained for the linear kernel. Since all variants are private, the kernel matrix is diagonal, which does not allow for any meaningful association testing, since there are no blocks of similar samples in the kernel matrix (compare with Section~\ref{ssec:teststat} and the examples in Figure~\ref{fig:SKATinterp}). The left side of Panel~B shows the convolution of the genotype matrix $\Mat{Z}$ with the positional similarities $\Mat{P}$, and the right side shows the resulting kernel matrix for the position-dependent kernel. Suddenly, two blocks of samples (samples 1--3 and samples 4--6) become visible, as a result of the fact that samples 1--3 have minor alleles in the left half of the sequence and samples 4--6 have minor alleles in the right half of the sequence. Now suppose that samples 1--3 are cases and samples 4--6 are controls. If the mutations/minor alleles in the left half of the sequence (e.g.\ a particular exon or transcription factor binding site) are causal for the disease, \PODKAT\ would be able to detect that, whereas \SKAT\ with the regular linear kernel would not be able to detect that. \PODKAT\ further provides the {\em IBS (identity by state) kernel} \[ K_{i,j}=\frac{1}{\sum_{k=1}^d w_d} \sum\limits_{k=1}^d w_k\cdot(2-|Z_{i,k}-Z_{j,k}|) \] and the {\em quadratic kernel} \[ K_{i,j}=\big(1+\sum\limits_{k=1}^d w_k\cdot Z_{i,k}\cdot Z_{j,k}\big)^2 \] that are also available in \SKAT\ \cite{WuLeeCaiLiBoehnkeLin11}. In order to make use of these two kernels, \verb+assocTest()+ must be called with the parameters \verb+kernel="localsim.SKAT"+ or \verb+kernel="quadratic.SKAT"+, respectively. Analogously to the linear kernel, weighting must be controlled with the argument \verb+weightFunc+ (or \verb+weights+) and can be switched off with \verb+weightFunc=NULL+ which means that \verb+assocTest()+ internally sets all $w_k=1$. Additionally, in order to enable these kernels also to take private variants into account, there are position-dependent variants of the IBS kernel and the quadratic kernel that can be used with kernels \verb+"localsim.podkat"+ or \verb+"quadratic.podkat"+, respectively. These kernels first compute the convolution of the genotype matrix $\Mat{Z}$ with the positional similarities $\Mat{P}$ (compare with example in Figure~\ref{fig:posDepKernelExample}B) and then compute the kernel matrices according to the formulas above. It must be pointed out that the linear kernel and the position-dependent linear kernel (kernel choices \verb+linear.SKAT+ and \verb+linear.podkat+) can be represented as outer products, which, in many cases, allows for much more efficient computations than the other four kernels. So, the computing times for large whole-genome studies with the four kernel choices \verb+localsim.SKAT+, \verb+localsim.podkat+, \verb+quadratic.SKAT+, and \verb+quadratic.podkat+ can be prohibitely long. \subsection{Weighting Functions}\label{ssec:weightFunc} As mentioned above, all six kernels implemented in \PODKAT\ can be used with and without weighting. In all six kernels, weights need to be defined in a per-variant fashion (i.e.\ one weight per column of the genotype matrix $\Mat{Z}$). If \verb+assocTest()+ is called for a VCF file, i.e.\ its argument \verb+Z+ is the file name of a VCF file or a \verb+TabixFile+ object, then \verb+assocTest()+ requires a one-argument function that computes a vector of weights from a vector of minor allele frequencies (MAFs). This function must be passed as \verb+weightFunc+ argument (if it is \verb+NULL+, then the kernels are used without weights). For convenience, \PODKAT\ provides three built-in functions. Firstly, there is \verb+invSdWeights+ which is defined as \begin{equation} f(x)=\frac{1}{\sqrt{x\cdot(1-x)}},\label{eq:invSdWeights} \end{equation} i.e.\ it computes weights as the reciprocals of the standard deviations of minor allele probabilities \cite{MadsenBrowning09}: \[ w_k=\frac{1}{\sqrt{\MAF_k\cdot(1-\MAF_k)}} \] This variant is provided as function \verb+invSdWeights()+. Figure~\ref{fig:weightFuncs}A visualizes this weighting function. \begin{figure} \begin{center} \fbox{\begin{minipage}{0.98\textwidth} {\bfseries\sffamily A}\\ \centerline{\includegraphics[width=0.8\textwidth]{invSdWeights}} \end{minipage}} \fbox{\begin{minipage}{0.98\textwidth} {\bfseries\sffamily B}\\ \centerline{\includegraphics[width=0.8\textwidth]{betaWeights}} \end{minipage}} \fbox{\begin{minipage}{0.98\textwidth} {\bfseries\sffamily C}\\ \centerline{\includegraphics[width=0.8\textwidth]{logisticWeights}} \end{minipage}} \end{center} \caption{{\bfseries A:}~weighting function \eqref{eq:invSdWeights}; {\bfseries B:}~beta distribution weighting functions \ref{eq:betaWeights} for different parameters; {\bfseries C:}~soft threshold weighting functions \ref{eq:logisticWeights} for different parameters.}\label{fig:weightFuncs} \end{figure} Secondly, there is a parameterized family of weighting functions that correspond to the densities of the beta distribution: \begin{equation} f_{\alpha,\beta}(x)=\frac{x^{\alpha-1}\cdot(1-x)^{\beta-1}}{B(\alpha,\beta)}, \label{eq:betaWeights} \end{equation} where $\alpha,\beta$ are the two shape parameters and $B(\alpha,\beta)$ is a normalization constant that only depends on the shape parameters. The package provides the function \verb+betaWeights()+. If called with the two arguments \verb+shape1+ and \verb+shape2+, this function creates the weighting function with these particular shape parameters. The default weight parameters are $\alpha=1$ and $\beta=25$ as suggested by Wu {\em et al.} \cite{WuLeeCaiLiBoehnkeLin11}. This is actually the default setting of the \verb+weightFunc+ parameter of the \verb+assocTest()+ function. Figure~\ref{fig:weightFuncs}B shows some examples with different shape parameters. Thirdly, \PODKAT\ offers the possibility to use a {\em soft threshold function} (logistic function) \begin{equation} f(x)=\frac{1}{1+\exp\big(a\cdot(x-\theta)\big)},\label{eq:logisticWeights} \end{equation} where $\theta$ is the threshold and $a$ is the slope that determines how hard/soft the threshold is. The function \verb+logisticWeights()+ can be used to create a weighting function with given threshold and slope. Figure~\ref{fig:weightFuncs}C shows some examples with parameters. Users are not limited to the three weighting schemes described above: it is possible to pass arbitrary weighting functions as \verb+weightFunc+ argument. However, user-provided weighting functions must conform to some conventions: (1) they must be written such that MAFs can be passed as first arguments; (2) MAFs can be passed as numeric vectors and the result is a numeric vector of the same length; (3) the returned weights are non-negative; (4) the function requires no arguments other than the first one. \subsection{Computing Single-Variant Contributions}\label{ssec:svcontrib} As mentioned in Subsection~\ref{ssec:contrib} above, \PODKAT\ allows for decomposing the test statistic $Q$ into individual contributions of single variants. This is possible for the linear kernel and the position-dependent linear kernel. The linear kernel can be reformulated as \[ Q=(\bm{y}-\hat{\bm{y}})^T\cdot\Mat{Z}\cdot\Mat{W}\cdot\Mat{W}^T\cdot \Mat{Z}^T\cdot(\bm{y}-\hat{\bm{y}}) =\|\underbrace{\Mat{W}^T\cdot\Mat{Z}^T\cdot(\bm{y}-\hat{\bm{y}})}_{=\bm{p}}\|^2, \] where $\bm{p}=\Mat{W}^T\cdot\Mat{Z}^T\cdot(\bm{y}-\hat{\bm{y}})$ is a vector of length $d$. In the same way, the position-dependent linear kernel can be rewritten as \[ Q=(\bm{y}-\hat{\bm{y}})^T\cdot\Mat{Z}\cdot\Mat{W}\cdot\Mat{P}\cdot\Mat{P}^T\cdot \Mat{W}^T\cdot\Mat{Z}^T\cdot(\bm{y}-\hat{\bm{y}}) =\|\underbrace{\Mat{P}^T\cdot\Mat{W}^T\cdot\Mat{Z}^T\cdot (\bm{y}-\hat{\bm{y}})}_{=\bm{p}}\|^2. \] Again $\bm{p}=\Mat{P}^T\cdot\Mat{W}^T\cdot\Mat{Z}^T\cdot(\bm{y}-\hat{\bm{y}})$ is a vector of length $d$. So, in both cases, $Q$ can be represented as the squared norm, i.e.\ the sum of squares, of a vector $\bm{p}$ of length $d$, i.e.\ with one entry per variant: \[ Q=\|\bm{p}\|^2=\sum\limits_{k=1}^d p_k^2 \] So, the $k$-th variant contributes $p_k^2$ to the test statistic $Q$. The function \verb+weights()+ described in Subsection~\ref{ssec:contrib} computes two values per variant: on the one hand, the raw contribution $p_k$ is stored in the metadata column \verb+weight.raw+. These values are particularly helpful to find out how a variant is associated with the residuals $\bm{y}-\hat{\bm{y}}$. If it is positive, the association is positive. If it is negative, the association is negative. On the other hand, the {\em relative contribution} \[ p'_k=\frac{p_k^2}{\|\bm{p}\|^2}=\frac{p_k^2}{Q} \] is stored to the metadata column \verb+weight.contribution+. This value measures how much each variant contributes to the test statistic $Q$, where these contributions are normalized to sum up to one. \subsection{Details on the Small Sample Correction}\label{ssec:sscor} As mentioned in the Subsection~\ref{ssec:teststat}, the test statistic of \PODKAT's association tests is assumed to follow a mixture of $\chi^2$ distributions \eqref{eq:chi2mixt} \cite{WuLeeCaiLiBoehnkeLin11}, where, as already mentioned above, the mixture weights $\bm{\lambda}=(\lambda_1,\dots,\lambda_q)$ (with $q\leq l$) are given as the non-zero eigenvalues of the matrix \begin{equation} \Mat{P}_0^{\frac{1}{2}}\cdot\Mat{K}\cdot\Mat{P}_0^{\frac{1}{2}}\label{eq:P0KP0} \end{equation} (with $\Mat{P}_0$ being defined as described in \eqref{eq:P0} above). This is theoretically provable for continuous traits under the assumptions of the null hypothesis, but holds only approximately for binary traits. In many cases, the identifiation of the mixture of $\chi^2$ distributions as proposed above leads to inflated type I error rates for binary traits. In order to overcome this problem, Lee {\em et al.} \cite{LeeEmondBamshadBarnesRiederNickerson12} have proposed a {\em small sample correction} for binary traits for \SKAT. \PODKAT\ introduces a new position-dependent kernel for being able to take private and very rare variants into account, but otherwise builds upon the same test framework as \SKAT. Therefore, \PODKAT\ also implements the small sample correction according to \cite{LeeEmondBamshadBarnesRiederNickerson12}. The small sample correction alternatively computes the $p$-value of an association test for binary traits (as described in Subsection~\ref{ssec:teststat}) as \begin{equation} 1-P_{\chi^2_{\mathrm{df}}}\Big(\frac{(Q-\mu_Q)\cdot\sqrt{2\mathrm{df}}}% {\sigma_Q}+\mathrm{df}\Big),\label{eq:nullDistSScor} \end{equation} where $P_{\chi^2_{\mathrm{df}}}$ is the cumulative distribution function of a $\chi^2$ distribution with $\mathrm{df}$ degrees of freedom, $\mu_Q$ is the theoretical mean of the mixture \begin{equation} \mu_Q=\sum\limits_{k=1}^q \lambda_k\label{eq:muQ} \end{equation} i.e.\ the expected test statistic under the null hypothesis, $\sigma_Q^2$ is the theoretical variance of the mixture \begin{equation} \sigma_Q^2=\bm{\lambda}^T\cdot\Mat{C}\cdot\bm{\lambda}\label{eq:theorVar} \end{equation} and $\mathrm{df}$ is the number of degrees of freedom that is computed as \begin{equation} \mathrm{df}=\frac{\Big(\sum\limits_{k=1}^l \lambda_k'^2\Big)^2}% {\sum\limits_{k=1}^l \lambda_k'^4}.\label{eq:dfChi2} \end{equation} In \eqref{eq:theorVar} above, $\Mat{C}$ is a $q\times q$ matrix defined as \[ \Mat{C}=\Mat{U}_2^T\cdot\mathrm{diag}(\bm{\varphi})\cdot\Mat{U}_2+2\cdot\Mat{I} \] where $\Mat{U}_2$ is an $l\times q$ matrix containing the elementwise squares of $\Mat{U}$, the $l\times q$ matrix of eigenvectors of non-zero eigenvalues of the matrix \eqref{eq:P0KP0} and $\bm{\varphi}$ is a vector of length $l$ the entries of which are defined as \[ \varphi_k=\frac{3p(y_i=1)^2-3p(y_i=1)+1}{p(y_i=1)(1-p(y_i=1))}-3. \] Practically, the probabilities $p(y_i=1)$ are estimated by the null model's fitted probabilities $\hat y_i$ (see Subsection~\ref{ssec:teststat}). Finally, the values $\lambda_k'$ in \eqref{eq:dfChi2} are defined as \[ \lambda'_k=\frac{\lambda_k\cdot C_{k,k}}{\sqrt{2}}, \] where $C_{k,k}$ are the diagonal elements of the matrix $\Mat{C}$ defined above. The R package \verb+SKAT+ does not exactly use the above method, but uses the matrix \begin{equation} \Mat{V}^{\frac{1}{2}}\cdot\Mat{K}\cdot\Mat{V}^{\frac{1}{2}}\label{eq:VKV} \end{equation} (with $\Mat{V}$ defined as in Subsection~\ref{ssec:teststat}) instead to determine the mixture weights. This is mainly for computational efficiency, since the square root of the diagonal matrix $\Mat{V}$ is much easier to compute than the square root of the dense matrix $\Mat{P}_0$. It can be proven easily that the eigenvalues of matrix \eqref{eq:VKV} are the same as of matrix \eqref{eq:P0KP0}. However, the eigenvectors, which are needed for computing the matrix $\Mat{C}$ are not identical, but sufficiently close, as computational simulations have demonstrated. \PODKAT, by default, uses this approximation too, but also allows for using the exact matrix \eqref{eq:P0KP0}. In order to make use of this option, the null model has to be created by calling \verb+nullModel()+ with the option \verb+adjExact=TRUE+. With this option, \verb+nullModel()+ pre-computes the square root of matrix $\Mat{P}$ and stores it to the slot \verb+P0sqrt+ of the returned \verb+NullModel+ object. Lee {\em et al.} \cite{LeeEmondBamshadBarnesRiederNickerson12} further suggested to adjust the higher moments of the estimated null distribution used in \eqref{eq:nullDistSScor} by correcting the degrees of freedom parameter $\mathrm{df}$ such that the kurtosis fits to the kurtosis observed in residuals sampled according to the null distribution. \PODKAT, like \SKAT, offers two methods that can be specified when training a null model with \verb+nullModel()+: \verb+type.resampling="bootstrap"+ creates residuals according the estimated Bernoulli distributions with probabilities $\hat{y}_i$; \verb+type.resampling="permutations"+ computes permutations of the residuals. The former variant is the default and strongly recommended. So, given the test statistics $Q_1,\dots,Q_N$ of $N$ sampled vectors of residuals, the correction for higher moments estimates the excess kurtosis $\hat{\gamma}$ and then replaces the parameter $\mathrm{df}$ in \eqref{eq:nullDistSScor} by \[ \mathrm{df}=\frac{12}{\hat{\gamma}}. \] \PODKAT\ offers multiple ways for estimating the excess kurtosis from the sampled test statistics $Q_1,\dots,Q_N$ (the number $N$ can be determined with the \verb+n.resampling.adj+ argument of the function \verb+nullModel()+). Which variant is chosen, can be determined with the argument \verb+method+ when calling \verb+assocTest()+: \begin{description} \item[Unbiased sample kurtosis:] this method uses the theoretical mean of the null distribution $\mu_Q$ (if available; see \eqref{eq:muQ}) and estimates the excess kurtosis as \[ \hat{\gamma}=\frac{\frac{1}{N}\sum\limits_{k=1}^N(Q_k-\mu_Q)^4}% {\Big(\frac{1}{N}\sum\limits_{k=1}^N(Q_k-\mu_Q)^2\Big)^2}-3. \] This variant can be selected with \verb+method="unbiased"+. \item[Biased sample kurtosis:] first computes the empirical mean $\hat{\mu}_1$ of the sampled test and estimates the excess kurtosis as \[ \hat{\gamma}=\frac{\frac{1}{N}\sum\limits_{k=1}^N(Q_k-\hat{\mu}_1)^4}% {\Big(\frac{1}{N}\sum\limits_{k=1}^N(Q_k-\hat{\mu}_1)^2\Big)^2}-3. \] This variant can be selected with \verb+method="sample"+. \item[Corrected sample kurtosis:] this variant is aimed at computing an (almost) unbiased estimator of the excess kurtosis as \[ \gamma=\frac{(N+1)\cdot N\cdot (N-1)}{(N-2)\cdot(N-3)}\cdot \frac{\sum\limits_{k=1}^N(Q_k-\hat{\mu}_1)^4}% {\Big(\sum\limits_{k=1}^N(Q_k-\hat{\mu}_1)^2\Big)^2}-3\cdot \frac{(N-1)^2}{(N-2)\cdot(N-3)}. \] This method can be selected with \verb+method="population"+. \item[\SKAT\ compatibility mode:] this variant is aimed at consistency with the implementation in the R package \verb+SKAT+; it estimates the excess kurtosis as \[ \hat{\gamma}=\frac{\frac{1}{N}\sum\limits_{k=1}^N(Q_k-\hat{\mu}_1)^4}% {\Big(\frac{1}{N-1}\sum\limits_{k=1}^N(Q_k-\hat{\mu}_1)^2\Big)^2}-3. \] This method can be selected with \verb+method="SKAT"+. \end{description} As already mentioned briefly in Section~\ref{sec:nullModel}, the argument \verb+adj+ of the \verb+nullModel()+ function controls how the null model is prepared to be ready for small sample correction and higher moment correction. The setting \verb+adj="force"+ creates sampled residuals for later higher moment correction in any case, while \verb+adj="none"+ generally switches off the creation of sample residuals. The default is \verb+adj="automatic"+ which creates sampled residuals if the number of samples in the study is at most 2,000. The number of sampled residuals is controlled by the \verb+n.resampling.adj+ argument as noted in Section~\ref{sec:nullModel} already. The default number of sampled residuals is 10,000. The function \verb+assocTest()+ has an \verb+adj+ argument too, the meaning of which is similar to the \verb+adj+ argument of the \verb+nullModel()+ function: if \verb+assocTest()+ is called with \verb+adj="automatic"+ (which is the default), corrections are only made for at most 2,000 samples. For more than 2,000 samples, all corrections are switched off. For \verb+adj="force"+, corrections are generally switched on, and for \verb+adj="none"+, corrections are generally switched off. If corrections are switched on, the small sample correction described above is performed. As already mentioned, which variant is used is determined by whether the null model has been created with \verb+adjExact=TRUE+ or \verb+adjExact=FALSE+. If the null model includes sampled residuals, the small sample correction is complemented by the higher moment correction described above. In case that \verb+assocTest()+ fails to compute the mixture weights $\bm{\lambda}=(\lambda_1,\dots,\lambda_q)$, it still uses the formula \eqref{eq:nullDistSScor} for computing the $p$-values, but uses values for $\mu_Q$ and $\sigma_Q$ that were also estimated from the sampled test statistics. It must be emphasized that both small sample correction and higher moment correction, especially the latter, result in a significant increase of computation times. For larger studies, we suggest to try \verb+assocTest()+ without small sample correction first and to create a Q-Q plot to analyze the results. If the $p$-values in the Q-Q plot are sufficiently close to the diagonal, the test correctly controls the type I error rate and no correction is necessary anyway. If this not the case and the test is either too conservative (i.e.\ $p$-values in the Q-Q plot are consistently below the diagonal) or the $p$-values are inflated (i.e.\ $p$-values in the Q-Q plot are consistently above the diagonal), some correction should be applied. To use both corrections regardless of the number of samples, both \verb+nullModel()+ and \verb+assocTest()+ must be called with \verb+adj="force"+. To use small sample correction, but without correction for higher moments, call \verb+nullModel()+ with \verb+adj="none"+ (which turns off the creation of sampled residuals), but call \verb+assocTest()+ with \verb+adj="force"+. \section{Future Extensions}\label{sec:ext} We plan or at least consider the following extensions in future version of this package: \begin{itemize} \item Option in the VCF reader that allows for splitting up multiple minor alleles into separate variants. Currently, multiple minor alleles are considered together as if they were synonymous. %%\item \end{itemize} \section{How to Cite This Package}\label{sec:cite} If you use this package for research that is published later, you are kindly asked to cite it as follows: \begin{quotation} \noindent U.\ Bodenhofer (\PODKATYear). {\em PODKAT: an R package for association testing involving rare and private variants.} R package version \PODKATVer. \end{quotation} %\bibliographystyle{plain} %\bibliography{Bioinformatics,BioStatistics,Biology,BodenhoferPub,Statistics,MachineLearningClassical} \begin{thebibliography}{10} \bibitem{BelancheVazquezVazquez08} L.~Belanche, J.~L. V\'azquez, and M.~V\'azquez. \newblock Distance-based kernels for real-valued data. \newblock In C.~Preisach, H.~Burkhardt, L.~Schmidt-Thieme, and R.~Decker, editors, {\em Data Analysis, Machine Learning and Applications}, Studies in Classification, Data Analysis, and Knowledge Organization, pages 3--10. Springer, Berlin, 2008. \bibitem{BenjaminiHochberg95} Y.~Benjamini and Y.~Hochberg. \newblock Controlling the false discovery rate: a practical and powerful approach to multiple testing. \newblock {\em J. Roy. Statist. Soc. Ser. B}, 57(1):289--300, 1995. \bibitem{BodenhoferSchwarzbauerIonescuHochreiter09} U.~Bodenhofer, K.~Schwarzbauer, M.~Ionescu, and S.~Hochreiter. \newblock Modeling position specificity in sequence kernels by fuzzy equivalence relations. \newblock In J.~P. Carvalho, D.~Dubois, U.~Kaymak, and J.~M.~C. Sousa, editors, {\em Proc. Joint 13th IFSA World Congress and 6th EUSFLAT Conference}, pages 1376--1381, Lisbon, July 2009. \bibitem{DanecekAutonAbecasisAlbersBanksDePristoHandsaker11} P.~Danecek, A.~Auton, G.~Abecasis, C.~A. Albers, E.~Banks, M.~A. DePristo, R.~E. Handsaker, G.~Lunter, G.~T. Marth, S.~T. Sherry, G.~McVean, R.~Durbin, and {1000 Genomes Project Analysis Group}. \newblock The variant call format and {VCFtools}. \newblock {\em Bioinformatics}, 27(15):2156--2158, 2011. \bibitem{Davies80} R.~B. Davies. \newblock The distribution of a linear combination of $\chi^2$ random variables. \newblock {\em J. R. Stat. Soc. Ser. C-Appl. Stat.}, 29:323--333, 1980. \bibitem{Holm79} S.~Holm. \newblock A simple sequentially rejective multiple test procedure. \newblock {\em Scand. J. Stat.}, 6(2):65--70, 1979. \bibitem{KentSugnetFureyRoskinPringleZahlerHaussler02} W.~J. Kent, C.~W. Sugnet, T.~S. Furey, K.~M. Roskin, T.~H. Pringle, A.~M. Zahler, and D.~Haussler. \newblock The human genome browser at {UCSC}. \newblock {\em Genome Res.}, 12:996--1006, 2002. \bibitem{LeeEmondBamshadBarnesRiederNickerson12} S.~Lee, M.~J. Emond, M.~J. Bamshad, K.~C. Barnes, M.~J. Rieder, D.~A. Nickerson, {NHLBI GO Exome Sequencing Project---ESP Lung Project Team}, D.~C. Christiani, M.~M. Wurfel, and X.~Lin. \newblock Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. \newblock {\em Am. J. Hum. Genet.}, 91(2):224--237, 2012. \bibitem{LiHandsakerWysokerFennellRuanHomerMarth09} H.~Li, B.~Handsaker, A.~Wysoker, T.~Fenell, J.~Ruan, N.~Homer, G.~Marth, G.~Abecasis, R.~Durbin, and {1000 Genome Project Data Processing Subgroup}. \newblock The sequence alignment/map format and {SAMtools}. \newblock {\em Bioinformatics}, 25(16):2078--2079, 2009. \bibitem{Lin97} X.~Lin. \newblock Variance component testing in generalised linear models with random effects. \newblock {\em Biometrika}, 84(2):309--326, 1997. \bibitem{LiuGhoshLin08} D.~Liu, D.~Ghosh, and X.~Lin. \newblock Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. \newblock {\em BMC Bioinformatics}, 9:292, 2008. \bibitem{LiuTangZhang09} H.~Liu, Y.~Tang, and H.~Zhang. \newblock A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. \newblock {\em Comput. Stat. Data Anal.}, 53:853--856, 2009. \bibitem{MadsenBrowning09} B.~E. Madsen and S.~R. Browning. \newblock A groupwise association test for rare mutations using a weighted sum statistic. \newblock {\em PLoS Genetics}, 5(2):e1000384, 2009. \bibitem{MorganPagesObenchainHayen2015} M.~Morgan, H.~Pag\`es, V.~Obenchain, and N.~Hayden. \newblock {\em {Rsamtools}: Binary alignment ({BAM}), {FASTA}, variant call ({BCF}), and tabix file import}, 2015. \newblock R package version 1.19.47. \bibitem{ObenchainLawrenceCareyGogartenShannonMorgan14} V.~Obenchain, M.~Lawrence, V.~Carey, S.~Gogarten, P.~Shannon, and M.~Morgan. \newblock {VariantAnnotation}: a {Bioconductor} package for exploration and annotation of genetic variants. \newblock {\em Bioinformatics}, 30(14):2076--2078, 2014. \bibitem{WuLeeCaiLiBoehnkeLin11} M.~C. Wu, S.~Lee, T.~Cai, Y.~Li, M.~Boehnke, and X.~Lin. \newblock Rare-variant association testing for sequencing data with the sequence kernel association test. \newblock {\em Am. J. Hum. Genet.}, 89(1):82--93, 2011. \end{thebibliography} \end{document}