\documentclass[10pt]{article} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{gcat Package} \usepackage{tikz} \usepackage{float} \usepackage{gensymb} \usepackage{times} \usepackage{graphicx} \usepackage{url} \usepackage{fullpage} \usepackage{float} \usepackage{amsmath} \usepackage{amsthm} \usepackage{bm} \usepackage{bbm} \usepackage{enumerate} \usepackage[font=small,format=plain,labelfont=bf,up,textfont=up]{caption} \usepackage{hyperref} \usepackage{subfig} \title{Genotype Conditional Association Test Vignette} \author{Wei Hao, Minsun Song, John D. Storey} \date{\today} \begin{document} \maketitle \section{Introduction} The \texttt{gcatest} package provides an implementation of the Genotype Conditional Association Test (GCAT)~\cite{Song2015}. GCAT is a test for genetic association that is powered by Logistic Factor Analysis (LFA)~\cite{Hao2013}. LFA is a method of modeling population structure in a genome wide association study. GCAT performs a test for association between each SNP and a trait (either quantitative or binary). We have shown that GCAT is robust to confounding from population structure. \section{Sample usage} We include a sample dataset with the package. \texttt{sim\_geno} is a simulated genotype matrix. \texttt{sim\_trait} is a simulated trait. There are 10,000 SNPs and 1,000 individuals. The first five SNPs are associated with the trait. This simulations were done under the Pritchard-Stephens-Donnelly model with $K=3$, with Dirichlet parameter $\alpha=0.1$ and variance allotment in the trait corresponding to $5\%$ genetic, $5\%$ environmental, and $90\%$ noise. This dataset is simulated under identical parameters as the PSD simulation in Figure 2 of the paper~\cite{Song2015}, except that we have adjusted the size of the simulation to be appropriate for a small demo. <>= library(lfa) library(gcatest) dim(sim_geno) length(sim_trait) @ \subsection{\texttt{gcat}} The first step of \texttt{gcat} is to estimate the logistic factors: <>= LF = lfa(sim_geno, 3) dim(LF) @ Then, we call the \texttt{gcat} function, which returns a vector of p-values: <>= gcat_p = gcat(sim_geno, LF, sim_trait) @ We can look at the p-values for the associated SNPs: <>= gcat_p[1:5] @ And also plot the histogram of the unassociated SNPs: <>= library(ggplot2) dat = data.frame(gcat_p[6:10000]) colnames(dat) = "gcat_p" ggplot(dat, aes(gcat_p, ..density..)) + geom_histogram(binwidth=1/20) + theme_bw() @ \section{Data Input} The \texttt{lfa} package provides the function \texttt{read.bed} for parsing PLINK binary genotypes (extension: \texttt{.bed}) into an R object of the format needed for the \texttt{gcat} function. \bibliographystyle{plain} \bibliography{gcatest} \end{document}