%\VignetteIndexEntry{Summary Tables} %\VignetteDepends{GeneticsBase} %\VignetteKeywords{Genetics} %\VignettePackage{GeneticsBase} \documentclass[10pt]{article} \usepackage{url} \usepackage{amsmath} \usepackage{natbib} \usepackage{isorot} \usepackage{fullpage} % standard 1 inch margins \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\email}[1]{\texttt{#1}} %%% Hyperlinks for ``PDF Latex'' : \ifx\pdfoutput\undefined%%--- usual ``latex'' : %% Stuff w/out hyperref \else%%---------------------- `` pdflatex '' : -- still gives funny errors \RequirePackage{hyperref} %% The following is R's share/texmf/hyperref.cfg : %% Stuff __with__ hyperref : \hypersetup{% %default: hyperindex,% colorlinks,% %default: pagebackref,% linktocpage,% %%plainpages=false,% linkcolor=Green,% citecolor=Blue,% urlcolor=Red,% pdfstartview=Fit,% pdfview={XYZ null null null}% } \RequirePackage{color} \definecolor{Blue}{rgb}{0,0,0.8} \definecolor{Green}{rgb}{0.1,0.75,0.1} \definecolor{Red}{rgb}{0.7,0,0} %% ESS JCGS v2 : %%\hypersetup{backref,colorlinks=true,pagebackref=true,hyperindex=true} %%\hypersetup{backref,colorlinks=false,pagebackref=true,hyperindex=true} \fi \usepackage{Sweave} \begin{document} \title{Tutorial: Using \Rpackage{GeneticsBase}} \author{ Gregory Warnes \\ \email{gregory\_warnes\@urmc.rochester.edu} \\ University of Rochester \\ \\ Ross Lazarus \\ \email{ross.lazarus@channing.harvard.edu}\\ Channing Laboratory } \maketitle %% %% Set up some defaults %% <>= options(width=90) @ \section{Introduction} This vignette was created as a tutorial for the 2007 BioConductor User's Conference held in Seattle, WA, USA during August 2007, and was presented by Dr. Warnes and Dr. Lazarus. The material is structured as a tutorial with a small example data set (8184 Markers x 180 Subjects belonging to 50 Families) . \section{Outline} \begin{enumerate} \item Preliminaries \begin{enumerate} \item Install the necessary packages \item Load the libraries \end{enumerate} \item Loading Data \begin{enumerate} \item Read data from files \item Error check the loaded data \end{enumerate} \item Descriptive Statistics \begin{enumerate} \item Summary of allele/genotyope frequency \item HWE test \item Visualize Disequilbrium \end{enumerate} \item Hypothesis Testing \begin{enumerate} \item Armitage test \item Logistic with synthetic covariates \item GLM with synthetic covariates \& outcome \end{enumerate} \item Subsetting Results and Formatting Output \begin{enumerate} \item Construct some output tables for top 100 significant markers \end{enumerate} \item Study planning tools (power, sample size) \end{enumerate} \section{Preliminaries} \subsection{Install RGenetics packages and depenedencies} For MS-Windows it is necessary to manually install dependencies from CRAN <>= install.packages( c("xtable","combinat","gdata","gplots","mvtnorm"), dep=TRUE) @ Now to install the necessary packages: <>= repos <- c("http://www.warnes.net/bioc2007/", "http://cran.fhcrc.org") install.packages(c("GeneticsBase", #"GeneticsQC", "GeneticsDesign", "fbat"), repos=repos, type="source", dep=TRUE) @ \subsection{Load the libraries} <<>>= library(GeneticsBase) #library(GeneticsQC) library(GeneticsDesign) library(fbat) @ \section{Loading Data} \subsection{Read data from files} Load the full data: <<>>= hm.a<-readGenes(gfile="hmCEU_YRI_chr22_ALLfbat.ped", gformat="fbat") print(hm.a) @ We'll also need just the founders later: <<>>= hm.f<-readGenes(gfile="hmCEU_YRI_chr22_Foundersfbat.ped", gformat="fbat") @ For the purpose of speeding execution of examples, we'll also create a smaller subset of 100 markers from the original file. <<>>= hm.a2<-hm.a[1:100,] @ \subsection{Error check the loaded data} Count frequencies of missing genotypes (requires 26 seconds on my MacBook Pro) Number of missing genotypes per subject: <<>>= mG<-missGFreq(hm.a2, founderOnly=TRUE, quiet=FALSE) head(mG$nMissSubjects) @ Column headers: \begin{description} \item{00} missing both alleles \item{0*} 1st allele missing while the 2nd allele is not missing \item{*0} 1st allele is not missing while the 2nd allele is missing \end{description} Number of missing genotypes per marker: <<>>= head(mG$nMissMarkers) @ Plot counts of missing genotypes: <>= par(mfrow=c(2,1)) hist(mG$nMissSubjects[,1], main="", xlab="number of missing genotypes") plot(1:nrow(mG$nMissSubjects),mG$nMissSubjects[,1], xlab="subject ID", ylab="number of missing genotypes", type="h") title("Counts of missing genotypes") par(mfrow=c(1,1)) @ \section{Descriptive Statistics} \begin{enumerate} \item Summary of allele/genotyope frequency \item HWE test \item Visualize Disequilbrium \end{enumerate} Basic data quality checks for markers. Column headings are: \begin{description} \item{ObsHET} observed proportion of heterozygous genotypes per marker \item{PredHET} predicted proportion of heterozygous genotypes per marker \item{HWpval} pvalues of Hardy-Weinberg test per marker \item{pGeno} percentage of non-missing genotypes for markes \item{MAF} minor allele frequencies. missing allele are excluded from calculation \item{Rating} 1 if passes HW test; 0 if failed HW test. \end{description} <<>>= ################ cM<-checkMarkers(hm.a2) head(cM) @ Check Mendelian errors: <<>>= cMend<-checkMendelian(hm.a2, quiet = FALSE) @ Number of Mendelian errors per marker: <<>>= head(cMend$nMerrMarker) @ Number of Mendelian errors per family: <<>>= head(cMend$nMerrFamily) @ Plot counts of Mendelian errors <>= par(mfrow=c(2,1)) hist(cMend$nMerrMarker, main="", xlab="number of Mendelian errors") plot(1:length(cMend$nMerrFamily),cMend$nMerrFamily, xlab="family ID", ylab="number of Mendian Errors") par(mfrow=c(1,1)) @ \subsection{Summary of allele/genotype frequency based on GeneticsBase functions} Allele summary, including allele counts, allele frequencies, 95\% CI of allele frequencies: <<>>= t1<-alleleSummary(hm.a2[1:10]) t1 @ Same for genotypes: <<>>= t2<-genotypeSummary(hm.a2[1:10], founderOnly=TRUE) t2 @ HWE test: <<>>= hwe<-HWE(hm.a2[1:10]) hwe @ Defatult graphical display of LD: <>= ld.small<-LD(hm.a2[1:20]) plot(ld.small) @ Alternative graphical display of LD: <>= ld <- LD(hm.a2) LDView(ld@"X^2") @ \section{Hypothesis Testing} \subsection{Armitage test} For the following examples, suppose 'A' is the minor allele, and 'a' is the major allele. Armitage test using additive model to code genotype: \begin{tabular}{cc} genotype & coding \\ AA & 2 \\ Aa & 1 \\ aa & 0 \\ \end{tabular} <<>>= res.A<-Armitage(hm.a2, method="A") head(res.A) @ Armitage test using recessive model to code genotype: \begin{tabular}{cc} genotype & coding \\ AA & 1 \\ Aa & 0 \\ aa & 0 \\ \end{tabular} <<>>= res.R<-Armitage(hm.a2, method="R") head(res.R) @ Armitage test using dominant model to code genotype: \begin{tabular}{cc} genotype & coding \\ AA & 1 \\ Aa & 1 \\ aa & 0 \\ \end{tabular} <<>>= res.D<-Armitage(hm.a2, method="D") head(res.D) @ \subsection{Logistic regression} First, we need to construct some synthetic covariates on the founders <<>>= sampleInfo(hm.f)$race <- sampleInfo(hm.f)$affected raceval <- sampleInfo(hm.f)$race-1 sampleInfo(hm.f)$Norm0.0 <- rnorm(nobs(hm.f), mean=0.0*raceval) sampleInfo(hm.f)$Norm0.5 <- rnorm(nobs(hm.f), mean=0.5*raceval) sampleInfo(hm.f)$Norm1.0 <- rnorm(nobs(hm.f), mean=1.0*raceval) sampleInfo(hm.f)$Norm1.5 <- rnorm(nobs(hm.f), mean=1.5*sampleInfo(hm.f)$race) doSample <- function(raceval, mult) { prob <- c( 0.33-raceval*mult, 0.33+(raceval*mult)/2, 0.33+(raceval*mult)/2) factor(sample( x=c("Red","Green","Blue"), size=1, p=prob, rep=T)) } sampleInfo(hm.f)$Cat0.0 <- sapply( raceval, doSample, mult=0.0 ) sampleInfo(hm.f)$Cat0.1 <- sapply( raceval, doSample, mult=0.1 ) sampleInfo(hm.f)$Cat0.2 <- sapply( raceval, doSample, mult=0.2 ) sampleInfo(hm.f)$Cat0.3 <- sapply( raceval, doSample, mult=0.3 ) @ Now, construct a function to fit the regression model and return the parameters and statistics that are of interest. %% %% Put model function here both in executable form and as pure %% verbatim so comments are preserved. %% \begin{verbatim} model <- function( markerName ) { # extract requested genetic marker genotype <- genotypes(hm.f,marker=markerName) # get data frame to use for fitting the model mframe <- model.frame(race ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5 + genotype, data=sampleInfo(hm.f) ) # To test significance of a term, best method is to do anova of # the full model against a submodel omitting the particular term. # This avoids issues with changes in names of factor levels, # presence or absence of covaraiates, etc. result <- try( { fit.with <- glm( race==1 ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5 + as.factor(genotype), data=mframe, family="binomial") fit.without <- glm( race==1 ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5, data=mframe, family="binomial") anova(fit.with, fit.without, test="Chisq")$"P(>|Chi|)"[2] } ) if(class(result)=="try-error") return(NA) # or return(result) to see the error messages else result # full result. Usually we want to specify exactly which # parameters and stats get returned so the format is consistent # across all markers. } \end{verbatim} <>= model <- function( markerName ) { # extract requested genetic marker genotype <- genotypes(hm.f,marker=markerName) # get data frame to use for fitting the model mframe <- model.frame(race ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5 + genotype, data=sampleInfo(hm.f) ) # To test significance of a term, best method is to do anova of # the full model against a submodel omitting the particular term. # This avoids issues with changes in names of factor levels, # presence or absence of covaraiates, etc. result <- try( { fit.with <- glm( race==1 ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5 + as.factor(genotype), data=mframe, family="binomial") fit.without <- glm( race==1 ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5, data=mframe, family="binomial") anova(fit.with, fit.without, test="Chisq")$"P(>|Chi|)"[2] } ) if(class(result)=="try-error") return(NA) # or return(result) to see the error messages else result # full result. Usually we want to specify exactly which # parameters and stats get returned so the format is consistent # across all markers. } @ Fit to a subset of 50, then 100 and use this information to compute the expected run time for all markers: <<>>= t1 <- unix.time( fits <- sapply( markerNames(hm.f)[1:50], model ) )[3] t1 @ (This takes 5.365 seconds on my MacBook Pro, R 2.4.1) <<>>= t2 <- unix.time( fits <- sapply( markerNames(hm.f)[1:100], model ) )[3] t2 @ (This takes 11.878 seconds on my MacBook Pro, R 2.4.1) Estimate total time to complete, in minutes <<>>= t1 + (t2 - t1)/50 * (nmarker(hm.f)-50) / 60 @ (This yields 23.02 minutes on my MacBook Pro, R 2.4.1) Plot the p-values for the first 100 markers <>= fits.sorted <- sort(fits) plot(-log(fits), type="h", col="blue") labels <- c(0.05, 0.01, 0.001, 0.0001, 0.00001, 0.00001) abline(h=-log(labels), lty=2, col="red") mtext(text=as.character(labels), side=4, at=-log(labels), col="red", las=1) title("Per-marker statistical significance") @ \subsection{Family-Based Associaton Test ('fbat')} Do the fbat calculations: <<>>= f<-fbat(hm.a2) @ Show the p-values: <<>>= summaryPvalue(f) @ Look at the fit details for a specific marker: <<>>= viewstat(f, "22_14884399_rs4911642") @ \section{Study planning tools (GeneticsDesign package)} \subsection{Power to detect a low-frequencty allele} Compute the probability of missing an allele with frequency 0.15 when 20 genotypes are sampled: <<>>= gregorius(freq=0.15, N=20) @ Determine what sample size is required to observe all alleles with true frequency 0.15 with probability 0.95 <<>>= gregorius(freq=0.15, missprob=1-0.95) @ \subsection{Power for a genetics study using a quantitative outcome} Calculate power for genetics study using a quantitative outcome: <<>>= GeneticPower.Quantitative.Numeric( N=50, freq=0.1, minh="recessive", alpha=0.05 ) GeneticPower.Quantitative.Factor( N=50, freq=0.1, minh="recessive", alpha=0.05 ) @ For a range of sample sizes: <<>>= power.range <- function( N, ... ) { sapply( N, function(n) GeneticPower.Quantitative.Numeric( N=n, ... ) ) } power.range( N=c(25,50,100,200,500), freq=0.1, minh="recessive", alpha=0.05 ) @ Create a power table: <<>>= fun <- function(p) power.range(freq=p, N=seq(100,1000,by=100), alpha=5e-2, minh='recessive') m <- sapply( X=seq(0.1,0.9, by=0.1), fun ) colnames(m) <- seq(0.1,0.9, by=0.1) rownames(m) <- seq(100,1000,by=100) print(round(m,2)) @ \subsection{Power calculator for genetic linear trend association studies.} The power is for the test that disease is associated with a marker, given high risk allele frequency ('A'), disease prevalence, genotype relative risk ('Aa'), genotype relative risk ('AA'), LD measure ($D'$ or $R^2$), marker allele frequency ('B'), number of cases, control:case ratio, and probability of the Type I error. The linear trend test (Cochran 1954; Armitage 1955) is used. Using $R^2$ as the measuer of LD: <<>>= res1<-GPC(pA=0.05, pD=0.1, RRAa=1.414, RRAA=2, r2=0.9, pB=0.06, nCase=500, ratio=1, alpha=0.05, quiet=FALSE) @ Using $D'$ as the measure of LD: <<>>= res2<-GPC.default(pA=0.05, pD=0.1, RRAa=1.414, RRAA=2, Dprime=0.9, pB=0.06, nCase=500, ratio=1, alpha=0.05, quiet=FALSE) @ \begin{center} \textbf{ The End. } \end{center} %\clearpage % %\bibliographystyle{biometrics} %\begin{thebibliography}{} % %\item [~\hspace{0.125in}~Warnes~GR.]{ ``The Genetics % Package: Utilities for handling genetic data'' \code{R News}, % Volume 3, Issue 1, June 2003.} % %\item [~\hspace{0.125in}~Warnes~GR.] ``genetics'', a package for % handling marker-based genetic data within the open-source % statistical package ``R''. The package includes function to compute % allele frequencies, use genetic markers in statistical models, % estimate disequilibrium, and test for departure from Hardy-Weinberg % equilibrium. \\ % \verb+http://cran.us.r-project.org/src/contrib/PACKAGES.html#genetics+, % 2002- % %\end{thebibliography} \end{document}