%\VignetteIndexEntry{Summary Tables} %\VignetteDepends{GeneticsBase} %\VignetteKeywords{Genetics} %\VignettePackage{GeneticsBase} \documentclass[10pt]{article} \usepackage{url} \usepackage{amsmath} \usepackage{natbib} \usepackage{isorot} \usepackage{fullpage} % standard 1 inch margins \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\email}[1]{\texttt{#1}} %%% Hyperlinks for ``PDF Latex'' : \ifx\pdfoutput\undefined%%--- usual ``latex'' : %% Stuff w/out hyperref \else%%---------------------- `` pdflatex '' : -- still gives funny errors \RequirePackage{hyperref} %% The following is R's share/texmf/hyperref.cfg : %% Stuff __with__ hyperref : \hypersetup{% %default: hyperindex,% colorlinks,% %default: pagebackref,% linktocpage,% %%plainpages=false,% linkcolor=Green,% citecolor=Blue,% urlcolor=Red,% pdfstartview=Fit,% pdfview={XYZ null null null}% } \RequirePackage{color} \definecolor{Blue}{rgb}{0,0,0.8} \definecolor{Green}{rgb}{0.1,0.75,0.1} \definecolor{Red}{rgb}{0.7,0,0} %% ESS JCGS v2 : %%\hypersetup{backref,colorlinks=true,pagebackref=true,hyperindex=true} %%\hypersetup{backref,colorlinks=false,pagebackref=true,hyperindex=true} \fi \usepackage{Sweave} \begin{document} \title{Generating Marker Summary Reports Using the \Rpackage{GeneticsBase}} \author{ Gregory Warnes\\ \email{gregory\_warnes\@urmc.rochester.edu}, \\ Nitin Jain\\ \email{nitin.jain\@pfizer.com} } \maketitle \section{Introduction} This document demonstrates how to use the \Rpackage{GeneticsBase} (version 0.6.0) package to generate marker summary tables \emph{for studies with a small number of markers}. It is written as a step-by-step tutorial. For additional details on each of the R functions utilized, please see the individual help pages \textbf{Note: The textual displays described here are not suitable for large numbers of markers. They are intended for reviewing detailed information on a small number of markers, such as those in candidate gene studies, or a small set of markers achieving a 'quality' or 'significance' cutoff from a larger set.} \section{Example} \subsection{Prepare phenotype data \label{data}} The first step is to prepare the phenotype data. It may be in the form of a SAS dataset, SAS export file, comma-delimited text file (CSV), tab-delimited text file (TSV), or Microsoft Excel spreadsheet file (XLS). It should have one row per observation and one column per variable, and must contain a subject identifier variable that can be used to match observations with the corresponding genotype data. \subsection{Prepare genotype data \label{data}} You also need to store the genetic call data in a file that can be read into R. GeneticsBase packge accepts genotype data in a variety of formats: \begin{itemize} \item standard pedigree (ped) format. \begin{verbatim} a2m apoe 50103 5010004 5090005 5090004 2 2 1 2 3 4 50103 5010005 5090005 5090004 2 2 1 1 3 4 50105 5010049 5090021 5090022 2 2 1 1 4 4 50105 5010070 5090020 5090019 1 2 1 1 3 4 \end{verbatim} \item hapmap format : The hapmap .ped format is a variant of the standard pedigree format. A portion of the first two lines of the hapmap file for chrmosome 1 are shown below: \begin{verbatim} rs2298011 rs1320571 rs11721 rs4018608 rs6685064 rs604618 .... 1347 14 0 0 1 1 1 1 3 3 3 3 1 1 2 2 3 3 3 3 2 2 3 3 2 ..... \end{verbatim} \item Pfizer format: First few lines of an example file in Pfizer's data format are shown below: \begin{verbatim} Locus,Gene Marker,Locus Start,Project,Protocol,Sample ID,Donor ID,Genotype A1A,C1556G,-1243,P234,103,1028022.1,1028022,G/G A1B,T127A,20141,P234,103,1028022.1,1028022,A/T A1B,T5094A,102358,P234,103,1028022.1,1028022,A/T A1A,C1556G,-1243,P234,103,1035130.1,1035130,G/G \end{verbatim} \item Perlegen format: A portion of first two lines of data in the Perlegen format are shown below: \begin{verbatim} snp_id genotype 753527 rrrrrrrrrrrrrrrrhrrrrrrrrrrrrrrrrrrrrrrrrrr...... 752848 hhahaararhhhahrhhhahaharhaahhahrhhahhahnhha...... \end{verbatim} \end{itemize} \subsection{Load the genotype and phenotype data} In GeneticsBase, both genotype and phentype data is loaded by a single function \emph{readgenes}. This function \emph{readGenes} has four primary arguments: \emph{gfile}, \emph{gfromat}, \emph{file}, and \emph{pformat}. Arguments \emph{gfile} and \emph{pfile} are the names of files containing the genotype and phenotype data (respectively), and arguments \emph{gformat} and \emph{pformat} are the corresponding file formats for the genotype and phenotype data. Various types of data can be loaded by readGenes function. Example files are provided in \code{data} subdirectory of the \code{GeneticsBase} package. To access these execute <<1,eval=FALSE,print=TRUE>>= library(GeneticsBase) setwd(file.path(.path.package("GeneticsBase"), "data")) @ \noindent% Supported file formats include: \begin{itemize} \item fbat .ped format The Alzheimer's example dataset is stored in the Fbat variant of the .ped Pedigree Format. As it does not include phenotype data, we only use the genotyep filename and file type arguments: <<2,echo=TRUE,eval=FALSE>>= ALZH <- readGenes(gfile="ALZH.ped", gformat="fbat.ped") @ The CAMP example dataset is from from the `Childhood Asthma Management Program (CAMP)' and incudes both genotype and phenotype information. It can be loaded by: <<3,echo=TRUE,eval=FALSE>>= CAMP <- readGenes(gfile="CAMP.ped", gformat="fbat.ped", pfile="CAMPZ.phe", pformat="fbat.phe") @ \item HAPMAP .ped format A subset of the data for the International HapMap project is available in the hapmap example data set. This file can be loaded via: <<4,echo=TRUE,eval=FALSE>>= hapmapchr1 <- readGenes(gfile="hapmapchr1.ped", gformat="hapmap") @ \item Pfizer genetics data format <<5,echo=TRUE,eval=FALSE>>= PfizerExample <- readGenes.pfizer("PfizerExample.txt", format="Listing") @ \item Perlegen data format <<6,echo=TRUE,eval=FALSE>>= PerlegenExample <- readGenes("PerlegenExample.txt", gformat="perlegen") @ \end{itemize} \subsection{Reviewing the data} For the purpose of this example, we are going to use CAMP data set, which can be loaded manually as shown in the previous sub-section, or via <<7,echo=TRUE,eval=TRUE>>= library(GeneticsBase) data(CAMP) @ Now you can see a brief summary of the data that was loaded by simply entering the name of the object on a line by itself: %% Invisible in text <<8,echo=FALSE, eval=TRUE>>= upper <- function(x) { tmp <- capture.output(print(x)) cat(paste(head(tmp, n=15), collapse="\n")) cat("\n") cat("\n...\n") } @ <<9,echo=TRUE>>= CAMP @ \begin{verbatim} Warning messages: 1: geneSet Object has 121 observations. Only first and last 6 displayed in: .local(object, ...) \end{verbatim} The phenotype data can be extracted from the CAMP data object using the \code{sampleInfo} command: <<10,echo=TRUE,eval=FALSE>>= pdata <- sampleInfo(CAMP) summary(pdata) @ <<11,echo=FALSE,eval=TRUE>>= pdata <- sampleInfo(CAMP) upper(summary(pdata)) @ \subsection{Generate the tables} We can generate a variety of summary tables on our genetics data. \begin{itemize} \item Allele information <<12,echo=TRUE,eval=FALSE,fig=FALSE>>= alleleSummary(CAMP) @ <<13,echo=FALSE,eval=TRUE>>= upper(alleleSummary(CAMP)) @ \item Genotype information <<14,echo=TRUE,eval=FALSE>>= genotypeSummary(CAMP) @ <<15,echo=FALSE,eval=TRUE>>= upper(genotypeSummary(CAMP)) @ \item Marker information <<16,echo=FALSE,eval=TRUE>>= # get descriptive statistics for markers #desMarkers(CAMP, founderOnly=T) @ \item Linkage disequilibrium, text view <<17,echo=TRUE,eval=FALSE>>= ld <- LD(CAMP) ld @ \begin{small} <<18,echo=FALSE,eval=TRUE>>= ld <- LD(CAMP) upper(ld) @ \end{small} \item Linkage disequilibrium, matrix plot <<19,echo=TRUE,fig=TRUE>>= plot(ld) @ \clearpage \item Linkage disequilibrium, graphical view using LDView <<20,echo=TRUE,fig=TRUE>>= # visualize r^2 matrix r2<-t(ld@"R^2") diag(r2)<-1 LDView(r2, labRow=markerNames(CAMP)) @ \end{itemize} \clearpage \section{Generating tables for inclusion in reports} To make it simple to include the summary tables in written reports, they can be written to files in a variety of formats, including plain text, html, and LaTeX. \subsection{Plain text files} <<21>>= aS <- alleleSummary(CAMP) txt(aS, file="CAMP_alleleSummary.txt") @ \subsection{LaTex files} <<22,echo=TRUE,eval=FALSE>>= aS <- alleleSummary(CAMP) latex(aS) @ <<23,echo=FALSE,eval=TRUE,results=tex>>= aS <- alleleSummary(CAMP) latex(aS, floating=FALSE) @ \clearpage %\begin{sideways} %\begin{minipage}{\textheight} \begin{landscape} <<24,echo=TRUE,eval=FALSE>>= gs <- genotypeSummary(CAMP[-2,]) latex(gS) # drop ALR2 SSLP so table fits on page @ \begin{small} <<25,echo=FALSE,eval=TRUE,results=tex>>= gS <- genotypeSummary(CAMP[-2,]) latex(gS, floating=FALSE) # drop ALR2 SSLP so table fits on page @ \end{small} \end{landscape} %\end{minipage} %\end{sideways} \clearpage <<26,echo=TRUE,eval=FALSE>>= ld <- LD(CAMP) latex(ld) @ \begin{tiny} <<27,echo=FALSE,eval=TRUE,results=tex>>= ld <- LD(CAMP) latex(ld, floating=FALSE) @ \end{tiny} \clearpage <<28,echo=TRUE,eval=TRUE,fig=TRUE>>= plot(ld) @ \subsection{HTML files} \subsection{Graphics files} As usual, plots can be generated in any format R supports. We can also output everything all at once to a set of files, encoded as plain text (format="print"), html (format="html"), or LaTeX (format="latex"): <<29,echo=TRUE,eval=TRUE>>= PGtables(CAMP, filename="CAMP", sep="_", format="html") @ which creates a set of html and a PDF files in the current directory. \begin{figure}[!ht] \begin{center} \caption{HTML allele summary table} \vspace{1em} \includegraphics[height=0.9\textheight]{CAMP_alleleSummary_html.pdf} \end{center} \end{figure} \begin{figure}[!ht] \begin{center} \caption{HTML genotype summary table} \vspace{1em} \includegraphics[height=0.9\textheight]{CAMP_genotypeSummary_html.pdf} \end{center} \end{figure} \begin{figure}[!ht] \begin{center} \caption{HTML linkage disequilibrium table} \vspace{1em} \includegraphics[height=0.9\textheight]{CAMP_LD_html.pdf} \end{center} \end{figure} \begin{figure}[!ht] \begin{center} \caption{Linkage disequilibrium plot} \vspace{1em} \includegraphics[height=0.9\textheight]{CAMP_LD.pdf} \end{center} \end{figure} \clearpage \section{Subsetting by Group} The \code{alleleSummary} and \code{genotypeSummary} functions also allow you to create tables which show the summary information separated out by a grouping variable, which must be discrete ``factor'' variables (such as Sex). To accomplish this, add the argument \code{by=Sex} to the function call. For example: <<30,echo=TRUE,eval=TRUE>>= alleleSummary(CAMP, by='sex') @ This will display a table within a separate block within each marker for each level of the variable \code{Sex}. To control whether the summary table for entire data in addition to individual factor levels, add \code{includeOverall=TRUE} or \code{includeOverall=FALSE} (the default) as appropriate. \appendix \section{Example R script\label{script}} <<31,echo=TRUE,eval=FALSE>>= library(GeneticsBase) data(CAMP) ## HTML Output PGtables(CAMP, filename="test", format="html") ## LaTex Output PGtables(CAMP, filename="test", format="latex") @ \clearpage \bibliographystyle{biometrics} \begin{thebibliography}{} \item [~\hspace{0.125in}~Warnes~GR.]{ ``The Genetics Package: Utilities for handling genetic data'' \code{R News}, Volume 3, Issue 1, June 2003.} \item [~\hspace{0.125in}~Warnes~GR.] ``genetics'', a package for handling marker-based genetic data within the open-source statistical package ``R''. The package includes function to compute allele frequencies, use genetic markers in statistical models, estimate disequilibrium, and test for departure from Hardy-Weinberg equilibrium. \\ \verb+http://cran.us.r-project.org/src/contrib/PACKAGES.html#genetics+, 2002- \end{thebibliography} \end{document}