% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{GGtools 2012: efficient tools for eQTL discovery} %\VignetteDepends{GGdata} %\VignetteKeywords{genetics of gene expression} %\VignettePackage{GGtools} \documentclass[12pt]{article} \usepackage{auto-pst-pdf} \usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Using \textit{GGtools} for eQTL discovery and interpretation} \author{VJ Carey \texttt{stvjc at channing.harvard.edu}} \maketitle \tableofcontents \section{Overview and installation} This document addresses data structure and analytic workflow for investigations of genetic sources of expression variation. Key background references are \citet{Williams:2007p21} for general biologic overview, \citet{Cheung:2005p446} and \citet{Stranger:2007p114} for key applications, and \citet{Stegle:2010p2015}, \citet{Petretto:2010p2678}, and \citet{Leek:2010p1819} for various methodological issues. \citet{Majewski:2011p3139} reviews potentials of eQTL investigations with expression measures based on RNA sequencing. This document is constructed using R version \Sexpr{version$major}.\Sexpr{version$minor}. See the session information at the end of the document for full details. Using a comparable version of R, you can obtain the software needed for the production of this document using \begin{verbatim} source("http://www.bioconductor.org/biocLite.R") biocLite("GGtools", dependencies=TRUE) \end{verbatim} \section{Data structures} \subsection{Reference data supplied with Bioconductor} A collection of 30 trios of central European ancestry was genotyped for 4 million SNP loci in HapMap phase II. Immortalized B-cell lines were assayed for gene expression using Illumina's HumanWG6v1 bead array. Digital data on expression and genotype for the 90 CEU individuals is distributed in Bioconductor package \textit{GGdata}; the expression data were retrieved from the GENEVAR website of Wellcome Trust, e.g., \begin{verbatim} ftp://ftp.sanger.ac.uk/pub/genevar/CEU_parents_norm_march2007.zip \end{verbatim} and the genotype data were obtained directly from hapmap.org at build 36: \begin{verbatim} ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2008-03/forward/non-redundant/ \end{verbatim} The data in GGdata are likely derived from r23, while r23a is now distributed. Some effort at updating genotypes may be supplied in the future. Acquire the genome-wide expression data and the genotype data for chromosome 20 as follows: <>= suppressPackageStartupMessages(library(GGtools)) library(parallel) <>= g20 = getSS("GGdata", "20") @ <>= g20 class(g20) @ The \texttt{smlSet} class was designed in 2006 as an experiment in unifying high-throughput expression and genotype data. A key resource was the \textit{snpMatrix} (now \textit{snpStats}) package of David Clayton, which defined an 8-bit representation of genotype calls (including uncertain calls obtained by statistical imputation), import and coercion infrastructure allowing use of the 8-bit representation with popular genetic data formats (pedfiles, mach and beagle outputs, etc.), and statistical testing infrastructure employing this representation. The expression and sample-level data are handled just as with familiar \texttt{ExpressionSet} instances: <>= exprs(g20)[1:5,1:5] pData(g20)[1:4,] @ The genotype data are held in a list with elements intended to represent chromosomes, and the list is stored in an environment to reduce copying efforts. <>= smList(g20) as(smList(g20)[[1]][1:5,1:5], "matrix") @ The leading zeroes in the display above indicate that raw bytes are used to represent the genotypes per sample (rows) per SNP (columns). Coercions: <>= as(smList(g20)[[1]][1:5,1:5], "numeric") as(smList(g20)[[1]][1:5,1:5], "character") @ Any number of chromosomes can be held in the genotype list component, but the design allows working with only one chromosome at a time and examples emphasize this approach. Amalgamation of results across chromosomes is generally straightforward. \subsection{Working with your own data} The \verb+make_smlSet+ function can be used to bind a list of suitably named \texttt{SnpMatrix} instances with an \texttt{ExpressionSet} instance to create a single \texttt{smlSet} instance covering multiple chromosomes. The \texttt{externalize} function can be applied to such an \texttt{smlSet} instance to create a new \textit{package} which can be installed for use with \texttt{getSS}. This is the preferred way of managing work with large genotyping panels (such as the 10 million locus panel achievable with ``thousand genomes imputation''). Briefly, \texttt{externalize} arranges a DESCRIPTION file and system of folders that can be installed as an R package. The expression data are stored as object \texttt{ex} in data/eset.rda, and the \texttt{SnpMatrix} instances are stored separately as .rda files in inst/parts. The \texttt{getSS} function will create \texttt{smlSet} instances on the fly from the externalize-generated package. \subsection{Filters and permutation support} Here ``filter'' is used to refer to any function that processes an \texttt{smlSet} instance and returns an \texttt{smlSet} instance after altering the contents, which may involve eliminating probes or SNPs, performing numerical transformations to expression or genotype measures, transforming sample data or eliminating samples. \subsubsection{Bracket operations} Coordinated manipulations of genotype, expression, and phenotype information on samples can be accomplished with the second subscript to the bracket operator. Thus \texttt{g20[,1:5]} is the restriction of \texttt{g20} to the first five samples. Sample names may also be used for such manipulations. Reduction of the expression component can be accomplished with the first subscript to the bracket operator. Thus \texttt{g20[1:5,]} is the restriction of \texttt{g20} to five expression probes; all other data are unaltered. If it is desired to use feature names for such manipulations, the character vector of feature names must be cast to class \texttt{probeId} with the \texttt{probeId()} method. At present no such operations can be used to alter the genotype data contents. \subsubsection{Large scale filters} It is known that non-specific filtering (removal of probes with low variation across samples, without regard to sample phenotype or class information) can increase sensitivity and specificity of certain differential expression test procedures \citep{Bourgon:2010p1763}. The \texttt{nsFilter} function of the \textit{genefilter} package has been adapted to work with \texttt{smlSet} instances. SNPs can be filtered out of the \texttt{smlSet} instance on the basis of observed minor allele or genotype frequencies using \texttt{MAFfilter} and \texttt{GTFfilter} respectively. Various approaches to reduction of ``expression heterogeneity'' can be examined. \texttt{clipPCs(x, vec)} will form the singular value decomposition of the expression matrix and remove principal components enumerated in \texttt{vec} by reassembling the expression matrix after setting eigenvalues in \texttt{vec} to zero. It is also possible to employ any computed quantities such as principal components or surrogate variables identified in SVA \citep{Leek:2007p1723} as covariates in the formula element of analysis functions in \texttt{GGtools}, but note that simple permutations do not lead to valid permutation tests of SNP effects in the presence of covariates (see \citep{Buzkova:2011p3368} who focus on interaction, but describe the problem for main effects models, with references, early in the paper.) The introduction of novel approaches to expression transformation can be accomplished using code similar to the following, illustrating of the use of PEER \citep{Stegle:2010p2015}: <>= library(peer) model = PEER() PEER_setPhenoMean(model, t(exprs(g20))) PEER_setNk(model, 10) PEER_setCovariates(model, matrix(1*g20$male,nc=1)) PEER_update(model) resid=t(PEER_getResiduals(model)) rownames(resid) = featureNames(g20) colnames(resid) = sampleNames(g20) g20peer10 = g20 g20peer10@assayData = assayDataNew("lockedEnvironment", exprs=resid) @ At this point, \texttt{g20peer10} holds expression data with 10 latent factors removed after adjustment for gender. \subsubsection{Permutation of expression against genotype} Because the \textit{snpStats} testing procedures defensively match predictor to response variable orderings using sample labels, special steps must be taken to ensure that tests use properly permuted responses. The \texttt{permEx} function takes care of this, using the current state of the random number generator. \subsection{Post-analysis data structures} While it is possible to construe the results of an eQTL search as a static report, it is more productive to conceptualize the result as a data object for further analysis. Unfortunately, the number of tests to be managed can be very large -- at least hundreds of millions, and these must be joinable with location metadata to be maximally useful. Several data structures for managing post-analysis results have emerged as this package has matured. Of particular concern are those that use \textit{ff} out-of-memory archiving for test statistic values or effect estimates and those that use the \texttt{GRanges} infrastructure to facilitate efficient query resolution in genomic coordinates. These will be described along with the related analytic workflow steps. \section{Focused analyses} A specific gene can be checked for eQTL on a given chromosome or set of chromosomes with \texttt{gwSnpTests}. There are various convenience facilities. In the call to \texttt{gwSnpTests} below, a gene symbol is used to pick out an expression element, and adjustment for gender is commodated in an additive genetic model for effects of B allele copy number on expression of CPNE1. One degree of freedom chi-squared tests are computed. <>= t1 = gwSnpTests(genesym("CPNE1")~male, g20, chrnum("20")) t1 topSnps(t1) @ It is possible to compute tests for this specific gene for association with SNP across several chromosomes if desired; change the value of the third argument to a vector. There are a few approaches to visualization of the results that are relevant, but complications arise in relation to choice of genomic coordinates. <>= pdf(file="t1.pdf") plot(t1, "SNPlocs.Hsapiens.dbSNP.20110815") dev.off() pdf(file="t1evg.pdf") plot_EvG(genesym("CPNE1"), rsid("rs17093026"), g20) dev.off() @ <>= plot(t1, "SNPlocs.Hsapiens.dbSNP.20110815") plot_EvG(genesym("CPNE1"), rsid("rs17093026"), g20) @ \setkeys{Gin}{width=0.45\textwidth} \begin{tabular}{cc} \includegraphics{t1} & \includegraphics{t1evg} \\ \end{tabular} Code like the following can be used to display scores ($-\log_{10} p$) on the genome browser, here with hg19 locations. <>= library(SNPlocs.Hsapiens.dbSNP.20110815) S20 = getSNPlocs("ch20", as.GRanges=TRUE) GR20 = makeGRanges(t1, S20) library(rtracklayer) export(GR20, "~/cpne1new.wig") @ With this code, it will be necessary to manually alter the chr assignment in the wig file, and place an informative title to get the following display. \clearpage \begin{center} \setkeys{Gin}{width=0.95\textwidth} \includegraphics{cpne1Brow} \end{center} \clearpage \section{Comprehensive surveys} \subsection{A set of genes vs. all SNP on a chromosome} The performance of \textit{snpStats} \texttt{snp.rhs.tests} is very good and so our principle for large-scale searches is to compute all association statistics, save them in truncated form, and filter results later. This is carried out with the \texttt{eqtlTests} function. To illustrate, the expression data is sharply filtered to the 50 most variable genes on chromosome 20 as measured by cross-sample median absolute deviation, SNP with MAF $<$ 0.05 are removed, and then all SNP-gene association tests are executed. <>= <>= g20 = GGtools:::restrictProbesToChrom(g20, "20") mads = apply(exprs(g20),1,mad) oo = order(mads, decreasing=TRUE) g20 = g20[oo[1:50],] tf = tempfile() dir.create(tf) e1 = eqtlTests(MAFfilter(g20, lower=0.05), ~male, geneApply=mclapply, targdir=tf) e1 @ On a two-core macbook pro, this computation takes well under a minute. The details of the underlying data structure are involved. Briefly, a short integer is used to represent each chi-squared statistic obtained in the \Sexpr{length(nrow(e1@fffile)*ncol(e1@fffile))} tests computed, in an \texttt{ff} archive. Use \texttt{topFeats} to manually harvest this. <>= pm1 = colnames(e1@fffile) tops = sapply(pm1, function(x) topFeats(probeId(x), mgr=e1, n=1)) top6 = sort(tops, decreasing=TRUE)[1:6] @ <>= print(top6) @ R has propagated the names of probes and SNPs with the scores so that a table can be created as follows: <>= nms = strsplit(names(top6), "\\.") gn = sapply(nms,"[",1) sn = sapply(nms,"[",2) tab = data.frame(snp=sn,score=as.numeric(top6)) rownames(tab) = gn tab @ Statistical interpretation of the scores in this table is not clear as the data structure includes familial aggregation in trios and extended pedigrees, and may include population stratification, Nevertheless, consistency of these findings with other published results involving multiple populations can be checked. \textit{GGtools} includes a table published by Stranger and colleagues in 2007 enumerating multipopulation eQTL \citep{Stranger:2007p114}. <>= data(strMultPop) strMultPop[ strMultPop$rsid %in% tab$snp, ] @ Thus the top two SNP in the table computed here are identified as multipopulation eQTL by Stranger. The other association scores are not very strong and likely do not correspond to genuine associations. \subsection{Tabulating best associated cis-rSNP with permutation-based FDR: small example} The workhorse for identifying genes to which can be associated putatively regulating SNP (rSNP) is \texttt{best.cis.eQTLs}. This can be used for genome-wide analysis, but here an alternative table is created for the sharply filtered chromosome 20 data given above. This call says that gene location information will be acquired from the Bioconductor TxDb.Hsapiens annotation package for hg19 UCSC known genes, and that tests for association within 1 Mbase of the coding region for each gene will be considered. The expression data will be permuted against genotype data in two independent draws to assemble the null reference distribution for association scores; these are used to enumerate false significance claims at various magnitudes of the distribution of association scores. The plug-in procedure for estimating FDR XXX cite Hastie Tibs Friedman is used. <>= if (file.exists("db2")) unlink("db2", recursive=TRUE) fn = probeId(featureNames(g20)) exTx = function(x) MAFfilter( x[fn, ], lower=0.05) b1 = best.cis.eQTLs("GGdata", ~male, radius=1e6, folderstem="db2", nperm=2, geneApply=mclapply, smFilter= exTx, chrnames="20") <>= b1 @ \subsection{Tabulating all cis-rSNP with association score exceeding a given threshold} The gene-centered analysis described in the previous subsection yields a set of thresholds corresponding to various FDRs. It may be of interest to enumerate all SNP with association scores exceeding some threshold. All.cis.eQTLs can be used for this. <>= args(All.cis.eQTLs) @ The computation can be done \textit{de novo} on the basis of an \texttt{smpack} argument, or can be regarded as a followup on a completed \texttt{best.cis.eQTLs} call. <>= b1.all = All.cis.eQTLs( maxfdr = 0.05, inbestcis = b1, smpack="GGdata", rhs=~1, chrnames = "20", smFilter4all = function(x) MAFfilter(x, lower=0.05)) @ Here a second SNP is identified for the one gene exhibiting an eQTL at FDR $<= 0.05$. \subsection{Removal of empirically identified expression heterogeneity} \citet{Leek:2010p1819} describe implications of batch effects in high-throughput experimental contexts. Various methods for adjustment of responses have been proposed; a very simple but evidently risky approach is nonspecific removal of principal components of variation. To maximize information on possible batch effects, the entire expression matrix is restored and decomposed for principal component removal. <>= if (file.exists("db2")) unlink("db2", recursive=TRUE) g20 = getSS("GGdata", "20") exTx = function(x) MAFfilter( clipPCs(x,1:10)[fn, ], lower=0.05) g20f = exTx(g20) <>= b2 = best.cis.eQTLs("GGdata", ~male, radius=1e6, folderstem="db2", nperm=2, geneApply=mclapply, smFilter= exTx, chrnames="20") b2 @ An improvement in sensitivity is seen after this adjustment. The probes with FDR at 0.001 or lower are identified by the helper function <>= goodProbes = function(x) names(x@scoregr[elementMetadata(x@scoregr)$fdr<0.001]) @ All probes identified as significant (at FDR $\leq 0.001$) before the PCA adjustment are identified as such after it: <>= setdiff(goodProbes(b2), goodProbes(b1)) @ The effects of the adjustment for genes that were significant only in the adjusted analysis can be visualized: \setkeys{Gin}{width=.95\textwidth} <>= newp = setdiff(goodProbes(b2), goodProbes(b1)) np = length(newp) bestSnp = function(pn, esm) elementMetadata(esm@scoregr[pn])$snpid par(mfrow=c(2,2)) plot_EvG(probeId(newp[1]), rsid(bestSnp(newp[1], b2)), g20, main="raw") plot_EvG(probeId(newp[1]), rsid(bestSnp(newp[1], b2)), g20f, main="PC-adjusted") plot_EvG(probeId(newp[np]), rsid(bestSnp(newp[np], b2)), g20, main="raw") plot_EvG(probeId(newp[np]), rsid(bestSnp(newp[np], b2)), g20f, main="PC-adjusted") @ \clearpage \section{Exercises} \begin{enumerate} \item All computations performed above ignore familial structure in the data that can be determined using the \texttt{famid}, \texttt{mothid}, \texttt{fathid} variables in the \texttt{pData(g20)}. Reduce the \texttt{smlSet} instance used for eQTL testing to parents only, who have parent identifiers equal to zero, and recompute the main tables. \item For selected eQTLs that are were significant with low FDR in the full data ignoring, but are not significant in the analysis of the reduced data, use a reasonably specified variance components model on the full data with familial structure and compute a third test statistic. Is the restriction to parents only a good policy for eQTL discovery? Is there evidence of substantial familial aggregation in expression after heterogeneity reduction? \item How can we select in a principled way the number of principal components to be removed for heterogeneity reduction? \end{enumerate} \clearpage \section{Session information} <>= toLatex(sessionInfo()) @ \bibliography{ggtvig} \end{document}