% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{GGtools: software for eQTL identification} %\VignetteDepends{GGdata, SNPlocs.Hsapiens.dbSNP.20120608} %\VignetteKeywords{genetics of gene expression} %\VignettePackage{GGtools} \documentclass[12pt]{article} \usepackage{auto-pst-pdf} \usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Using \textit{GGtools} for eQTL discovery and interpretation} \author{VJ Carey \texttt{stvjc at channing.harvard.edu}} \maketitle \tableofcontents \section{Overview and installation} This document addresses data structure and analytic workflow for investigations of genetic sources of expression variation. Key background references are \citet{Williams:2007p21} for general biologic overview, \citet{Cheung:2005p446} and \citet{Stranger:2007p114} for key applications, and \citet{Stegle:2010p2015}, \citet{Petretto:2010p2678}, and \citet{Leek:2010p1819} for various methodological issues. \citet{Majewski:2011p3139} reviews potentials of eQTL investigations with expression measures based on RNA sequencing. This document is constructed using R version \Sexpr{version$major}.\Sexpr{version$minor}. See the session information at the end of the document for full details. Using a comparable version of R, you can obtain the software needed for the production of this document using \begin{verbatim} source("http://www.bioconductor.org/biocLite.R") biocLite("GGtools", dependencies=TRUE) \end{verbatim} \section{Data structures} \subsection{Reference data supplied with Bioconductor} A collection of 30 trios of central European ancestry was genotyped for 4 million SNP loci in HapMap phase II. Immortalized B-cell lines were assayed for gene expression using Illumina's HumanWG6v1 bead array. Digital data on expression and genotype for the 90 CEU individuals is distributed in Bioconductor package \textit{GGdata}; the expression data were retrieved from the GENEVAR website of Wellcome Trust, e.g., \begin{verbatim} ftp://ftp.sanger.ac.uk/pub/genevar/CEU_parents_norm_march2007.zip \end{verbatim} and the genotype data were obtained directly from hapmap.org at build 36: \begin{verbatim} ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2008-03/forward/non-redundant/ \end{verbatim} The data in GGdata are likely derived from r23, while r23a is now distributed. Some effort at updating genotypes may be supplied in the future. Acquire the genome-wide expression data and the genotype data for chromosome 20 as follows: <>= suppressPackageStartupMessages(library(GGtools)) library(parallel) <>= g20 = getSS("GGdata", "20") @ <>= g20 class(g20) @ The \texttt{smlSet} class was designed in 2006 as an experiment in unifying high-throughput expression and genotype data. A key resource was the \textit{snpMatrix} (now \textit{snpStats}) package of David Clayton, which defined an 8-bit representation of genotype calls (including uncertain calls obtained by statistical imputation), import and coercion infrastructure allowing use of the 8-bit representation with popular genetic data formats (pedfiles, mach and beagle outputs, etc.), and statistical testing infrastructure employing this representation. The expression and sample-level data are handled just as with familiar \texttt{ExpressionSet} instances: <>= exprs(g20)[1:5,1:5] pData(g20)[1:4,] @ The genotype data are held in a list with elements intended to represent chromosomes, and the list is stored in an environment to reduce copying efforts. <>= smList(g20) as(smList(g20)[[1]][1:5,1:5], "matrix") @ The leading zeroes in the display above indicate that raw bytes are used to represent the genotypes per sample (rows) per SNP (columns). Coercions: <>= as(smList(g20)[[1]][1:5,1:5], "numeric") as(smList(g20)[[1]][1:5,1:5], "character") @ Any number of chromosomes can be held in the genotype list component, but the design allows working with only one chromosome at a time and examples emphasize this approach. Amalgamation of results across chromosomes is generally straightforward. \subsection{Working with your own data} The \verb+make_smlSet+ function can be used to bind a list of suitably named \texttt{SnpMatrix} instances with an \texttt{ExpressionSet} instance to create a single \texttt{smlSet} instance covering multiple chromosomes. The \texttt{externalize} function can be applied to such an \texttt{smlSet} instance to create a new \textit{package} which can be installed for use with \texttt{getSS}. This is the preferred way of managing work with large genotyping panels (such as the 10 million locus panel achievable with ``thousand genomes imputation''). Briefly, \texttt{externalize} arranges a DESCRIPTION file and system of folders that can be installed as an R package. The expression data are stored as object \texttt{ex} in data/eset.rda, and the \texttt{SnpMatrix} instances are stored separately as .rda files in inst/parts. The \texttt{getSS} function will create \texttt{smlSet} instances on the fly from the externalize-generated package. \subsection{Filters and permutation support} Here ``filter'' is used to refer to any function that processes an \texttt{smlSet} instance and returns an \texttt{smlSet} instance after altering the contents, which may involve eliminating probes or SNPs, performing numerical transformations to expression or genotype measures, transforming sample data or eliminating samples. \subsubsection{Bracket operations} Coordinated manipulations of genotype, expression, and phenotype information on samples can be accomplished with the second subscript to the bracket operator. Thus \texttt{g20[,1:5]} is the restriction of \texttt{g20} to the first five samples. Sample names may also be used for such manipulations. Reduction of the expression component can be accomplished with the first subscript to the bracket operator. Thus \texttt{g20[1:5,]} is the restriction of \texttt{g20} to five expression probes; all other data are unaltered. If it is desired to use feature names for such manipulations, the character vector of feature names must be cast to class \texttt{probeId} with the \texttt{probeId()} method. At present no such operations can be used to alter the genotype data contents. \subsubsection{Large scale filters} It is known that non-specific filtering (removal of probes with low variation across samples, without regard to sample phenotype or class information) can increase sensitivity and specificity of certain differential expression test procedures \citep{Bourgon:2010p1763}. The \texttt{nsFilter} function of the \textit{genefilter} package has been adapted to work with \texttt{smlSet} instances. SNPs can be filtered out of the \texttt{smlSet} instance on the basis of observed minor allele or genotype frequencies using \texttt{MAFfilter} and \texttt{GTFfilter} respectively. Various approaches to reduction of ``expression heterogeneity'' can be examined. \texttt{clipPCs(x, vec)} will form the singular value decomposition of the expression matrix and remove principal components enumerated in \texttt{vec} by reassembling the expression matrix after setting eigenvalues in \texttt{vec} to zero. It is also possible to employ any computed quantities such as principal components or surrogate variables identified in SVA \citep{Leek:2007p1723} as covariates in the formula element of analysis functions in \texttt{GGtools}, but note that simple permutations do not lead to valid permutation tests of SNP effects in the presence of covariates (see \citep{Buzkova:2011p3368} who focus on interaction, but describe the problem for main effects models, with references, early in the paper.) The introduction of novel approaches to expression transformation can be accomplished using code similar to the following, illustrating of the use of PEER \citep{Stegle:2010p2015}: <>= library(peer) model = PEER() PEER_setPhenoMean(model, t(exprs(g20))) PEER_setNk(model, 10) PEER_setCovariates(model, matrix(1*g20$male,nc=1)) PEER_update(model) resid=t(PEER_getResiduals(model)) rownames(resid) = featureNames(g20) colnames(resid) = sampleNames(g20) g20peer10 = g20 g20peer10@assayData = assayDataNew("lockedEnvironment", exprs=resid) @ At this point, \texttt{g20peer10} holds expression data with 10 latent factors removed after adjustment for gender. \subsubsection{Permutation of expression against genotype} Because the \textit{snpStats} testing procedures defensively match predictor to response variable orderings using sample labels, special steps must be taken to ensure that tests use properly permuted responses. The \texttt{permEx} function takes care of this, using the current state of the random number generator. \subsection{Post-analysis data structures} While it is possible to construe the results of an eQTL search as a static report, it is more productive to conceptualize the result as a data object for further analysis. Unfortunately, the number of tests to be managed can be very large -- at least hundreds of millions, and these must be joinable with location metadata to be maximally useful. Several data structures for managing post-analysis results have emerged as this package has matured. Of particular concern are those that use \textit{ff} out-of-memory archiving for test statistic values or effect estimates and those that use the \texttt{GRanges} infrastructure to facilitate efficient query resolution in genomic coordinates. These will be described along with the related analytic workflow steps. \section{Focused analyses} A specific gene can be checked for eQTL on a given chromosome or set of chromosomes with \texttt{gwSnpTests}. There are various convenience facilities. In the call to \texttt{gwSnpTests} below, a gene symbol is used to pick out an expression element, and adjustment for gender is commodated in an additive genetic model for effects of B allele copy number on expression of CPNE1. One degree of freedom chi-squared tests are computed. <>= t1 = gwSnpTests(genesym("CPNE1")~male, g20, chrnum("20")) t1 topSnps(t1) @ It is possible to compute tests for this specific gene for association with SNP across several chromosomes if desired; change the value of the third argument to a vector. There are a few approaches to visualization of the results that are relevant, but complications arise in relation to choice of genomic coordinates. <>= pdf(file="t1.pdf") plot(t1, snplocsDefault()) dev.off() pdf(file="t1evg.pdf") plot_EvG(genesym("CPNE1"), rsid("rs17093026"), g20) dev.off() @ <>= plot(t1, snplocsDefault()) plot_EvG(genesym("CPNE1"), rsid("rs17093026"), g20) @ \setkeys{Gin}{width=0.45\textwidth} \begin{tabular}{cc} \includegraphics{t1} & \includegraphics{t1evg} \\ \end{tabular} Code like the following can be used to display scores ($-\log_{10} p$) on the genome browser, here with hg19 locations. <>= library(snplocsDefault(), character.only=TRUE) S20 = getSNPlocs("ch20", as.GRanges=TRUE) GR20 = makeGRanges(t1, S20) library(rtracklayer) export(GR20, "~/cpne1new.wig") @ With this code, it will be necessary to manually alter the chr assignment in the wig file, and place an informative title to get the following display. \clearpage \begin{center} \setkeys{Gin}{width=0.95\textwidth} \includegraphics{cpne1Brow} \end{center} \clearpage \section{Comprehensive surveys} \subsection{A set of genes vs. all SNP on a chromosome} The performance of \textit{snpStats} \texttt{snp.rhs.tests} is very good and so our principle for large-scale searches is to compute all association statistics, save them in truncated form, and filter results later. This is carried out with the \texttt{eqtlTests} function. To illustrate, the expression data is sharply filtered to the 50 most variable genes on chromosome 20 as measured by cross-sample median absolute deviation, SNP with MAF $<$ 0.05 are removed, and then all SNP-gene association tests are executed. <>= g20 = GGtools:::restrictProbesToChrom(g20, "20") mads = apply(exprs(g20),1,mad) oo = order(mads, decreasing=TRUE) g20 = g20[oo[1:50],] tf = tempfile() dir.create(tf) e1 = eqtlTests(MAFfilter(g20, lower=0.05), ~male, geneApply=mclapply, targdir=tf) e1 @ On a two-core macbook pro, this computation completes in less than a minute. The details of the underlying data structure are involved. Briefly, a short integer is used to represent each chi-squared statistic obtained in the \Sexpr{length(nrow(e1@fffile)*ncol(e1@fffile))} tests computed, in an \texttt{ff} archive. Use \texttt{topFeats} to manually harvest this. <>= pm1 = colnames(e1@fffile) tops = sapply(pm1, function(x) topFeats(probeId(x), mgr=e1, n=1)) top6 = sort(tops, decreasing=TRUE)[1:6] @ <>= print(top6) @ R has propagated the names of probes and SNPs with the scores so that a table can be created as follows: <>= nms = strsplit(names(top6), "\\.") gn = sapply(nms,"[",1) sn = sapply(nms,"[",2) tab = data.frame(snp=sn,score=as.numeric(top6)) rownames(tab) = gn tab @ Statistical interpretation of the scores in this table is not clear as the data structure includes familial aggregation in trios and extended pedigrees, and may include population stratification, Nevertheless, consistency of these findings with other published results involving multiple populations can be checked. \textit{GGtools} includes a table published by Stranger and colleagues in 2007 enumerating multipopulation eQTL \citep{Stranger:2007p114}. <>= data(strMultPop) strMultPop[ strMultPop$rsid %in% tab$snp, ] @ Thus the top two SNP in the table computed here are identified as multipopulation eQTL by Stranger. The other association scores are not very strong and likely do not correspond to genuine associations. \subsection{Tabulating best associated cis-rSNP with permutation-based FDR: small example} The workhorse for identifying genes to which can be associated putatively regulating SNP (rSNP) is \texttt{best.cis.eQTLs}. This can be used for genome-wide analysis, but here an alternative table is created for the sharply filtered chromosome 20 data given above. This call says that gene location information will be acquired from the Bioconductor TxDb.Hsapiens annotation package for hg19 UCSC known genes, and that tests for association within 1 Mbase of the coding region for each gene will be considered. The expression data will be permuted against genotype data in two independent draws to assemble the null reference distribution for association scores; these are used to enumerate false significance claims at various magnitudes of the distribution of association scores. The plug-in procedure for estimating FDR XXX cite Hastie Tibs Friedman is used. <>= fn = probeId(featureNames(g20)) <>= if (file.exists("db2")) unlink("db2", recursive=TRUE) fn = probeId(featureNames(g20)) exTx = function(x) MAFfilter( x[fn, ], lower=0.05) b1 = best.cis.eQTLs("GGdata", ~male, radius=1e6, folderstem="db2", nperm=2, geneApply=mclapply, smFilter= exTx, chrnames="20", snpannopk=snplocsDefault()) <>= data(b1) <>= b1 @ %===== \section{Comprehensive reporting on FDR for all SNP cis to a set of genes} We want to support improved control of cis-eQTL searches and better reflectance of cis-eQTL search outputs. The basic idea is that we can, in most modern computing facilities, retain fairly substantial archives resulting from comprehensive searches; previous versions of GGtools emphasized gene-centric analysis which could then be extended to higher levels of resolution in a targeted manner. We can do a single comprehensive search and then filter to understand sensitivity of FDR measures to aspects of upstream filtering. \subsection{Configuring a search} <>= options("showHeadLines"=3) options("showTailLines"=1) <>= suppressPackageStartupMessages(library(GGtools)) ini = new("CisConfig") ini radius(ini) = 75000L smFilter(ini) = function(x) nsFilter(x, var.cutoff=.98) smpack(ini) = "GGtools" chrnames(ini) = "20" library(parallel) # to define mclapply geneApply(ini) = mclapply ini @ \subsection{Executing a search} \subsubsection{Computing and serializing scored GRanges} The FDR reported are for the specific search (here one chromosome). Various tools for combining multiple searches are used to get FDRs for more comprehensive searches. <>= options(mc.cores=max(1, detectCores()-3)) system.time(t20 <- All.cis(ini)) t20 @ Note that the observed score is retained along with all associated scores achieved under permutation of expression against genotype. We can update to a different chromosome: <>= chrnames(ini) = "21" system.time(t21 <- All.cis(ini)) @ For realistic and comprehensive filtering settings (radius 250000 or greater, MAF bound .005, the results of All.cis can be quite large, so most downstream utilities assume the results are serialized to disk. <>= td = tempdir() save(t20, file=paste0(td, "/t20.rda")) #save(t21, file=paste0(td, "/t21.rda")) @ \subsubsection{Amalgamation of chromosome-specific results, recomputation of FDR} We will use a collection method that focuses on sensitivity analysis, generating FDRs for different tuning parameters within the scope of the search. <>= fns = dir(td, full=TRUE, patt="^t2.*rda$") cf = collectFiltered(fns, mafs=c(.02,.03,.1), hidists=c(1000,10000,75000)) class(cf) names(cf) # MAFs are primary organization, distances secondary names(cf[[1]]) sapply(cf, sapply, function(x) sum(x$fdr <= 0.05)) # best per gene of = order(cf[[3]][[1]]$fdr) cf[[3]][[1]][of,][1:4,] # shows best hits @ <>= g20 = getSS("GGtools", "20") <>= plot_EvG(probeId("GI_4502372-S"), rsid("rs290449"), g20) @ \texttt{collectFiltered} has a default \texttt{filterFun} which isolates the best scoring SNP per gene. A SNP-centric FDR can also be computed. <>= cf2 = collectFiltered(fns, mafs=c(.02,.05,.1), hidists=c(10000,75000), filterFun=cis.FDR.filter.SNPcentric) sapply(cf2, sapply, function(x) sum(x$fdr<=0.01)) @ \section{Full search on an SGE-based cluster} The stages are: \begin{itemize} \item create the smlSet-based package via externalize \item set up the chromosome-specific run as a function <>= onepopConfig = function(chrn="22", nperm=3L, MAF=.05, npc=10, radius=50000, exclRadius=NULL) { if (!is.null(npc)) bf = basicFilter = function(ww) MAFfilter(clipPCs(ww, 1:npc), lower=MAF)[, which(ww$isFounder==1)] else bf = basicFilter = function(ww) MAFfilter(ww, lower=MAF)[, which( ww$isFounder==1)] ssm(library(GGtools)) iniconf = new("CisConfig") smpack(iniconf) = "GGdata" rhs(iniconf) = ~1 folderStem(iniconf) = paste0("cisScratch_", chrn) chrnames(iniconf) = as.character(chrn) geneannopk(iniconf) = "illuminaHumanv1.db" snpannopk(iniconf) = snplocsDefault() smchrpref(iniconf) = "" geneApply(iniconf) = mclapply exFilter(iniconf) = function(x)x smFilter(iniconf) = bf nperm(iniconf) = as.integer(nperm) radius(iniconf) = radius estimates(iniconf) = TRUE MAFlb(iniconf) = MAF library(parallel) options(mc.cores=3) options(error=recover) set.seed(1234) tmp = All.cis( iniconf ) metadata(tmp)$config = iniconf obn = paste("pop2_", "np_", nperm, "_maf_", MAF, "_chr_", chrn, "_npc_", npc, "_rad_", radius, "_exc_", exclRadius, sep="") fn = paste(obn, file=".rda", sep="") assign(obn, tmp) save(list=obn, file=fn) } @ \item Create a script that invokes the function \begin{verbatim} #!/bin/bash RCMD=/udd/stvjc/bin/R3 CHR=$1 MAF=$2 NPERM=$3 NPC=$4 RAD=$5 EXCL=$6 LANG=C echo "source('onepopConfig.R'); \ onepopConfig(chrn=${CHR}, MAF=${MAF}, nperm=${NPERM}, \ npc=${NPC}, radius=${RAD}, exclRadius=${EXCL})" | ${RCMD} \ --no-save >& lk${CHR}_${MAF}_${NPERM}_${RAD}_${EXCL}.txt \end{verbatim} \item Use the scheduler implicitly (or somehow) \begin{verbatim} #!/bin/bash for i in {1..22}; do qsub -cwd -l lx6,large ./doOne.sh $i .005 3L 18 250000L NULL; echo $i; done; \end{verbatim} At conclusion, 22 .rda files will be present for harvesting using the collectFiltered function. \end{itemize} %===== \subsection{Removal of empirically identified expression heterogeneity} \citet{Leek:2010p1819} describe implications of batch effects in high-throughput experimental contexts. Various methods for adjustment of responses have been proposed; a very simple but evidently risky approach is nonspecific removal of principal components of variation. To maximize information on possible batch effects, the entire expression matrix is restored and decomposed for principal component removal. <>= if (file.exists("db2")) unlink("db2", recursive=TRUE) g20 = getSS("GGdata", "20") exTx = function(x) MAFfilter( clipPCs(x,1:10)[fn, ], lower=0.05) g20f = exTx(g20) @ <>= b2 = best.cis.eQTLs("GGdata", ~male, radius=50000, folderstem="db3", nperm=2, geneApply=mclapply, smFilter= exTx, chrnames="20", snpannopk=snplocsDefault()) <>= data(b2) <>= b2 @ An improvement in sensitivity can be seen after this adjustment. The probes with FDR at 0.13 (used for demonstration purposes with this drastically reduced data) or lower are identified by the helper function <>= goodProbes = function(x) names(x@scoregr[elementMetadata(x@scoregr)$fdr<0.13]) @ All probes identified as significant (at FDR $\leq 0.13$) before the PCA adjustment are identified as such after it: <>= setdiff(goodProbes(b2), goodProbes(b1)) @ The adjustment can lead to loss of significance for some probes. <>= setdiff(goodProbes(b1), goodProbes(b2)) @ The effects of the adjustment for genes that were significant only in the adjusted analysis can be visualized: \setkeys{Gin}{width=.95\textwidth} <>= newp = setdiff(goodProbes(b2), goodProbes(b1)) np = length(newp) bestSnp = function(pn, esm) elementMetadata(esm@scoregr[pn])$snpid par(mfrow=c(2,2)) plot_EvG(probeId(newp[1]), rsid(bestSnp(newp[1], b2)), g20, main="raw") plot_EvG(probeId(newp[1]), rsid(bestSnp(newp[1], b2)), g20f, main="PC-adjusted") plot_EvG(probeId(newp[np]), rsid(bestSnp(newp[np], b2)), g20, main="raw") plot_EvG(probeId(newp[np]), rsid(bestSnp(newp[np], b2)), g20f, main="PC-adjusted") @ \clearpage \section{Exercises} \begin{enumerate} \item All computations performed above ignore familial structure in the data that can be determined using the \texttt{famid}, \texttt{mothid}, \texttt{fathid} variables in the \texttt{pData(g20)}. Reduce the \texttt{smlSet} instance used for eQTL testing to parents only, who have parent identifiers equal to zero, and recompute the main tables. \item For selected eQTLs that are were significant with low FDR in the full data ignoring, but are not significant in the analysis of the reduced data, use a reasonably specified variance components model on the full data with familial structure and compute a third test statistic. Is the restriction to parents only a good policy for eQTL discovery? Is there evidence of substantial familial aggregation in expression after heterogeneity reduction? \item How can we select in a principled way the number of principal components to be removed for heterogeneity reduction? \end{enumerate} \clearpage \section{Appendix: building your own structures} The following code illustrates construction of a minimal smlSet instance. <>= library(Biobase) suppressPackageStartupMessages(library(GGtools)) ex = matrix(0, nr=5, nc=3) pd = data.frame(v1 = 1:3, v2=5:7) colnames(ex) = rownames(pd) = LETTERS[1:3] adf = AnnotatedDataFrame(pd) rownames(ex) = letters[1:5] es = ExpressionSet(ex, phenoData=adf) exprs(es) pData(es) library(snpStats) mysnps = matrix(rep(1:3, 10), nr=3) # note 1=A/A, ... 0 = NA rownames(mysnps) = colnames(ex) mysm = new("SnpMatrix", mysnps) as(mysm, "character") as(mysm, "numeric") sml = make_smlSet(es, list(c1=mysm)) annotation(sml) colnames(smList(sml)[[1]]) @ \clearpage \section{Session information} <>= toLatex(sessionInfo()) @ \bibliography{ggtvig} \end{document}