%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%

%\VignetteIndexEntry{GGtools 2012: efficient tools for eQTL discovery}
%\VignetteDepends{GGdata}
%\VignetteKeywords{genetics of gene expression}
%\VignettePackage{GGtools}

\documentclass[12pt]{article}

\usepackage{auto-pst-pdf}
\usepackage{amsmath,pstricks}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{Using \textit{GGtools} for eQTL discovery and interpretation}
\author{VJ Carey \texttt{stvjc at channing.harvard.edu}}
\maketitle

\tableofcontents

\section{Overview and installation}

This document addresses data structure and analytic workflow
for investigations of genetic sources of expression variation.
Key background references are
\citet{Williams:2007p21} for general biologic overview,
\citet{Cheung:2005p446} and \citet{Stranger:2007p114} for key applications,
and \citet{Stegle:2010p2015}, \citet{Petretto:2010p2678}, and
\citet{Leek:2010p1819} for various methodological issues.
\citet{Majewski:2011p3139}
reviews  potentials of eQTL investigations with
expression measures based on RNA sequencing. 

This document is constructed using R version 
\Sexpr{version$major}.\Sexpr{version$minor}.
See the session information
at the end of the document for full details.  Using a comparable
version of R, you can obtain the software needed for the production
of this document using
\begin{verbatim}
source("http://www.bioconductor.org/biocLite.R")
biocLite("GGtools", dependencies=TRUE)
\end{verbatim}

\section{Data structures}

\subsection{Reference data supplied with Bioconductor}

A collection of 30 trios of central
European ancestry was genotyped for
4 million SNP loci in HapMap phase II.
Immortalized B-cell lines were assayed for
gene expression using Illumina's HumanWG6v1 bead array.
Digital data on expression and genotype for the 90 CEU individuals
is distributed in Bioconductor package \textit{GGdata}; the expression 
data were retrieved from the GENEVAR website of Wellcome Trust, e.g.,
\begin{verbatim} 
ftp://ftp.sanger.ac.uk/pub/genevar/CEU_parents_norm_march2007.zip
\end{verbatim} 
and the genotype data were obtained directly from
hapmap.org at build 36:
\begin{verbatim}
ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2008-03/forward/non-redundant/
\end{verbatim}
The data in GGdata are likely derived from r23, while r23a is now distributed.
Some effort at updating genotypes may be supplied in the future.

Acquire the genome-wide expression data and the genotype data
for chromosome 20 as follows:
<<loadm>>=
suppressPackageStartupMessages(library(GGtools))
library(parallel)
<<getd,cache=TRUE>>=
g20 = getSS("GGdata", "20")
@
<<lkc>>=
g20
class(g20)
@

The \texttt{smlSet} class was designed in 2006 as an experiment
in unifying high-throughput expression and genotype data.
A key resource was the \textit{snpMatrix} (now \textit{snpStats})
package of David Clayton, which defined an 8-bit representation of
genotype calls (including uncertain calls obtained by statistical
imputation), import and coercion infrastructure allowing use
of the 8-bit representation with popular genetic data formats
(pedfiles, mach and beagle outputs, etc.),
and statistical testing infrastructure employing this
representation.

The expression and sample-level data are handled just as with familiar
\texttt{ExpressionSet} instances:
<<doex>>=
exprs(g20)[1:5,1:5]
pData(g20)[1:4,]
@

The genotype data are held in a list with elements intended to
represent chromosomes, and the list is stored in an environment to
reduce copying efforts.
<<lksm>>=
smList(g20)
as(smList(g20)[[1]][1:5,1:5], "matrix")
@

The leading zeroes in the display above indicate that raw bytes are
used to represent the genotypes per sample (rows) per SNP (columns).
Coercions:

<<lksm2>>=
as(smList(g20)[[1]][1:5,1:5], "numeric")
as(smList(g20)[[1]][1:5,1:5], "character")
@

Any number of chromosomes can be held in the genotype list component,
but the design allows working with only one chromosome at a time and
examples emphasize this approach.  Amalgamation of results across
chromosomes is generally straightforward.

\subsection{Working with your own data}

The \verb+make_smlSet+ function can be used to bind a list of
suitably named
\texttt{SnpMatrix} instances with an \texttt{ExpressionSet} instance
to create a single \texttt{smlSet} instance covering multiple chromosomes.

The \texttt{externalize} function can be applied to such an \texttt{smlSet}
instance to create a new \textit{package} which can be installed for use
with \texttt{getSS}.  This is the preferred way of managing work with
large genotyping panels (such as the 10 million locus panel achievable with
``thousand genomes imputation'').

Briefly, \texttt{externalize} arranges a DESCRIPTION file and system of
folders that can be installed as an R package.  The expression data
are stored as object \texttt{ex} in data/eset.rda, and the \texttt{SnpMatrix}
instances are stored separately as .rda files in inst/parts.
The \texttt{getSS} function will create \texttt{smlSet} instances on
the fly from the externalize-generated package.

\subsection{Filters and permutation support}

Here ``filter'' is used to refer to any function that processes an \texttt{smlSet}
instance and returns an \texttt{smlSet} instance after altering the contents,
which may involve eliminating probes or SNPs, performing numerical transformations
to expression or genotype measures, transforming sample data or eliminating samples.

\subsubsection{Bracket operations}

Coordinated manipulations of genotype, expression, and phenotype
information on samples can be accomplished with the second subscript to
the bracket operator.  Thus \texttt{g20[,1:5]} is the restriction of \texttt{g20}
to the first five samples.  Sample names may also be used for such manipulations.

Reduction of the expression component can be accomplished with the first
subscript to the bracket operator.  Thus \texttt{g20[1:5,]} is the restriction
of \texttt{g20} to five expression probes; all other data are unaltered.
If it is desired to use feature names for such manipulations, the character
vector of feature names must be cast to class \texttt{probeId} with the \texttt{probeId()}
method.  At present no such operations can be used to alter the genotype data contents.

\subsubsection{Large scale filters}

It is known that non-specific filtering (removal of probes with low variation
across samples, without regard to sample phenotype or class information)
can increase sensitivity and specificity of certain differential expression
test procedures \citep{Bourgon:2010p1763}.  The \texttt{nsFilter} function of
the \textit{genefilter} package has been adapted to work with \texttt{smlSet}
instances.

SNPs can be filtered out of the \texttt{smlSet} instance on the basis of
observed minor allele or genotype frequencies using \texttt{MAFfilter} and
\texttt{GTFfilter} respectively.

Various approaches to reduction of ``expression heterogeneity'' can be
examined.  \texttt{clipPCs(x, vec)} will form the singular value decomposition
of the expression matrix and remove principal components enumerated in \texttt{vec}
by reassembling the expression matrix after setting eigenvalues in \texttt{vec}
to zero.  It is also possible to employ any computed quantities such as principal
components or surrogate variables identified in SVA \citep{Leek:2007p1723}
as covariates in the formula element of analysis functions in
\texttt{GGtools}, but note that simple permutations
do not lead to valid permutation tests of SNP effects in
the presence of covariates (see \citep{Buzkova:2011p3368} who focus on
interaction, but describe the problem for main effects models,
with references, early in the paper.)

The introduction of novel approaches to expression transformation can
be accomplished using code similar to the following, illustrating
of the use of PEER \citep{Stegle:2010p2015}:

<<dopeer,eval=FALSE>>=
library(peer)
model = PEER()
PEER_setPhenoMean(model, t(exprs(g20)))
PEER_setNk(model, 10)
PEER_setCovariates(model, matrix(1*g20$male,nc=1))
PEER_update(model)
resid=t(PEER_getResiduals(model))
rownames(resid) = featureNames(g20)
colnames(resid) = sampleNames(g20)
g20peer10 = g20
g20peer10@assayData = assayDataNew("lockedEnvironment", exprs=resid)
@
At this point, \texttt{g20peer10} holds expression data with 10 latent
factors removed after adjustment for gender.

\subsubsection{Permutation of expression against genotype}

Because the \textit{snpStats} testing procedures defensively
match predictor to response variable orderings using sample labels,
special steps must be taken to ensure that tests use properly
permuted responses.  The \texttt{permEx} function takes care of this,
using the current state of the random number generator.  

\subsection{Post-analysis data structures}

While it is possible to construe the results of an eQTL search as a
static report, it is more productive to conceptualize the result as a
data object for further analysis.  Unfortunately, the number of
tests to be managed can be very large -- at least hundreds of millions,
and these must be joinable with location metadata to be maximally useful.

Several data structures for managing post-analysis results
have emerged as this package has matured.  Of particular concern
are those that use \textit{ff} out-of-memory archiving for
test statistic values or effect estimates and those that use the
\texttt{GRanges} infrastructure to facilitate efficient query
resolution in genomic coordinates.  These will be described along
with the related analytic workflow steps.

\section{Focused analyses}

A specific gene can be checked for eQTL on a given chromosome
or set of chromosomes with
\texttt{gwSnpTests}.  There are various convenience facilities.
In the call to \texttt{gwSnpTests} below,
a gene symbol is used to pick out an expression element,
and adjustment for gender is commodated in an additive genetic model
for effects of B allele copy number on expression of CPNE1.
One degree of freedom chi-squared tests are computed.
<<dot,cache=TRUE>>=
t1 = gwSnpTests(genesym("CPNE1")~male, g20, chrnum("20"))
t1
topSnps(t1)
@ 
It is possible to compute tests for this specific gene for association
with SNP across several chromosomes if desired; change the value of
the third argument to a vector.

There are a few approaches to visualization of the results that are
relevant, but complications arise in relation to choice of genomic
coordinates.

<<dopl,echo=FALSE,results=hide>>=
pdf(file="t1.pdf")
plot(t1, "SNPlocs.Hsapiens.dbSNP.20110815")
dev.off()
pdf(file="t1evg.pdf")
plot_EvG(genesym("CPNE1"), rsid("rs17093026"), g20)
dev.off()
@

<<dol,eval=FALSE>>=
plot(t1, "SNPlocs.Hsapiens.dbSNP.20110815")
plot_EvG(genesym("CPNE1"), rsid("rs17093026"), g20)
@

\setkeys{Gin}{width=0.45\textwidth}
\begin{tabular}{cc}
\includegraphics{t1} & \includegraphics{t1evg} \\
\end{tabular}

Code like the following can be used to display
scores ($-\log_{10} p$) on the genome browser, here with hg19 locations.
<<doloc,eval=FALSE>>=
library(SNPlocs.Hsapiens.dbSNP.20110815)
S20 = getSNPlocs("ch20", as.GRanges=TRUE)
GR20 = makeGRanges(t1, S20)
library(rtracklayer)
export(GR20, "~/cpne1new.wig")
@

With this code, it will be necessary
to manually alter the chr assignment in the wig file, and
place an informative title to get the following display.

\clearpage

\begin{center}
\setkeys{Gin}{width=0.95\textwidth}
\includegraphics{cpne1Brow}
\end{center}

\clearpage

\section{Comprehensive surveys}

\subsection{A set of genes vs. all SNP on a chromosome}

The performance of \textit{snpStats} \texttt{snp.rhs.tests} is
very good and so our principle for large-scale searches is to
compute all association statistics, save them in truncated form,
and filter results later.  This is carried out with the
\texttt{eqtlTests} function.  To illustrate, 
the expression data is sharply filtered
to the 50 most variable genes on chromosome 20 as
measured by cross-sample median absolute deviation, SNP
with MAF $<$ 0.05 are removed, and
then all SNP-gene association tests are executed.

<<prep,echo=FALSE,results=hide>>=
<<do20,cache=TRUE,keep.source=TRUE>>=
g20 = GGtools:::restrictProbesToChrom(g20, "20")
mads = apply(exprs(g20),1,mad)
oo = order(mads, decreasing=TRUE)
g20 = g20[oo[1:50],]
tf = tempfile()
dir.create(tf)
e1 = eqtlTests(MAFfilter(g20, lower=0.05), ~male, 
    geneApply=mclapply, targdir=tf)
e1
@
On a two-core macbook pro, this computation takes well under a minute.
The details of the underlying data structure are involved.  Briefly,
a short integer is used to represent each chi-squared statistic
obtained in the \Sexpr{length(nrow(e1@fffile)*ncol(e1@fffile))}
tests computed, in an \texttt{ff} archive.  Use \texttt{topFeats}
to manually harvest this.

<<gettop,cache=TRUE>>=
pm1 = colnames(e1@fffile)
tops = sapply(pm1, function(x) topFeats(probeId(x), mgr=e1, n=1)) 
top6 = sort(tops, decreasing=TRUE)[1:6]
@
<<dopr6>>=
print(top6)
@

R has propagated the names of probes and SNPs with the scores so that 
a table can be created as follows:
<<gettab>>=
nms = strsplit(names(top6), "\\.")
gn = sapply(nms,"[",1)
sn = sapply(nms,"[",2)
tab = data.frame(snp=sn,score=as.numeric(top6))
rownames(tab) = gn
tab
@

Statistical interpretation of the scores in this table is not clear as the
data structure includes familial aggregation in
trios and extended pedigrees, and may include population stratification,
Nevertheless, consistency of these findings with other published
results involving multiple populations can be checked.
\textit{GGtools} includes a table published by Stranger and colleagues
in 2007 enumerating multipopulation eQTL \citep{Stranger:2007p114}.
<<ddstr>>=
data(strMultPop)
strMultPop[ strMultPop$rsid %in% tab$snp, ]
@
Thus the top two SNP in the table computed here are identified as multipopulation eQTL by
Stranger.  The other association scores are not very strong and likely do
not correspond to genuine associations.

\subsection{Tabulating best associated cis-rSNP with permutation-based FDR: small example}

The workhorse for identifying genes to which can be associated putatively regulating
SNP (rSNP) is \texttt{best.cis.eQTLs}.  This can be used for genome-wide analysis,
but here an alternative table is created for the sharply filtered chromosome 20 data
given above.  This call says that gene location information will be acquired from
the Bioconductor TxDb.Hsapiens annotation package for hg19 UCSC known genes,
and that tests for association within 1 Mbase  of the coding region for each gene
will be considered.  The expression data will be permuted against genotype data
in two independent draws to assemble the null reference distribution for
association scores; these are used to enumerate false significance claims
at various magnitudes of the distribution of association scores.  The plug-in
procedure for estimating FDR XXX cite Hastie Tibs Friedman is used.

<<doc,cache=TRUE,keep.source=TRUE>>=
if (file.exists("db2")) unlink("db2", recursive=TRUE)
fn = probeId(featureNames(g20))
exTx = function(x) MAFfilter( x[fn, ], lower=0.05)
b1 = best.cis.eQTLs("GGdata", ~male,  radius=1e6,
   folderstem="db2", nperm=2, geneApply=mclapply,
   smFilter= exTx, chrnames="20")
<<lkc>>=
b1
@

\subsection{Tabulating all cis-rSNP with association score exceeding a given threshold}

The gene-centered analysis described in the previous subsection yields a
set of thresholds corresponding to various FDRs.  It may be of interest to
enumerate all SNP with association scores exceeding some threshold.
All.cis.eQTLs can be used for this.

<<lkall>>=
args(All.cis.eQTLs)
@  

The computation can be done \textit{de novo} on the basis of an \texttt{smpack}
argument, or can be regarded as a followup on a completed \texttt{best.cis.eQTLs}
call.

<<doall1>>=
b1.all = All.cis.eQTLs( maxfdr = 0.05, inbestcis = b1, smpack="GGdata", rhs=~1, 
 chrnames = "20",
 smFilter4all = function(x) MAFfilter(x, lower=0.05))
@

Here a second SNP is identified for the one gene exhibiting an eQTL
at FDR $<= 0.05$.

      
\subsection{Removal of empirically identified expression heterogeneity}

\citet{Leek:2010p1819} describe implications of
batch effects in high-throughput experimental contexts.
Various methods for adjustment of responses have been proposed;
a very simple but evidently risky approach is nonspecific removal
of principal components of variation.  To maximize information
on possible batch effects, the entire expression matrix is restored and
decomposed for principal component removal.

<<domo,cache=TRUE, keep.source=TRUE>>=
if (file.exists("db2")) unlink("db2", recursive=TRUE)
g20 = getSS("GGdata", "20")
exTx = function(x) MAFfilter( clipPCs(x,1:10)[fn, ], lower=0.05)
g20f = exTx(g20)
<<lkc2,cache=TRUE>>=
b2 = best.cis.eQTLs("GGdata", ~male,  radius=1e6,
   folderstem="db2", nperm=2, geneApply=mclapply,
   smFilter= exTx, chrnames="20")
b2
@

An improvement in sensitivity is seen after this adjustment.  The
probes with FDR at 0.001 or lower are identified by the helper
function
<<ggg>>=
goodProbes = function(x) names(x@scoregr[elementMetadata(x@scoregr)$fdr<0.001])
@
All probes identified as significant
(at FDR $\leq 0.001$) before the PCA adjustment
are identified as such after it:
<<chkp>>=
setdiff(goodProbes(b2), goodProbes(b1))
@

The effects of the adjustment for
genes that were significant only in the
adjusted analysis can be visualized:

\setkeys{Gin}{width=.95\textwidth}
<<domopic,fig=TRUE>>=
newp = setdiff(goodProbes(b2), goodProbes(b1))
np = length(newp)
bestSnp = function(pn, esm) elementMetadata(esm@scoregr[pn])$snpid
par(mfrow=c(2,2))
plot_EvG(probeId(newp[1]), rsid(bestSnp(newp[1], b2)), g20, main="raw")
plot_EvG(probeId(newp[1]), rsid(bestSnp(newp[1], b2)), g20f, main="PC-adjusted")
plot_EvG(probeId(newp[np]), rsid(bestSnp(newp[np], b2)), g20, main="raw")
plot_EvG(probeId(newp[np]), rsid(bestSnp(newp[np], b2)), g20f, main="PC-adjusted")
@


\clearpage
\section{Exercises}

\begin{enumerate}
\item All computations performed above ignore familial structure
in the data that can be determined using the \texttt{famid}, 
\texttt{mothid}, \texttt{fathid}
variables in the \texttt{pData(g20)}.  Reduce the
\texttt{smlSet} instance used
for eQTL testing to parents only, who have parent identifiers
equal to zero, and recompute the main tables.
\item For selected eQTLs that are were significant
with low FDR in the full data ignoring, but are
not significant in the analysis of the
reduced data, use a reasonably specified variance components
model on the full data with familial structure and compute
a third test statistic.  Is the restriction to parents only
a good policy for eQTL discovery?  Is there evidence of substantial
familial aggregation in expression after heterogeneity reduction?
\item How can we select in a principled way the number of principal components
to be removed for heterogeneity reduction?
\end{enumerate}


\clearpage
\section{Session information}
<<getss,results=tex>>=
toLatex(sessionInfo())
@


\bibliography{ggtvig}

\end{document}