%\documentclass[a4paper,12pt]{article} \documentclass[12pt]{article} %\usepackage{times} \usepackage{mathptmx} \renewcommand{\ttdefault}{cmtt} \usepackage{graphicx} \usepackage[pdftex, bookmarks, bookmarksopen, pdfauthor={David Clayton and Chris Wallace}, pdftitle={snpMatrix Vignette}] {hyperref} \setlength{\topmargin}{-20mm} \setlength{\headheight}{5mm} \setlength{\headsep}{15mm} \setlength{\textheight}{245mm} \setlength{\oddsidemargin}{10mm} \setlength{\evensidemargin}{0mm} \setlength{\textwidth}{150mm} %\newcommand{\R}{\includegraphics[height=2ex]{/home/david/TeX/graphics/Rlogo.eps}} \title{snpMatrix vignette\\Example of genome-wide association testing} \author{David Clayton and Chris Wallace} \date{\today} \usepackage{Sweave} \SweaveOpts{echo=TRUE, pdf=TRUE, eps=FALSE, keep.source=FALSE} \begin{document} \setkeys{Gin}{width=0.9\textwidth} %\VignetteIndexEntry{snpMatrix} %\VignettePackage{snpMatrix} \maketitle \section*{The {\tt snpMatrix} package} The package ``{\tt snpMatrix}'' was written to provide data classes and methods to facilitate the analysis of whole genome association studies in R. In the data classes it implements, each genotype call is stored as a single byte and, at this density, data for single chromosomes derived from large studies and new high-throughput gene chip platforms can be handled in memory by modern PCs and workstations. The object--oriented programming model introduced with version 4 of the S-plus package, usually termed ``S4 methods'' was used to implement these classes. At the current state of development the package only supports population--based studies, although we would hope to provide support for family--based studies soon. Both quantitative and qualitative phenotypes may be analysed. Flexible association testing functions are provided which can carry out single SNP tests which control for potential confounding by quantitative and qualitative covariates. Tests involving several SNPs taken together as ``tags'' are also supported. Efficient calculation of pair-wise linkage disequilibrium measures is implemented and data input functions include a function which can download data directly from the international HapMap project website. The package is described by Clayton and Leung (2007) {\it Human Heredity}, {\bf 64}: 45--51. \section*{Getting started} We shall start by loading the the packages and the data to be used in this exercise: <>= library(chopsticks) library(hexbin) data(for.exercise) @ In addition to the {\tt snpMatrix} package, we have also loaded the {\tt hexbin} package which reduces file sizes and legibility of plots with very many data points. The data have been created artificially from publicly available datasets. The SNPs have been selected from those genotyped by the International HapMap Project\footnote{\tt http://www.hapmap.org} to represent the typical density found on a whole genome association chip, (the Affymetrix 500K platform\footnote{\tt http://www.affymetrix.com/support/technical/sample\_data/500k\_hapmap\_genotype\_data.affx}) for a moderately sized chromosome (chromosome 10). A (rather too) small study of 500 cases and 500 controls has been simulated allowing for recombination using beta software from Su and Marchini. Re-sampling of cases was weighted in such a way as to simulate three ``causal'' locus on this chromosome, with multiplicative effects of 1.3, 1.4 and 1.5 for each copy of the risk allele at each locus. It should be noted that this is a somewhat optimistic scenario! You have loaded three objects: \begin{enumerate} \item {\tt snps.10}, an object of class ``{\tt snp.matrix}'' containing a matrix of SNP genotype calls. Rows of the matrix correspond to subjects and columns correspond to SNPs: <<>>= show(snps.10) @ \item {\tt snp.support}, a conventional R data frame containing information about the SNPs typed. To see its contents: <<>>= summary(snp.support) @ Row names of this data frame correspond with column names of {\tt snps.10} and comprise the (unique) SNP identifiers. \item {\tt subject.support}, another conventional R data frame containing further information about the subjects. The row names coincide with the row names of {\tt snps.10} and comprise the (unique) subject identifiers. In this simulated study there are only two variables: <<>>= summary(subject.support) @ The variable {\tt cc} identifies cases ({\tt cc=1}) and controls ({\tt cc=0}) while {\tt stratum}, coded 1 or 2, identifies a stratification of the study population --- more on this later. \end{enumerate} In general, analysis of a whole--genome association study will require a subject support data frame, a SNP support data frame for each chromosome, and a SNP data file for each chromosome\footnote{ Support files are usually read in with general tools such as {\tt read.table()}. The {\tt snpMatrix} package contains a number of tools for reading SNP genotype data into an object of class ``{\tt snp.matrix}''.}. You may have noticed that it was not suggested that you should examine the contents of {\tt snps.10} by typing {\tt summary(snps.10)}. The reason is that this command produces one line of summary statistics for each of the 12,400 SNPs. Instead we shall compute the summary and then summarise it! <<>>= snpsum <- summary(snps.10) summary(snpsum) @ The contents of {\tt snpsum} are fairly obvious from the output from the last command. We could look at a couple of summary statistics in more detail: <>= par(mfrow = c(1, 2)) hist(snpsum$MAF) hist(snpsum$z.HWE) @ The latter should represent a $z$-statistic. {\it i.e.} a statistic normally distributed with mean zero and unit standard deviation under the hypothesis of Hardy--Weinberg equilibrium (HWE). Quite clearly there is extreme deviation from HWE, but this can be accounted for by the manner in which this synthetic dataset was created. A useful tool to detect samples that have genotyped poorly is {\tt row.summary()}. This calculates call rate and mean heterozygosity across all SNPs for each subject in turn: <>= sample.qc <- row.summary(snps.10) summary(sample.qc) @ <>= par(mfrow = c(1, 1)) plot(sample.qc) @ \section*{The analysis} We'll start by removing the three `outlying' samples above (the 3 samples with Heterozygosity near zero): <>= use <- sample.qc$Heterozygosity>0 snps.10 <- snps.10[use, ] subject.support <- subject.support[use, ] @ Then we'll see if there is any difference between call rates for cases and controls. First generate logical arrays for selecting out cases or controls:\footnote{ These commands assume that the subject support frame has the same number of rows as the SNP matrix and that they are in the same order. Otherwise a slightly more complicated derivation is necessary.} <>= if.case <- subject.support$cc == 1 if.control <- subject.support$cc == 0 @ Now we recompute the genotype summary separately for cases and controls: <>= sum.cases <- summary(snps.10[if.case, ]) sum.controls <- summary(snps.10[if.control, ]) @ and plot the call rates, using hexagonal binning and superimposing a line of slope 1 through the origin: <>= hb <- hexbin(sum.controls$Call.rate, sum.cases$Call.rate, xbin=50) sp <- plot(hb) hexVP.abline(sp$plot.vp, a=0, b=1, col="black") @ There is no obvious difference in call rates. This is not a surprise, since no such difference was built into the simulation. In the same way we could look for differences between allele frequencies, superimposing a line of slope 1 through the origin: <>= sp <- plot(hexbin(sum.controls$MAF, sum.cases$MAF, xbin=50)) hexVP.abline(sp$plot.vp, a=0, b=1, col="white") @ This is not a very effective way to look for associations, but if the SNP calling algorithm has been run separately for cases and controls this plot can be a useful diagnostic for things going wrong ({\it e.g.} different labelling of clusters). It should be stressed that, for real data, the plots described above would usually have many more outliers. Our simulation did not model the various biases and genotype failures that affect real studies. The fastest tool for carrying out simple tests for association taking the SNP one at a time is {\tt single.snp.tests}. The output from this function is a data frame with one line of data for each SNP. Running this in our data and summarising the results: <>= tests <- single.snp.tests(cc, data = subject.support, snp.data = snps.10) @ Some words of explanation are required. In the call, the {\tt snp.data=} argument is mandatory and provides the name of the matrix providing the genotype data. The {\tt data=} argument gives the name of the data frame that contains the remaining arguments --- usually the subject support data frame\footnote{This is not mandatory --- we could have made {\tt cc} available in the global environment. However we would then have to be careful that the values are in the right order; by specifying the data frame, order is forced to be correct by checking the order of the row names for the {\tt data} and {\tt snp.data} arguments.}. Let us now see what has been calculated: <>= summary(tests) @ We have, for each SNP, chi-squared tests on 1 and 2 degrees of freedom (df), together with $N$, the number of subjects for whom data were available. The 1 df test is the familiar Cochran-Armitage test for codominant effect while the 2 df test is the conventional Pearsonian test for the $3\times 2$ contingency table. The large number of {\tt NA} values for the latter test reflects the fact that, for these SNPs, the minor allele frequency was such that one homozygous genotype did not occur in the data. We will probably wish to restrict our attention to SNPs that pass certain criteria. For example <>= use <- snpsum$MAF > 0.01 & snpsum$z.HWE^2 < 200 @ (The Hardy-Weinberg filter is ridiculous and reflects the strange characteristics of these simulated data. In real life you might want to use something like 16, equivalent to a 4SE cut-off). To see how many SNPs pass this filter <>= sum(use) @ We will now throw way the discarded test results and save the positions of the remaining SNPs <>= tests <- tests[use, ] position <- snp.support[use, "position"] @ We now calculate $p$-values for the Cochran-Armitage tests and plot minus logs (base 10) of the $p$-values against position, with a horizontal line corresponding to $p = 10^{-6}$: <>= p1 <- pchisq(tests$chi2.1df, df = 1, lower.tail = FALSE) plot(hexbin(position, -log10(p1), xbin=50)) @ Clearly there are far too many ``significant'' results, an impression which is made even clearer by the quantile-quantile (QQ) plot: <>= qq.chisq(tests$chi2.1df, df = 1) @ The three numbers returned by this command are the number of tests considered, the number of outliers falling beyond the plot boundary, and the slope of a line fitted to the smallest 90\% of values. The "concentration band" for the plot is shown in grey. This region is defined by upper and lower probability bounds for each order statistic. The default is to use the 2.5\% and 95.7\% bounds\footnote{Note that this is not a simultaneous confidence region; the probability that the plot will stray outside the band at some point exceeds 95\%.}. This over-dispersion of chi-squared values was built into our simulation. The data were constructed by re-sampling individuals from {\em two} groups of HapMap subjects, the CEU sample (of European origin) and the JPT$+$CHB sample (of Asian origin), The 55\% of the cases were of European ancestry as compared with only 45\% of the controls. We can deal with this by stratification of the tests, achieved by adding the {\tt stratum} argument to the call to {\tt single.snp.tests} (the following commands are as before) <>= tests <- single.snp.tests(cc, stratum, data = subject.support, snp.data = snps.10) tests <- tests[use, ] p1 <- pchisq(tests$chi2.1df, df = 1, lower.tail = FALSE) plot(hexbin(position, -log10(p1), xbin=50)) @ <>= qq.chisq(tests$chi2.1df, df = 1) @ Most of the over-dispersion of test statistics has been removed (the residual is probably due to ``cryptic relatedness'' owing to the way in which the data were simulated). Now let us find the names and positions of the most significant 10 SNPs. The first step is to compute an array which gives the positions in which the first, second, third etc. can be found <>= ord <- order(p1) top10 <- ord[1:10] top10 @ We now list the 1 df $p$-values, the corresponding SNP names and their positions on the chromosome: <>= names <- rownames(tests) p1[top10] names[top10] position[top10] @ The most associated SNPs lie within 3 small regions of the genome. To concentrate on the rightmost region (the most associated region on the left contains just one SNP), we'll first sort the names of the SNPs into position order along the chromosome and select those lying in the region approximately one mega-base either side of the second most associated SNP: <>= posord <- order(position) position <- position[posord] names <- names[posord] local <- names[position > 9.6e+07 & position < 9.8e+07] @ The variable {\tt posord} contains the permutation necessary to sort SNPs into position order and {\tt names} and {\tt position} have now been reordered in this manner. the variable {\tt local} contains the names of the SNPs in the selected 2 mega-base region. Now create a matrix containing just these SNPs, in position order, and compute the linkage disequilibrium (LD) between them: <>= snps.2mb <- snps.10[, local] ld.2mb <- ld.snp(snps.2mb) @ A plot of the $D'$ values across the region may be written to a file (in encapsulated postscript format) as follows: \begin{verbatim} plot(ld.2mb, file="ld2.eps") \end{verbatim} This can be viewed (outside R) using a postscript viewer such as ``gv'' or ``ggv''. Alternatively it can be converted to a .pdf file and viewed in a pdf viewer such as ``acroread''. The associated SNPs fall in a region of tight LD towards the middle of the plot. Next we shall estimate the size of the effect at the most associated SNPs for each region (rs870041, rs7088765, rs1916572). In the following commands, we extract this SNP from the matrix as a numerical variable (coded 0, 1, or 2) and then, using the {\tt glm()} function, carry out a logistic regression of case--control status on this numerical coding of the SNP and upon stratum. The variable {\tt stratum} must be included in the regression in order to allow for the different population structure of cases and controls. We first attach {\tt subject.support} so that we can refer to {\tt cc} and {\tt stratum} variables directly: <>= attach(subject.support) top <- snps.10[, "rs870041"] top <- as.numeric(top) glm(cc ~ top + stratum, family = "binomial") @ The coefficient of {\tt top} in this regression is estimated as 0.5048, equivalent to a relative equivalent to a relative risk of $\exp() = 1.657$. For the other top SNPs we have: <>= top2 <- snps.10[, "rs7088765"] top2 <- as.numeric(top2) glm(cc ~ top2 + stratum, family = "binomial") top3 <- snps.10[, "rs1916572"] top3 <- as(top3, "numeric") glm(cc ~ top3 + stratum, family = binomial) @ So the relative risks are, respectively, $\exp(-0.4097) = 0.664$ and $\exp(0.3783) = 1.460$. Finally you might like to repeat the analysis above using the 2 df tests. The conclusion would have been much the same. A word of caution however; with real data the 2 df test is less robust against artifacts due to genotyping error. On the other hand, it is much more powerful against recessive or near-recessive variants. \section*{Advanced topics: multi-locus tests} There are two other functions for carrying out association tests ({\tt snp.lhs.tests()} and {\tt snp.rhs.tests()}) in the package. These are somewhat slower, but much more flexible. For example, the former function allows one to test for differences in allele frequencies between more than two groups. An important use of the latter function is to carry out tests using {\em groups} of SNPs rather than single SNPs. We shall explore this use in the final part of the exercise. A prerequisite to multi-locus analyses is to decide on how SNPs should be grouped in order to ``tag'' the genome rather more completely than by use of single markers. The {\tt snpMatrix} package will eventually contain tools to compute such groups, for example, by using HapMap data. The function {\tt ld.snp()}, which we encountered earlier, will be an essential tool in this process. However this work is not complete and, for now, we demonstrate the testing tool by grouping the 27,828 SNPs we have decided to use into 20kB blocks. The following commands compute such a grouping, tabulate the block size, and remove empty blocks: <>= blocks <- split(posord, cut(position, seq(100000, 135300000, 20000))) bsize <- sapply(blocks, length) table(bsize) blocks <- blocks[bsize>0] @ You can check that this has worked by listing the column positions of the first 20 SNPs together with the those contained in the first five blocks <>= posord[1:20] blocks[1:5] @ Note that these positions refer to the reduced set of SNPs after application of the filter on MAF and HWE. Therefore, before proceeding further we create a new matrix of SNP genotypes containing only these 27,828: <>= snps.use <- snps.10[, use] remove(snps.10) @ The command to carry out the tests on these groups, controlling for the known population structure differences is <>= mtests <- snp.rhs.tests(cc ~ stratum, family = "binomial", data = subject.support, snp.data = snps.use, tests = blocks) summary(mtests) @ The first argument, together with the second, specifies the model which corresponds to the null hypothesis. In this case we have allowed for the variation in ethnic origin ({\tt stratum}) between cases and controls. We complete the analysis by calculating the $p$--values and plotting minus their logs (base 10): <>= pm <- pchisq(mtests$Chi.squared, mtests$Df, lower.tail = FALSE) plot(hexbin(-log10(pm), xbin=50)) @ The same associated region is picked out, albeit with a rather larger $p$-value; in this case the multiple df test cannot be powerful as the 1 df test since the simulation ensured that the ``causal'' locus was actually one of the SNPs typed on the Affymetrix platform. QQ plots are somewhat more difficult since the tests are on differing degrees of freedom. This difficulty is neatly circumvented by noting that $-2\log p$ is, under the null hypothesis, distributed as chi-squared on 2 df: <>= qq.chisq(-2 * log(pm), df = 2) @ \end{document}