%\VignetteIndexEntry{Oligo Vignette} %\VignetteDepends{oligo, hapmap100kxba, pd.mapping50k.xba240} %\VignetteKeywords{Expression, SNP, Affymetrix, NimbleGen, Oligonucleotide Arrays} %\VignettePackage{oligo} \documentclass{article} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\oligo}{\Rpackage{oligo }} \usepackage{graphicx} \begin{document} \title{An Introduction to the Oligo Package} \date{March, 2007} \author{Benilton Carvalho} \maketitle <>= options(width=60) options(continue=" ") options(prompt="R> ") @ \section{Introduction} The \oligo package is designed to support all microarray designs provided by Affymetrix and NimbleGen: expression, tiling, SNP and exon arrays. As of now, chip-specific packages are built via \Rpackage{makePlatformDesign} and transitioning to the \Rpackage{pdInfoBuilder} package, which creates the data packages for the Affymetrix SNP arrays. \section{Analyzing Affymetrix SNP Arrays} Genotyping can be performed using \oligo and you will need: \begin{itemize} \item \oligo and its dependencies; \item Chip specific data package, eg. \Rpackage{pd.mapping50k.xba240}: package that contains the array specifications and SNP annotation. \item CEL files. \end{itemize} We will start by loading the \oligo package and importing the CEL files available on the \Rpackage{hapmap100kxba} package. An output directory should also be defined and that is the location where the summary files, including genotype calls and confidences are stored. <>= library("oligo") library("hapmap100kxba") pathCelFiles <- system.file("celFiles", package="hapmap100kxba") fullFilenames <- list.celfiles(path=pathCelFiles, full.names=TRUE) outputDir <- file.path(getwd(), "crlmmTest") @ The density of the SNP arrays increased in a way that it is not interesting to store the intensity matrix in memory; efficient methods to handle this situations have been developed and batches of SNPs are analyzed one at a time. The genotyping algorithm implemented in \oligo is described in \cite{Carvalho2006}. The whole process can be described in three steps: \begin{enumerate} \item Normalization against a reference distribution; \item Summarization via SNPRMA; \item Genotype calling via CRLMM. \end{enumerate} The normalization step is done by equalizing the observed intensities accordingly to a reference distribution, based on the 270 Hapmap samples. These samples are available to the public at http://www.hapmap.org. The normalization step will remove systematic biases, which are not due to biological factors. The SNPRMA algorithm is responsible for summarizing the data. The design of the SNP arrays is such that up to the 250K density, there are probes for both alleles on both strands. On the SNP 6.0 platform, given a SNP, there are probes only on one strand. Therefore, for the designs up to 250K, SNPRMA will create summaries at the SNP-Allele-Strand-level. For each SNP there are four numbers $(\theta_{A-}, \theta_{B-}, \theta_{A+}, \theta_{B+})$, which are proportional to the log-intensities in each of these combinations of allele and strand ($-$: antisense; $+$: sense). They are represented by four matrices: \Robject{antisenseThetaA}, \Robject{antisenseThetaB}, \Robject{senseThetaA} and \Robject{senseThetaB}, which are the components of the \Rclass{SnpQSet} object. One can extract these objects using accessors of the same name. For the SNP 6.0 array, a similar approach is taken, but the summaries are given at the SNP-Allele-level and there will be only $(\theta_A, \theta_B)$ estimates for each SNP. An object of class \Rclass{SnpCnvQSet} is returned and contains two matrices: \Robject{thetaA} and \Robject{thetaB}. Accessors with the same name are provided. Average intensities and log-ratios are defined as across allele and within strand, ie: \begin{eqnarray} A_{s} & = & \frac{\theta_{A, s}+\theta_{B, s}}{2} \\ M_{s} & = & \theta_{A, s} - \theta_{B, s}, \end{eqnarray} where $s$ defines the strand (antisense or sense). These quantities can be obtained via \Rmethod{getA()} and \Rmethod{getM} methods, which return high-dimensional arrays with dimensions corresponding to SNP's, samples and strands, respectively. These measures are later used for genotyping. The CRLMM algorithm can be applied on a \Rclass{SnpQSet} or \Rclass{SnpCnvQSet} object in order to produce genotype calls. It involves running a mixture of regressions via EM algorithm to adjust for average intensity and fragment length in the log-ratio scale. These adjustments may take long time to run, depending on the combination of number of samples and computer resources available. %% \textbf{A word of warning}: the \Rmethod{crlmm()} method searches for a %% variable \Robject{gender} in the \Rclass{phenoData} slot of the %% \Rclass{SnpQSet} object. If it fails to find that variable, it will try %% estimate the gender from the data. If there is not enough discrimination %% power to estimate the gender, the following error message will be %% returned: %% %% {\bf empty cluster: try a better set of initial centers} %% %% Increasing the sample size is one of the possible solutions, although %% the preferred one is to have \Robject{gender} already defined in the %% \Rclass{phenoData} slot. %% The \Rclass{phenoData} slot includes covariates about the %% samples. Genotyping and copy number analyses often make use of gender %% information in order to provide more precise inferences. The code %% below exemplifies the creation of the \Rclass{phenoData} object. %% %% <>= %% aboutSamples <- data.frame(gender=c("female", "female", "male")) %% rownames(aboutSamples) <- basename(fullFilenames) %% aboutVars <- data.frame(labelDescription="male/female") %% rownames(aboutVars) <- "gender" %% pd <- new("AnnotatedDataFrame", %% data=aboutSamples, %% varMetadata=aboutVars) %% @ %% %% <>= %% crlmmOut <- justCRLMM(fullFilenames, phenoData=pd, verbose=FALSE) %% @ <>= crlmm(fullFilenames, outputDir, verbose=FALSE) @ The \Rfunction{crlmm} method does not return an object to the R session. Instead, it saves the objects to disk, as not all systems are guaranteed to meet the memory requirements that \Rclass{SnpCallSetPlus} (for 100K and 500K arrays) or \Rclass{SnpCnvCallSetPlus} (for SNP 5.0 and SNP 6.0 arrays) objects might need. For convenience, the \Rfunction{getCrlmmSummaries} will read the information from disk and make a \Rclass{SnpCallSetPlus} or \Rclass{SnpCnvCallSetPlus} object available to the user. %% The \Robject{crlmmOut} object above belongs to the \Rclass{SnpCallSet} %% class and contains the genotype calls and confidence measures %% associated to the calls, represented respectively by the %% \Robject{calls} and \Robject{callsConfidence} matrices. These matrices %% can be accessed using the methods of the same name as demonstrated %% below: <>= crlmmOut <- getCrlmmSummaries(outputDir) calls(crlmmOut)[1:5,1:2] callsConfidence(crlmmOut)[1:5,1:2] @ <>= crlmmCalls <- readSummaries("calls", outputDir) crlmmConf <- readSummaries("conf", outputDir) crlmmCalls[1:5, 1:2] crlmmConf[1:5, 1:2] @ The genotype calls are represented by 1 (AA), 2 (AB) and 3 (BB). The confidence is the predicted probability that the algorithm made the right call. Summaries generated by the algorithm can also be accessed from the R session. The options for summaries are ``alleleA'', ``alleleB'', ``alleleA-sense'', ``alleleA-antisense'', ``alleleB-sense'', ``alleleB-antisense''. The options ``alleleA'' and ``alleleB'' must be used \textbf{only} with SNP 5.0 and SNP 6.0 platforms. The remaining options must be used with 50K and 250K arrays. <>= alleleAsense <- readSummaries("alleleA-sense", outputDir)[1:5,] alleleBsense <- readSummaries("alleleB-sense", outputDir)[1:5,] log.ratios.sense <- alleleAsense-alleleBsense log.ratios.sense[, 1:2] @ <>= unlink(outputDir, recursive=TRUE) @ \subsection{Exploring the Annotation Package} The user who is willing to make deeper investigation using the annotations provided for each SNP array can use SQL queries to access more other information that might not be directly exposed. The example below demonstrates how to see the available tables, fields and extract chromosome, physical location and cytoband for the first five SNP's (probes querying specific SNP's have names starting with the string ``SNP''). <<>>= conn <- db(pd.mapping50k.xba240) dbListTables(conn) dbListFields(conn, "featureSet") sql <- "SELECT man_fsetid, chrom, physical_pos FROM featureSet WHERE man_fsetid LIKE 'SNP%' LIMIT 5" dbGetQuery(conn, sql) @ \bibliography{oligoVignette}{} \bibliographystyle{plain} \section{Details} This document was written using: <<>>= sessionInfo() @ \end{document}