%\VignetteIndexEntry{Classes Used in the Oligo Packages} %\VignetteDepends{oligo} %\VignetteKeywords{Expression, SNP, Affymetrix, NimbleGen, Oligonucleotide Arrays} %\VignettePackage{oligo} \documentclass{article} \usepackage{hyperref, amsfonts} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\oligo}{\Rpackage{oligo }} \begin{document} \title{Classes Used in the Oligo Package} \date{March, 2007} \author{Benilton Carvalho} \maketitle \section{Introduction} This document describes the classes used in the \oligo package. The \oligo package uses essentially two groups of classes: \begin{itemize} \item Static data classes: these are chip-specific information. Each chip contains its own annotation, which is shared across experiments that used that array. These are generated by the \Rpackage{makePlatformDesign} and \Rpackage{pdInfoBuilder} packages. \item Experimental data classes: these classes refer to the experimental data, ie. CEL and XYS files that the user has. All the experimental data classes derive from \Robject{eSet} defined in \Rpackage{Biobase}. \end{itemize} The \Rclass{platformDesign} is one of the static data classes and is generated by the \Rpackage{makePlatformDesign} package. It is a container for the chip-specific information. We are transitioning the creation of the chip-specific packages to the \Rpackage{pdInfoBuilder}, which makes more efficient use of memory (via SQLite) and is much more flexible than the environment approach used by \Rpackage{makePlatformDesign}. \section{\Rclass{platformDesign} Class} The \Rclass{platformDesign} class is the container for information on the expression (NimbleGen), tiling (Affymetrix, NimbleGen) and exon (Affymetrix) arrays. It contains the following slots: \begin{table}[h] \centering \begin{tabular}{|l|l|} \hline Slot & Type \\ \hline \Robject{manufacturer} & \Rclass{character} \\ \Robject{genomebuild} & \Rclass{character} \\ \Robject{featureInfo} & \Rclass{enviroment}\\ \Robject{featureTypeDescription} & \Rclass{list} \\ \Robject{type} & \Rclass{character} \\ \Robject{nrow} & \Rclass{numeric} \\ \Robject{ncol} & \Rclass{numeric} \\ \Robject{nwells} & \Rclass{numeric} \\ \Robject{lookup} & \Rclass{data.frame}\\ \Robject{indexes} & \Rclass{list} \\ \Robject{platforms} & \Rclass{character} \\ \hline \end{tabular} \caption{Description of the \Rclass{platformDesign} class} \label{tab:platformDesign} \end{table} \begin{itemize} \item \Robject{manufacturer}: lower case string containing the name of the manufacturer of the array (eg., \Rcode{``affymetrix''} or \Rcode{``nimblegen''}). \item \Robject{genomebuild}: lower case string containing the genome release information using the USCS notation, as described at \url{http://genome.ucsc.edu/FAQ/FAQreleases#release1}. \item \Robject{featureInfo}: an environment containing vectors of same length which fully characterizes the array being used. See details below. \item \Robject{type}: a string describing the type of the array (eg., \Rcode{``expression''}, \Rcode{``tiling''}, \Rcode{``exon''}, \Rcode{``SNP''}). \item \Robject{nrow} and \Robject{ncol}: array dimensions. \item \Robject{nwells}: number of wells (specific for NimbleGen data). \item \Robject{lookup}: data.frame used to map features in complex NimbleGen designs. \item \Robject{indexes}: not used anymore. To be removed. \item \Robject{platform}: not used anymore. To be removed. \end{itemize} \subsection{Details on the \Robject{featureInfo} slot} The \Robject{featureInfo} is the home for the majority of the information used by \Rpackage{oligo}. \Robject{featureInfo} is an \Rclass{environment} containing the following vectors: \begin{table}[h] \centering \begin{tabular}{|l|c|c|c|} \hline \textbf{Field} & \textbf{Expression} & \textbf{Tiling} & \textbf{Exon} \\ \hline \Robject{X} and \Robject{Y} & \checkmark & \checkmark & \checkmark \\ \hline \Robject{feature\_set\_name} & \checkmark & \checkmark & \checkmark \\ \hline \Robject{feature\_ID} & \checkmark & \checkmark & \checkmark \\ \hline \Robject{feature\_type} & \checkmark & \checkmark & \checkmark \\ \hline \Robject{target\_strand} & \checkmark & \checkmark & \checkmark \\ \hline \Robject{sequence} & \checkmark & \checkmark & \checkmark \\ \hline \Robject{order\_index} & \checkmark & \checkmark & \checkmark \\ \hline \Robject{length} & & \checkmark & \checkmark \\ \hline \Robject{chromosome} & & \checkmark & \\ \hline \Robject{ambiguous\_feature} & & \checkmark & \\ \hline \Robject{position} & & \checkmark & \\ \hline \Robject{location} & \checkmark & & \checkmark \\ \hline \Robject{atomID} & & & \checkmark \\ \hline \Robject{gc\_count} & & & \checkmark \\ \hline \end{tabular} \caption{Fields in \Robject{featureInfo}} \label{tab:featureInfo} \end{table} \begin{itemize} \item \Robject{X} and \Robject{Y}: X/Y coordinates on the array. Class: \Rclass{integer}. \item \Robject{feature\_set\_name}: name of the featureset (probeset). Class: \Rclass{character}. \item \Robject{feature\_ID}: match ID between PM and MM. \Rclass{integer}. \item \Robject{feature\_type}: type of the feature. Class: \Rclass{factor}. (PM/MM) \item \Robject{target\_strand}: target strandness. Class: \Rclass{factor}. (antisense/sense) \item \Robject{sequence}: probe sequence. Class: \Rclass{character}. \item \Robject{length}: probe length. Class: \Rclass{integer}. \item \Robject{chromosome}: chromosome. Class: \Rclass{character}. (chr1/chr22/chrX) \item \Robject{ambiguous\_feature}: indicator if sequence is mapped to more than one genomic location. \Rclass{logical} \item \Robject{position}: genomic location within chromosome. \Rclass{numeric} \item \Robject{location}: genomic location within chromosome. To be removed and merged with \Robject{position}. \item \Robject{atomID}: pairing key between PM-MM. \item \Robject{gc\_count}: number of GC bases. To be removed, as this can be obtained from the sequence information. \end{itemize} \subsubsection{Particularities of Tiling Arrays} For tiling arrays, I have been using the genomic position as \Robject{feature\_set\_name}, but it is not uncommon to have a probe sequence matching $k>1$ genomic positions. In situations like this, the \Robject{feature\_set\_name} is set as the concatenation of the $k$ genomic positions using \Robject{``;''} as separator and \Robject{ambiguous\_feature} is set \Robject{TRUE}. For example: \begin{table}[h] \centering \begin{tabular}{|c|c|c|c|} \hline \Robject{sequence} & \Robject{position} & \Robject{feature\_set\_name} & \Robject{ambiguous\_feature} \\ \hline \Robject{AAATC...GCCAT} & 12345 & \Robject{``12345''} & \Robject{FALSE} \\ \Robject{CCACG...ATTCC} & 34567 / 87654 & \Robject{``34567;87654''} & \Robject{TRUE} \\ \hline \end{tabular} \caption{Naming convention for tiling arrays} \label{tab:namingTiling} \end{table} An even more effective naming convention would be \Robject{CHRnnPmmmmmm}, which would be more robust on designs that involve multiple chromosomes. \subsubsection{Particularities of Exon Arrays} A basic support of Exon Arrays is offered by \Rpackage{oligo}. \subsubsection{Particularities of SNP Arrays} The data packages for SNP arrays are now built via \Rpackage{pdInfoBuilder} package. The packages for the Affymetrix 100K and 500K sets are available via BioConductor. \subsubsection{About \Robject{order\_index}} In the final version of the data packages, the fields described on Table \ref{tab:featureInfo} are ordered by \Robject{feature\_set\_name}, \Robject{feature\_type} and \Robject{target\_strand}. Note that this breaks the link between the intensity file (which is often ordered by X/Y location) and the annotation available in the \Rclass{platformDesign} object. In order to keep this link, we initially order the \Rclass{platformDesign} object by X/Y location, so it matches the intensities files. Then we add the field \Robject{order\_index}, which is only the row number. Later, the \Rclass{featureInfo} object is reordered by \Robject{feature\_set\_name}, \Robject{feature\_type} and \Robject{target\_strand}. But with the presence of \Robject{order\_index}, we can correctly map the intensities to their probe-level annotations. \section{\Rclass{DBPDInfo} Class} The \Rclass{DBPDInfo} class is the database approach for the \Rclass{platformDesign} class. Table \ref{DBPDInfo} describes the class structure. \begin{table}[h] \centering \begin{tabular}{|l|l|} \hline Slot & Type \\ \hline getdb & \Robject{function} \\ tableInfo & \Robject{data.frame} \\ geometry & \Robject{integer} \\ manufacturer & \Robject{character} \\ genomebuild & \Robject{character} \\ \hline \end{tabular} \label{DBPDInfo} \caption{Description of the \Rclass{DBPDInfo} class} \end{table} \begin{itemize} \item \Robject{getdb}: function that accesses the external database (we use SQLite, via RSQLite); \item \Robject{tableInfo}: a data.frame with two columns (\Robject{tbl} and \Robject{row\_count}). This data.frame contains the name and number of rows of each table available in the database. \item \Robject{geometry}: an integer vector of length 2, containing the number of rows and columns of the array; \item \Robject{manufacturer}: a string with the manufacturer's name; \item \Robject{genomebuild}: a string with the genome build information. \end{itemize} \section{\Rclass{FeatureSet} Class} The \Rclass{FeatureSet} class is a virtual class to be used with the feature-level data and is created from the \Rclass{eSet} class. Different classes are created from this: \begin{itemize} \item \Rclass{ExpressionFeatureSet}: for expression arrays; \item \Rclass{SnpFeatureSet}: for SNP arrays; \item \Rclass{ExonFeatureSet}: for exon arrays; \item \Rclass{TilingFeatureSet}: for tiling arrays. \end{itemize} \section{\Rclass{SnpQSet} Class} The \Rclass{SnpQSet} class is created by the \Rmethod{snprma()} method. It contains four matrices, which contain the summarized information for SNP data. The four matrices are: \begin{itemize} \item \Robject{antisenseThetaA}: summarized data at the SNP-level for the antisense strand and allele A; \item \Robject{antisenseThetaB}: summarized data at the SNP-level for the antisense strand and allele B; \item \Robject{senseThetaA}: summarized data at the SNP-level for the sense strand and allele A; \item \Robject{senseThetaB}: summarized data at the SNP-level for the sense strand and allele B; \end{itemize} This is the expected input to the genotyping algorithm, \Rmethod{crlmm()}. \section{\Rclass{SnpCallSet} Class} The \Rclass{SnpCallSet} class is a container for the output of genotyping algorithm, eg. \Rmethod{crlmm()}. It contains two matrices: \Robject{calls} and \Robject{callsConfidence}, which hold respectively the genotype calls and associated measures of confidence. \section{\Rclass{SnpCopyNumberSet} Class} The \Rclass{SnpCopyNumberSet} class is a container for the output of copy number analisys. It contains two matrices: \Robject{copyNumber} and \Robject{copyNumberConfidence}, which hold respectively the copy number estimates and associated measures of confidence. \end{document}