%\VignetteIndexEntry{Preparing Affymetrix Data}
%\VignetteDepends{GWASTools, GWASdata}

\documentclass[11pt]{article}
\usepackage{fullpage}
\usepackage{Sweave}
\usepackage{amsmath}  
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
\usepackage[pdftex,plainpages=false, letterpaper, bookmarks, bookmarksnumbered, colorlinks, linkcolor=blue, citecolor=blue, filecolor=blue, urlcolor=blue]{hyperref}

% Different font in captions
\newcommand{\captionfonts}{\footnotesize}

\begin{document}

\title{Preparing Affymetrix Data}
\author{GENEVA Coordinating Center\\
  Department of Biostatistics\\
  University of Washington}

\maketitle

\SweaveOpts{keep.source=TRUE, eps=FALSE} 


\section{Overview}
This vignette describes how to prepare Affymetrix data for use in
GWASTools.  After following the examples here for prepapring the data,
the analysis procedure is the same as presented in
the GWAS Data Cleaning vignette.


\section{Creating the SNP Annotation Data Object}
\label{create_snpann}

All of the functions in GWASTools require a minimum set of variables in the SNP annotation data object.  The minimum required variables are

\begin{itemize}
\item{{\tt snpID}, a unique integer identifier for each SNP}
\item{{\tt chromosome}, an integer mapping for each chromosome, with values 1-27, mapped in order from 1-22, 23=X, 24=XY (the pseudoautosomal region), 25=Y, 26=M (the mitochondrial probes), and 27=U (probes with unknown positions)}
\item{{\tt position}, the base position of each SNP on the chromosome.}
\end{itemize}


We create the integer chromosome mapping for a few reasons.  The
chromosome is stored as an integer in the NetCDF files, so in order to
link the SNP annotation with the NetCDF file, we use the integer
values in the annotation as well.
For convenience when using GWASTools functions,
the chromosome variable is most times assumed to be an integer value.  Thus, for the sex chromosomes, we can simply
use the {\tt chromosome} values.  For presentation of results, it is important to have the mapping of the
integer values back to the standard designations for the chromosome
names, thus the {\tt getChromosome()} functions in the GWASTools
objects have a {\tt char=TRUE} option to return the characters 1-22,
X, XY, Y, M, U.
The position variable should hold all numeric values of the physical position of a probe.  {\em The SNP annotation file is assumed to list the probes in order of chromosome and 
position within chromosome.}

Note that for Affymetrix data, the rs ID is often not unique, as there
may be multiple probes for a given SNP.  The probe ID is usually the
unique SNP identifier.
Also, Affymetrix annotation generally has a separate column or columns
to indicate pseudoautosomal regions; some manipulation is usually
required to include this information within the chromosome column itself.

<<>>=
library(GWASTools)
library(GWASdata)
# Load the SNP annotation (simple data frame)
data(affy_snp_annot)
# Create a SnpAnnotationDataFrame
snpAnnot <- SnpAnnotationDataFrame(affy_snp_annot)
# names of columns
varLabels(snpAnnot)
# data
head(pData(snpAnnot))
# Add metadata to describe the columns
meta <- varMetadata(snpAnnot)
meta[c("snpID", "chromosome", "position", "rsID", "probeID"), 
  "labelDescription"] <- c("unique integer ID for SNPs",                      
  paste("integer code for chromosome: 1:22=autosomes,",
   "23=X, 24=pseudoautosomal, 25=Y, 26=Mitochondrial, 27=Unknown"), 
  "base pair position on chromosome (build 36)",
  "RS identifier",
  "unique ID from Affymetrix")
varMetadata(snpAnnot) <- meta
@ 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Creating the Scan Annotation Data Object}
\label{create_sampann}

The scan annotation file holds attributes for each genotyping scan that are relevant to genotypic data
cleaning.  These data include processing variables such as tissue type, DNA extraction method, and genotype processing
batch.  They also include individual characteristics such as gender and race.  The initial sample
annotation file is created from the raw data supplied by the genotyping center and/or study investigator, providing a 
mapping from the raw data file(s) for each sample scan to other sample information such as sex,
coded as {\tt M} and {\tt F}, ethnicity,  
unique scan identifier, called {\tt scanID}, and unique subject identifier.  
Since a single subject may have been genotyped multiple times as a
quality control measure, it is important to distinguish between the
scanID (unique genotyping instance) and subjectID (person
providing a DNA sample).

<<>>=
# Load the scan annotation (simple data frame)
data(affy_scan_annot)
# Create a ScanAnnotationDataFrame
scanAnnot <- ScanAnnotationDataFrame(affy_scan_annot)
# names of columns
varLabels(scanAnnot)
# data
head(pData(scanAnnot))
# Add metadata to describe the columns
meta <- varMetadata(scanAnnot)
meta[c("scanID", "subjectID", "family", "father", "mother",
  "CoriellID", "race", "sex", "status", "genoRunID", "plate",
  "alleleFile", "chpFile"), "labelDescription"] <- 
   c("unique integer ID for scans", 
  "subject identifier (may have multiple scans)",
  "family identifier", 
  "father identifier as subjectID",
  "mother identifier as subjectID",
  "Coriell subject identifier",
  "HapMap population group",
  "sex coded as M=male and F=female",
  "simulated case/control status" ,
  "genotyping instance identifier",
  "plate containing samples processed together for genotyping chemistry",
  "data file with intensities",
  "data file with genotypes and quality scores")
varMetadata(scanAnnot) <- meta
@ 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{Creating the NetCDF Files}


The data for genotype calls, allelic intensities and other variables such as BAlleleFrequency are stored as NetCDF files. 
More information on the NetCDF file format in general can be found at \\ \href{http://www.unidata.ucar.edu/software/netcdf/}{http://www.unidata.ucar.edu/software/netcdf/}.
The GWASTools package depends on the ncdf library.
Documentation for the ncdf R library can be found at \\
\href{http://cran.r-project.org/web/packages/ncdf/ncdf.pdf}{http://cran.r-project.org/web/packages/ncdf/ncdf.pdf}.
The ncdf library provides a convenient R interface to create, populate and extract data from NetCDF files. 

For each study, three different NetCDF files are  
created to be used in subsequent cleaning and analysis steps.   
This format is used for the ease with which multi-dimensional arrays of data can be stored and accessed.

All NetCDF files created have two dimensions, one called {\tt snp} and one titled {\tt sample}.  The {\tt snp}
dimension is of the same length as the number of probes that were released from the genotyping center and listed in the
SNP annotation file.  The 
{\tt sample} dimension is of length equal to the number of genotyping scans released as listed in the sample annotation file.
Further, all NetCDF files have three variables in common: {\tt sampleID, chromosome} and {\tt position}.
The {\tt sampleID} is used for indexing the columns of the two dimensional values stored in the NetCDF files
(genotype calls, for example).  The {\tt sampleID} ordering must match the {\tt scanID} values as listed
in the sample annotation file, see Section~\ref{create_sampann}.
The index to the SNP probes in the NetCDF file is the {\tt snpID}, which is stored as values of the SNP dimension.  Since {\tt snpID} is in chromosome and position order, these variables also provide a check on ordering and are often used to select subsets of SNPs for analysis.  Analogous to the sample ordering, these values must match the {\tt snpID} values listed
in the SNP annotation file, see \ref{create_snpann}.
To prevent errors in ordering samples or SNPs, the functions in the GWASTools package take as arguments R objects
which will return an error on creation if the sample and SNP
annotation does not match the NetCDF file.
We recommend always checking the ordering of these variables before writing new versions of the SNP or sample annotation
data files.

\subsection*{Genotype NetCDF Files}

The genotype NetCDF files store genotypic data in 0, 1, 2 format
indicating the number of ``A'' alleles in the genotype 
(i.e. AA=2, AB=1, BB=0 and missing=-1).  The conversion from AB format and forward strand (or other)
allele formats can be stored in the SNP annotation file.

The genotypic data are stored as a two-dimensional array, where rows are SNPs and columns are samples.
To store the genotype data, the raw data files are opened and checked to ensure the sample identifier from
the sample annotation file and the genotype data file match.  If no discrepencies exist, the probes
listed in the file are checked against the expected list of probes, then ordered and written to the NetCDF file.
This process iterates over each file (sample).  Diagnostics are stored as the process continues so that after the
data are written one can ensure the function performed as expected.   

\subsection*{Creating the Genotype NetCDF file}
                                     

The first step in creating a NetCDF file is to create a `shell' NetCDF file to which the 
data will be written.  The function {\tt ncdfCreate} creates the two dimensions and 
three common variables as described above. It also assigns values of the SNP dimension as {\tt snpID} values and populates the {\tt chromosome} and {\tt position} variables.

In this case, we also want the {\tt genotype} variable to be created, so the {\tt vars} argument must be set to 
{\tt "genotype"}. In the interest of computational feasibility, only 3 samples will be written in this step.

<<>>=
geno.nc.file <- "tmp.geno.nc"
# get snp annotation data frame for function
snp <- affy_snp_annot[,c("snpID", "chromosome", "position")]
ncdfCreate(ncdf.filename = geno.nc.file,
       snp.annotation = snp,
       variables = c("genotype"),
       array.name = "AffyGenomeWideSNP_6",
       genome.build = "36",
       n.samples = 3,
       precision = "single")
@ 

 Now the shell has been created so we can call the {\tt ncdfAddData} function to 
populate the NetCDF with the genotype data stored in the raw data files.  The data are written to the NetCDF file one sample at a time and, simultaneously, the corresponding sample identifier {\tt scanID} is written to the {\tt sampleID} variable. The {\tt chpFile}
variable from the sample annotation file
holds the name of the CHP file for each sample scan; 
these are the files we must read in to get genotype data for each sample.

<<>>=
# first 3 samples only
scan.annotation <- affy_scan_annot[1:3, c("scanID", "genoRunID", "chpFile")]
names(scan.annotation) <- c("scanID", "scanName", "file")
# indicate which column of SNP annotation is referenced in data files
snp.annotation <- affy_snp_annot[,c("snpID", "probeID")]
names(snp.annotation) <- c("snpID", "snpName")
@ 

The arguments to {\tt ncdfAddData} must be created. We need {\tt
      col.nums}, which is an integer vector indicating which columns
    of the raw text file contain variables for input.

<<>>=
col.nums <- as.integer(c(2,3))
names(col.nums) <- c("snp", "geno")
# Define a path to the raw data CHP text files which are read by 
#   ncdfAddData to access the raw genotypic data
path <- system.file("extdata", "affy_raw_data",
  package="GWASdata")
@ 

All required arguments have been defined so we are ready to call {\tt ncdfAddData}.
This function will take the previously created genotype NetCDF file and populate it with the genotype
data from the CHP text files and with the sample identifier {\tt scanID} corresponding to that CHP file in the sample annotation data.frame.
A set of diagnostic values are written and stored in {\tt diag.geno}, so we must look at those to
ensure no errors occurred.

<<>>=
diag.geno.file <- "diag.geno.RData"
diag.geno <- ncdfAddData(path = path,
  ncdf.filename = geno.nc.file, 
  snp.annotation = snp.annotation,
  scan.annotation = scan.annotation, 
  sep.type = "\t",
  skip.num = 1,
  col.total = 6,
  col.nums = col.nums,
  scan.name.in.file = -1,
  scan.start.index = 1,
  diagnostics.filename = diag.geno.file,
  verbose = FALSE)
# Look at the values included in the "diag.geno" object which holds 
#   all output from the function call
names(diag.geno)
# `read.file' is a vector indicating whether (1) or not (0) each file 
#   specified in the `files' argument was read successfully
table(diag.geno$read.file)
# `row.num' is a vector of the number of rows read from each file
table(diag.geno$row.num) 
# `sample.match' is a vector indicating whether (1) or not (0) 
#   the sample name inside the raw text file matches that in the 
#   sample annotation data.frame
table(diag.geno$sample.match)
# `snp.chk' is a vector indicating whether (1) or not (0) 
#   the raw text file has the expected set of SNP names
table(diag.geno$snp.chk)
# `chk' is a vector indicating whether (1) or not (0) all previous
#   checks were successful and the data were written to the NetCDF file
table(diag.geno$chk)
@ 

Although the diagnostic values indicated no issues, we will open
    the NetCDF file and extract a small sampling of data.  This
    illustrates the use of the {\tt NcdfGenotypeReader} class for
    retrieving data from a NetCDF file.

  
<<>>=
(genofile <- NcdfGenotypeReader(geno.nc.file))
# Take out genotype data for the first 3 samples and 
#   the first 5 SNPs
(genos <- getGenotype(genofile, snp=c(1,5), scan=c(1,3)))
# Close the NetCDF file
close(genofile)
@ 

Run the function {\tt ncdfCheckGenotype} to check that the
    NetCDF file contains the same data as the raw data files.

<<>>=
check.geno.file <- "check.geno.RData"
check.geno <- ncdfCheckGenotype(path = path,
  ncdf.filename = geno.nc.file, 
  snp.annotation = snp.annotation,
  scan.annotation = scan.annotation, 
  sep.type = "\t",
  skip.num = 1,
  col.total = 6,
  col.nums = col.nums,
  scan.name.in.file = -1,
  check.scan.index = 1:3,
  n.scans.loaded = 3,
  diagnostics.filename = check.geno.file,
  verbose = FALSE)
# Look at the values included in the "check.geno" object which holds 
#   all output from the function call
names(check.geno)
# 'snp.order' is a vector indicating whether (1) or not (0) the snp ids 
#   are in the same order in each file.
table(check.geno$snp.order)
# 'geno.chk' is a vector indicating whether (1) or not (0) the genotypes 
#   in the netCDF match the text file
table(check.geno$geno.chk)
@ 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection*{Intensity NetCDF Files}


The intensity NetCDF files store quality scores and allelic intensity data for each SNP. 
The normalized X and Y intensities as well as the confidence scores are written to the NetCDF for all
samples, for all SNPs. (A separate NetCDF file will store the BAlleleFreq and LogRRatio data.)

Aside from the three variables held in common with all NetCDF files ({\tt sampleID, chromosome} and {\tt position}),
the intensity and quality data are written to the intensity NetCDF in a two dimensional format, with SNPs corresponding
to rows and samples corresponding to columns.
To write the intensity data, the raw data files are opened and the intensities and quality score
are read.  Like with the genotype data, if all sample and probe identifiers match between the data files
and the annotation files, the data are populated in the NetCDF and diagnostics are written.

Affymetrix data are provided in two files per genotyping scan.  The CHP file holds the genotype calls, used to create the 
genotype NetCDF file, as well as the confidence score, which is written to the {\tt quality} variable in
the intensity NetCDF file.
The normalized X and Y intensity data are stored in the {\tt allele\_summary} files in the format of two rows
per SNP, one for each allelic probe. 
A separate function {\tt ncdfAddIntensity} reads these data and writes them to the NetCDF file. 
Thus, when writing the intensity NetCDF file using Affymetrix data, there are two function calls
needed, each of which opens and reads from the two sets of files.

\subsection*{Creating the Intensity NetCDF file}
                       

The first step in creating the intensity NetCDF is to create a `shell' NetCDF
file to which the data will be written.  We can use the same function used for creating the shell
genotype NetCDF, {\tt ncdfCreate}.  The two dimensions and three common variables will be created
along with the X, Y and quality variables; the {\tt vars} argument is set to write these.
For a reasonable computation time, we will create the file to hold data for 3 samples only.

<<>>=
qxy.nc.file <- "tmp.qxy.nc"
# get snp annotation data frame for function
snp <- affy_snp_annot[,c("snpID", "chromosome", "position")]
ncdfCreate(ncdf.filename = qxy.nc.file,
       snp.annotation = snp,
       variables = c("quality","X","Y"),
       array.name = "AffyGenomeWideSNP_6",
       genome.build = "36",
       n.samples = 3,
       precision = "single")
@ 

		
We next call the {\tt ncdfAddData} function to populate the intensity NetCDF with the 
X and Y intensities and the quality score stored in the raw CHP text files. The sample identifier {\tt scanID} is also populated at the same time.  The {\tt chpFile}
variable from the sample annotation file holds the name of the CHP file for each sample scan.  
From this file we can get all needed data for the intensity NetCDF file.

<<>>=
# first 3 samples only
scan.annotation <- affy_scan_annot[1:3, c("scanID", "genoRunID", "chpFile")]
names(scan.annotation) <- c("scanID", "scanName", "file")
# indicate which column of SNP annotation is referenced in data files
snp.annotation <- affy_snp_annot[,c("snpID", "probeID")]
names(snp.annotation) <- c("snpID", "snpName")
@ 

The arguments to {\tt ncdfAddData} must be created. We need {\tt
      col.nums}, which is an integer vector indicating which columns
    of the raw text file contain variables for input.

<<>>=
col.nums <- as.integer(c(2,4))
names(col.nums) <- c("snp","qs")
# Define a path to the raw data CHP text files which are read by 
#   ncdfAddData to access the raw genotypic data
path <- system.file("extdata", "affy_raw_data",
  package="GWASdata")
@ 

All required arguments have been defined so we are ready to call {\tt ncdfAddData}.
This function will take the previously created intensity NetCDF file and populate it with 
the quality score from the CHP text files.  Recall the intensity values are stored in a different set of files 
so those will be populated in the next step.
A set of diagnostic values are written and stored in {\tt diag.qual}.

<<>>=
diag.qual.file <- "diag.qual.RData"
diag.qual <- ncdfAddData(path = path,
  ncdf.filename = qxy.nc.file, 
  snp.annotation = snp.annotation,
  scan.annotation = scan.annotation, 
  sep.type = "\t",
  skip.num = 1,
  col.total = 6,
  col.nums = col.nums,
  scan.name.in.file = -1,
  scan.start.index = 1,
  diagnostics.filename = diag.qual.file,
  verbose = FALSE)
@ 

As alluded to above, the normalized X and Y intensity values are stored in the {\tt allele files}.  
To write the intensities, we will call {\tt ncdfAddIntensity} with similar arguments to the {\tt ncdfAddData}
function.

<<>>=
scan.annotation <- affy_scan_annot[1:3, c("scanID", "genoRunID", "alleleFile")]
names(scan.annotation) <- c("scanID", "scanName", "file")
diag.xy.file <- "diag.xy.RData"
diag.xy <- ncdfAddIntensity(path = path,
  ncdf.filename = qxy.nc.file, 
  snp.annotation = snp.annotation, 
  scan.annotation = scan.annotation, 
  scan.start.index = 1, 
  n.consecutive.scans = 3,
  diagnostics.filename = diag.xy.file,
  verbose = FALSE)
@ 

We will open
    the NetCDF file and extract a small sampling of data.  In this
    case we use the {\tt NcdfIntensityReader} class.
 
<<>>=
# Open the NetCDF file we just created
(intenfile <- NcdfIntensityReader(qxy.nc.file))
# Take out the normalized X intensity values for the first 
#     5 SNPs for the first 3 samples
(xinten <- getX(intenfile, snp=c(1,5), scan=c(1,3)))
# Close the NetCDF file
close(intenfile)
@

Run the function {\tt ncdfCheckIntensity} to check that the
    NetCDF file contains the same data as the raw data files.

<<>>=
scan.annotation <- affy_scan_annot[1:5,
  c("scanID", "genoRunID", "chpFile", "alleleFile")]
names(scan.annotation) <- c("scanID", "scanName", "file", "inten.file")
check.qxy.file <- "check.qxy.RData"
check.qxy <- ncdfCheckIntensity(path = path,
  intenpath = path,
  ncdf.filename = qxy.nc.file, 
  snp.annotation = snp.annotation,
  scan.annotation = scan.annotation, 
  sep.type = "\t",
  skip.num = 1,
  col.total = 6,
  col.nums = col.nums,
  scan.name.in.file = -1,
  check.scan.index = 1:3,
  n.scans.loaded = 3,
  affy.inten = TRUE,
  diagnostics.filename = check.qxy.file,
  verbose = FALSE)
@ 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\subsection*{BAlleleFrequency and LogRRatio NetCDF Files}


The BAlleleFrequency and LogRRatio NetCDF file stores these values for every sample by SNP.
For Affymetrix data, these values must be calculated by the user.  
For a thorough explanation and presentation of an application of these values, please refer to 
Peiffer, Daniel A., et al. (2006).\footnote{Peiffer, Daniel A., et al. High-resolution genomic profiling of chromosomal aberrations using
Infinium whole-genome genotyping. \textit{Genome Research} \textbf{16}, 1136-1148 (September 2006).}
          
For a given sample and SNP, R and $\theta$ are calculated using the X and Y intensities, where
\begin{equation}
\label{rthe}
R=X+Y 
\end{equation}
\begin{equation*}
\theta = \frac{2\arctan(Y/X)}{\pi}
\end{equation*}
$\theta$ corresponds to the polar coordinate angle and R is the sum of the normalized X and Y intensities (not, as one might assume, the magnitude of the polar coordinate vector).
It is from these values that we calculate the LogRRatio and BAlleleFrequency.

The LogRRatio is given below.  The expected value of R is derived from a plot of $\theta$ versus R for a given SNP.  
It is the predicted value of R derived from a line connecting the centers of the two nearest genotype clusters.  
\begin{equation}
 {\rm LogRRatio} =\log \left( \frac{R_{\text{observed values}}}{R_{\text{expected values}}}\right)
\end{equation} 
Variation in the LogRRatio across a single chromosome indicates possible duplication or deletion, and is an indication
of overall sample quality. 
 
The BAlleleFrequency is the frequency of the B allele in the population of cells from which the DNA is extracted.  
Each sample and SNP combination has a BAlleleFrequency value.  Note the BAlleleFrequency values vary for a subject
with each DNA extraction and tissue used. 
After all SNPs have been read and all samples have been clustered for a probe, 
the mean $\theta$ ``cluster'' value is calculated for each probe, for each of the three genotype clusters, 
resulting in $\theta_{AA}, \theta_{AB} \text{ and }\theta_{BB}$ for every probe.  
Then the $\theta$ value for each sample, call it $\theta_{n}$, 
is compared to $\theta_{AA}, \theta_{AB}\text{ and }\theta_{BB}$.
The BAlleleFrequency is calculated
\begin{equation*}
 \text{BAlleleFrequency}= 
\begin{cases}
0 & \text{if } \theta_{n} < \theta_{AA}\\
 \mbox{}\\
\frac{{\textstyle (1/2)(\theta_{n}-\theta_{AA})}}{{\textstyle \theta_{AB}-\theta_{AA}}} & \text{if } \theta_{AA} \leq \theta_{n} < \theta_{AB}\\
\mbox{}\\
\frac{{\textstyle 1}}{{\textstyle 2}}+\frac{{\textstyle (1/2)(\theta_{n}-\theta_{AB})}}
  {{\textstyle \theta_{BB}-\theta_{AB}}} & \text{if } \theta_{AB} \leq \theta_{n} < \theta_{BB}\\
 \mbox{}\\
1 & \text{if } \theta_{n} \geq \theta_{BB}
\end{cases}
\end{equation*}
A $\theta_{n}$ value of 0 or 1 corresponds to a homozygote genotype for sample $n$ at that
particular probe, and a $\theta_{n}$ value of 1/2 indicates a heterozygote genotype.
Thus, $BAlleleFrequency \in[0,1]$ for each probe.  Across a chromosome, three bands are expected, 
one hovering around 0, one around 1 and one around 0.5, and any deviation from this is considered aberrant.

We use the BAlleleFrequency and LogRRatio values to detect mixed samples or samples of low quality, 
as well as chromosomal duplications and deletions.  Samples that have a significantly large 
(partial or full chromosome) aberration for a particular chromosome as detected from the BAlleleFrequency 
values are recommended to be filtered out, for the genotype data are not reliable in these situations.  
Because of these applications, the BAlleleFrequency and LogRRatio values are a salient part of the data cleaning steps.

In addition to the three variables held in common with all NetCDF files ({\tt sampleID, chromosome} and {\tt position}),
the BAlleleFrequency and LogRRatio values are calculated and written to this NetCDF in a two dimensional format, 
with SNPs corresponding to rows and samples corresponding to columns.
Because we have already completed the creation of both the genotype and intensity NetCDF files,
we simply use those files to access the data.
The BAlleleFrequency and LogRRatio values are calculated in subsets for efficiency and written to the 
corresponding subset indices in the NetCDF file.

\subsection*{Creating the BAlleleFrequency and LogRRatio NetCDF file}


The first step in creating the BAlleleFrequency NetCDF is to create a `shell' NetCDF
file to which the data will be written.  We can use the same function used for creating the other shell
NetCDF files, {\tt ncdfCreate}.  The two dimensions and three common variables will be created
along with the BAlleleFrequency and LogRRatio variables.
In order to allow for a reasonable computation time, we will create the file to hold data for 3 samples only.

<<>>=
bl.nc.file <- "tmp.bl.nc"
# get snp annotation data frame for function
snp <- affy_snp_annot[,c("snpID", "chromosome", "position")]
ncdfCreate(ncdf.filename = bl.nc.file,
       snp.annotation = snp,
       variables = c("BAlleleFreq","LogRRatio"),
       array.name = "AffyGenomeWideSNP_6",
       genome.build = "36",
       n.samples = 3,
       precision = "single")
@ 

We now will calculate the BAlleleFrequency and LogRRatio values for each sample by SNP and
write these values to the NetCDF by calling the function {\tt BAFfromGenotypes}.
We will also select ``by.study'' as the call method, so all 3 samples
have their genotype clusters called together.  In normal usage, we
recommend calling Affymetrix genotypes ``by.plate'' (in which case the {\tt
  plate.name} argument is passed to the function).
For more detail regarding the {\tt BAFfromGenotypes} function, please see the function documentation.
After the function is complete, we will look at a few values to ensure the file was created successfully.

<<>>=
xyNC <- NcdfIntensityReader(qxy.nc.file)
genoNC <- NcdfGenotypeReader(geno.nc.file)
BAFfromGenotypes(xyNC, genoNC,
           bl.ncdf.filename = bl.nc.file, 
           min.n.genotypes = 0,
           call.method ="by.study")
close(xyNC)
close(genoNC)
# Open the NetCDF file we just created
(blfile <- NcdfIntensityReader(bl.nc.file))
# Look at the BAlleleFrequency values for the first 5 SNPs
(baf <- getBAlleleFreq(blfile, snp=c(1,5), scan=c(1,3)))
# Close the NetCDF file
close(blfile)
@ 


<<echo=FALSE, results=hide>>=
file.remove(geno.nc.file, qxy.nc.file, bl.nc.file)
file.remove(diag.geno.file, diag.qual.file, diag.xy.file)
file.remove(check.geno.file, check.qxy.file)
@ 

\end{document}