%\VignetteIndexEntry{Data formats in GWASTools}
%\VignetteDepends{GWASTools, SNPRelate}
\documentclass[11pt]{article}
\usepackage{fullpage}
\usepackage{Sweave}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
\usepackage[pdftex,plainpages=false, letterpaper, bookmarks, bookmarksnumbered, colorlinks, linkcolor=blue, citecolor=blue, filecolor=blue, urlcolor=blue]{hyperref}

\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\begin{document}

\title{Data formats in GWASTools}
\author{Stephanie M. Gogarten}

\maketitle
%\tableofcontents

\SweaveOpts{keep.source=TRUE, eps=FALSE}

The central classes of the \Rpackage{GWASTools} package are
\Rcode{GenotypeData} and \Rcode{IntensityData}.  They are designed to
link all parts of a GWAS analysis (genotype data, SNP information, and
sample information) in a single S4 object, even when the genotype data
is too large to be stored in R's memory at one time.
In designing \Rpackage{GWASTools}, we took care to separate the
application programming interface (API) of the \Rcode{GenotypeData}
and \Rcode{IntensityData} classes from the format in which the data are
stored.

Each class contains a data slot (for a \Rcode{GenotypeReader} or
\Rcode{IntensityReader} object, respectively)
and annotation slots (a \Rcode{SnpAnnotationReader} and a \Rcode{ScanAnnotationReader}).
These Reader classes are actually class unions, allowing multiple options for
storing data and enabling new storage methods to be added without
changing any code that uses \Rcode{GenotypeData} and \Rcode{IntensityData} objects.
The class unions are currently defined as follows:
\begin{itemize}
\item \Rcode{GenotypeReader}:
\Rcode{NcdfGenotypeReader}, \Rcode{GdsGenotypeReader}, or
\Rcode{MatrixGenotypeReader}
\item \Rcode{IntensityReader}: \Rcode{NcdfIntensityReader} or \Rcode{GdsIntensityReader}
\item \Rcode{SnpAnnotationReader}: \Rcode{SnpAnnotationDataFrame} or \Rcode{SnpAnnotationSQLite}
\item \Rcode{ScanAnnotationReader}: \Rcode{ScanAnnotationDataFrame} or \Rcode{ScanAnnotationSQLite}
\end{itemize}

We use the term ``scan'' to indicate a unique genotyping instance, as
the same DNA sample may be genotyped more than once.
Each SNP and scan must have a
unique integer ID (``snpID'' and ``scanID'') that serves as the primary key between the genotype
data and the annotation.  Validity methods ensure that these IDs, as
well as chromosome and base position of SNPs, are consistent between
the data and annotation slots.  Chromosome and position values must be
integers, so all classes which include SNP data have slots to record
integer codes for non-autosomal chromosome types (X, Y,
pseudoautosomal, and mitochondrial).


\section{Genotype data formats}

\subsection{NetCDF}

The Network Common Data Form (NetCDF, \href{http://www.unidata.ucar.edu/software/netcdf/}{http://www.unidata.ucar.edu/software/netcdf/}) allows array-oriented data to be
stored on disk with fast access to subsets of the data in R using the
\Rpackage{ncdf4} package.  The \Rcode{NcdfReader} class provides an S4
wrapper for \Rcode{ncdf} objects.  \Rcode{NcdfGenotypeReader} and
\Rcode{NcdfIntensityReader} extend \Rcode{NcdfReader} with
methods specific to genotype and intensity data.

All NetCDF files created for \Rcode{GWASTools} have two dimensions, one called \Rcode{snp} and one titled \Rcode{sample}.
Further, all NetCDF files have three variables in common: \Rcode{sampleID, chromosome} and \Rcode{position}.
The \Rcode{sampleID} is used for indexing the columns of the two dimensional values stored in the NetCDF files
(genotype calls, for example).
The index to the SNP probes in the NetCDF file is the \Rcode{snpID}, which is stored as values of the SNP dimension.

\subsection{GDS}

Genomic Data Structure (GDS,
\href{http://corearray.sourceforge.net/}{http://corearray.sourceforge.net/})
is a storage format for bioinformatics data similar to NetCDF.  An R
interface is provided with the \Rcode{gdsfmt} package.  The
\Rpackage{GWASTools} functions \Rcode{convertNcdfGds} and
\Rcode{convertGdsNcdf} allow conversion between NetCDF and GDS
format.  GDS format is required for the \Rpackage{SNPRelate} package,
which computes relatedness and PCA as demonstrated in the ``GWAS Data
Cleaning'' vignette, so it may be convenient to store data in this
format from the start.  The \Rcode{GdsReader} class provides a wrapper
for \Rpackage{gdsfmt} objects with the same API as the
\Rcode{NcdfReader} class.  \Rcode{GdsGenotypeReader} and
\Rcode{GdsIntensityReader} extend \Rcode{GdsReader} with
methods specific to genotype and intensity data.

All GDS files created for \Rcode{GWASTools} have variables
\Rcode{sample.id, snp.id, snp.chromosome}, and \Rcode{snp.position}.
For genotype files, which store the count of the ``A'' allele in two
bits, the A and B alleles are stored in the \Rcode{snp.allele}
variable in the form ``A/B''.  Character SNP identifiers are often
stored in the variable \Rcode{snp.rs.id}.

\subsection{Matrix}

The \Rcode{MatrixGenotypeReader} class is convenient for analyses on smaller data sets which can easily fit into R's
memory.  It combines a matrix of genotypes with scanID, snpID,
chromosome, and position.

\section{Annotation}

SNP and scan annotation can be stored in either of two formats: an
annotated data frame, or a SQLite database.  Either format may be
supplied to the \Rcode{snpAnnot} and \Rcode{scanAnnot} slots of a
\Rcode{GenotypeData} or \Rcode{IntensityData} object.
Each annotation object consists of two component data frames (or tables).  The
main annotation data frame has one row for each SNP (or scan) and columns
containing variables such as (for SNPs) snpID, chromosome, position, rsID, A and B alleles and (for
scans) scanID, subject ID (to link duplicate scans of the same
subject), sex, and phenotype.  The metadata data frame has one row for
each column in the annotation data frame, and (at minimum) a column containing a description
of the variable.
Both formats
share methods to return annotation columns and metadata.

\subsection{Annotated data frames}

The \Rcode{SnpAnnotationDataFrame} and \Rcode{ScanAnnotationDataFrame}
classes extend the \Rcode{AnnotatedDataFrame} class in the
\Rcode{Biobase} package.  In addition to GWASTools methods, all
methods defined for \Rcode{AnnotatedDataFrame} are available to these
classes, including operators which allow these objects to be used like
standard data frames in many ways.
This format provides some built-in functionality from
\Rcode{AnnotatedDataFrame} to ensure that the annotation and metadata
data frames are consistent.

\subsection{SQLite databases}

The \Rcode{ScanAnnotationSQLite} and \Rcode{ScanAnnotationSQLite}
classes provide an alternate means of storing annotation that is
portable across multiple platforms.  In addition to the methods shared
with the annotation data frame classes, these classes have
\Rcode{getQuery} methods to pass any SQL query to the database.

\section{Input}

\subsection{Plain text}

Data in plain text format (for example, FinalReport files
produced by Illumina's GenomeStudio) can be converted to NetCDF or GDS files
using the function \Rcode{createDataFile}.  See
the ``GWAS Data Cleaning'' and ``Preparing Affymetrix Data'' vignettes for
examples.


\subsection{PLINK}

PLINK ped/map files can be converted to NetCDF with accompanying SNP
and scan annotation using the function \Rcode{plinkToNcdf}.
\Rcode{plinkToNcdf} will automatically convert between the sex
chromosome codes used by PLINK and the default codes used by
\Rpackage{GWASTools}.

\Rcode{snpgdsBED2GDS} in the \Rpackage{SNPRelate} package converts
binary PLINK to GDS.  \Rcode{snpgdsBED2GDS} is significantly faster than
\Rcode{plinkToNcdf}, and the resulting GDS file may be used with
\Rpackage{SNPRelate} as well.  The option \Rcode{cvt.snpid="int"} is required to
generate integer snpIDs.  Chromosome codes are not converted.

<<>>=
library(GWASTools)
library(SNPRelate)
bed.fn <- system.file("extdata", "plinkhapmap.bed.gz", package="SNPRelate")
fam.fn <- system.file("extdata", "plinkhapmap.fam.gz", package="SNPRelate")
bim.fn <- system.file("extdata", "plinkhapmap.bim.gz", package="SNPRelate")
gdsfile <- "snps.gds"
snpgdsBED2GDS(bed.fn, fam.fn, bim.fn, gdsfile, family=TRUE,
              cvt.chr="int", cvt.snpid="int", verbose=FALSE)
@ 

Now that the file has been created, we can access it in
\Rpackage{GWASTools} using the \Rcode{GdsGenotypeReader} class.  We
create sample and SNP annotation from the variables stored in the GDS file.
Note that PLINK sex chromosome coding is different from the
\Rpackage{GWASTools} default, so specify codes if your file contains Y
or pseudoautosomal SNPs.

<<>>=
(gds <- GdsGenotypeReader(gdsfile, YchromCode=24L, XYchromCode=25L))
scanID <- getScanID(gds)
family <- getVariable(gds, "sample.annot/family")
father <- getVariable(gds, "sample.annot/father")
mother <- getVariable(gds, "sample.annot/mother")
sex <- getVariable(gds, "sample.annot/sex")
sex[sex == ""] <- NA # sex must be coded as M/F/NA
phenotype <- getVariable(gds, "sample.annot/phenotype")
scanAnnot <- ScanAnnotationDataFrame(data.frame(scanID, father, mother, 
                                                sex, phenotype,
                                                stringsAsFactors=FALSE))

snpID <- getSnpID(gds)
chromosome <- getChromosome(gds)
position <- getPosition(gds)
alleleA <- getAlleleA(gds)
alleleB <- getAlleleB(gds)
rsID <- getVariable(gds, "snp.rs.id")
snpAnnot <- SnpAnnotationDataFrame(data.frame(snpID, chromosome, position,
                                              rsID, alleleA, alleleB,
                                              stringsAsFactors=FALSE),
                                   YchromCode=24L, XYchromCode=25L)

genoData <- GenotypeData(gds, scanAnnot=scanAnnot, snpAnnot=snpAnnot)
getGenotype(genoData, snp=c(1,5), scan=c(1,5))

close(genoData)
@ 

<<echo=FALSE, results=hide>>=
unlink(gdsfile)
@ 


\subsection{VCF}

Bi-allelic SNP data from Variant Call Format (VCF) can be converted to
GDS using the function \Rcode{snpgdsVCF2GDS} in the \Rcode{SNPRelate} package.

<<>>=
library(GWASTools)
library(SNPRelate)
vcffile <- system.file("extdata", "sequence.vcf", package="SNPRelate")
gdsfile <- "snps.gds"
snpgdsVCF2GDS(vcffile, gdsfile, verbose=FALSE)
@ 

Now that the file has been created, we can access it in
\Rpackage{GWASTools} using the \Rcode{GdsGenotypeReader} class.  We
create a \Rcode{SnpAnnotationDataFrame} from the variables stored in the GDS file.

<<>>=
(gds <- GdsGenotypeReader(gdsfile))
getScanID(gds)

snpID <- getSnpID(gds)
chromosome <- as.integer(getChromosome(gds))
position <- getPosition(gds)
alleleA <- getAlleleA(gds)
alleleB <- getAlleleB(gds)
rsID <- getVariable(gds, "snp.rs.id")
qual <- getVariable(gds, "snp.annot/qual")
filter <- getVariable(gds, "snp.annot/filter")
snpAnnot <- SnpAnnotationDataFrame(data.frame(snpID, chromosome, position,
                                              rsID, alleleA, alleleB,
                                              qual, filter,
                                              stringsAsFactors=FALSE))

genoData <- GenotypeData(gds, snpAnnot=snpAnnot)
getGenotype(genoData)

close(genoData)
@ 

<<echo=FALSE, results=hide>>=
unlink(gdsfile)
@ 


\subsection{Imputed genotypes}

Genotype probabilities or dosages from IMPUTE2, BEAGLE, or MaCH can be
converted into A allele dosage and stored in NetCDF or GDS with the function
\Rcode{imputedDosageFile}.


\section{Output}

\subsection{PLINK}

A \Rcode{GenotypeData} object can be written to PLINK ped/map files
with the function \Rcode{plinkWrite}.

\subsection{VCF}

A \Rcode{GenotypeData} object can be written to VCF
with the function \Rcode{vcfWrite}.
\Rcode{genoDataAsVCF} converts a \Rcode{GenotypeData} object to
a \Rcode{VCF} object for use with the \Rpackage{VariantAnnotation} package.

\subsection{snpStats}

\Rcode{asSnpMatrix} converts a \Rcode{GenotypeData} object to a
\Rcode{SnpMatrix} object for use with the \Rpackage{snpStats} package.

\end{document}