%\VignetteIndexEntry{Data formats in GWASTools} %\VignetteDepends{GWASTools, SNPRelate} \documentclass[11pt]{article} \usepackage{fullpage} \usepackage{Sweave} \usepackage{amsmath} \usepackage{graphicx} \usepackage[utf8]{inputenc} \usepackage[pdftex,plainpages=false, letterpaper, bookmarks, bookmarksnumbered, colorlinks, linkcolor=blue, citecolor=blue, filecolor=blue, urlcolor=blue]{hyperref} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \begin{document} \title{Data formats in GWASTools} \author{Stephanie M. Gogarten} \maketitle %\tableofcontents \SweaveOpts{keep.source=TRUE, eps=FALSE} The central classes of the \Rpackage{GWASTools} package are \Rcode{GenotypeData} and \Rcode{IntensityData}. They are designed to link all parts of a GWAS analysis (genotype data, SNP information, and sample information) in a single S4 object, even when the genotype data is too large to be stored in R's memory at one time. In designing \Rpackage{GWASTools}, we took care to separate the application programming interface (API) of the \Rcode{GenotypeData} and \Rcode{IntensityData} classes from the format in which the data are stored. Each class contains a data slot (for a \Rcode{GenotypeReader} or \Rcode{IntensityReader} object, respectively) and annotation slots (a \Rcode{SnpAnnotationReader} and a \Rcode{ScanAnnotationReader}). These Reader classes are actually class unions, allowing multiple options for storing data and enabling new storage methods to be added without changing any code that uses \Rcode{GenotypeData} and \Rcode{IntensityData} objects. The class unions are currently defined as follows: \begin{itemize} \item \Rcode{GenotypeReader}: \Rcode{NcdfGenotypeReader}, \Rcode{GdsGenotypeReader}, or \Rcode{MatrixGenotypeReader} \item \Rcode{IntensityReader}: \Rcode{NcdfIntensityReader} or \Rcode{GdsIntensityReader} \item \Rcode{SnpAnnotationReader}: \Rcode{SnpAnnotationDataFrame} or \Rcode{SnpAnnotationSQLite} \item \Rcode{ScanAnnotationReader}: \Rcode{ScanAnnotationDataFrame} or \Rcode{ScanAnnotationSQLite} \end{itemize} We use the term ``scan'' to indicate a unique genotyping instance, as the same DNA sample may be genotyped more than once. Each SNP and scan must have a unique integer ID (``snpID'' and ``scanID'') that serves as the primary key between the genotype data and the annotation. Validity methods ensure that these IDs, as well as chromosome and base position of SNPs, are consistent between the data and annotation slots. Chromosome and position values must be integers, so all classes which include SNP data have slots to record integer codes for non-autosomal chromosome types (X, Y, pseudoautosomal, and mitochondrial). \section{Genotype data formats} \subsection{NetCDF} The Network Common Data Form (NetCDF, \href{http://www.unidata.ucar.edu/software/netcdf/}{http://www.unidata.ucar.edu/software/netcdf/}) allows array-oriented data to be stored on disk with fast access to subsets of the data in R using the \Rpackage{ncdf} package. The \Rcode{NcdfReader} class provides an S4 wrapper for \Rcode{ncdf} objects. \Rcode{NcdfGenotypeReader} and \Rcode{NcdfIntensityReader} extend \Rcode{NcdfReader} with methods specific to genotype and intensity data. All NetCDF files created for \Rcode{GWASTools} have two dimensions, one called \Rcode{snp} and one titled \Rcode{sample}. Further, all NetCDF files have three variables in common: \Rcode{sampleID, chromosome} and \Rcode{position}. The \Rcode{sampleID} is used for indexing the columns of the two dimensional values stored in the NetCDF files (genotype calls, for example). The index to the SNP probes in the NetCDF file is the \Rcode{snpID}, which is stored as values of the SNP dimension. \subsection{GDS} Genomic Data Structure (GDS, \href{http://corearray.sourceforge.net/}{http://corearray.sourceforge.net/}) is a storage format for bioinformatics data similar to NetCDF. An R interface is provided with the \Rcode{gdsfmt} package. The \Rpackage{GWASTools} functions \Rcode{convertNcdfGds} and \Rcode{convertGdsNcdf} allow conversion between NetCDF and GDS format. GDS format is required for the \Rpackage{SNPRelate} package, which computes relatedness and PCA as demonstrated in the ``GWAS Data Cleaning'' vignette, so it may be convenient to store data in this format from the start. The \Rcode{GdsReader} class provides a wrapper for \Rpackage{gdsfmt} objects with the same API as the \Rcode{NcdfReader} class. \Rcode{GdsGenotypeReader} and \Rcode{GdsIntensityReader} extend \Rcode{GdsReader} with methods specific to genotype and intensity data. All GDS files created for \Rcode{GWASTools} have variables \Rcode{sample.id, snp.id, snp.chromosome}, and \Rcode{snp.position}. For genotype files, which store the count of the ``A'' allele in two bits, the A and B alleles are stored in the \Rcode{snp.allele} variable in the form ``A/B''. Character SNP identifiers are often stored in the variable \Rcode{snp.rs.id}. \subsection{Matrix} The \Rcode{MatrixGenotypeReader} class is convenient for analyses on smaller data sets which can easily fit into R's memory. It combines a matrix of genotypes with scanID, snpID, chromosome, and position. \section{Annotation} SNP and scan annotation can be stored in either of two formats: an annotated data frame, or a SQLite database. Either format may be supplied to the \Rcode{snpAnnot} and \Rcode{scanAnnot} slots of a \Rcode{GenotypeData} or \Rcode{IntensityData} object. Each annotation object consists of two component data frames (or tables). The main annotation data frame has one row for each SNP (or scan) and columns containing variables such as (for SNPs) snpID, chromosome, position, rsID, A and B alleles and (for scans) scanID, subject ID (to link duplicate scans of the same subject), sex, and phenotype. The metadata data frame has one row for each column in the annotation data frame, and (at minimum) a column containing a description of the variable. Both formats share methods to return annotation columns and metadata. \subsection{Annotated data frames} The \Rcode{SnpAnnotationDataFrame} and \Rcode{ScanAnnotationDataFrame} classes extend the \Rcode{AnnotatedDataFrame} class in the \Rcode{Biobase} package. In addition to GWASTools methods, all methods defined for \Rcode{AnnotatedDataFrame} are available to these classes, including operators which allow these objects to be used like standard data frames in many ways. This format provides some built-in functionality from \Rcode{AnnotatedDataFrame} to ensure that the annotation and metadata data frames are consistent. \subsection{SQLite databases} The \Rcode{ScanAnnotationSQLite} and \Rcode{ScanAnnotationSQLite} classes provide an alternate means of storing annotation that is portable across multiple platforms. In addition to the methods shared with the annotation data frame classes, these classes have \Rcode{getQuery} methods to pass any SQL query to the database. \section{Input} \subsection{Plain text} Data in plain text format (for example, FinalReport files produced by Illumina's GenomeStudio) can be converted to NetCDF or GDS files using the function \Rcode{createDataFile}. See the ``GWAS Data Cleaning'' and ``Preparing Affymetrix Data'' vignettes for examples. \subsection{PLINK} PLINK ped/map files can be converted to NetCDF with accompanying SNP and scan annotation using the function \Rcode{plinkToNcdf}. \Rcode{plinkToNcdf} will automatically convert between the sex chromosome codes used by PLINK and the default codes used by \Rpackage{GWASTools}. \Rcode{snpgdsBED2GDS} in the \Rpackage{SNPRelate} package converts binary PLINK to GDS. \Rcode{snpgdsBED2GDS} is significantly faster than \Rcode{plinkToNcdf}, and the resulting GDS file may be used with \Rpackage{SNPRelate} as well. The option \Rcode{cvt.snpid="int"} is required to generate integer snpIDs. Chromosome codes are not converted. <<>>= library(GWASTools) library(SNPRelate) bed.fn <- system.file("extdata", "plinkhapmap.bed.gz", package="SNPRelate") fam.fn <- system.file("extdata", "plinkhapmap.fam.gz", package="SNPRelate") bim.fn <- system.file("extdata", "plinkhapmap.bim.gz", package="SNPRelate") gdsfile <- "snps.gds" snpgdsBED2GDS(bed.fn, fam.fn, bim.fn, gdsfile, family=TRUE, cvt.chr="int", cvt.snpid="int", verbose=FALSE) @ Now that the file has been created, we can access it in \Rpackage{GWASTools} using the \Rcode{GdsGenotypeReader} class. We create sample and SNP annotation from the variables stored in the GDS file. Note that PLINK sex chromosome coding is different from the \Rpackage{GWASTools} default, so specify codes if your file contains Y or pseudoautosomal SNPs. <<>>= (gds <- GdsGenotypeReader(gdsfile, YchromCode=24L, XYchromCode=25L)) scanID <- getScanID(gds) family <- getVariable(gds, "sample.annot/family") father <- getVariable(gds, "sample.annot/father") mother <- getVariable(gds, "sample.annot/mother") sex <- getVariable(gds, "sample.annot/sex") sex[sex == ""] <- NA # sex must be coded as M/F/NA phenotype <- getVariable(gds, "sample.annot/phenotype") scanAnnot <- ScanAnnotationDataFrame(data.frame(scanID, father, mother, sex, phenotype, stringsAsFactors=FALSE)) snpID <- getSnpID(gds) chromosome <- getChromosome(gds) position <- getPosition(gds) alleleA <- getAlleleA(gds) alleleB <- getAlleleB(gds) rsID <- getVariable(gds, "snp.rs.id") snpAnnot <- SnpAnnotationDataFrame(data.frame(snpID, chromosome, position, rsID, alleleA, alleleB, stringsAsFactors=FALSE), YchromCode=24L, XYchromCode=25L) genoData <- GenotypeData(gds, scanAnnot=scanAnnot, snpAnnot=snpAnnot) getGenotype(genoData, snp=c(1,5), scan=c(1,5)) close(genoData) @ <>= unlink(gdsfile) @ \subsection{VCF} Bi-allelic SNP data from Variant Call Format (VCF) can be converted to GDS using the function \Rcode{snpgdsVCF2GDS} in the \Rcode{SNPRelate} package. <<>>= library(GWASTools) library(SNPRelate) vcffile <- system.file("extdata", "sequence.vcf", package="SNPRelate") gdsfile <- "snps.gds" snpgdsVCF2GDS(vcffile, gdsfile, verbose=FALSE) @ Now that the file has been created, we can access it in \Rpackage{GWASTools} using the \Rcode{GdsGenotypeReader} class. We create a \Rcode{SnpAnnotationDataFrame} from the variables stored in the GDS file. <<>>= (gds <- GdsGenotypeReader(gdsfile)) getScanID(gds) snpID <- getSnpID(gds) chromosome <- as.integer(getChromosome(gds)) position <- getPosition(gds) alleleA <- getAlleleA(gds) alleleB <- getAlleleB(gds) rsID <- getVariable(gds, "snp.rs.id") qual <- getVariable(gds, "snp.annot/qual") filter <- getVariable(gds, "snp.annot/filter") snpAnnot <- SnpAnnotationDataFrame(data.frame(snpID, chromosome, position, rsID, alleleA, alleleB, qual, filter, stringsAsFactors=FALSE)) genoData <- GenotypeData(gds, snpAnnot=snpAnnot) getGenotype(genoData) close(genoData) @ <>= unlink(gdsfile) @ \subsection{Imputed genotypes} Genotype probabilities or dosages from IMPUTE2, BEAGLE, or MaCH can be converted into A allele dosage and stored in NetCDF or GDS with the function \Rcode{imputedDosageFile}. \section{Output} \subsection{PLINK} A \Rcode{GenotypeData} object can be written to PLINK ped/map files with the function \Rcode{plinkWrite}. \subsection{VCF} A \Rcode{GenotypeData} object can be written to VCF with the function \Rcode{vcfWrite}. \Rcode{genoDataAsVCF} converts a \Rcode{GenotypeData} object to a \Rcode{VCF} object for use with the \Rpackage{VariantAnnotation} package. \subsection{snpStats} \Rcode{asSnpMatrix} converts a \Rcode{GenotypeData} object to a \Rcode{SnpMatrix} object for use with the \Rpackage{snpStats} package. \end{document}