%\VignetteIndexEntry{Converting VCF data for use in GWASTools}
%\VignetteDepends{GWASTools, VariantAnnotation}

\documentclass[11pt]{article}
\usepackage{fullpage}
\usepackage{Sweave}
\usepackage{amsmath}  
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
\usepackage[pdftex,plainpages=false, letterpaper, bookmarks, bookmarksnumbered, colorlinks, linkcolor=blue, citecolor=blue, filecolor=blue, urlcolor=blue]{hyperref}

\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\begin{document}

\title{Converting VCF data for use in GWASTools}
\author{Stephanie M. Gogarten}

\maketitle

\SweaveOpts{keep.source=TRUE, eps=FALSE} 

This vignette demonstrates how to convert a Variant Call Format (VCF)
file into a format compatible with \Rpackage{GWASTools}.
As \Rpackage{GWASTools} expects genotype data represented as the number of A
alleles, with values of 0, 1, 2, or missing, only bi-allelic SNP data
can be preserved in the conversion.  Other types of variants will be
ignored.  The VCF format is described here: \href{http://www.1000genomes.org/wiki/Analysis/Variant\%20Call\%20Format/vcf-variant-call-format-version-41}{http://www.1000genomes.org/wiki/Analysis/Variant\%20Call\%20Format/vcf-variant-call-format-version-41}.

\section{Converting to GDS}

The simplest way to convert a VCF file is to use the function
\Rcode{convertVcfGds}.  This function will extract sample name, SNP
ID, chromosome, position, reference and alternate alleles, and
genotypes from a VCF file and store them in a GDS file (described in
the \Rpackage{gdsfmt} package).  Unique integer IDs are generated
for all samples and SNPs.

<<>>=
library(GWASTools)
vcffile <- system.file("extdata", "sequence.vcf", package="SNPRelate")
gdsfile <- "snps.gds"
convertVcfGds(vcffile, gdsfile)
@ 

Now that the file has been created, we can access it in
\Rpackage{GWASTools} using the \Rcode{GdsGenotypeReader} class.

<<>>=
(gds <- GdsGenotypeReader(gdsfile))

getScanID(gds)
getSnpID(gds)
getChromosome(gds)
getPosition(gds)
getVariable(gds, "sample.name")
getVariable(gds, "snp.rs.id")
getVariable(gds, "snp.allele")
getGenotype(gds)

close(gds)
@ 

<<echo=FALSE, results=hide>>=
unlink(gdsfile)
@ 

\section{Converting to NetCDF}

In some cases it may be desirable to extract additional information
from a VCF file, e.g., monomorphic SNPs, quality scores, etc.
In that case one can use the \Rpackage{VariantAnnotation} package to read the VCF file.
We then create a NetCDF
file with SNP and scan annotation for use in \Rpackage{GWASTools}.

<<>>=
library(VariantAnnotation)
vcffile <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
(vcf <- readVcf(vcffile, "hg18"))
@ 

Retrieve the reference and alternate alleles.  Since we are keeping
only bi-allelic SNPs, we discard any variants where the reference or
alternate sequences have length $>1$, and variants with more than one
alternate allele.
<<>>=
# width of ref seq
rw <- width(ref(vcf))
# width of first alt seq
aw <- unlist(lapply(alt(vcf), function(x) {width(x[1])}))
# number of alternate genotypes
nalt <- elementLengths(alt(vcf))
# select only bi-allelic SNPs (monomorphic OK, so aw can be 0 or 1)
snp <- rw == 1 & aw <= 1 & nalt == 1
# subset vcf
vcf <- vcf[snp,]
@ 

Retrieve the row data from the VCF, which describes the SNPs.
<<>>=
chrom <- as.vector(seqnames(rowData(vcf)))
position <- start(ranges(rowData(vcf)))
alleleA <- as.character(ref(vcf))
alleleB <- as.character(unlist(alt(vcf)))
QUAL <- qual(vcf)
FILTER <- filt(vcf)
@

Chromosome must be stored as an integer in the NetCDF file.
<<>>=
chromosome <- chrom
chromosome[chrom == "X"] <- 23
chromosome[chrom == "XY"] <- 24 # pseudoautosomal
chromosome[chrom == "Y"] <- 25
chromosome[chrom == "M"] <- 26 # mitochrondrial
chromosome <- as.integer(chromosome)
@ 

Create a data.frame from the row data, as well as metadata
describing each column.  Since FILTER is a single column, its metadata is
combined into a single character string.  The SNPs are ordered by the
integer value of chromosome and position.  The "snpID"
column contains a unique integer ID that will be used to index SNPs in
the NetCDF file.
<<>>=
# make annotation
snp.df <- data.frame(chromosome, position, alleleA, alleleB, QUAL, FILTER,
                     row.names=rownames(vcf), stringsAsFactors=FALSE)
snp.ord <- order(snp.df$chromosome, snp.df$position)
snp.df <- snp.df[snp.ord,]
snp.df$snpID <- 1:nrow(snp.df)
cols <- c("snpID", "chromosome", "position", "alleleA", "alleleB", "QUAL", "FILTER")
hdr <- exptData(vcf)$header
desc <- c("unique integer ID = snp dimension values in NetCDF file",
  "chromosome as integer where X=23, XY=24, Y=25, M=26",
  "base position",
  "allele A (the reference allele)",
  "allele B (the alternate allele)",
  "phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong)",
  paste(paste(row.names(fixed(hdr)$FILTER), fixed(hdr)$FILTER$Description, sep="="), collapse="; "))
meta <- data.frame("labelDescription"=desc, row.names=cols)
snp.df <- snp.df[,cols]
@ 

The INFO field of the VCF may contain many columns, some of which may
contain many values.  Here we keep only columns which have a
single value for each SNP (Number=0, 1, or A).  'A' means one value
per alternate allele, but we have kept only variants with one
alternate allele.
<<>>=
info <- info(vcf)
info.meta <- info(hdr)
vals <- list()
for (i in 1:nrow(info.meta)) {
  field <- row.names(info.meta)[i]
  number <- info.meta$Number[i]
  if (number == "A") {
    # one element per alternate allele
    # should only have one value per variant
    dat <- unlist(info[[field]])
    if (length(dat) > nrow(info)) next
    vals[[field]] <- dat
  } else if (number %in% 0:1) {
    vals[[field]] <- info[[field]]
  }
}
snp.df <- cbind(snp.df, vals)
meta.vals <- as.data.frame(info.meta[names(vals), "Description", drop=FALSE])
names(meta.vals) <- "labelDescription"
meta <- rbind(meta, meta.vals)
@ 

Combine the SNP annotation and metadata into a \Rcode{SnpAnnotationDataFrame}.
<<>>=
library(GWASTools)
snpAnnot <- SnpAnnotationDataFrame(snp.df, meta)
@ 

Create a unique integer scanID that will be used to index samples
in the NetCDF file, and store it in a \Rcode{ScanAnnotationDataFrame}.
<<>>=
scan.df <- data.frame("scanID"=1:ncol(vcf), row.names=colnames(vcf))
meta.scan <- data.frame("labelDescription"="unique integer ID = sampleID in NetCDF file",
                        row.names=names(scan.df))
scanAnnot <- ScanAnnotationDataFrame(scan.df, meta.scan)
@ 

Extract the genotypes from the VCF, and convert to the number of
REF alleles so each genotype may be stored as a single byte.  Phase
information (denoted by ``|'' or ``/'') is lost.
<<>>=
geno <- geno(vcf)$GT
geno.ab <- geno
geno.ab[,] <- NA
geno.ab[geno %in% c("0/0", "0|0")] <- 2 # AA = REF/REF
geno.ab[geno %in% c("0/1", "0|1", "1/0", "1|0")] <- 1 # AB = REF/ALT
geno.ab[geno %in% c("1/1", "1|1")] <- 0 # BB = ALT/ALT
# new snp order (by chrom and position)
geno.ab <- geno.ab[snp.ord,]
# geno.ab should be integer, not character
mode(geno.ab) <- "integer"
@ 

Extract the genotype quality scores from the VCF.
<<>>=
gq <- geno(vcf)$GQ
@ 

Create a genotype NetCDF file and add the integer sample IDs,
genotypes, and quality scores.
<<>>=
ncfile <- "ex2.nc"
ncdfCreate(snp.annotation=snp.df, ncdf.filename=ncfile,
           n.samples=ncol(vcf), variables=c("genotype", "quality"))
nc <- open.ncdf(ncfile, write=TRUE)
put.var.ncdf(nc, "sampleID", scanAnnot$scanID, start=1, count=-1)
put.var.ncdf(nc, "genotype", geno.ab)
put.var.ncdf(nc, "quality", gq)
close.ncdf(nc)
@ 

Create a \Rcode{GenotypeData} object to check that everything worked
correctly.  This object can be used in the \Rpackage{GWASTools} functions.
<<>>=
nc <- NcdfGenotypeReader(ncfile)
(genoData <- GenotypeData(nc, snpAnnot=snpAnnot, scanAnnot=scanAnnot))
pData(snpAnnot)
pData(scanAnnot)
getGenotype(genoData)
getVariable(genoData, "quality")
close(genoData)
@ 

<<echo=FALSE, results=hide>>=
unlink(ncfile)
@ 


\section{References}
The variant call format and VCFtools.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE,
Lunter G, Marth GT, Sherry ST, McVean G, Durbin R; 1000 Genomes Project Analysis Group.
Bioinformatics. 2011 Aug 1;27(15):2156-8. Epub 2011 Jun 7.

\vspace{2mm}
\noindent
GDS home page: \href{http://corearray.sourceforge.net}{http://corearray.sourceforge.net}

\vspace{2mm}
\noindent
NetCDF home page: \href{http://www.unidata.ucar.edu/software/netcdf}{http://www.unidata.ucar.edu/software/netcdf}

\end{document}