%\VignetteIndexEntry{Converting VCF data for use in GWASTools} %\VignetteDepends{GWASTools, VariantAnnotation} \documentclass[11pt]{article} \usepackage{fullpage} \usepackage{Sweave} \usepackage{amsmath} \usepackage{graphicx} \usepackage[utf8]{inputenc} \usepackage[pdftex,plainpages=false, letterpaper, bookmarks, bookmarksnumbered, colorlinks, linkcolor=blue, citecolor=blue, filecolor=blue, urlcolor=blue]{hyperref} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \begin{document} \title{Converting VCF data for use in GWASTools} \author{Stephanie M. Gogarten} \maketitle \SweaveOpts{keep.source=TRUE, eps=FALSE} This vignette demonstrates how to convert a Variant Call Format (VCF) file into a format compatible with \Rpackage{GWASTools}. As \Rpackage{GWASTools} expects genotype data represented as the number of A alleles, with values of 0, 1, 2, or missing, only bi-allelic SNP data can be preserved in the conversion. Other types of variants will be ignored. The VCF format is described here: \href{http://www.1000genomes.org/wiki/Analysis/Variant\%20Call\%20Format/vcf-variant-call-format-version-41}{http://www.1000genomes.org/wiki/Analysis/Variant\%20Call\%20Format/vcf-variant-call-format-version-41}. \section{Converting to GDS} The simplest way to convert a VCF file is to use the function \Rcode{convertVcfGds}. This function will extract sample name, SNP ID, chromosome, position, reference and alternate alleles, and genotypes from a VCF file and store them in a GDS file (described in the \Rpackage{gdsfmt} package). Unique integer IDs are generated for all samples and SNPs. <<>>= library(GWASTools) vcffile <- system.file("extdata", "sequence.vcf", package="SNPRelate") gdsfile <- "snps.gds" convertVcfGds(vcffile, gdsfile) @ Now that the file has been created, we can access it in \Rpackage{GWASTools} using the \Rcode{GdsGenotypeReader} class. <<>>= (gds <- GdsGenotypeReader(gdsfile)) getScanID(gds) getSnpID(gds) getChromosome(gds) getPosition(gds) getVariable(gds, "sample.name") getVariable(gds, "snp.rs.id") getVariable(gds, "snp.allele") getGenotype(gds) close(gds) @ <>= unlink(gdsfile) @ \section{Converting to NetCDF} In some cases it may be desirable to extract additional information from a VCF file, e.g., monomorphic SNPs, quality scores, etc. In that case one can use the \Rpackage{VariantAnnotation} package to read the VCF file. We then create a NetCDF file with SNP and scan annotation for use in \Rpackage{GWASTools}. <<>>= library(VariantAnnotation) vcffile <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") (vcf <- readVcf(vcffile, "hg18")) @ Retrieve the reference and alternate alleles. Since we are keeping only bi-allelic SNPs, we discard any variants where the reference or alternate sequences have length $>1$, and variants with more than one alternate allele. <<>>= # width of ref seq rw <- width(ref(vcf)) # width of first alt seq aw <- unlist(lapply(alt(vcf), function(x) {width(x[1])})) # number of alternate genotypes nalt <- elementLengths(alt(vcf)) # select only bi-allelic SNPs (monomorphic OK, so aw can be 0 or 1) snp <- rw == 1 & aw <= 1 & nalt == 1 # subset vcf vcf <- vcf[snp,] @ Retrieve the row data from the VCF, which describes the SNPs. <<>>= chrom <- as.vector(seqnames(rowData(vcf))) position <- start(ranges(rowData(vcf))) alleleA <- as.character(ref(vcf)) alleleB <- as.character(unlist(alt(vcf))) QUAL <- qual(vcf) FILTER <- filt(vcf) @ Chromosome must be stored as an integer in the NetCDF file. <<>>= chromosome <- chrom chromosome[chrom == "X"] <- 23 chromosome[chrom == "XY"] <- 24 # pseudoautosomal chromosome[chrom == "Y"] <- 25 chromosome[chrom == "M"] <- 26 # mitochrondrial chromosome <- as.integer(chromosome) @ Create a data.frame from the row data, as well as metadata describing each column. Since FILTER is a single column, its metadata is combined into a single character string. The SNPs are ordered by the integer value of chromosome and position. The "snpID" column contains a unique integer ID that will be used to index SNPs in the NetCDF file. <<>>= # make annotation snp.df <- data.frame(chromosome, position, alleleA, alleleB, QUAL, FILTER, row.names=rownames(vcf), stringsAsFactors=FALSE) snp.ord <- order(snp.df$chromosome, snp.df$position) snp.df <- snp.df[snp.ord,] snp.df$snpID <- 1:nrow(snp.df) cols <- c("snpID", "chromosome", "position", "alleleA", "alleleB", "QUAL", "FILTER") hdr <- exptData(vcf)$header desc <- c("unique integer ID = snp dimension values in NetCDF file", "chromosome as integer where X=23, XY=24, Y=25, M=26", "base position", "allele A (the reference allele)", "allele B (the alternate allele)", "phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong)", paste(paste(row.names(fixed(hdr)$FILTER), fixed(hdr)$FILTER$Description, sep="="), collapse="; ")) meta <- data.frame("labelDescription"=desc, row.names=cols) snp.df <- snp.df[,cols] @ The INFO field of the VCF may contain many columns, some of which may contain many values. Here we keep only columns which have a single value for each SNP (Number=0, 1, or A). 'A' means one value per alternate allele, but we have kept only variants with one alternate allele. <<>>= info <- info(vcf) info.meta <- info(hdr) vals <- list() for (i in 1:nrow(info.meta)) { field <- row.names(info.meta)[i] number <- info.meta$Number[i] if (number == "A") { # one element per alternate allele # should only have one value per variant dat <- unlist(info[[field]]) if (length(dat) > nrow(info)) next vals[[field]] <- dat } else if (number %in% 0:1) { vals[[field]] <- info[[field]] } } snp.df <- cbind(snp.df, vals) meta.vals <- as.data.frame(info.meta[names(vals), "Description", drop=FALSE]) names(meta.vals) <- "labelDescription" meta <- rbind(meta, meta.vals) @ Combine the SNP annotation and metadata into a \Rcode{SnpAnnotationDataFrame}. <<>>= library(GWASTools) snpAnnot <- SnpAnnotationDataFrame(snp.df, meta) @ Create a unique integer scanID that will be used to index samples in the NetCDF file, and store it in a \Rcode{ScanAnnotationDataFrame}. <<>>= scan.df <- data.frame("scanID"=1:ncol(vcf), row.names=colnames(vcf)) meta.scan <- data.frame("labelDescription"="unique integer ID = sampleID in NetCDF file", row.names=names(scan.df)) scanAnnot <- ScanAnnotationDataFrame(scan.df, meta.scan) @ Extract the genotypes from the VCF, and convert to the number of REF alleles so each genotype may be stored as a single byte. Phase information (denoted by ``|'' or ``/'') is lost. <<>>= geno <- geno(vcf)$GT geno.ab <- geno geno.ab[,] <- NA geno.ab[geno %in% c("0/0", "0|0")] <- 2 # AA = REF/REF geno.ab[geno %in% c("0/1", "0|1", "1/0", "1|0")] <- 1 # AB = REF/ALT geno.ab[geno %in% c("1/1", "1|1")] <- 0 # BB = ALT/ALT # new snp order (by chrom and position) geno.ab <- geno.ab[snp.ord,] # geno.ab should be integer, not character mode(geno.ab) <- "integer" @ Extract the genotype quality scores from the VCF. <<>>= gq <- geno(vcf)$GQ @ Create a genotype NetCDF file and add the integer sample IDs, genotypes, and quality scores. <<>>= ncfile <- "ex2.nc" ncdfCreate(snp.annotation=snp.df, ncdf.filename=ncfile, n.samples=ncol(vcf), variables=c("genotype", "quality")) nc <- open.ncdf(ncfile, write=TRUE) put.var.ncdf(nc, "sampleID", scanAnnot$scanID, start=1, count=-1) put.var.ncdf(nc, "genotype", geno.ab) put.var.ncdf(nc, "quality", gq) close.ncdf(nc) @ Create a \Rcode{GenotypeData} object to check that everything worked correctly. This object can be used in the \Rpackage{GWASTools} functions. <<>>= nc <- NcdfGenotypeReader(ncfile) (genoData <- GenotypeData(nc, snpAnnot=snpAnnot, scanAnnot=scanAnnot)) pData(snpAnnot) pData(scanAnnot) getGenotype(genoData) getVariable(genoData, "quality") close(genoData) @ <>= unlink(ncfile) @ \section{References} The variant call format and VCFtools. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R; 1000 Genomes Project Analysis Group. Bioinformatics. 2011 Aug 1;27(15):2156-8. Epub 2011 Jun 7. \vspace{2mm} \noindent GDS home page: \href{http://corearray.sourceforge.net}{http://corearray.sourceforge.net} \vspace{2mm} \noindent NetCDF home page: \href{http://www.unidata.ucar.edu/software/netcdf}{http://www.unidata.ucar.edu/software/netcdf} \end{document}