%\VignetteIndexEntry{Introduction to VariantAnnotation} %\VignetteKeywords{variants, sequence, sequencing, alignments} %\VignettePackage{VariantAnnotation} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \newcommand{\VariantAnnotation}{\Rpackage{VariantAnnotation}} \title{Introduction to \VariantAnnotation} \author{Valerie Obenchain} \date{\today} \begin{document} \maketitle \tableofcontents <>= options(width=72) @ \section{Introduction} This vignette outlines a general workflow for annotating and filtering genetic variants using the \VariantAnnotation package. Sample data are in VariantCall Format (VCF) and are a subset of chromosome 22 from 1000 Genomes, \url{ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/}. VCF is a text file format that contains meta-information lines, a header line with column names, data lines with information about a position in the genome, and optional genotype information on samples for each position. A full description of the VCF format can be found on the 1000 Genomes page, \url{http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41} The sample dataare read in from a VCF file and variants are identified according to region such as \Rcode{coding}, \Rcode{intron}, \Rcode{intergenic}, \Rcode{spliceSite} etc. Amino acid coding changes are computed for the non-synonymous variants and SIFT and PolyPhen databases provide predictions of how severly the coding changes affect protein function. The end of the vignette covers other transformations of VCF data such as the creation of a \Robject{SnpMatrix} or a `long form' \Robject{GRanges}. \section{Variant Call Format (VCF) files} \subsection{Import complete files} Data are parsed into a \Robject{VCF} object with \Rfunction{readVcf}. <>= library(VariantAnnotation) fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") vcf @ Extract the header information stored in the \Rcode{exptData} slot <>= hdr <- exptData(vcf)[["header"]] hdr @ and explore it with the \Rcode{fixed}, \Rcode{info} and \Rcode{geno} accessors. More information on this object can be found at ?\Rcode{VCFHeader}. <>= fixed(hdr) head(info(hdr), 3) @ The \Robject{GRanges} in the \Rcode{rowData} slot is created from information in the the CHROM, POS, and ID fields of the VCF file. Values in the \Rcode{paramRangeID} column are meaningful when ranges have been specified in the the \Rcode{param} argument to \Rfunction{readVcf}. This is discussed further in the \Rcode{Data Subsets} section. <>= head(rowData(vcf)) @ The REF, ALT, QUAL and FILTER fields can be accessed together with \Rcode{fixed} accessor or individually with \Rcode{ref}, \Rcode{alt}, \Rcode{qual} and \Rcode{filt} accessors. <>= head(fixed(vcf), 3) qual(vcf)[1:3,] @ The \Robject{ALT} column is stored as a \Robject{DNAStringSetList} unless the file is a structural VCF, in which case it is stored as a \Robject{CharacterList}. Extract \Robject{ALT} from the \Robject{GRanges} and determine the number of elements in the list. <>= alternate <- values(alt(vcf))[["ALT"]] alternate ## number of ALT values per variant unique(elementLengths(alternate)) head(unlist(alternate)) @ Data from the \Rcode{INFO} field is accessed with the \Rcode{info} accessor. Look at the heder information related to the \Rcode{INFO} fields <>= head(info(hdr), 4) @ and retrieve the first couple of the fields, <>= info(vcf)[1:3, 2:3] @ Genotype data described in the \Rcode{FORMAT} field are parsed into matrices or arrays and can be accessed with the \Rcode{geno} accessor. These data are not returned with the \Robject{GRanges} from \Rcode{rowData} because they are unique for each sample and the data structures can be multidimensional. This is in contrast to the \Rcode{fixed} and \Rcode{info} data which are the same for a each variant across all samples. Extract the header information for the genotypes. <<>>= geno(hdr) @ Elements of the genotype list can be accessed in the usual way. <>= geno(vcf) geno(vcf)$GT[1:3,1:5] geno(vcf)$DS[1:3,1:5] @ \subsection{Import data subsets} When working with large VCF files it may be more efficient to read in subsets of the data. Data can be subset by selecting genomic coordinates (ranges) or by selecting fields from the VCF file. \subsubsection{Genomic coordinates} Subset by genomic coordinates by creating a \Robject{GRanges}, \Robject{RangedData} or \Robject{RangesList}. To read in a portion of chromosome 22, we create a \Robject{GRanges} with the regions of interest. <>= rng <- GRanges(seqnames="22", ranges=IRanges(c(50301422, 50989541), c(50312106, 51001328))) names(rng) <- c("gene_79087", "gene_644186") @ When ranges are specified, the VCF file must have an accompanying Tabix index file; if one does not exist it must be created. See ?\Rcode{indexTabix} for help creating an index. Once the index exists a \Robject{TabixFile} instance can be created, see ?\Rcode{TabixFile}. This object creates a reference to the VCF and its index. Once opened, the reference remains open across calls to methods, avoiding costly index re-loading. An index file for our sample data is included in the package so the \Rcode{TabixFile} can be created with, <>= tab <- TabixFile(fl) tab @ Call \Rfunction{readVcf} with \Robject{TabixFile} and the ranges as the \Robject{param}. The dimension of the resulting \Robject{VCF} object shows 397 records overlaped with the specified ranges. <>= vcf_rng <- readVcf(tab, "hg19", rng) vcf_rng @ The \Rcode{paramRangesID} column now has meaning as it distinguishes which variant records came from which param range. <<>>= rowData(vcf_rng) @ \subsubsection{VCF fields} Data can also be subset on the \Rcode{fixed}, \Rcode{info} and/or \Rcode{geno} fields in the VCF file. Fields available for import are described in the header information. To view the header before reading in the data in use \Rfunction{ScanVcfHeader}. <>= hdr <- scanVcfHeader(fl) hdr @ The \Rcode{info} and \Rcode{geno} accessors return \Robject{DataFrames} containing descriptions of the fields, data type and number of values. A listing of all possible \Rcode{info} or \Rcode{geno} values is constructed by selecting the rownames of the \Rcode{DataFrames}. <>= ## INFO fields info_DF <- info(hdr) rownames(info_DF) ## FORMAT fields geno_DF <- geno(hdr) rownames(geno_DF) @ We are interested in "LDAF" in INFO which is 'allele frequency accounting for linkage disequlibrium', and "GT" in FORMAT which is 'genotype'. Full descriptions of the elements can be seen in the header INFO and FORMAT \Robject{DataFrames}. <>= info_DF[rownames(info_DF) == "LDAF", ] geno_DF[rownames(geno_DF) == "GT", ] @ To subset on "LDAF" and "GT" we specify them as \Rcode{character} vectors in the \Rcode{info} and \Rcode{geno} arguments to \Rcode{ScanVcfParam}. This creates a \Robject{ScanVcfParam} object which is used as the \Robject{param} argument to \Rfunction{readVcf}. <>= ## Return "ALT" from 'fixed', "LAF" from 'info' and "GT" from 'geno' svp <- ScanVcfParam(fixed="ALT", info="LDAF", geno="GT") ## Return all 'fixed' fields, "LAF" from 'info' and "GT" from 'geno' svp <- ScanVcfParam(info="LDAF", geno="GT") svp @ Note that subsetting by the VCF fields does not affect the number of ranges read in. Instead the results of the filtering are reflected in the names of the elements returned from the \Rcode{info} and \Rcode{geno} accessors. <>= vcf_flds <- readVcf(fl, "hg19", svp) geno(vcf_flds) head(info(vcf_flds), 3) @ In the previous section we saw that a Tabix index file must exist when data are subset by genomic coordinates (i.e., ranges). This is not the case when subsetting on INFO and FORMAT elements. An index file is only needed when subsetting by ranges. \subsubsection{Subset on both genomic coordinates and VCF fields} To subset on both genomic coordinates and INFO and FORMAT fields the \Robject{ScanVcfParam} object must contain both. Our previous \Robject{ScanVcfParam} did not have ranges associated with it so we create a new instance with the ranges and INFO and FORMAT fields. <>= svp_all <- ScanVcfParam(info="LDAF", geno="GT", which=rng) svp_all @ The subsetting here involves genomic coordinates so we need to use the Tabix index file we created. <>= readVcf(tab, "hg19", svp_all) @ \subsection{Adjusting chromosome names} When functions involve the comparision of ranges by overlaps. For overlap methods to work properly the chromosome names (seqlevels) must be compatible. The VCF data chromosome names are represented by number, i.e. '22', <>= rowdat <- rowData(vcf) seqlevels(rowdat) @ but the TxDb chromosome names are preceded with 'chr'. <>= library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene head(seqlevels(txdb)) @ Chromosome names can be modified with the \Rfunction{renameSeqlevels} function. Seqlevels are modified at the \Robject{GRanges} level in the \Robject{rowData} slot of the \Robject{VCF} which means all future data extractions from this \Robject{VCF} will have the new seqlevels. If the data are read in from the file again, however, the seqlevels will need to be adjusted again. See ?\Rfunction{VCF} and ?\Rfunction{renameSeqlevels} for examples with \Robject{VCF} and \Robject{GRanges} objects. <>= ## rename variant seqlevels in the VCF object vcf <- renameSeqlevels(vcf, c("22"="chr22")) ## extract the rowData with modified seqlevels rd <- rowData(vcf) ## confirm seqlevels are the same intersect(seqlevels(rd), seqlevels(txdb)) @ To subset a \Robject{VCF} or \Robject{GRanges} by chromosome use \Rfunction{keepSeqlevels}. As an example we extract transcripts for all chromosomes in TxDb.Hsapiens.UCSC.hg19.knownGene then keep only 'chr21' and 'chr22'. See ?\Robject{VCF} and ?\Rfunction{keepSeqlevels} for details. \begin{verbatim} ## initially there are 93 chromosomes > rngs <- transcripts(txdb) > length(seqlevels(rngs)) [1] 93 ## keep only chr21 and chr22 > rngs <- keepSeqlevels(rngs, c("chr21", "chr22")) > seqlevels(rngs) [1] "chr21" "chr22" \end{verbatim} \section{Variant location} \Rfunction{locateVariants} identifies where the ranges in \Rcode{query} fall with respect to the annotation supplied in \Robject{subject}. Regions are specified in the \Rcode{region} argument and can be one of the following constructors: CodingVariants, IntronVariants, FiveUTRVariants, ThreeUTRVariants, IntergenicVariants, or SpliceSiteVariants. Location definitions are shown in Table \ref{table:location}. \begin{table}[h!] \begin{center} \begin{tabular}{l|l} \hline Location & Details \\ \hline coding & falls \emph{within} a coding region \\ fiveUTR & falls \emph{within} a 5' untranslated region \\ threeUTR & falls \emph{within} a 3' untranslated region \\ intron & falls \emph{within} an intron region \\ intergenic & does not fall \emph{within} a transcript associated with a gene \\ spliceSite & overlaps any portion of the first 2 or last 2 nucleotides of an intron \\ \hline \end{tabular} \end{center} \caption{Variant locations} \label{table:location} \end{table} When the \Rcode{query} is a \Robject{VCF} the variant ranges are taken from the \Rcode{rowData} slot. If \Rcode{query} is a \Robject{GRanges} it can have additional elementMetadata columns but they are ignored. As an alternative to a \Rcode{TranscriptDb}, the \Robject{subject} can be a \Robject{GRangesList} of the appropriate type. \Rcode{CodingVariants} would require coding regions by transcript, for \Rcode{IntronVariants} introns by transcripts would be necessary, etc. See ?\Rcode{locateVariants} man page for details. Identify the coding variants, <>= loc <- locateVariants(rd, txdb, CodingVariants()) head(loc, 4) @ \Rcode{SpliceSiteVariants} are those overlapping the first 2 or last 2 nucleotides of an intron. <>= head(locateVariants(rd, txdb, SpliceSiteVariants()), 4) @ To locate variants in all regions use the \Rcode{AllVariants()} constructor, <>= allvar <- locateVariants(rd, txdb, AllVariants()) @ The \Robject{GRanges} output of \Rfunction{locateVariants} includes only the ranges that fell in the specified region. Each row is a variant-transcript match which may result in multiple rows for each variant. \Rcode{elementMetadata} columns returned include \Rcode{location}, \Rcode{queryID}, \Rcode{txID}, \Rcode{cdsID}, and \Rcode{geneID}. In the case of \Robject{IntergenicVariants} columns for \Rcode{precedesID} and \Rcode{followsID} are also included. The \Rcode{queryID} column maps back to the row number in the original \Robject{query}. To answer gene-centric questions data can be summarized by gene reguardless of transcript. <>= ## Did any coding variants match more than one gene? table(sapply(split(values(loc)[["geneID"]], values(loc)[["queryID"]]), function(x) length(unique(x)) > 1)) ## Summarize the number of coding variants by gene ID idx <- sapply(split(values(loc)[["queryID"]], values(loc)[["geneID"]]), unique) sapply(idx, length) @ \section{Amino acid coding changes} \Rfunction{predictCoding} computes amino acid coding changes for non-synonymous variants. Only ranges in \Rcode{query} that overlap with a coding region in the \Rcode{subject} are considered. Reference sequences are retrieved from either a \Robject{BSgenome} or fasta file specified in \Rcode{seqSource}. Variant sequences are constructed by substituting, inserting or deleting values in the \Robject{varAllele} column into the reference sequence. Amino acid codes are computed for the variant codon sequence when the length is a multiple of 3. Examples of coding situations are shown in Table \ref{table:aacoding}. The \Robject{query} argument to \Rfunction{predictCoding} can be a \Robject{GRanges} or \Robject{VCF}. When a \Robject{GRanges} is supplied the \Rcode{varAllele} argument must be specified. In the case of a \Robject{VCF}, the alternate alleles are taken from \Robject{values(alt())[["ALT"]]} and the \Robject{varAllele} argument is not specified. \begin{table}[h] \begin{center} \begin{tabular}{l|l|l|l|l|c} \hline Type & refAllele & varAllele & refCodon & varCodon & translation possible \\ \hline substitution & G & T & aag & aaT & yes\\ substitution & G & TG & tga & tTGa & no\\ substitution & G & TGCG & gtc & TGCGtc & yes\\ insertion & `' & G & cgg & Gcgg & no\\ insertion & `' & TTG & gaa & gaTTGa & yes\\ deletion & A & `' & atc & tc & no\\ deletion & GGCCTA & `' & acggcctaa & aca & yes\\ \hline \end{tabular} \end{center} \caption{Amino acid coding} \label{table:aacoding} \end{table} The result is a modified \Rcode{query} containing only variants that fall within coding regions. Each row represents a variant-transcript match so more than one row per original variant is possible. <>= library(BSgenome.Hsapiens.UCSC.hg19) coding <- predictCoding(vcf, txdb, seqSource=Hsapiens) coding[5:9] @ Using variant rs114264124 as an example, we see varAllele \Rcode{A} has been substituted into the \Rcode{refCodon} \Rcode{CGG} to produce \Rcode{varCodon} \Rcode{CAG}. The \Rcode{refCodon} is the sequence of codons necessary to make the variant allele substitution and therefore often includes more nucleotides than indicated in the range (i.e. the range is 50302962, 50302962, width of 1). Notice it is the second position in the \Rcode{refCodon} that has been substituted. This position in the codon, the position of substitution, corresponds to genomic position 50302962. This genomic position maps to position 698 in coding region-based coordinates and to triplet 233 in the protein. This is a non-synonymous coding variant where the amino acid has changed from \Rcode{R} (Arg) to \Rcode{Q} (Gln). When the resulting \Rcode{varCodon} is not a multiple of 3 it cannot be translated. The consequence is considered a \Rcode{frameshift} and \Robject{varAA} will be missing. <>= ## consequence is 'frameshift' where translation is not possible coding[values(coding)[["consequence"]] == "frameshift"] @ \section{SIFT and PolyPhen Databases} From \Rfunction{predictCoding} we identified the amino acid coding changes for the non-synonymous variants. For this subset we can retrieve predictions of how damaging these coding changes may be. SIFT (Sorting Intolerant From Tolerant) and PolyPhen (Polymorphism Phenotyping) are methods that predict the impact of amino acid substitution on a human protein. The SIFT method uses sequence homology and the physical properties of amino acids to make predictions about protein function. PolyPhen uses sequence-based features and structural information characterizing the substitution to make predictions about the structure and function of the protein. Collated predictions for specific dbSNP builds are available as downloads from the SIFT and PolyPhen web sites. These results have been packaged into \Rpackage{SIFT.Hsapiens.dbSNP132.db} and \Rpackage{PolyPhen.Hapiens.dbSNP131.db} and are designed to be searched by rsid. Variants that are in dbSNP can be searched with these database packages. When working with novel variants, SIFT and PolyPhen must be called directly. See references for home pages. Identify the non-synonymous variants and obtain the rsids. <>= nms <- names(coding) idx <- values(coding)[["consequence"]] == "nonsynonymous" nonsyn <- coding[idx] names(nonsyn) <- nms[idx] rsids <- unique(names(nonsyn)[grep("rs", names(nonsyn), fixed=TRUE)]) @ Detailed descriptions of the database columns can be found with \Rcode{?SIFTDbColumns} and \Rcode{?PolyPhenDbColumns}. Variants in these databases often contain more than one row per variant. The variant may have been reported by multiple sources and therefore the source will differ as well as some of the other variables. <>= library(SIFT.Hsapiens.dbSNP132) ## rsids in the package head(keys(SIFT.Hsapiens.dbSNP132)) ## list available columns cols(SIFT.Hsapiens.dbSNP132) ## select a subset of columns ## a warning is thrown when a key is not found in the database subst <- c("RSID", "PREDICTION", "SCORE", "AACHANGE", "PROTEINID") sift <- select(SIFT.Hsapiens.dbSNP132, keys=rsids, cols=subst) head(sift) @ Next we query the PolyPhen database for information on these variants. PolyPhen provides predictions using two different training datasets and has considerable information about 3D protein structure. See \Rcode{?PolyPhenDbColumns} or the PolyPhen web site listed in the references for more details. Query the PolyPhen database with the rsids found in SIFT, <>= library(PolyPhen.Hsapiens.dbSNP131) inSIFT <- unique(sift$RSID[!is.na(sift$PREDICTION)]) pp <- select(PolyPhen.Hsapiens.dbSNP131, keys=inSIFT, cols=c("TRAININGSET", "PREDICTION", "PPH2PROB")) head(pp[!is.na(pp$PREDICTION), ]) @ \section{Other operations} \subsection{Create a SnpMatrix} The 'GT' element in the \Rcode{FORMAT} field of the VCF represents the genotype. These data can be converted into a \Robject{snpMatrix} object which can then be used with the functions offered in \Rpackage{snpStats} and other packages making use of the \Robject{SnpMatrix} class. The \Rfunction{MatrixToSnpMatrix} function converts the genotype calls in \Rcode{geno} to a \Robject{SnpMatrix}. No dbSNP package is used in this computation. The return value is a named list where 'genotypes' is a \Robject{SnpMatrix} and 'map' is a \Robject{DataFrame} with SNP names and alleles at each loci. The \Rcode{ignore} column in 'map' indicates which variants were set to NA (missing) because they met one or more of the following criteria, \begin{itemize} \item{only diploid calls are included; others are set to NA} \item{only single nucleotide variants are included; others are set to NA} \item{variants with >1 ALT allele are set to NA} \end{itemize} See ?\Rfunction{MatrixToSnpMatrix} for more details. <>= calls <- geno(vcf)$GT a0 <- values(ref(vcf))[["REF"]] a1 <- values(alt(vcf))[["ALT"]] res <- MatrixToSnpMatrix(calls, a0, a1) res @ The \Rcode{ALT} value in the 'map' \Robject{DataFrame} will be a \Robject{CharacterList} if the VCF was for structural variants or a \Robject{DNAStringSetList} otherwise. The column is not clearly visable inside the \Robject{DataFrame} but can be extracted and inspected as follows, <>= allele2 <- res$map[["allele.2"]] ## number of alternate alleles per variant unique(elementLengths(allele2)) unlist(allele2) @ \subsection{Long form GRanges} The \Rcode{readVcfLongForm} function reads data from a VCF file in the same manner as \Rcode{readVcf} but outputs a long form \Rcode{GRanges} instead of a \Rclass{VCF} class. This format is driven by the fact that the alternate allele (ALT) in the VCF file often has more than one value per record. In the long form \Robject{GRanges}, the rows of the \Robject{GRanges} are replicated to match the length of the `unlisted' alternate allele. This format provides access to each possible REF, ALT pair for each variant. Input arguments and data subsetting is the same for \Rcode{readVcfLongForm} as for \Rcode{readVcf}. The \Rcode{fixed} and \Rcode{info} fields are included as elementMetadata columns. Currently no \Rcode{geno} information is included. \Rcode{info} information was previously collected from the file header. We import `HOMSEQ' and `ALT'. <>= rownames(info_DF) @ <>= param <- ScanVcfParam(fixed="ALT", info="HOMSEQ") gr <- readVcfLongForm(fl, "hg19", param) head(gr) @ \subsection{Write out VCF files} A VCF file can be written out from data stored in a \Rcode{VCF} class. Methods to write out from more general structures are in progress. <>= fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") out1.vcf <- tempfile() out2.vcf <- tempfile() in1 <- readVcf(fl, "hg19") writeVcf(in1, out1.vcf) in2 <- readVcf(out1.vcf, "hg19") writeVcf(in2, out2.vcf) in3 <- readVcf(out2.vcf, "hg19") identical(in2, in3) @ \section{References} Wang K, Li M, Hakonarson H, (2010), ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, Vol 38, No. 16, e164.\\ \noindent McLaren W, Pritchard B, RiosD, et. al., (2010), Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics, Vol. 26, No. 16, 2069-2070.\\ \noindent SIFT home page : \url{http://sift.bii.a-star.edu.sg/}\\ \noindent PolyPhen home page : \url{http://genetics.bwh.harvard.edu/pph2/} \section{Session Information} <>= sessionInfo() @ \end{document}