\documentclass[a4paper]{article} \usepackage{listings} \title{RareVariantVis: Package for visualization of rare variants in whole genome sequencing data} \author{Tomasz Stokowy} %\VignetteIndexEntry{RareVariantVis} %\VignetteEngine{knitr::knitr} \begin{document} \maketitle \section*{Introduction} This vignette was created to present how to efficiently visualize and interprete genomic variants in R. Package RareVariantVis aims to present genomic variants (especially rare ones) in a global, per chromosome way. Visualization is performed in two ways - standard that outputs png figures and interactive that uses JavaScript d3 package. Interactive visualization allows to analyze trio/family data, for example in search for causative variants in rare Mendelian diseases. \section*{Input data} In this example we will use whole Complete Genomics genome sequencing data. Son sample from Ashkenazim trio from Stanford GIAB Personal Genome Project was used to prepare visualization input file - data frame with variants. Example of such input file is presented below: <>= library(RareVariantVis) library(AshkenazimSonChr21) head(SonVariantsChr21) @ Data frame consists of columns from vcf file. Mandatory colums are: \begin{itemize} \item Start.position - for location on chromosome, \item SNP.Frequency - for dbSNP frequency, \item DP - sequencing depth, \item AD - allelic depths for reference and alternative alleles. \item Gene.name - gene symbol \item Gene.component - part of gene \item Variant.type - type of variant \end{itemize} Gene.component field accepts such regions as \texttt{EXON\_REGION}, \texttt{SA\_SITE\_CANONICAL}, \texttt{SD\_SITE\_CANONICAL}, \texttt{UTR}, \texttt{INTRON\_REGION} and other. It can be also empty space for intergenic regions. Variant.type field accepts following types: Substitution - nonsynonymous, Substitution - nonsense, Complex, Deletion - frameshift, Insertion - frameshift, Substitution, Substitution - synonymous and other. Large variant files are also accepted - computer with 16GB RAM can handle input files up to 1GB. \section*{Visualization options} There are two main visualization functionalities of the package - static for all variants and dynamic for rare variants. Static aims for vizualization of all variants on the chromosome, whereas dynamic for interactive plotting of variants selected in the filtering procedure. \subsection*{Static visualization} Static visualization is performed by a key function - chromosomeVis. ChromosomeVis provides png plot and filtered list of variants in working directory. Alternatively, it can also provide output plot to current device: Example of chromosomeVis function that saves plot on the disk: <>= chromosomeVis(file = SonVariantsChr21, sample = "Son", chromosome = 21, centromeres = CentromeresHg19, pngWidth = 1600, pngHeight = 1200, plot = FALSE) @ Example of chromorsomeVis function that provides visualization to the plot device: \begin{center} <>= chromosomeVis(file = SonVariantsChr21, sample = "Son", chromosome = 21, centromeres = CentromeresHg19, plot = TRUE) @ \end{center} chromosomeVis function provides also option of vcf file visualization. Example of input data and function setup are given in chromosomeVis function manual. Functions chromosomeVis and chromosomePlot provide an output with genomic loci on x axis and ratio of alternative reads to sequencing depth on y axis. This means that heterozygous variants (for which expected ratio is 0.5) are placed below red line at 0.75. On the other hand, alternative homozygous variants are placed above 0.75, with expected ratio = 1. Some homozygous reference variants are also called, however this ones are ignored by majority of calling tools used for Illumina sequencing data. Position of the chromosome is marked with orange vertical lines. Red continous line depicts moving average of alternative reads to sequencing depth ratio. This moving average value is based on 2000 variants and provides information about possible homozygous regions (potentially above 0.75). Green dots depict rare, coding nonsynonymous variants, with reliable depth. Currently, only one rare variant filter setting is provided: \begin{itemize} \item dbSNP frequency lower than 0.01, \item coding, \item nonsynonymous or nonsense variant, \item sequencing depth greater than 10 \end{itemize} This rare variants are also reported by chromosomeVis function to the output file. User could define his own list of variants to be excluded, for example external database variants of in-house database variants using posFilter option. Output file includes the same columns as input file but for rare variants only. It is possible that input files can include more columns than in the example data (AshkenazimSonChr21). Package was tested to work efficiently with files consisting of 100 annotations (columns). \subsection*{Dynamic visualization} Dynamic visualization is performed by rareVariantVis function. Function takes as an input rare variants file generated by chromosomeVis function. Example of such data frame input is data(SonRareVariantsChr21). Example of data and function performance: <>= head(SonRareVariantsChr21) @ <>= rareVariantVis(file = SonRareVariantsChr21, sample = "Son", chromosome = 21, centromeres = CentromeresHg19) @ Functions rareVariantsVis provides an output html file with genomic loci on x axis and ratio of alternative reads to sequencing depth on y axis. Html file with visualized rare variants is located in current working directory. It is possible to point and zoom on the plot. Pointed variants highlight their properties. Right click on the plot cancels all changes made. \section*{Trio analysis} RareVariantVis package is designed also for trio analysis. This approach allows to observe inheritance patterns ans and potential de novo mutations. Moreover, technical effects, regions of sequencing alignment problems and highly polymorphic genome regions are also observed in trio visualization. Function trioVis accepts chromosomeVis output from mother, index and father samples. As an output it provides interactive visualization to current working directory. <>= trioVis(fileMother = MotherRareVariantsChr21, fileIndex = SonRareVariantsChr21, fileFather = FatherRareVariantsChr21, sampleMother = "Mother", sampleIndex = "Son", sampleFather = "Father", chromosome = 21, centromeres = CentromeresHg19) @ Another output of the trio analysis is summary table with Inheritance and Gene Count columns. This table provides information about inheritance pattern for all variants of Index sample. It is importance to notice that some records (for example de novo) may be false positive. This fact is caused by highly polymorphic regions, alignment issues in some data or not adequate calling in mother or father samples. It is recomended to check candidate variants in raw data (bam files), for example using Integrative Genome Viewer. \end{document}