%\VignetteIndexEntry{VariantFiltering: filter coding and non-coding genetic variants} %\VignetteKeywords{genetics, variants, annotation, filtering, inheritance} %\VignettePackage{VariantFiltering} \documentclass{article} <>= BiocStyle::latex(use.unsrturl=FALSE) @ \usepackage{natbib} \title{VariantFiltering: filtering of coding and non-coding genetic variants} \author{Dei M. Elurbe\,$^{1,2}$ and Robert Castelo\,$^{3,4}$} \begin{document} \SweaveOpts{eps=FALSE} \DefineVerbatimEnvironment{Sinput}{Verbatim} {formatcom = {\color{Sinput}}} \DefineVerbatimEnvironment{Soutput}{Verbatim} {formatcom = {\color{Soutput}}} \DefineVerbatimEnvironment{Scode}{Verbatim} {formatcom = {\color{Scode}}} \definecolor{Sinput}{rgb}{0.21,0.49,0.72} \definecolor{Soutput}{rgb}{0.32,0.32,0.32} \definecolor{Scode}{rgb}{0.75,0.19,0.19} <<>= pdf.options(useDingbats=FALSE) options(width=60) @ \maketitle \begin{quote} {\scriptsize \noindent $^{1}$CIBER de Enfermedades Raras (CIBERER), Barcelona, Spain \\ $^{2}$Present address: CMBI, Radboud University Medical Centre, Nijmegen, The Netherlands. \\ $^{2}$Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona, Spain. \\ $^{3}$Research Program on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute, Barcelona, Spain. } \end{quote} \section{Overview} The aim of this software package is to facilitate the filtering and annotation of coding and non-coding genetic variants from a group of related or unrelated individuals among which at least one of them is affected by a genetic disorder. When working with related individuals, \Biocpkg{VariantFiltering} can search for variants from the affected individuals that segregate according to a particular inheritance model acting on autosomes (dominant, recessive homozygous or recessive heterozygous -also known as compound heterozygous) or allosomes (X-linked), or that occur \emph{de novo}. When working with unrelated individuals, no mode of inheritance is used for filtering but it can be used to search for variants shared among individuals affected by a common genetic disorder. \Biocpkg{VariantFiltering} exploits the R/Bioconductor infrastructure to annotate the sought variants with a number of functional annotations. Many of these annotations come from annotation packages, such as \Biocpkg{MafDb.ALL.wgs.phase1.release.v3.20101123}, which stores and exposes to the user minimum allele frequency (MAF) values frozen from the latest realease of the 1000 Genomes project. The main input are Variant Call Format (VCF) files which are parsed using the functionality from the \Biocpkg{VariantAnnotation} package for that purpose. This package contains a toy data set for illustrative purposes only, consisting of a multisample VCF file with variants from chromosomes 20, 21, 22 and allosomes from a trio of CEU individuals from the 1000 Genomes project. To further reduce the execution time of this vignette, only the code for the first analysis that searches for autosomal recessive homozygous variants is actually called and its results reported. \section{Using the package with parallel execution} Functions in \Biocpkg{VariantFiltering} to annotate and filter variants leverage the functionality of the Bioconductor package \Biocpkg{BiocParallel} to perform in parallel some of the tasks and calculations and reduce the overall execution time. These functions have an argument called \Robject{BPPARAM} that allows the user to control how this parallelism is exploited. In particular the user must give as value to this argument the result from a call to the function \Rfunction{bpparam()}, which actually is its default behavior. Here below we modify that behavior to force a call being executed without parallelism. The interested reader should consult the help page of \Rfunction{bpparam()} and the vignette of the \Biocpkg{BiocParallel} for further information. \section{Setting up the analysis} To start using \Biocpkg{VariantFiltering} the user should consider installing the packages listed in the \Rcode{Suggests:} field from its \Rcode{DESCRIPTION} file. After loading \Biocpkg{VariantFiltering} the first step is to build a parameter object, of class \Rclass{VariantFilteringParam} which requires at least a character vector of VCF filenames, as follows: <<>>= library(VariantFiltering) CEUvcf <- file.path(system.file("extdata", package="VariantFiltering"), "CEUtrio.vcf.bgz") CEUped <- file.path(system.file("extdata", package="VariantFiltering"), "CEUtrio.ped") param <- VariantFilteringParam(vcfFilenames=CEUvcf, pedFilename=CEUped) param @ In this case, we are also providing a PED file because the analysis illustrated below works on a trio filtering variants with a particular inheritance model. \section{Autosomal recessive inheritance analysis: Homozygous} Homozygous variants responsible for a recessive trait in the affected individuals can be identified calling the \Rfunction{autosomalRecessiveHomozygous()} function. This function takes a \Rclass{VariantFilteringParam} object as main argument and selects homozygous variants that are present in all the affected individuals and occur in heterozygosity in the unaffected ones. We change the default value of the \Rcode{BPPARAM} argument to the object returned by a call to \Rfunction{bpparam("SerialParam")} to disable the parallel execution of this function when building this vignette. For this reason, we need to load the \Biocpkg{BiocParallel} package first and perform the call to the function as follows: <>= library(BiocParallel) reHo <- autosomalRecessiveHomozygous(param, BPPARAM=bpparam("SerialParam")) reHo @ The resulting object belongs to the \Rclass{VariantFilteringResults} class of objects which eases the task of applying further functional annotation filters to select and prioritize variants. The help page of the \Rclass{VariantFilteringResults} class contains a list of available accessor methods. Here we illustrate the use of a few of them: <>= maxMAF(reHo) <- 0.05 MAFmask <- MAFpop(reHo) MAFmask MAFpop(reHo) <- !MAFmask MAFpop(reHo, "ASN_AFKG") <- TRUE MAFpop(reHo) minCRYP5ss(reHo) <- 0 reHo filteredVariants(reHo) @ \section{Autosomal recessive inheritance analysis: Heterozygous} To filter by this mode of inheritance, also known as compound heterozygous, we need two unaffected parents/ancestors and at least one affected descendant. Variants are filtered in five steps: 1. select heterozygous variants in one of the parents and homozygous in the other; 2. discard previously selected variants that are common between the two parents; 3. group variants by gene; 4. select those genes, and the variants that occur within them, which have two or more variants and there is at least one from each parent; 5. from the previously selected variants, discard those that do not occur in the affected descendants. This is implemented in the function \Rfunction{autosomalRecessiveHeterozygous()}. <>= compHet <- autosomalRecessiveHeterozygous(param) @ \section{Autosomal dominant inheritance analysis} The function \Rfunction{autosomalDominant()} identifies variants present in all the affected individual(s) discarding the ones that also occur in at least one of the unaffected subjects. <>= dom <- autosomalDominant(param) @ \section{X-Linked inheritance analysis} The function \Rfunction{xLinked()} identifies variants that appear only in the X chromosome of the unaffected females as heterozygous, don't appear in the unaffected males analyzed and finally are present (as homozygous) in the affected male(s). This function is currently restricted to affected males, and therefore, it cannot search for X-linked segregating variants affecting daughters. <>= xlid <- xLinked(param) @ \section{De Novo variants analysis} The function \Rfunction{deNovo()} searches for \emph{de novo} variants which are present in one descendant and present in both parents/ancestors. It is currently restricted to a trio of individuals. <>= deNovo <- deNovo(param) @ \section{Variants analysis for all possible inheritance models} Sometimes more than one mode of inheritance may be compatible with the phenotype of the individuals. To facilitate exploring the different inheritance models we provide a function called \Rfunction{allInheritanceModels()} which does not filter upfront for any mode of inheritance but it allows the user to switch among them by using the shiny app launched by the function \Rfunction{reportVariants()}. As a consequence, the object returned by \Rfunction{allInheritanceModels()} will be have a larger memory footprint. This analysis can be currently performed to a number of related individuals between one and four with at least one of them being affected. <>= aim <- allInheritanceModels(param) aim <- reportVariants(aim) @ \section{Variants analysis on unrelated individuals} When the individuals to analyze are unrelated, and therefore, there is no mode of inheritance to use as a filter, we can filter variants using the function \Rfunction{unrelatedIndividuals()}. Thus, the goal here will be to identify variants that are responsible for a phenotype common to all the individuals. Similarly to \Rfunction{allInheritanceModels()}, no variants are filtered out upfront, and therefore, the main memory requirements may be larger. <>= uind <- unrelatedIndividuals(param) @ \section{Create a report from the filtered variants} The function \Rfunction{reportVariants()} allows us to easily create a report from the filtered variants into a CSV or a TSV file as follows: <>= reportVariants(reHo, type="csv", file="reHo.csv") @ However, the default value on the \Robject{type} argument (\Robject{"shiny"}) starts a shiny web app which allows one to interactively filter the variants, obtaining an udpated \Rclass{VariantFilteringResults} object and downloading the filtered variants and the corresponding full reproducible R code, if necessary. This \Rfunction{reportVariants()} function is designed to provide for different options according to the exact type of \Rclass{VariantFilteringResults} object is dispatching. For example, for results objects resulting from the function \Rfunction{allInheritanceModels()}, it will show a tab named "Inheritance" where the user can allocate the individuals present in the VCF file according to different inheritance models. <>= aim <- reportVariants(aim) @ The previous call should open a browser window with a web app similar to the one in Figure~\ref{shinyapp}. The button "Save \& Close" will close the web app and return the results object updated according to the filters switched on or off at the web app. \begin{figure}[ht] \centerline{\includegraphics[width=\textwidth]{shinysnapshotVariantFiltering.jpg}} \caption{Snapshot of the shiny web app run from VariantFiltering with the function \Rfunction{reportVariants()}. Some of the parameters has been filled for illustrative purposes.} \label{shinyapp} \end{figure} \section{Current limitations of the package} The package has a number of limitations and shortcomings which we will list here and update with feedback from users. We will make our best to try to address these issues in forthcoming releases of the package: \begin{itemize} \item Only SNVs are identified to a particular version of dbSNP using a Bioconductor SNPlocs.* package. \item Multi-allelic variants, i.e., variants with more than one alternate allele, are not yet supported and can prompt errors. \item Filtering by the autosomal recessive heterozygous mode of inheritance is currently restricted to coding variants. \item The X-linked mode of inheritance is currently restricted to affected sons, and therefore, it cannot search for X-linked segregating variants from affected daughters. \item The {\it de novo} mode of inheritance is currently restricted to a trio with two parents and one descendant. \item The function \Rfunction{allInheritanceModels} is currently restricted to a maximum of four individuals. \item The "Generate Report" button from the shiny app does not provide the full reproducible code when using \Rfunction{allInheritanceModels()} function for the analysis. \item Scoring of splice sites is currently restricted to human and the whole package is currently pretty much human-oriented. \end{itemize} \section{Session information} <>= toLatex(sessionInfo()) @ \end{document}