%\VignetteIndexEntry{Introduction to VariantTools} %\VignetteKeywords{variants, sequencing, SNPs, somatic mutations, RNA editing} %\VignettePackage{VariantTools} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \textwidth=6.5in \textheight=8.5in \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\Bioconductor}{\software{Bioconductor}} \newcommand{\VariantTools}{\Rpackage{VariantTools}} \title{An Introduction to \VariantTools} \author{Michael Lawrence, Jeremiah Degenhardt} \date{\today} \begin{document} \maketitle \tableofcontents <<>= options(width=72) @ \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} %%%%%%%%%%%%%%%%%%%%%%%%%%%%% This vignette outlines the basic usages of the \VariantTools package and the general workflow for loading data, calling single sample variants and tumor-specific somatic mutations or other sample-specific variant types (eg RNA editing). Most of the functions operate on alignments (BAM files) or datasets of called variants. The user is expected to have already aligned the reads with a separate tool, e.g., \software{GSNAP} via \Rpackage{gmapR}. \section{Calling single-sample variants} \subsection{Basic usage} \label{sec:quick-start} For our example, we take paired-end RNA-seq alignments from two lung cancer cell lines from the same individual. H1993 is derived from a metastatis and H2073 is derived from the primary tumor. Below, we call variants from a region around the p53 gene: <>= p53 <- gmapR:::exonsOnTP53Genome("TP53") @ <>= library(VariantTools) bams <- LungCancerLines::LungCancerBamFiles() bam <- bams$H1993 tally.param <- VariantTallyParam(gmapR::TP53Genome(), readlen = 100L, high_base_quality = 23L, which = range(p53)) called.variants <- callVariants(bam, tally.param) @ %$ % In the above, we load the genome corresponding to the human p53 gene region and the H1993 BAM file (stripped down to the same region). We pass the BAM, genome, read length and quality cutoff to the \Rfunction{callVariants} workhorse. The read length is not strictly required, but it is necessary for one of the QA filters. The value given for the high base quality cutoff is appropriate for Sanger and Illumina 1.8 or above. By default, the high quality counts are used by the likelihood ratio test during calling. The returned \Robject{called\_variants} is a variant \Rclass{GRanges}, in the same form as that returned by \Rfunction{bam\_tally} in the \Rpackage{gmapR} package. Unsurprisingly, \Rfunction{callVariants} uses \Rfunction{bam\_tally} internally to generate the per-nucleotide counts (pileup) from the BAM file. The result is then filtered to generate the variant calls. The \Rclass{VCF} class holds similar information; however, we favor the simple tally \Rclass{GRanges}, because it has a separate record for each ALT, at each position. \Rclass{VCF}, the class and the file format, has a single record for a position, collapsing over multiple ALT alleles, and this is much less convenient for our purposes. If we subset the variants by those in an actual p53 exon (not an intron), we find two: one with strong evidence for a homozygous mutation, and another with much weaker evidence (low coverage). <>= subsetByOverlaps(called.variants, p53, ignore.strand = TRUE) @ The next section goes into further detail on the process, including the specific filtering rules applied, and how one might, for example, tweak the parameters to avoid calling low-coverage variants, like the one above. \subsection{Step by step} \label{sec:step-by-step} The \Rfunction{callVariants} method for BAM files, introduced above, is a convenience wrapper that delegates to several low-level functions to perform each step of the variant calling process: generating the tallies, basic QA filtering and the actual variant calling. Calling these functions directly affords the user more control over the process and provides access to intermediate results, which is useful e.g. for diagnostics and for caching results. The workflow consists of three function calls that rely on argument defaults to achieve the same result as our call to \Rfunction{callVariants} above. Please see their man pages for the arguments available for customization. The first step is to tally the variants from the BAM file. By default, this will return observed differences from the reference, excluding N calls and only counting reads above 13 in mapping quality (MAPQ) score. There are three cycle bins: the first 10 bases, the final 10 bases, and the stretch between them (these will be used in the QA step). <>= raw.variants <- tallyVariants(bam, tally.param) @ Next, basic QA filters are applied. These include a minimum read count (2) check, minimum unique cycle count (2) check, and Fisher Exact Test on the per-strand counts vs. reference for strand bias (p-value cutoff: 0.001). If there are at least three cycle bins in the tallies, at least one read must present the variant in an internal cycle bin. The intent is to ensure that we have sufficient data and that the data are not due to strand-specific nor cycle-specific artifacts. <>= qa.variants <- qaVariants(raw.variants) @ The final step is to actually call the variants. The \Rfunction{callVariants} function uses a binomial likelihood ratio test for this purpose. The ratio is $P(D|p=p_{lower}) / P(D|p=p_{error})$, where $p_{lower} = 0.2$ is the assumed lowest variant frequency and $p_{error} = 0.001$ is the assumed error rate in the sequencing (default: 0.001). <>= called.variants <- callVariants(qa.variants) @ \subsection{Diagnosing the filters} \label{sec:filter-diag} The calls to \Rfunction{qaVariants} and \Rfunction{callVariants} are essentially filtering the tallies, so it is important to know, especially when faced with a new dataset, the effect of each filter and the effect of the individual parameters on each filter. The filters are implemented as modules and are stored in a \Rclass{FilterRules} object from the \Rpackage{IRanges} package. We can create those filters directly and rely on some \Rclass{FilterRules} utilities to diagnose the filtering process. Here we construct the \Rclass{FilterRules} that implements the \Rfunction{qaVariants} function. Again, we rely on the argument defaults to generate the same answer. <>= qa.filters <- VariantQAFilters() @ % We can now ask for a summary of the filtering process, which gives the number of variants that pass each filter, separately and then combined: <>= summary(qa.filters, raw.variants) @ % Now we retrieve the variants that pass the filters: <>= qa.variants <- subsetByFilter(raw.variants, qa.filters) @ We could do the same, except modify a filter parameter, such as the p-value cutoff for the Fisher Exact Test for strand bias: <>= qa.filters.custom <- VariantQAFilters(fisher.strand.p.value = 1e-4) summary(qa.filters.custom, raw.variants) @ % To get a glance at the additional variants we are discarding compared to the previous cutoff, we can subset the filter sets down to the Fisher strand filter, evaluate the old and new filter, and compare the results: <>= fs.original <- eval(qa.filters["fisherStrand"], raw.variants) fs.custom <- eval(qa.filters.custom["fisherStrand"], raw.variants) raw.variants[fs.original != fs.custom] @ We can also manipulate the filters that call the variants that have already passed the basic QA checks. <>= calling.filters <- VariantCallingFilters() summary(calling.filters, qa.variants) @ \subsection{Extending and customizing the workflow} \label{sec:extending} Since the built-in filters are implemented using \Rclass{FilterRules}, it is easy to mix and match different filters, including those implemented externally to the \Rpackage{VariantTools} package. This is the primary means of extending and customizing the variant calling workflow. % \section{Comparing variant sets across samples} % We have already called variants for the metastatic H1993 sample. We % leave the processing of the primary tumor H2073 sample as an exercise % to the reader and instead turn our attention to detecting the variants % that are specific to the metastatic sample, as compared to the primary % tumor. % \subsection{Calling sample-specific variants} % The function \Rfunction{callSampleSpecificVariants} takes the case % (e.g., tumor) sample and control (e.g., matched normal) sample as % input. In our case, we are comparing the metastatic line (H1993) to % the primary tumor line (H2073) from the same patient, a smoker. To % avoid inconsistencies, it is recommended to pass BAM files as input, % for which tallies are automatically generated, subjected to QA, and % called as variants vs. reference, prior to determining the % sample-specific variants. % Here, we find the somatic mutations from a matched tumor/normal % pair. Since we are starting from BAM files, we have to provide % \Robject{tally.param} for the tally step. % <>= % options(error=recover) % control.raw <- tallyVariants(bams$H2073, tally.param) % control.called.noqa <- callVariants(control.raw) % control.called <- callVariants(qaVariants(control.raw)) % control.cov <- coverage(bams$H2073, drop.D.ranges = TRUE) % filters <- SampleSpecificVariantFilters(control.raw, control.cov, % VariantCallingFilters()) % somatic <- callSampleSpecificVariants(bams$H1993, bams$H2073, tally.param) % @ % %$ % This can be time-consuming for the entire genome, since the tallies % need to be computed. To avoid repeated computation of the tallies, the % user can pass the raw tally \Rclass{GRanges} objects instead of the % BAM files. This is less safe, because anything could have happened to % those \Rclass{GRanges} objects. % The QA and initial calling are optionally controlled by passing % \Rclass{FilterRules} objects, typically those returned by % \Rfunction{VariantQAFilters} and \Rfunction{VariantCallingFilters}, % respectively. For controlling the final step, determining the % sample-specific variants, one may pass filter parameters directly to % \Rfunction{callSampleSpecificVariants}. Here is an example of % customizing some parameters. % <>= % calling.filters <- VariantCallingFilters(read.count = 3L) % somatic <- callSampleSpecificVariants(tumor.bam, normal.bam, tally.param, % calling.filters = calling.filters, % power = 0.9, p.value = 0.001) % @ \section{Exporting the calls as VCF} VCF is a common file format for communicating variants. To export our variants to a VCF file, we first need to coerce the \Rclass{GRanges} to a \Rclass{VCF} object. Then, we use \Rfunction{writeVcf} from the \Rpackage{VariantAnnotation} package to write the file (indexing is highly recommended for large files). <>= vcf <- variantGR2Vcf(called.variants, sample.id = "H1993", project = "VariantTools_Vignette") @ <>= writeVcf(vcf, "H1993.vcf", index = TRUE) @ % \subsection{Checking variant concordance} % <>= % concord <- checkVariantConcordance(tumor.variants, normal.variants) % concord % @ % \section{Making some plots} % It is important to diagnose the behavior of these algorithms, and we % provide some exploratory plots to facilitate this. % The first plot shows us the mutation transition/transversion rate matrix % <>= % #plotTitv(TS, main = "tumor specific mutations") % @ % and finally we want to plot our variants on the genome % <>= % #plotTumor(TS, tumor$filtered.granges) % @ \end{document}