Comparative Identification and Assessment of Single Nucleotide Variants Through Shifts in Base Call Frequencies
Table of Contents
1 Motivation
The advances in high-throughput nucleotide sequencing have lead to new possibilities of identifying genomic alterations, which is of particular interest for cancer research. While the detection of genomic variants is often conducted by comparing against a population-based reference sequence (such as the human GRCh 37), in the investigation of tumors a comparison between matched samples is predominantly pursued. Considering a test sample \(T\) and a control sample \(C\), the aim is to identify single nucloetide variants (SNVs) which show significant changes between the two. Typically, this is applied for comparing a tumor sample against a matched normal/constitutive sample of the same patient, focusing on alterations that may have occurred in the transition to the cancerous tissue, and therefore are often termed somatic variants. However, a broader range of questions can be addressed by extending the focus to include genomic changes of temporally or spatially separated samples to each other, to investigate tumor evolution and subclonal shifts [11].
Extending the scope of comparative variant analyses in a cancer related setting may require an accompanying change in definitions. In the classical tumor versus normal setting, a somatic event is characterized by both the presence of the variant in the tumor and the absence in the matched normal [16]. This definition is not satisfactory for comparing different tumor stages against each other. If we imagine two stages of a tumor haboring multiple subclones, variants may be present in both stages with changing abundances. Here, the classical definition of a somatic variant would result in missing the event. In the following, we will therefore focus on estimating the shift in variant allele frequencies between a test and a control sample. The classical somatic definition requiring the variant to be absent in the control constitutes a special case of this.
Numerous approaches focus on more narrowly defined somatic comparisons, motivated by scientific questions in the field of cancer genomics [15] [16]. Most existing methods do not allow an explicit statement about the evidence for the absence of a putative variant in a sample, and a clear distinction is needed whether a negative call results from a true negative site or the lack of statistical power to detect it. As an example, imagine a variant calling approach using a hypothesis test with the null hypothesis that the variant frequencies at a locus are identical for both samples. While a significant p-value provides evidence for the presence of a variant, inverting this argument does not work: A non-significant p-value can both be indicative of the absence or insufficient power. Furthermore, most methods return only a point estimate for the variant frequency in the tumor, without stating the confidence of the estimate. However, this information would be critical to reliably differentiate between variants arising from competing subclonal populations.
With the Rariant
package, we pursue a generalized approach for identifying and
quantifying SNV alterations from high-throughput sequencing data in comparative
settings. By focusing on shifts of the non-consensus base call frequencies,
events like loss of heterozygosity and clonal expansions in addition to somatic
variants can be detected. In the following, we will 1) outline the methodological
framework, 2) provide an example workflow on how to obtain a call set starting
from short-read alignments, and 3) illustrate potential benefits of the methods on
a biological cancer data set. This methodology can be used for a high
performance identification of variant sites as well as to quantitatively assess
the presence or absence in comparisons between matched samples.
2 Methodology
2.1 Comparative shift of non-consensus base call frequencies
Starting with a set of aligned reads we observe the sequencing depth \(N^{S}_{i}\) and the number of non-consensus base calls \(K^{S}_{i}\) for a sample \(S\) at position \(i\). Non-consensus is here defined as base calls that differ from the consensus sequence, which can be either the reference sequence or, in a comparative setting, the most abundant base call in the control sample. The non-consensus rate \(p^{S}_{i}\), which we assume to be binomially distributed, can then be estimated as
$$\hat{p}^{S}_{i} = \frac{K^{S}_{i}}{N^{S}_{i}}$$
for sites with \(N^{S}_{i} > 0\). Since \(K^{S}_{i} \leq N^{S}_{i}\), the rates are bound to the range \([0,1]\).
The true non-consensus rate
$$p^{S}_{i} = v^{S}_{i} + e^{S}_{i}$$
comprises the presence of a putative variant with a frequency \(v^{S}_{i}\) and a technical error rate \(e^{S}_{i}\). In order to detect and describe the change in the variant frequency, we focus on the shift \(d_{i}\) in non-consensus rates as the difference of the rates between the test and control samples, which we estimate as
$$\hat{d}_{i} = \hat{p}^{T}_{i} - \hat{p}^{C}_{i}.$$
If we assume that the true site-specific technical error rates are identical between the two matched samples [10], the difference of the rates yields an unbiased estimate for the change in the variant frequency. Thus, positions not haboring biological alterations will result in \(\hat{d}_{i} \approx 0\).
2.2 Confidence intervals
Distinguishing biological variants from noise requires knowledge about the variance of the point estimate \(\hat{d_{i}}\). By constructing a confidence interval (CI) for \(d_{i}\) with confidence level \(\beta\) [14], we assess the certainty of the estimated shift in non-consensus frequencies. The probability of the true value being outside the confidence interval is less than \(\alpha = 1 - \beta\). This is in concordance with the type I or \(\alpha\) error definition in statistical testing.
Under the assumption that the non-consensus counts \(K^{S}_{i}\) in our samples follow binomial distributions with parameters \(p^{S}_{i}\) and \(N^{S}_{i}\), several methods have been established for estimating confidence intervals for the difference of two rate parameters [12] [7]. The performance of an approach is generally described in terms of its coverage probability indicating the probability of a confidence interval to cover the true value (see 4.2). Coverage probabilities greater and less than the confidence level \(\beta\) describe conservative and liberal behaviors, respectively. Due to the conservative coverage probabilities and high computational effort of exact confidence interval estimates, approximate methods are generally preferred [1] [7].
The Agresti-Caffo (AC) confidence interval [2]
$$\tilde{p}^{T} - \tilde{p}^{C} \pm z \sqrt{ \frac{\tilde{p}^{T} (1 - \tilde{p}^{T})} {\tilde{N}^{T}} + \frac{\tilde{p}^{C}(1 - \tilde{p}^{C})} {\tilde{N}^{C}} }$$
with
$$\tilde{p}^{X} = \frac{K^{X}+\zeta}{N^{X}+2\zeta},$$
$$\tilde{N}^{X} = N^{X} + 2\zeta,$$
$$\zeta = \frac{1}{4} z^2,$$
and \(z = z_{(1-\beta)/2}\) as the upper \((1-\beta)/2\) percentile of the standard normal distribution), is an approximation of the score test-based confidence interval. Several publications emphasize the usefulness and advantages of the AC method over related approaches [7] [3] [5].
2.2.1 Decision making with confidence intervals
While the estimate for the shift in the non-consenus frequency \(\hat{d}\) indicates the change in abundance and direction of a variant, the corresponding confidence interval gives us information about the precision and power of the estimate. Generally, wide confidence intervals will be present at sites with little statistical power, as due to low sequencing depths.
For the case that we compare a tumor to a matched normal sample, we show a set of hypothetical cases that can be distinguished by regarding the point estimate and its confidence interval:
- Presence of a somatic, heterozygous variant
- Presence of a somatic, subclonal variant
- Presence of loss of heterozygosity
- Absence of a somatic variant
- Presence or absence of a variant cannot be distinguished due to the low certainty of the estimate
- No power due to insufficient sequencing depth
2.3 Distinguishing event classes
Focusing on the comparative shift of non-consensus frequencies allows us to detect and
distinguish different types of events. Since Rariant
does not make explicit
assumptions about the abundance of a potential variant in the control sample, we
are further able to find clonal shifts, for example between different tumor
samples, or losses of heterozygocity. Generally, gains and losses of variant
alleles are characterized by positive and negative values of \(d\), respectively.
For a differentiated interpretation of the results, we classify a variant into
one of four classes:
- somatic
- A somatic variant that does not occur in the control sample
- hetero/LOH
- A shift away from heterozygous SNP in the control sample
- undecided
- Both of the 'somatic' or 'hetero' are possible
- powerless
- A distinction between the two classes cannot be made due to a lack of power
The classification is based on two binomial tests for each position:
- Somatic variants where the variant allele is not present in the control sample, rejecting a binomial test with the alternative hypothesis \(H_{1}: p^{C} > 0\).
- Sites with a loss of heterozygosity with a shift away from a heterozygous variant in the control sample, rejecting a binomial test with the alternative hypothesis \(H_{1}: p^{C} \neq \frac{1}{2}\).
2.4 Identifying variant sites in large datasets
The method that we have described before is suited for detecting variant positions efficiently in large sequencing datasets, including whole-exome and whole-genome sequencing. For this purpose, we test for a shift in non-consensus frequencies between two samples at each genomic position individually:
- Form the base counts table for four bases A, C, G, T from the aligned reads. In order to reduce the number of false counts, we can optionally exclude reads with low base calling quality and clip the head of each read.
- Determine the consensus sequence: In a comparative setting, we will use the most abundant base call.
- Calculate the sequencing depth \(N^{S}_{i}\), mismatch counts \(K^{S}_{i}\), and derived statistics for both samples, based on the consensus sequence (see 2.1).
- Find potential variant sites with a Fisher's Exact Test, comparing the number of mismatching and total bases between the samples: \({K^{T}_{i}, N^{T}_{i}, K^{C}_{i}, N^{C}_{i}}\). The p-values are corrected for multiple testing according to the Benjamini-Hochberg procedure. Only positions rejecting the null hypothesis at a significance level \(\alpha\) are further on considered as potential variants.
- Calculate Agresti-Caffo confidence intervals with confidence level \(\beta\), in order to evaluate presence or absence of the variant (see 2.2).
- Classify variant sites into the groups: somatic, LOH, undecided, and powerless (see 2.3).
3 Workflow
3.1 Multiple Sample Simulation Study
We want to further demostrate the usage and abilities of the Rariant
package
on a real-life data set. Due to legal and privacy issues, most human cancer
sequencing data is not publicly accessible and therefore cannot serve as an
example data set here. Alternatively, we conduct an analysis to mimic the
characteristics of current cancer sequencing studies.
For the purpose of the analysis, we compare three samples from the 1000 Genomes
project [6], serving as a
control/normal (control
) and two related test/tumor samples (test
and
test2
). Further, we simulate a clonal mixture (mix
) of the two test samples
by combining their reads.
To better understand the evidence for the presence or absence of particular variants across samples, we plot the confidence intervals, colored according to the predicted event type, and abundance shifts for all sites of interest, colored according to the sign of the shift.
In the following, we look at positions which showed a significant effect in at least one sample. This gives us 12 positions to consider in the following.
While most of the variants are somatic, i.e. they do not appear in the control sample, the last variant position shows a loss of a heterozygous SNP. Looking for example in more detail into the group of 5 variant sites around 7.85 Mbp: We can identify them as consistent with a heterozygous somatic variant in the first sample, since their 95% CIs overlap the value of 0.5. In contrast, we can show the absence of the same variants in the second sample. The third sample again shows the presence of the variants, as seen in the first case, but with lower abundance. Such a result could be expected in a mixture of subclones, in which some clones carry a somatic variant and others not. Further, we can also see the case of the next variant which consistently exists in all three samples with the same abundance.
4 Supplementary Information
4.1 Strand-specific analysis
By comparing the confidence intervals between the two strands, we can further detect and characterize effects such as variations in sequencing depth and strand biases. We illustrate this with a set of hypothetical cases for confidence intervals for two strands. The upper row (cases 4-7) corresponds to sites with overlapping CIs, whereas the lower row (cases 1-3) shows cases of disagreements between the CIs indicative of strand biases. When analyzing the probability for the overlap of confidence intervals, an adjustment of the confidence level has to be taken into account [8].
Motivated by the analysis of different Illumina genome and exome sequencing, we consider strand-biases, in which the non-consensus base call rates differ significantly between strands at sites with sufficient sequencinq depth, a neglectable problem with current data sets and analysis pipelines (see also Best practices for short-read processing). In the presence of strand biases, pooling the counts of both plus and minus strand may be not desirable. A possible solution may be to perform a strand-specific analysis, and later combine the resulting statistics. Gerstung and colleagues discuss different approaches for combining p-values [9], in particular taking the minimum, maximum, average, or Fisher combination. These can be also applied for confidence intervals, with Fisher's method being equivalent to taking the sum of both strands.
4.2 Assessing performance of confidence interval methods
As outlined before, an important property for assessing confidence intervals is given by their coverage probabilities. Ideally, we would expect a method to have coverage probabilities close to the nominal confidence level β over a wide range in the parameter space. Previous publications analyzing the performance focus on parameter settings that deviate from those of sequencing data sets [7]. Therefore, we perform a simulation that demonstrates the behavior of the Agresti-Caffo methods for a whole-genome sequencing study. For a fixed sequencing depth of 30 in both test and control sample, the coverage probability of 95% AC confidence intervals is computed for all possible combinations of mismatch counts \(K^{T}\) and \(K^{C}\).
For mismatch rates close to 0 or 1 in both samples, the Agresti-Caffo method shows a conservative perfomance.
4.3 Benchmarking of performance and resources
For an analysis of two matched human tumor samples, we performed a benchmark to
assess the computational time and memory usage on a standard laptop (Thinkpad
X220 built in 2011). Both samples contain about 95M reads mapped to the
1000genomes reference sequence reads that are considered in the analysis. For
the analysis of chromosome 22, the analysis with default parameters required
~873s and 600MB of RAM. For an analysis of all linear toplevel chromosomes
(autosomes and allosomes), this would require ~15h of time. Please consider
that the current version of Rariant
is under active development and
computational efficiency will increase with newer versions.
5 Frequently Asked Questions
5.1 Getting help
We welcome emails with questions or suggestions about our software, and want to ensure that we eliminate issues if and when they appear. We have a few requests to optimize the process:
- All emails and follow-up questions should take place over the Bioconductor mailing list, which serves as a repository of information.
- The subject line should contain Rariant and a few words describing the problem. First search the Bioconductor mailing list, for past threads which might have answered your question.
- If you have a question about the behavior of a function, read the sections of
the manual page for this function by typing a question mark and the function
name, e.g.
?rariant
. Additionally, read through the vignette to understand the interplay between different functions of the package. We spend a lot of time documenting individual functions and the exact steps that the software is performing. - Include all of your R code related to the question you are asking.
- Include complete warning or error messages, and conclude your message with the
full output of
sessionInfo()
.
5.2 Installing the package
Before you want to install the Rariant
package, please ensure that
you have the latest version of R
and Bioconductor
installed. For details on
this, please have a look at the help packages for R and Bioconductor. Then you
can open R
and run the following commands which will install the latest
release version of Rariant
:
6 References
References
[1] | Alan Agresti and Brent A. Coull. Approximate is better than exact for interval estimation of binomial proportions. The American Statistician, 52(2):119126, 1998. [ http ] |
[2] | Alan Agresti and Brian Caffo. Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. The American Statistician, 54(4):280288, 2000. [ http ] |
[3] | Walter W. Piegorsch. Sample sizes for improved binomial confidence intervals. Computational Statistics & Data Analysis, 46(2):309-316, June 2004. [ DOI | http ] |
[4] | Yoav Benjamini and Daniel Yekutieli. False discovery rateadjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100(469):7181, 2005. [ http ] |
[5] | Frank Schaarschmidt, Martin Sill, and Ludwig A. Hothorn. Approximate simultaneous confidence intervals for multiple contrasts of binomial proportions. Biometrical Journal, 50(5):782792, 2008. [ DOI | http ] |
[6] | The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061-1073, October 2010. [ DOI | .html ] |
[7] | Morten W. Fagerland, Stian Lydersen, and Petter Laake. Recommended confidence intervals for two independent binomial proportions. Statistical methods in medical research, 2011. [ http ] |
[8] | Mirjam J. Knol, Wiebe R. Pestman, and Diederick E. Grobbee. The (mis)use of overlap of confidence intervals to assess effect modification. European Journal of Epidemiology, 26(4):253-254, April 2011. PMID: 21424218 PMCID: PMC3088813. [ DOI | http ] |
[9] | Moritz Gerstung, Christian Beisel, Markus Rechsteiner, Peter Wild, Peter Schraml, Holger Moch, and Niko Beerenwinkel. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nature Communications, 3:811, May 2012. [ DOI | .html ] |
[10] | Omkar Muralidharan, Georges Natsoulis, John Bell, Hanlee Ji, and Nancy R. Zhang. Detecting mutations in mixed sample sequencing data using empirical bayes. The Annals of Applied Statistics, 6(3):1047-1067, September 2012. Zentralblatt MATH identifier06096521, Mathematical Reviews number (MathSciNet) MR3012520. [ DOI | http ] |
[11] | Lucy R. Yates and Peter J. Campbell. Evolution of the cancer genome. Nature Reviews Genetics, 13(11):795, November 2012. [ DOI | .html ] |
[12] | Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions. John Wiley & Sons, 2013. [ http ] |
[13] | Geoffrey Decrouez and Peter Hall. Split sample methods for constructing confidence intervals for binomial and poisson parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), page n/an/a, 2013. [ DOI | http ] |
[14] | Alan Agresti. Categorical data analysis. Wiley, Hoboken, NJ, 2013. |
[15] | Su Y. Kim and Terence P. Speed. Comparing somatic mutation-callers: beyond venn diagrams. BMC Bioinformatics, 14(1):189, June 2013. PMID: 23758877. [ DOI | http ] |
[16] | Nicola D. Roberts, R. Daniel Kortschak, Wendy T. Parker, Andreas W. Schreiber, Susan Branford, Hamish S. Scott, Garique Glonek, and David L. Adelson. A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics, 29(18):2223-2230, September 2013. PMID: 23842810. [ DOI | http ] |