%\VignetteIndexEntry{MBASED} %\VignetteDepends{} %\VignetteKeywords{MBASED} \documentclass{article}[12pt] <>= BiocStyle::latex() @ \begin{document} \title{MBASED Overview} \maketitle \section{MBASED algorithm} The package MBASED (\textbf{M}eta-analysis-\textbf{B}ased \textbf{A}llele-\textbf{S}pecific \textbf{E}xpression \textbf{D}etection) implements an algorithm for identifying instances of allele-specific gene expression in RNA count data. In one-sample analyses, MBASED identifies within-sample deviations from the null hypothesis of equal allele expression, while for two-sample analyses, significant differences in allelic ratios between samples with the same haplotypes are detected. This means that the comparison is meaningful for samples from the same individual, but not necessarily for samples from different individuals. MBASED uses principles of meta-analysis to combine information across loci within a single unit of expression (which we call \texttt{aseID}, e.g., a gene, a transcript, or an individual SNV), without relying on \emph{a priori} knowledge of phased haplotypes. The informative loci are usually heterozygous SNVs, but could also be aggregations of SNVs or other variants that have been resolved into the same haplotype. In the extreme case of complete phasing, loci could be genes themselves, and the approach presented here reduces to a (beta-)binomial test of difference between reads mapping to different haplotypes. In the following discussion, we will use `gene' in place of aseID and `SNV' in place of `locus', as these are the units we typically use in practice. The details of the statistical approach implented by MBASED can be found in the accompanying paper. Here we briefly mention a few items pertaining to the interpretation of MBASED output. \subsection{1-Sample analysis} For each gene $g$ with at least one expressed heterozygous locus, we test the null hypothesis $p_{hap1,g}=p_{hap2,g}=0.5$, where $p_{hap1,g}$ and $p_{hap2,g}$ denote the underlying fractions of transcribed copies of gene $g$, specific to each haplotype. Note that we assume that gene $g$ is represented by exactly two, and not more, haplotypes. The alternative hypothesis is $p_{hap1,g} \neq p_{hap2,g}$. The beta-binomial statistical model used by MBASED is flexible and allows ASE assessment in presence of pre-existing allelic bias (e.g., over-representation of reference allele-supporting reads due to enrichment protocols even in absence of any underlying ASE) and overdispersion. We show examples of this flexibility below, and the user is referred to the accompanying paper for the details of the algorithm used by MBASED. For each gene $g$ the output of MBASED includes \begin{enumerate} \item Estimate of the underlying RNA major haplotype (allele) frequency (\texttt{majorAlleleFrequency}, MAF). \item P-value for the test of null hypothesis of no ASE (\texttt{pValueASE}). Note that the null hypothesis can be reparametrized in terms of major haplotype frequency as $H_0: MAF>0.5$. The reported p-values are not adjusted for multiple hypothesis testing, and such adjustment should be carried out by the user when looking for statistical significance across multiple genes. \item P-value for test of null hypothesis of no inter-SNV variability of ASE in the gene (only for genes with $>1$ tested SNVs, \texttt{pValueHeterogeneity}). Small p-values correspond to significantly inconsistent ASE measurements at individual SNVs within a gene. If the unit of expression is a gene with multiple transcript isoforms, the observed inconsitency might indicate isoform-specific ASE. Alternatively, SNVs for such genes might exhibit background variability of ASE measures in excess of what is specified by the user (MBASED requires the user to specify the level of variability in the data, see below). \end{enumerate} \subsection{2-Sample analysis} \label{sec:2sample} For each gene $g$ with at least one heterozygous locus expressed in both samples, we test the null hypothesis $p_{hap1,g, sample1}=p_{hap2,g, sample2}=p_{hap,1}$, where $p_{hap1,g, sample1}$ and $p_{hap1,g, sample1}$ denote the underlying fractions of transcribed copies of gene $g$, specific to haplotype 1 in each sample. No requirement is imposed that $p_{hap1,g,sample1}=0.5$, since we are interested in differences in ASE pattern between the two samples, not in whether they both exhibit ASE. Just as in 1-sample analysis, we assume that gene $g$ is represented by exactly two, and not more, haplotypes, and we further require that these haplotypes be the same in both samples. The alternative hypothesis is $p_{hap1,g, sample1} \neq p_{hap1,g, sample2}$. For each gene $g$ the output of MBASED includes \begin{enumerate} \item Estimate of underlying major haplotype frequency difference, $p_{hap_{maj},g,sample1} - p_{hap_{maj},g,sample2}$, where $hap_{maj}$ is the major of the two haplotypes in sample 1 (\texttt{majorAlleleFrequencyDifference}). This is our measure of between-sample allelic imblance (the 2-sample ASE). Note that assignment of haplotypes to `major' and `minor' is based exclusively on sample 1, since the 2-sample analysis is designed to look for ASE specific to sample 1. If ASE specific to sample 2 is of interest, the roles of the samples should be reversed. \item P-value for the test of null hypothesis of no differential ASE (\texttt{pValueASE}). Analogous to 1-sample analysis, the reported p-values are not adjusted for multiple hypothesis testing, and such adjustment should be carried out by the user when looking for statistical significance across multiple genes. \item P-value for test test of null hypothesis of no inter-SNV variability of ASE in the gene (only for genes with $>1$ tested SNVs, \texttt{pValueHeterogeneity}). Small p-values correspond to significantly inconsistent 2-sample ASE measurements at individual SNVs within a gene. If the unit of expression is a gene with multiple transcript isoforms, the observed inconsitency might indicate isoform-specific allelic imbalance between the two samples. \end{enumerate} \section{Caveat Emptor} MBASED operates on read counts from heterozygous expressed SNVs. The onus is on the user to make sure that the supplied counts do in fact correspond to heterozygous SNVs. Any read counts coming from a homozygous SNV will by definition result in the detection of monoallelic expression. Further, it is up to the user to pre-filter SNVs for any known artifacts that might give rise to spurious ASE calls. The p-values returned by MBASED are not adjusted for multiple hypothesis testing, and it is left to the user to choose the appropriate adjustment procedure and significance cutoff. Further, to ensure the sufficient power for ASE detection, we strongly recommend that only SNVs with sufficient read coverage be retained for the analysis. What constitutes `sufficient coverage' will depend on the multiple hypothesis testing adjustment procedure and cutoffs chosen by the user to employ on the p-values reported by MBASED. In our work, we required the depth of coverage of at least 10 reads, and higher cutoffs may be employed with more deeply sequenced samples. Finally, while MBASED is designed to detect instances of statistically significant ASE, it is up to the user to decide what constitutes \emph{biologically significant} ASE. Apropriate cutoffs on the reported measures of ASE (\texttt{majorAlleleFrequency} in case of 1-sample analysis, \texttt{majorAlleleFrequencyDifference} in case of 2-sample analysis) should then be employed to define the set of ASE genes for follow-up. MBASED allows the user to provide information about pre-existing allelic bias and extra-binomial dispersion of read counts at individual loci (see examples later in the vignette), but currently does not provide explicit functionality for estimating these quantities. See the accompanying paper for further discussion of these and other issues. \section{Examples} ASE assessment is performed by function runMBASED(). Please check the help page for a full list of arguments. Here we show some examples of its use. We start by discussing simple examples of both 1-sample and 2-sample analyses, and then show some examples of advanced usage, including adjusting for pre-existing allelic bias and extra-binomial dispersion in the data. \subsection{Input/Output format} The input and output of runMBASED() function are RangedSummarizedExperiment objects, defined in R package 'SummarizedExperiment'. Please see the RangedSummarizedExperiment help page for details. Briefly, the input object has rows corresponding to individual SNVs and columns corresponding to samples. In 1-sample analysis, there is a single column. In 2-sample analysis, there are 2 columns, one for each sample, with first column corresponding to `sample1' and second column corresponding to `sample2' in the language of section \ref{sec:2sample}. The output is also a RangedSummarizedExperiment object, with rows corresponding to individual aseIDs and a single column. MBASED output at the level of individual SNV is stored as metadata on the output object. The examples below will illustrate the usage. \subsection{1-sample analysis} Suppose we are looking at 2 genes, one of which has 1 SNV and the other has 3 SNVs: \begin{table}[h] \centering \begin{tabular}{|c|c|c|c|c|c|} \hline locus & location & refAllele & altAllele & refCount & altCount \\ \hline gene1:SNV1 & chr1:100 & G & T & 25 & 20\\ \hline gene2:SNV1 & chr2:1000 & A & C & 10 & 16\\ \hline gene2:SNV2 & chr2:1100 & C & T & 22 & 15\\ \hline gene2:SNV3 & chr2:1200 & A & G & 14 & 16\\ \hline \end{tabular} \caption{1-Sample example} \label{tab:ex1s} \end{table} If we know the true underlying haplotypes of the gene, we can run MBASED as follows (assuming all reference alleles are on the same haplotype): <>= set.seed(988482) library(MBASED) ##a quick look at the main function args(runMBASED) ## create GRanges object for SNVs of interest. ## note: the only required column is 'aseID' mySNVs <- GRanges( seqnames=c('chr1', 'chr2', 'chr2', 'chr2'), ranges=IRanges(start=c(100, 1000, 1100, 1200), width=1), aseID=c('gene1', rep('gene2', 3)), allele1=c('G', 'A', 'C', 'A'), allele2=c('T', 'C', 'T', 'G') ) names(mySNVs) <- c('gene1_SNV1', 'gene2_SNV1', 'gene2_SNV2', 'gene2_SNV3') ## create input RangedSummarizedExperiment object mySample <- SummarizedExperiment( assays=list( lociAllele1Counts=matrix( c(25, 10, 22, 14), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ), lociAllele2Counts=matrix( c(20,16,15,16), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ) ), rowRanges=mySNVs ) ##example of use ASEresults_1s_haplotypesKnown <- runMBASED( ASESummarizedExperiment=mySample, isPhased=TRUE, numSim=10^6, BPPARAM = SerialParam() ) ## explore the return object class(ASEresults_1s_haplotypesKnown) names(assays(ASEresults_1s_haplotypesKnown)) assays(ASEresults_1s_haplotypesKnown)$majorAlleleFrequency assays(ASEresults_1s_haplotypesKnown)$pValueASE assays(ASEresults_1s_haplotypesKnown)$pValueHeterogeneity rowRanges(ASEresults_1s_haplotypesKnown) names(metadata(ASEresults_1s_haplotypesKnown)) class(metadata(ASEresults_1s_haplotypesKnown)$locusSpecificResults) names(assays(metadata(ASEresults_1s_haplotypesKnown)$locusSpecificResults)) assays(metadata(ASEresults_1s_haplotypesKnown)$locusSpecificResults)$allele1IsMajor assays(metadata(ASEresults_1s_haplotypesKnown)$locusSpecificResults)$MAF rowRanges(metadata(ASEresults_1s_haplotypesKnown)$locusSpecificResults) ## define function to print out the summary of ASE results summarizeASEResults_1s <- function(MBASEDOutput) { geneOutputDF <- data.frame( majorAlleleFrequency=assays(MBASEDOutput)$majorAlleleFrequency[,1], pValueASE=assays(MBASEDOutput)$pValueASE[,1], pValueHeterogeneity=assays(MBASEDOutput)$pValueHeterogeneity[,1] ) lociOutputGR <- rowRanges(metadata(MBASEDOutput)$locusSpecificResults) lociOutputGR$allele1IsMajor <- assays(metadata(MBASEDOutput)$locusSpecificResults)$allele1IsMajor[,1] lociOutputGR$MAF <- assays(metadata(MBASEDOutput)$locusSpecificResults)$MAF[,1] lociOutputList <- split(lociOutputGR, factor(lociOutputGR$aseID, levels=unique(lociOutputGR$aseID))) return( list( geneOutput=geneOutputDF, locusOutput=lociOutputList ) ) } summarizeASEResults_1s(ASEresults_1s_haplotypesKnown) @ The output of \texttt{runMBASED()} is a RangedSummarizedExperiment object with 3 single-column assay matrices. The rows of the matrices correspond to individual aseIDs, in our case `gene1' and `gene2'. The matrices are majorAlleleFrequency (MBASED estimate of gene-level MAF), pValueASE (the coresponding p-value), and pValueHeterogeneity (the inter-SNV variability p-values). Note that heterogeneity p-value is \texttt{NA} for gene1, since this gene only has 1 tested SNV. The metadata slot of the return object contains a single element, `locusSpecificResults', which is a RangedSummarizedExperiment object with information on MBASED results at the level of individual SNV. The assay matrices specify whether at a given locus allele1 belongs to the major (gene-level) haplotype (`allele1IsMajor'), and what the allele frequency is at that SNV of the allele belonging to the major (gene-level) haplotype (`MAF'). Since assignment of 'major' status to a haplotypes is based on gene-wide information, there may be cases when the locus-specific MAF (frequency of `major' gene haplotype at the locus) is $< 0.5$, as is the case for the second locus of gene2 in the previous example. When true haplotypes are unknown, MBASED assigns alleles to `major' and `minor' haplotypes based on observed read counts, assigning the allele with higher than expected (under conditions of no ASE) read counts to the same `major' haplotype. The following example illustrates this approach using the same data as before, by setting argument `isPhased' to FALSE: <>= summarizeASEResults_1s( runMBASED( ASESummarizedExperiment=mySample, isPhased=FALSE, numSim=10^6, BPPARAM = SerialParam() ) ) @ Note that the pseudo-phasing based`major' haplotype of gene2 consists of the reference allele at SNV2 and alternative alleles at SNV1 and SNV3. Also note that by design, the gene-level estimate of MAF for the multi-locus gene2 is higher in the case of unphased data (0.58 vs. 0.505) and the SNV-level estimates are all $\geq 0.5$. We rely on properly calibrated p-values to prevent these higher estimates from resulting in spurious ASE calls due to pseudo-phasing. This is accomplished by employing simulations (argument \texttt{numSim}) to obtain an approximation to null distribution of MAF estimates. We used $10^6$ simulations in the previous example and we recommend at least this many simulations in practice. If parallel architecture is available (see argument \texttt{BPPARAM} and documentation for function \texttt{bplapply} in \texttt{R} package \texttt{BiocParallel}), it can be employed when many genes are tested simulatneously (parallelization is done over \emph{genes (aseIDs)}, and not \emph{simulations}). To illustrate the need for simulations, we show how the results are affected by bias introduced during the pseudo-phasing step: <>= ##re-run analysis without simulations summarizeASEResults_1s( runMBASED( ASESummarizedExperiment=mySample, isPhased=FALSE, numSim=0, BPPARAM = SerialParam() ) )$geneOutput @ Notice the 3-fold decrease in pValueASE for gene2 when we skipped the simulations. See the accompanying paper for further discussion of the internal simulations used by MBASED. The use of parallelization is illustrated in the example below (not evaluated): <>= ##re-run analysis while parallelizing computations ## results are same as before ## with some random flactuations due to simulations summarizeASEResults_1s( runMBASED( ASESummarizedExperiment=mySample, isPhased=FALSE, numSim=10^6, BPPARAM = MulticoreParam() ) )$geneOutput ## Number of seconds it takes to run results without parallelizing: system.time(runMBASED( ASESummarizedExperiment=mySample, isPhased=FALSE, numSim=10^6, BPPARAM = SerialParam() ))['elapsed'] ## ~ 15 sec on our machine ## Number of seconds it takes to run results with parallelizing: system.time(runMBASED( ASESummarizedExperiment=mySample, isPhased=FALSE, numSim=10^6, BPPARAM = MulticoreParam() ))['elapsed'] ## ~9 sec on our machine @ We also note that for a single-locus gene1, the phasing plays no role, and the simulation-based p-values are very close to each other (there's some variability due to random nature of simulations) and are approximately equal to the analytical p-value based on the binomial test: <>= binom.test(25, 45, 0.5, 'two.sided')$p.value @ Finally, let's consider a case of isoform-specific ASE. Suppose that the 3 SNVs in gene2 correspond to 3 different exons, and that 2 transcript isoforms of gene 2 exist: isoform1 uses all 3 exons, and isoform2 uses only the last 2 exons. Further, let us assume that \begin{itemize} \item both isoforms are present in 50/50 ratio \item isoform1 shows no ASE \item isoform2 shows complete silencing of one of the alleles (monoallelic expression, an extreme form of ASE). \end{itemize} Here is an example of data one could observe under this toy scenario: \begin{table}[h] \centering \begin{tabular}{|c|c|c|c|} \hline locus & location & refCount & altCount \\ \hline gene2:SNV1 & chr2:1000 & 23& 26 \\ \hline gene2:SNV2 & chr2:1100 & 65 & 25 \\ \hline gene2:SNV3 & chr2:1200 & 30 & 70 \\ \hline \end{tabular} \caption{Example of isoform-specific ASE} \label{tab:ex1sIsoformSpecific} \end{table} Notice that SNV1 shows lower coverage than the other 2 SNVs, since it is only expressed by isoform1. Also note that alternative allele at SNV2 and reference allele at SNV3 are silenced in isoform2. Overall, each isoform is represented by approximately 50 reads at each SNV that is a part of that isoform. <>= isoSpecificExampleSNVs <- mySNVs[2:4,] ## create input RangedSummarizedExperiment object isoSpecificExample <- SummarizedExperiment( assays=list( lociAllele1Counts=matrix( c(23, 65, 30), ncol=1, dimnames=list( names(isoSpecificExampleSNVs), 'mySample' ) ), lociAllele2Counts=matrix( c(26,25,70), ncol=1, dimnames=list( names(isoSpecificExampleSNVs), 'mySample' ) ) ), rowRanges=isoSpecificExampleSNVs ) summarizeASEResults_1s( runMBASED( ASESummarizedExperiment=isoSpecificExample, isPhased=FALSE, numSim=10^6, BPPARAM = MulticoreParam() ) ) @ We see that MBASED reports major allele frequency of 0.675 for this gene. Since the true MAF is 0.5 for isoform1 and 1.0 for isoform2, the reproted value represents an average of sorts. The small value of heterogeneity p-value (0.003) indicates that there is strong evidence that ASE estimates are inconsistent across individual SNVs and that the reported MAF value might be misleading (it is an aggregate of isoform-specific values). Also note that the reported ASE p-value is 0. In practice, given the restricitions imposed by the number of simulations, this means that we estimate $p_{ASE} \leq \frac{1}{N_{sim}}$. In this instance, we should interpret the outcome as p-value $< 10^{-6}$. \subsection{2-sample analysis} Next, we show an example of 2-sample ASE analysis using MBASED. Suppose we want to identify instances of tumor-specifc ASE and observe the following data: \begin{table}[h] \centering \begin{tabular}{|c|c|c|c|c|c|} \hline locus & location & refCountTumor & altCountTumor & refCountNormal & altCountNormal \\ \hline gene1:SNV1 & chr1:100 & 25 & 20 & 18 & 23\\ \hline gene2:SNV1 & chr2:1000 & 10 & 29 & 17 & 19\\ \hline gene2:SNV2 & chr2:1100 & 35 & 15 & 21 & 24\\ \hline gene2:SNV3 & chr2:1200 & 14 & 40 & 25 & 31\\ \hline gene3:SNV1 & chr3:2000 & 35 & 9 & 40 & 10 \\ \hline \end{tabular} \caption{2-Sample example} \label{tab:ex2s} \end{table} Visual examination reveals that gene1 doesn't exhibit ASE in either sample, gene2 shows signs of ASE in tumor sample only, and gene3 shows signs of ASE in both samples. <>= mySNVs_2s <- GRanges( seqnames=c('chr1', 'chr2', 'chr2', 'chr2', 'chr3'), ranges=IRanges(start=c(100, 1000, 1100, 1200, 2000), width=1), aseID=c('gene1', rep('gene2', 3), 'gene3'), allele1=c('G', 'A', 'C', 'A', 'T'), allele2=c('T', 'C', 'T', 'G', 'G') ) names(mySNVs_2s) <- c('gene1_SNV1', 'gene2_SNV1', 'gene2_SNV2', 'gene2_SNV3', 'gene3_SNV1') ## create input RangedSummarizedExperiment object myTumorNormalExample <- SummarizedExperiment( assays=list( lociAllele1Counts=matrix( c( c(25,10,35,14,35), c(18,17,21,25,40) ), ncol=2, dimnames=list( names(mySNVs_2s), c('tumor', 'normal') ) ), lociAllele2Counts=matrix( c( c(20,29,15,40,9), c(23,19,24,31,10) ), ncol=2, dimnames=list( names(mySNVs_2s), c('tumor', 'normal') ) ) ), rowRanges=mySNVs_2s ) ##example of use ASEresults_2s <- runMBASED( ASESummarizedExperiment=myTumorNormalExample, isPhased=FALSE, numSim=10^6, BPPARAM = SerialParam() ) ## explore the return object class(ASEresults_2s) names(assays(ASEresults_2s)) assays(ASEresults_2s)$majorAlleleFrequencyDifference assays(ASEresults_2s)$pValueASE assays(ASEresults_2s)$pValueHeterogeneity rowRanges(ASEresults_2s) names(metadata(ASEresults_2s)) class(metadata(ASEresults_2s)$locusSpecificResults) names(assays(metadata(ASEresults_2s)$locusSpecificResults)) assays(metadata(ASEresults_2s)$locusSpecificResults)$allele1IsMajor assays(metadata(ASEresults_2s)$locusSpecificResults)$MAFDifference rowRanges(metadata(ASEresults_2s)$locusSpecificResults) ## define function to print out the summary of ASE results summarizeASEResults_2s <- function(MBASEDOutput) { geneOutputDF <- data.frame( majorAlleleFrequencyDifference=assays(MBASEDOutput)$majorAlleleFrequencyDifference[,1], pValueASE=assays(MBASEDOutput)$pValueASE[,1], pValueHeterogeneity=assays(MBASEDOutput)$pValueHeterogeneity[,1] ) lociOutputGR <- rowRanges(metadata(MBASEDOutput)$locusSpecificResults) lociOutputGR$allele1IsMajor <- assays(metadata(MBASEDOutput)$locusSpecificResults)$allele1IsMajor[,1] lociOutputGR$MAFDifference <- assays(metadata(MBASEDOutput)$locusSpecificResults)$MAFDifference[,1] lociOutputList <- split(lociOutputGR, factor(lociOutputGR$aseID, levels=unique(lociOutputGR$aseID))) return( list( geneOutput=geneOutputDF, locusOutput=lociOutputList ) ) } summarizeASEResults_2s(ASEresults_2s) @ Analogous to 1-sample analysis, the output of \texttt{runMBASED()} for 2-sample analysis is a RangedSummarizedExperiment object with 3 single-column assay matrices. The matrices are are similar to those in 1-sample case, except our measure of ASE is the difference in major allele frequency between the two samples. For example, for a single-SNV gene1, this difference is $\frac{25}{25+20}-\frac{18}{18+23}=0.1165$. Notice that which allele is `major' and which is `minor' is determined entirely by read counts in the first sample (tumor, in this case). We would like to emphasize that the 2-sample MBASED algorithm is designed to find instances of sample1-specific ASE (whatever sample1 may be) and is \emph{not} symmetric with respect to the 2 samples. Therefore, if we were interested in cases of normal-specific ASE (e.g., loss of imprinting in cancer), we would re-run MBASED using normal sample as sample1 and tumor sample as sample2. The following examples illustrate this important point. <>= ## single-SNV gene with strong ASE in tumor (25 vs. 5 reads) ## and no ASE in normal (22 vs. 23 reads): mySNV <- GRanges( seqnames=c('chr1'), ranges=IRanges(start=c(100), width=1), aseID=c('gene1'), allele1=c('G'), allele2=c('T') ) names(mySNV) <- c('gene1_SNV1') summarizeASEResults_2s( runMBASED( ASESummarizedExperiment=SummarizedExperiment( assays=list( lociAllele1Counts=matrix( c( c(25), c(22) ), ncol=2, dimnames=list( names(mySNV), c('tumor', 'normal') ) ), lociAllele2Counts=matrix( c( c(5), c(23) ), ncol=2, dimnames=list( names(mySNV), c('tumor', 'normal') ) ) ), rowRanges=mySNV ), isPhased=FALSE, numSim=10^6, BPPARAM=SerialParam() ) ) ## MBASED detects highly significant ASE. ## ## Now, suppose that the normal sample had the two allele counts ## exchanged (due to chance variation), : summarizeASEResults_2s( runMBASED( ASESummarizedExperiment=SummarizedExperiment( assays=list( lociAllele1Counts=matrix( c( c(25), c(23) ## used to be 22 ), ncol=2, dimnames=list( names(mySNV), c('tumor', 'normal') ) ), lociAllele2Counts=matrix( c( c(5), c(22) ## used to be 23 ), ncol=2, dimnames=list( names(mySNV), c('tumor', 'normal') ) ) ), rowRanges=mySNV ), isPhased=FALSE, numSim=10^6, BPPARAM=SerialParam() ) ) ## We get virtually the same results ## ## Now, suppose we treated normal as sample1 and tumor as sample2 ## Let's use original normal sample allele counts summarizeASEResults_2s( runMBASED( ASESummarizedExperiment=SummarizedExperiment( assays=list( lociAllele1Counts=matrix( c( c(22), c(25) ), ncol=2, dimnames=list( names(mySNV), c('normal', 'tumor') ) ), lociAllele2Counts=matrix( c( c(23), c(5) ), ncol=2, dimnames=list( names(mySNV), c('normal', 'tumor') ) ) ), rowRanges=mySNV ), isPhased=FALSE, numSim=10^6, BPPARAM=SerialParam() ) ) ## We appear to have recovered the same result: highly significant MAF difference ## (but note that allele2 is now 'major', since allele classification into ## major/minor is based entirely on sample1) ## BUT: consider what happens if by chance the allelic ## ratio in the normal was 23/22 instead of 22/23: summarizeASEResults_2s( runMBASED( ASESummarizedExperiment=SummarizedExperiment( assays=list( lociAllele1Counts=matrix( c( c(23), ## used to be 22 c(25) ), ncol=2, dimnames=list( names(mySNV), c('normal', 'tumor') ) ), lociAllele2Counts=matrix( c( c(22), ## used to be 23 c(5) ), ncol=2, dimnames=list( names(mySNV), c('normal', 'tumor') ) ) ), rowRanges=mySNV ), isPhased=FALSE, numSim=10^6, BPPARAM=SerialParam() ) ) ## MAF difference is large in size but negative and the finding is no longer ## significant (MBASED uses a 1-tailed significance test). ## ## We therefore strongly emphasize that the proper way to detect normal-specific ASE ## would be to treat normal sample as sample1 and tumor sample as sample2. @ \subsection{Adjusting for pre-exiting allelic bias} In a given sample, MBASED models the detected haplotype 1 allele counts $X_{hap1, SNV}$ at each SNV within gene $g$ as independent random variables with \begin{equation} \label{eq:betaBinomModel} X_{hap1, SNV} \sim BetaBin(n=n_{SNV},\mu=f_{hap1,SNV}(p_{hap1,g}),\rho=\rho_{SNV}) \end{equation} where \begin{equation} E(X_{hap1,SNV})=n\mu \end{equation} and \begin{equation} var(X_{hap1,SNV})={\mu}(1-\mu){\rho}n(n-1+frac{1}{\rho}) \end{equation} If dispersion parameter $\rho_{SNV}=0$ (this is the default model assumed by MBASED), the model reduces to the binomial model \begin{equation} \label{eq:binomModel} X_{hap1, SNV} \sim Bin(n=n_{SNV}, p=f_{hap1,SNV}(p_{hap1,g})) \end{equation} We will now assume the no-overdispersion binomial model (\ref{eq:binomModel}) and will describe extensions to non-zero dispersions in the next section. By default, MBASED assumes that \begin{equation} f_{hap1,SNV}(p_{hap1,g})=p_{hap1,g} \end{equation} In other words, the underlying frequency of observed haplotype 1-supporting reads is equal to the frequency of haplotype 1 transcripts in the cell. In practice, this may not be the case. For example, the sample might undergo some enrichment procedure that results in global over-representation of reference allele-supporting reads. Or the aligner used to map reads to reference genome might favor one or the other allele at some SNVs due to the presence of a homologous region elsewhere in the genome. In such cases, \begin{equation} f_{hap1,SNV}(p_{hap1,g}) \neq p_{hap1,g} \end{equation} We refer to the situation where one allele is favored over another in the read data, even in absence of ASE, as `pre-existing allelic bias'. MBASED assumes that $f_{hap1,SNV}(p_{hap1,g})$ satisfies the following functional constraint: \begin{equation} f_{hap1,SNV}(p_{hap1,g})=\frac{f_{hap1,SNV}(0.5) \times p_{hap1,g}}{f_{hap1,SNV}(0.5) \times p_{hap1,g} + f_{hap2,SNV}(0.5) \times p_{hap2,g}} \end{equation} In words, $f_{hap1,SNV}(p_{hap1,g})$ is proportional to both $p_{hap1,g}$ and $f_{hap1,SNV}(0.5)$ (the underlying frequency of observed haplotype 1-supporting reads under conditions of no ASE) and is uniquely determined by these two frequencies. MBASED adjusts its estimates of ASE and corresponding statistical significance, when supplied with values of probabilities of observing allele 1-supporting reads under conditions of no ASE, $f_{hap1,SNV}(0.5)$ in our notation. As an example, consider a hypothetical case of known reference allele bias, where under conditions of no ASE ($p_{hap1,g}=p_{hap2,g}=0.5$) the frequency of observed reference allele-supporting reads is 60\% ($f_{ref,SNV}(0.5)=0.6$ and $f_{alt,SNV}(0.5)=0.4$) for all SNVs. Suppose we observe the following read counts: \begin{table}[h] \centering \begin{tabular}{|c|c|c|c|c|} \hline locus & location & refCount & altCount & fractionRef \\ \hline gene1:SNV1 & chr1:100 & 20 & 25 & 0.49 \\ \hline gene2:SNV1 & chr2:1000 & 17 & 9 & 0.65 \\ \hline gene2:SNV2 & chr2:1100 & 22 & 15 & 0.59 \\ \hline gene2:SNV3 & chr2:1200 & 20 & 10 & 0.67 \\ \hline \end{tabular} \caption{Example of reference bias in the data} \label{tab:ex1sRefBias} \end{table} Notice that gene2 is exhibiting 60/40 ratio of reference to alternative allele read counts, which is what we expect of genes with no ASE under the reference allele bias described above. In contrast, gene1 shows overexpression of alternative allele, which may actually be seen as evidence \emph{for} ASE. MBASED allows one to specify the extent of pre-existing allelic bias by providing a corresponding matrix in the assays of the input RangedSummarizedExperiment object. By default, MBASED assumes that the background (no ASE) allelic ratio is 50/50: <>= mySNVs ## run MBASED with default values: summarizeASEResults_1s( runMBASED( SummarizedExperiment( assays=list( lociAllele1Counts=matrix( c(20,17,15,20), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ), lociAllele2Counts=matrix( c(25,9,22,10), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ) ), rowRanges=mySNVs ), isPhased=FALSE, numSim=10^6, BPPARAM=SerialParam() ) ) ## we get the same results by explicitly specifying 50/50 pre-existing allelic probability summarizeASEResults_1s( runMBASED( SummarizedExperiment( assays=list( lociAllele1Counts=matrix( c(20,17,15,20), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ), lociAllele2Counts=matrix( c(25,9,22,10), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ), lociAllele1CountsNoASEProbs=matrix( rep(0.5, 4), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ) ), rowRanges=mySNVs ), isPhased=FALSE, numSim=10^6, BPPARAM=SerialParam() ) ) @ Note that we labeled 3 out of 4 reference alleles as allele1 and the corresponding alternative alleles as allele2 (the sole exception is SNV2 of gene2, where alternative allele is allele1). As can be expected, not accounting for pre-existing alleic bias results in gene2 showing strong evidence of ASE (p-value=0.04, all reference alleles are assigned to major haplotype). We can fix the problem by changing the value of \texttt{lociAllele1CountsNoASEProbs} assay matrix to the correct values of probabilites of observing allele1 under the conditions of no ASE: <>= ##run the analysis adjusting for pre-existing allelic bias summarizeASEResults_1s( runMBASED( SummarizedExperiment( assays=list( lociAllele1Counts=matrix( c(20,17,15,20), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ), lociAllele2Counts=matrix( c(25,9,22,10), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ), lociAllele1CountsNoASEProbs=matrix( c(0.6, 0.6, 0.4, 0.6), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ) ), rowRanges=mySNVs ), isPhased=FALSE, numSim=10^6, BPPARAM=SerialParam() ) ) @ Notice that the estimate of MAF for gene2 went down from 0.63 to 0.54, and the p-value went up from 0.04 to 0.86. In contrast, MAF for gene1 went up from 0.56 to 0.66 and the p-value dropped from 0.55 to 0.05 (as we are much less likely to observe overexpression of alternative allele if the bias in the data favors the reference allele). When 2-sample analysis is performed, the second column of \texttt{lociAllele1CountsNoASEProbs} specifies no-ASE allelic probabilities for the same alleles in sample 2, (and thus MBASED allows for possibility that pre-existing allelic biases are different in the two samples). In our own work, we assumed that the no-ASE probability of observing reference allele-supporting read was the same across all heterozygous loci within each sample (global reference bias) and estimated it for our data to be approximately 0.52 (see the accompanying paper for the details of the estimation procedure). A better approach would be to estimate this probability separately at each SNV based on a large number of samples known to not exhibit ASE at that locus (and keeping the protocol/sequencing/alignment pipeline constant). Currently MBASED provdies no functionality for estimating the extent of pre-existing allelic bias and the user is advised to investigate their data prior to changing the default values of the relevent function arguments. \subsection{Adjusting for overdispersion} By default, MBASED assumes that dispersion parameter $\rho$ in model \ref{eq:betaBinomModel} is 0, leading to binomial model \ref{eq:binomModel}. In practice, extra-binomial dispersion may be present in the sequencing read count data and needs to be taken into account in order to assign proper statistical significance to observed levels of ASE. Below we generate some counts from beta-binomial distribution and show how model misspecification (assuming data to be binomial) affects ASE calling. We use our earlier total read counts for single-SNV gene1 and 3-SNV gene2. We also use the same value of overdispersion parameter $\rho=0.02$ at each SNV. <>= set.seed(5) ## we chose this seed to get a 'significant' result totalAlleleCounts=c(45, 26, 37, 30) ## simulate allele1 counts allele1Counts <- MBASED:::vectorizedRbetabinomMR( n=4, size=totalAlleleCounts, mu=rep(0.5, 4), rho=rep(0.02, 4) ) ## run MBASED without accounting for overdispersion summarizeASEResults_1s( runMBASED( SummarizedExperiment( assays=list( lociAllele1Counts=matrix( allele1Counts, ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ), lociAllele2Counts=matrix( totalAlleleCounts-allele1Counts, ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ) ), rowRanges=mySNVs ), isPhased=FALSE, numSim=10^6, BPPARAM=SerialParam() ) ) ## this is the same as explicitly setting dispersion parameters to 0. summarizeASEResults_1s( runMBASED( SummarizedExperiment( assays=list( lociAllele1Counts=matrix( allele1Counts, ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ), lociAllele2Counts=matrix( totalAlleleCounts-allele1Counts, ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ), lociCountsDispersions=matrix( rep(0, 4), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ) ), rowRanges=mySNVs ), isPhased=FALSE, numSim=10^6, BPPARAM=SerialParam() ) ) ## now re-run MBASED supplying the correct dispersion values: summarizeASEResults_1s( runMBASED( SummarizedExperiment( assays=list( lociAllele1Counts=matrix( allele1Counts, ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ), lociAllele2Counts=matrix( totalAlleleCounts-allele1Counts, ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ), lociCountsDispersions=matrix( rep(0.02, 4), ncol=1, dimnames=list( names(mySNVs), 'mySample' ) ) ), rowRanges=mySNVs ), isPhased=FALSE, numSim=10^6, BPPARAM=SerialParam() ) ) @ We observe that after taking overdispersion into account, gene1 had its ASE p-value increase from 0.55 to 0.67, and gene2 had the pvalue increase from 0.05 to 0.19. In practice, not taking overdispersion into account leads to a number of false positive ASE calls, with the number of false positives rising as the extent of overdispersion increases. Therefore it is important to have a good grasp of how much dispersion is present in the data. Also note a slight change in the estimate of major allele frequency for gene2, which results from incorporation of extra-variability into the estimate during the meta-analytic integration of ASE information from multiple SNVs within that gene. When 2-sample analysis is performed, the second column of \texttt{lociCountsDispersions} specifies dispersion parameter values for the same SNVs in sample 2, (and thus MBASED allows for possibility that the extent of overdispersion is different in the two samples). In our own work, we assumed that the dispersion was the same across all loci within each sample and estimated it for our data to be approximately 0.004 (see the accompanying paper for the details of the estimation procedure). A better approach would be to estimate the dispersion separately at each SNV based on a large number of samples known to not exhibit ASE at that locus (and keeping the protocol/sequencing/alignment pipeline constant). Currently MBASED provdies no functionality for estimating the extent the dispersion and the user is advised to investigate their data prior to changing the default values of the relevent function arguments. \section{Session Info} <>= sessionInfo() @ \end{document}