%\VignetteIndexEntry{Package maskBAD} %\VignettePackage{maskBAD} \documentclass{article} \usepackage{Sweave} \usepackage[a4paper]{geometry} \usepackage{hyperref,graphicx} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{{\small\texttt{#1}}} \title{\textsf{\textbf{Package ``maskBAD''}}} \author{Michael Dannemann, Michael Lachmann, Anna Lorenc} \SweaveOpts{echo=FALSE} \usepackage{a4wide} \begin{document} \maketitle \begin{abstract} Package ``maskBAD'' allows to users to identify, inspect and remove probes for which binding affinity differs between two groups of samples. It works for Affymetrix 3' IVT and whole transcript (exon and gene) arrays. When there are SNPs, transcript isoforms or cross hybrydizing transcripts specific to only one of compared groups, probes' binding affinities may differ between groups. Such probes bias estimates of gene expression, introduce false positives in differential gene expression analysis and reduce the power to detect real expression differences. Hence they should be identified and removed from the data at preprocessing stage. For details about impact of BAD (binding affinity difference) probes on differential gene expression estimates, implemented detection method and its power to improve data quality, please consult our papers Dannemann, Lorenc et al.: ``The effects of probe binding affinity differences on gene expression measurements and how to deal with them'', Bioinformatics, 2009 (\cite{Dannemann09}) and Dannemann et al.: ``'maskBAD' - a package to detect and remove Affymetrix probes with binding affinity differences'', BMC Bioinformatics 2012 (\cite{Dannemann12}). \end{abstract} \section{Introduction} The ``maskBAD'' package implements functions to detect and remove probes with different binding affinity (BAD) in Affymetrix array expression data. Identification and removal of BAD probes removes spurious gene expression differences and helps recover true ones. BAD probes are prevalent in comparisons of genetically distinct samples, like belonging to different strains or species, but systematic qualitative differences in transcriptome might introduce them also when samples differ by treatment, health status or tissue type. Masking therefore should be a routine step in data preprocessing. In this package we introduce functions that allow to identify, inspect and remove BAD probes and show how it can be integrated in a standard gene expression analysis pipeline. \section{Identification of BAD probes} Detection of BAD (binding affinity difference) probes is done on an \Robject{AffyBatch} object, usually prepared with \Rfunction{ReadAffy()} from package \Rpackage{affy}. We'll use 100 random probe sets from human and chimpanzee expression dataset (\cite{Khaitovich04}), each with 10 individuals, available with the package. <>= library("maskBAD") data(AffyBatch, package="maskBAD") data(exmask, package="maskBAD") ## load(system.file("data/AffyBatch.rda", package="maskBAD")) ## load(system.file("data/exmask.rda", package="maskBAD")) @ <>= ## data available by loading data(AffyBatch) newAffyBatch exmask <- mask(newAffyBatch,ind=rep(1:2,each=10),verbose=FALSE,useExpr=F) @ The scoring of probes is performed with the function \Rfunction{mask} and is based on the algorithm presented in Dannemann et. al (\cite{Dannemann09}). Briefly, for a given probe, its signal is proportional to the amount of RNA in the sample and its binding affinity. When one target is measured with several probes (a probe set), as on Affymetrix arrays, the probes' signals are correlated for each sample. BAD probes correlation differs between groups, hence comparison of probes' pairwise correlations between those groups allows to identify of BAD probes. Function \Rfunction{mask()} detects probes differing in binding affinity between groups of samples defined by argument \texttt{ind}. The masking algorithm workes on a gene/probe set basis as the expression signal for the same transcript should be correlated among probes. Therefore, to identify BAD probes use probe sets defined at transcript/gene level. \\ The method works properly only for probe sets with signal above background in both groups. By default (\texttt{use.expr=TRUE} option) only probe sets with \Rfunction{mas5calls()} ``P'' in at least 90\% of samples from both groups are considered in BAD detection. Another desired set of probe sets might be specified by the parameter \texttt{exprlist}. \subsection{Evaluation of the mask and choice of cutoff} For human-chimpanzee, with assumption of sequence divergence 1\%, we'll expect naively $\sim$22\% of probes comprise a SNP. To filter out lower 22\% of quality scores, we have to put a cutoff at 0.029. <>= quantile(exmask[[1]]$quality.score,0.22) @ Manual inspection of borderline cases with \Rfunction{plotProbe()} might help to decide about the cutoff. It plots signal intensity of a probe against all other probes from the same probe set. <>= ## add rownames containing x and y coordinates of each probe rownames(exmask[[1]]) <-apply(exmask[[1]][,c("x","y")],1, function(x)paste(x[1],x[2],sep=".")) ## filtering for probes around the choosen cutoff probesToSee = rownames(exmask[[1]][exmask[[1]]$quality.score<0.03 & exmask[[1]]$quality.score >0.028,]) ## select a random probe and plot it against ## all other probes of its probe set plotProbe(affy=newAffyBatch,probeset=as.character(exmask[[1]][ probesToSee[1],"probeset"]), probeXY=probesToSee[1],scan=TRUE,ind=rep(1:2,each=10), exmask=exmask$probes,names=FALSE) @ \subsection{Including external data} Mask might be calibrated with a subset of probes with known BAD status (called later on external mask), for example probes with target sequences known in both compared groups. External mask contains \texttt{x} and \texttt{y} coordinates and probe set assignement for the probes, along with their BAD status coded as 0/1 (BAD probe/not BAD probe). For a given cutoff, compare BAD status according to expression-based mask and given by external mask. Here we use the object sequenceMask - the information whether the probe sequence is affected by a SNP between the groups (0) or not (1) . <>= head(exmask$probes) head(sequenceMask) rownames(sequenceMask) <- paste(sequenceMask$x,sequenceMask$y,sep="_") rownames(exmask$probes)<- paste(exmask$probes$x,exmask$probes$y,sep="_") cutoff=0.029 table((exmask$probes[rownames(sequenceMask),"quality.score"]>cutoff)+0, sequenceMask$match) @ Here, 952 probes are concordant between masks, whereas 79 probes with known SNP are not detected by expression-based mask.\\\\ With \Rfunction{overlapExprExtMasks()} expression mask may be compared with any other designation of BAD probes along a range of cutoffs. Type 1 (fraction of external mask not BAD probes, identified as BAD by expression mask) and type 2 (fraction of external mask BAD probes, not detected as BAD in expression mask) errors are plotted with their binomial confidence intervals. <>= head(exmask$probes[,1:3]) head(sequenceMask[,c(1,2,4)]) @ The function \Rfunction{overlapExprExtMasks()} applied on our example data, produces a plot like shown in Figure \ref{figErr}. <>= overlapExSeq <- overlapExprExtMasks(exmask$probes[,1:3], sequenceMask[,c(1,2,4)],verbose=FALSE, cutoffs=quantile(exmask$probes[,3],seq(0,1,0.05))) @ \begin{figure} \centering \includegraphics[width=.7\textwidth]{maskBAD-figErr} \caption{Overlap of expression-based mask and external mask (red line), with binomial confidence intervals (dashed lines). Several cutoff levels are marked on the plot.} \label{figErr} \end{figure} <>= overlapExSeq <- overlapExprExtMasks(exmask$probes[,1:3], sequenceMask[,c(1,2,4)],verbose=FALSE, cutoffs=quantile(exmask$probes[,3],seq(0,1,0.05))) @ The cutoff depends on the goal of masking. SFPs detection needs tighter control of type 2 than type 1 error - low cutoff values. Removal of differential expression bias needs higher cutoff. The distribution of quality scores for probes declared in external mask as BAD and not BAD may be compared with Wilcoxon rank sum test and Kolomogornov-Smirnow test for several cutoff levels. When those distributions do not differ (high p-values on \ref{figWil}), cutoff is too high. <>= overlapTests <- overlapExprExtMasks(exmask$probes[,1:3],verbose=FALSE, sequenceMask[,c(1,2,4)], wilcox.ks=T,sample=100) @ <>= plot(overlapTests$testCutoff[[1]],overlapTests$ksBoot,col="lightgrey", main="Kolmogorov-Smirnov Test",xlab="Quality score cutoff", ylab="p value (Kolmogorov-Smirnov Test)",ylim=c(0,1),pch=16,xaxt="n",cex=1.5) axis(1,at=seq(1,length(unique(overlapTests$testCutoff[[2]])),2), labels=signif(unique(overlapTests$testCutoff[[2]]),2)[seq(1,length(unique(overlapTests$testCutoff[[2]])),2)], las=3) lines(which(unique(overlapTests$testCutoff[[2]]) %in% overlapTests$testCutoff[[2]]), overlapTests$ksP[!is.na(overlapTests$ksP)],type="p",pch=16,col="red") @ <>= plot(overlapTests$testCutoff[[1]],overlapTests$wilcoxonBoot,col="lightgrey", main="Wilcoxon Rank Sum Test",xlab="Quality score cutoff", ylab="p value (Wilcoxon Rank Sum Test)",ylim=c(0,1),pch=16,xaxt="n",cex=1.5) axis(1,at=seq(1,length(unique(overlapTests$testCutoff[[2]])),2), labels=signif(unique(overlapTests$testCutoff[[2]]),2)[seq(1,length(unique(overlapTests$testCutoff[[2]])),2)], las=3) lines(which(unique(overlapTests$testCutoff[[2]]) %in% overlapTests$testCutoff[[2]]), overlapTests$wilcoxonP[!is.na(overlapTests$wilcoxonP)],type="p",pch=16,col="red") @ \begin{figure} \centering \includegraphics[width=\textwidth]{maskBAD-figWil1} \includegraphics[width=\textwidth]{maskBAD-figWil2} \caption{Kolmogorow-Smirnov-Test (upper panel) and Wilcoxon-Rank-Sum-Test (lower panel) comparing distributions of quality scores of BAD and not BAD probes. For low cutoffs, two distributions are different (red - actual tests; gray - tests on bootstraped data, to estimate variance).} \label{figWil} \end{figure} <>= hist(exmask[[1]]$quality.score,breaks=100,xlab="mask quality score",main="") @ \begin{figure} \centering \includegraphics[width=.7\textwidth]{maskBAD-figHist} \caption{Probes' quality scores.} \label{figHist} \end{figure} When no information about sequence divergence between the groups is available, a cutoff might be suggested by the distribution of quality scores for the mask. On Figure 3, an excess of probes with low quality scores is obvious. Assuming a uniform distribution of quality scores for probes not different between groups, a cutoff should be set so that the remaining distribution is uniform - like. \section{Removal of BAD probes and estimation of expression values} \subsection{Removal of BAD probes} To remove probes from \texttt{affybatch}, use \Rfunction{prepareMaskedAffybatch()}. It produces an \texttt{affybatch} object (without the masked probes) and an according \texttt{CDF} environment. When one probe appears in several probe sets (as is a case for some custom annotations) and it is detected as BAD in one probe set only, it will be nevertheless removed from all probe sets it appears. With \texttt{cdftable} argument, \Rfunction{prepareMaskedAffybatch()} allows also to remove any list of probes. <>= affyBatchAfterMasking <- prepareMaskedAffybatch(affy=newAffyBatch,exmask=exmask$probes, cutoff=quantile(exmask[[1]]$quality.score,0.22)) newAffyBatch affyBatchAfterMasking @ Note changed number of genes (probe sets) in masked \texttt{affybatch}. \subsection{Expression estimation with \Rpackage{rma}} For the new \texttt{affybatch} expression values might be estimated as usual. New \texttt{affybatch} object has a masked \texttt{cdf} declared, instead of a standard one. <>= cdfName(affyBatchAfterMasking[[1]]) @ Therefore it is important to save new \texttt{cdf} for further use and make it available in the workspace (with a name matching \texttt{cdfName} slot of masked \texttt{affybatch}) whenever a function to estimate expression is called. <>= new_cdf=affyBatchAfterMasking[[2]] head(exprs(rma(affyBatchAfterMasking[[1]]))) @ \subsection{Expression estimation with \Rpackage{gcrma}} Normalization and expression estimates with \Rfunction{gcrma()} require probe affinities. Those should be computed with standard \texttt{cdf}. Then both \texttt{affybatch} object and probe affinities object should be filtered with the mask. <>= library(gcrma) library(hgu95av2probe) affy.affinities=compute.affinities("hgu95av2",verbose=FALSE) @ <>= affy.affinities.m=prepareMaskedAffybatch(affy=affy.affinities, exmask=exmask$probes,cdfName="MaskedAffinitesCdf") @ To calculate expression values, just use both masked objects (make sure that a \texttt{cdf} with name matching \texttt{cdf} slot of masked \texttt{affyBatch} is accessible!). <>= head(exprs(gcrma(affyBatchAfterMasking[[1]],affy.affinities.m))) @ \subsection{Exon and gene arrays} Exon and gene arrays require setting \texttt{useExpr=FALSE}, as those arrays do not contain MM probes and \Rfunction{mas5calls()} - based definition of expressed genes is not applicable. As computing time increases with number of probe sets, we recommend to use a subset of probe sets (specified with \texttt{exprlist}) for exon arrays. \bibliographystyle{plain} \bibliography{maskBAD} \end{document}