% Created 2014-02-13 Thu 11:48 \documentclass{article} \usepackage{hyperref} %\VignetteIndexEntry{Using geneRxCluster} \author{Charles C. Berry} \date{\today} \title{Using geneRxCluster} \hypersetup{ pdfkeywords={}, pdfsubject={}, pdfcreator={Emacs 24.3.1 (Org mode 8.2.5h)}} \begin{document} \maketitle \tableofcontents \section{Overview} \label{sec-1} The \verb~geneRxCluster~ package provides some functions for exploring genomic insertion sites originating from two different sources. Possibly, the two sources are two different gene therapy vectors. In what follows, some simulations are used to create datasets to illustrate functions in the package, but it is not necessary to follow the details of the simulations to get an understanding of the functions. More examples and details are given by Supplement 2 of Berry et al \cite{berry2014} available at the \href{http://dx.doi.org/10.1093/bioinformatics/btu035}{Bioinformatics web site}. \section{Basic Use} \label{sec-2} It might be helpful to look at these help pages briefly before getting started: \begin{center} \begin{tabular}{ll} \hline Function & Purpose\\ \hline critVal.target & a helper for gRxCluster\\ gRxCluster & the main function\\ gRxCluster-object & says what gRxCluster returns\\ gRxPlot & plots results and crtical regions\\ gRxSummary & quick summary of results\\ \hline \end{tabular} \end{center} \subsection{Reading Data from a File} \label{sec-2-1} The core function in the package is \verb~gRxCluster~ and it requires genomic locations and group indicators. Those basic data might be represented by a table like this: \begin{verbatim} chromo pos grp chr1 176812 FALSE chr1 191298 TRUE chr1 337906 TRUE chr1 356317 TRUE chr1 516904 FALSE chr1 661124 FALSE . . . . . . . . . \end{verbatim} In \texttt{R} that table might be a \verb~data.frame~ or a collection of three equal length vectors. The first one here, \texttt{chromo}, indicates the chromosome. The \texttt{pos} column indicates the position on the chromosome (and note that the positions have been ordered from lowest to highest), and the \texttt{grp} vector indicates which of the two groups the row is associated with. If a table called \texttt{exptData.txt} contained the table above, this command would read it in: <>= df <- read.table("exptData.txt", header=TRUE) @ %def \subsection{Simulating Data} \label{sec-2-2} Here, \texttt{df} will be simulated. For a start some insertion sites are simulated according to a null distribution - i.e. the two sources are chosen according to a coin toss at each location. First the chromosome lengths are given <>= chr.lens <- structure(c(247249719L, 242951149L, 199501827L, 191273063L, 180857866L, 170899992L, 158821424L, 146274826L, 140273252L, 135374737L, 134452384L, 132349534L, 114142980L, 106368585L, 100338915L, 88827254L, 78774742L, 76117153L, 63811651L, 62435964L, 46944323L, 49691432L, 154913754L, 57772954L), .Names = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chrY")) @ %def Now a sample is drawn from the chromosomes and for each chromosome a sample of positions is drawn. The function \verb~sample.pos~ is defined that samples the desired number of positions in the right range. These results are placed in a \verb~data.frame~ <>= set.seed(13245) chr.names <- names(chr.lens) chr.factor <- factor(chr.names,chr.names) chrs <- sample(chr.factor,40000,repl=TRUE, prob=chr.lens) chr.ns <- table(chrs) sample.pos <- function(x,y) sort(sample(y,x,repl=TRUE)) chr.pos <- mapply( sample.pos, chr.ns,chr.lens,SIMPLIFY=FALSE) df <- data.frame(chromo=rep(chr.factor,chr.ns), pos=unlist(chr.pos)) @ %def Now two groups are sampled as a logical vector: <>= df$grp <- rbinom(40000, 1, 0.5)==1 @ %def \subsection{Invoking \texttt{gRxCluster}} \label{sec-2-3} With this \texttt{data.frame} the function can be invoked. <>= require(geneRxCluster,quietly=TRUE) null.results <- gRxCluster(df$chromo,df$pos,df$grp,15L:30L,nperm=100L) as.data.frame(null.results)[,c(-4,-5)] @ %def The function call specified window widths of \texttt{15L:30L} sites and called for 100 permutations of the data with \verb~nperm=100L~. The resulting object, \verb~null.results~, is a \texttt{GRanges} object (which is supported by the \texttt{GenomicRanges} package \cite{lawrence2013software}) has 5 clumps. These clumps can be compared to the number of expected False Discoveries by invoking the function \verb~gRxSummary~: <>= gRxSummary( null.results ) @ %def The printed summary indicates 5 clusters (or clumps) were discovered, and that the estimated False Discovery Rate was 0.68 is a bit less than 1.0, which we know to be the actual False Discovery Rate. However, this is well within the bounds of variation in a simulation like this. The last part of the printout shows the values of all the arguments used in the call to \verb~gRxCluster~ including two that were filled in by default, and which will be discussed later on. \subsection{Simulating Clumps} \label{sec-2-4} Let's look at another example, but first add some true clumps to the simulation. We start by sampling chromosomes 30 times: <>= clump.chrs <- sample(chr.factor,30,repl=TRUE, prob=chr.lens) @ %def For each sample a position is chosen using the \verb~sample.pos~ function defined above \begin{verbatim} \end{verbatim} <<>>= clump.chr.pos.bound <- sapply(chr.lens[clump.chrs], function(y) sample.pos(1,y)) @ %def For each position, the number of sites in the clump is determined: <<>>= clump.site.ns <- rep(c(15,25,40),each=10) @ %def For every position, nearby sites (\(<1\) Mbase) are sampled: <>= clump.sites <- lapply(seq_along(clump.chrs), function(x) { chromo <- clump.chrs[x] n <- clump.site.ns[x] ctr <- clump.chr.pos.bound[x] chrLen <- chr.lens[chromo] if (ctr>= clump.grps <- rep(0:1,15)==1 @ %def then a \texttt{data.frame} is constructed, added to the \verb~df~ \texttt{data.frame} and the positions are put in order: <>= df2 <- data.frame( chromo=rep(clump.chrs,clump.site.ns), pos=unlist(clump.sites), grp=rep(clump.grps,clump.site.ns) ) df3 <- rbind(df,df2) df3 <- df3[order(df3$chromo,df3$pos),] @ %def Finally, the clump discovery takes place: <>= alt.results <- gRxCluster(df3$chromo,df3$pos,df3$grp, 15L:30L, nperm=100L) gRxSummary(alt.results) @ %def There were plenty of clumps discovered. Were they the simulated clumps or just False Discoveries? Several functions from the \verb~GenomicRanges~ package \cite{lawrence2013software} are useful in sorting this out. Here sites in the simulated clumps are turned into a \verb~GRanges~ object. <>= df2.GRanges <- GRanges(seqnames=df2$chromo,IRanges(start=df2$pos,width=1), clump=rep(1:30,clump.site.ns)) @ %def The function \verb~findOverlaps~ is used to map the regions in which clumps were found to the sites composing those simulated clumps, then the function \verb~subjectHits~ indicates which of the simulated clumps were found. <<>>= clumps.found <- subjectHits(findOverlaps(alt.results,df2.GRanges)) @ %def Finally, the number of sites in the simulated clumps that are covered by each estimated clump is printed. <>= matrix( table(factor(df2.GRanges$clump[ clumps.found ],1:30)), nrow=10,dimnames=list(clump=NULL,site.ns=c(15,25,40))) @ %def Notice that fewer than half of the clumps consisting of just 15 sites are found, the clumps of 25 sites are usually found, but usually all of the sites composing each clump are not found. The clumps formed from 40 sites are found and all or almost all of the sites in each clump are found. And here the clumps that are False Discoveries are counted by using the \verb~countOverlaps~ function <<>>= sum( countOverlaps(alt.results, df2.GRanges ) == 0 ) @ %def \section{Customizing Critical Regions and Filters} \label{sec-3} The critical regions used above can be displayed like this: <>= gRxPlot(alt.results,method="criticalRegions") @ %def Notice that the regions are not perfectly symmetrical. This is because the proportions of the two classes are not exactly equal: <<>>= xtabs(~grp, df3) @ %def The \verb~gRxCluster~ function provides a means of using another set of criitical regions and another filter expression. The expression for settings up critical regions is found in the is found in the \verb~metadata()~ slot of \verb~alt.results~ in the \texttt{\$call} component: <<>>= as.list(metadata(alt.results)$call)[['cutpt.tail.expr']] @ %def The expression is evaluated in an enviroment that has objects \verb~k~, \verb~n~, and an object called \verb~x~ that the expression may use. The object \verb~k~ is a set of values for the number of sites to include in a window, \verb~n~ is the results of \verb~table(df3$grp)~, and \verb~x~ is a matrix of the lagged differences of \verb~df3[,"pos"]~. The lags of order \texttt{(15:30)-1} (setting those that cross chromosome boundaries to \verb~NA~) make up the columns of \verb~x~. One obvious change that a user might make is to reset the value of \verb~target~. <>= generous.target.expr <- quote(critVal.target(k,n, target=20, posdiff=x)) generous.results <- gRxCluster(df3$chromo,df3$pos,df3$grp, 15L:30L,nperm=100L, cutpt.tail.expr=generous.target.expr) gRxSummary(generous.results) @ %def Many more discoveries are made, but look at the count of false discoveries: <<>>= sum( 0==countOverlaps(generous.results,df2.GRanges)) @ %def The filter function is also found in the \verb~metadata()~ slot of \verb~alt.results~ in the \texttt{\$call} component: <<>>= as.list(metadata(alt.results)$call)[['cutpt.filter.expr']] @ %def \verb~alt.result~ filtered out the windows whose widths were less than the median number of bases. The expression is evaluated in the enviroment as before, but only the object \verb~x~ has been added in at the time the expression is called. If filtering is not desired it can be tunred off by using an expression that returns values higher than any seen in \verb~x~ such as this: <>= no.filter.expr <- quote(rep(Inf,ncol(x))) no.filter.results <- gRxCluster(df3$chromo,df3$pos,df3$grp,15L:30L,nperm=100L, cutpt.filter.expr=no.filter.expr) gRxSummary(no.filter.results) @ %def The effect of using non-specific filters to increase power is applied in gene-expression microarray studies \cite{bourgon2010independent}. The less stringent filtering results in fewer discoveries, but the number of false discoveries also decreased: <<>>= sum( 0==countOverlaps(no.filter.results,df2.GRanges)) @ %def Here a more stringent filter is used <>= hard.filter.expr <- quote(apply(x,2,quantile, 0.15, na.rm=TRUE)) hard.filter.results <- gRxCluster(df3$chromo,df3$pos,df3$grp,15L:30L, nperm=100L, cutpt.filter.expr=hard.filter.expr) gRxSummary(hard.filter.results) @ %def The number of discoveries here needs to be corrected for the number of false discoveries if comparisons are to be made: <<>>= sum( 0==countOverlaps(hard.filter.results,df2.GRanges)) @ %def It seems to do a bit better than the other two alternatives when true and false discovery numbers are considered. \begin{thebibliography}{1} \bibitem{berry2014} Charles C. Berry and Karen E. Ocwieja and Nirvav Malani and Frederic D. Bushman. \newblock Comparing DNA site clusters with Scan Statistics. \newblock{\em Bioinformatics}, doi: 10.1093/bioinformatics/btu035, 2014. \bibitem{lawrence2013software} Michael Lawrence, Wolfgang Huber, Herv{\'e} Pag{\`e}s, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin~T Morgan, and Vincent~J Carey. \newblock Software for computing and annotating genomic ranges. \newblock {\em PLoS Computational Biology}, 9(8):e1003118, 2013. \bibitem{bourgon2010independent} R.~Bourgon, R.~Gentleman, and W.~Huber. \newblock {Independent filtering increases detection power for high-throughput experiments}. \newblock {\em Proceedings of the National Academy of Sciences}, 107(21):9546, 2010. \end{thebibliography} % Emacs 24.3.1 (Org mode 8.2.5h) \end{document}