%\VignetteIndexEntry{RRHO} \documentclass[11pt]{article} \usepackage[utf8]{inputenc} \usepackage{natbib} \usepackage{amsthm} \newtheorem{remark}{Remark}[section] \title{A Practitioner's Guide to Multiple Testing Error Rates} \author{Jonathan Rosenblatt \\ Department of Statistics and Operations Research,\\ The Sackler Faculty of Exact Sciences, \\ Tel Aviv University \\ Israel} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \section{Introduction} Consider the problem of comparing two lists of same length. In the following, we will assume these lists consist of genes. A possible way of comparison would be to test the hypothesis that the lists' ordering is arbitrary, versus some type of agreement. This is the purpose of this package, based on the work of~\citet{plaisier_rankrank_2010}. The proposed approach is to count the number of common genes in the first $i$ and $j$ elements of the first and second list respectively. As the count of common elements could be driven by chance, the significance of the observed count is computed assuming the hypothesis of completely random list orderings. As this is performed for all $i$ and $j$, multiplicity kicks in. The package offers both FWER control \footnote{ For a general introduction to multiple testing error rates, see~\citep{rosenblatt_practitioners_2013}.} using permutation testing and FDR control using the B-Y procedure~\citep{benjamini_control_2001} as proposed in the original work by~\citet{plaisier_rankrank_2010}. \begin{remark} FDR or FWER? ~\citet{plaisier_rankrank_2010} recommend the control of the FDR over the different $i$s and $j$s. Each $i,j$ combination tests the null hypothesis of ``arbitrary rankings of the two lists'', versus an alternative of ``non-arbitrary ranking in the first $i$ and $j$ elements of the first and second list respectively''. FDR control is thus appropriate if concerned with the number of false $i,j$ statements made. If only concerned with the existence of any non-arbitrariness, without claiming at which part of the lists it resides, than FWER control is more appropriate. \end{remark} \section{Workflow} We start with a sketch of the workflow. The details follow. \begin{itemize} \item Compute the significance of the gene overlap for all $i$ and $j$ first elements of the two lists. \item Correct the nominal significance levels for the multiple $i$s and $j$s. \item Report finding using the exported significance matrices and accompanying Venn diagrams. \end{itemize} <>= library(RRHO) # Create "gene" lists: list.length <- 100 list.names <- paste('Gene',1:list.length, sep='') gene.list1<- data.frame(list.names, sample(100)) gene.list2<- data.frame(list.names, sample(100)) # Compute overlap and significance RRHO.example <- RRHO(gene.list1, gene.list2, BY=TRUE) @ <>= # Examine Nominal (-log) pvalues image(RRHO.example$hypermat) # Note: If lattice is available try: levelplot(RRHO.example$hypermat) @ <<>>= pval.testing <- pvalRRHO(RRHO.example,50) # FWER corrected pvalue: pval.testing$pval # The sampling distribution of the minimum of the (-log) nominal p-values: xs<- seq(0,10,length=100) plot(Vectorize(pval.testing$FUN.ecdf)(xs)~xs, xlab='-log(pvalue)', ylab='ECDF', type='S') @ <>= # Examine B-Y corrected pvalues # Note: probably nothing will be rejected in this example as the data is generated from the null. image(RRHO.example$hypermat.by) @ \bibliography{Project-RRHO} \bibliographystyle{plainnat} \end{document}