% \VignetteEngine{knitr::knitr} % \VignetteIndexEntry{ccrepe} \documentclass{article} %% ------------------------- %% Pre-amble: %% ------------------------- \usepackage{float} \usepackage[nonamebreak,square]{natbib} \usepackage{amsmath} <>= BiocStyle::latex(width=78, use.unsrturl=FALSE) @ <>= library(ccrepe) @ \title{CCREPE: Compositionality Corrected by PErmutation and REnormalization} \author{Emma Schwager, George Weingart, Craig Bielski, Curtis Huttenhower} %% ------------------------- %% End pre-amble %% ------------------------- \begin{document} \maketitle \tableofcontents \section{Introduction} \Biocpkg{ccrepe} is a package for analysis of sparse compositional data. Specifically, it determines the significance of association between features in a composition, using any similarity measure (e.g. Pearson correlation, Spearman correlation, etc.) The CCREPE methodology stands for Compositionality Corrected by Renormalization and Permutation, as detailed below. The package also provides a novel similarity measure, the N-dimensional checkerboard score (NC-score), particularly appropriate to compositions derived from microbial community sequencing data. This results in p-values and false discovery rate q-values corrected for the effects of compositionality. The package contains two functions \Rfunction{ccrepe} and \Rfunction{nc.score} and is maintained by the Huttenhower lab (\email{ccrepe-users@googlegroups.com}). %--------------------------------------------------------- \section{ccrepe} %--------------------------------------------------------- \Rfunction{ccrepe} is the main package function. It calculates compositionality-corrected p-values and q-values for a user-selected similarity measure, operating on either one or two input matrices. If given one matrix, all features (columns) in the matrix are compared to each other using the selected similarity measure. If given two matrices, each feature in the first are compared against all features in the second. \subsection{General functionality} Compositional data induces spurious correlations between features due to the nonindependence of values that must sum to a fixed total. CCREPE abrogates this when determining the significance of a similarity measure for each feature pair using two main steps, permutation/renormalization and bootstrapping. First, given two features to compare, CCREPE generates a null distribution of the similarity expected just due to compositionality by iteratively permuting one feature, renormalizing all samples in the composition to their previous sum, and computing the resulting similarity measures. Second, CCREPE bootstraps over sample subsets in order to assess confidence in the "true" similarity measure. Finally, the two resulting distributions are compared using a pooled-variance Z-test to give a compositionality-corrected p-value. False discovery rate q-values are additionally calculated using the Benjamin-Hochberg-Yekutieli procedure. For greater detail, see \citet{faust2012microbial} and \citet{schwager2012unpublished}.\\ CCREPE employs several filtering steps before the data are processed. It removes any missing subjects using \Rcode{na.omit}: in the two dataset case, any subjects missing in \emph{either} dataset will be removed. Any subjects or features which are all zero are removed as well: an all-zero subject cannot be normalized (its sum is 0) and an all-zero feature has standard deviation 0 (in addition to being uninteresting biologically). \subsection{Arguments} \begin{description} \setlength{\itemsep}{1em} \item[\Rcode{x}] First \Rclass{dataframe} or \Rclass{matrix} containing relative abundances. Columns are features, rows are samples. Rows should therefore sum to a constant. Row names are used for identification if present. \item[\Rcode{y}] Second \Rclass{dataframe} or \Rclass{matrix} (optional) containing relative abundances. Columns are features, rows are samples. Rows should therefore sum to a constant. If both \Rcode{x} and \Rcode{y} are specified, they will be merged by row names. If no row names are specified for either or both datasets, the default is to merge by row number. \item[\Rcode{sim.score}] Similarity measure, such as \Rfunction{cor} or \Rfunction{nc.score}. This can be either an existing R function or user-defined. If the latter, certain properties should be satisfied as detailed below (also see examples). The default similarity measure is Spearman correlation. A user-defined similarity measure should mimic the interface of \Rfunction{cor}: \begin{enumerate} \item Take either two \Rclass{vector} inputs one \Rclass{matrix} or \Rclass{dataframe} input. \item In the case of two inputs, return a single number. \item In the case of one input, return a matrix in which the (\Rcode{i},\Rcode{j})th entry is the similarity score for column \Rcode{i} and column \Rcode{j} in the original matrix. \item The resulting matrix (in the case of one input) must be symmetric. \item The inputs must be named \Rcode{x} and \Rcode{y}. \end{enumerate} \item[\Rcode{sim.score.args}] An optional list of arguments for the measurement function. When given, they are passed to the \Rcode{sim.score} function directly. For example, in the case of \Rcode{cor}, the following would be acceptable: <>= sim.score.args = list(method="spearman", use="complete.obs") @ \item[\Rcode{min.subj}] Minimum number (count) of samples that must be non-missing in order to apply the similarity measure. This is to ensure that there are sufficient samples to perform a bootstrap (default: 20). \item[\Rcode{iterations}] The number of iterations for both bootstrap and permutation calculations (default: 1000). \item[\Rcode{subset.cols.x}] A vector of column indices from x to indicate which features to compare \item[\Rcode{subset.cols.y}] A vector of column indices from y to indicate which features to compare \item[\Rcode{errthresh}] If feature has number of zeros greater than $\ errthresh^{1/n} $ , that feature is excluded \item[\Rcode{verbose}] If \Rcode{TRUE}, print periodic progress of the algorithm through the dataset(s), as well as including more detailed debugging output. (default: \Rcode{FALSE}). \item[\Rcode{iterations.gap}] If \Rcode{verbose=TRUE}, the number of iterations between issuing status messages (default: 100). \item[\Rcode{distributions}] Optional output file for detailed log (if given) of all intermediate permutation and renormalization distributions. \item[\Rcode{compare.within.x}] A boolean value indicating whether to do comparisons given by taking all subsets of size 2 from \Rcode{subset.cols.x} or to do comparisons given by taking all possible combinations of \Rcode{subset.cols.x} and \Rcode{subset.cols.y}. If \Rcode{TRUE} but \Rcode{subset.cols.y=NA}, returns all comparisons involving any features in \Rcode{subset.cols.x}. This argument is only used when \Rcode{y=NA}. \item[\Rcode{concurrent.output}] Optional output file to which each comparison will be written as it is calculated. \item[\Rcode{make.output.table}] A boolean value indicating whether to include table-formatted output. \end{description} \subsection{Output} \Rcode{ccrepe} returns a \Rclass{list} containing both the calculation results and the parameters used: \begin{description} \setlength{\itemsep}{1em} \item[\Rcode{sim.score}] \Rclass{matrix} of simliarity scores for all requested comparisons. The (\Rcode{i},\Rcode{j})th element corresponds to the similarity score of column \Rcode{i} (or the \Rcode{i}th column of \Rcode{subset.cols.1}) and column \Rcode{j} (or the \Rcode{j}th column of \Rcode{subset.cols.1}) in one dataset, or to the similarity score of column \Rcode{i} (or the \Rcode{i}th column of \Rcode{subset.cols.1}) in dataset \Rcode{x} and column \Rcode{j} (or the \Rcode{j}th column of \Rcode{subset.cols.2}) in dataset \Rcode{y} in the case of two datasets. \item[\Rcode{p.values}] \Rclass{matrix} of the corrected p-values for all requested comparisons. The (\Rcode{i},\Rcode{j})th element corresponds to the p-value of the (\Rcode{i},\Rcode{j})th element of \Rcode{sim.score}. \item[\Rcode{q.values}] \Rclass{matrix} of the Benjamini-Hochberg-Yekutieli corrected p-values. The (\Rcode{i},\Rcode{j})th element corresponds to the p-value of the (\Rcode{i},\Rcode{j})th element of \Rcode{sim.score}. \item[\Rcode{z.stat}] \Rclass{matrix} of the z-statistics used in generating the p-values for all requested comparisons. The (\Rcode{i},\Rcode{j})th element corresponds to the z-statistic generating the (\Rcode{i},\Rcode{j})th element of \Rcode{p.values}. \end{description} \subsection{Usage} <>= ccrepe( x = NA, y = NA, sim.score = cor, sim.score.args = list(), min.subj = 20, iterations = 1000, subset.cols.x = NULL, subset.cols.y = NULL, errthresh = 1e-04, verbose = FALSE, iterations.gap = 100, distributions = NA, compare.within.x = TRUE, concurrent.output = NA, make.output.table = FALSE) @ \subsection{Example 1} An example of how to use \Rfunction{ccrepe} with one dataset. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data[,1] = 2*data[,2] + rnorm(10,0,0.01) data.rowsum <- apply(data,1,sum) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(c( "Sample 1", "Sample 2","Sample 3","Sample 4","Sample 5", "Sample 6","Sample 7","Sample 8","Sample 9","Sample 10"), c("Feature 1", "Feature 2", "Feature 3","Feature 4")) test.output <- ccrepe(x=test.input, iterations=20, min.subj=10) @ <>= par(mfrow=c(1,2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output @ \subsection{Example 2} An example of how to use \Rfunction{ccrepe} with two datasets. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data[,1] = 2*data[,2] + rnorm(10,0,0.01) data.rowsum <- apply(data,1,sum) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm data2 <- matrix(rlnorm(105,meanlog=0,sdlog=1),nrow=15,ncol=7) aligned.rows <- c(seq(1,4),seq(6,9),11,12) # The datasets dont need # to have subjects line up exactly data2[aligned.rows,1] <- 2*data[,3] + rnorm(10,0,0.01) data2.rowsum <- apply(data2,1,sum) data2.norm <- data2/data2.rowsum apply(data2.norm,1,sum) # The rows sum to 1, so the data are normalized test.input.2 <- data2.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) dimnames(test.input.2) <- list(paste("Sample",c(seq(1,4),11,seq(5,8),12,9,10,13,14,15)),paste("Feature",seq(1,7))) test.output.two.datasets <- ccrepe(x=test.input, y=test.input.2, iterations=20, min.subj=10) @ Please note that we receive a warning because the subjects don't match - only paired observations. <>= par(mfrow=c(1,2)) plot(data2[aligned.rows,1],data[,3],xlab="dataset 2: Feature 1",ylab="dataset 1: Feature 3",main="Non-normalized") plot(data2.norm[aligned.rows,1],data.norm[,3],xlab="dataset 2: Feature 1",ylab="dataset 1: Feature 3", main="Normalized") test.output.two.datasets @ \subsection{Example 3} An example of how to use \Rfunction{ccrepe} with \Rfunction{nc.score} as the similarity score. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data[,1] = 2*data[,2] + rnorm(10,0,0.01) data.rowsum <- apply(data,1,sum) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) test.output.nc.score <- ccrepe(x=test.input, sim.score=nc.score, iterations=20, min.subj=10) @ <>= par(mfrow=c(1,2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output.nc.score @ \subsection{Example 4} An example of how to use \Rfunction{ccrepe} with a user-defined \Rfunction{sim.score} function. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data[,1] = 2*data[,2] + rnorm(10,0,0.01) data.rowsum <- apply(data,1,sum) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) my.test.sim.score <- function(x,y=NA,constant=0.5){ if(is.vector(x) && is.vector(y)) return(constant) if(is.matrix(x) && is.na(y)) return(matrix(rep(constant,ncol(x)^2),ncol=ncol(x))) if(is.data.frame(x) && is.na(y)) return(matrix(rep(constant,ncol(x)^2),ncol=ncol(x))) else stop('ERROR') } test.output.sim.score <- ccrepe(x=test.input, sim.score=my.test.sim.score, iterations=20, min.subj=10, sim.score.args = list(constant = 0.6)) @ <>= par(mfrow=c(1,2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output.sim.score @ \subsection{Example 5} An example of how to use \Rfunction{ccrepe} when specifying column subsets. <<<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data.rowsum <- apply(data,1,sum) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) test.output.1.3 <- ccrepe(x=test.input, iterations=20, min.subj=10, subset.cols.x=c(1,3)) test.output.1 <- ccrepe(x=test.input, iterations=20, min.subj=10, subset.cols.x=c(1), compare.within.x=FALSE) test.output.12.3 <- ccrepe(x=test.input, iterations=20, min.subj=10, subset.cols.x=c(1,2),subset.cols.y=c(3), compare.within.x=FALSE) test.output.1.3$sim.score test.output.1$sim.score test.output.12.3$sim.score @ %--------------------------------------------------------- \section{nc.score} %--------------------------------------------------------- The \Rfunction{nc.score} similarity measure is an N-dimensional extension of the checkerboard score particularly suited to similarity score calculations between compositions derived from ecological relative abundance measurements. In such cases, features typically represent species abundances, and the NC-score discretizes these continuous values into one of N bins before computing a normalized similarity of co-occurrence or co-exclusion. This can be used as a standalone function or with \Rfunction{ccrepe} as above to obtain compositionality-corrected p-values. \subsection{General Functionality} The NC-score is an extension to Diamond's checkerboard score (see \citet{cody1975ecology}) to ordinal data, and simplifies to a calculation of Kendall's $\tau$ on binned data instead of ranked data. Let two features in a dataset with $n$ subjects be denoted by \[ \left[ \begin{array}{cccc} x_1 & x_2 & \dots & x_n\\ y_1 & y_2 & \dots & y_n\\ \end{array} \right]. \] The binning function maps from the original data to $b$ numbered bins in $\{1,...,b\}$. Let the binning function be denoted by $B(\cdot)$. The co-variation and co-exclusion patterns are the same as concordant and discordant pairs in Kendall's $\tau$. Considering a $2 \times 2$ submatrix of the form \[ \left[ \begin{array}{cc} B(x_i) & B(x_j)\\ B(y_i) & B(y_j)\\ \end{array} \right], \] a co-variation pattern is counted when $(B(x_i) - B(x_j))(B(y_i) - B(y_j)) > 0$ and a co-exclusion pattern, conversely, when $(B(x_i) - B(x_j))(B(y_i) - B(y_j)) < 0$. The NC-score statistic for features $x$ and $y$ is then defined as \[ (\text{number of co-variation patterns}) - (\text{number of co-exclusion patterns}), \] normalized by the Kendall's $\tau$ normalization factor accounting for ties described in \citet{kendall1970}. \subsection{Arguments} \begin{description} \setlength{\itemsep}{1em} \item[\Rcode{x}] First numerical \Rclass{vector}, or single \Rclass{dataframe} or \Rclass{matrix}, containing relative abundances. If the latter, columns are features, rows are samples. Rows should therefore sum to a constant. \item[\Rcode{y}] If provided, second numerical \Rclass{vector} containing relative abundances. If given, \Rcode{x} must be a \Rclass{vector} as well. \item[\Rcode{nbins}] A non-negative integer of the number of bins to generate (cutoffs will be generated by the discretize function from the infotheo package). \item[\Rcode{bin.cutoffs}] A list of values demarcating the bin cutoffs. The binning is performed using the findInterval function. \item[\Rcode{use}] An optional character string givinga method for computing covariances in the presence of missing values. This must be (an abbreviaion of) on of the strings "everything", "all.obs", "complete.obs","na.or.complete", or "pairwise.complete.obs". \end{description} \subsection{Output} \Rfunction{nc.score} returns either a single number (if called with two vectors) or a \Rclass{matrix} of all pairwise scores (if called with a \Rclass{matrix}) of normalized scores. This behaviour is precisely analogous to the cor function in R \subsection{Usage} <>= nc.score( x, y = NULL, use = "everything", nbins = NULL, bin.cutoffs=NULL) @ \subsection{Example 1} An example of using \Rfunction{nc.score} to get a single similarity score or a matrix. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data.rowsum <- apply(data,1,sum) data[,1] = 2*data[,2] + rnorm(10,0,0.01) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) test.output.matrix <- nc.score(x=test.input) test.output.num <- nc.score(x=test.input[,1],y=test.input[,2]) @ <>= par(mfrow=c(1, 2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output.matrix test.output.num @ \subsection{Example 2} An example of using \Rfunction{nc.score} with an aribitrary bin number. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data.rowsum <- apply(data,1,sum) data[,1] = 2*data[,2] + rnorm(10,0,0.01) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) test.output <- nc.score(x=test.input,nbins=4) @ <>= par(mfrow=c(1, 2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output @ \subsection{Example 3} An example of using \Rfunction{nc.score} with user-defined bin edges. <<>>= data <- matrix(rlnorm(40,meanlog=0,sdlog=1),nrow=10,ncol=4) data.rowsum <- apply(data,1,sum) data[,1] = 2*data[,2] + rnorm(10,0,0.01) data.norm <- data/data.rowsum apply(data.norm,1,sum) # The rows sum to 1, so the data are normalized test.input <- data.norm dimnames(test.input) <- list(paste("Sample",seq(1,10)),paste("Feature",seq(1,4))) test.output <- nc.score(x=test.input,bin.cutoffs=c(0.1,0.2,0.3)) @ <>= par(mfrow=c(1, 2)) plot(data[,1],data[,2],xlab="Feature 1",ylab="Feature 2",main="Non-normalized") plot(data.norm[,1],data.norm[,2],xlab="Feature 1",ylab="Feature 2", main="Normalized") test.output @ %--------------------------------------------------------- \section{References} %--------------------------------------------------------- \bibliographystyle{plainnat} \bibliography{ccrepe} \end{document}