%\VignetteIndexEntry{MAST-intro} %\VignetteEngine{knitr::knitr} \documentclass{article} \usepackage{url, graphicx} \usepackage{color} \usepackage[cm]{fullpage} \usepackage[usenames,dvipsnames]{xcolor} %\usepackage[authoryear]{natbib} %\makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. % \VignetteIndexEntry{An Introduction to SingleCellAssay} %\makeatother \newcommand{\future}[1]{TODO: {\color{gray} #1}} \newcommand{\sca}{\texttt{MAST}} \input{symbols.tex} \begin{document} \title{An Introduction to MAST} \author{Andrew McDavid and Greg Finak} \maketitle \section{Philosophy} \sca is an R/Bioconductor package for managing and analyzing qPCR and sequencing-based single--cell gene expression data, as well as data from other types of single--cell assays. Our goal is to support assays that have multiple \emph{features} (genes, markers, etc) per \emph{well} (cell, etc) in a flexible manner. Assays are assumed to be mostly \emph{complete} in the sense that most \emph{wells} contain measurements for all features. \subsection{Internals} A \texttt{SingleCellAssay} object can be manipulated as a matrix, with rows giving wells and columns giving features. \subsection{Statistical Testing} Apart from reading and storing single--cell assay data, the package also provides functionality for significance testing of differential expression using a combined binomial and normal--theory likelihood ratio test, as well as filtering of individual outlier wells. These methods are described our papers. % Add citations \section{Examples} With the cursory background out of the way, we'll proceed with some examples to help understand how the package is used. \subsection{Reading Data} Data can be imported in a Fluidigm instrument-specific format (the details of which are undocumented, and likely subject-to-change) or some derived, annotated format, or in ``long'' (melted) format, in which each row is a measurement, so if there are $N$ wells and $M$ cells, then the \texttt{data.frame} should contain $N \times M$ rows. The use of key--value mappings makes the reading of various input formats very flexible, provided that they contain the minimal required information expected by the package. For example, the following data set was provided in as a comma-separated value file. It has the cycle threshold ($\ct$) recorded. Non-detected genes are recorded as NAs. For the Fluidigm/qPCR single cell expression functions to work as expected, we must use the \emph{expression threshold}, defined as $et = c_{\mbox{max}} - \ct$, which is proportional to the log-expression. Below, we load the package and the data, then compute the expression threshold from the $\ct$, and construct a \texttt{FluidigmAssay}. <>= suppressPackageStartupMessages(library(Biobase)) library(knitr) opts_chunk$set(error=FALSE) #load_all('..') library(MAST) library(data.table) library(plyr) data(vbeta) colnames(vbeta) vbeta <- computeEtFromCt(vbeta) vbeta.fa <- FluidigmAssay(vbeta, idvars=c("Subject.ID", "Chip.Number", "Well"), primerid='Gene', measurement='Et', ncells='Number.of.Cells', geneid="Gene", cellvars=c('Number.of.Cells', 'Population'), phenovars=c('Stim.Condition','Time'), id='vbeta all') show(vbeta.fa) @ We see that the variable \texttt{vbeta} is a \texttt{data.frame} from which we construct the \texttt{FluidigmAssay} object. The \texttt{idvars} is the set of column(s) in \texttt{vbeta} that uniquely identify a well (globally), the \texttt{primerid} is a column(s) that specify the feature measured at this well. The \texttt{measurement} gives the column name containing the log-expression measurement, \texttt{ncells} contains the number of cells (or other normalizing factor) for the well. \texttt{geneid}, \texttt{cellvars}, \texttt{phenovars} all specify additional columns to be included in the \texttt{featureData}, \texttt{phenoData} and \texttt{cellData} (\future{wellData}). The output is a \texttt{FluidigmAssay} object with \Sexpr{nrow(cData(vbeta.fa))} wells and \Sexpr{nrow(fData(vbeta.fa))} features. We can access the feature--level metadata and the cell--level metadata using the \texttt{fData} and \texttt{cData} accessors. <>= head(fData(vbeta.fa),3) head(cData(vbeta.fa),3) @ We see this gives us the set of genes measured in the assay, or the cell-level metadata (i.e. the number of cells measured in the well, the population this cell belongs to, the subject it came from, the chip it was run on, the well id, the stimulation it was subjected to, and the timepoint for the experiment this cell was part of). The wellKey are concatenated idvars columns, helping to ensure consistency when splitting and merging \sca objects. \future{Some of this ``cell--level'' information could arguably be part of the \texttt{@phenoData} slot of the object. This functionality is forthcoming but doesn't limit what can be done with the package at this stage}. \subsection{Subsetting, splitting, combining} It's possible to subset \sca objects by wells and features. Square brackets (``['') will index on the first index and by features on the second index. Integer and boolean and indices may be used, as well as character vectors naming the cellKey or the feature (via the primerid). There is also a \texttt{subset} method, which will evaluate its argument in the frame of the \texttt{cData}, hence will subset by wells. <>= sub1 <- vbeta.fa[1:10,] show(sub1) sub2 <- subset(vbeta.fa, Well=='A01') show(sub2) sub3 <- vbeta.fa[1:10,6:10] show(sub3) cellData(sub3) featureData(sub3) @ The cellData and featureData \texttt{AnnotatedDataFrames} are subset accordingly as well. A \sca may be split into a list of \sca, which is known as an \texttt{SCASet}. The split method takes an argument which names the column (factor) on which to split the data. Each level of the factor will be placed in its own \sca within the SCASet. <>= sp1 <- split(vbeta.fa, 'Subject.ID') show(sp1) @ The splitting variable can either be a character vector naming column(s) of the \sca, or may be a \texttt{factor} or \texttt{list} of \texttt{factor}s. It's possible to combine \sca objects or an \texttt{SCASet} with the \texttt{combine} method. <>= ##unloadNamespace("gplots") combine(x=sp1[[1]],y=sp1[[2]]) combine(sp1) @ \subsubsection{Modifying the cellData or featureData} Combine also can take a \texttt{vector} or \texttt{data.frame} or \texttt{AnnotatedDataFrame} and append it to the cellData or featureData. The number of rows must match the number of wells or number of features, and determines to which component the argument will be appended. <>= newData <- data.frame(otherVariable=rnorm(nrow(vbeta.fa))) vbetaWithNewData <- combine(vbeta.fa, newData) head(cData(vbetaWithNewData)) @ \subsection{Filtering} We can filter and perform some significance tests on the \sca. We may want to filter any wells with at least two outlier cells where the discrete and continuous parts of the signal are at least 9 standard deviations from the mean. This is a very conservative filtering criteria. We'll group the filtering by the number of cells. We'll split the assay by the number of cells and look at the concordance plot after filtering. <>= vbeta.split<-split(vbeta.fa,"Number.of.Cells") #see default parameters for plotSCAConcordance plotSCAConcordance(vbeta.split[[1]],vbeta.split[[2]], filterCriteria=list(nOutlier = 1, sigmaContinuous = 9, sigmaProportion = 9)) @ The filtering function has several other options, including whether the filter shuld be applied (thus returning a new SingleCellAssay object) or returned as a matrix of boolean values. <>= vbeta.fa ## Split by 'ncells', apply to each component, then recombine vbeta.filtered <- filter(vbeta.fa, groups='ncells') ## Returned as boolean matrix was.filtered <- filter(vbeta.fa, apply_filter=FALSE) ## Wells filtered for being discrete outliers head(subset(was.filtered, pctout)) @ There's also some functionality for visualizing the filtering. <>= burdenOfFiltering(vbeta.fa, 'ncells', byGroup=TRUE) @ \subsection{Significance testing under the Hurdle model} There are two frameworks available in the package. The first framework \texttt{zlm} offers a full linear model to allow arbitrary comparisons and adjustment for covariates. The second framework \texttt{LRT} can be considered essentially performing t-tests (respecting the discrete/continuous nature of the data) between pairs of groups. \texttt{LRT} is subsumed by the first framework, but might be simpler for some users, so has been kept in the package. We'll describe \texttt{zlm}. Models are specified in terms of the variable used as the measure and covariates present in the \texttt{cellData} using symbolic notation, just as the \texttt{lm} function in R. <>= vbeta.1 <- subset(vbeta.fa, ncells==1) ## Consider the first 20 genes vbeta.1 <- vbeta.1[,1:20] layername(vbeta.1) head(cData(vbeta.1)) @ Now, for each gene, we can regress on \texttt{Et} the factors \texttt{Population} and \texttt{Subject.ID}. In each gene, we'll fit a Hurdle model with a separate intercept for each population and subject. A an S4 object of class ``ZlmFit'' is returned, containing slots with the genewise coefficients, variance-covariance matrices, etc. <>= library(ggplot2) library(reshape) library(abind) zlm.output <- zlm.SingleCellAssay(~ Population + Subject.ID, vbeta.1, method='glm', ebayes=TRUE) show(zlm.output) ## returns a data.table with a summary of the fit coefAndCI <- summary(zlm.output, logFC=FALSE) coefAndCI <- coefAndCI[contrast != '(Intercept)',] coefAndCI[,contrast:=abbreviate(contrast)] ggplot(coefAndCI, aes(x=contrast, y=coef, ymin=ci.lo, ymax=ci.hi, col=component))+ geom_pointrange(position=position_dodge(width=.5)) +facet_wrap(~primerid) + theme(axis.text.x=element_text(angle=45, hjust=1)) + coord_cartesian(ylim=c(-3, 3)) @ Try \verb|?ZlmFit-class| or \verb|showMethods(classes='ZlmFit')| to see a full list of methods. The combined test for differences in proportion expression/average expression is found by calling a likelihood ratio test on the fitted object. An array of genes, metrics and test types is returned. We'll plot the -log10 P values by gene and test type. <>= zlm.lr <- lrTest(zlm.output, 'Population') dimnames(zlm.lr) pvalue <- ggplot(melt(zlm.lr[,,'Pr(>Chisq)']), aes(x=primerid, y=-log10(value)))+ geom_bar(stat='identity')+facet_wrap(~test.type) + coord_flip() print(pvalue) @ In fact, the \texttt{zlm} framework is quite general, and has wrappers for a variety of modeling functions that accept \texttt{glm}-like arguments to be used, such as mixed models (using \texttt{lme4}) and Bayesian regression models (using \texttt{arm}). Multicore support is offered by setting \texttt{options(mc.cores=4)}, or however many cores your system has. <>= library(lme4) lmer.output <- zlm.SingleCellAssay(~ Stim.Condition +(1|Subject.ID), vbeta.1, method='glmer') @ \subsection{Two-sample Likelihood Ratio Test} Another way to test for differential expression is available through the \texttt{LRT} function, which is analogous to two-sample T tests. <>= library(car) two.sample <- LRT(vbeta.1, 'Population', referent='CD154+VbetaResponsive') car::some(two.sample) @ Here we compare each population (\Sexpr{unique(cData(vbeta.1)$Population)[-1]}) to \Sexpr{unique(cData(vbeta.1)$Population)[1]}. The \texttt{Population} column shows which population is being compared, while \texttt{test.type} is \texttt{comb} for the combined normal theory/binomial test. Column \texttt{primerid} gives the gene being tested, \texttt{direction} shows if the comparison group mean is greater (1) or less (-1) than the referent group, and \texttt{lrtstat} and \texttt{p.value} give the test statistic and $\chi^2$ p-value (two degrees of freedom). Other options are whether additional information about the tests are returned (\texttt{returnall=TRUE}) and if the testing should be stratified by a character vector naming columns in \texttt{cData} containing grouping variables (\texttt{groups}). These tests have been subsumed by \texttt{zlm.SingleCellAssay} but remain in the package for user convenience. \section{Use with single cell RNA-sequencing data} In RNA-sequencing data is essentially no different than qPCR-based single cell gene expression, once it has been aligned and mapped, if one is willing to reduce the experiment to counts or count-like data for a fixed set of genes/features. We assume that suitable tools (eg, SAMseq or TopHat) have been applied to do this. **More details on convenience functions for RNAseq data** \section{Implementation Details} Here we provide some background on the implementation of the package. There are several fundamental new object types provided by the package. \texttt{DataLayer} is the base class, which is provides an array-like object to store tabular data that might have multiple derived representations. A \texttt{SingleCellAssay} object contains a \texttt{DataLayer}, plus cell and feature data. New types of single cell assays can be incorportated by extending \texttt{SingleCellAssay}. Different derived classes of \sca require different fields to be present in the \texttt{cellData} and \texttt{featureData} These requirements are set for each class by the slots \texttt{cmap} and \texttt{fmap}, giving required columns in cell and feature data, respectively. We have found it useful to enforce naming conventions to reduce confusion when combining data across projects, so the constructor will rename the fields the user provides to match the values specifed in \texttt{cmap} and \texttt{fmap}. Sets of single cell assays are stored in the \texttt{SCASet} class. A constructor for SCASet is provided to construct an SCASet directly from a data frame. Alternatively, a SingleCellAssay or derived class can be \texttt{split} on an arbitray variable to produce an SCASet. On construction of a \texttt{SingleCellAssay} object, the package tests for completeness, and will fill in the missing data (with NA) if it is not, so assays with lots of missing data can make reading marginally slower. \end{document}