%\VignetteIndexEntry{SDAMS Vignette} %\VignettePackage{SDAMS} %\VignetteKeyword{Semi-parametric Differential Adundance Analysis} \documentclass[12pt]{article} \usepackage{float} \usepackage{Sweave} \usepackage{amsmath} \usepackage{amssymb} \RequirePackage{Bioconductor} \AtBeginDocument{\bibliographystyle{unsrturl}} \renewcommand{\baselinestretch}{1.3} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=TRUE,width=4,height=4} \author{Yuntong Li$^{1}$, Chi Wang$^{2,3}$\footnote{to whom correspondence should be addressed}, Li Chen$^{2,3}$\footnote{to whom correspondence should be addressed}\\[1em] \small{$^{1}$Department of Statistics , University of Kentucky,Lexington, KY;}\\ \small{$^{2}$Markey Cancer Center, University of Kentucky, Lexington, KY ;}\\ \small{$^{3}$Department of Biostatistics, University of Kentucky, Lexington, KY;}\\ \small{\texttt{yuntong.li@uky.edu}}\\ \small{\texttt{chi.wang@uky.edu}}\\ \small{\texttt{lichenuky@uky.edu}}} \title{\textsf{\textbf{The SDAMS package}}} %\bibliographystyle{abbrv} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \begin{abstract} This vignette introduces the use of the Bioconductor package {\tt SDAMS}, which is designed for differential abundance analysis for metabolomics and proteomics data from mass spectrometry. These data may contain a large fraction of zero values and non-zero part may not be normally distributed. {\tt SDAMS} considers a two-part semi-parametric model, a logistic regression for the zero proportion and a semi-parametric log-linear model for the non-zero values. A kernel-smoothed likelihood method is proposed to estimate regression coefficients in the two-part model and a likelihood ratio test is constructed for differential abundant analysis. \end{abstract} \newpage \tableofcontents \newpage \section{Citation} The package {\tt SDAMS} implements statistical methods from the following publication. If you use {\tt SDAMS} in the published research, please cite: \\ Li, Y., Fan, T.W., Lane, A.N. et al. SDA: a semi-parametric differential abundance analysis method for metabolomics and proteomics data. BMC Bioinformatics 20, 501 (2019).\\ Yuntong Li, Chi Wang and Li Chen: SDAMS for differential expression analysis of single-cell RNA sequencing data. (Manuscript) \section{Quick Start} This section show the most basic {\tt SDAMS} work flow for a differential abundance analysis for metabolomics and proteomics data from mass spectrometry: \begin{enumerate} \item Create a {\tt SummarizedExperiment} object using function {\tt createSEFromMatrix} or {\tt createSEFromCSV}. In this section we use an example {\tt SummarizedExperiment} object directly, which is an object of {\tt SummarizedExperiment} class named {\tt exampleSumExp} contained in this package. \item Perform a differential abundance analysis using {\tt SDA}. \end{enumerate} <>= library("SDAMS") data("exampleSumExp") results <- SDA(exampleSumExp) @ Here, the {\tt SummarizedExperiment} class object {\tt exampleSumExp} contained in the package is the proteomics dataset, which a matrix-like container for proteomic features with experimental subject grouping information. There are 560 features for 202 experimental subjects with 49 prostate cancer subjects and 153 healthy subjects (0 for healthy control and 1 for patient in this case). This is a 10\% subsample of the original dataset. The features are stored as a matrix in the assay slot. Each row in this matrix represents a proteomic feature and each column represents a subject. See Reference~ \cite{siwy2011human} for detailed information regarding this dataset. \section{Data Input} \subsection{Create SummarizedExperiment object from csv files} The proteomics or metabolomics data is stored as a matrix with each row being a feature and each column corresponding to a subject. All data in this matrix are non-negative. Another information required is the phenotype covariates. Here we focus on the binary grouping information, such as numeric 1 for control group and 0 for case group. But it can also be characters, such as "healthy" and "disease". To utilize {\tt SDAMS} package, we should have two separate csv files (for example 'feature.csv' and 'group.csv') as inputs for {\tt createSEfromCSV} to creat a {\tt SummarizedExperiment} object. Note: \begin{enumerate} \item The $1^{st}$ column in 'feature.csv' represents feature names and the $1^{st}$ row represents subject codes. \item The $1^{st}$ column in 'group.csv' represents subject codes, for example, Subject1, Subject2.... \end{enumerate} The format for "csv files" should look like as Figure~\ref{example feature} and Figure~\ref{example group}: \begin{figure}[h!] \centering \includegraphics{feature.png} \caption{Example of 'feature.csv' pattern} \label{example feature} \end{figure} \begin{figure}[ht] \centering \includegraphics[width=2cm]{group.png} \caption{Example of 'group.csv' pattern} \label{example group} \end{figure} After creating the two csv files, we need the paths for the two csv files: <>= path1 <- "/path/to/your/feature.csv/" path2 <- "/path/to/your/group.csv/" @ Here for demonstration purpose, we use the data stored in {\tt inst/extdata} directory. This is the csv format of the data in exampleSumExp which is a {\tt SummarizedExperiment} object we described before. <>= directory1 <- system.file("extdata", package = "SDAMS", mustWork = TRUE) path1 <- file.path(directory1, "ProstateFeature.csv") directory2 <- system.file("extdata", package = "SDAMS", mustWork = TRUE) path2 <- file.path(directory2, "ProstateGroup.csv") @ then use the function {\tt createSEFromCSV} after loading the {\tt SDAMS} package. <>= library("SDAMS") exampleSE1 <- createSEFromCSV(path1, path2) exampleSE1 @ The feature data and grouping information can be accessed using {\tt SummarizedExperiment} commands: <>= head(assay(exampleSE1)[,1:10]) head(colData(exampleSE1)$grouping) @ \subsection{Create SummarizedExperiment object from seperate matrix} If the two datasets have already been cleaned and loaded into R as matrices, then we can use {\tt createSEFromMatrix} to create a {\tt SummarizedExperiment} object. <>= set.seed(100) featureInfo <- matrix(runif(800, -2, 5), ncol = 40) featureInfo[featureInfo<0] <- 0 rownames(featureInfo) <- paste("feature", 1:20, sep = '') colnames(featureInfo) <- paste('subject', 1:40, sep = '') groupInfo <- data.frame(grouping=matrix(sample(0:1, 40, replace = TRUE), ncol = 1)) rownames(groupInfo) <- colnames(featureInfo) exampleSE2 <- createSEFromMatrix(feature = featureInfo, colData = groupInfo) exampleSE2 head(assay(exampleSE2)[,1:10]) head(colData(exampleSE2)$grouping) @ \section{Data Analysis} \subsection{Proteomics Example Data} Finally, we perform differential abundance analyais using {\tt SummarizedExperiment} object created in previous section. This can be done by using function {\tt SDA}. The theory behind {\tt SDA} can be reached at section \ref{theory}. A list with point estimates, p-values, q-values and corresponding feature names is returned. Below are the results generated by using the {\tt SummarizedExperiment} object exampleSE1. <>= results <- SDA(exampleSE1) head(results$gamma) head(results$pv_gamma) head(results$qv_gamma) @ In this example , there is only one group covariate applied to each subject. Here $\textbf{X}_i$ is one dimension. The covariate effect on the fraction of zero values for feature $g$ is $\gamma_g$, which is estimated to be 0.11 for the first feature, and 0.86 for the second feature, etc. The corresponding hypothesis is $H_0$: $\gamma_g=0$ vs. $H_1$: $\gamma_g \ne 0$. The p-values calculated from likelihood ratio test are returned in {\tt pv\_gamma}. Users can determine their own significance level to make inference, such as 0.05 nominal level. We also provide a FDR adjustment method \cite{storey2003statistical} used in {\tt SDA} for multiple comparison issues. Those results for $\gamma_g$ are stored in {\tt qv\_gamma}. <>= head(results$beta) head(results$pv_beta) head(results$qv_beta) @ The model parameter $\beta_g$ is the log fold change in the non-zero abundance comparing different values of the single group covariate for feature $g$. The corresponding two-sided hypothesis is $H_0$: $\beta_g=0$ vs. $H_1$: $\beta_g \ne 0$. Again, {\tt SDA} will return p-values and adjusted p-values(q-values) for parameter $\beta_g$, and they are stored in {\tt pv\_beta} and {\tt qv\_beta} respectively. <>= head(results$pv_2part) head(results$qv_2part) @ Hypothesis testing on overall effect of group covariate on the $g$th feature is performed by assessing $\gamma_{g}$ and $\beta_{g}$. The null hypothesis $H_0:$ $\gamma_{g}=0$ and $\beta_{g}=0$ against alternative hypothesis $H_1:$ at least one of the two parameters is non-zero. The p-values are calculated based on chi-square distribution with 2 degrees of freedom. And the corresponding q-values are calculated using the same procedure as in one-part model. <>= head(results$feat.names) @ A vector of feature names is returned for convenience which corresponds to the results in the other components. \subsection{Single Cell Example Data} SDAMS can also perform differential expression analysis for single-cell RNA sequence data with a large number of cell which are not expressed. The example data in a {\tt SummarizedExperiment} object can loaded and analyzed. <>= data(exampleSingleCell) @ We are interested in identifying genes that are differentially expressed in two different cell subpopulations, which is quantified by $\gamma_{Zg}$ and $\beta_{Zg}$. Three types of hypotheses can be tested through function {\tt SDA}: $H_0^1: \gamma_{Zg}=0$, $H_0^2: \beta_{Zg}=0$, and $H_0^3: \gamma_{Zg}=0 \textrm{ and } \beta_{Zg} = 0$. Likelihood ratio test is conducted for each test and p-value is returned by {\tt SDA} for each single gene. The corresponding q-value is also calculated to address multiple comparison correction. <>= results_SC <- SDA(exampleSingleCell) head(results_SC$pv_gamma) head(results_SC$qv_gamma) head(results$pv_beta) head(results$qv_beta) head(results$pv_2part) head(results$qv_2part) @ \section{Theory for SDAMS}\label{theory} As mentioned in the abstract, MS data is a mixture of zero intensity values and possibly non-normally distributed non-zero intensity values. Therefore, the differential abundance analysis needs to be performed to compare both the zero proportion and the mean of non-zero values between groups and also allows adjustment of covariates. SDA is a two-part model which addresses these issues that uses a logistic regression model to characterize the zero proportion and a semiparametric model to characterize non-zero values. \subsection{A two-part semi-parametric model} The differential abundance analysis in SDAMS has the following forms. For binary part: \[\mathrm{log}(\frac{\pi_{ig}}{1-\pi_{ig}})=\gamma_{0g}+\boldsymbol{\gamma}_g \boldsymbol{X}_{i} , \] For continuous non-zero part: \[\mathrm{log}(Y_{ig})=\boldsymbol{\beta}_g \boldsymbol{X}_i + \varepsilon_{ig}, \] where $Y_{ig}$ is the random variable representing the abundance of feature $g$ in subject $i$, $\pi_{ig}=Pr(Y_{ig}=0)$ is the probability of point mass. $\boldsymbol{X}_i=(X_{i1},X_{i2},...,X_{iQ})^T$ is a $Q$-vector covariates that specifies the treatment conditions applied to subject $i$. The corresponding $Q$-vector of model parameters $\boldsymbol{\gamma}_g=(\gamma_{1g},\gamma_{2g}, ...,\gamma_{Qg})^T$ quantify the covariates effects on the fraction of zero values for feature $g$ and $\gamma_{0g}$ is the intercept. $\boldsymbol{\beta}_g=(\beta_{1g},\beta_{2g},...,\beta_{Qg})$ is a $Q$-vector of model parameters quantifying the covariates effects on the non-zero values for the feature, and $\varepsilon_{ig}$'s $(i=1,2,..n)$ are independent error terms with a common but completely unspecified density function $f_g$. Importantly, we do not impose any distributional assumption on $f_g$. Without assuming a specific parametric distribution for $\varepsilon_{ig}$, this model is much more flexible to characterize data with unknown and possibly non-normal distribution. \subsection{Identification of differentially abundant features} We replace $f_g$ by its kernel density estimator in the likelihood function. The maximum likelihood estimator is obtained through a trust region maximization algorithm. The likelihood ratio test is performed on the null hypothesis $H_0:$ $\gamma_{qg}=0$ and $\beta_{qg}=0$ against alternative hypothesis $H_1:$ at least one of the two parameters is non-zero. We also consider the hypotheses for testing $\gamma_{qg}=0$ and $\beta_{qg}=0$ separately. To adjust for multiple comparisons across features, the false discovery discovery rate (FDR) q-value is calculated based on the {\tt qvalue} function in {\tt qvalue} package in R/Bioconductor. \section{Session Info} <>= toLatex(sessionInfo()) @ \bibliography{reference} \end{document}