%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{semisup} %\VignetteEncoding{UTF-8} \documentclass{article} \setlength\parindent{0pt} \usepackage{amsmath,amssymb,tikz,hyperref} \usetikzlibrary{positioning} \hypersetup{colorlinks = true,allcolors = blue} \title{semisup: detecting SNPs with interactive effects on a quantitative trait} \date{\today} \author{\textbf{A Rauschenberger}, \textbf{RX Menezes}, \textbf{MA van de Wiel}, \\ \textbf{NM van Schoor}, and \textbf{MA Jonker}} \bibliographystyle{plain} \begin{document} \maketitle <>= options(width=60) knitr::knit_hooks$set(document = function(x) {sub('\\usepackage[]{color}', '\\usepackage{xcolor}', x, fixed = TRUE)}) @ This vignette explains how to use the R~package \textbf{semisup}. Use the function \hyperref[mixtura]{\mbox{mixtura}} for model fitting, and the function \hyperref[scrutor]{\mbox{scrutor}} for hypothesis testing. %------------------------------------------------------------------------------- \vspace{0cm} \section{Initialisation}%------------------------------------------ %------------------------------------------------------------------------------- Start with installing \textbf{semisup} from Bioconductor\footnote{\href{https://cran.r-project.org/web/packages/devtools/README.html}{devtools} and \href{https://github.com/rauschenberger/semisup}{GitHub}: \texttt{devtools::install\_github("rauschenberger/semisup")}}: <>= source("https://bioconductor.org/biocLite.R") BiocInstaller::biocLite("semisup") @ Then load and attach the package: <<>>= library(semisup) @ If you want to reproduce the examples, you should attach the toy database: <>= attach(toydata) @ <>= names <- names(toydata) for(i in seq_along(names)){ assign(names[i],toydata[[i]]) } rm(names) @ The following commands access the reference manual: <>= help(semisup) ?semisup @ %------------------------------------------------------------------------------- \newpage \section{Scope}%------------------------------------------------------- %------------------------------------------------------------------------------- Data is available for $n$~samples. Let $\boldsymbol{y}={(y_1,\ldots,y_n)^T}$ represent the observations, $\boldsymbol{x}={(x_1,\ldots,x_n)^T}$ the groups, and $\boldsymbol{z}={(z_1,\ldots,z_n)^T}$ the classes. We assume all observations from the \textit{labelled} group are in class~\textcolor{blue}{A}, and those from the \textit{unlabelled} group are in class~\textcolor{blue}{A} or in class~\textcolor{red}{B}. \begin{table}[!h] \centering \begin{tabular}{l|ccc|ccccc} & $1$ & \ldots & $s$ & $s+1$ & \ldots & $n=s+u$ \\ \hline $\boldsymbol{y}$ & $y_1$ & \ldots & $y_{s}$ & $y_{s+1}$ & \ldots & $y_{s+u}$ \\ $\boldsymbol{x}$ & $0$ & \ldots & $0$ & $1$ & \ldots & $1$ \\ $\boldsymbol{z}$ & \textcolor{blue}{A} & \ldots & \textcolor{blue}{A} & \textcolor{blue}{A}/\textcolor{red}{B} & \ldots & \textcolor{blue}{A}/\textcolor{red}{B} \end{tabular} \label{table} \caption{Observations $\boldsymbol{y}$, groups $\boldsymbol{x}$, and classes $\boldsymbol{z}$. Here, the first~$s$ observations are \textit{labelled} (class~\textcolor{blue}{A}), and the last~$u$ observations are \textit{unlabelled} (class \textcolor{blue}{A} or \textcolor{red}{B}).} \end{table} We assume all observations come from the same probability distribution, but with different parameters for the two classes: \begin{align*} & Y_i | (Z_i = \textcolor{blue}{\mathrm{A}}) \sim F(\boldsymbol{\cdot},\textcolor{blue}{\boldsymbol{\theta_a}}), \\ & Y_i | (Z_i = \textcolor{red}{\mathrm{B}}) \sim {F(\boldsymbol{\cdot},\textcolor{red}{\boldsymbol\theta_b}}). \end{align*} The mixing proportion~$\tau$ is the probability that a random \textit{unlabelled} observations is in class~\textcolor{red}{B}. It is of interest to test whether $\tau$ is significantly larger than zero. \begin{align*} & \tau =~ \mathbb{P}[Z_i=\textcolor{red}{\mathrm{B}}|X_i=1], \\ & H_0:~ \tau = 0, \\ & H_1:~ \tau > 0. \end{align*} The function \hyperref[mixtura]{mixtura} estimates the unknown parameters ($\textcolor{blue}{\boldsymbol{\theta_a}}$, $\textcolor{red}{\boldsymbol{\theta_b}}$, $\tau$) and predicts the missing class labels in $\boldsymbol{z}=(z_1,\ldots,z_n)^T$. The function \hyperref[scrutor]{scrutor} tests homogeneity (${\tau=0}$) against heterogeneity (${\tau>0}$). %------------------------------------------------------------------------------- \newpage \section{Model fitting} \label{mixtura} %------------------------------ %------------------------------------------------------------------------------- Observing two groups of observations, we assume the \textit{labelled} observations are in class~\textcolor{blue}{A}, and the \textit{unlabelled} observations are in class~\textcolor{blue}{A} or in class~\textcolor{red}{B}. <>= par(mar=(c(2,4,1,5))) plot.new() plot.window(xlim=c(1,length(y)),ylim=range(y)) box() axis(side=2) title(ylab="y") mtext(text="labelled",at=25,side=1) mtext(text="unlabelled",at=75,side=1) abline(v=50.5,lty=2) points(y,col=rep(c("blue","black"),each=50)) @ The function \hyperref[mixtura]{mixtura} estimates the unknown parameters and predicts the missing class labels: <<>>= fit <- mixtura(y,z) @ <>= mean0 <- round(fit$estim1$mean0,0) sd0 <- round(fit$estim1$sd0,0) mean1 <- round(fit$estim1$mean1,0) sd1 <- round(fit$estim1$sd1,0) tau <- round(fit$estim1$p1,2) @ Here, \Sexpr{100*tau}\% of the \textit{unlabelled} observations are assigned to class \textcolor{red}{B}, and all other observations are assigned to class \textcolor{blue}{A}: <>= class <- round(fit$posterior) @ % alternative hypothesis <>= par(mar=(c(2,4,1,5))) plot.new() plot.window(xlim=c(1,length(y)),ylim=range(y)) box() axis(side=2) title(ylab="y") mtext(text="labelled",side=1,at=25) mtext(text="unlabelled",side=1,at=75) abline(v=50.5,lty=2) points(y,col=ifelse(class,"red","blue")) # component zero text <- paste("N(",mean0,",",sd0,")",sep="") mtext(text=text,side=4,at=0,line=0.5,las=1,col="blue") # mixing proportion text <- bquote(tau==.(tau)) mtext(text=text,side=4,at=2,line=0.5,las=2) # component one text <- paste("N(",mean1,",",sd1,")",sep="") mtext(text=text,side=4,at=4,line=0.5,las=1,col="red") @ These are the parameter estimates: <>= fit$estim1 @ %------------------------------------------------------------------------------- \newpage \section{Hypothesis testing} \label{scrutor}%-------------------------- %------------------------------------------------------------------------------- Under the null hypothesis, all observations are in class~\textcolor{blue}{A}. Under the alternative hypothesis, some \textit{unlabelled} observations are in class~\textcolor{red}{B}. \newline The function \hyperref[mixtura]{mixtura} not only fits the model under the alternative hypothesis (see above), but also under the null hypothesis: <>= fit$estim0 @ <>= par(mar=(c(2,4,1,5))) plot.new() plot.window(xlim=c(1,length(y)),ylim=range(y)) box() axis(side=2) title(ylab="y") mtext(text="labelled",side=1,at=25) mtext(text="unlabelled",side=1,at=75) abline(v=50.5,lty=2) points(y,col="blue") # estimates mean0 <- round(fit$estim0$mean0,0) sd0 <- round(fit$estim0$sd0,0) # component zero text <- paste("N(",mean0,",",sd0,")",sep="") mtext(text=text,side=4,at=0,line=0.5,las=1,col="blue") # mixing proportion text <- bquote(tau==0) mtext(text=text,side=4,at=2,line=0.5,las=2) # component one text <- paste("N(-,-)",sep="") mtext(text=text,side=4,at=4,line=0.5,las=1,col="red") @ Because the null distribution of the likelihood-ratio test statistic is unknown, we compare the hypotheses by resampling. The function \hyperref[scrutor]{scrutor} uses parametric bootstrapping or permutation: <>= scrutor(y,z) @ If the \mbox{$p$-value} is less than or equal to the significance level, we reject the null hypothesis in favour of the alternative hypothesis. %------------------------------------------------------------------------------- \vspace{3cm} \section*{Options} %----------------------------------------------- %------------------------------------------------------------------------------- % \hspace{0.1cm} \textbf{Further arguments}\newline \fbox{ \begin{minipage}{\textwidth} The functions \hyperref[mixtura]{mixtura} and \hyperref[scrutor]{scrutor} have similar arguments. Set \texttt{dist} equal to \texttt{"norm"} or \texttt{"nbinom"} to choose between the Gaussian and the negative binomial distributions. In the latter case, optionally provide a dispersion estimate \texttt{phi} or an offset \texttt{gamma}. All other arguments are technical. \end{minipage}} %------------------------------------------------------------------------------- \newpage \section{Application}%------------------------------------------------- %------------------------------------------------------------------------------- %------------------------------------------------------------------------------- \subsection{Data preparation}%-------------------------------------------------- %------------------------------------------------------------------------------- Let $n$ be the sample size, $q$ the number of quantitative traits, and $p$ the number of single nucleotide polymorphisms (\textsc{snp}s). \begin{itemize} \item Transform the quantitative trait to a vector of length~$n$, or transform the quantitative traits to a matrix with $n$~rows (samples) and $q$~columns (variables). \item Transform the \textsc{snp} to a vector of length~$n$, or transform the \textsc{snp}s to a matrix with $n$~rows (samples) and $p$~columns (variables). \item Binarise the \textsc{snp}(s), indicating the \textit{labelled} group by zero, and the \textit{unlabelled} group by a missing value. \end{itemize} For example, assign observations with zero minor alleles to the \textit{labelled} group, and those with one or two minor alleles to the \textit{unlabelled} group: \vspace{0.4cm} \begin{minipage}[left]{0.48\textwidth} <>= n <- length(snp) z <- rep(NA,times=n) z[snp==0] <- 0 @ \end{minipage} \hspace{0.02\textwidth} \begin{minipage}[right]{0.48\textwidth} <>= n <- nrow(SNPs) p <- ncol(SNPs) Z <- matrix(NA,nrow=n,ncol=p) Z[SNPs==0] <- 0 @ \end{minipage} \vspace{0.1cm} %------------------------------------------------------------------------------- \subsection{Test of association}%----------------------------------------------- %------------------------------------------------------------------------------- Use \hyperref[scrutor]{scrutor} to test for association between a quantitative trait (vector) and a \textsc{snp} (vector). The function returns a test statistic and a \mbox{$p$-value}. \vspace{0.3cm} \begin{minipage}{0.48\textwidth} \begin{equation*} \boldsymbol{y} = \begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{pmatrix} \qquad \boldsymbol{z} = \begin{pmatrix} z_{1} \\ z_{2} \\ \vdots \\ z_{n}, \end{pmatrix} \end{equation*} \end{minipage} \hspace{0.02\textwidth} \begin{minipage}{0.48\textwidth} \begin{center} \begin{tikzpicture}[every node/.style={circle,draw}] \node (y) at (0,0) {$\boldsymbol{y}$}; \node (z) at (2,0) {$\boldsymbol{z}$}; \path[thick] (y) edge (z); \end{tikzpicture} \end{center} \end{minipage} \vspace{0.3cm} <>= scrutor(y,z) @ %------------------------------------------------------------------------------- \newpage \subsection{Genome-wide association study}%---------------------------- %------------------------------------------------------------------------------- Use \hyperref[scrutor]{scrutor} to test for association between a quantitative trait (vector) and several \textsc{snp}s (matrix). For each \textsc{snp}, the function returns a test statistic and a \mbox{$p$-value}. \vspace{0.3cm} \begin{minipage}{0.49\textwidth} \begin{equation*} \boldsymbol{y} = \begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{pmatrix} \end{equation*} \begin{equation*} \underset{n \times p}{\boldsymbol{Z}} = \begin{pmatrix} z_{1,1} & z_{1,2} & \cdots & z_{1,p} \\ z_{2,1} & z_{2,2} & \cdots & z_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ z_{n,1} & z_{n,2} & \cdots & z_{n,p} \end{pmatrix} \end{equation*} \end{minipage} \hspace{0.02\textwidth} \begin{minipage}{0.49\textwidth} \vspace{0.2cm} \begin{center} \begin{tikzpicture}[every node/.style={circle,draw}] \node (y) at (0,0) {$\boldsymbol{y}$}; \node (z1) at (-2,-2.3) {$\boldsymbol{z_1}$}; \node (z2) at (-1,-2.3) {$\boldsymbol{z_2}$}; \node (z3) at (0,-2.3) {$\boldsymbol{z_3}$}; \node[draw=none] (dots) at (1,-2.3) {$\cdots$}; \node (zp) at (2,-2.3) {$\boldsymbol{z_p}$}; \path[thick] (y) edge (z1) (y) edge (z2) (y) edge (z3) (y) edge (zp); \end{tikzpicture} \end{center} \end{minipage} \vspace{0.3cm} <>= scrutor(y,Z) @ %------------------------------------------------------------------------------- \subsection{Differential expression analysis}%---------------------------------- %------------------------------------------------------------------------------- Use \hyperref[scrutor]{scrutor} to test for association between several quantitative traits (matrix) and a \textsc{snp} (vector). For each quantitative trait, the function returns a test statistic and a \mbox{$p$-value}. \vspace{0.3cm} \begin{minipage}{0.49\textwidth} \begin{equation*} \underset{n \times q}{\boldsymbol{Y}} = \begin{pmatrix} y_{1,1} & y_{1,2} & \cdots & y_{1,q} \\ y_{2,1} & y_{2,2} & \cdots & y_{2,q} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n,1} & y_{n,2} & \cdots & y_{n,q} \end{pmatrix} \end{equation*} \begin{equation*} \boldsymbol{z} = \begin{pmatrix} z_{1} \\ z_{2} \\ \vdots \\ z_{n} \end{pmatrix} \end{equation*} \end{minipage} \hspace{0.02\textwidth} \begin{minipage}{0.49\textwidth} \vspace{0.4cm} \begin{center} \begin{tikzpicture}[every node/.style={circle,draw}] \node (z) at (0,0) {$\boldsymbol{z}$}; \node (y1) at (-2,2.4) {$\boldsymbol{y_1}$}; \node (y2) at (-1,2.4) {$\boldsymbol{y_2}$}; \node (y3) at (0,2.4) {$\boldsymbol{y_3}$}; \node[draw=none] (dots) at (1,2.4) {$\cdots$}; \node (yq) at (2,2.4) {$\boldsymbol{y_q}$}; \path[thick] (z) edge (y1) (z) edge (y2) (z) edge (y3) (z) edge (yq); \end{tikzpicture} \end{center} \end{minipage} \vspace{0.3cm} <>= scrutor(Y,z) @ %------------------------------------------------------------------------------- \subsection{Expression quantitative trait loci analysis}%----------------------- %------------------------------------------------------------------------------- Use \hyperref[scrutor]{scrutor} to test for association between several quantitative traits (matrix) and several \textsc{snp}s (matrix). If their numbers are different, all pairwise combinations are considered. If their numbers are equal, a one-to-one correspondence is assumed. For each combination, the function returns a test statistic and a \mbox{$p$-value}. \vspace{0.3cm} \begin{minipage}{0.49\textwidth} \begin{equation*} \underset{n \times q}{\boldsymbol{Y}} = \begin{pmatrix} y_{1,1} & y_{1,2} & \cdots & y_{1,q} \\ y_{2,1} & y_{2,2} & \cdots & y_{2,q} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n,1} & y_{n,2} & \cdots & y_{n,q} \end{pmatrix} \end{equation*} \begin{equation*} \underset{n \times p}{\boldsymbol{Z}} = \begin{pmatrix} z_{1,1} & z_{1,2} & \cdots & z_{1,p} \\ z_{2,1} & z_{2,2} & \cdots & z_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ z_{n,1} & z_{n,2} & \cdots & z_{n,p} \end{pmatrix} \end{equation*} \end{minipage} \hspace{0.02\textwidth} \begin{minipage}{0.49\textwidth} \begin{center} \vspace{0.4cm} \begin{tikzpicture}[every node/.style={circle,draw}] \node (y1) at (0,0) {$\boldsymbol{y_1}$}; \node (y2) at (1,0) {$\boldsymbol{y_2}$}; \node (y3) at (2,0) {$\boldsymbol{y_3}$}; \node[draw=none] (dots) at (3,0) {$\cdots$}; \node (yq) at (4,0) {$\boldsymbol{y_q}$}; \node (z1) at (0,-2.4) {$\boldsymbol{z_1}$}; \node (z2) at (1,-2.4) {$\boldsymbol{z_2}$}; \node (z3) at (2,-2.4) {$\boldsymbol{z_3}$}; \node[draw=none] (dots) at (3,-2.4) {$\cdots$}; \node (zp) at (4,-2.4) {$\boldsymbol{z_p}$}; \path[thick] (y1) edge[gray] (z2) (y1) edge[gray] (z3) (y1) edge[gray] (zp) (y2) edge[gray] (z1) (y2) edge[gray] (z3) (y2) edge[gray] (zp) (y3) edge[gray] (z1) (y3) edge[gray] (z2) (y3) edge[gray] (zp) (yq) edge[gray] (z1) (yq) edge[gray] (z2) (yq) edge[gray] (z3) (y1) edge[black] (z1) (y2) edge[black] (z2) (y3) edge[black] (z3) (yq) edge[black] (zp); \end{tikzpicture} \end{center} \end{minipage} \vspace{0.3cm} <>= scrutor(Y,Z) @ %------------------------------------------------------------------------------- \section*{References}%---------------------------------------------------------- %------------------------------------------------------------------------------- The R package \texttt{semisup} is based on Rauschenberger et al.~\cite{Rauschenberger2018}, where detailed references to previous work are given. If you use \texttt{semisup} for publications, please cite Rauschenberger et al.~\cite{Rauschenberger2018}. \newline Consider shrinkage estimation (Robinson et al.~\cite{Robinson2008}) and scale normalisation (Robinson et al.~\cite{Robinson2010}) to improve the negative binomial mixture model (R package \href{http://bioconductor.org/packages/edgeR/}{edgeR}). Use the non-parametric mixture test (van Wieringen et al.~\cite{VanWieringen2008}) to increase robustness against outliers (R package \href{http://www.few.vu.nl/~wvanwie/software/TestForPDE/TestForPDE.html}{PDGEtest}). \begingroup \renewcommand{\section}[2]{} \bibliography{references} \endgroup \end{document}