%\VignetteIndexEntry{BUScorrect_user_guide} %\VignetteEngine{knitr::knitr} \documentclass[12pt]{article} <>= BiocStyle::latex() @ \geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm} \title{BUScorrect: Batch Effects Correction with Unknown Subtypes\\ User's Guide} \author{Xiangyu Luo\thanks{\email{xyluo1991@gmail.com}} and Yingying Wei\\ The Chinese University of Hong Kong} \begin{document} \maketitle \tableofcontents \newpage \section{Introduction} High-throughput experimental data are accumulating exponentially in public databases. However, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed ``batch effects,'' and the latter is often modelled by subtypes. \\ \noindent Researchers have long been aware that samples generated on different days are not directly comparable. Samples processed at the same time are usually referred to as coming from the same ``batch.'' Even when the same biological conditions are measured, data from different batches can present very different patterns. The variation among different batches may be due to changes in laboratory conditions, preparation time, reagent lots, and experimenters \cite{leek2010tackling}. The effects caused by these systematic factors are called ``batch effects.'' \\ \noindent Various ``batch effects'' correction methods have been proposed when the subtype information for each sample is known \cite{johnson2007adjusting, leek2007capturing}. Here we adopt a ``broad'' definition for ``subtype.'' ``Subtype'' is defined as a set of samples that share the same underlying genomic profile, in other words biological variability, when measured with no technical artifacts. For instance, groupings such as ``case'' and ``control'' can be viewed as two subtypes. However, subtype information is usually unknown, and it is often the main interest of the study to learn the subtype for each collected sample, especially in personalized medicine.\\ \noindent Here, the R package \Rpackage{BUScorrect} fits a Bayesian hierarchical model---the Batch-effects-correction-with-Unknown-Subtypes model (BUS)---to correct batch effects in the presence of unknown subtypes \cite{xxx2018xxx}. BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, and (d) enjoying a linear-order computation complexity. After correcting the batch effects with BUS, the corrected value can be used for other analysis as if all samples are measured in a single batch. BUS can integrate batches measured from different platforms and allow subtypes to be measured in some but not all of the batches as long as the experimental design fulfils the conditions listed in \cite{xxx2018xxx}.\\ \noindent This guide provides step-by-step instructions for applying the BUS model to correct batch effects and identify the unknown subtype information for each sample. \\ \section{Data Preparation} \noindent To fit the BUS model, we first look at the input data format. Two types of data formats are allowed, the R list and the \Rpackage{SummarizedExperiment} object \cite{sum_exp}. Specifically, assuming there are 3 batches, the R list consists of 3 gene expression matrices with genes in rows and samples in columns. Alternatively, the user may have a \Rpackage{SummarizedExperiment} object at hand. The \Rpackage{SummarizedExperiment} object is a matrix, where rows represent genes, columns are samples, and the column data record the batch information for each sample. In the following, we provide concrete examples for the two allowable input formats. \\ <>= library(BUScorrect) data("BUSexample_data", package="BUScorrect") #BUSexample_data is a list class(BUSexample_data) #The list's length is three, thus we have three batches length(BUSexample_data) #Each element of the list is a matrix class(BUSexample_data[[1]]) #In the matrix, a row is a gene, and a column corresponds to a sample dim(BUSexample_data[[1]]) dim(BUSexample_data[[2]]) dim(BUSexample_data[[3]]) #Look at the expression data head(BUSexample_data[[1]][ ,1:4]) @ \noindent The example data \Robject{BUSexample\_data} consist of 3 batches. In total, 2000 genes are measured. The sample sizes of each batch are 70, 80, and 70, respectively. Because it is a simulation data set, we actually know that all of the samples come from 3 subtypes. In addition, we can prepare a \Rpackage{SummarizedExperiment} object named \Robject{BUSdata\_SumExp} using the \Robject{BUSexample\_data} as follows.\\ <>= #require the SummarizedExperiment from Bioconductor require(SummarizedExperiment) #batch number B <- length(BUSexample_data) #sample size vector n_vec <- sapply(1:B, function(b){ ncol(BUSexample_data[[b]])}) #gene expression matrix GE_matr <- NULL for(b in 1:B){ GE_matr <- cbind(GE_matr, BUSexample_data[[b]]) } rownames(GE_matr) <- NULL colnames(GE_matr) <- NULL #batch information Batch <- NULL for(b in 1:B){ Batch <- c(Batch, rep(b, n_vec[b])) } #column data frame colData <- DataFrame(Batch = Batch) #construct a SummarizedExperiment object BUSdata_SumExp <- SummarizedExperiment(assays=list(GE_matr=GE_matr), colData=colData) head(assays(BUSdata_SumExp)$GE_matr[ ,1:4]) head(colData(BUSdata_SumExp)$Batch) @ \noindent In a nutshell, the user can use either the R list or the \Rpackage{SummarizedExperiment} object as input. Regarding the R list, the list length is equal to the batch number, and each list component is a gene expression matrix from a batch. With respect to the \Rpackage{SummarizedExperiment} object, it is a matrix where rows are genes and columns represent samples, and the user has to specify the sample-specific batch indicators through the \Robject{Batch} vector in the \Robject{colData} column data frame.\\ \section{Model Fitting} \noindent Once we have prepared the input data and specified the subtype number, we are able to fit the BUS model, which requires the function \Rfunction{BUSgibbs}. The first argument, \Robject{Data}, of \Rfunction{BUSgibbs} should be either an R list or a SummarizedExperiment object. If \Robject{Data} is an R list, each element of \Robject{Data} is a data matrix for a specific batch, where each row corresponds to a gene or a genomic feature and each column corresponds to a sample. If \Robject{Data} is a SummarizedExperiment object, \Robject{assays(Data)} must contain a gene expression matrix named ``GE\_matr,'' where one row represents a gene and one column corresponds to a sample. \Robject{colData(Data)} must include a vector named ``Batch'', which indicates the batch information for each sample. The second argument, \Robject{n.subtypes}, is the number of subtypes among samples, which needs to be specified by the user in advance. As discussed later, if \Robject{n.subtypes} is unknown, we can vary the subtype number and use BIC to select the optimal number. The third argument, \Robject{n.iterations}, is the total number of iterations to run by the Gibbs sampler for posterior inference of the BUS model. The first \Robject{n.iterations}/2 iterations are treated as burn-in, and posterior samples from the second \Robject{n.iterations}/2 iterations are kept for statistical inference. The fourth argument, \Robject{showIteration}, lets the user decide whether \Rfunction{BUSgibbs} should display the number of iterations that have been run. To reproduce the results, the users are highly recommended to set \Rfunction{set.seed} before running \Rfunction{BUSgibbs}.\\ <>= #For R list input format set.seed(123) BUSfits <- BUSgibbs(Data = BUSexample_data, n.subtypes = 3, n.iterations = 300, showIteration = FALSE) #For SummarizedExperiment object input format #set.seed(123) #BUSfits_SumExp <- BUSgibbs(Data = BUSdata_SumExp, n.subtypes = 3, n.iterations = 300, # showIteration = FALSE) @ \noindent The \Rfunction{summary} command provides an overview of the output object \Robject{BUSfits} from \Rfunction{BUSgibbs}. \Robject{BUSfits} collects all the posterior samples and posterior estimates for the intrinsic gene indicators, the subtype class for each sample, the subtype proportions for each batch, the baseline expression levels, the subtype effects, the location batch effects, and the scale batch effects.\\ <>= summary(BUSfits) @ \section{Estimated Subtypes and Batch Effects} Our main interests lie in the estimation of the subtype class for each sample and the batch effects. We can call the \Rfunction{Subtypes} function to extract the subtype information from \Robject{BUSfits}.\\ <>= est_subtypes <- Subtypes(BUSfits) @ \noindent There is a message from the function \Rfunction{Subtypes} to remind the user of the format of \Robject{est\_subtypes}. \Robject{est\_subtypes} is a list of length 3, corresponding to the three batches in the study. \Robject{est\_subtypes[[1]]} shows the subtype for each of the 70 samples on batch 1. \\ \noindent Similarly, you can call \Rfunction{location\_batch\_effects} and \Rfunction{scale\_batch\_effects} functions to get the estimated location and scale batch effects.\\ <>= est_location_batch_effects <- location_batch_effects(BUSfits) head(est_location_batch_effects) est_scale_batch_effects <- scale_batch_effects(BUSfits) head(est_scale_batch_effects) @ \noindent The first batch is taken as the reference batch, therefore its location batch effects are zeros for all the genes and its scale batch effects are all ones.\\ \noindent The subtype effects can be obtained by the \Rfunction{subtype\_effects} function. Notice that the first subtype is taken as the baseline subtype.\\ <>= est_subtype_effects <- subtype_effects(BUSfits) @ \section{Intrinsic Gene Identification} The intrinsic genes are the genes that differentiate subtypes \cite{huo2016meta}. We use the following functions to identify the intrinsic genes by controlling the false discovery rate.\\ <>= #select posterior probability threshold to identify the intrinsic gene indicators thr0 <- postprob_DE_thr_fun(BUSfits, fdr_threshold=0.01) est_L <- estimate_IG_indicators(BUSfits, postprob_DE_threshold=thr0) #obtain the intrinsic gene indicators intrinsic_gene_indices <- IG_index(est_L) @ \noindent The function \Rfunction{postprob\_DE\_thr\_fun} calculates the best posterior probability threshold that results in a false discovery rate less than \Robject{fdr\_threshold}. \Rfunction{postprob\_DE\_thr\_fun} also gives the user a message about the selected posterior probability threshold. \Rfunction{estimate\_IG\_indicators} then obtain the selected threshold to estimate intrinsic gene indicators. Finally, one can obtain the intrinsic gene indices by calling the function \Rfunction{IG\_index}. \\ \noindent The \Rfunction{postprob\_DE} function calculates the posterior probability of being differentially expressed for genes in subtypes $k$ ($k\geq 2$). <<>>= postprob_DE_matr <- postprob_DE(BUSfits) @ \section{Visualize the Adjusted Genomic Data} The function \Rfunction{adjusted\_values} adjusts the batch effects for the original data.\\ <>= adjusted_data <- adjusted_values(BUSfits, BUSexample_data) @ \noindent The message is a reminder of the output format. Subsequently, we can compare the original data suffering from batch effects and the adjusted data with the batch effects removed.\\ \noindent The function \Rfunction{visualize\_data} plots a heatmap for the expression values across batches. The user can specify the argument \Rfunction{gene\_ind\_set}, which is a vector, to select the genes to be displayed in the heatmap.\\ \newpage <>= #only show the first 100 genes visualize_data(BUSexample_data, title_name = "original expression values", gene_ind_set = 1:100) #try the following command to show the whole set of genes #visualize_data(BUSexample_data, title_name = "original expression values", # gene_ind_set = 1:2000) @ \newpage <>= #only show the first 100 genes visualize_data(adjusted_data, title_name = "adjusted expression values", gene_ind_set = 1:100) #try the following command to show the whole set of genes #visualize_data(adjusted_data, title_name = "adjusted expression values", # gene_ind_set = 1:2000) @ \newpage \noindent In these two heatmaps, the top bar indicates the batch origin for each sample. Samples under the same colour are from the same batch. The batch effects present in the original data are correctly removed, and only the biological variability is kept. \section{Model Selection using BIC} If we have no prior knowledge about the subtype number, we can vary the argument \Robject{n.subtypes} in the function \Rfunction{BUSgibbs}, e.g., from 2 to 10 and identify the underlying true subtype number K as the one that achieves the minimal BIC. \\ \noindent In BUScorrect R package, one can obtain the BIC value as follows.\\ <>= BIC_val <- BIC_BUS(BUSfits) @ \noindent In this example, the underlying true subtype number is three. For an illustration, we vary the \Robject{n.subtypes} from 2 to 4. <>= BIC_values <- NULL for(k in 2:4){ set.seed(123) BUSfits <- BUSgibbs(Data = BUSexample_data, n.subtypes = k, n.iterations = 300, showIteration = FALSE) BIC_values <- c(BIC_values, BIC_BUS(BUSfits)) } plot(2:4, BIC_values, xlab="subtype number", ylab="BIC", main="BIC plot", type="b") @ \noindent The BIC attains the minimum at \Robject{n.subtypes}$=3$, thus correctly recovering the true subtype number. \bibliography{user_guide} \end{document}