%\VignetteIndexEntry{BUScorrect_user_guide}
%\VignetteEngine{knitr::knitr}
%\VignetteEncoding{UTF-8}

\documentclass[12pt]{article}

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

  \geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm}



\title{BUSseq: Batch Effects Correction with Unknown Subtypes for 
scRNA-seq data\\ User's Guide}
  \author{Fangda Song\thanks{\email{sfd1994895@gmail.com}}, 
  Ga Ming Chan and Yingying Wei\\
  The Chinese University of Hong Kong}
  
  \begin{document}
  
  \maketitle
  
  \tableofcontents
  \newpage
  
  \section{Introduction}
  Single-cell RNA-sequencing (scRNA-seq) technologies enable the
  measurement of the transcriptome of individual cells, which provides 
  unprecedented opportunities to discover cell types and understand 
  cellular heterogeneity \cite{Bacher2016Design}. Despite their 
  widespread applications, single-cell RNA-sequencing (scRNA-seq) 
  experiments are still plagued by batch effects and dropout events. \\
  
  One of the major tasks of scRNA-seq experiments is to identify cell 
  types for a population of cells \cite{Bacher2016Design}. Therefore, 
  the cell type of each individual cell is always unknown and is the 
  target of inference. However, most existing methods for batch effects
  correction, such as Combat space \cite{johnson2007adjusting} and the 
  surrogate variable analysis (SVA)(\cite{leek2007capturing}, 
  \cite{leek2014svaseq}), are designed for bulk experiments and require
  knowledge of the subtype information, which corresponds to cell type 
  information for scRNA-seq data, of each sample a priori.\\
  
  \noindent Here, the R package \Rpackage{BUSseq} fits an interpretable
  Bayesian hierarchical model---the Batch Effects Correction with
  Unknown Subtypes for scRNA seq Data(BUSseq)---to correct batch 
  effects in the presence of unknown cell types 
  \cite{song2020flexible}. BUSseq is able to simultaneously correct
  batch effects, clusters cell types, and takes care of the count data
  nature, the overdispersion, the dropout events, and the cell-specific
  sequencing depth of scRNA-seq data. After correcting the batch
  effects with BUSseq, the corrected value can be used for downstream 
  analysis as if all cells were sequenced in a single batch. BUSseq can
  integrate the read count matrices measured from different platforms 
  and allow cell types to be measured in some but not all of the 
  batches as long as the experimental design fulfills the conditions 
  listed in \cite{song2020flexible}.\\
  
  \noindent This guide provides step-by-step instructions for applying 
  the BUSseq model to correct batch effects and identify the unknown 
  cell type indicators for each cell for scRNA-seq data. \\    
  
  \section{Methodolgy}
  BUSseq is a hierarchical model that closely mimics the data generating 
  mechanism of scRNA-seq experiments \cite{song2020flexible}. The hierarchical 
  structure of BUSseq can be illustrated by the following diagram.
  
    \begin{figure}[!htbp]
    \centering
  \includegraphics[width=.6\textwidth]{Images/Model_assumption.jpg}
  \caption{The hierarchical stucture of BUSseq model. 
  Only $Y_{big}$ in the gray rectangle is observed.}
  \end{figure}
  
  \noindent Assuming that there are totally $B$ batches and $n_b$ cells 
  in $b$-th batch, $b=1,2,\cdots,B$, we define the underlying 
  gene expression level of gene $g$ in cell $i$ of batch $b$ as $X_{big}$. 
  Given the cell type $W_{bi}=k$, $X_{big}$ follows a negative binomial 
  distribution with mean expression level $\mu_{big}$ and a gene-specific 
  and batch-specific overdispersion parameter $\phi_{bg}$. The mean expression 
  level $\mu_{big}$ is determined by the log-scale baseline expression level 
  $\alpha_g$, the cell type effect $\beta_{gk}$, the location batch effect 
  $\nu_{bg}$, and the cell-specific size factor $\delta_{bi}$. It is of note 
  that the cell type $W_{bi}$ of each individual cell is unknown and is 
  our target of inference. Therefore, we assume that a cell on batch $b$ 
  comes from cell type $k$ with probability $\text{Pr}(W_{bi}=k)=\pi_{bk}$, 
  and the proportions of cell types $(\pi_{b1},\cdots,\pi_{bK})$ can 
  vary across batches.\\

  \noindent Unfortunately, it is not always possible to observe the expression 
  level $X_{big}$ due to dropout events. Without dropout ($Z_{big}=0$), 
  we can directly observe $Y_{big}=X_{big}$. However, if a dropout event occurs 
  ($Z_{big}=1$), then we observe $Y_{big}=0$ instead of $X_{big}$. 
  In other words, when we observe a zero read count $Y_{big}=0$, 
  there are two circumstances: a non-expressed gene---biological zeros---
  or a dropout event. When gene $g$ is not expressed in cell $i$ of batch 
  $b$ ($X_{big}=0$), we always have $Y_{big}=0$; when gene $g$ is actually 
  expressed in cell $i$ of batch $b$ ($X_{big}>0$) but a dropout event occurs, 
  we can only observe $Y_{big}=0$, and hence $Z_{big}=1$. 
  It has been noted that highly expressed genes are less-likely 
  to suffer from dropout events \cite{kharchenko2014bayesian}. We thus model 
  the dependence of the dropout rate $Pr(Z_{big}=1|X_{big})$ on 
  the expression level using a logistic regression with batch-specific 
  intercept $\gamma_{b0}$ and log-odds ratio $\gamma_{b1}$. 
  Noteworthy, BUSseq includes the negative binomial distribution without zero 
  inflation as a special case. When all cells are from a single cell type and 
  the cell-specific size factor $\delta_{bi}$ is estimated a priori according 
  to spike-in genes, BUSseq can reduce to a form similar to BASiCS 
  \cite{vallejos2015basics}.\\
  
  \section{Entire workflow}
  \subsection{Data Preparation}
  \noindent The input data of our MCMC algorithm should be a 
  \Robject{SingleCellExperiment} object or a \Robject{list}. The input
  \Robject{SingleCellExperiment} object should include a \Robject{counts} assay 
  of raw read count data and \Robject{colData} indicating the corresponding 
  batch of each cell. On the other hand, if the input is a \Robject{list}, 
  then each element of the list corresponds to the read count matrix of a batch, 
  where each row represents a gene and each column corresponds to a cell. 
  Here, we take the raw read count data 
  \Robject{assay(BUSseqfits\_example, "counts")} 
  stored in our package as an example to illustrate how to prepare the input.\\
  
  <<data_preparation1>>=
  library(BUSseq)
  library(SingleCellExperiment)

  #Input data is should be a SingleCellExperiment object or a list
  CountData <- assay(BUSseqfits_example, "counts")
  batch_ind <- unlist(colData(BUSseqfits_example))
  
  # Construct a SingleCellExperiment object with colData as batch indicators 
  sce_input <- SingleCellExperiment(assays = list(counts = CountData),
                            colData = DataFrame(Batch_ind = factor(batch_ind)))
  
  # Or, construct a list with each element represents the count data matrix 
  # of a batch
  num_cell_batch <- table(batch_ind)
  list_input <- list(Batch1 = CountData[,1:num_cell_batch[1]], 
                Batch2 = CountData[,1:num_cell_batch[2] + num_cell_batch[1]])
  
  # Cell numbers within each batch
  print(num_cell_batch)
  
  #Peek at the read counts
  print(CountData[1:5,1:5])
  @
    
    \noindent The example raw count data \Robject{CountData} consist of two 
    batches. Each batch includes 150 cells, and there are 300 genes measured 
    in each cell. Because it is a simulated dataset, we actually know that 
    all of the cells come from 4 cell types.\\
    
    \noindent In a nutshell, users can use a \Robject{SingleCellExperiment} 
    object or a \Robject{list} as the input of our MCMC algorithm. Note that 
    the gene numbers of all batches need to be the same.\\ 
  
  
  \subsection{Model Fitting}
  \noindent Once we have prepared the input data and specified the cell
  type number, we are able to fit the BUSseq model by running the 
  \Rfunction{BUSseq\_MCMC} function.\\
  
    <<BUSgibbs>>=
    # Conduct MCMC sampling and posterior inference for BUSseq model
    BUSseqfits_res <- BUSseq_MCMC(ObservedData = sce_input, 
                              seed = 1234, n.cores = 2,
                              n.celltypes = 4, n.iterations = 500)
  @
  
  \noindent The first argument, \Robject{ObservedData}, of 
  \Rfunction{BUSseq\_MCMC} should be a \Robject{SingleCellExperiment} object or 
  a \Robject{list} as we discuss before.\\
  
  \noindent The second argument, \Robject{seed}, allows the users to obtain 
  reproducible results.\\
  
  \noindent The third argument, \Robject{n.celltypes}, is the number of
  cell types among all cells, which needs to be specified by the user
  in advance. As discussed later, if \Robject{n.celltypes} is unknown, 
  we can vary the cell type number and use the Bayesian Information 
  Criterion (BIC) to select the optimal number.\\
  
  \noindent The forth argument, \Robject{n.iterations}, is the total 
  number of iterations of the MCMC algorithm for the posterior 
  inference of the BUSseq model. Users can also set the number of 
  burnin iterations by the argument, \Robject{n.burnin}. Given 
  \Robject{n.iterations}, the default number of burnins is 
  \Robject{n.iterations}/2 iterations. The parameters are inferred by 
  samples after the burnin iterations. \\

  \noindent Now, let us take a look at the output:
  <<BUSseq_output>>=
    # The output is a SingleCellExperiment object
    class(BUSseqfits_res)
    
    # Peek at the output
    BUSseqfits_res
    
  @
    
    \noindent  %The \Rfunction{summary} command provides an overview of 
    %the output object \Robject{BUSseqfits\_res} from 
    %\Rfunction{BUSseq\_MCMC}. 
    \Robject{BUSseqfits\_res} is a \Robject{SingleCellExperiment} object. 
    Compared with the input, \Rfunction{BUSseq\_MCMC} incoporates the inferred 
    underlying true read counts after imputing the dropout events and 
    the posterior inference of parameters into \Robject{BUSseqfits\_res}. 
    The posterior inference includes the posterior mode of the cell-type 
    indicators for each cell, and the posterior mean and variance of 
    the cell-type proportions within each batch, the cell-type-specific mean
    expression levels, the location batch effects, the overdispersion 
    parameters and the log-odds ratios of the logistic regression for 
    dropout events. Here, we show how to extract the imputed data 
    from the output.
    
      <<Output_extract>>=
    # Extract the imputed read counts
    Imputed_count <- assay(BUSseqfits_res, "imputed_data")
  @
    
    We will further explain how to obtain the parameter
    estimation from the output \Robject{BUSseqfits\_res} in the next section.
    
    \subsection{Estimated Cell Type, Batch and Cell-Specific Effect Extraction}
  Our main interests are the estimation of the cell type for each cell 
  and the estimation of the batch effects. We can call the 
  \Rfunction{celltypes} function to extract the estimated cell type labels  
  $\widehat{W}_{bi}$ from \Robject{BUSseqfits\_res}.\\
  
  <<Celltypes>>=
    celltyes_est <- celltypes(BUSseqfits_res)
  @ 
    
    \noindent There is a message from the 
    \Rfunction{celltypes} function to remind the format of it output.\\
  
  \noindent Similarly, you can call the
  \Rfunction{location\_batch\_effects} and \Rfunction{overdispersions} 
  functions to get the estimated location batch effects $\widehat{\nu}_{bg}$ and 
  batch-specific and gene-specific overdispersion parameters 
  $\widehat{\phi}_{bg}$. Note that the first batch is taken as 
  the reference batch, so its location batch effects are zeros for all genes.\\
  
  <<BatchEffects>>=  
    location_batch_effects_est <- location_batch_effects(BUSseqfits_res)
    head(location_batch_effects_est)
    
    overdispersion_est <- overdispersions(BUSseqfits_res)
    head(overdispersion_est)
  @

\noindent The estimated cell-specific size factors $\widehat{\delta}_{bi}$ 
are available by calling the \Rfunction{cell\_effect\_values} function. 
Here, the first cell in each batch is regarded 
as the reference cell, and its size factor is set as zero.
    
  <<CellEffects>>=
    cell_effects_est <- cell_effect_values(BUSseqfits_res)
    head(cell_effects_est)
  @
  
  \noindent The \Rfunction{celltype\_mean\_expression} function provides 
  the estimated cell-type-specific mean expression levels 
  $exp(\widehat{\alpha}_g + \widehat{\beta}_{gk})$. The 
  estimates remove the technical artifacts, including the batch effects 
  and the cell-spcific size factors, but retain the biological variability 
  across different cell types. Moreover, the estimated cell type effects
  $\widehat{\beta}_{gk}$ can be obtained 
  by the \Rfunction{celltype\_effects} function. Notice that the first cell 
  type is regarded as the reference cell type implying all zeros in the first 
  column of \Robject{celltype\_effects\_est}. \\
  
  <<CelltypeEffects>>=
    celltype_mean_expression_est <- celltype_mean_expression(BUSseqfits_example)
    head(celltype_mean_expression_est)
    
    celltype_effects_est <- celltype_effects(BUSseqfits_res)
    head(celltype_effects_est)
  @
  \subsection{Intrinsic Gene Identification}
  
  <<IG>>=
    #obtain the intrinsic gene indicators
    intrinsic_gene_indicators <- intrinsic_genes_BUSseq(BUSseqfits_res)
    print(intrinsic_gene_indicators)
    
    #The estimated FDR, the first 240 genes are known as intrinsic 
    #genes in the simulation setting.
    index_intri <- which(unlist(intrinsic_gene_indicators) == "Yes")
    false_discovery_ind <- !(index_intri %in% 1:240)
    fdr_est <- sum(false_discovery_ind)/length(index_intri)
    print(fdr_est)
  @
    
    \noindent Therefore, the true FDR is \Sexpr{fdr_est} much less than 
    the estimated FDR, 0.05.
    
    
  \subsection{Corrected Read Count Data and Visualization}
  The \Rfunction{BUSseq\_MCMC} function not only conducts MCMC sampling
  and posterior inference, but also imputes the missing data caused by
  dropout events. Furthermore, based on the imputed data, we expect to 
  correct batch effects as if all the batches were measured in a single 
  scRNA-seq experiment. The \Rfunction{corrected\_read\_counts} function 
  adjusts to the imputed data and adds the corrected read 
  count data into the input \Robject{SingleCellExperiment}.\\ 

  <<adjusted>>=
    # Obtain the corrected read count data
    BUSseqfits_res <- corrected_read_counts(BUSseqfits_res)
    
    # An new assay "corrected_data" is incorporated
    BUSseqfits_res
  @
  
  \noindent Subsequently, we visualize the raw count data that suffer
  from batch effects and dropout events, the inferred true expression 
  levels after imputing dropout events, and the corrected count data 
  which impute the dropout events and remove the batch effects.
  The \Rfunction{heatmap\_data\_BUSseq} function draws the 
  heatmap for the count data across batches for the output 
  \Robject{SinleCellExperiment} object of the functions 
  \Rfunction{BUSseq\_MCMC} and \Rfunction{corrected\_read\_counts}.\\
  
  \noindent First, the used assay to draw the heatmap is determined by 
  the arugment \Robject{data\_type}, 
  including \Robject{Raw}, \Robject{Imputed} and \Robject{Corrected}.
  Moreover, the heatmap will be stored in the local folder according to 
  the argument \Robject{image\_dir}. The image name can be modified by the 
  argument \Robject{project\_name}. Besides, the user can specify 
  the argument \Robject{gene\_set} to only display a subset of genes 
  in the heatmap.\\

  <<visualize1>>=
    #generate the heatmap of raw read count data
    heatmap_data_BUSseq(BUSseqfits_res, data_type = "Raw", 
                            project_name = "BUSseq_raw_allgenes",
                            image_dir = "./heatmap")
    
    #display only the first 100 genes in the heatmap
    heatmap_data_BUSseq(BUSseqfits_res, data_type = "Raw", 
                            gene_set = 1:100,
                            project_name = "BUSseq_raw_100genes",
                            image_dir = "./heatmap")
  @
  \begin{figure}[!htbp]
    \centering
  \includegraphics[width=.45\textwidth]
  {heatmap/BUSseq_raw_allgenes_log1p_data.png}
  \includegraphics[width=.45\textwidth]
  {heatmap/BUSseq_raw_100genes_log1p_data.png}
  \caption{The heatmap of the raw count data of all genes (left) and 
  the first 100 genes (right). Each row represents a gene, and each column 
  denotes a cell. The color bar indicates the corresponding batch of each cell.}
  \end{figure}
  <<visualize2>>=
    #generate the heatmap of imputed read count data
    heatmap_data_BUSseq(BUSseqfits_res, data_type = "Imputed", 
                            project_name = "BUSseq_imputed_allgenes",
                            image_dir = "./heatmap")
    #generate the heatmap of corrected read count data
    heatmap_data_BUSseq(BUSseqfits_res, data_type = "Corrected",
                            project_name = "BUSseq_corrected_allgenes",
                            image_dir = "./heatmap")
  @
  \begin{figure}[!htbp]
    \centering
    \includegraphics[width=.45\textwidth]
    {heatmap/BUSseq_imputed_allgenes_log1p_data.png}
    \includegraphics[width=.45\textwidth]
    {heatmap/BUSseq_corrected_allgenes_log1p_data.png}
    \caption{The heatmap for the imputed (left) and corrected (right) count 
    data of all genes.}
  \end{figure}
  \noindent In all these heatmaps, the top bar indicates the corresponding 
  batch of each cell. That's to say, cells under the same color 
  are from the same batch. The batch effects present in the raw data 
  are correctly removed in the corrected count data, and only the biological 
  variabilities are kept. We can also only display the identified 
  intrinsic genes in the corrected count data by setting the argument 
  \Robject{gene\_set} as the indices of the identified intrinsic genes 
  \Robject{index\_intri} in the last section.
  
    <<visualize3>>=
    #Only show the identified intrinsic genes
    heatmap_data_BUSseq(BUSseqfits_res, data_type = "Corrected",
                    gene_set = index_intri,
                    project_name = "BUSseq_corrected_intrinsic_genes",
                    image_dir = "./heatmap")
  @
  \begin{figure}[!htbp]
    \centering
    \includegraphics[width=.8\textwidth]
    {heatmap/BUSseq_corrected_intrinsic_genes_log1p_data.png}
    \caption{The heatmap for the corrected count data of the identified
    intrinsic genes.}
  \end{figure}
  
  \section{Performance of BUSseq in real data analysis}
  
  In the scRNA-seq experiments, the unknown cell types 
  of individual cells are usually the target of inference. 
  Because of severe batch effects, if we directly pool cells 
  from different experiments together, the cells are often 
  clustered by batch or experiment rather than by cell type. 
  After correcting batch effects, cells can be clustered 
  based on the corrected read count data or their corresponding 
  low-dimensional embedding. Therefore, to benchmark different 
  batch effects correct methods, we expect that the estimated 
  cell-type labels are highly consistent with the reference 
  cell type labels generated by the fluorescence-activated 
  cell sorting (FACS) technique. The adjusted Rand index (ARI) 
  can measure the consistency between the estimated labels 
  and the reference ones. ARI ranges from 0 to 1, and the higher 
  value means the better consistency \cite{rand1971objective}.\\ 
  
  
  \cite{song2020flexible} benchmarked BUSseq with the state-of-the-art methods 
  of batch effects correction for scRNA-seq data, including 
  LIGER \cite{welch2019single}, MNN \cite{haghverdi2018batch}, 
  Scanorama \cite{hie2019efficient}, scVI \cite{lopez2018deep}, 
  Seurat \cite{stuart2019comprehensive} and ZINB-WaVE \cite{risso2018general}. 
  We applied all these methods to integrate multiple scRNA-seq experiments in a 
  mouse hematopoietic study and a human pancreatic study, respectively.\\
  
  \begin{table}
  	\centering
  	\begin{tabular}{c|cc}
  	\hline\hline
  		Method & ARI on hematopoietic study & ARI on pancreatic study \\
  		\hline
  		BUSseq & 0.582 & 0.608 \\
  		LIGER & 0.307 & 0.542 \\
  		MNN & 0.575 & 0.279 \\
  		Scanorama &  0.518 & 0.527 \\
  		scVI & 0.197 & 0.282 \\
  		Seurat 3.0 & 0.266 & 0.287 \\
  		ZINB-WaVE & 0.348 & 0.380 \\
  		 \hline\hline
  	\end{tabular}
  \caption{The comparison of different methods in the cell type clustering.}
  \end{table} 
  
  According to the above table, BUSseq outperforms all of the other methods 
  in being consistent with the reference cell-type labels for these 
  two real studies. 

  \section{Session information}
    <<RSession>>=
  sessionInfo()
  @

  
  \bibliography{user_guide}
  \end{document}