---
title: "Introduction to BatchQC"
author: 
- name: W. Evan Johnson
  affiliation:
  - Division of Infectious Disease, Department of Medicine, Rutgers University 
  - Director, Center for Data Science, Rutgers University
  email: wj183@njms.rutgers.edu 
- name: Jessica McClintock
  affiliation: 
  - Division of Infectious Disease, Department of Medicine, Rutgers University 
  email: jessica.mcclintock@rutgers.edu
date: '`r format(Sys.Date(), "%B %e, %Y")`'
package: BatchQC
output: 
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Introdution to BatchQC}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

# Introduction
Sequencing and microarray samples are often collected or processed in multiple 
batches or at different times. This often produces technical biases that can 
lead to incorrect results in the downstream analysis. BatchQC is a software tool
that streamlines batch preprocessing and evaluation by providing shiny app
interactive diagnostics, visualizations, and statistical analyses to explore
the extent to which batch variation impacts the data. This is contained in a
full R package which allows reproducibility of all shiny evaluations and
analyses.  BatchQC diagnostics help determine whether batch adjustment needs to
be done, and how correction should be applied before proceeding with a
downstream analysis. Moreover, BatchQC interactively applies multiple common
batch effect approaches to the data, and the user can quickly see the benefits
of each method. BatchQC is developed as a Shiny App, but is reproducible at the
command line through the functions contained int he R package. The output of the
shiny app is organized into multiple tabs, and each tab features an important 
part of the batch effect analysis and visualization of the data. The BatchQC 
interface has the following features:

1. Upload Data; apply desired Normalization and Batch Effect Correction (incl. 
ComBat and ComBat-Seq)
2. Experimental Design to view summary of data
3. Variation Analysis
4. Heatmap plot of gene expressions
5. Median Correlation Plot
6. Circular Dendrogram clustered and colored by batch, condition, and covariates
7. Principal Component Analysis and plots
8. Differential Expression Plots and Analysis using DESeq2
9. Data Download to export any corrections or normalizations as an SE object

`BatchQC()` is the function that launches the Shiny App in interactive mode.

# Installation
## Bioconductor Version
To begin, install [Bioconductor](http://www.bioconductor.org/) and then install
BatchQC:

```{R, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BatchQC")
```

## Github Version
To install the most up-to-date version of BatchQC, please install directly from
github. You will need the devtools package. You can install both of these with
the following commands:
```{R, eval = FALSE}
if (!require("devtools", quietly = TRUE)) {
    install.packages("devtools")   
}
library(devtools)
install_github("wejlab/BatchQC")
```

## Load BatchQC and Launch Shiny App
You should now be able to load BatchQC and launch the shiny app.

```{R load}
library(BatchQC)
```

```{R, eval = FALSE, echo = TRUE}
BatchQC()
```

# Example Usage
Below is an example of how you may use the shiny app to analyze your data:

## 1. Upload data set 
Upon launching the shiny app, you will be on the "Upload Data" screen. From
here, you may upload the following data:

1. **Your own data w/Count and Metadata file.** Count file should have sample
ID as column and item of interest as row. Metadata should have sample ID as row
and all meta data labeled in the column. Both files should be uploaded as .csv
by selecting "Browse" and selecting the appropriate files from your computer.
2. **Your own data as a Summarized Experiment object.** Upload should be a .RDS
file type and can be uploaded by selecting "Browse" and then the appropriate 
file from your computer
3. **An example data set.** See the examples vignette for additional
information on each example data set

A preview of the selected data will show up under the "Input Summary" and "Full
Metadata" tab. Input summary shows the counts file/assay on the Input summary
tab and the metadata on the "Full Metadata" tab. If a preview does not load
properly, please ensure your dataset is structured properly and a valid file
type.

**You MUST hit "Upload" for your data to be available to interact with the shiny
app.**

All shiny app functions can also be run from the command line; all command line
execution require that your data is in a Summarized Experiment object. You
can create a summarized experiment object from a counts and metadata matrix. 
The remainder of this documentation will utilize the signature data example data
set included in the package. This can be loaded from the command line as a
summarized experiment object with the following:
```{r}
data(signature_data)
data(batch_indicator)
se_object <- BatchQC::summarized_experiment(signature_data, batch_indicator)
SummarizedExperiment::assayNames(se_object) <- "log_intensity"
se_object$batch <- as.factor(se_object$batch)
se_object$condition <- as.factor(se_object$condition)
```

Likewise, you can upload your own data into a summarized experiment object for
use with command line functions by providing a data matrix and sample info
matrix to the summarized_experiment function.

## 2. Apply Normalization methods
After you have successfully uploaded your data set of interest, you can apply
one of the following Normalization methods if appropriate:

1. **CPM or counts per million.** CPM calculates the counts mapped to a feature 
relative to the total counts mapped to a sample times one million. CPM may be 
used to adjust expression count biases introduced by sequencing depth. CPM 
adjusted values are not recommended for differential expression analysis or 
within sample comparison.
2. **DESeq.** DESeq calculates the counts mapped to a feature divided by sample-
specific size factors. Size factors are determined by the median ratio of gene 
counts relative to the geometric mean per feature. DESeq may be used to adjust
expression count biases introduced by sequencing depth and RNA composition. 
DESeq adjusted values are not recommended for within sample comparison.

To use one of these two methods, select it from the normalization method drop 
down, select an assay to perform the normalization on from the drop down, and
name the new assay (the prepopulated name can be changed to your preference).
You may also choose to log-transform the results of either of these
normalization methods or your raw data by selecting the "Log transform" box. Hit
"Normalize" to complete the analysis. Your new assay will now be available for
further analysis.

Alternatively, this can be called directly from the command line on an SE
object and will return the same SE object with an additional assay with the name
that you provide as the output_assay_name.

Our signature data set already has a log adjusted assay, so normalization is not
needed. Examples are provided below on how you would complete this if needed via
the command line:

```{r}
# CPM Normalization
se_object <- BatchQC::normalize_SE(se = se_object, method = "CPM",
    log_bool = FALSE, assay_to_normalize = "log_intensity",
    output_assay_name = "CPM_normalized_counts")

# CPM Normalization w/log
se_object <- BatchQC::normalize_SE(se = se_object, method = "CPM",
    log_bool = TRUE, assay_to_normalize = "log_intensity",
    output_assay_name = "CPM_log_normalized_counts")

# DESeq Normalization
se_object <- BatchQC::normalize_SE(se = se_object, method = "DESeq",
    log_bool = FALSE, assay_to_normalize = "log_intensity",
    output_assay_name = "DESeq_normalized_counts")

# DESeq Normalization w/log
se_object <- BatchQC::normalize_SE(se = se_object, method = "DESeq",
    log_bool = TRUE, assay_to_normalize = "log_intensity",
    output_assay_name = "DESeq_log_normalized_counts")

# log adjust
se_object <- BatchQC::normalize_SE(se = se_object, method = "none",
    log_bool = TRUE, assay_to_normalize = "log_intensity",
    output_assay_name = "log_normalized_counts")
```

## 3. Apply Batch Effect Correction
Select the appropriate Batch Effect correction model to create a batch-corrected
assay for your data. You may select one of the following:

1. **ComBat-Seq.** ComBat-Seq uses a negative binomial regression to model batch
effects. It requires untransformed, raw count data to adjust for batch effect. 
Only use this model on an untransformed raw counts assay (not normalized 
assays).
2. **ComBat.** ComBat corrects for Batch effect using a parametric empirical 
Bayes framework and data should be cleaned and normalized. Select a normalized 
assay to run this on.
3. **limma.** limma fitsa linear model to tohe data and then removes components
due to batch effect. It shoudl be used on log transformed data.
4. **sva** sva identifies and estimates surrogate varaibels for unknown sources
of variation. It is appropriate when you know your experimental variable of
interest (like the disease state of samples), but unknown sources of variation.
It should be run on cleaned and log transformed data. This can be run with or
without additional adjustment variables.

To apply your desired batch effect correction model, select the method from the 
drop down menu, then select the appropriate assay in the second drop down menu,
select your batch variable (labeled batch in all built-in examples), and the 
covariates that you would like to preserve (not including the batch variable).
If desired, revise the name for the corrected assay and then hit "Correct."

A progress bar will pop up in the bottom right hand corner and a message will 
state "Correction Complete" in the bottom right hand corner when complete.

You are now ready to analyze your raw and batch effect corrected data sets! 

The signature data is log normalized, so ComBat is one appropriate batch
correction tool. Therefore, to run from the command line use the following:

```{r}
# ComBat correction
se_object <- BatchQC::batch_correct(se = se_object, method = "ComBat",
    assay_to_normalize = "log_intensity", batch = "batch",
    covar = c("condition"), output_assay_name = "ComBat_corrected")
```

ComBat-Seq (or limma or sva) can be applied to a data set by providing
"ComBat-Seq" (or "limma" or "sva") to the method parameter in the function call.

## 4. Experimental Design
### Batch Design
This tab allows you to view the batch status of each condition. Select the 
variable that represents your batch variable and then select the variable that 
represents the condition. A table will then appear listing each condition in the
row and how many samples in each condition are in each batch (displayed in the 
columns).

To recreate this table, run the following code, which will produce a tibble
as output:
```{R}
batch_design_tibble <- BatchQC::batch_design(se = se_object, batch = "batch", 
    covariate = "condition")
batch_design_tibble
```

### Confounding Statistics
This tab displays the Pearson correlation coefficient and Cramer's V estimation
based on the selected batch and condition.

This table can be recreated with the following code:
```{R}
confound_table <- BatchQC::confound_metrics(se = se_object, batch = "batch")
confound_table
```

#### Pearson Correlation Coefficient
Pearson correlation coefficient indicates the strength of the linear 
relationship between your batch and condition variables. It can range from -1 to
1. The closer the value is to 0, the less likely there is a batch effect. The 
closer the value is to -1 or 1, the more likely there is a batch effect.

To calculate only the pearson correlation coefficient, run the following:
```{R}
pearson_cor_result <- BatchQC::std_pearson_corr_coef(batch_design_tibble)
pearson_cor_result
```

#### Cramer's V
This is an additional metric for batch effect and will be between 0 and 1. The 
closer the value is to 0, the lower the batch effect. The closer the value is to
1, the greater the batch effect

To calculate only Cramer's V, run the following:
```{R}
cramers_v_result <- BatchQC::cramers_v(batch_design_tibble)
cramers_v_result
```

If both the Pearson Correlation Coefficient and the Cramer's V test indicate low
batch effect, you likely do not need to adjust your results for batch. However,
if one or both metrics indicate some or high batch effect, you should use
additional analysis tools (listed below) to explore the need for a batch
correction for your results.

## 5. Variation Analysis
First, select your unadjusted assay, your batch variable and any covariates of
interest. Click "Here we go!" to see the variation analysis results. The
"Individual Variable" tab shows the individual, or raw variation, explained
by each variable. The "Residual" tab displays the type 2, or residual, variation
for each variable. A box plot is displayed along with a table with the variation
specific to each gene listed. This table is searchable and can be displayed in
ascending or descending order. Notice the amount of variation attributed to
batch. In our signature data set, most of the variation is attributed to batch
rather than the condition. This indicates a need for a batch correction.

The "Individual Variation Variable/Batch Ratio" tab displays the individual, or
raw variation, divided by the batch. A ratio greater that 1 indicates that batch
has a stronger affect than the variable of interest. And the "Residual Variation
Variable/Batch Ratio" displays the residual, or type 2 variation, divided by the
batch. A ratio greater that 1 indicates that batch has a stronger affect than
the variable of interest.

Next, select the batch corrected assay, with the same batch and condition
variable and observe the changes in variation. When batch effect is being
properly corrected, you should notice that the variation explained by the batch
variable has decreased in the batch corrected assay, indicating that the
variation in the data is now more likely to be associated with your condition of
interest rather than the batch. The ratio statistics should also be less than 1.
This is visible in our example data set and confirms the need for batch
adjustment.

To recreate the variation analysis plot for the log_intensity assay, use the
following code:
```{r}
ex_variation <- batchqc_explained_variation(se = se_object,
    batch = "batch", condition = "condition", assay_name = "log_intensity")
EV_boxplot <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[1]])
EV_boxplot
EV_boxplot_type_2 <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[2]])
EV_boxplot_type_2
```

To recreate the variation analysis table for the log_intensity assay, use the
following code:
```{r}
ex_variation <- batchqc_explained_variation(se = se_object,
    batch = "batch", condition = "condition", assay_name = "log_intensity")
EV_table <- BatchQC::EV_table(batchqc_ev = ex_variation[[1]])
EV_table
EV_table_type_2 <- BatchQC::EV_table(batchqc_ev = ex_variation[[2]])
EV_table_type_2
```

We can then compare our batch corrected assay, using the following code:
```{r}
ex_variation <- batchqc_explained_variation(se = se_object,
    batch = "batch", condition = "condition", assay_name = "ComBat_corrected")
EV_boxplot <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[1]])
EV_boxplot
EV_boxplot_type_2 <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[2]])
EV_boxplot_type_2
EV_table <- BatchQC::EV_table(batchqc_ev = ex_variation[[1]])
EV_table
EV_table_type_2 <- BatchQC::EV_table(batchqc_ev = ex_variation[[2]])
EV_table_type_2
```

## 6. Heatmaps
Select your assay of interest and the batch and condition(s) of interest. We
recommend viewing no more than 500 elements at a time (or the max number in your
data set). 

### Sample Correlations
This heat map shows how similar each sample is to each other sample in your
data (samples are plotted on both the x and y axis, so are identical to 
their self). The samples are clustered together based on similarity. Ideally, 
you would expect your samples to cluster and be more similar to samples from
the same condition rather than from the sample batch. If there is clear visual 
clustering of the batch in your raw data, you should apply a batch correction 
method. Your batch corrected assay should have stronger clustering by condition.

Using the signature dataset, we initially see clustering based on batch, but
after apply ComBat, our adjusted data set clusters more based on condition.

Use the following code to recreate the sample correlation heatmap:
```{r}
heatmaps <- BatchQC::heatmap_plotter(se = se_object, assay = "log_intensity",
    nfeature = 38, annotation_column = c("batch", "condition"),
    log_option = "FALSE")
correlation_heatmap <- heatmaps$correlation_heatmap
correlation_heatmap
```

### Heatmap
The heatmap shoes the sample clustering on the x axis and the y axis is each
gene. The values on the heatmap represent gene expression. The dendrogram on the
x axis is the same clustering arrangement as was seen in the sample
correlations. This heatmap allows the viewer to see patterns of gene expression
across the samples based on the selected variables. Remember that if batch is
clustering together on your raw data, you should perform a batch effect 
correction to adjust your data.

Use the following code to recreate the heatmap:
```{r}
heatmaps <- BatchQC::heatmap_plotter(se = se_object, assay = "log_intensity",
    nfeature = 38, annotation_column = c("batch", "condition"),
    log_option = FALSE)
heatmap <- heatmaps$topn_heatmap
heatmap
```

## 7. Dendrograms
### Dendrogram
This tab shows a more detailed dendrogram than seen in the heatmap. When samples
cluster together by batch, you are likely to have a strong batch effect in your
data and you should consider a batch correction method for your data. Samples
are clustered based on similarity of gene expression and other metadata using an
Euclidian distance metric. Labels will include the batch variable and one
covariate of interest.

Use the following code to recreate the dendrogram for the signature data set: 
```{r}
dendrogram_plot <- BatchQC::dendrogram_plotter(se = se_object,
    assay = "log_intensity",
    batch_var = "batch",
    category_var = "condition")
dendrogram_plot$dendrogram
```

### Circular Dendrogram
This circularizes the previous dendrogram to better show relatedness of all 
branches on the dendrogram without the appearance of great distance of the 
samples at the top and the bottom of the chart.

Use the following code to recreate the circular dendrogram: 
```{r}
circular_dendrogram_plot <- BatchQC::dendrogram_plotter(
    se = se_object, assay = "log_intensity", batch_var = "batch",
    category_var = "condition")
circular_dendrogram_plot$circular_dendrogram
```

## 8. PCA Analysis
You may select multiple assays to plot PCA analysis side by side. Therefore, 
select your raw data assay and batch corrected assay, the batch variable as 
shape, condition as color, and the number of top variable features you are
interested in. Review the clustering pattern. In data sets where a strong batch
effect is seen, similar shapes (batch) will all cluster together. Whereas if a
batch effect is not seen, shapes (batch) should be dispersed throughout and you
should see clustering by condition (color).

Use the following code to recreate the PCA plot and associated table that lists
the explained variation of each PC for our signature data set:
```{r}
pca_plot <- BatchQC::PCA_plotter(se = se_object, nfeature = 20, color = "batch",
    shape = "condition", batch = "batch",
    assays = c("log_intensity", "ComBat_corrected"),
    xaxisPC = 1, yaxisPC = 2, log_option = FALSE)
pca_plot$plot
pca_plot$var_explained
```

## 9. umap Analysis
You may plot your data on a umap by selecting the assay, the batch variable, and
selecting a value for nearest neighbors and minimum distance for the points to
plot. The default for neighbors is 15 and minimum distance default is 0.1. We
expect each batch to cluster together when there is a batch effect present.
Otherwise, having dispersed points in each batch indicates there is not a strong
batch effect. You can adjust the nearest neighbor parameter, a lower value will
priortize local structure and a higher value will represent bigger picture, but
lose finer details. A higher minimum distance value will also put less emphasis
on the global structure.

Use the following code to recreate the umap for the signature data set:

```{r umap}
umap_plot <- BatchQC::umap(se = se_object, assay_of_interest = "log_intensity",
    batch = "batch", neighbors = 15, min_distance = 0.1, spread = 1)

umap_plot
```

This code creates a umap for the batch corrected assay:
```{r umap batch corrected}
umap_plot_batch_corrected <- BatchQC::umap(se = se_object,
    assay_of_interest = "ComBat_corrected",
    batch = "batch", neighbors = 15, min_distance = 0.1, spread = 1)

umap_plot_batch_corrected
```

## 10. Differential Expression Analysis
To explore which features are differentially expressed between conditions, use
the differential expression tab. DE_Seq2 should be used for count data
(non-negative, integer values) while limma can be used for all other data. 

Our signature data is logged data, so we will use limma, we will select the
assay we are interested in, select the batch variable and select all covariates
that we would like to include in the analysis. The "Results Table" return a
table for each condition within each variable. The results table displays the
log2 fold change, p-value and adjusted p-value for each feature. The "P-Value
Analysis" tab displays a box plot of p-values for each analysis. In both the
"Results Table" and "P-Value Analysis" tabs, the tables can be sorted by
ascending or descending order of a column of your choice. They are also
searchable with the "Search" bar (which can be helpful if you are interested in
a specific feature or p-value threshold).
Differential Express
To run the limma differential expression analysis from the command line, run the
following code (which can be modified to run DESeq2 when appropriate by changing
the method argument to "DESeq2"):
```{r}
differential_expression <- BatchQC::DE_analyze(se = se_object,
    method = "limma", batch = "batch", conditions = c("condition"),
    assay_to_analyze = "log_intensity")
```

After running the differential expression analysis, you can see all the
completed analysis by running this code: 

```{r}
names(differential_expression)
```

You can then view the log2 fold change, p-value and adjusted p-value table for a
specific analysis of interest, by running the following:
```{r}
head(differential_expression[["batch2"]])
```

After running a differential expression analysis, you may replicate the p-value
box plot and table from the command line by running the following code:

```{r}
pval_plotter(differential_expression)
head(pval_summary(differential_expression))
```

### Volcano Plot
The "Volcano Plot" is based on the limma or DESeq2 analysis results and must be
run after completing the differential expression analysis. You should select the
analysis result that you are interested in, specify the threshold for p-value
cut off (the default is 0.05), and the magnitude of expression change (the
shiny default is the average of the absolute value of the minimum and maximum 
value in the data set). After clinking "Run," a volcano plot will appear. All
values that are below the p-value threshold and exceed the set expression
change threshold will appear in red, indicating that these may be potential
features that differ between the conditions in that variable.

In a data set with a batch effect, you will likely see many significant values
in both your p-value analysis and in the volcano plot. When you apply a batch
correction analysis, you should see more significant values in your covariate
conditions than in your batch conditions and fewer significant values that meet
the volcano plot thresholds.

The volcano plot is interactive and you can select specific points to view the
feature information, log2fold change, and -log10(p-value).

To recreate the volcano plot for the "batch2" differential expression result 
assay (which you may change by using any of the named results assays), run the
following code:

```{r}
value <- round((max(abs(
    differential_expression[[length(differential_expression)]][, 1]))
    + min(
        abs(differential_expression[[length(differential_expression)]][, 1])))
    / 2)
volcano_plot(DE_results = differential_expression[["batch2"]],
    pslider = 0.05,
    fcslider = value)
```

## 10. Data Download
You may download the data set that you have been working with int he shiny app,
including the  assays you added in the "Normalization" or "Batch Effect
Correction" steps when you uploaded your data. A preview is provided that shows
the Summarized Experiment (SE) object, including the dimensions, metadata table,
assays, rownames and colNames.

Click the "Download" button and you can save this SE object to your computer for
additional analysis outside of the BatchQC shiny app. The object is saved as a
.RDS Summarized Experiment object in the folder of your choice. You should give
it a descriptive name (default is "se.RDS").

If you run code from the command line according to the above instructions, all
assays will be saved to your original SE object and available for your use in
other applications. You may save it as an SE object by running the following:

```{r, eval = FALSE, echo = TRUE}
file_location <- "location/to/save/object"

saveRDS(object = se_object, file = file_location)
```

# Session info {.unnumbered}

```{r sessionInfo, echo=FALSE}
sessionInfo()
```