---
title: "Preprocessing"
author: 
- name: Lis Arend
bibliography: references.bib
biblio-style: apalike
link-citation: yes
colorlinks: yes
output: 
  bookdown::html_document2:
    toc: true
    toc_depth: 2
    number_sections: true
    fig_caption: true
pkgdown:
  as_is: true
vignette: >
  %\VignetteIndexEntry{2. Preprocessing}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", message = TRUE, warning = FALSE,
  fig.width=8,
  fig.height =6
)
```

# Load PRONE Package

```{r, message = FALSE}
# Load and attach PRONE
library(PRONE)
```

# Load Data (TMT)

Here, we are directly working with the [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) data. For more information on how to create the SummarizedExperiment from a proteomics data set, please refer to the ["Get Started"](PRONE.html) vignette.

The example TMT data set originates from [@biadglegne_mycobacterium_2022].

```{r load_real_tmt}
data("tuberculosis_TMT_se")
se <- tuberculosis_TMT_se
```

This SummarizedExperiment object already includes data of different normalization methods. Since this vignette should show you how to use the PRONE workflow for novel proteomics data, we will remove the normalized data and only keep the raw and log2 data that are available after loading the data accordingly. 

```{r, eval = TRUE, include = TRUE}
se <- subset_SE_by_norm(se, ain = c("raw", "log2"))
```

# Overview of the Data

To get an overview on the number of NAs, you can simply use the function `get_NA_overview()`:

```{r overview_NA_table}
knitr::kable(get_NA_overview(se, ain = "log2"))
```

To get an overview on the number of samples per sample group or batch, you can simply use the function `plot_condition_overview()` by specifying the column of the meta-data that should be used for coloring. By default (condition = NULL), the column specified in `load_data()` will be used.

```{r, fig.cap = "Overview barplot of the number of samples per condition."}
plot_condition_overview(se, condition = NULL)
```


```{r, fig.cap = "Overview barplot of the number of samples per pool."}
plot_condition_overview(se, condition = "Pool")
```

A general overview of the protein intensities across the different samples is provided by the function `plot_heatmap()`. The parameter "ain" specifies the data to plot, currently only "raw" and "log2" is available (names(assays(se)). Later if multiple normalization methods are executed, these will be saved as assays, and the normalized data can be visualized.

```{r, fig.cap = "Heatmap of the log2-protein intensities with columns and proteins being clustered with hclust."}
available_ains <- names(assays(se))

plot_heatmap(se, ain = "log2", color_by = c("Pool", "Group"), 
             label_by = NULL, only_refs = FALSE)
```

Similarly, an upset plot can be generated to visualize the overlaps between sets defined by a specific column in the metadata. The sets are generated by using non-NA values. 

```{r, fig.cap = "Upset plot of the non-NA protein intensities with sets defined by the Pool column.", fig.width = 8, fig.height = 8}
plot_upset(se, color_by = NULL, label_by = NULL, mb.ratio = c(0.7,0.3), 
           only_refs = FALSE)
```

If you are interested in the intensities of specific biomarkers, you can use the `plot_markers_boxplots()` function to compare the distribution of intensities per group. The plot can be generated per marker and facet by normalization method (facet_norm = TRUE) or by normalization method and facet by marker (facet_marker = TRUE).

```{r, fig.cap = "Boxplot of the log2-protein intensities of the markers Q92954;J3KP74;E9PLR3, Q9Y6Z7, and Q68CQ4. If specific biomarkers are known to be upregulated in some conditions, you can use this plot to get an impression of the intensities before doing differential expression analysis."}
p <- plot_markers_boxplots(se, 
                           markers = c("Q92954;J3KP74;E9PLR3", "Q9Y6Z7"), 
                           ain = "log2", 
                           id_column = "Protein.IDs", 
                           facet_norm = FALSE, 
                           facet_marker = TRUE)
p[[1]] + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))
```


# Filter Proteins 

## Remove Proteins With Missing Values in ALL Samples

```{r}
se <- filter_out_complete_NA_proteins(se)
```

## Remove Proteins With a Specific Value in a Specific Column

Typically proteins with "+" in the columns "Reverse", "Only.identified.by.site", and "Potential.contaminant" are removed in case of a MaxQuant proteinGroups.txt output file.

```{r}
se <- filter_out_proteins_by_value(se, "Reverse", "+")
se <- filter_out_proteins_by_value(se, "Only.identified.by.site", "+")
#se <- filter_out_proteins_by_value(se, "Potential.contaminant", "+")
```

## Remove Proteins by ID

If you don't want to remove for instance all proteins with "Potential.contaminant == +", you can also first get the protein ID with the specific value, check them in Uniprot, and then remove only some by using the function `filter_out_proteins_by_ID()`.

```{r} 
pot_contaminants <- get_proteins_by_value(se, "Potential.contaminant", "+")
se <- filter_out_proteins_by_ID(se, pot_contaminants)
```

## Explore Missing Value Pattern

Due to the high amount of missing values in MS-based proteomics data, it is important to explore the missing value pattern in the data. The function `plot_NA_heatmap()` provides a heatmap of the proteins with at least one missing value across all samples. 

```{r, fig.cap = "Heatmap of the missing value pattern. Proteins with at least one missing value are included in this heatmap."}
plot_NA_heatmap(se, color_by = NULL, label_by = NULL, cluster_samples = TRUE, 
                cluster_proteins = TRUE)
```

Another way to explore the missing value pattern is to use the functions `plot_NA_density()` and `plot_NA_frequency()`. 

```{r, fig.cap = "Density plot of protein intensities of proteins with missing values and proteins without missing values. This can help to check whether the missing values are biased to lower intense proteins. In this case, the missing values are biased towards the lower intense proteins."}
plot_NA_density(se)
```

```{r, fig.cap = "Protein identification overlap. This plot shows the number of identified proteins (y-axis) per number of samples (x-axis)."}
plot_NA_frequency(se)
```

## Filter Proteins By Applying a Missing Value Threshold

To reduce the amount of missing values, it is possible to filter proteins by applying a missing value threshold. The function `filter_out_NA_proteins_by_threshold()` removes proteins with more missing values than the specified threshold. The threshold is a value between 0 and 1, where 0.7, for instance, means that proteins with less than 70% of real values will be removed, i.e., proteins with more than 30% missing values will be removed.

```{r, fig.cap = "Heatmap of the missing value pattern after filtering proteins with too many missing values (threshold was set to 70%). Proteins with at least one missing value are included in this heatmap."}
se <- filter_out_NA_proteins_by_threshold(se, thr = 0.7) 

plot_NA_heatmap(se)
```

# Filter Samples

Following filtering proteins by different criteria, samples can be analyzed more in detail. PRONE provides some functions, such as `plot_nr_prot_samples()` and `plot_tot_int_samples()`, to get an overview of the number of proteins and the total intensity per sample, but also offers the automatic outlier detection method of POMA.

## Quality Control

```{r, fig.cap = "Barplot on the number of non-zero proteins per samples colored by condition. This plot helps as quality control to check whether the number of proteins is similar across samples. In this case, only very few proteins have been detected for sample 1.HC_Pool1 compared to the other samples."}
plot_nr_prot_samples(se, color_by = NULL, label_by = NULL)
```

```{r, fig.cap = "Barplot on the total protein intensity per samples colored by condition. This plot helps as quality control to check whether the total protein intensity is similar across samples. In this case, the total protein intensities of samples 1.HC_Pool1 and 1.HC_Pool2 are much lower compared to the other samples."}
plot_tot_int_samples(se, color_by = NULL, label_by = NULL)
```

## Remove Samples Manually

Based on these plots, samples "1.HC_Pool1" and 1_HC_Pool2 seem to be outliers. You can easily remove samples manually by using the `remove_samples_manually()` function.

```{r}
se <- remove_samples_manually(se, "Label", c("1.HC_Pool1", "1.HC_Pool2"))
```

## Remove Reference Samples

And you can remove the reference samples directly using the function `remove_reference_samples()`. But attention: possibly you need them for normalization! That is exactly why we currently keep them!

```{r}
se_no_refs <- remove_reference_samples(se)
```


## Outlier Detection via POMA R Package

The POMA R package provides a method to detect outliers in proteomics data. The function `detect_outliers_POMA()` detects outliers in the data based on the POMA algorithm. The method calculates the euclidean distances (or maximum or manhattan distances) between all sample pairs and then group centroids are computed either as the average distance of group members or the spatial median. The distances of the group members to the corresponding group centroid are then calculated to apply the classical univariate outlier detection formula Q3 + 1.5 * IQR (1.5 is adjustable) to detect group-dependent outliers.
 
The function returns a list with the following elements: polygon plot, distance boxplot, and the outliers. For further information on the POMA algorithm, please refer to the original publication [@castellano-escuder_pomashiny:_2021]: 

```{r}
poma_res <- detect_outliers_POMA(se, ain = "log2")
```

```{r, fig.cap = "Polygon plot of the principal coordinates calculated using the POMA algorithm. The original distances are reduced to principal coordinates (see POMA and the vegan::betadisper method)."}
poma_res$polygon_plot
```

```{r, fig.cap = "Boxplot of the distances of the samples to the group centroid calculated using POMA."}
poma_res$distance_boxplot
```

```{r}
knitr::kable(poma_res$outliers, 
             caption = "Outliers detected by the POMA algorithm.", digits = 2)
```

To remove the outliers detected via the POMA algorithm, just put the data.table of the `detect_outliers_POMA()` function into the `remove_POMA_outliers()` function.

```{r}
se <- remove_POMA_outliers(se, poma_res$outliers)
```

# Session Info

```{r}
utils::sessionInfo()
```

# References