---
title: "MSstatsTMTPTM Example: An example workflow and analysis of the MSstatsTMTPTM package"
author: "Devon Kohler (<kohler.d@northeastern.edu>)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{MSstatsTMTPTM Example: An example workflow and analysis of the MSstatsTMTPTM package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=8, 
  fig.height=8
)
```

```{r, message=FALSE, warning=FALSE}
library(MSstatsTMTPTM)
library(MSstatsTMT)
library(MSstats)
library(dplyr)
```

This Vignette provides an example workflow for how to use the package 
MSstatsTMTPTM. It also provides examples and an analysis of how adjusting for 
global protein levels allows for better interpretations of PTM modeling results.

## Installation

To install this package, start R (version "4.0") and enter:

``` {r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MSstatsTMTPTM")
```

## 1. Workflow

### 1.1 Raw Data Format

The first step is to load in the raw dataset for both the PTM and Protein 
datasets. This can be the output of the MSstatsTMT converter functions: 
`PDtoMSstatsTMTFormat`, `SpectroMinetoMSstatsTMTFormat`, and 
`OpenMStoMSstatsTMTFormat`. Both the PTM and protein datasets must include the 
following columns: `ProteinName`, `PeptideSequence`, `Charge`, `PSM`, `Mixture`,
`TechRepMixture`, `Run`, `Channel`, `Condition`, `BioReplicate`, and 
`Intensity`.

#### 1.1.1 Raw PTM Data

```{r}
# read in raw data files
# raw.ptm <- read.csv(file="raw.ptm.csv", header=TRUE)
head(raw.ptm)
```

It is important to note the `ProteinName` column in the PTM dataset represents 
the modification sites. The location of the modification must be added into the 
`ProteinName`. For example the first row shows `Protein_12_S703` for 
ProteinName, with Y474 being the modificaiton site. This can be done as shown 
above, or by adding the PeptideSequence into the ProteinName, ex. 
`Protein_12_Peptide_491` for the first row.

#### 1.1.1 Raw Protein Data

``` {r}
head(raw.protein)
# raw.protein <- read.csv(file="raw.protein.csv", header=TRUE)
```

The raw Protein dataset looks similar to the PTM dataset, however the 
`ProteinName` column does not contain a modification site.

### 1.2 proteinSummarization

After loading in the input data, the next step is to use the 
proteinSummarization function from MSstatsTMT. This provides the summarized 
dataset needed to model the protein/PTM abundance. The summarization for PTM and
Protein datasets should be done separately. The function will summarize the 
Protein dataset up to the protein level and will summarize the PTM dataset up to
the PTM level. The different summarizations are caused by adding the PTM site 
into the `ProteinName` field. For details about normalization and imputation 
options in proteinSummarization please review the package documentation here: 
[MSstatsTMT Package](https://www.bioconductor.org/packages/release/bioc/html/MSstatsTMT.html).

``` {r, results='hide', message=FALSE, warning=FALSE}

# Use proteinSummarization function from MSstatsTMT to summarize raw data
quant.msstats.ptm <- proteinSummarization(raw.ptm,
                                          method = "msstats",
                                          global_norm = TRUE,
                                          reference_norm = FALSE,
                                          MBimpute = TRUE)

quant.msstats.protein <- proteinSummarization(raw.protein,
                                          method = "msstats",
                                          global_norm = TRUE,
                                          reference_norm = FALSE,
                                          MBimpute = TRUE)

```

``` {r}
head(quant.msstats.ptm)
head(quant.msstats.protein)
```


### 1.3 groupComparisonTMTPTM

After the two datasets are summarized, both the summarized PTM and protein 
datasets should be used in the modeling function `groupComparisonTMTPTM`. First 
a full pairwise comparison is made between all conditions in the experiment.

``` {r, results='hide', message=FALSE, warning=FALSE}

# test for all the possible pairs of conditions
model.results.pairwise <- groupComparisonTMTPTM(data.ptm=quant.msstats.ptm,
                                       data.protein=quant.msstats.protein)
```

Optionally, a specific contrast matrix can be defined. Below is an example of a 
contrast matrix and how it is passed into the `groupComparisonTMTPTM` function.

``` {r,  message=FALSE, warning=FALSE}
# Specify comparisons
comparison<-matrix(c(1,0,0,-1,0,0,
                     0,1,0,0,-1,0,
                     0,0,-1,0,0,-1,
                     1,0,-1,0,0,0,
                     0,1,-1,0,0,0,
                     0,0,0,1,0,-1,
                     0,0,0,0,1,-1,
                     .25,.25,-.5,.25,.25,-.5,
                     1/3,1/3,1/3,-1/3,-1/3,-1/3),nrow=9, ncol=6, byrow=TRUE)

# Set the names of each row
row.names(comparison)<-c('1-4', '2-5', '3-6', '1-3', 
                         '2-3', '4-6', '5-6', 'Partial', 'Third')
# Set the column names
colnames(comparison)<- c('Condition_1', 'Condition_2', 'Condition_3', 
                         'Condition_4', 'Condition_5', 'Condition_6')

comparison
```

``` {r, results='hide', message=FALSE, warning=FALSE}
# test for specified condition comparisons only
model.results.contrast <- groupComparisonTMTPTM(data.ptm=quant.msstats.ptm,
                                       data.protein=quant.msstats.protein,
                                       contrast.matrix = comparison)

```

``` {r}
names(model.results.contrast)
ptm_model <- model.results.contrast[[1]]
protein_model <- model.results.contrast[[2]]
adjusted_model <- model.results.contrast[[3]]

head(adjusted_model)

```

The modeling function will return a list consisting of three dataframes.One each
for the PTM-level, Protein-level, and adjusted PTM-level group comparison 
result.

### 1.4 Example Volcano Plot

The models from the `groupComparisonTMTPTM` function can be used in the model 
visualization function, `groupComparisonPlots`, from the base MSstats. Below is 
a Volcano Plot for the Adjusted PTM model.
**Note: the input for groupComparisonPlots should be one data.frame from output of groupComparisonTMTPTM.**

``` {r}
groupComparisonPlots(data = adjusted_model,
                     type = 'VolcanoPlot',
                     ProteinName = FALSE,
                     which.Comparison = '1-4',
                     address = FALSE)

```


## 2. Analysis

### 2.1 How to adjust PTMs for changes in global protein levels?

In order to adjust the PTM abundance for global protein abundance there are two 
main steps. First the PTM and global protein datasets must be modeled and then 
the resulting model parameters are combined. The combination is done using the 
formulas below:

Log2FC: $\log_2FC_{PTM} - \log_2FC_{Protein}$

SE: $\sqrt{SE_{PTM}^2 + SE_{Protein}^2}$

DF: $(SE_{PTM}^2 + SE_{Protein}^2)^2 \biggm/ (\frac{SE_{PTM}^2}{DF_{PTM}} + \frac{SE_{Protein}^2}{DF_{Protein}})$

Please see the package MSstatsPTM for further explanation of adjustment strategy
and formulas.

### 2.2 Example PTM

``` {r}
dataProcessPlotsTMTPTM(data.ptm = raw.ptm,
                       data.protein = raw.protein,
                       data.ptm.summarization = quant.msstats.ptm,
                       data.protein.summarization = quant.msstats.protein,
                       type = 'ProfilePlot',
                       which.Protein = 'Protein_2391_Y40',
                       address = FALSE)

```

In the plots above the data points for the PTM `Protein_2391_Y40` and Protein 
`Protein_2391` are shown for all conditions. To take a look at a specific 
comparison, the input data can be filtered.

``` {r}
dataProcessPlotsTMTPTM(data.ptm = raw.ptm %>% filter(
  Condition %in% c('Condition_1', 'Condition_4')),
                       data.protein = raw.protein %>% filter(
                         Condition %in% c('Condition_1', 'Condition_4')),
                       data.ptm.summarization = quant.msstats.ptm %>% filter(
                         Condition %in% c('Condition_1', 'Condition_4')),
                       data.protein.summarization = 
    quant.msstats.protein %>% filter(Condition %in% c(
      'Condition_1', 'Condition_4')),
                       type = 'ProfilePlot',
                       which.Protein = 'Protein_2391_Y40',
                       originalPlot = FALSE,
                       address = FALSE)

model_df <- rbind(adjusted_model %>% filter(
  Protein == 'Protein_2391_Y40' & Label == '1-4') %>% select(-Tvalue),
                  ptm_model %>% filter(
                    Protein == 'Protein_2391_Y40' & Label == '1-4'
                    ) %>% select(-issue),
                  protein_model %>% filter(
                    Protein == 'Protein_2391' & Label == '1-4'
                    ) %>% select(-issue))
model_df <- data.frame(model_df)
rownames(model_df) <- c('Adjusted PTM', 'PTM', 'Protein')
model_df
```

The example above shows the added insight of adjusting for Protein level. 
Originally the PTM model shows a small positive log2FC of 0.144. However, the 
global protein abundance changed strongly in the negative direction, -0.508. 
When the PTM is adjusted for the global the log2FC is increased to 0.652, which
is a much larger absolute change than without adjustment. The change in 
abundance of this PTM may have been marked as insignificant without adjusting 
for protein levels. Note the change in SE and DF once the PTM model is adjusted.

Additionally, there are other situations where the PTM log2FC will be reduced
after adjustment.

``` {r}
dataProcessPlotsTMTPTM(data.ptm = raw.ptm %>% filter(
  Condition %in% c('Condition_2', 'Condition_5')),
                       data.protein = raw.protein %>% filter(
                         Condition %in% c('Condition_2', 'Condition_5')),
                       data.ptm.summarization = quant.msstats.ptm %>% 
    filter(Condition %in% c('Condition_2', 'Condition_5')),
                       data.protein.summarization = quant.msstats.protein %>% 
    filter(Condition %in% c('Condition_2', 'Condition_5')),
                       type = 'ProfilePlot',
                       which.Protein = 'Protein_1076_Y67',
                       originalPlot = FALSE,
                       address = FALSE)

model_df <- rbind(adjusted_model %>% filter(
  Protein == 'Protein_1076_Y67' & Label == '2-5') %>% select(-Tvalue),
                  ptm_model %>% filter(
                    Protein == 'Protein_1076_Y67' & Label == '2-5'
                    ) %>% select(-issue),
                  protein_model %>% filter(
                    Protein == 'Protein_1076' & Label == '2-5'
                    ) %>% select(-issue))

model_df <- data.frame(model_df)
rownames(model_df) <- c('Adjusted PTM', 'PTM', 'Protein')
model_df
```

**Note - In this example Protein_1076 was only available in one run of the global protein experiment.**

In this example the PTM model originally showed a large log2FC of 0.242, however
most of this log2FC was due to the global protein abundance change which was 
0.356. Once adjusted, the PTM shows a much smaller log2FC of -0.114. Without 
adjustment a potentially incorrect inference could have been drawn for this PTM.

From these two examples the added insight into PTM abundance change from protein
adjustment can be observed. If the goal of the experiment is to find the true 
abundance change due to PTMs than removing the effect of the protein is very 
important.

## Session information

```{r session}
sessionInfo()
```