---
title: "Processing quantitative proteomics data with QFeatures"
author:
- name: Laurent Gatto
package: QFeatures
abstract: >
  This vignette describes how to process quantitative mass
  spectrometry data with QFeatures: cleaning up unneeded feature
  variables, adding an experimental design, filtering out contaminants
  and reverse hits, managing missing values, log-transforming,
  normalising and aggregating data. This vignette is distributed under
  a CC BY-SA license.
output:
  BiocStyle::html_document:
    toc_float: true
bibliography: QFeatures.bib
vignette: >
  %\VignetteIndexEntry{Processing quantitative proteomics data with QFeatures}
  %\VignetteEngine{knitr::rmarkdown}
  %%\VignetteKeywords{Mass Spectrometry, MS, MSMS, Proteomics, Metabolomics, Infrastructure, Quantitative}
  %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r env, message = FALSE, warning = FALSE, echo = FALSE}
library("QFeatures")
library("ggplot2")
library("dplyr")
```

# Reading data as `QFeatures`

We are going to use a subset of the CPTAC study 6 containing
conditions A and B [@Paulovich:2010]. The peptide-level data, as
processed by MaxQuant [@Cox:2008] is available in the `msdata`
package:


```{r msdata}
basename(f <- msdata::quant(pattern = "cptac", full.names = TRUE))
```

From the names of the columns, we see that the quantitative columns,
starting with `"Intensity."` (note the dot!) are at positions 56 to
61.

```{r cptac_cols}
x <- read.delim(f)
names(x)
(i <- grep("Intensity\\.", names(x)))
```

We now read these data using the `readQFeatures` function. The peptide
level expression data will be imported into R as an instance of class
`QFeatures` named `cptac` with an assay named `peptides`. We also use
the `fnames` argument to set the row-names of the `peptides` assay to
the peptide sequences.

```{r read_cptac}
library("QFeatures")
cptac <- readQFeatures(x, quantCols = i, name = "peptides", fnames = "Sequence")
cptac
```

# Encoding the experimental design

Below we update the sample (column) annotations to encode the two
groups, 6A and 6B, and the original sample numbers.

```{r}
cptac$group <- rep(c("6A", "6B"), each = 3)
cptac$sample <- rep(7:9, 2)
colData(cptac)
```

# Filtering out contaminants and reverse hits


```{r}
filterFeatures(cptac, ~ Reverse == "")
```
```{r}
filterFeatures(cptac, ~ Potential.contaminant == "")
```

```{r}
cptac <- cptac |>
    filterFeatures(~ Reverse == "") |>
    filterFeatures(~ Potential.contaminant == "")
```

# Removing up unneeded feature variables

The spreadsheet that was read above contained numerous variables that
are returned by MaxQuant, but not necessarily necessary in the frame
of a downstream statistical analysis.

```{r}
rowDataNames(cptac)
```

The only ones that we will be needing below are the peptides sequences
and the protein identifiers. Below, we store these variables of
interest and filter them using the `selectRowData` function.

```{r}
rowvars <- c("Sequence", "Proteins", "Leading.razor.protein")
cptac <- selectRowData(cptac, rowvars)
rowDataNames(cptac)
```

# Managing missing values

Missing values can be very numerous in certain proteomics experiments
and need to be dealt with carefully. The first step is to assess their
presence across samples and features. But before being able to do so,
we need to replace 0 by `NA`, given that MaxQuant encodes missing data
with a 0 using the `zeroIsNA` function.

```{r}
cptac <- zeroIsNA(cptac, i = seq_along(cptac))
nNA(cptac, i = seq_along(cptac))
```

The output of the `nNA` function tells us that

- there are currently close to 50% is missing values in the data;
- there are 4051 peptides with 0 missing values, 989 with a single
  missing values, ... and 3014 peptides composed of only missing
  values;
- the range of missing values in the 6 samples is comparable and
  ranges between 4651 and 5470.

In this dataset, we have such a high number of peptides without any
data because the 6 samples are a subset of a larger dataset, and these
peptides happened to be absent in groups A and B. Below, we use
`filterNA` to remove all the peptides that contain one or more missing
values by using `pNA = 0` (which also is the default value).


```{r}
cptac <- filterNA(cptac, i = seq_along(cptac), pNA = 0)
cptac
```

I we wanted to keep peptides that have up to 90% of missing values,
corresponsing in this case to those that have only one value (i.e 5/6
percent of missing values), we could have set `pNA` to 0.9.

# Counting unique features

Counting the number of unique features across samples can be used for
quality control or for assessing the identification efficiency between
different conditions or experimental set-ups. `countUniqueFeatures`
can be used to count the number of features that are contained in each
sample of an assay from a `QFeatures` object. For instance, we can
count the number of (non-missing) peptides per sample from the
`peptides` assay. Note that the counts are automatically stored in the
`colData` of `cptac`, under `peptide_counts`:


```{r count_peptides}
cptac <- countUniqueFeatures(cptac,
                             i = "peptides",
                             colDataName = "peptide_counts")
colData(cptac)
```

We can also count the number of unique proteins. We therefore need to
tell `countUniqueFeatures` that we need to group by protein (the
protein name is stored in the `rowData` under `Proteins`):

```{r count_proteins}
cptac <- countUniqueFeatures(cptac,
                             i = "peptides",
                             groupBy = "Proteins",
                             colDataName = "protein_counts")
colData(cptac)
```

# Imputation

The `impute` method can be used to perform missing value imputation
using a variety of imputation methods. The method takes an instance of
class `QFeatures` (or a `SummarizedExperiment`) as input, an a
character naming the desired method (see `?impute` for the complete
list with details) and returns a new instance of class `QFeatures` (or
`SummarizedExperiment`) with imputed data.

As described in more details in [@Lazar:2016], there are two types
of mechanisms resulting in missing values in LC/MSMS experiments.

* Missing values resulting from absence of detection of a feature,
  despite ions being present at detectable concentrations.  For
  example in the case of ion suppression or as a result from the
  stochastic, data-dependent nature of the MS acquisition
  method. These missing value are expected to be randomly distributed
  in the data and are defined as *missing at random* (MAR) or *missing
  completely at random* (MCAR).

* Biologically relevant missing values, resulting from the *absence*
  of the low abundance of ions (below the limit of detection of the
  instrument). These missing values are not expected to be randomly
  distributed in the data and are defined as *missing not at random*
  (MNAR).


MAR and MCAR values can be reasonably well tackled by many imputation
methods. MNAR data, however, requires some knowledge about the
underlying mechanism that generates the missing data, to be able to
attempt data imputation. MNAR features should ideally be imputed with
a *left-censor* (for example using a deterministic or
probabilistic minimum value) method. Conversely, it is recommended to
use *hot deck* methods (for example nearest neighbour, maximum
likelihood, etc) when data are missing at random.

```{r miximp, echo = FALSE, fig.cap = "Mixed imputation method. Black cells represent presence of quantitation values and light grey corresponds to missing data. The two groups of interest are depicted in green and blue along the heatmap columns. Two classes of proteins are annotated on the left: yellow are proteins with randomly occurring missing values (if any) while proteins in brown are candidates for non-random missing value imputation."}
data(se_na2)
x <- assay(impute(se_na2, "zero"))
x[x != 0] <- 1
suppressPackageStartupMessages(library("gplots"))
heatmap.2(x, col = c("lightgray", "black"),
          scale = "none", dendrogram = "none",
          trace = "none", keysize = 0.5, key = FALSE,
          RowSideColors = ifelse(rowData(se_na2)$randna, "orange", "brown"),
          ColSideColors = rep(c("steelblue", "darkolivegreen"), each = 8))
```

It is anticipated that the identification of both classes of missing
values will depend on various factors, such as feature intensities and
experimental design. Below, we use perform mixed imputation, applying
nearest neighbour imputation on the `r sum(rowData(se_na2)$randna)`
features that are assumed to contain randomly distributed missing
values (if any) (yellow on figure \@ref(fig:miximp)) and a
deterministic minimum value imputation on the
`r sum(!rowData(se_na2)$randna)` proteins that display a non-random pattern
of missing values (brown on figure \@ref(fig:miximp)).


# Data transformation

When analysing continuous data using parametric methods (such as
t-test or linear models), it is often necessary to log-transform the
data. The figure below (left) show that how our data is mainly composed of
small values with a long tail of larger ones, which is a typical
pattern of quantitative omics data.

Below, we use the `logTransform` function to log2-transform our
data. This time, instead of overwriting the peptides assay, we are
going to create a new one to contain the log2-transformed data.

```{r}
cptac <- addAssay(cptac,
                  logTransform(cptac[[1]]),
                  name = "peptides_log")
cptac
```

The `addAssay()` function is the general function that adds new assays
to a `QFeatures` object. The step above could also be fun with the
following syntax, that implicitly returns an updated `QFeatures`
object.

```{r, eval = FALSE}
logTransform(cptac,
             i = "peptides",
             name = "log_peptides")
```

```{r, fig.cap = "Quantitative data in its original scale (left) and log2-transformed (right)."}
par(mfrow = c(1, 2))
limma::plotDensities(assay(cptac[[1]]))
limma::plotDensities(assay(cptac[[2]]))
```

# Normalisation

Assays in `QFeatures` objects can be normalised with the `normalize`
 function. The type of normalisation is defined by the `method`
 argument; below, we use quantile normalisation, store the normalised
 data into a new experiment, and visualise the resulting data.

```{r}
cptac <- addAssay(cptac,
                  normalize(cptac[["peptides_log"]], method = "center.median"),
                  name = "peptides_norm")
cptac
```

As above, the `normalize()` function can also be firectly applied to
the `QFeatures` object.

```{r, eval = FALSE}
normalize(cptac,
          i = "log_peptides",
          name = "lognorm_peptides",
          method = "center.median")
```


```{r, fig.cap = "Distribution of log2 peptide intensities before (left) and after (right) median normalisation."}
par(mfrow = c(1, 2))
limma::plotDensities(assay(cptac[["peptides_log"]]))
limma::plotDensities(assay(cptac[["peptides_norm"]]))
```

# Feature aggregation

At this stage, it is possible to directly use the peptide-level
intensities to perform a statistical analysis [@Goeminne:2016], or
aggregate the peptide-level data into protein intensities, and perform
the differential expression analysis at the protein level.

To aggregate feature data, we can use the `aggregateFeatures` function
that takes the following inputs:

- the name of the `QFeatures` instance that contains the peptide
  quantitation data - `"cptac"` in our example;
- **`i`**: the name or index of the assay that contains the
  (normalised) peptide quantitation data - `"peptides_norm"` in our
  case;
- **`fcol`**: the feature variable (in the assay above) to be used to
  define what peptides to aggregate - `"Proteins"` here, given that we
  want to aggregate all peptides that belong to one protein (group);
- **`name`**: the name of the new aggregates assay - `"proteins"` in this case;
- and finally **`fun`**, the function that will compute this
  aggregation - we will be using the default value, namely
  `robustSummary` [@Sticker:2019].


```{r, warning = FALSE}
cptac <- aggregateFeatures(cptac, i = "peptides_norm", fcol = "Proteins", name = "proteins")
cptac
```

We obtain a final 1125 quantified proteins in the new `proteins`
assay. Below, we display the quantitation data for the first 6
proteins and their respective variables. The latter shown that number
of peptides that were using during the aggregation step (`.n` column).

```{r}
head(assay(cptac[["proteins"]]))
rowData(cptac[["proteins"]])
```

We can get a quick overview of this `.n` variable by computing the
table below, that shows us that we have 405 proteins that are based on
a single peptides, 230 that are based on two, 119 that are based on
three, ... and a single protein that is the results of aggregating 44
peptides.

```{r}
table(rowData(cptac[["proteins"]])$.n)
```

Let's choose `P02787ups|TRFE_HUMAN_UPS` and visualise its expression
pattern in the 2 groups at the protein and peptide level.

```{r, message = TRUE, fig.cap = "Expression intensities for the protein *P02787ups|TRFE_HUMAN_UPS* (right, green) and its peptides (left) in groups A (circles) and B (triangles)."}
library("ggplot2")
library("dplyr")
longFormat(cptac["P02787ups|TRFE_HUMAN_UPS", ]) |>
    as.data.frame() |>
    mutate(group = ifelse(grepl("A", colname), "A", "B")) |>
    mutate(sample = sub("Intensity\\.", "", colname)) |>
    ggplot(aes(x = sample, y = value, colour = rowname, shape = group)) +
    geom_point() +
    facet_grid(~ assay)
```

# TODO

- Improve on data visualisation.

# See also

- The
  [QFeaturesWorkshop2020](https://lgatto.github.io/QFeaturesWorkshop2020/index.html)
  workshop, presented at the EuroBioc2020 meeting. It also documents
  how to use a custom docker container to run the workshop code.

- The [Quantitative proteomics data
  analysis](https://uclouvain-cbio.github.io/WSBIM2122/sec-prot.html)
  chapter of the WSBIM2122 course.

# Session information {-}

```{r sessioninfo, echo=FALSE}
sessionInfo()
```


# References {-}