---
title: "Single Cell Proteomics data processing and analysis."
author:
  - name: Laurent Gatto
  - name: Christophe Vanderaa
output:
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
bibliography: scp.bib
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('scp')`"
vignette: >
  %\VignetteIndexEntry{Single Cell Proteomics data processing and analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```


# The `scp` package

The `scp` package is used to process and analyse mass
spectrometry-based single cell proteomics (SCP) data. It relies on the
[`QFeatures`](https://rformassspectrometry.github.io/QFeatures/)
(@QFeatures) package to manage and process
[`SingleCellExperiment`](http://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html)
(@sce) objects.

```{r scp_framework, results='markup', fig.cap="`scp` relies on `SingleCellExperiment` and `QFeatures` objects.", echo=FALSE, out.width='100%', fig.align='center'}
knitr::include_graphics("./figures/SCP_framework.png", error = FALSE)
```

This vignette will guide you through the common steps of mass
spectrometry-based single-cell proteomics data analysis. To start, we
load the `scp` package.

```{r load_scp, message = FALSE}
library("scp")
```

We also load `ggplot2`, `magrittr` and `dplyr` for convenient data
manipulation and plotting.

```{r load_other_package, message = FALSE}
library("ggplot2")
library("magrittr")
library("dplyr")
```


# Read in SCP data

The workflow starts with reading in the tabular quantification data
generated by, for example, MaxQuant (@Cox:2008). We created a small
example data by subsetting the MaxQuant table provided in the SCoPE2
preprint (@Specht2019-jm). The `mqScpData` table is a typical example
of what you would get after reading in a CSV file using `read.csv` or
`read.table`. See `?mqScpData` for more information about the table
content.

```{r load_mqScpData}
data("mqScpData")
dim(mqScpData)
mqScpData[1:10, 1:5]
```

In order to convert this tabular data to a `scp`-compatible
`QFeatures` object, we need to provide a **metadata table** where rows
contain sample information and columns must contain at least:

* The name of the batch the sample was acquired in
* The name of the channel the sample was acquired in

Any additional information about the samples will be stored in the
`colData`.

We provide an example of such a data frame. It was formatted from the
annotation table provided in the SCoPE2 preprint. See
`?sampleAnnotation` for more information about the table content.

```{r load_sampleAnnotation}
data("sampleAnnotation")
sampleAnnotation
```

The two tables are supplied to the `readSCP` function.

```{r readSCP}
scp <- readSCP(quantTable = mqScpData,
               metaTable = sampleAnnotation,
               channelCol = "Channel",
               batchCol = "Set")
```

As indicated by the output on the console, `readSCP` proceeds as follows:

1. If `quantTable` is the path to a CSV file, it reads the file using
   `read.csv`. The table is converted to a `SingleCellExperiment`
   object. `readSCP` needs to know in which field(s) the quantative
   data is stored. Those field name(s) is/are provided by the
   `channelCol` field in the `metaData` table. So in this example, the
   column names hodling the quantitative data in `mqScpData` are
   stored in the column named `Channel` in `sampleAnnotation`.

2. The `SingleCellExperiment` object is then split according to
   batch. The split is performed depending on the `batchCol` field in
   `quantTable`, in this case the data will be split according to the
   `Set` column in `mqScpData`.

3. The sample metadata is generated from the `metaTable`. Note that in
   order for `readSCP` to correctly match the feature data with the
   metadata, `metaTable` must also contain the `batchCol` field with
   batch names.

4. Finally, the split feature data and the sample metadata are stored
   in a single `QFeatures` object.

Here is a compact overview of the data:

```{r overview}
scp
```

We can see that the `scp` object we created is a `QFeatures` object
containing 4 assays. Each assay has an associated name, this is the
batch name that was used for splitting. We can also see that each
assay is a `SingleCellExperiment` object. The rows represent the
peptide to spectrum matches (PSMs), the number vary depending on the
batch. Finally, all three assays contains 16 columns that correspond
to the 16 TMT channels recorded during the 4 MS runs.

# Clean missing data

Single-cell (proteomics or transcriptomics) data contains many
zeros. The zeros can be biological zeros or technical zeros and
differentiating between the two types is very hard. To avoid artefacts
in dowstream steps, we replace the zeros by the missing value
`NA`. The `zeroIsNA` function takes the `QFeatures` object and the
name(s) or index/indices of the assay(s) to clean and automatically
replaces any zero in the selected quantitative data by `NA`.

```{r zeroIsNA}
scp <- zeroIsNA(scp, i = 1:4)
```


# Filter PSMs

A common steps in SCP is to filter out low-confidence PSMs. Each PSM
assay contains feature meta-information that are stored in the assay's
`rowData`. The `QFeatures` package allows to quickly filter the rows
of an assay by using these information. The available variables in the
`rowData` are listed below for each assay.

```{r rowDataNames}
rowDataNames(scp)
```

## Filter features based on feature metadata

Below are some examples of criteria that are used to identify
low-confidence.  The information is readily available since this was
computed by MaxQuant:

- Remove PSMs that are matched to contaminants
- Remove PSMs that are matched to the decoy database
- Keep PSMs that exhibit a high PIF (parental ion fraction),
  indicative of the purity of a spectrum

We can perform this filtering using the
[`filterFeatures` function](https://rformassspectrometry.github.io/QFeatures/reference/QFeatures-filtering.html).
`filterFeatures` automatically accesses the feature metadata and
selects the rows that meet the provided condition(s). For instance,
`Reverse != "+"` keeps the rows for which the `Reverse` variable in
the `rowData` is not `"+"` (*i.e.* the PSM is not matched to the decoy
database).

```{r filter_psms}
scp <- filterFeatures(scp,
                      ~ Reverse != "+" &
                        !grepl("REV|CON", protein) &
                        Potential.contaminant != "+" &
                        !is.na(PIF) & PIF > 0.8)
```

## Filter assays based on detected features

To avoid proceeding with failed runs, another interesting filter is to
remove assays with too few features. If a batch contains less than,
for example, 150 features we can then suspect something wrong happened
in that batch and it should be removed. Using `dims`, we can query the
dimensions (hence the number of features and the number of samples) of
all assays contained in the dataset.

```{r dims}
dims(scp)
```

Actually, a `QFeatures` object can be seen as a three-order array:
$features \times samples \times assay$. Hence, `QFeatures` supports
three-order subsetting `x[rows, columns, assays]`. We first select the
assay that have sufficient PSMs (the number of rows is greater than
150), and then subset the `scp` object for the assays that meet the
criterion.

```{r filter_assays}
keepAssay <- dims(scp)[1, ] > 150
scp <- scp[, , keepAssay]
scp
```

Notice the `190321S_LCA10_X_FP97_blank_01` sample was removed because
it did not contain sufficient features, as expected from a blank
sample. This could also have been the case for failed runs.

## Filter features based on SCP metrics

Another type of filtering is specific to SCP. In the SCoPE2 analysis,
the authors suggest a filters based on the sample to carrier ratio
(SCR), that is the reporter ion intensity of a single-cell sample
divided by the reporter ion intensity of the carrier channel (200
cells) from the same batch. It is expected that the carrier
intensities are much higher than the single-cell intensities.

The SCR can be computed using the `computeSCR` function from
`scp`. The function must be told which channels are the samples that
must be divided and which channel contains the carrier. This
information is provided in the sample metadata and is accessed using
the `colData`, under the `SampleType` field.

```{r show_annotation}
colData(scp)[, "SampleType"] %>%
  table
```

In this dataset, `SampleType` gives the type of sample that is present
in each TMT channel. The SCoPE2 protocole includes 5 types of samples:

* The carrier channels (`Carrier`) contain 200 cell equivalents and
  are meant to boost the peptide identification rate.
* The normalization channels (`Reference`) contain 5 cell equivalents
  and are used to partially correct for between-run variation.
* The unused channels (`Unused`) are channels that are left empty due
  to isotopic cross-contamination by the carrier channel.
* The blank channels (`Blank`) contain samples that do not contain any
  cell but are processed as single-cell samples.
* The single-cell sample channels contain the single-cell samples of
  interest, that are macrophage (`Macrophage`) or monocyte
  (`Monocyte`).

The `computeSCR` function expects the following input:

* The `QFeatures` dataset
* The assay name(s) or index/indices for which the SCR should be
  computed
* The sample metadata variable pointing to the channel annotation
* A string pattern (following regular expression syntax) that uniquely
  identifies the carrier channel in each batch
* A string pattern (following regular expression syntax) that
  identifies the samples to divide

The function creates an field `.meanSCR` and stores it in the
`rowData` of each assay.

```{r computeSCR}
scp <- computeSCR(scp,
                  i = 1:3,
                  colDataCol = "SampleType",
                  carrierPattern = "Carrier",
                  samplePattern = "Macrophage|Monocyte")
```

Before applying the filter, we plot the distribution of the average
SCR. We can extract the `.meanSCR` variable from the `rowData` of
several assays using the `rowDataToDF`. It takes the `rowData`
field(s) of interest and returns a `DataFrame` table.

```{r plot_SCR, warning=FALSE, message=FALSE}
scp %>%
  rowDataToDF(i = 1:3,
              vars = ".meanSCR") %>%
  data.frame %>%
  ggplot(aes(x = .meanSCR)) +
  geom_histogram() +
  geom_vline(xintercept = 0.1)
```

Most values are close to 0.02 as expected since the experimental ratio
is 1/200.  There are a few point that have higher signal than
expected. We therefore filter out those points using a cutoff of 0.1
using again the `filterFeatures` functions.

```{r filter_SCR}
scp <- filterFeatures(scp,
                      ~ !is.na(.meanSCR) &
                        .meanSCR < 0.1)
```

## Filter out PSMs with high false discovery rate

Finally, a last PSM filter criterion is the false discovery rate (FDR)
for identification. Filtering on the PEP is too conservative
(@Kall2008-hb) so we provide the `computeFDR` function to convert PEPs
to FDR. Beside the dataset and the assay(s) for which to compute the
FDR, we also need to give the feature grouping variable, here the
peptide sequence, and the variable containing the PEPs. Those are
contained in the feature metadata.

```{r computeFDR}
scp <- computeFDR(scp,
                  i = 1:3,
                  groupCol = "peptide",
                  pepCol = "dart_PEP")
```

Note that a new variable `.FDR` containing the computed FDRs is added
to the `rowData`. We filter the PSMs that have an associated peptide
FDR smaller than 1\%.

```{r filter_FDR}
scp <- filterFeatures(scp,
                      ~ .FDR < 0.01)
```

# Process the PSM data

## Relative reporter ion intensity

In order to partialy correct for between-run variation, SCoPE2
suggests computing relative reporter ion intensities. This means that
intensities measured for single-cells are divided by the reference
channel containing 5-cell equivalents. We use the `divideByReference`
function that divides channels of interest by the reference
channel. Similarly to `computeSCR`, we can point to the samples and
the reference columns in each assay using the annotation contained in
the `colData`.

We here divide all columns (using the regular expression wildcard `.`)
by the reference channel (`Reference`).

```{r divideByReference}
scp <- divideByReference(scp,
                         i = 1:3,
                         colDataCol = "SampleType",
                         samplePattern = ".",
                         refPattern = "Reference")
```

# Aggregate PSM data to peptide data

Now that the PSM assays are processed, we can aggregate them to
peptides. This is performed using the `aggregateFeaturesOverAssays`
function. For each assay, the function aggregates several PSMs into a
unique peptide. This is best illustrated by the figure below.

```{r features_aggregation, results = 'markup', fig.cap = "Conceptual illustration of feature aggregation.", echo=FALSE, out.width='100%', fig.align='center'}
knitr::include_graphics("./figures/feature_aggregation.png", error = FALSE)
```

Remember there currently are three assays containing the PSM data.

```{r show_agg_psms}
scp
```

The PSMs are aggregated over the `fcol` feature variable, here
peptides. We also need to supply an aggregating function that will
tell how to combine the quantitative data of the PSMs to aggregate. We
here use the median value. We name the aggregated assays using the
original names and appending `peptides_` at the start.

```{r aggregate_peptides, message = FALSE}
scp <- aggregateFeaturesOverAssays(scp,
                                   i = 1:3,
                                   fcol = "peptide",
                                   name = paste0("peptides_", names(scp)),
                                   fun = matrixStats::colMedians, na.rm = TRUE)
```

Notice that 3 new assays were created in the `scp` object. Those new
assays contain the aggregated features while the three first assays
are unchanged. This allows to keep track of the data processing.

```{r show_agg_peptides}
scp
```


Under the hood, the `QFeatures` architecture preserves the
relationship between the aggregated assays. See `?AssayLinks` for more
information on relationships between assays.

# Join the SCoPE2 sets in one assay

Up to now, we kept the data belonging to each MS run in separate
assays. We now combine all batches into a single assay. This is done
using the `joinAssays` function from the `QFeatures` package. Note
that we now use the aggregated assays, so assay 4 to 6.

```{r joinAssays}
scp <- joinAssays(scp,
                  i = 4:6,
                  name = "peptides")
```

In this case, one new assay is created in the `scp` object that
combines the data from assay 4 to 6. The samples are always distinct
so the number of column in the new assay (here $48$) will always
equals the sum of the columns in the assays to join (here $16 + 16 +
16$). The feature in the joined assay might contain less features than
the sum of the rows of the assays to join since common features
between assays are joined in a single row.

```{r show_join}
scp
```

# Filter single-cells

Another common step in single-cell data analysis pipelines is to
remove low-quality cells. After subseting for the samples of interest,
we will use 2 metrics: the median relative intensities per cell and
the median coefficient of variation (CV) per cell.

## Filter samples of interest

We first subset the cells of interest, that is the blank samples
(`sc_0`), the macrophages (`sc_m0`) and the monocytes (`sc_u`).

We extract the `SingleCellExperiment` assay with the joined peptides,
subset the channels corresponding to blank, macrophage or monocytes
and add it as a new assay in the `QFeatures` object. However, the
sample annotation is contained in the `colData` of the `QFeatures`
dataset, but we need to access it from the `SingeCellExperiment`
object. Therefore, we provide the `transferColDataToAssay` to copy the
sample metadata from the `QFeatures` to a target assay.

```{r transferColDataToAssay}
colData(scp[["peptides"]])
scp <- transferColDataToAssay(scp, "peptides")
colData(scp[["peptides"]])
```

Once the metadata is transfered, we can subset the
`SingleCellExperiment` assay.

```{r subset_single_cells}
sce <- scp[["peptides"]]
sce <- sce[, sce$SampleType %in% c("Blank", "Macrophage", "Monocyte")]
```

Then we add this assay back into the `QFeatures` object. This is done
using the `addAssay` function. We also preserve the links between
dfeatures using the `addAssayLinkOneToOne`. Since the features did not
change (only the samples did), one-to-one links between features are
added.

```{r addAssay_filter1}
scp %>%
  addAssay(y = sce,
           name = "peptides_filter1") %>%
  addAssayLinkOneToOne(from = "peptides",
                       to = "peptides_filter1") ->
  scp
```

Note the added assay `peptides_filter1` contains the same number of
rows than its parent assay `peptides`, but less samples as 20 samples
are not single-cells or blank samples bu rather unused, reference or
carrier samples.

```{r show_cell_filter}
scp
```


## Filter based on the median relative intensity

We compute the median relative reporter ion intensity for each cell
separately and apply a filter based on this statistic. This procedure
recalls that of library size filtering commonly performed in scRNA-Seq
data analysis, where the library size is the sum of the counts in each
single cell. We will store the median intensity in the `colData` of
the `peptides_filter1` assay (so the `SingleCellExperiment` object)
because this metric is specific to that assay.  The medians are
computed on the quantitative data using `colMedians` that requires a
data matrix. The data matrix can be extracted from a
`SingleCellExperiment` using the `assay` function.

```{r compute_colMedians}
scp[["peptides_filter1"]] %>%
  assay %>%
  matrixStats::colMedians(na.rm = TRUE) ->
  scp[["peptides_filter1"]]$MedianRI
```

Looking at the distribution of the median per cell can highlight
low-quality cells.

```{r plot_medianRI, warning=FALSE, message=FALSE}
scp[["peptides_filter1"]] %>%
  colData %>%
  data.frame %>%
  ggplot(aes(x = MedianRI, fill = SampleType)) +
  geom_boxplot() +
  scale_x_log10()
```

Here all values seems reasonable so no filtering is needed. If it were
not the case, the same procedure as the previous section can be used
for selecting the cells that have an associated median RI lower that a
defined threshold.

## Filter based on the median CV

The median CV measures the consistency of quantification for a group
of peptides that belong to a protein. We remove cells that exhibit
high median CV over the different proteins. We compute the median CV
per cell using the `computeMedianCV` function from the `scp`
package. The function takes the `peptides_filter1` assay, protein and
peptide information from the assay `rowData`, and the batch names in
the `colData` of the `QFeatures` object. The peptide and batch
information are required to perform normalization prior to the CV
computation. Protein information is required since the CV are computed
at the protein level. The computed median CVs are automatically stored
in the `colData` of the `SingleCellExperiment` assay under the
`.medianCV` field

```{r computeMedianCV}
scp <- computeMedianCV(scp,
                       i = "peptides_filter1",
                       proteinCol = "protein",
                       peptideCol = "peptide",
                       batchCol = "Set")
```

The computed CVs are stored in the `colData` of the `peptides_filter1`
assay and holds the median CV per cell computed using at least 5
observations (peptides).  The main interest of computing the median CV
per cell is to filter cells with reliable quantification. The blank
samples are not expected to have reliable quantifications and hence
can be used to estimate an empirical null distribution of the CV. This
distribution helps defining a threshold that filters out single-cells
that contain noisy quantification.

```{r plot_medianCV, message = FALSE, warning = FALSE}
scp[["peptides_filter1"]] %>%
  colData %>%
  data.frame %>%
  ggplot(aes(x = .MedianCV,
             fill = SampleType)) +
  geom_boxplot() +
  geom_vline(xintercept = 0.3)
```

We can see that the protein quantifiation for single-cells are much
more consistent within cells than witin the blank channels. Based on
the distribution of the blanks, we decide to keep the cells that have
a median CV lower than 0.3. Note this example is inaccurate because
the null distribution is based on only 3 blank samples, but more sets
could lead to a better estimation of the CV null distribution. Note we
also keep only the samples of interest (macrophages and monocytes)
since all QC metrics are now computed.

```{r create_filter2}
keepSample <- scp[["peptides_filter1"]]$.MedianCV < 0.3 &
  scp[["peptides_filter1"]]$SampleType %in% c("Macrophage", "Monocyte")
keepSample[is.na(keepSample)] <- FALSE
```

Then, we subset the assay for samples that pass the filter.

```{r apply_filter2}
sce <- scp[["peptides_filter1"]]
sce <- sce[, keepSample]
```

Finally, we add the filtered assay to the `QFeatures`object and
created the required linking with the previous assay.

```{r add_filter2}
addAssay(scp,
         y = sce,
         name = "peptides_filter2") %>%
  addAssayLinkOneToOne(from = "peptides_filter1",
                       to = "peptides_filter2") ->
  scp
```

# Process the peptide data

In this vignette, the peptide data are further processed before
aggregation to proteins. The steps are: normalization, filter peptides
based on missing data and log-transformation.

## Normalization

The columns (samples) of the peptide data are first normalized by
dividing the relative intensities by the median relative
intensities. Then, the rows (peptides) are normalized by dividing the
relative intensities by the mean relative intensities. The normalized
data is stored in a separate assay.  This normalization procedure is
suggested in the SCoPE2 analysis and is applied using the `sweep`
method. Beside the dataset and the assay to normalize, the method
expects a `MARGIN`, that is either row-wise (`1`) or column-wise (`2`)
transformation, the `FUN` function to apply and `STATS`, a vector of
values to apply. More conventional normalization procedure can be
found in `?QFeatures::normalize`.

```{r normalize_scale}
scp %>%
    ## Divide columns by median
    sweep(i = "peptides_filter2",
        MARGIN = 2,
        FUN = "/",
        STATS = colMedians(assay(scp[["peptides_filter2"]]),
                           na.rm = TRUE),
        name = "peptides_norm_col") %>%
    ## Divide rows by mean
    sweep(i = "peptides_norm_col",
        MARGIN = 1,
        FUN = "/",
        STATS = rowMeans(assay(.[["peptides_norm_col"]]),
                         na.rm = TRUE),
        name = "peptides_norm") ->
    scp
```

Notice each call to `sweep` created a new assay.

```{r show_sweep}
scp
```


## Remove peptides with high missing rate

Peptides that contain many missing values are not
informative. Therefore, another common procedure is to remove higly
missing data. In this example, we remove peptides with more than 99 \%
missing data. This is done using the `filterNA` function from
`QFeatures`.

```{r filterNA}
scp <- filterNA(scp,
                i = "peptides_norm",
                pNA = 0.99)
```

## Log-transformation

In this vignette, we perform log2-transformation using the
`logTransform` method from `QFeatures`. Other log-transformation can
be applied by changing the `base` argument.

```{r logTransform}
scp <- logTransform(scp,
                    base = 2,
                    i = "peptides_norm",
                    name = "peptides_log")
```

Similarly to `sweep`, `logTransform` creates a new assay in `scp`.

# Aggregate peptide data to protein data

Similarly to aggregating PSM data to peptide data, we can aggregate
peptide data to protein data using the `aggregateFeatures` function.

```{r aggregate_proteins}
scp <- aggregateFeatures(scp,
                         i = "peptides_log",
                         name = "proteins",
                         fcol = "protein",
                         fun = matrixStats::colMedians, na.rm = TRUE)
```

The only difference between `aggregateFeatures` and
`aggregateFeaturesOverAssays` is that the second function can
aggregate several assay at once whereas the former only takes one
assay to aggregate. Hence, only a single assay, `proteins`, was
created in the `scp` object.

```{r show_agg_proteins}
scp
```

After the second aggregation, the `proteins` assay in this example
contains quantitative information for 89 proteins in 15 single-cells.

# Process the protein data

The protein data is further processed in three steps: normalization,
imputation (using the KNN algorithm) and batch correction (using the
`ComBat` algorithm).

## Normalization

Normalization is performed similarly to peptide normalization. We use
the same functions, but since the data were log-transformed at the
peptide level, we subtract by the statistic (median or mean) instead
of dividing.

```{r normalize_center}
scp %>%
    ## Center columns with median
    sweep(i = "proteins",
          MARGIN = 2,
          FUN = "-",
          STATS = colMedians(assay(scp[["proteins"]]),
                             na.rm = TRUE),
          name = "proteins_norm_col") %>%
    ## Center rows with mean
    sweep(i = "proteins_norm_col",
          MARGIN = 1,
          FUN = "-",
          STATS = rowMeans(assay(.[["proteins_norm_col"]]),
                           na.rm = TRUE),
          name = "proteins_norm") ->
    scp
```


## Imputation

The protein data contains a lot of missing values.

```{r missingness}
scp[["proteins_norm"]] %>%
  assay %>%
  is.na %>%
  mean
```

The average missingness in the `proteins` assay is around 25
\%. Including more samples and hence more batches can increase the
missingness up to 70 \% as seen for the complete SCoPE2 dataset
(@Specht2019-jm). Whether imputation is beneficial or deleterious for
the data will not be discussed in this vignette.  But taking those
missing value into account is essential to avoid artefacts in
downstream analyses. The data imputation is performed using the K
nearest neighbors algorithm, with k = 3. This is available from the
`impute` mehtod.  More details about the arguments can be found in
`?impute::impute.knn`.

```{r impute}
scp <- impute(scp,
              i = "proteins_norm",
              method = "knn",
              k = 3, rowmax = 1, colmax= 1,
              maxp = Inf, rng.seed = 1234)
```

Note that after imputation, no value are missing.

```{r missingness_imputed}
scp[["proteins_norm"]] %>%
  assay %>%
  is.na %>%
  mean
```


## Batch correction

A very important step for processing SCP data is to correct for batch
effects.  Batch effects are caused by technical variation occuring
during different MS runs. Since only a small number of single-cells
can be acquired at once, batch effects are unavoidable.

The `ComBat` function from the `sva` package can be used to perform
batch correction as it is performed in the SCoPE2 analysis. We do not
claim that `ComBat` is the best algorithm for batch correcting SCP
data and other batch correcting methods could be used using the same
procedure.

We first extract the assay to process.

```{r transferColDataToAssay_proteins}
scp <- transferColDataToAssay(scp, i = "proteins_norm")
sce <- scp[["proteins_norm"]]
```

Next, we need to provide a design matrix and the batch annotation to
`Combat`.  The design matrix allows to protect variables of interest,
in our case `SampleType`.

```{r prepare_batch_correction}
batch <- colData(sce)$Set
model <- model.matrix(~ SampleType, data = colData(sce))
```

We then load and call `ComBat` and overwrite the data matrix. Recall
the data matrix can be accessed using the `assay` function.

```{r compute_batch_correction, results = 'hide', message = FALSE}
library(sva)
assay(sce) <- ComBat(dat = assay(sce),
                     batch = batch,
                     mod = model)
```

Finally, we add the batch corrected assay to the `QFeatures` object
and create the feature links.

```{r add_batch_correction}
addAssay(scp,
         y = sce,
         name = "proteins_batchC") %>%
  addAssayLinkOneToOne(from = "proteins_norm",
                       to = "proteins_batchC") ->
  scp
```

# Dimension reduction

Because each assay contains `SingelCellExperiment` objects, we can
easily apply methods developed in the scRNA-Seq field. A useful
package for dimension reduction on single-cell data is the `scater`.

```{r load_scater}
library(scater)
```

This package provides streamline functions to computes various
dimension reduction such as PCA, UMAP, t-SNE, NMF, MDS, ....

## PCA

PCA can be computed using the `runPCA` method. It returns a
`SingleCellExperiment` object for which the dimension reduction
results are stored in the `reducedDim` slot.

```{r run_PCA}
scp[["proteins_batchC"]] <- runPCA(scp[["proteins_batchC"]],
                                   ncomponents = 5,
                                   ntop = Inf,
                                   scale = TRUE,
                                   exprs_values = 1,
                                   name = "PCA")
```

The computed PCA can be displayed using the `plotReducedDim` function. The
`dimred` arguments should give the name of the dimension reduction results to
plot, here we called it `PCA`. The samples are colored by type of sample.

```{r plot_PCA}
plotReducedDim(scp[["proteins_batchC"]],
               dimred = "PCA",
               colour_by = "SampleType",
               point_alpha = 1)
```

This is a minimalistic example with only a few plotted cells, but the
original SCoPE2 dataset contained more than thousand cells.

## UMAP

Similarly to PCA, we can compute a UMAP using the `runUMAP`
method. Note however that the UMAP implementation requires a
initialization, usually provided by PCA. The previous PCA results are
used automatically when supplying `dimred = "PCA"` (`PCA` is the name
of the dimension reduction result that we supplied in the previous
section).

```{r run_UMAP}
scp[["proteins_batchC"]] <- runUMAP(scp[["proteins_batchC"]],
                                    ncomponents = 2,
                                    ntop = Inf,
                                    scale = TRUE,
                                    exprs_values = 1,
                                    n_neighbors = 3,
                                    dimred = "PCA",
                                    name = "UMAP")
```

The computed UMAP can be displayed using the `plotReducedDim`
function. The `dimred` arguments gives the name of the dimension
reduction results to plot, here we called it `UMAP`. The samples are
colored by type of sample.

```{r plot_UMAP}
plotReducedDim(scp[["proteins_batchC"]],
               dimred = "UMAP",
               colour_by = "SampleType",
               point_alpha = 1)
```

The UMAP plot is a very interesting plot for large datasets. A UMAP on
this small example dataset is not useful but is shown for
illustration.


# Monitoring data processing

The `QFeatures` plot shows the quantitative data for a features at the
different expression levels. For instance, suppose we are interested
in the protein *LMNA* (protein ID is `P02545`). A useful QC is to
monitor the data processing at the PSM, peptide and protein
level. This can easily be done thanks to the `QFeatures`
framework. Using the `subsetByFeature`, we can extract the protein of
interest and its related features in the other assays. The data is
formated to a long format table that can easily be plugged in the
`ggplot2` visualization tool.

```{r monitor_plot, warning = FALSE, fig.width = 10, fig.height = 10, out.width = "100%"}
scp %>%
  ## Get the features related to LMNA (P02545)
  subsetByFeature("P02545") %>%
  ## Format the `QFeatures` to a long format table
  longFormat(colDataCols = c("Set", "SampleType", "Channel")) %>%
  data.frame %>%
  ## This is used to preserve ordering of the samples and assays in ggplot2
  mutate(assay = factor(assay, levels = names(scp)),
         Channel = sub("RI", "TMT-", Channel),
         Channel = factor(Channel, levels = unique(Channel))) %>%
  ## Start plotting
  ggplot(aes(x = Channel, y = value, group = rowname, col = SampleType)) +
  geom_point() +
  ## Plot every assay in a separate facet
  facet_wrap(facets = vars(assay), scales = "free_y", ncol = 3) +
  ## Annotate plot
  xlab("Channels") +
  ylab("Intensity (arbitrary units)") +
  ## Improve plot aspect
  theme(axis.text.x = element_text(angle = 90),
        strip.text = element_text(hjust = 0),
        legend.position = "bottom")
```

This graph helps to keep track of the data processing and highlights
anomalies in the data. For instance, we can see that no data were
recorded for 5 channels (TMT-12 to 16). Those channels contain either
macrophages or monocytes (from the experimental design) and are
expected to contain information. Thanks to this diagnostic plot, we
reported the issue to Specht and colleagues and this led to version 3
of the SCoPE2 dataset.

# Session information {-}

```{r setup2, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "",
    crop = NULL
)
```


```{r sessioninfo, echo=FALSE}
sessionInfo()
```

# Reference