--- title: "Single Cell Proteomics data processing and analysis." author: - name: Laurent Gatto - name: Christophe Vanderaa output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show bibliography: scp.bib date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('scp')`" vignette: > %\VignetteIndexEntry{Single Cell Proteomics data processing and analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` # The `scp` package The `scp` package is used to process and analyse mass spectrometry (MS)-based single cell proteomics (SCP) data. The functions rely on a specific data structure that wraps [`QFeatures`](https://rformassspectrometry.github.io/QFeatures/) objects (@Gatto2020-ry) around [`SingleCellExperiment`](http://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) objects (@Amezquita2019-bf). This data structure could be seen as Matryoshka dolls were the `SingleCellExperiment` objects are small dolls contained in the bigger `QFeatures` doll. The `SingleCellExperiment` class provides a dedicated framework for single-cell data. The `SingleCellExperiment` serves as an interface to many cutting-edge methods for processing, visualizing and analysis single-cell data. More information about the `SingleCellExperiment` class and associated methods can be found in the [OSCA book](http://bioconductor.org/books/release/OSCA/). The `QFeatures` class is a data framework dedicated to manipulate and process MS-based quantitative data. It preserves the relationship between the different levels of information: peptide to spectrum match (PSM) data, peptide data and protein data. The `QFeatures` package also provides an interface to many utility functions to streamline the processing MS data. More information about MS data analysis tools can be found in the [RforMassSpectrometry project](https://www.rformassspectrometry.org/). ```{r scp_framework, results='markup', fig.cap="`scp` relies on `SingleCellExperiment` and `QFeatures` objects.", echo=FALSE, out.width='100%', fig.align='center'} knitr::include_graphics("./figures/SCP_framework.png", error = FALSE) ``` Before running the vignette we need to load the `scp` package. ```{r load_scp, message = FALSE} library("scp") ``` We also load `ggplot2`, `magrittr` and `dplyr` for convenient data manipulation and plotting. ```{r load_other_package, message = FALSE} library("ggplot2") library("magrittr") library("dplyr") ``` # Before you start This vignette will guide you through some common steps of mass spectrometry-based single-cell proteomics (SCP) data analysis. SCP is an emerging field and further research is required to develop a principled analysis workflow. Therefore, we **do not guarantee** that the steps presented here are the best steps for this type of data analysis. This vignette performs the steps that were described in the SCoPE2 landmark paper (@Specht2021-jm). We hope to convince the reader that, although the workflow is probably not optimal, `scp` has the full potential to perform standardized and principled data analysis. All functions presented here are comprehensively documented, highly modular, can easily be extended with new algorithms. Suggestions, feature requests or bug reports are warmly welcome. Feel free to open an issue in the [GitHub repository](https://github.com/UCLouvain-CBIO/scp/issues). # Read in SCP data The first step is to read in the PSM quantification table generated by, for example, MaxQuant (@Tyanova2016-yj). We created a small example data by subsetting the MaxQuant `evidence.txt` table provided in the SCoPE2 landmark paper (@Specht2021-jm). The `mqScpData` table is a typical example of what you would get after reading in a CSV file using `read.csv` or `read.table`. See `?mqScpData` for more information about the table content. ```{r load_mqScpData} data("mqScpData") dim(mqScpData) mqScpData[1:10, 1:5] ``` In order to convert this tabular data to a `scp`-compatible `QFeatures` object, we need to provide a **metadata table** where rows contain sample information and columns must contain at least: * The name of the batch the sample was acquired in * The name of the channel the sample was acquired in Any additional information about the samples will be stored in the `colData`. We provide an example of such a data frame. It was formatted from the annotation table provided in the SCoPE2 article. See `?sampleAnnotation` for more information about the table content. ```{r load_sampleAnnotation} data("sampleAnnotation") sampleAnnotation ``` The two tables are supplied to the `readSCP` function. ```{r readSCP} scp <- readSCP(featureData = mqScpData, colData = sampleAnnotation, channelCol = "Channel", batchCol = "Raw.file") ``` As indicated by the output on the console, `readSCP` proceeds as follows: 1. If `featureData` is the path to a CSV file, it reads the file using `read.csv`. The table is converted to a `SingleCellExperiment` object. `readSCP` needs to know in which field(s) the quantitative data is stored. Those field name(s) is/are provided by the `channelCol` field in the `metaData` table. So in this example, the column names holding the quantitative data in `mqScpData` are stored in the column named `Channel` in `sampleAnnotation`. 2. The `SingleCellExperiment` object is then split according to batch. The split is performed depending on the `batchCol` field in `featureData`, in this case the data will be split according to the `Raw.file` column in `mqScpData`. `Raw.file` contains the names of the acquisition runs that was used by MaxQuant to retrieve the raw data files. 3. The sample metadata is generated from the supplied `colData`. Note that in order for `readSCP` to correctly match the feature data with the metadata, `colData` must contain the same `batchCol` field with batch names. 4. Finally, the split feature data and the sample metadata are stored in a single `QFeatures` object. Here is a compact overview of the data: ```{r overview} scp ``` We can see that the `scp` object we created is a `QFeatures` object containing 4 assays. Each assay has an associated name, this is the batch name that was used for splitting. We can also see that each assay is a `SingleCellExperiment` object. The rows represent the peptide to spectrum matches (PSMs), the number vary depending on the batch. Finally, all three assays contains 16 columns that correspond to the 16 TMT channels recorded during the 4 MS runs. ## Special case: empty columns The SCoPE2 data we provide as an example contains runs that were acquired with two TMT protocols. The 3 first assays were acquired using the TMT-11 protocol and the last assay was acquired using a TMT-16 protocol. When exporting the table, the authors combined the data in a single table, were missing channels in the TMT-11 data are filled with `NA`. This is essential when working in table format, but since `scp` keeps the runs separated we can allow for different number of channels per run. `readSCP` can automatically detect and remove columns that contain only `NA`s, by supplying the argument `removeEmptyCols = TRUE`. ```{r readSCP2} scp <- readSCP(featureData = mqScpData, colData = sampleAnnotation, channelCol = "Channel", batchCol = "Raw.file", removeEmptyCols = TRUE) scp ``` See here that the 3 first assays contain 11 columns that correspond to the TMT-11 labels and the last assay contains 16 columns that correspond to the TMT-16 labels. # Clean missing data All single-cell data contain many zeros. The zeros can be biological zeros or technical zeros and differentiating between the two types is not a trivial task. To avoid artefacts in downstream steps, we replace the zeros by the missing value `NA`. The `zeroIsNA` function takes the `QFeatures` object and the name(s) or index/indices of the assay(s) to clean and automatically replaces any zero in the selected quantitative data by `NA`. ```{r zeroIsNA} scp <- zeroIsNA(scp, i = 1:4) ``` # Filter PSMs A common steps in SCP is to filter out low-confidence PSMs. Each PSM assay contains feature meta-information that are stored in the `rowData` of the assays. The `QFeatures` package allows to quickly filter the rows of an assay by using these information. The available variables in the `rowData` are listed below for each assay. ```{r rowDataNames} rowDataNames(scp) ``` ## Filter features based on feature metadata Below are some examples of criteria that are used to identify low-confidence. The information is readily available since this was computed by MaxQuant: - Remove PSMs that are matched to contaminants - Remove PSMs that are matched to the decoy database - Keep PSMs that exhibit a high PIF (parental ion fraction), indicative of the purity of a spectrum We can perform this filtering using the `filterFeatures` function from `QFeatures`. `filterFeatures` automatically accesses the feature metadata and selects the rows that meet the provided condition(s). For instance, `Reverse != "+"` keeps the rows for which the `Reverse` variable in the `rowData` is not `"+"` (*i.e.* the PSM is not matched to the decoy database). ```{r filter_psms} scp <- filterFeatures(scp, ~ Reverse != "+" & Potential.contaminant != "+" & !is.na(PIF) & PIF > 0.8) ``` ## Filter assays based on detected features To avoid proceeding with failed runs, another interesting filter is to remove assays with too few features. If a batch contains less than, for example, 150 features we can then suspect something wrong happened in that batch and it should be removed. Using `dims`, we can query the dimensions (hence the number of features and the number of samples) of all assays contained in the dataset. ```{r dims} dims(scp) ``` Actually, a `QFeatures` object can be seen as a three-order array: $features \times samples \times assay$. Hence, `QFeatures` supports three-order subsetting `x[rows, columns, assays]`. We first select the assays that have sufficient PSMs (the number of rows is greater than 150), and then subset the `scp` object for the assays that meet the criterion. ```{r filter_assays} keepAssay <- dims(scp)[1, ] > 150 scp <- scp[, , keepAssay] scp ``` Notice the `190321S_LCA10_X_FP97_blank_01` sample was removed because it did not contain sufficient features, as expected from a blank run. This could also have been the case for failed runs. ## Filter features based on SCP metrics Another type of filtering is specific to SCP. In the SCoPE2 analysis, the authors suggest a filters based on the sample to carrier ratio (SCR), that is the reporter ion intensity of a single-cell sample divided by the reporter ion intensity of the carrier channel (200 cells) from the same batch. It is expected that the carrier intensities are much higher than the single-cell intensities. The SCR can be computed using the `computeSCR` function from `scp`. The function must be told which channels are the samples that must be divided and which channel contains the carrier. This information is provided in the sample metadata and is accessed using the `colData`, under the `SampleType` field. ```{r show_annotation} table(colData(scp)[, "SampleType"]) ``` In this dataset, `SampleType` gives the type of sample that is present in each TMT channel. The SCoPE2 protocole includes 5 types of samples: * The carrier channels (`Carrier`) contain 200 cell equivalents and are meant to boost the peptide identification rate. * The normalization channels (`Reference`) contain 5 cell equivalents and are used to partially correct for between-run variation. * The unused channels (`Unused`) are channels that are left empty due to isotopic cross-contamination by the carrier channel. * The blank channels (`Blank`) contain samples that do not contain any cell but are processed as single-cell samples. * The single-cell sample channels contain the single-cell samples of interest, that are macrophage (`Macrophage`) or monocyte (`Monocyte`). The `computeSCR` function expects the following input: * The `QFeatures` dataset * The assay name(s) or index/indices for which the SCR should be computed * The sample metadata variable pointing to the channel annotation * A string pattern (following regular expression syntax) that uniquely identifies the carrier channel in each batch * A string pattern (following regular expression syntax) that identifies the samples to divide The function creates a new field in the `rowData` of the assays. We store the computed SCR in the `rowData` and call it `MeanSCR`. ```{r computeSCR} scp <- computeSCR(scp, i = 1:3, colDataCol = "SampleType", carrierPattern = "Carrier", samplePattern = "Macrophage|Monocyte", rowDataName = "MeanSCR") ``` Before applying the filter, we plot the distribution of the average SCR. We can extract the `MeanSCR` variable from the `rowData` of several assays using the `rowDataToDF`. It takes the `rowData` field(s) of interest and returns a `DataFrame` table. ```{r plot_SCR, warning=FALSE, message=FALSE} scp %>% rowDataToDF(i = 1:3, vars = "MeanSCR") %>% data.frame %>% ggplot(aes(x = MeanSCR)) + geom_histogram() + geom_vline(xintercept = c(1/200, 0.1), lty = c(2, 1)) + scale_x_log10() ``` The expected ratio between single cells and the carrier is 1/200 (dashed line). We can see that the distribution mode is slightly shifted towards higher ratios with a mode around 0.01. However, there are a few PSMs that stand out of the distribution and have a much higher signal than expected, indicating something wrong happened during the quantification of those PSMs. We therefore filter out PSMs with an average SCR higher than 0.1 (solide line). This is again easily performed using the `filterFeatures` functions. ```{r filter_SCR} scp <- filterFeatures(scp, ~ !is.na(MeanSCR) & MeanSCR < 0.1) ``` ## Filter features to control for FDR Finally, we might also want to control for false discovery rate (FDR). MaxQuant already computes posterior error probabilities (PEP), but filtering on PEPs is too conservative (@Kall2008-hb) so we provide the `pep2qvalue` function to convert PEPs to q-values that are directly related to FDR. We here compute the q-values from the PEP (`dart_PEP`) across all 3 assays. `dart_PEP` contains the PEP values that have been updated using the DART-ID algorithm (@Chen2019-uc). The function will store the results in the `rowData`, we here asked to name the new column `qvalue_PSMs`. ```{r compute_qvalues} scp <- pep2qvalue(scp, i = 1:3, PEP = "dart_PEP", rowDataName = "qvalue_PSMs") ``` We also allow to compute q-values at peptide or protein level rather than PSM. In this case, you need to supply the `groupBy` argument. Suppose we want to compute the q-values at protein level, we can fetch the protein information stored under `protein` in the `rowData`. This time, we store the q-values in a new field called `qvalue_proteins`. ```{r compute_qvalues2} scp <- pep2qvalue(scp, i = 1:3, PEP = "dart_PEP", groupBy = "protein", rowDataName = "qvalue_proteins") ``` We can now filter the PSM to control, let's say, the protein FDR at 1\%. This can be performed using `filterFeatures` because the q-values were stored in the `rowData`. ```{r filter_FDR} scp <- filterFeatures(scp, ~ qvalue_proteins < 0.01) ``` # Process the PSM data ## Relative reporter ion intensity In order to partialy correct for between-run variation, SCoPE2 suggests computing relative reporter ion intensities. This means that intensities measured for single-cells are divided by the reference channel containing 5-cell equivalents. We use the `divideByReference` function that divides channels of interest by the reference channel. Similarly to `computeSCR`, we can point to the samples and the reference columns in each assay using the annotation contained in the `colData`. We here divide all columns (using the regular expression wildcard `.`) by the reference channel (`Reference`). ```{r divideByReference} scp <- divideByReference(scp, i = 1:3, colDataCol = "SampleType", samplePattern = ".", refPattern = "Reference") ``` # Aggregate PSM data to peptide data Now that the PSM assays are processed, we can aggregate them to peptides. This is performed using the `aggregateFeaturesOverAssays` function. For each assay, the function aggregates several PSMs into a unique peptide. This is best illustrated by the figure below. ```{r features_aggregation, results = 'markup', fig.cap = "Conceptual illustration of feature aggregation.", echo=FALSE, out.width='100%', fig.align='center'} knitr::include_graphics("./figures/feature_aggregation.png", error = FALSE) ``` Remember there currently are three assays containing the PSM data. ```{r show_agg_psms} scp ``` The PSMs are aggregated over the `fcol` feature variable, here peptides. We also need to supply an aggregating function that will tell how to combine the quantitative data of the PSMs to aggregate. We here aggregate the PSM data using the median value per sample thanks to the `matrixStats:colMedians` function. Other functions can be used and we refer to `?aggregateFeatures` for more information about available aggregation functions. The `aggregateFeaturesOverAssays` function will create a new assay for each aggregated assay. We name the aggregated assays using the original names and appending `peptides_` at the start. ```{r aggregate_peptides, message = FALSE} scp <- aggregateFeaturesOverAssays(scp, i = 1:3, fcol = "peptide", name = paste0("peptides_", names(scp)), fun = matrixStats::colMedians, na.rm = TRUE) ``` Notice that 3 new assays were created in the `scp` object. Those new assays contain the aggregated features while the three first assays are unchanged. This allows to keep track of the data processing. ```{r show_agg_peptides} scp ``` Under the hood, the `QFeatures` architecture preserves the relationship between the aggregated assays. See `?AssayLinks` for more information on relationships between assays. # Join the SCoPE2 sets in one assay Up to now, we kept the data belonging to each MS run in separate assays. We now combine all batches into a single assay. This is done using the `joinAssays` function from the `QFeatures` package. Note that we now use the aggregated assays, so assay 4 to 6. ```{r joinAssays} scp <- joinAssays(scp, i = 4:6, name = "peptides") ``` In this case, one new assay is created in the `scp` object that combines the data from assay 4 to 6. The samples are always distinct so the number of column in the new assay (here $48$) will always equals the sum of the columns in the assays to join (here $16 + 16 + 16$). The feature in the joined assay might contain less features than the sum of the rows of the assays to join since common features between assays are joined in a single row. ```{r show_join} scp ``` # Filter single-cells Another common step in single-cell data analysis pipelines is to remove low-quality cells. After subsetting for the samples of interest, we will use 2 metrics: the median relative intensities per cell and the median coefficient of variation (CV) per cell. ## Filter samples of interest We can subset the cells of interest, that is the blank samples (`sc_0`), the macrophages (`sc_m0`) and the monocytes (`sc_u`). This can easily be done by taking advantage of the `colData` and the subsetting operators. Recall that `QFeatures` objects support three-order subsetting, `x[rows, columns, assays]`, where columns are the samples of interest. ```{r subset samples} scp scp <- scp[, scp$SampleType %in% c("Blank", "Macrophage", "Monocyte"), ] ``` The subsetting removes unwanted samples from all assays. The filtered data set contains the same number of assays with the same number of features, but the number of columns (hence sampled) decreased. ```{r show_cell_filter} scp ``` ## Filter based on the median relative intensity We compute the median relative reporter ion intensity for each cell separately and apply a filter based on this statistic. This procedure recalls that of library size filtering commonly performed in scRNA-Seq data analysis, where the library size is the sum of the counts in each single cell. We will store the median intensity in the `colData`. The medians are computed for every sample from the quantitative data using the `apply` function. The data matrix can be extracted from a `SingleCellExperiment` using the `assay` function. Setting `MARGIN = 2` means that we compute the median by column. ```{r compute_colMedians} scp[["peptides"]] %>% assay %>% apply(MARGIN = 2, FUN = median, na.rm = TRUE) -> medians head(medians) ``` The computed medians are then inserted in the `colData`. ```{r} colData(scp)[names(medians), "MedianRI"] <- medians ``` Looking at the distribution of the median per cell can highlight low-quality cells. ```{r plot_medianRI, warning=FALSE, message=FALSE} scp %>% colData %>% data.frame %>% ggplot() + aes(x = MedianRI, y = SampleType, fill = SampleType) + geom_boxplot() + scale_x_log10() ``` The blanks samples should not contain any peptide information and are therefore used to assess the amount of background signal. The graph above confirms that the signal measured in single-cells (macrophages and monocytes) is above the background signal, hence no filtering is needed. Would it not be the case, the same procedure as in the previous section can be used for selecting the cells that have an associated median RI lower that a defined threshold. ## Filter based on the median CV The median CV measures the consistency of quantification for a group of peptides that belong to a protein. We remove cells that exhibit high median CV over the different proteins. We compute the median CV per cell using the `computeMedianCV` function from the `scp` package. The function takes the `peptides` assay and computes the CV for each protein in each cell. To perform this, we must supply the name of the `rowData` field that contains the protein information through the `groupBy` argument. We also only want to compute CVs if we have at least 5 peptides per protein. Finally, we also perform a normalization and divide the columns by the median. The computed median CVs are automatically stored in the `colData` under the name that is supplied, here `MedianCV`. ```{r medianCVperCell} scp <- medianCVperCell(scp, i = 1:3, groupBy = "protein", nobs = 5, norm = "div.median", na.rm = TRUE, colDataName = "MedianCV") ``` The computed CVs are stored in the `colData` of the `peptides` assay and holds the median CV per cell computed using at least 5 observations (peptides). The main interest of computing the median CV per cell is to filter cells with reliable quantification. The blank samples are not expected to have reliable quantifications and hence can be used to estimate an empirical null distribution of the CV. This distribution helps defining a threshold that filters out single-cells that contain noisy quantification. ```{r plot_medianCV, message = FALSE, warning = FALSE} scp %>% getWithColData("peptides") %>% colData %>% data.frame %>% ggplot(aes(x = MedianCV, fill = SampleType)) + geom_boxplot() + geom_vline(xintercept = 0.65) ``` We can see that the protein quantification for single-cells are much more consistent than for blank samples. Based on the distribution of the blanks, we decide to keep the cells that have a median CV lower than 0.65. Note this example is inaccurate because the null distribution is based on only 3 blank samples, but more sets could lead to a better estimation of the CV null distribution. ```{r create_filter} scp <- scp[, !is.na(scp$MedianCV) & scp$MedianCV < 0.65, ] ``` We can now remove the blank samples since all QC metrics are now computed. ```{r remove_blanks} scp <- scp[, scp$SampleType != "Blank", ] ``` # Process the peptide data In this vignette, the peptide data are further processed before aggregation to proteins. The steps are: normalization, filter peptides based on missing data and log-transformation. ## Normalization The columns (samples) of the peptide data are first normalized by dividing the relative intensities by the median relative intensities. Then, the rows (peptides) are normalized by dividing the relative intensities by the mean relative intensities. The normalized data is stored in a separate assay. This normalization procedure is suggested in the SCoPE2 analysis and is applied using the `sweep` method. Beside the dataset and the assay to normalize, the method expects a `MARGIN`, that is either row-wise (`1`) or column-wise (`2`) transformation, the `FUN` function to apply and `STATS`, a vector of values to apply. More conventional normalization procedure can be found in `?QFeatures::normalize`. ```{r normalize_scale} ## Divide columns by median scp <- sweep(scp, i = "peptides", MARGIN = 2, FUN = "/", STATS = colMedians(assay(scp[["peptides"]]), na.rm = TRUE), name = "peptides_norm_col") ## Divide rows by mean scp <- sweep(scp, i = "peptides_norm_col", MARGIN = 1, FUN = "/", STATS = rowMeans(assay(scp[["peptides_norm_col"]]), na.rm = TRUE), name = "peptides_norm") ``` Notice each call to `sweep` created a new assay. ```{r show_sweep} scp ``` Let's point out ## Remove peptides with high missing rate Peptides that contain many missing values are not informative. Therefore, another common procedure is to remove higly missing data. In this example, we remove peptides with more than 99 \% missing data. This is done using the `filterNA` function from `QFeatures`. ```{r filterNA} scp <- filterNA(scp, i = "peptides_norm", pNA = 0.99) ``` ## Log-transformation In this vignette, we perform log2-transformation using the `logTransform` method from `QFeatures`. Other log-transformation can be applied by changing the `base` argument. ```{r logTransform} scp <- logTransform(scp, base = 2, i = "peptides_norm", name = "peptides_log") ``` Similarly to `sweep`, `logTransform` creates a new assay in `scp`. # Aggregate peptide data to protein data Similarly to aggregating PSM data to peptide data, we can aggregate peptide data to protein data using the `aggregateFeatures` function. ```{r aggregate_proteins} scp <- aggregateFeatures(scp, i = "peptides_log", name = "proteins", fcol = "protein", fun = matrixStats::colMedians, na.rm = TRUE) ``` The only difference between `aggregateFeatures` and `aggregateFeaturesOverAssays` is that the second function can aggregate several assay at once whereas the former only takes one assay to aggregate. Hence, only a single assay, `proteins`, was created in the `scp` object. ```{r show_agg_proteins} scp ``` After the second aggregation, the `proteins` assay in this example contains quantitative information for 89 proteins in 15 single-cells. # Process the protein data The protein data is further processed in three steps: normalization, imputation (using the KNN algorithm) and batch correction (using the `ComBat` algorithm). ## Normalization Normalization is performed similarly to peptide normalization. We use the same functions, but since the data were log-transformed at the peptide level, we subtract by the statistic (median or mean) instead of dividing. ```{r normalize_center} scp %>% ## Center columns with median sweep(i = "proteins", MARGIN = 2, FUN = "-", STATS = colMedians(assay(scp[["proteins"]]), na.rm = TRUE), name = "proteins_norm_col") %>% ## Center rows with mean sweep(i = "proteins_norm_col", MARGIN = 1, FUN = "-", STATS = rowMeans(assay(.[["proteins_norm_col"]]), na.rm = TRUE), name = "proteins_norm") -> scp ``` ## Imputation The protein data contains a lot of missing values. ```{r missingness} scp[["proteins_norm"]] %>% assay %>% is.na %>% mean ``` The average missingness in the `proteins` assay is around 25 \%. Including more samples and hence more batches can increase the missingness up to 70 \% as seen for the complete SCoPE2 dataset (@Specht2021-jm). Whether imputation is beneficial or deleterious for the data will not be discussed in this vignette. But taking those missing value into account is essential to avoid artefacts in downstream analyses. The data imputation is performed using the K nearest neighbors algorithm, with k = 3. This is available from the `impute` mehtod. More details about the arguments can be found in `?impute::impute.knn`. ```{r impute} scp <- impute(scp, i = "proteins_norm", method = "knn", k = 3, rowmax = 1, colmax= 1, maxp = Inf, rng.seed = 1234) ``` Note that after imputation, no value are missing. ```{r missingness_imputed} scp[["proteins_norm"]] %>% assay %>% is.na %>% mean ``` ## Batch correction A very important step for processing SCP data is to correct for batch effects. Batch effects are caused by technical variation occuring during different MS runs. Since only a small number of single-cells can be acquired at once, batch effects are unavoidable. The `ComBat` function from the `sva` package can be used to perform batch correction as it is performed in the SCoPE2 analysis. We do not claim that `ComBat` is the best algorithm for batch correcting SCP data and other batch correcting methods could be used using the same procedure. We first extract the assay to process. ```{r get_proteins} sce <- getWithColData(scp, "proteins_norm") ``` Next, we need to provide a design matrix and the batch annotation to `Combat`. The design matrix allows to protect variables of interest, in our case `SampleType`. ```{r prepare_batch_correction} batch <- colData(sce)$Raw.file model <- model.matrix(~ SampleType, data = colData(sce)) ``` We then load and call `ComBat` and overwrite the data matrix. Recall the data matrix can be accessed using the `assay` function. ```{r compute_batch_correction, results = 'hide', message = FALSE} library(sva) assay(sce) <- ComBat(dat = assay(sce), batch = batch, mod = model) ``` Finally, we add the batch corrected assay to the `QFeatures` object and create the feature links. ```{r add_batch_correction} addAssay(scp, y = sce, name = "proteins_batchC") %>% addAssayLinkOneToOne(from = "proteins_norm", to = "proteins_batchC") -> scp ``` # Dimension reduction Because each assay contains `SingelCellExperiment` objects, we can easily apply methods developed in the scRNA-Seq field. A useful package for dimension reduction on single-cell data is the `scater`. ```{r load_scater} library(scater) ``` This package provides streamline functions to computes various dimension reduction such as PCA, UMAP, t-SNE, NMF, MDS, .... ## PCA PCA can be computed using the `runPCA` method. It returns a `SingleCellExperiment` object for which the dimension reduction results are stored in the `reducedDim` slot. ```{r run_PCA} scp[["proteins_batchC"]] <- runPCA(scp[["proteins_batchC"]], ncomponents = 5, ntop = Inf, scale = TRUE, exprs_values = 1, name = "PCA") ``` The computed PCA can be displayed using the `plotReducedDim` function. The `dimred` arguments should give the name of the dimension reduction results to plot, here we called it `PCA`. The samples are colored by type of sample. ```{r plot_PCA} plotReducedDim(scp[["proteins_batchC"]], dimred = "PCA", colour_by = "SampleType", point_alpha = 1) ``` This is a minimalistic example with only a few plotted cells, but the original SCoPE2 dataset contained more than thousand cells. ## UMAP Similarly to PCA, we can compute a UMAP using the `runUMAP` method. Note however that the UMAP implementation requires a initialization, usually provided by PCA. The previous PCA results are used automatically when supplying `dimred = "PCA"` (`PCA` is the name of the dimension reduction result that we supplied in the previous section). ```{r run_UMAP} scp[["proteins_batchC"]] <- runUMAP(scp[["proteins_batchC"]], ncomponents = 2, ntop = Inf, scale = TRUE, exprs_values = 1, n_neighbors = 3, dimred = "PCA", name = "UMAP") ``` The computed UMAP can be displayed using the `plotReducedDim` function. The `dimred` arguments gives the name of the dimension reduction results to plot, here we called it `UMAP`. The samples are colored by type of sample. ```{r plot_UMAP} plotReducedDim(scp[["proteins_batchC"]], dimred = "UMAP", colour_by = "SampleType", point_alpha = 1) ``` The UMAP plot is a very interesting plot for large datasets. A UMAP on this small example dataset is not useful but is shown for illustration. # Monitoring data processing The `QFeatures` plot shows the quantitative data for a features at the different expression levels. For instance, suppose we are interested in the protein *Plastin-2* (protein ID is `P13796`). A useful QC is to monitor the data processing at the PSM, peptide and protein level. This can easily be done thanks to the `QFeatures` framework. Using the `subsetByFeature`, we can extract the protein of interest and its related features in the other assays. The data is formatted to a long format table that can easily be plugged in the `ggplot2` visualization tool. ```{r monitor_plot, warning = FALSE, fig.width = 10, fig.height = 10, out.width = "100%"} scp %>% ## Get the features related to Plastin-2 (P13796) subsetByFeature("P13796") %>% ## Format the `QFeatures` to a long format table longFormat(colDataCols = c("Raw.file", "SampleType", "Channel")) %>% data.frame %>% ## This is used to preserve ordering of the samples and assays in ggplot2 mutate(assay = factor(assay, levels = names(scp)), Channel = sub("Reporter.intensity.", "TMT-", Channel), Channel = factor(Channel, levels = unique(Channel))) %>% ## Start plotting ggplot(aes(x = Channel, y = value, group = rowname, col = SampleType)) + geom_point() + ## Plot every assay in a separate facet facet_wrap(facets = vars(assay), scales = "free_y", ncol = 3) + ## Annotate plot xlab("Channels") + ylab("Intensity (arbitrary units)") + ## Improve plot aspect theme(axis.text.x = element_text(angle = 90), strip.text = element_text(hjust = 0), legend.position = "bottom") ``` This graph helps to keep track of the data processing. We can see how the different PSMs are progressively aggregated to peptides and then to proteins as well as how the normalization, imputation or batch correction impact the distribution of the quantifications. # Session information {-} ```{r setup2, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "", crop = NULL ) ``` ```{r sessioninfo, echo=FALSE} sessionInfo() ``` # Reference