---
title: "Working with PSM data"
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{Working with PSM data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\VignettePackage{PSMatch}
%\VignetteDepends{mzR,mzID,BiocStyle,msdata}
---
```{r style, echo = FALSE, results = 'asis', message=FALSE}
BiocStyle::markdown()
```
**Package**: `r Biocpkg("PSMatch")`
**Authors**: `r packageDescription("PSMatch")[["Author"]] `
**Last modified:** `r file.info("PSM.Rmd")$mtime`
**Compiled**: `r date()`
```{r setup, message = FALSE, echo = FALSE}
library("PSMatch")
```
# Installation instructions
To install the package from Bioconductor, make sure you have the
`BiocManager` package, available from CRAN, and then run
```r
BiocManager::install("PSMatch")
```
# Introduction
This vignette is one among several illustrating how to use the
`PSMatch` package, focusing on the handling and processing of
proteomics identification data using the `PSM` class. For a general
overview of the package, see the `PSMatch` package manual page
(`?PSMatch`) and references therein.
# Handling and processing identification data
## Loading PSM data
We are going to use an `mzid` file from the `msdata` package.
```{r}
f <- msdata::ident(full.names = TRUE, pattern = "TMT")
basename(f)
```
The `PSM()` function parses one or multiple `mzid` files and returns
an object of class `PSM`. This class is a simple extension of the
`DFrame` class, representing the peptide-spectrum matches in a tabular
format.
```{r}
library("PSMatch")
id <- PSM(f)
id
```
```{r, echo = FALSE}
n_matches <- nrow(id)
n_scans <- length(unique(id$spectrumID))
n_seqs <- length(unique(id$sequence))
n_cols <- ncol(id)
```
This table contains `r n_matches` matches for `r n_scans` scans and
`r n_seqs` peptides sequences, each annotated by `r n_cols` variables.
```{r}
nrow(id) ## number of matches
length(unique(id$spectrumID)) ## number of scans
length(unique(id$sequence)) ## number of peptide sequences
names(id)
```
The PSM data are read as is, without any filtering. As we can see
below, we still have all the hits from the forward and reverse (decoy)
databases.
```{r}
table(id$isDecoy)
```
## Keeping all matches
The data also contains multiple matches for several spectra. The table
below shows the number of individual MS scans that have 1, 2, ... up
to 5 matches.
```{r}
table(table(id$spectrumID))
```
More specifically, we can see below how scan 1774 has 4 matches, all
to sequence `RTRYQAEVR`, which itself matches to 4 different proteins:
```{r}
i <- grep("scan=1774", id$spectrumID)
id[i, ]
id[i, "DatabaseAccess"]
```
If the goal is to keep all the matches, but arranged by scan/spectrum,
one can *reduce* the `DataFrame` object by the `spectrumID` variable,
so that each scan correponds to a single row that still stores all
values[^rownames]:
[^rownames]: The rownames aren't needed here and are removed to reduce
to output in the the next code chunks displaying parts of `id2`.
```{r, warning = FALSE}
id2 <- reducePSMs(id, id$spectrumID)
rownames(id2) <- NULL ## rownames not needed here
dim(id2)
```
The resulting object contains a single entrie for scan 1774 with
information for the multiple matches stored as a list within the table
cell.
```{r}
j <- grep("scan=1774", id2$spectrumID)
id2[j, ]
```
```{r}
id2[j, "DatabaseAccess"]
```
The identification data could be used to annotate an raw mass
spectrometry `Spectra` object (see the `Spectra::joinSpectraData()`
function for details).
## Filtering data
Often, the PSM data is filtered to only retain reliable matches. The
`MSnID` package can be used to set thresholds to attain user-defined
PSM, peptide or protein-level FDRs. Here, we will simply filter out
wrong or the least reliable identifications.
### Remove decoy hits
```{r}
id <- filterPsmDecoy(id)
id
```
### Keep first rank matches
```{r}
id <- filterPsmRank(id)
id
```
### Remove shared peptides
The data still contains shared peptides, i.e. those that different
proteins. For example `QKTRCATRAFKANKGRAR` matches proteins `ECA2869`
and `ECA4278`.
```{r}
id[id$sequence == "QKTRCATRAFKANKGRAR", "DatabaseAccess"]
```
We can filter these out to retain unique peptides.
```{r}
id <- filterPsmShared(id)
id
```
This last filtering leaves us with `r nrow(id)` PSMs.
Note that the `ConnectedComponents` approach defined in this package
allows one to explore protein groups defined by such shared peptides
more thoroughly and make informed decision as to which shared peptides
to retain and which ones to drop. For details see the related vignette
or the
[`ConnectedComponents`](https://rformassspectrometry.github.io/PSMatch/reference/ConnectedComponents.html)
manual page.
### All filters in one function
This can also be achieved with the `filterPSMs()` function:
```{r}
id <- PSM(f)
filterPSMs(id)
```
# The `mzR` and `mzID` parsers
The `PSM()` function can take two different values for the `parser`
parameter, namely `"mzR"` (the default value) and `"mzID"`.
- **mzR** uses the `openIDfile()` function from the
`r BiocStyle::Biocpkg("mzR")` to parse the `mzId` file(s), and then
coerces the data to a `data.frame` which is eventually returned as
a `PSM` object. The parser function uses dedicated code from the
Proteowizard project (included in `mzR`) and is generally the
fastest approach.
- **mzID** parses the `mzId` file with `mzID()` function from the
`r BiocStyle::Biocpkg("mzID")` package, and then flattens the data to
a `data.frame` with `mzID::flatten()` and eventuelly returns a
`PSM` object. The `mzID` package relies on the `r BiocStyle::CRANpkg("XML")`
package. Is is slower but is is more robust to variations in the
`mzID` implementation, as is thus a useful backup when the `mzR`
backend fails.
```{r, warning = FALSE}
system.time(id1 <- PSM(f, parser = "mzR"))
system.time(id2 <- PSM(f, parser = "mzID"))
```
Other differences in the two parsers include the columns that are
returned, the way they name them, and, as will shown below the matches
that are returned. Note for instance (and this will be important
later), that there is no equivalent of `"modLocation"` in `id2`.
```{r}
names(id1)
names(id2)
```
We also have different number of matches in the two tables:
```{r}
nrow(id1)
nrow(id2)
```
```{r}
table(id1$isDecoy)
table(id2$isdecoy)
```
Let's first filter the PSM tables to facilitate focus the comparison
of relevant scans. Note that the default `filterPSMs()` arguments are
set to work with both parser.
```{r}
id1_filtered <- filterPSMs(id1)
id2_filtered <- filterPSMs(id2)
```
As can be seen, we are also left with `r nrow(id1_filtered)` vs
`r nrow(id2_filtered)` PSMs after filtering.
The difference doesn't stem from different scans, given that the
spectum identifiers are identical in both tables:
```{r}
identical(sort(unique(id1_filtered$spectrumID)),
sort(unique(id2_filtered$spectrumid)))
```
The difference is obvious when we tally a table of spectrum id
occurences in the filtered tables. In `id2_filtered`, each scan is
unique, i.e matched only once.
```{r}
anyDuplicated(id2_filtered$spectrumid)
```
However, for `id1_filtered`, we see that some scans are still repeat
up to 4 times in the table:
```{r}
table(table(id1_filtered$spectrumID))
```
The example below shows that these differences stem from the
modification location (`"modLocation"`), that is not report by the
`mzID` parser:
```{r}
k <- names(which(table(id1_filtered$spectrumID) == 4))
id1_filtered[id1_filtered$spectrumID == k, "sequence"]
id1_filtered[id1_filtered$spectrumID == k, "modLocation"]
id1_filtered[id1_filtered$spectrumID == k, "modName"]
```
If we remove the `"modLocation"` column, we recoved the same number of
PSMs than with the `mzID` parser.
```{r}
id1_filtered$modLocation <- NULL
nrow(unique(id1_filtered))
nrow(unique(id2_filtered))
```
# Session information
```{r si}
sessionInfo()
```