--- title: "Working with PSM data" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Working with PSM data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignettePackage{PSMatch} %\VignetteDepends{mzR,mzID,BiocStyle,msdata} --- ```{r style, echo = FALSE, results = 'asis', message=FALSE} BiocStyle::markdown() ``` **Package**: `r Biocpkg("PSMatch")`
**Authors**: `r packageDescription("PSMatch")[["Author"]] `
**Last modified:** `r file.info("PSM.Rmd")$mtime`
**Compiled**: `r date()` ```{r setup, message = FALSE, echo = FALSE} library("PSMatch") ``` # Installation instructions To install the package from Bioconductor, make sure you have the `BiocManager` package, available from CRAN, and then run ```r BiocManager::install("PSMatch") ``` # Introduction This vignette is one among several illustrating how to use the `PSMatch` package, focusing on the handling and processing of proteomics identification data using the `PSM` class. For a general overview of the package, see the `PSMatch` package manual page (`?PSMatch`) and references therein. # Handling and processing identification data ## Loading PSM data We are going to use an `mzid` file from the `msdata` package. ```{r} f <- msdata::ident(full.names = TRUE, pattern = "TMT") basename(f) ``` The `PSM()` function parses one or multiple `mzid` files and returns an object of class `PSM`. This class is a simple extension of the `DFrame` class, representing the peptide-spectrum matches in a tabular format. ```{r} library("PSMatch") id <- PSM(f) id ``` ```{r, echo = FALSE} n_matches <- nrow(id) n_scans <- length(unique(id$spectrumID)) n_seqs <- length(unique(id$sequence)) n_cols <- ncol(id) ``` This table contains `r n_matches` matches for `r n_scans` scans and `r n_seqs` peptides sequences, each annotated by `r n_cols` variables. ```{r} nrow(id) ## number of matches length(unique(id$spectrumID)) ## number of scans length(unique(id$sequence)) ## number of peptide sequences names(id) ``` The PSM data are read as is, without any filtering. As we can see below, we still have all the hits from the forward and reverse (decoy) databases. ```{r} table(id$isDecoy) ``` ## Keeping all matches The data also contains multiple matches for several spectra. The table below shows the number of individual MS scans that have 1, 2, ... up to 5 matches. ```{r} table(table(id$spectrumID)) ``` More specifically, we can see below how scan 1774 has 4 matches, all to sequence `RTRYQAEVR`, which itself matches to 4 different proteins: ```{r} i <- grep("scan=1774", id$spectrumID) id[i, ] id[i, "DatabaseAccess"] ``` If the goal is to keep all the matches, but arranged by scan/spectrum, one can *reduce* the `DataFrame` object by the `spectrumID` variable, so that each scan correponds to a single row that still stores all values[^rownames]: [^rownames]: The rownames aren't needed here and are removed to reduce to output in the the next code chunks displaying parts of `id2`. ```{r, warning = FALSE} id2 <- reducePSMs(id, id$spectrumID) rownames(id2) <- NULL ## rownames not needed here dim(id2) ``` The resulting object contains a single entrie for scan 1774 with information for the multiple matches stored as a list within the table cell. ```{r} j <- grep("scan=1774", id2$spectrumID) id2[j, ] ``` ```{r} id2[j, "DatabaseAccess"] ``` The identification data could be used to annotate an raw mass spectrometry `Spectra` object (see the `Spectra::joinSpectraData()` function for details). ## Filtering data Often, the PSM data is filtered to only retain reliable matches. The `MSnID` package can be used to set thresholds to attain user-defined PSM, peptide or protein-level FDRs. Here, we will simply filter out wrong or the least reliable identifications. ### Remove decoy hits ```{r} id <- filterPsmDecoy(id) id ``` ### Keep first rank matches ```{r} id <- filterPsmRank(id) id ``` ### Remove shared peptides The data still contains shared peptides, i.e. those that different proteins. For example `QKTRCATRAFKANKGRAR` matches proteins `ECA2869` and `ECA4278`. ```{r} id[id$sequence == "QKTRCATRAFKANKGRAR", "DatabaseAccess"] ``` We can filter these out to retain unique peptides. ```{r} id <- filterPsmShared(id) id ``` This last filtering leaves us with `r nrow(id)` PSMs. Note that the `ConnectedComponents` approach defined in this package allows one to explore protein groups defined by such shared peptides more thoroughly and make informed decision as to which shared peptides to retain and which ones to drop. For details see the related vignette or the [`ConnectedComponents`](https://rformassspectrometry.github.io/PSMatch/reference/ConnectedComponents.html) manual page. ### All filters in one function This can also be achieved with the `filterPSMs()` function: ```{r} id <- PSM(f) filterPSMs(id) ``` # The `mzR` and `mzID` parsers The `PSM()` function can take two different values for the `parser` parameter, namely `"mzR"` (the default value) and `"mzID"`. - **mzR** uses the `openIDfile()` function from the `r BiocStyle::Biocpkg("mzR")` to parse the `mzId` file(s), and then coerces the data to a `data.frame` which is eventually returned as a `PSM` object. The parser function uses dedicated code from the Proteowizard project (included in `mzR`) and is generally the fastest approach. - **mzID** parses the `mzId` file with `mzID()` function from the `r BiocStyle::Biocpkg("mzID")` package, and then flattens the data to a `data.frame` with `mzID::flatten()` and eventuelly returns a `PSM` object. The `mzID` package relies on the `r BiocStyle::CRANpkg("XML")` package. Is is slower but is is more robust to variations in the `mzID` implementation, as is thus a useful backup when the `mzR` backend fails. ```{r, warning = FALSE} system.time(id1 <- PSM(f, parser = "mzR")) system.time(id2 <- PSM(f, parser = "mzID")) ``` Other differences in the two parsers include the columns that are returned, the way they name them, and, as will shown below the matches that are returned. Note for instance (and this will be important later), that there is no equivalent of `"modLocation"` in `id2`. ```{r} names(id1) names(id2) ``` We also have different number of matches in the two tables: ```{r} nrow(id1) nrow(id2) ``` ```{r} table(id1$isDecoy) table(id2$isdecoy) ``` Let's first filter the PSM tables to facilitate focus the comparison of relevant scans. Note that the default `filterPSMs()` arguments are set to work with both parser. ```{r} id1_filtered <- filterPSMs(id1) id2_filtered <- filterPSMs(id2) ``` As can be seen, we are also left with `r nrow(id1_filtered)` vs `r nrow(id2_filtered)` PSMs after filtering. The difference doesn't stem from different scans, given that the spectum identifiers are identical in both tables: ```{r} identical(sort(unique(id1_filtered$spectrumID)), sort(unique(id2_filtered$spectrumid))) ``` The difference is obvious when we tally a table of spectrum id occurences in the filtered tables. In `id2_filtered`, each scan is unique, i.e matched only once. ```{r} anyDuplicated(id2_filtered$spectrumid) ``` However, for `id1_filtered`, we see that some scans are still repeat up to 4 times in the table: ```{r} table(table(id1_filtered$spectrumID)) ``` The example below shows that these differences stem from the modification location (`"modLocation"`), that is not report by the `mzID` parser: ```{r} k <- names(which(table(id1_filtered$spectrumID) == 4)) id1_filtered[id1_filtered$spectrumID == k, "sequence"] id1_filtered[id1_filtered$spectrumID == k, "modLocation"] id1_filtered[id1_filtered$spectrumID == k, "modName"] ``` If we remove the `"modLocation"` column, we recoved the same number of PSMs than with the `mzID` parser. ```{r} id1_filtered$modLocation <- NULL nrow(unique(id1_filtered)) nrow(unique(id2_filtered)) ``` # Session information ```{r si} sessionInfo() ```