--- title: "Using pRoloc for spatial proteomics data analysis" author: - name: Lisa M. Breckels affiliation: Computational Proteomics Unit, Cambridge, UK - name: Laurent Gatto affiliation: Computational Proteomics Unit, Cambridge, UK package: pRoloc abstract: > This tutorial illustrates the usage of the `pRoloc` R package for the analysis and interpretation of spatial proteomics data. It walks the reader through the creation of *MSnSet* instances, that hold the quantitative proteomics data and meta-data and introduces several aspects of data analysis, including data visualisation and application of machine learning to predict protein localisation. output: BiocStyle::html_document: toc_float: true bibliography: pRoloc.bib vignette: > %\VignetteIndexEntry{Using pRoloc for spatial proteomics data analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteKeywords{Bioinformatics, Machine learning, Organelle, Spatial Proteomics} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r env, include=FALSE, echo=FALSE, cache=FALSE} library("knitr") opts_chunk$set(stop_on_error = 1L) suppressPackageStartupMessages(library("MSnbase")) suppressWarnings(suppressPackageStartupMessages(library("pRoloc"))) suppressPackageStartupMessages(library("pRolocdata")) suppressPackageStartupMessages(library("class")) set.seed(1) ``` # Foreword {-} *[MSnbase](http://bioconductor.org/packages/MSnbase)* and *[pRoloc](http://bioconductor.org/packages/pRoloc)* are under active developed; current functionality is evolving and new features are added on a regular basis. This software is free and open-source software. If you use it, please support the project by citing it in publications: > Gatto L. and Lilley K.S. *MSnbase - an R/Bioconductor package for > isobaric tagged mass spectrometry data visualization, processing and > quantitation*. [Bioinformatics 28, 288-289 (2011)](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btr645). > Gatto L, Breckels LM, Wieczorek S, Burger T, Lilley > KS. *Mass-spectrometry-based spatial proteomics data analysis using > pRoloc and pRolocdata.* > [Bioinformatics. 2014 May 1;30(9):1322-4.](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btu013). > Breckels LM, Mulvey CM, Lilley KS and Gatto L. A Bioconductor > workflow for processing and analysing spatial proteomics > data. F1000Research 2016, 5:2926 > [doi: 10.12688/f1000research.10411.1](https://f1000research.com/articles/5-2926/). If you are using the *phenoDisco* function, please also cite > Breckels L.M., Gatto L., Christoforou A., Groen A.J., Kathryn Lilley > K.S. and Trotter M.W. *The effect of organelle discovery upon > sub-cellular protein localisation.* J Proteomics, > [S1874-3919(13)00094-8 (2013)](http://www.sciencedirect.com/science/article/pii/S1874391913000948). For an introduction to spatial proteomics data analysis: > Gatto L, Breckels LM, Burger T, Nightingale DJ, Groen AJ, Campbell > C, Nikolovski N, Mulvey CM, Christoforou A, Ferro M, Lilley KS. *A > foundation for reliable spatial proteomics data analysis*. Mol Cell > Proteomics. [2014 Aug;13(8):1937-52](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4125728/). doi: > 10.1074/mcp.M113.036350. The *[pRoloc](http://bioconductor.org/packages/pRoloc)* package contains additional vignettes and reference material: * *pRoloc-tutorial*: pRoloc tutorial. * *pRoloc-ml*: Machine learning techniques available in pRoloc. * *pRoloc-transfer-learning*: A transfer learning algorithm for spatial proteomics. * *pRoloc-goannotations*: Annotating spatial proteomics data. # Questions and bugs {-} You are welcome to contact me directly about `r Biocpkg("pRoloc")`. For bugs, typos, suggestions or other questions, please file an issue in our issue tracking system () providing as much information as possible, a reproducible example and the output of `sessionInfo()`. If you wish to reach a broader audience for general questions about proteomics analysis using `R`, you may want to use the Bioconductor support site: . # Introduction {#sec:intro} ## Spatial proteomics Spatial (or organelle) proteomics is the study of the localisation of proteins inside cells. The sub-cellular compartment can be organelles, i.e. structures defined by lipid bi-layers,macro-molecular assemblies of proteins and nucleic acids or large protein complexes. In this document, we will focus on mass-spectrometry based approaches that assay a population of cells, as opposed as microscopy based techniques that monitor single cells, as the former is the primary concern of `r Biocpkg("pRoloc")`, although the techniques described below and the infrastructure in place could also be applied the processed image data. The typical experimental use-case for using `r Biocpkg("pRoloc")` is a set of fractions, originating from a total cell lysate. These fractions can originate from a continuous gradient, like in the LOPIT [@Dunkley2006] or PCP [@Foster2006] approaches, or can be discrete fractions. The content of the fractions is then identified and quantified (using labelled or un-labelled quantitation techniques). Using relative quantitation of known organelle residents, termed organelle markers, organelle-specific profiles along the gradient are determined and new residents are identified based on matching of these distribution profiles. See for example [@Gatto2010] and references therein for a detailed review on organelle proteomics. It should be noted that large protein complexes, that are not necessarily separately enclosed within their own lipid bi-layer, can be detected by such techniques, as long as a distinct profile can be defined across the fractions. ## About R and *pRoloc* R [@Rstat] is a statistical programming language and interactive working environment. It can be expanded by so-called packages to confer new functionality to users. Many such packages have been developed for the analysis of high-throughput biology, notably through the Bioconductor project [@Gentleman2004]. Two packages are of particular interest here, namely `r Biocpkg("MSnbase")` [@Gatto2012] and `r Biocpkg("pRoloc")`. The former provides flexible infrastructure to store and manipulate quantitative proteomics data and the associated meta-data and the latter implements specific algorithmic technologies to analyse organelle proteomics data. Among the advantages of R are robust statistical procedures, good visualisation capabilities, excellent documentation, reproducible research^[The content of this document is compiled (the code is executed and its output, text and figures, is displayed dynamically) to generate the pdf file.], power and flexibility of the R language and environment itself and a rich environment for specialised functionality in many domains of bioinformatics: tools for many omics technologies, including proteomics, bio-statistics, gene ontology and biological pathway analysis, ... Although there exists some specific graphical user interfaces (GUI), interaction with R is executed through a command line interface. While this mode of interaction might look alien to new users, experience has proven that after a first steep learning curve, great results can be achieved by non-programmers. Furthermore, specific and general documentation is plenty and beginners and advanced course material are also widely available. Once R is started, the first step to enable functionality of a specific packages is to load them using the `library` function, as shown in the code chunk below: ```{r libraries} library("MSnbase") library("pRoloc") library("pRolocdata") ``` `r Biocpkg("MSnbase")` implements the data containers that are used by `r Biocpkg("pRoloc")`. `r Biocexptpkg("pRolocdata")` is a data package that supplies several published organelle proteomics data sets. As a final setup step, we set the default colour palette for some of our custom plotting functionality to use semi-transparent colours in the code chunk below (see *?setStockcol* for details). This facilitates visualisation of overlapping points. ```{r setcols} setStockcol(NULL) ## reset first setStockcol(paste0(getStockcol(), 70)) ``` # Data structures {#sec:data} ## Example data The data used in this tutorial has been published in [@Tan2009]. The LOPIT technique [@Dunkley2006] is used to localise integral and associated membrane proteins in *Drosophila melanogaster* embryos. Briefly, embryos were collected at 0 -- 16 hours, homogenised and centrifuged to collect the supernatant, removing cell debris and nuclei. Membrane fractionation was performed on a iodixanol gradient and fractions were quantified using iTRAQ isobaric tags [@Ross2004] as follows: fractions 4/5, 114; fractions 12/13, 115; fraction 19, 116 and fraction 21, 117. Labelled peptides were then separated using cation exchange chromatography and analysed by LS-MS/MS on a QSTAR XL quadrupole-time-of-flight mass spectrometer (Applied Biosystems). The original localisation analysis was performed using partial least square discriminant analysis (PLS-DA). Relative quantitation data was retrieved from the supplementary file `pr800866n_si_004.xls` () and imported into R as described below. We will concentrate on the first replicate. ## Importing and loading data This section illustrates how to import data in comma-separated value (csv) format into an appropriate R data structure. The first section shows the original `csv` (comma separated values) spreadsheet, as published by the authors, and how one can read such a file into \R using the *read.csv* function. This spreadsheet file is similar to the output of many quantitation software. In the next section, we show 2 `csv` files containing a subset of the columns of original `pr800866n_si_004-rep1.csv` file and another short file, created manually, that will be used to create the appropriate R data. ### The original data file {#sec:orgcsv} ```{r readCsvData0} ## The original data for replicate 1, available ## from the pRolocdata package f0 <- dir(system.file("extdata", package = "pRolocdata"), full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv") csv <- read.csv(f0) ``` The three first lines of the original spreadsheet, containing the data for replicate one, are illustrated below (using the function *head*). It contains `r nrow(csv)` rows (proteins) and `r ncol(csv)` columns, including protein identifiers, database accession numbers, gene symbols, reporter ion quantitation values, information related to protein identification, ... ```{r showOrgCsv} head(csv, n=3) ``` ### From `csv` files to R data {#sec:csv} There are several ways to create the desired R data object, termed *MSnSet*, that will be used to perform the actual sub-cellular localisation prediction. Here, we illustrate a method that uses separate spreadsheet files for quantitation data, feature meta-data and sample (fraction) meta-data and the `readMSnSet` constructor function, that will hopefully be the most straightforward for new users. ```{r readCsvData1, tidy = FALSE} ## The quantitation data, from the original data f1 <- dir(system.file("extdata", package = "pRolocdata"), full.names = TRUE, pattern = "exprsFile.csv") exprsCsv <- read.csv(f1) ## Feature meta-data, from the original data f2 <- dir(system.file("extdata", package = "pRolocdata"), full.names = TRUE, pattern = "fdataFile.csv") fdataCsv <- read.csv(f2) ## Sample meta-data, a new file f3 <- dir(system.file("extdata", package = "pRolocdata"), full.names = TRUE, pattern = "pdataFile.csv") pdataCsv <- read.csv(f3) ``` * `exprsFile.csv` containing the quantitation (expression) data for the `r nrow(exprsCsv)` proteins and 4 reporter tags. ```{r showExprsFile} head(exprsCsv, n=3) ``` * `fdataFile.csv` containing meta-data for the `r nrow(fdataCsv)` features (here proteins). ```{r showFdFile} head(fdataCsv, n=3) ``` * `pdataFile.csv` containing samples (here fractions) meta-data. This simple file has been created manually. ```{r showPdFile} pdataCsv ``` A self-contained data structure, called *MSnSet* (defined in the `r Biocpkg("MSnbase")` package) can now easily be generated using the *readMSnSet* constructor, providing the respective `csv` file names shown above and specifying that the data is comma-separated (with `sep = ","`). Below, we call that object `tan2009r1` and display its content. ```{r makeMSnSet} tan2009r1 <- readMSnSet(exprsFile = f1, featureDataFile = f2, phenoDataFile = f3, sep = ",") tan2009r1 ``` ## A shorter input work flow The `readMSnSet2` function provides a simplified import pipeline. It takes a single spreadsheet as input (default is `csv`) and extract the columns identified by `ecol` to create the expression data, while the others are used as feature meta-data. `ecol` can be a `character` with the respective column labels or a numeric with their indices. In the former case, it is important to make sure that the names match exactly. Special characters like `'-'` or `'('` will be transformed by R into `'.'` when the `csv` file is read in. Optionally, one can also specify a column to be used as feature names. Note that these must be unique to guarantee the final object validity. ```{r readMSnSet2} ecol <- paste("area", 114:117, sep = ".") fname <- "Protein.ID" eset <- readMSnSet2(f0, ecol, fname) eset ``` The `ecol` columns can also be queried interactively from R using the `getEcols` and `grepEcols` function. The former return a character with all column names, given a splitting character, i.e. the separation value of the spreadsheet (typically `","` for `csv`, `"\t"` for `tsv`, ...). The latter can be used to grep a pattern of interest to obtain the relevant column indices. ```{r ecols} getEcols(f0, ",") grepEcols(f0, "area", ",") e <- grepEcols(f0, "area", ",") readMSnSet2(f0, e) ``` The `phenoData` slot can now be updated accordingly using the replacement functions `phenoData<-` or `pData<-` (see `?MSnSet` for details). ### The *MSnSet* class Although there are additional specific sub-containers for additional meta-data (for instance to make the object MIAPE compliant), the feature (the sub-container, or slot `featureData`) and sample (the `phenoData` slot) are the most important ones. They need to meet the following validity requirements (see [figure below](#fig:msnset)): * the number of row in the expression/quantitation data and feature data must be equal and the row names must match exactly, and * the number of columns in the expression/quantitation data and number of row in the sample meta-data must be equal and the column/row names must match exactly. It is common, in the context of `r Biocpkg("pRoloc")` to update the feature meta-data (described in section \@ref(sec:analysis)) by adding new columns, without breaking the objects validity. Similarly, the sample meta-data can also be updated by adding new sample variables. A detailed description of the `MSnSet` class is available by typing `?MSnSet` in the R console. ![Dimension requirements for the respective expression, feature and sample meta-data slots.](./Figures/msnset.png){#fig:msnset} The individual parts of this data object can be accessed with their respective accessor methods: * the quantitation data can be retrieved with `exprs(tan2009r1)`, * the feature meta-data with `fData(tan2009r1)` and * the sample meta-data with `pData(tan2009r1)`. The advantage of this structure is that it can be manipulated as a whole and the respective parts of the data object will remain compatible. The code chunk below, for example, shows how to extract the first 5 proteins and 2 first samples: ```{r showSubset} smallTan <- tan2009r1[1:5, 1:2] dim(smallTan) exprs(smallTan) ``` Several data sets, including the 3 replicates from [@Tan2009], are distributed as *MSnSet* instances in the `r Biocexptpkg("pRolocdata")` package. Others include, among others, the *Arabidopsis thaliana* LOPIT data from [@Dunkley2006] (`dunkley2006`) and the mouse PCP data from [@Foster2006] (`foster2006`). Each data set can be loaded with the `data` function, as show below for the first replicate from [@Tan2009]. ```{r rmtan, echo=FALSE} ## remove manullay constructred data rm(tan2009r1) data(tan2009r1) ``` ```{r loadTan1} data(tan2009r1) ``` The original marker proteins are available as a feature meta-data variables called `markers.orig` and the output of the partial least square discriminant analysis, applied in the original publication, in the `PLSDA` feature variable. The most up-to-date marker list for the experiment can be found in `markers`. This feature meta-data column can be added from a simple `csv` markers files using the `addMarkers` function - see `?addMarkers` for details. The organelle markers are illustrated below using the convenience function *getMarkers*, but could also be done manually by accessing feature variables directly using *fData()*. ```{r lookAtTan} getMarkers(tan2009r1, fcol = "markers.orig") getMarkers(tan2009r1, fcol = "PLSDA") ``` **Important** As can be seen above, some proteins are labelled `"unknown"`, defining non marker proteins. This is a convention in many `r Biocpkg("pRoloc")` functions. Missing annotations (empty string) will not be considered as of unknown localisation; we prefer to avoid empty strings and make the absence of known localisation explicit by using the `"unknown"` tag. This information will be used to separate marker and non-marker (unlabelled) proteins when proceeding with data visualisation and clustering (sections \@ref(sec:viz) and \@ref(sec:usml)) and classification analysis (section \@ref(sec:sml)). ## *pRoloc*'s organelle markers The `r Biocpkg("pRoloc")` package distributes a set of markers that have been obtained by mining the `r Biocexptpkg("pRolocdata")` datasets and curation by various members of the [Cambridge Centre for Proteomics](http://proteomics.bio.cam.ac.uk/). The available marker sets can be obtained and loaded using the `pRolocmarkers` function: ```{r markers} pRolocmarkers() head(pRolocmarkers("dmel")) table(pRolocmarkers("dmel")) ``` These markers can then be added to a new *MSnSet* using the *addMarkers* function by matching the marker names (protein identifiers) and the feature names of the *MSnSet*. See *?addMarkers* for examples. ## Data processing The quantitation data obtained in the supplementary file is normalised to the sum of intensities of each protein; the sum of fraction quantitation for each protein equals 1 (considering rounding errors). This can quickly be verified by computing the row sums of the expression data. ```{r realtiveQuants} summary(rowSums(exprs(tan2009r1))) ``` The `normalise` method (also available as `normalize`) from the `r Biocpkg("MSnbase")` package can be used to obtain relative quantitation data, as illustrated below on another iTRAQ test data set, available from `r Biocpkg("MSnbase")`. Several normalisation `methods` are available and described in `?normalise`. For many algorithms, including classifiers in general and support vector machines in particular, it is important to properly per-process the data. Centering and scaling of the data is also available with the `scale` method. In the code chunk below, we first create a test *MSnSet* instance^[Briefly, the `itraqdata` raw iTRAQ4-plex data is quantified by trapezoidation using `r Biocpkg("MSnbase")` functionality. See the `MSnbase-demo` vignette for details.] and illustrate the effect of `normalise(..., method = "sum")`. ```{r norm, echo=TRUE, message=FALSE, cache=TRUE} ## create a small illustrative test data data(itraqdata) itraqdata <- quantify(itraqdata, method = "trap", reporters = iTRAQ4) ## the quantification data head(exprs(itraqdata), n = 3) summary(rowSums(exprs(itraqdata))) ## normalising to the sum of feature intensitites itraqnorm <- normalise(itraqdata, method = "sum") processingData(itraqnorm) head(exprs(itraqnorm), n = 3) summary(rowSums(exprs(itraqnorm))) ``` Note above how the processing undergone by the *MSnSet* instances `itraqdata` and `itraqnorm` is stored in another such specific sub-container, the `processingData` slot. The different features (proteins in the `tan2009r1` data above, but these could also represent peptides or MS$^2$ spectra) are characterised by unique names. These can be retrieved with the `featureNames` function. ```{r featurenames} head(featureNames(tan2009r1)) ``` If we look back at section \@ref(sec:csv), we see that these have been automatically assigned using the first columns in the `exprsFile.csv` and `fdataFile.csv` files. It is thus crucial for these respective first columns to be identical. Similarly, the sample names can be retrieved with `sampleNames(tan2009r1)`. # Data visualisation {#sec:viz} The following sections will focus on two closely related aspects, data visualisation and data analysis (i.e. organelle assignments). Data visualisation is used in the context on quality control, to convince ourselves that the data displays the expected properties so that the output of further processing can be trusted. Visualising results of the localisation prediction is also essential, to control the validity of these results, before proceeding with orthogonal (and often expensive) *dry* or *wet* validation. ## Profile plots {#sec:profplot} The underlying principle of gradient approaches is that we have separated organelles along the gradient and by doing so, generated organelle-specific protein distributions along the gradient fractions. The most natural visualisation is shown on figure \@ref(fig:plotdist1), obtained using the sub-setting functionality of *MSnSet* instances and the `plotDist` function, as illustrated below. ```{r showplotdist, echo=TRUE, eval=FALSE} ## indices of the mito markers j <- which(fData(tan2009r1)$markers.orig == "mitochondrion") ## indices of all proteins assigned to the mito i <- which(fData(tan2009r1)$PLSDA == "mitochondrion") plotDist(tan2009r1[i, ], markers = featureNames(tan2009r1)[j]) ``` ```{r plotdist1, echo=FALSE, fig.cap="Distribution of protein intensities along the fractions of the separation gradient for 4 organelles: mitochondrion (red), ER/Golgi (blue, ER markers and green, Golgi markers) and plasma membrane (purple)."} par(mfrow = c(2,2), mar = c(4, 4, 2, 0.1)) cls <- getStockcol()[1:4] plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "mitochondrion"), ], markers = featureNames(tan2009r1) [which(fData(tan2009r1)$markers.orig == "mitochondrion")], mcol = cls[1]) title(main = "mitochondrion") plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "ER/Golgi"), ], markers = featureNames(tan2009r1) [which(fData(tan2009r1)$markers.orig == "ER")], mcol = cls[2]) title(main = "ER") plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "ER/Golgi"), ], markers = featureNames(tan2009r1) [which(fData(tan2009r1)$markers.orig == "Golgi")], mcol = cls[3]) title(main = "Golgi") plotDist(tan2009r1[which(fData(tan2009r1)$PLSDA == "PM"), ], markers = featureNames(tan2009r1) [which(fData(tan2009r1)$markers.orig == "PM")], mcol = cls[4]) title(main = "PM") ``` ## Sub-cellular cluster dendrogram {#sec:dendro} To gain a quick overview of the distance/similarity between the sub-cellular clusters, it can useful to compare average marker profiles, rather than all profiles, as presented in the profile plots above. The `mrkHClust` calculates average class profiles and generates the resulting dendrogram. ```{r dendro, fig.cap="Hierarchical clustering. Average distance between organelle classes."} mrkHClust(tan2009r1) ``` On figure \@ref(fig:dendro), we see that the lysosome and the ribosome 60S are separated by the smallest distance. The advantage of this representation is that it provides a quick snapshot of the average similarity between organelles using the complete profiles (as opposed to a PCA plot, discussed in the next section). The main drawback is that it ignores any variability in individual markers (cluster thighness). It is however a good guide for a more thorough exploration of the data, as described in the next sections. Note the colours of the labels on the dendrogram (figure \@ref(fig:dendro)), which match the colours used to annotate PCA plots, described in the next section (figure \@ref(fig:plot2d). These colours are defined at the session level (see `getStockcol` and `setStockcol`) and re-used throughout `r Biocpkg("pRoloc")` for consistent annotation. ## Dimensionality reduction {#sec:pcalot} Alternatively, we can combine all organelle groups in one single 2 dimensional figure by applying a dimensionality reduction technique such as Principal Component Analysis (PCA) using the `plot2D` function (see figure \@ref(fig:plot2d)). The protein profile vectors are summarised into 2 values that can be visualised in two dimensions, so that the variability in the data will be maximised along the first principal component (PC1). The second principal component (PC2) is then chosen as to be orthogonal to PC1 while explaining as much variance in the data as possible, and so on for PC3, PC4, etc. Using a PCA representation to visualise a spatial proteomics experiment, we can easily plot all the proteins on the same figure as well a many sub-cellular clusters. These clusters are defined in a feature meta-data column (slot `featureData`, accessed as a `data.frame` with the `fData` function) that is declared with the `fcol` argument (default is `"markers"` which contains the most current known markers for this experiment, while the original markers published with the data can be found in the slot `"markers.orig"`). ```{r plot2d, echo=TRUE, fig.asp=1, fig.cap="PCA plot. Representation of the proteins of `tan2009r1` after reduction of the 4 reporter quantitation data to 2 principal components."} plot2D(tan2009r1, fcol = "markers") addLegend(tan2009r1, fcol = "markers", cex = .7, where = "bottomright", ncol = 2) ``` As the default value for the `fcol` argument is `"markers"`, it is not necessary to specify it. It is however mandatory to specify other annotation feature variables, such as to visualise the set of markers described in the original publication. ```{r plot2dorg, echo=TRUE, fig.asp=1, fig.cap="PCA plot. Reduced set of markers for the `tan2009r1` data projected onto 2 principal components."} plot2D(tan2009r1, fcol = "markers.orig") addLegend(tan2009r1, fcol = "markers.orig", where = "bottomright") ``` It is also possible to visualise the data along 3 dimensions using the *plot3D* function, which works like the 2 dimension version ([figure below](#fig:plot3D)). The resulting figure can be rotated by the user. ```{r plot3Danim, echo=FALSE, eval=FALSE} ## here's how the animation below was created data(hyperLOPIT2015) suppressPackageStartupMessages(library("rgl")) plot3D(hyperLOPIT2015) movie3d(spin3d(c(1, 1, 1), rpm = 10), duration = 5, clean = FALSE, dir = "~/dev/00_github/pRoloc/vignettes/Figures/") ``` ```{r plot3D, eval = FALSE} plot3D(tan2009r1) ``` ![Snapshot of the 3-dimensional PCA plot. The `tan2009r1` data is represented along the first 3 principal components.](./Figures/plot3D.png){#fig:plot3D} ```{r scree0, echo=FALSE} pcvar <- plot2D(tan2009r1, method = "scree", plot = FALSE) pcvar <- round(pcvar, 2) ``` As can be seen on the figures \@ref(fig:plot2d), \@ref(fig:plot2dorg) and on the [3D plot above](#fig:plot3D), we indicate on the axis labels the percentage of total variance explained by the individual PCs. It is not unusual to reach over 75% along the first two PCs, even for experiments with several tens of fractions. One can calculate this information for all PCs by setting `method = "scree"` in `plot2D`. On figure \@ref(fig:scree), we see that the four PCs in the `tan2009r1` data account for `r paste(pcvar[1:3], collapse = ", ")` and `r pcvar[4]` percent of the total variance. ```{r scree, fig.cap="Percentage of variance explained."} plot2D(tan2009r1, method = "scree") ``` A variety of dimensionality reduction methods are available in `plot2D`: `r paste(pRoloc:::plot2Dmethods, collapse = ", ")`. Except for `scree` (see above) and `none` (no data transformation, which can be useful when the data is already transformed and only needs to be plotted), these can used to produce visualisation of the data in two dimensions. Two are worth some discussion here; the readers are redirected to the manual page for more details. Linear discriminant analysis (LDA) will project the protein occupancy profiles in a new set of dimensions using as a criterion the separation of marker classes by maximising the between class variance to the within class variance ratio. As opposed to the *unsupervised* PCA, the *supervised* LDA should not be used as an to explore the data or as a quality control, but can be useful to assess if one or more organelles have been preferentially separated. The t-Distributed Stochastic Neighbour Embedding (t-SNE)^[] [@tsne:2008] is widely applied in many areas in computational biology and more generally field that need to visualise high-dimensional data. The t-SNE method is non-linear, and will emphasise separation of different features while grouping features with similar profiles. In addition, different transformations are applied on different regions leading to plots that can substantially differ from a PCA plot. As a result, proximity in two dimensions and tightness of the clusters can't be related to these quantities in the original data. See *How to Use t-SNE Effectively*^[] for a useful non-technical introduction. The results of the algorithm crucially depend on the values of its input parameters, in particular the *perplexity*, which balances global and local aspects of the data. The suggested range of value ranges from 5 to 50, and should not be greater than the number of data points (which we can assume not to be the case in modern spatial proteomics experiments). Below, we test the effect of this parameter along the suggested range, including the default value of 30, at which the algorithm converges. ```{r tsne, eval=FALSE} perps <- sort(c(30, seq(5, 50, 15))) data(HEK293T2011) par(mfrow = c(2, 3)) plot2D(HEK293T2011, main = "PCA") sapply(perps, function(perp) { plot2D(HEK293T2011, method = "t-SNE", methargs = list(perplexity = perp)) title(main = paste("t-SNE, perplexity", perp)) }) ``` ![Effect of t-SNE's perplexity parameter on the human `HEK293T2011` data.](./Figures/tsne.png){#fig:tsne} Other parameters that can effect the results are the number of iterations and the learning rate epsilon. The t-SNE algorithm takes much more time to complete than the other available methods. In such cases, saving the results and re-plotting with method `none` can be useful (see `?plot2D`). In the case of this document, the [figure above](#fig:tsne), was pre-generated rather than computed upon compilation. ## Features of interest In addition to highlighting sub-cellular niches as coloured clusters on the PCA plot, it is also possible to define some arbitrary *features of interest* that represent, for example, proteins of a particular pathway or a set of interaction partners. Such sets of proteins are recorded as *FeaturesOfInterest* instances, as illstrated below (using the ten first features of our experiment): ```{r foi0} foi1 <- FeaturesOfInterest(description = "Feats of interest 1", fnames = featureNames(tan2009r1[1:10])) description(foi1) foi(foi1) ``` Several such features of interest can be combined into collections: ```{r } foi2 <- FeaturesOfInterest(description = "Feats of interest 2", fnames = featureNames(tan2009r1[880:888])) foic <- FoICollection(list(foi1, foi2)) foic ``` *FeaturesOfInterest* instances can now be overlaid on the PCA plot with the `highlightOnPlot` function. The `highlightOnPlot3D` can be used to overlay data onto a 3 dimensional figure produced by `plot3D`. ```{r plotfoi, fig.asp=1, fig.cap="Adding features of interest on a PCA plot."} plot2D(tan2009r1, fcol = "PLSDA") addLegend(tan2009r1, fcol = "PLSDA", where = "bottomright", cex = .7) highlightOnPlot(tan2009r1, foi1, col = "black", lwd = 2) highlightOnPlot(tan2009r1, foi2, col = "purple", lwd = 2) legend("topright", c("FoI 1", "FoI 2"), bty = "n", col = c("black", "purple"), pch = 1) ``` See `?FeaturesOfInterest` and `?highlightOnPlot` for more details. ## Interactive visualisation {#sec:gui} The `r Biocpkg("pRolocGUI")` application allows one to explore the spatial proteomics data using an interactive, web-based \CRANpkg{shiny} interface [@shiny]. The package is available from Bioconductor and can be installed and started as follows: ```{r guiinstall, eval = FALSE} library("BiocInstaller") biocLite("pRolocGUI") library("pRolocGUI") pRolocVis(tan2009r1) ``` ![Screenshot of the `r Biocpkg("pRolocGUI")` interface. The GUI is also available as an online app for the hyperLOPIT experiment from [@Christoforou:2016] at .](./Figures/Screenshot_PCA2.png){#fig:gui} More details are available in the vignette that can be started from the application by clicking on any of the question marks, by starting the vignette from R with `vignette("pRolocGUI")` or can be accessed online (). # Assessing sub-cellular resolution {#sec:qsep} TODO # Data analysis {#sec:analysis} Classification of proteins, i.e. assigning sub-cellular localisation to proteins, is the main aspect of the present data analysis. The principle is the following and is, in its basic form, a 2 step process. First, an algorithm learns from the known markers that are shown to him and models the data space accordingly. This phase is also called the training phase. In the second phase, un-labelled proteins, i.e. those that have not been labelled as resident of any organelle, are matched to the model and assigned to a group (an organelle). This 2 step process is called machine learning (ML), because the computer (machine) learns by itself how to recognise instances that possess certain characteristics and classifies them without human intervention. That does however not mean that results can be trusted blindly. In the above paragraph, we have defined what is called supervised ML, because the algorithm is presented with some know instances from which it learns (see section \@ref(sec:sml)). Alternatively, un-supervised ML does not make any assumptions about the group memberships, and uses the structure of the data itself to defined sub-groups (see section \@ref(sec:usml)). It is of course possible to classify data based on labelled and unlabelled data. This extension of the supervised classification problem described above is called semi-supervised learning. In this case, the training data consists of both labelled and unlabelled instances with the obvious goal of generating a better classifier than would be possible with the labelled data only. The `phenoDisco` algorithm, will be illustrated in that context (section \@ref(sec:ssml)). ## Unsupervised ML {#sec:usml} The `plot2D` can also be used to visualise the data on a PCA plot omitting any marker definitions, as shown on figure \@ref(fig:plot2dnull). This approach avoids any bias towards marker definitions and concentrate on the data and its underlying structure itself. ```{r plot2dnull, echo=TRUE, fig.cap="Plain PCA representation of the `tan2009r1` data."} plot2D(tan2009r1, fcol = NULL) ``` Alternatively, `r Biocpkg("pRoloc")` also gives access to `r Biocpkg("MLInterfaces")`'s `MLean` unified interface for, among others, unsupervised approaches using k-means (figure \@ref(fig:plotKmeans)), hierarchical (figure \@ref(fig:plotHclust)) or partitioning around medoids (figure \@ref(fig:plotPam)), clustering. ```{r plotKmeans, echo=TRUE, fig.asp=1, fig.cap="k-means clustering on the `tan2009r1` data."} kcl <- MLearn( ~ ., tan2009r1, kmeansI, centers=5) plot(kcl, exprs(tan2009r1)) ``` ```{r plotHclust, echo=TRUE, tidy = FALSE, fig.asp=1, fig.cap="Hierarchical clustering on the `tan2009r1` data."} hcl <- MLearn( ~ ., tan2009r1, hclustI(distFun = dist, cutParm = list(k = 5))) plot(hcl, labels = FALSE) ``` ```{r plotPam, echo=TRUE, fig.asp=1, fig.cap="Partitioning around medoids on the `tan2009r1` data."} pcl <- MLearn( ~ ., tan2009r1, pamI(dist), k = 5) plot(pcl, data = exprs(tan2009r1)) ``` ## Supervised ML {#sec:sml} In this section, we show how to use `r Biocpkg("pRoloc")` to run a typical supervised ML analysis. Several ML methods are available, including k-nearest neighbour (knn), partial least square discriminant analysis (plsda), random forest (rf), support vector machines (svm), \ldots The detailed description of each method is outside of the scope of this document. We will use support vector machines to illustrate a typical pipeline and the important points that should be paid attention to. These points are equally valid and work, from a `r Biocpkg("pRoloc")` user perspective, exactly the same for the other approaches. ### Classification algorithm parameters optimisation Before actually generating a model on the new markers and classifying unknown residents, one has to take care of properly setting the model parameters. Wrongly set parameters can have a very negative impact on classification performance. To do so, we create testing (to model) and training (to predict) subsets using known residents, i.e. marker proteins. By comparing observed and expected classification prediction, we can assess how well a given model works using the macro F1 score (see below). This procedure is repeated for a range of possible model parameter values (this is called a grid search), and the best performing set of parameters is then used to construct a model on all markers and predict un-labelled proteins. For this parameter optimisation procedure to perform well and produce useful results, it is essential to run it with a reasonable amount of markers. In our experience, 15 such marker proteins are necessary. Model accuracy is evaluated using the F1 score, $F1 = 2 ~ \frac{precision \times recall}{precision + recall}$, calculated as the harmonic mean of the precision ($precision = \frac{tp}{tp+fp}$, a measure of *exactness* -- returned output is a relevant result) and recall ($recall=\frac{tp}{tp+fn}$, a measure of *completeness* -- indicating how much was missed from the output). What we are aiming for are high generalisation accuracy, i.e high $F1$, indicating that the marker proteins in the test data set are consistently correctly assigned by the algorithms. In order to evaluate how well a classifier performs on profiles it was not exposed to during its creation, we implemented the following schema. Each set of marker protein profiles, i.e. labelled with known organelle association, are separated into *training* and *test/validation* partitions by sampling 80\% of the profile corresponding to each organelle (i.e. stratified) without replacement to form the training partition $S_{tr}$ with the remainder becoming the test/validation partition $S_{ts}$. The svm regularisation parameter $C$ and Gaussian kernel width $sigma$ are selected using a further round of stratified five-fold cross-validation on each training partition. All pairs of parameters $(C_i, sigma_j)$ under consideration are assessed using the macro F1 score and the pair that produces the best performance is subsequently employed in training a classifier on all training profiles $S_{tr}$ prior to assessment on all test/validation profiles $S_{ts}$. This procedure is repeated $N$ times (in the example below 10) in order to produce $N$ macro F1 estimated generalisation performance values (figure \@ref(fig:params)). This procedure is implemented in the `svmOptimisation`. See `?svmOptimisation` for details, in particular the range of $C$ and $sigma$ parameters and how the relevant feature variable is defined by the `fcol` parameters, which defaults to `"markers"`. ```{r svmParamOptim, eval = FALSE, message = FALSE, tidy=FALSE} params <- svmOptimisation(tan2009r1, fcol = "markers.orig", times = 100, xval = 5, verbose = FALSE) ``` In the interest of time, the optimisation is not executed but loaded with ```{r loadParams, echo = TRUE} fn <- dir(system.file("extdata", package = "pRoloc"), full.names = TRUE, pattern = "params.rda") load(fn) ``` ```{r showParams} params ``` ```{r params, echo=TRUE, fig.cap="Assessing parameter optimisation. On the left, we see the respective distributions of the 10 macro F1 scores for the best cost/sigma parameter pairs. See also the output of *f1Count* in relation to this plot. On the right, we see the averaged macro F1 scores, for the full range of parameter values."} plot(params) levelPlot(params) ``` In addition to the plots on figure \@ref(fig:params), `f1Count(params)` returns, for each combination of parameters, the number of best (highest) F1 observations. One can use `getParams` to see the default set of parameters that are chosen based on the executed optimisation. Currently, the first best set is automatically extracted, and users are advised to critically assess whether this is the most wise choice. ```{r f1count} f1Count(params) getParams(params) ``` **Important** It is essential to emphasise that the accuracy scores obtained during parameter optimisation are only a reflection of the classification performance on a set of distinct and ideally separated spatial clusters. Here, we assume that the data is characterised by good separation of the various spatial niches which are reflected by sub-cellular markers. Quality control of the data and the markers using the visualisation described in section \@ref(sec:viz) is essential and subsequent analyses are doomed to fail in the absence of such separation. These classification scores are not representative of the reliability of the final classification (described in section \@ref(sec:sml)), in particular along the boundaries separating the different sub-cellular niches. High scores at the optimisation stage are a requirement to proceed with the analysis, but are by no means indicative of the false positive rate in the final sub-cellular assignment of non-marker proteins. ### Classification {#sec:sml} We can now re-use the result from our parameter optimisation (a best cost/sigma pair is going to be automatically extracted, using the `getParams` method, although it is possible to set them manually), and use them to build a model with all the marker proteins and predict unknown residents using the *svmClassification* function (see the manual page for more details). By default, the organelle markers will be defined by the `"markers"` feature variables (and can be defined by the `fcol` parameter) e.g. here we use the original markers in `"markers.orig"` as a use case. New feature variables containing the organelle assignments and assignment probabilities, called scores hereafter, are automatically added to the `featureData` slot; in this case, using the `svm` and `svm.scores` labels. **Important** The calculation of the classification probabilities is dependent on the classification algorithm. These probabilities are not to be compared across algorithms; they do **not** reflect any **biologically relevant** sub-cellular localisation probability but rather an algorithm-specific classification confidence score.} ```{r svmRes0, warning=FALSE, eval = FALSE, tidy = FALSE} ## manual setting of parameters svmres <- svmClassification(tan2009r1, fcol = "markers.orig", sigma = 1, cost = 1) ``` ```{r svmRes, warning=FALSE, tidy = FALSE} ## using default best parameters svmres <- svmClassification(tan2009r1, fcol = "markers.orig", assessRes = params) processingData(svmres) tail(fvarLabels(svmres), 4) ``` The original markers, classification results and scores can be accessed with the `fData` accessor method, e.g. `fData(svmres)$svm` or `fData(svmres)$svm.scores`. Two helper functions, `getMarkers` and `getPredictions` are available and add some level of automation and functionality, assuming that the default feature labels are used. Both (invisibly) return the corresponding feature variable (the markers or assigned classification) and print a summary table. The `fcol` parameter must be specified for `getPredictions`. It is also possible to defined a classification probability below which classifications are set to `"unknown"`. ```{r getPredictions} p1 <- getPredictions(svmres, fcol = "svm") p1 <- fData(p1)$svm.pred minprob <- median(fData(svmres)$svm.scores) p2 <- getPredictions(svmres, fcol = "svm", t = minprob) p2 <- fData(p2)$svm.pred table(p1, p2) ``` To graphically illustrate the organelle-specific score distributions, we can use a boxplot and plot the scores for the respective predicted svm classes, as shown on figure \@ref(fig:predscores). As can be seen, different organelles are characterised by different score distributions. Using a unique threshold (`minprob` with value `r round(minprob, 2)` above) results in accepting `r round(sum(p2 == "ER")/sum(p1 == "ER"), 2) * 100`% of the initial ER predictions and only `r round(sum(p2 == "mitochondrion")/sum(p1 == "mitochondrion"), 2) * 100`% of the mitochondrion predictions. The *getPredictions* function also accepts organelle-specific score thresholds. Below, we calculate organelle-specific median scores. ```{r predscores, fig.cap="Organelle-specific SVM score distributions."} boxplot(svm.scores ~ svm, data = fData(svmres), ylab = "SVM scores") abline(h = minprob, lwd = 2, lty = 2) ``` ```{r medscores2, tidy = FALSE} ts <- orgQuants(svmres, fcol = "svm", t = .5) ``` Using these scores equates to choosing the 50\% predictions with highest scores for organelle. ```{r medscorepreds, tidy = FALSE} getPredictions(svmres, fcol = "svm", t = ts) ``` We can now visualise these results using the plotting functions presented in section \@ref(sec:usml), as shown on figure \@ref(fig:svmres). We clearly see that besides the organelle marker clusters that have been assigned high confidence members, many other proteins have substantially lower prediction scores. ```{r svmres, fig.asp=1, fig.cap="Representation of the svm prediction on the `tan2009r1` data set. The svm scores have been used to set the point size (`cex` argument; the scores have been transformed to emphasise the extremes). Different symbols (`fpch`) are used to differentiate markers and new assignments."} ptsze <- exp(fData(svmres)$svm.scores) - 1 plot2D(svmres, fcol = "svm", fpch = "markers.orig", cex = ptsze) addLegend(svmres, fcol = "svm", where = "bottomright", cex = .5) ``` ## Semi-supervised ML {#sec:ssml} It is obvious that the original set of markers initially used (`r paste(levels(fData(tan2009r1)$markers.orig)[1:4], collapse = ", ")`) is not a biologically realistic representation or the organelle diversity. Manually finding markers is however time consuming, as it requires careful verification of the annotation, and possibly critical for the subsequent analysis, as markers are directly used in the training phase of the supervised ML approach. As can be seen in the PCA plots above, there is inherent structure in the data that can be made use of to automate the detection of new clusters. The `phenoDisco` algorithm [@Breckels2013] is an iterative method, that combines classification of proteins to known groups and detection of new clusters. It is available in `r Biocpkg("pRoloc")` though the `phenoDisco` function. ```{r runPhenoDisco, eval=FALSE, warning=FALSE} pdres <- phenoDisco(tan2009r1, GS = 10, times = 100, fcol = "PLSDA") ``` Again, in the interest of time, *phenoDisco* is not executed when the vignette is dynamically built. The data object can be located in the `extdata` direcoty and loaded with: ```{r loadpdres, echo=TRUE} fn <- dir(system.file("extdata", package = "pRoloc"), full.names = TRUE, pattern = "pdres.rda") load(fn) ``` The results are also appended to the `featureData` slot. ```{r phenoDiscoFvar} processingData(pdres) tail(fvarLabels(pdres), 3) ``` The `plot2D` function, can, as previously, be utilised to visualise the results, as shown on figure \@ref(fig:pdres). ```{r pdres, echo=TRUE, fig.asp=1, fig.cap="Representation of the phenoDisco prediction and cluster discovery results."} plot2D(pdres, fcol = "pd") addLegend(pdres, fcol = "pd", ncol = 2, where = "bottomright", cex = .5) ``` ## Following up on novelty discovery {#sec:followup} The newly discovered phenotypes need to be carefully validated prior to further analysis. Indeed, as the structure of the data is made use of in the discovery algorithm, some might represent peculiar structure in the data and not match with biologically relevant groups. The `tan2009r1` data has been submitted to a careful `phenodisco` analysis and validation in [@Breckels2013]. The results of this new, augmented marker set is available in the `pd.markers` feature data. These markers represent a combined set of the original markers and validated proteins from the new phenotypes. ```{r pdmarkers} getMarkers(tan2009r1, fcol = "pd.markers") ``` The augmented set of markers is now employed to repeat the classification using the support vector machine classifier. We apply a slightly different analysis than described in section \@ref(sec:sml). In the code chunks below, we use class specific weights when creating the svm model; the weights are set to be inversely proportional to class frequencies. ```{r weights, eval = TRUE, echo = TRUE} w <- classWeights(tan2009r1, fcol = "pd.markers") w ``` ```{r pdsvmParams, eval = FALSE, tidy = FALSE} params2 <- svmOptimisation(tan2009r1, fcol = "pd.markers", times = 10, xval = 5, class.weights = w, verbose = FALSE) ``` As above, the results are pre-computed and available in the `extdata` package directory. ```{r loadParams2, echo = TRUE} fn <- dir(system.file("extdata", package = "pRoloc"), full.names = TRUE, pattern = "params2.rda") load(fn) ``` ```{r pdsvm, cache = TRUE, message = FALSE, warning = FALSE, tidy = FALSE} tan2009r1 <- svmClassification(tan2009r1, params2, class.weights = w, fcol = "pd.markers") ``` The data is visualised as described previously, were we use the svm classification a-posteriori probability to set the point size. ```{r pdres2, fig.asp=1, fig.cap="Classification results. The results of the second round of classification, using the augmented set of markers obtained using *phenoDisco* as detailed in [@Breckels2013] and a weighted svm classifier."} ptsze <- exp(fData(tan2009r1)$svm.scores) - 1 plot2D(tan2009r1, fcol = "svm", cex = ptsze) addLegend(tan2009r1, fcol = "svm", where = "bottomright", ncol = 2, cex = .5) ``` # Conclusions {#sec:ccl} This tutorial focuses on practical aspects of organelles proteomics data analysis using `r Biocpkg("pRoloc")`. Two important aspects have been illustrates: (1) data generation, manipulation and visualisation and (2) application of contemporary and novel machine learning techniques. Other crucial parts of a full analysis pipeline that were not covered here are raw mass-spectrometry quality control, quantitation, post-analysis and data validation. Data analysis is not a trivial task, and in general, one can not assume that any off-the-shelf algorithm will perform well. As such, one of the emphasis of the software presented in this document is allowing users to track data processing and critically evaluate the results. # Acknowledgement {-} We would like to thank Dr Daniel J.H. Nightingale, Dr Arnoud J. Groen, Dr Claire M. Mulvey and Dr Andy Christoforou for their organelle marker contributions. # Session information {-} All software and respective versions used to produce this document are listed below. ```{r sessioninfo, echo=FALSE} sessionInfo() ``` # References {-}