--- title: "MSnbase IO capabilities" author: - name: Laurent Gatto affiliation: Computational Proteomics Unit, Cambridge, UK. package: MSnbase abstract: > This vignette describes *MSnbase*'s input and output capabilities. bibliography: MSnbase.bib output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{MSnbase IO capabilities} %\VignetteEngine{knitr::rmarkdown} %\VignetteKeywords{Mass Spectrometry, Proteomics, Infrastructure } %\VignetteEncoding{UTF-8} --- ```{r env, echo=FALSE} suppressPackageStartupMessages(library("BiocStyle")) suppressPackageStartupMessages(library("MSnbase")) suppressPackageStartupMessages(library("pRolocdata")) ``` ```{r include_forword, echo=FALSE, results="asis"} cat(readLines("./Foreword.md"), sep = "\n") ``` ```{r include_bugs, echo=FALSE, results="asis"} cat(readLines("./Bugs.md"), sep = "\n") ``` # Overview `r Biocpkg("MSnbase")`'s aims are to facilitate the reproducible analysis of mass spectrometry data within the R environment, from raw data import and processing, feature quantification, quantification and statistical analysis of the results [@Gatto2012]. Data import functions for several formats are provided and intermediate or final results can also be saved or exported. These capabilities are presented below. # Data input #### Raw data {-} Data stored in one of the published `XML`-based formats. i.e. `mzXML` [@Pedrioli2004], `mzData` [@Orchard2007] or `mzML` [@Martens2010], can be imported with the `readMSData` method, which makes use of the `r Biocpkg("mzR")` package to create `MSnExp` objects. The files can be in profile or centroided mode. See `?readMSData` for details. Data from `mzML` files containing chromatographic data (e.g. generated in SRM/MRM experiments) can be imported with the `readSRMData` function that returns the chromatographic data as a `Chromatograms` object. See `?readSRMData` for more details. #### Peak lists {-} Peak lists in the `mgf` format^[http://www.matrixscience.com/help/data_file_help.html] can be imported using the `readMgfData`. In this case, the peak data has generally been pre-processed by other software. See `?readMgfData` for details. #### Quantitation data {-} Third party software can be used to generate quantitative data and exported as a spreadsheet (generally comma or tab separated format). This data as well as any additional meta-data can be imported with the `readMSnSet` function. See `?readMSnSet` for details. `r Biocpkg("MSnbase")` also supports the `mzTab` format^[https://github.com/HUPO-PSI/mzTab], a light-weight, tab-delimited file format for proteomics data developed within the Proteomics Standards Initiative (PSI). `mzTab` files can be read into R with `readMzTabData` to create and `MSnSet` instance. ![*MSnbase* input capabilities. The white and red boxes represent R functions/methods and objects respectively. The blue boxes represent different disk storage formats.](./Figures/MSnbase-io-in.png) # Data output #### RData files {-} R objects can most easily be stored on disk with the `save` function. It creates compressed binary images of the data representation that can later be read back from the file with the `load` function. #### mzML/mzXML files {-} `MSnExp` and `OnDiskMSnExp` files can be written to MS data files in `mzML` or `mzXML` files with the `writeMSData` method. See `?writeMSData` for details. #### Peak lists {-} `MSnExp` instances as well as individual spectra can be written as `mgf` files with the `writeMgfData` method. Note that the meta-data in the original R object can not be included in the file. See `?writeMgfData` for details. #### Quantitation data {-} Quantitation data can be exported to spreadsheet files with the `write.exprs` method. Feature meta-data can be appended to the feature intensity values. See `?writeMgfData` for details. **Deprecated** `MSnSet` instances can also be exported to `mzTab` files using the `writeMzTabData` function. ![*MSnbase* output capabilities. The white and red boxes represent R functions/methods and objects respectively. The blue boxes represent different disk storage formats.](./Figures/MSnbase-io-out.png) # Creating `MSnSet` from text spread sheets This section describes the generation of `MSnSet` objects using data available in a text-based spreadsheet. This entry point into R and `r Biocpkg("MSnbase")` allows to import data processed by any of the third party mass-spectrometry processing software available and proceed with data exploration, normalisation and statistical analysis using functions available in \R and the numerous Bioconductor packages. ## A complete work flow The following section describes a work flow that uses three input files to create the `MSnSet`. These files respectively describe the quantitative expression data, the sample meta-data and the feature meta-data. It is taken from the `r Biocpkg("pRoloc")` tutorial and uses example files from the `r Biocpkg("pRolocdat")` package. We start by describing the `csv` to be used as input using the `read.csv` function. ```{r readCsvData0} ## The original data for replicate 1, available ## from the pRolocdata package f0 <- dir(system.file("extdata", package = "pRolocdata"), full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv") csv <- read.csv(f0) ``` The three first lines of the original spreadsheet, containing the data for replicate one, are illustrated below (using the function `head`). It contains `r nrow(csv)` rows (proteins) and `r ncol(csv)` columns, including protein identifiers, database accession numbers, gene symbols, reporter ion quantitation values, information related to protein identification, ... ```{r showOrgCsv} head(csv, n=3) ``` Below read in turn the spread sheets that contain the quantitation data (`exprsFile.csv`), feature meta-data (`fdataFile.csv`) and sample meta-data (`pdataFile.csv`). ```{r readCsvData1} ## The quantitation data, from the original data f1 <- dir(system.file("extdata", package = "pRolocdata"), full.names = TRUE, pattern = "exprsFile.csv") exprsCsv <- read.csv(f1) ## Feature meta-data, from the original data f2 <- dir(system.file("extdata", package = "pRolocdata"), full.names = TRUE, pattern = "fdataFile.csv") fdataCsv <- read.csv(f2) ## Sample meta-data, a new file f3 <- dir(system.file("extdata", package = "pRolocdata"), full.names = TRUE, pattern = "pdataFile.csv") pdataCsv <- read.csv(f3) ``` `exprsFile.csv` contains the quantitation (expression) data for the `r nrow(exprsCsv)` proteins and 4 reporter tags. ```{r showExprsFile} head(exprsCsv, n = 3) ``` `fdataFile.csv` contains meta-data for the `r nrow(fdataCsv)` features (here proteins). ```{r showFdFile} head(fdataCsv, n = 3) ``` `pdataFile.csv` contains samples (here fractions) meta-data. This simple file has been created manually. ```{r showPdFile} pdataCsv ``` The self-contained `MSnSet` can now easily be generated using the `readMSnSet` constructor, providing the respective `csv` file names shown above and specifying that the data is comma-separated (with `sep = ","`). Below, we call that object `res` and display its content. ```{r makeMSnSet} library("MSnbase") res <- readMSnSet(exprsFile = f1, featureDataFile = f2, phenoDataFile = f3, sep = ",") res ``` ### The `MSnSet` class Although there are additional specific sub-containers for additional meta-data (for instance to make the object MIAPE compliant), the feature (the sub-container, or slot `featureData`) and sample (the `phenoData` slot) are the most important ones. They need to meet the following validity requirements (see figure below): - the number of row in the expression/quantitation data and feature data must be equal and the row names must match exactly, and - the number of columns in the expression/quantitation data and number of row in the sample meta-data must be equal and the column/row names must match exactly. A detailed description of the `MSnSet` class is available by typing `?MSnSet` in the R console. ![Dimension requirements for the respective expression, feature and sample meta-data slots.](./Figures/msnset.png) The individual parts of this data object can be accessed with their respective accessor methods: - the quantitation data can be retrieved with `exprs(res)`, - the feature meta-data with `fData(res)` and - the sample meta-data with `pData(res)`. ## A shorter work flow The `readMSnSet2` function provides a simplified import workforce. It takes a single spreadsheet as input (default is `csv`) and extract the columns identified by `ecol` to create the expression data, while the others are used as feature meta-data. `ecol` can be a `character` with the respective column labels or a numeric with their indices. In the former case, it is important to make sure that the names match exactly. Special characters like `'-'` or `'('` will be transformed by R into `'.'` when the `csv` file is read in. Optionally, one can also specify a column to be used as feature names. Note that these must be unique to guarantee the final object validity. ```{r readMSnSet2} ecol <- paste("area", 114:117, sep = ".") fname <- "Protein.ID" eset <- readMSnSet2(f0, ecol, fname) eset ``` The `ecol` columns can also be queried interactively from R using the `getEcols` and `grepEcols` function. The former return a character with all column names, given a splitting character, i.e. the separation value of the spreadsheet (typically `","` for `csv`, `"\t"` for `tsv`, ...). The latter can be used to grep a pattern of interest to obtain the relevant column indices. ```{r ecols} getEcols(f0, ",") grepEcols(f0, "area", ",") e <- grepEcols(f0, "area", ",") readMSnSet2(f0, e) ``` The `phenoData` slot can now be updated accordingly using the replacement functions `phenoData<-` or `pData<-` (see `?MSnSet` for details). # Session information ```{r} sessionInfo() ``` # References {-}