--- title: "An R interface to the ProteomeXchange repository" author: - name: Laurent Gatto affiliation: Computational Proteomics Unit, Cambridge, UK package: rpx output: BiocStyle::html_document2: toc_float: true vignette: > %\VignetteEngine{knitr} %\VignetteIndexEntry{An R interface to the ProteomeXchange repository} %\VignetteEngine{knitr::rmarkdown} %\VignetteKeywords{Infrastructure, Bioinformatics, Proteomics, Mass spectrometry} %\VignetteEncoding{UTF-8} --- ```{r env, echo = FALSE} suppressPackageStartupMessages(library("BiocStyle")) suppressPackageStartupMessages(library("Biostrings")) suppressPackageStartupMessages(library("MSnbase")) ``` # Introduction The goal of the `r Biocpkg("rpx")` package is to provide programmatic access to proteomics data from R, in particular to the ProteomeXchange (PX) central repository (see http://www.proteomexchange.org/ and http://central.proteomexchange.org/). > Vizcaino J.A. et al. *ProteomeXchange: globally co-ordinated > proteomics data submission and dissemination*, Nature Biotechnology > 2014, 32, 223 -- 226, doi:10.1038/nbt.2839. Additional repositories are likely to be added in the future. # The `r Biocpkg("rpx")` package ## `PXDataset` objects The central object that handles data access is the `PXDataset` class. Such an instance can be generated by passing a valid PX experiment identifier to the `PXDataset` constructor. ```{r pxdata} library("rpx") id <- "PXD000001" px <- PXDataset(id) px ``` ## Data and meta-data Several attributes can be extracted from an `PXDataset` instance, as described below. The experiment identifier, that was originally used to create the \Robject{PXDataset} instance can be extracted with the \Rfunction{pxid} method: ```{r pxid} pxid(px) ``` The file transfer url where the data files can be accessed can be queried with the `pxurl` method: ```{r purl} pxurl(px) ``` The species the data has been generated the data can be obtain calling the `pxtax` function: ```{r pxtax} pxtax(px) ``` Relevant bibliographic references can be queried with the `pxref` method: ```{r pxref} strwrap(pxref(px)) ``` All files available for the PX experiment can be obtained with the `pxfiles` method: ```{r pxfiles} pxfiles(px) ``` The complete or partial data set can be downloaded with the `pxget` function. The function takes an instance of class `PXDataset` as first mandatory argument. The next argument, `list`, specifies what files to download. If missing, a menu is printed and the user can select a file. If set to `"all"`, all files of the experiment are downloaded in the working directory. Alternatively, numerics or logicals can also be used to subset the relevant files to be downloaded based on the `pxfiles(.)` output. The last argument, `force`, can be set to `TRUE` to force the download of files that already exists in the working directory. ```{r pxget} pxget(px, "erwinia_carotovora.fasta") dir(pattern = "fasta") ``` By default, `pxget` will not download and overwrite a file if already available. The last argument of `pxget`, `force`, can be set to `TRUE` to force the download of files that already exists in the working directory. ```{r pxget2} (i <- grep("fasta", pxfiles(px))) pxget(px, i) ## same as above ``` Finally, a list of recent PX additions and updates can be obtained using the `pxannounced()` function: ```{r pxan} pxannounced() ``` ## A simple use-case Below, we show how to automate the extraction of files of interest (fasta and mzTab files), download them and read them using appropriate Bioconductor infrastructure. (Note that we read version 0.9 of the MzTab format below. For recent data, the `version` argument would be omitted.) ```{r more, warning=FALSE} (mzt <- grep("F0.+mztab", pxfiles(px), value = TRUE)) (fas <- grep("fasta", pxfiles(px), value = TRUE)) pxget(px, c(mzt, fas)) library("Biostrings") readAAStringSet(fas) library("MSnbase") (x <- readMzTabData(mzt, "PEP", version = "0.9")) head(exprs(x)) head(fData(x)[, 1:2]) ``` # Questions and help Eithe post questions on the [Bioconductor support forum](https://support.bioconductor.org/) or open a GitHub [issue](https://github.com/lgatto/rpx/issues). # Session information ```{r si} sessionInfo() ``` ```{r clean, echo = FALSE} unlink("erwinia_carotovora.fasta") unlink("F063721.dat-mztab.txt") ```