--- title: "An R interface to the ProteomeXchange repository" author: - name: Laurent Gatto package: rpx output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteEngine{knitr} %\VignetteIndexEntry{An R interface to the ProteomeXchange repository} %\VignetteEngine{knitr::rmarkdown} %\VignetteKeywords{Infrastructure, Bioinformatics, Proteomics, Mass spectrometry} %\VignetteEncoding{UTF-8} --- ```{r env, echo = FALSE} suppressPackageStartupMessages(library("BiocStyle")) suppressPackageStartupMessages(library("Biostrings")) ``` # Introduction The goal of the `r Biocpkg("rpx")` package is to provide programmatic access to proteomics data from R, in particular to the ProteomeXchange (PX) central repository (see http://www.proteomexchange.org/ and http://central.proteomexchange.org/). > Vizcaino J.A. et al. *ProteomeXchange: globally co-ordinated > proteomics data submission and dissemination*, Nature Biotechnology > 2014, 32, 223 -- 226, doi:10.1038/nbt.2839. Additional repositories are likely to be added in the future. # The `r Biocpkg("rpx")` package ## `PXDataset` objects The central object that handles data access is the `PXDataset` class. Such an instance can be generated by passing a valid PX experiment identifier to the `PXDataset` constructor. ```{r pxdata} library("rpx") id <- "PXD000001" px <- PXDataset(id) px ``` ## Data and meta-data Several attributes can be extracted from an `PXDataset` instance, as described below. The experiment identifier, that was originally used to create the `PXDataset` instance can be extracted with the `pxid()` method: ```{r pxid} pxid(px) ``` The file transfer url where the data files can be accessed can be queried with the `pxurl` method: ```{r purl} pxurl(px) ``` The species the data has been generated the data can be obtain calling the `pxtax` function: ```{r pxtax} pxtax(px) ``` Relevant bibliographic references can be queried with the `pxref` method: ```{r pxref} strwrap(pxref(px)) ``` All files available for the PX experiment can be obtained with the `pxfiles` method: ```{r pxfiles} pxfiles(px) ``` The complete or partial data set can be downloaded with the `pxget()` function. The function takes an instance of class `PXDataset` as first mandatory argument. The next argument, `list`, specifies what files to download. If missing, a menu is printed and the user can select a file. If set to `"all"`, all files of the experiment are downloaded. Alternatively, numerics or logicals can also be used to subset the relevant files to be downloaded based on the `pxfiles(.)` output. ```{r pxget} f <- pxget(px, "PXD000001_mztab.txt") f ``` The `rpx` package makes use of the `r Biocpkg("BiocFileCache")` package to avoid repeatedly dowloading files. When downloaded, file are cached, i.e. stored centrally in the package's cache directory. Next time the `pxget()` function attempts to get that file, it will be directly retrieved from the cache instead being downloaded again. Finally, a list of recent PX additions and updates can be obtained using the `pxannounced()` function: ```{r pxan} pxannounced() ``` ## A simple use-case Below, we download the fasta file from the PXD000001 dataset and load it with the Biostrings package. ```{r more, warning=FALSE} fas <- grep("fasta", pxfiles(px), value = TRUE) fas f <- pxget(px, fas) f ## files available in the rpx cache ``` ```{r example1} library("Biostrings") readAAStringSet(f) ``` # Questions and help Either post questions on the [Bioconductor support forum](https://support.bioconductor.org/) or open a GitHub [issue](https://github.com/lgatto/rpx/issues). # Session information ```{r si} sessionInfo() ```