---
title: "An R interface to the ProteomeXchange repository"
author:
- name: Laurent Gatto
package: rpx
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteEngine{knitr}
  %\VignetteIndexEntry{An R interface to the ProteomeXchange repository}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteKeywords{Infrastructure, Bioinformatics, Proteomics, Mass spectrometry}
  %\VignetteEncoding{UTF-8}
---

```{r env, echo = FALSE}
suppressPackageStartupMessages(library("BiocStyle"))
suppressPackageStartupMessages(library("Biostrings"))
```


# Introduction

The goal of the `r Biocpkg("rpx")` package is to provide programmatic
access to proteomics data from R, in particular to the ProteomeXchange
(PX) central repository (see http://www.proteomexchange.org/ and
http://central.proteomexchange.org/).

> Vizcaino J.A. et al. *ProteomeXchange: globally co-ordinated
> proteomics data submission and dissemination*, Nature Biotechnology
> 2014, 32, 223 -- 226, doi:10.1038/nbt.2839.

Additional repositories are likely to be added in the future.


# The `r Biocpkg("rpx")`  package

## `PXDataset` objects

The central object that handles data access is the `PXDataset`
class. Such an instance can be generated by passing a valid PX
experiment identifier to the `PXDataset` constructor.

```{r pxdata}
library("rpx")
id <- "PXD000001"
px <- PXDataset(id)
px
```

## Data and meta-data

Several attributes can be extracted from an `PXDataset` instance, as
described below.


The experiment identifier, that was originally used to create the
`PXDataset` instance can be extracted with the `pxid()` method:

```{r pxid}
pxid(px)
```

The file transfer url where the data files can be accessed can be
queried with the `pxurl` method:

```{r purl}
pxurl(px)
```

The species the data has been generated the data can be obtain calling
the `pxtax` function:

```{r pxtax}
pxtax(px)
```

Relevant bibliographic references can be queried with the
`pxref` method:

```{r pxref}
strwrap(pxref(px))
```

All files available for the PX experiment can be obtained with the
`pxfiles` method:

```{r pxfiles}
pxfiles(px)
```

The complete or partial data set can be downloaded with the `pxget()`
function. The function takes an instance of class `PXDataset` as first
mandatory argument.

The next argument, `list`, specifies what files to download. If
missing, a menu is printed and the user can select a file. If set to
`"all"`, all files of the experiment are downloaded. Alternatively,
numerics or logicals can also be used to subset the relevant files to
be downloaded based on the `pxfiles(.)` output.

```{r pxget}
f <- pxget(px, "PXD000001_mztab.txt")
f
```

The `rpx` package makes use of the `r Biocpkg("BiocFileCache")`
package to avoid repeatedly dowloading files. When downloaded, file
are cached, i.e. stored centrally in the package's cache
directory. Next time the `pxget()` function attempts to get that file,
it will be directly retrieved from the cache instead being downloaded
again.

Finally, a list of recent PX additions and updates can be obtained
using the `pxannounced()` function:

```{r pxan}
pxannounced()
```

## A simple use-case

Below, we download the fasta file from the PXD000001 dataset and load
it with the Biostrings package.

```{r more, warning=FALSE}
fas <- grep("fasta", pxfiles(px), value = TRUE)
fas
f <- pxget(px, fas)
f ## files available in the rpx cache
```

```{r example1}
library("Biostrings")
readAAStringSet(f)
```

# Questions and help

Either post questions on the [Bioconductor support
forum](https://support.bioconductor.org/) or open a GitHub
[issue](https://github.com/lgatto/rpx/issues).

# Session information

```{r si}
sessionInfo()
```