---
title: "An R interface to the ProteomeXchange repository"
author: 
- name: Laurent Gatto
  affiliation: Computational Proteomics Unit, Cambridge, UK
package: rpx
output:
  BiocStyle::html_document2:
    toc_float: true
vignette: >
  %\VignetteEngine{knitr}
  %\VignetteIndexEntry{An R interface to the ProteomeXchange repository}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteKeywords{Infrastructure, Bioinformatics, Proteomics, Mass spectrometry}
  %\VignetteEncoding{UTF-8}
---

```{r env, echo = FALSE}
suppressPackageStartupMessages(library("BiocStyle"))
suppressPackageStartupMessages(library("Biostrings"))
suppressPackageStartupMessages(library("MSnbase"))
```


# Introduction

The goal of the `r Biocpkg("rpx")` package is to provide programmatic
access to proteomics data from R, in particular to the ProteomeXchange
(PX) central repository (see http://www.proteomexchange.org/ and
http://central.proteomexchange.org/).

> Vizcaino J.A. et al. *ProteomeXchange: globally co-ordinated
> proteomics data submission and dissemination*, Nature Biotechnology
> 2014, 32, 223 -- 226, doi:10.1038/nbt.2839.

Additional repositories are likely to be added in the future.


# The `r Biocpkg("rpx")`  package

## `PXDataset` objects

The central object that handles data access is the `PXDataset`
class. Such an instance can be generated by passing a valid PX
experiment identifier to the `PXDataset` constructor.

```{r pxdata}
library("rpx")
id <- "PXD000001"
px <- PXDataset(id)
px
```

## Data and meta-data

Several attributes can be extracted from an `PXDataset` instance, as
described below.


The experiment identifier, that was originally used to create the
\Robject{PXDataset} instance can be extracted with the
\Rfunction{pxid} method:

```{r pxid}
pxid(px)
```

The file transfer url where the data files can be accessed can be
queried with the `pxurl` method:

```{r purl}
pxurl(px)
```

The species the data has been generated the data can be obtain calling
the `pxtax` function:

```{r pxtax}
pxtax(px)
```


Relevant bibliographic references can be queried with the
`pxref` method:

```{r pxref}
strwrap(pxref(px))
```

All files available for the PX experiment can be obtained with the
`pxfiles` method:

```{r pxfiles}
pxfiles(px)
```


The complete or partial data set can be downloaded with the `pxget`
function. The function takes an instance of class `PXDataset` as first
mandatory argument.

The next argument, `list`, specifies what files to download. If
missing, a menu is printed and the user can select a file. If set to
`"all"`, all files of the experiment are downloaded in the working
directory. Alternatively, numerics or logicals can also be used to
subset the relevant files to be downloaded based on the `pxfiles(.)`
output.

The last argument, `force`, can be set to `TRUE` to force the download
of files that already exists in the working directory.

```{r pxget}
pxget(px, "erwinia_carotovora.fasta")
dir(pattern = "fasta")
```

By default, `pxget` will not download and overwrite a file if already
available. The last argument of `pxget`, `force`, can be set to `TRUE`
to force the download of files that already exists in the working
directory.

```{r pxget2}
(i <- grep("fasta", pxfiles(px)))
pxget(px, i) ## same as above
```

Finally, a list of recent PX additions and updates can be obtained
using the `pxannounced()` function:

```{r pxan}
pxannounced()
```

## A simple use-case

Below, we show how to automate the extraction of files of interest
(fasta and mzTab files), download them and read them using appropriate
Bioconductor infrastructure. (Note that we read version 0.9 of the
MzTab format below. For recent data, the `version` argument would be
omitted.)

```{r more, warning=FALSE}
(mzt <- grep("F0.+mztab", pxfiles(px), value = TRUE))
(fas <- grep("fasta", pxfiles(px), value = TRUE))
pxget(px, c(mzt, fas))

library("Biostrings")
readAAStringSet(fas)

library("MSnbase")
(x <- readMzTabData(mzt, "PEP", version = "0.9"))
head(exprs(x))
head(fData(x)[, 1:2])
```

# Questions and help

Eithe post questions on the
[Bioconductor support forum](https://support.bioconductor.org/) or
open a GitHub [issue](https://github.com/lgatto/rpx/issues).

# Session information

```{r si}
sessionInfo()
```

```{r clean, echo = FALSE}
unlink("erwinia_carotovora.fasta")
unlink("F063721.dat-mztab.txt")
```