---
title: "Human Protein Atlas in R"
author:
- name: Laurent Gatto
  affiliation: de Duve Institute, UCLouvain, Belgium
package: hpar
abstract: >
  The Human Protein Atlas (HPA) is a systematic study oh the human
  proteome using antibody-based proteomics. Multiple tissues and cell
  lines are systematically assayed affinity-purified antibodies and
  confocal microscopy. The *hpar* package is an R interface to the HPA
  project. It distributes three data sets, provides functionality to
  query these and to access detailed information pages, including
  confocal microscopy images available on the HPA web page.
bibliography: hpar.bib
output:
  BiocStyle::html_document:
   toc_float: true
vignette: >
  %\VignetteIndexEntry{Human Protein Atlas in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteKeywords{Infrastructure, Bioinformatics, Proteomics}
  %\VignetteEncoding{UTF-8}
---


```{r env, echo=FALSE}
suppressPackageStartupMessages(library("BiocStyle"))
suppressPackageStartupMessages(library("org.Hs.eg.db"))
suppressPackageStartupMessages(library("GO.db"))
```


# Introduction

## The HPA project

From the [Human Protein Atlas](http://www.proteinatlas.org/)
[@Uhlen2005; @Uhlen2010] site:


> The Swedish Human Protein Atlas project, funded by the Knut and
> Alice Wallenberg Foundation, has been set up to allow for a
> systematic exploration of the human proteome using Antibody-Based
> Proteomics. This is accomplished by combining high-throughput
> generation of affinity-purified antibodies with protein profiling in
> a multitude of tissues and cells assembled in tissue
> microarrays. Confocal microscopy analysis using human cell lines is
> performed for more detailed protein localisation. The program hosts
> the Human Protein Atlas portal with expression profiles of human
> proteins in tissues and cells.

The `r Biocpkg("hpar")` package provides access to HPA data from the R
interface. It also distributes the following data sets:


- `hpaNormalTissue` *Normal tissue data*: Expression profiles
  for proteins in human tissues based on immunohistochemisty using tissue
  micro arrays. The tab-separated file includes Ensembl gene identifier 
  ("Gene"), tissue name ("Tissue"), annotated cell type ("Cell type"), 
  expression value ("Level"), and the gene reliability of the expression 
  value ("Reliability").}

- `hpaNormalTissue16.1`: Same as above, for version 16.1.

- `hpaCancer` *Pathology data*: Staining profiles for 
  proteins in human tumor tissue based on immunohistochemisty using 
  tissue micro arrays and log-rank P value for Kaplan-Meier analysis 
  of correlation between mRNA expression level and patient survival. 
  The tab-separated file includes Ensembl gene identifier ("Gene"), 
  gene name ("Gene name"), tumor name ("Cancer"), the number of 
  patients annotated for different staining levels ("High", "Medium",
  "Low" & "Not detected") and log-rank p values for patient survival 
  and mRNA correlation ("prognostic - favourable", 
  "unprognostic - favourable", "prognostic - unfavourable", 
  "unprognostic - unfavourable"). }

- `hpaCancer16.1`: Same as above, for version 16.1.

- `rnaGeneTissue` *RNA HPA tissue gene data*: Transcript 
  expression levels summarized per gene in 37 tissues based on 
  RNA-seq. The tab-separated file includes Ensembl gene identifier 
  ("Gene"), analysed sample ("Tissue"), transcripts per million 
  ("TPM"), protein-transcripts per million ("pTPM") and 
  normalized expression ("NX"). }

- `rnaGeneCellLine` *RNA HPA cell line gene data*: Transcript 
  expression levels summarized per gene in 64 cell lines.
  The tab-separated file includes Ensembl gene identifier 
  ("Gene"), analysed sample ("Cell line"), transcripts per
  million ("TPM"), protein-coding transcripts per million ("pTPM") 
  and normalized expression ("NX"). }
  
- `rnaGeneCellLine16.1`: Same as above, for version 16.1.

- `hpaSubcellularLoc` *Subcellular location data*: Subcellular 
  location of proteins based on immunofluorescently stained cells. 
  The tab-separated file includes the following columns: Ensembl 
  gene identifier ("Gene"), name of gene ("Gene name"), gene reliability 
  score ("Reliability"), enhanced locations ("Enhanced"), supported 
  locations ("Supported"), Approved locations ("Approved"), uncertain 
  locations ("Uncertain"), locations with single-cell variation in 
  intensity ("Single-cell variation intensity"), locations with spatial 
  single-cell variation ("Single-cell variation spatial"), locations 
  with observed cell cycle dependency (type can be one or more of 
  biological definition, custom data or correlation) ("Cell cycle 
  dependency"), Gene Ontology Cellular Component term identifier 
  ("GO id").}

- `hpaSubcellularLoc14` and `*16.1`: Same as above, for versions 14
  and 16.1.
  
- `hpaSecretome` *Secretome data*: The human secretome is here 
  defined as all Ensembl genes with at least one predicted 
  secreted transcript according to HPA predictions. The complete 
  information about the HPA Secretomedata is given on 
  \url{https://www.proteinatlas.org/humanproteome/blood/secretome}. 
  This dataset has 230 columns and includes the Ensembl gene identifier 
  ("Gene"). Information about the additionnal variables can be found 
  \href{https://www.proteinatlas.org/search}{here} by clicking on 
  \emph{Show/hide columns}.

## HPA data usage policy

The use of data and images from the HPA in publications and
presentations is permitted provided that the following conditions are
met:

- The publication and/or presentation are solely for informational and
  non-commercial purposes.
- The source of the data and/or image is referred to the HPA
  site^[www.proteinatlas.org] and/or one or more of our publications
  are cited.


## Installation

 `r Biocpkg("hpar")` is available through the Bioconductor
project. Details about the package and the installation procedure can
be found on its
[landing page](http://bioconductor.org/packages/devel/bioc/html/hpar.html). To
install using the dedicated Bioconductor infrastructure, run :

```{r install, eval = FALSE}
## install BiocManager only one
install.packages("BiocManager")
## install hpar
BiocManager::install("hpar")
```

After installation, `r Biocpkg("hpar")` will have to be explicitly
loaded with

```{r load}
library("hpar")
```

so that all the package's functionality and data is available to the
user.


# The  `r Biocpkg("hpar")`  package

## Data sets

The data sets described above can be loaded with the `data` function,
as illustrated below for `hpaNormalTissue` below. Each data set is a
`data.frame` and can be easily manipulated using standard R
functionality. The code chunk below illustrates some of its
properties.


```{r hpaData}
data(hpaNormalTissue)
dim(hpaNormalTissue)
names(hpaNormalTissue)
## Number of genes
length(unique(hpaNormalTissue$Gene))
## Number of cell types
length(unique(hpaNormalTissue$Cell.type))
head(levels(hpaNormalTissue$Cell.type))
## Number of tissues
length(unique(hpaNormalTissue$Tissue))
head(levels(hpaNormalTissue$Tissue))
```


## HPA interface


The package provides a interface to the HPA data. The `getHpa` allows
to query the data sets described above. It takes three arguments,
`id`, `hpadata` and `type`, that control the query, what data set to
interrogate and how to report results respectively. The HPA data uses
Ensembl gene identifiers and `id` must be a valid
identifier. `hpadata` must be one of available dataset. `type` can be
either `"data"` or `"details"`. The former is the default and returns
a `data.frame` containing the information relevant to `id`. It is also
possible to obtained detailed information, (including cell images) as
web pages, directly from the HPA web page, using `"details"`.

We will illustrate this functionality with using the TSPAN6
(tetraspanin 6) gene (ENSG00000000003) as example.

```{r getHpa}
id <- "ENSG00000000003"
head(getHpa(id, hpadata = "hpaNormalTissue"))
getHpa(id, hpadata = "hpaSubcellularLoc")
head(getHpa(id, hpadata = "rnaGeneCellLine"))
```

If we ask for `"detail"`, a browser page pointing to the
relevant page is open (see figure below)

```{r getHpa2, eval=FALSE}
getHpa(id, type = "details")
```

![The HPA web page for the tetraspanin 6 gene (ENSG00000000003).](./hpa.png)

If a user is interested specifically in one data set, it is possible
to set `hpadata` globally and omit it in `getHpa`. This is done by
setting the `hpar` options `hpardata` with the `setHparOptions`
function. The current default data set can be tested with
`getHparOptions`.

```{r opts}
getHparOptions()
setHparOptions(hpadata = "hpaSubcellularLoc")
getHparOptions()
getHpa(id)
```

## HPA release information

Information about the HPA release used to build the installed


`r Biocpkg("hpar")` package can be accessed with `getHpaVersion`,
`getHpaDate` and `getHpaEnsembl`. Full release details can be found on
the [HPA release history](http://www.proteinatlas.org/about/releases)
page.

```{r rel}
getHpaVersion()
getHpaDate()
getHpaEnsembl()
```

# A small use case

Let's compare the subcellular localisation annotation obtained from
the HPA subcellular location data set and the information available in
the Bioconductor annotation packages.

```{r uc-hpar}
id <- "ENSG00000001460"
getHpa(id, "hpaSubcellularLoc")
```

Below, we first extract all cellular component GO terms available for
`id` from the `r Biocannopkg("org.Hs.eg.db")` human annotation and
then retrieve their term definitions using the `r Biocannopkg("GO.db")`
database.

```{r uc-db}
library("org.Hs.eg.db")
library("GO.db")
ans <- select(org.Hs.eg.db, keys = id,
              columns = c("ENSEMBL", "GO", "ONTOLOGY"),
              keytype = "ENSEMBL")
ans <- ans[ans$ONTOLOGY == "CC", ]
ans
sapply(as.list(GOTERM[ans$GO]), slot, "Term")
```


# Session information {-}

```{r sessioninfo, echo = FALSE}
sessionInfo()
```