---
title: "ExpressionAtlas package vignette"
author: "Maria Keays, Pedro Madrigal, Anil Thanki"
affiliation: "EMBL-EBI"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_depth: 3
    number_sections: true
bibliography: references.bib
keywords: ["gene expression", "RNA-seq", "ExpressionAtlas", "Bioconductor"]
vignette: >
    %\VignetteEngine{ knitr::rmarkdown }
    %\VignetteBuilder{ knitr }
    %\VignetteIndexEntry{ ExpressionAtlas }
    %\VignetteEncoding{UTF-8}
---

# Expression Atlas resources at EMBL-EBI

The [EMBL-EBI](http://www.ebi.ac.uk) [Expression
Atlas](http://www.ebi.ac.uk/gxa) and [Single Cell Expression Atlas](http://www.ebi.ac.uk/gxa/sc) 
resources consist of carefully selected high-quality datasets
from [ArrayExpress](https://www.ebi.ac.uk/biostudies/arrayexpress) and 
other data sources that have been manually 
curated and re-analyzed via the Functional Genomics Team analysis pipelines [@George2023]. Since October 
2022, ArrayExpress is a collection of functional genomics data in BioStudies
(https://www.ebi.ac.uk/biostudies).

The Expression Atlas website allows users to search these datasets for genes and/or
experimental conditions, to discover which genes are expressed in which
tissues, cell types, developmental stages, and hundreds of other experimental
conditions.

The *ExpressionAtlas* R package allows you to search and download pre-packaged data from Expression Atlas inside an R session. Raw counts
are provided for RNA-seq datasets, while normalized intensities are available
for microarray experiments. Protocols describing how the data was generated are
contained within the downloaded R objects, with more detailed information
available on the [Expression Atlas website](http://www.ebi.ac.uk/gxa).
Sample annotations are also included in the R object.

Single-cell datasets are downloaded in annData format, and loaded into a 
SingleCellExperiment object for access and customised visualisation.

# Searching and downloading Expression Atlas data

The package communicates with the EBI RESTful Web Services API and Expression Atlas
API to enable efficient data searching and retrieval.

## Searching

You can search for experiments in Atlas using the `searchAtlasExperiments()`
function. This function returns a *DataFrame* (see
[S4Vectors](http://bioconductor.org/packages/release/bioc/html/S4Vectors.html))
containing the results of your search. The first argument to
`searchAtlasExperiments()` should be a character vector of sample properties,
e.g. biological sample attributes, experimental treatments nad/or species. You may also
optionally provide a secondary filter yo your query to limit your search to, as a second argument.

```{r setup, include=FALSE}
knitr::opts_chunk$set(message=FALSE)
knitr::opts_chunk$set(warning=FALSE)
```

```{r}
invisible(lapply(
  c("ExpressionAtlas", "SummarizedExperiment", "SingleCellExperiment", "ggplot2"),
  function(pkg) suppressMessages(library(pkg, character.only = TRUE))
))
```

```{r eval=FALSE}
atlas_res <- searchAtlasExperiments( query = "salt", secondaryFilter = "oryza" )
# Searching for experiments matching your query ...
# Query successful.
# Found 4 experiments matching your query.
```

```{r, echo=FALSE}
data( "atlasRes" )
```

We will proceed with a subset of three accessions:

```{r}
atlasRes
```

The *Accession* column contains the ArrayExpress accession of each dataset --
the unique identifier assigned to it. The species, experiment type (e.g.
microarray or RNA-seq), and title of each dataset are also listed.

## Detailed search through experiment metadata

```{r}
searchAtlasExperiments( query = "lung" , detailed = FALSE )
```

The `detailed` argument can be set to `TRUE` to return a more detailed search through Atlas experiment metadata information:

```{r}
searchAtlasExperiments( query = "lung" , detailed = TRUE )
```

It can work with a secondary filter as well:

```{r}
searchAtlasExperiments( query = "lung", secondaryFilter = "human", detailed = FALSE )

searchAtlasExperiments( query = "lung", secondaryFilter = "human", detailed = TRUE )
```

As we can see, a detailed search returns a *DataFrame* with more Atlas studies.

## Downloading the data

To download the data for any/all of the experiments in your results, you can
use the function `getAtlasData()`. This function accepts a vector of
ArrayExpress accessions. The data is downloaded into a *SimpleList* object (see package
[S4Vectors](http://bioconductor.org/packages/release/bioc/html/S4Vectors.html)), with one
entry per experiment, listed by accession.

For example, to download all the datasets in your results:

```{r eval=FALSE}
allExps <- getAtlasData( atlasRes$Accession )
# Downloading Expression Atlas experiment summary from:
#  ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/atlas/experiments/E-GEOD-11175/E-GEOD-11175-atlasExperimentSummary.Rdata
# Successfully downloaded experiment summary object for E-GEOD-11175
# Downloading Expression Atlas experiment summary from:
#  ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/atlas/experiments/E-MTAB-1625/E-MTAB-1625-atlasExperimentSummary.Rdata
# Successfully downloaded experiment summary object for E-MTAB-1625
# Downloading Expression Atlas experiment summary from:
#  ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/atlas/experiments/E-MTAB-1624/E-MTAB-1624-atlasExperimentSummary.Rdata
# Successfully downloaded experiment summary object for E-MTAB-1624
```

```{r, echo=FALSE}
data( "allExps" )
```

```{r}
allExps
```

To only download the RNA-seq experiment(s):

```{r eval=FALSE}
rnaseqExps <- getAtlasData( 
    atlasRes$Accession[ 
        grep( 
            "rna-seq", 
            atlasRes$Type, 
            ignore.case = TRUE 
        ) 
    ] 
)
# Downloading Expression Atlas experiment summary from:
#  ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/atlas/experiments/E-MTAB-1625/E-MTAB-1625-atlasExperimentSummary.Rdata
# Successfully downloaded experiment summary object for E-MTAB-1625
```

```{r, echo=FALSE}
data( "rnaseqExps" )
```

```{r}
rnaseqExps
```

To access an experiment summary, use the accession:

```{r}
mtab1624 <- allExps[[ "E-MTAB-1624" ]]
mtab1625 <- allExps[[ "E-MTAB-1625" ]]
```

Each dataset is also represented by a *SimpleList*, with one entry per platform
used in the experiment. For RNA-seq data there will only ever be one entry,
named `rnaseq`. For microarray data, there is one entry per array design used,
listed by ArrayExpress array design accession (see below).

### RNA-seq experiment summaries

Following on from above, `mtab1625` now contains a *SimpleList* object 
with a single entry named `rnaseq`. For RNA-seq experiments, this entry is a
*RangedSummarizedExperiment* object (see package
[SummarizedExperiment](http://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html)).

```{r}
sumexp <- mtab1625$rnaseq
sumexp
```

The matrix of raw counts for this experiment is stored in the *assays* slot:

```{r}
head( assays( sumexp )$counts )
```

The sample annotations can be found in the *colData* slot:

```{r}
colData( sumexp )
```

Information describing how the raw data files were processed to obtain the raw
counts matrix are found in the *metadata* slot:

```{r}
metadata( sumexp )
```

### Single-channel microarray experiments

Data from a single-channel microarray experiment, e.g.
[E-MTAB-1624](http://www.ebi.ac.uk/gxa/experiments/E-MTAB-1624), is
represented as one or more
*[ExpressionSet](https://www.bioconductor.org/packages/release/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf)*
object(s) in the SimpleList that is downloaded. *ExpressionSet* objects are
indexed by the ArrayExpress accession(s) of the microarray design(s) used in the
original experiment.

```{r}
names( mtab1624 )
affy126data <- mtab1624[[ "A-AFFY-126" ]]
affy126data
```

The matrix of normalized intensity values is in the *assayData* slot:

```{r}
head( exprs( affy126data ) )
```

The sample annotations are in the *phenoData* slot:

```{r}
pData( affy126data )
```

A brief outline of how the raw data was normalized is in the *experimentData* slot:

```{r}
preproc( experimentData( affy126data ) )
```

# Downloading and visualising a bulk Expression Atlas experiment

## Downloading a single Expression Atlas experiment summary

You can download data for a single Expression Atlas experiment using the
`getAtlasExperiment()` function:

```{r eval=FALSE}
mtab3007 <- getAtlasExperiment( "E-MTAB-3007" )
# Downloading Expression Atlas experiment summary from:
#  ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/atlas/experiments/E-MTAB-3007/E-MTAB-3007-atlasExperimentSummary.Rdata
# Successfully downloaded experiment summary object for E-MTAB-3007
```

## Downloading a single Expression Atlas experiment normalised data

You can download normalised data for a single Expression Atlas experiment using the
`getNormalisedAtlasExpression()` function:

```{r eval=FALSE}
mtab4045_tpm <- getNormalisedAtlasExpression( "E-MTAB-4045", "tpm" )
# Downloading XML file from FTP...
# E-MTAB-4045  is  rnaseq_mrna_baseline , will continue downloading data
# Downloading expression file from:
#  ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/atlas/experiments/E-MTAB-4045/E-MTAB-4045-tpms.tsv
# Downloading XML file from FTP...
```


```{r eval=FALSE}
mtab4045_cpm <- getNormalisedAtlasExpression( "E-MTAB-4045", "cpm" )
# Downloading XML file from FTP...
# E-MTAB-4045  is  rnaseq_mrna_baseline , will continue downloading data
# Downloading Expression Atlas experiment summary from:
#  ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/atlas/experiments/E-MTAB-4045/E-MTAB-4045-atlasExperimentSummary.Rdata
# Successfully downloaded experiment summary object for E-MTAB-4045
```

## Downloading a single Expression Atlas differential experiment analytics data

You can also download analytics data for a single Expression Atlas deifferential experiment using the
`getAnalyticsDifferentialAtlasExpression()` function:

```{r eval=FALSE}
mtab10104_dea <- getAnalyticsDifferentialAtlasExpression( "E-MTAB-10104" )
# Downloading expression file from:
# ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/atlas/experiments/E-MTAB-10104/E-MTAB-10104-analytics.tsv
```


## Creating heatmap from normalised data of a Expression Atlas experiment

You can create heatmap from downloaded normalised data for a single Expression Atlas experiment using the
`getNormalisedAtlasExpression()` function:

```{r eval=TRUE, fig.width=7, fig.height=8}
mtab4045_tpm <- getNormalisedAtlasExpression( "E-MTAB-4045", "tpm" ) 
mtab4045 <- heatmapAtlasExperiment( df = mtab4045_tpm, save_pdf = FALSE, show_plot = TRUE, palette = "viridis", top_n = 20, scaled = FALSE, show_heatmap_title = TRUE )

```

And now using CPM data, instead of TPM:
```{r eval=TRUE, fig.width=7, fig.height=8}
mtab4045_cpm <- getNormalisedAtlasExpression( "E-MTAB-4045", "cpm" ) 
mtab4045 <- heatmapAtlasExperiment( df = mtab4045_cpm, save_pdf = FALSE, show_plot = TRUE,  palette = "viridis", top_n = 20, scaled = TRUE, show_heatmap_title = TRUE )
```


## Creating a Volcano plot using normalised data from a differential Expression Atlas experiment

You can create volcano plot from downloaded normalised data for a single Expression Atlas experiment using the
`volcanoDifferentialAtlasExperiment()` function:

```{r eval=TRUE, fig.width=7, fig.height=5}
mtab10104_dea <- getAnalyticsDifferentialAtlasExpression( "E-MTAB-10104" )

head(mtab10104_dea)
mtab10104 <-  volcanoDifferentialAtlasExperiment( df = mtab10104_dea, save_pdf = FALSE, show_plot = TRUE, low_fc_colour = "Gray", high_fc_colour = "Blue", cutoff = 1, show_volcanoplot_title = TRUE )

```


# Searching, downloading and visualising Single-Cell Expression Atlas experiments

## Search Single Expression Atlas experiments

You can search for experiments in Atlas using the `searchSCAtlasExperiments()`
function.

```{r}
search_1 <- searchSCAtlasExperiments( query = "endoderm" )
print(search_1)

search_2 <- searchSCAtlasExperiments( query = "endoderm", secondaryFilter = "human" )
print(search_2)

search_3 <- searchSCAtlasExperiments( query = "endoderm", secondaryFilter = "mouse" )
print(search_3)

search_4 <- searchSCAtlasExperiments( query = "arabidopsis" )
print(search_4)

search_5 <- searchSCAtlasExperiments( query = "pluripotency" )
print(search_5)

```

## Downloading a study into a SingleCellExperiment object

You can download data for a Single-Cell Expression Atlas experiment using the 
`getAtlasSCExperiment()` function:

```{r eval=FALSE}
enad19 <- getAtlasSCExperiment( "E-ENAD-19" )
# returns a SingleCellExperiment object

```

## Visualise a dimensionality reduction plot of a SingleCellExperiment object

Let's use and example for pluripotency of human early embryos and embryonic stem cells by single cell RNA-seq.

```{r}
egeod36552 <- getAtlasSCExperiment( "E-GEOD-36552" )
# returns a SingleCellExperiment object

print("Reduced dimension names:")
print(reducedDimNames(egeod36552))

print("Column data names:")
print(colnames(colData(egeod36552)))

plotDimRedSCAtlasExperiment(egeod36552, dimRed = "X_pca", colorby = "time" )

plotDimRedSCAtlasExperiment(egeod36552, dimRed = "X_umap_neighbors_n_neighbors_20", colorby = "cell_type")

plotDimRedSCAtlasExperiment(egeod36552, dimRed = "X_umap_neighbors_n_neighbors_20", colorby = "louvain_resolution_2.0") + theme_classic() + theme(legend.position = "left")

```

## Plot a gene expression heatmap from a SingleCellExperiment object

For the default cluster or selected clusters with Expression Atlas marker genes, you can use the `heatmapSCAtlasExperiment()` function:
```{r}
egeod36552 <- getAtlasSCExperiment( "E-GEOD-36552" )

heatmapSCAtlasExperiment( egeod36552, genes=NULL, sel.K=NULL, scaleNormExp=TRUE, show_row_names=FALSE ) 

# heatmapSCAtlasExperiment( egeod36552, genes=NULL, sel.K=NULL, scaleNormExp=TRUE, show_row_names=FALSE ) 

heatmapSCAtlasExperiment( egeod36552, genes=NULL, sel.K=6, scaleNormExp=TRUE, show_row_names=FALSE ) 

```

For user selected genes:
```{r}
egeod36552 <- getAtlasSCExperiment( "E-GEOD-36552" )

heatmapSCAtlasExperiment( egeod36552, genes=c('ENSG00000151611','ENSG00000020577', 'ENSG00000188869' ), sel.K=NULL, scaleNormExp=FALSE, show_row_names=TRUE ) 

# heatmapSCAtlasExperiment( egeod36552, genes=c('ENSG00000151611','ENSG00000020577', 'ENSG00000188869' ), sel.K=NULL, scaleNormExp=TRUE, show_row_names=TRUE ) 

```

## Dot-plot for a SingleCellExperiment object

For example, if we chose one marker gene from each of the clusters (k = 4) in the previous example, we can use the `dotPlotSCAtlasExperiment()` function to plot the average expression of these genes across the clusters:

```{r, fig.width=7, fig.height=4}
egeod36552 <- getAtlasSCExperiment( "E-GEOD-36552" )

# dotPlotSCAtlasExperiment(egeod36552, genes=c('ENSG00000166681','ENSG00000178928', 'ENSG00000142182' , 'ENSG00000160282' ), sel.K=4)

dotPlotSCAtlasExperiment(egeod36552, genes=c('ENSG00000166681','ENSG00000178928', 'ENSG00000142182' , 'ENSG00000160282' ), sel.K=4, scaleNormExp=TRUE) + theme_classic()
```

# sessionInfo

```{r}
sessionInfo()
```

# References