---
title: "Use of CAGE resources with CAGEr"
author:
 - "Vanja Haberle"
 - "Charles Plessy"
package: CAGEr
output: 
  BiocStyle::html_document:
    toc: true
bibliography: CAGEr.bib
vignette: >
  %\VignetteIndexEntry{Use of CAGE resources with CAGEr}
  %\VignetteEncoding[utf8]{inputenc}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r setup, echo = FALSE, results = "hide"}
options(signif = 3, digits = 3)
knitr::opts_chunk$set(tidy = FALSE, cache = TRUE, autodep = TRUE, fig.height = 5.5,
                      message = FALSE, error = FALSE, warning = TRUE)
set.seed(0xdada)
```


Available CAGE data resources
=============================

Note: this still uses the `CAGEset` class.


There are several large collections of CAGE data available that provide single
base-pair resolution TSSs for numerous human and mouse primary cells, cell lines
and tissues. Together with several minor datasets for other model organisms
(_Drosophila melanogaster_, _Danio rerio_) they are a valuable resource that
provides cell/tissue type and developmental stage specific TSSs essential for
any type of promoter centred analysis. By enabling direct and user-friendly
import of TSSs for selected samples into R, _CAGEr_ facilitates the integration
of these precise TSS data with other genomic data types. Each of the available
CAGE data resources accessible from within _CAGEr_ is explained in more detail
further below.

## FANTOM5

FANTOM consortium provides single base-pair resolution TSS data for numerous
human and mouse primary cells, cell lines and tissues. The main FANTOM5
publication [@Consortium:2014hz] released ~1000 human and ~400 mouse CAGE
samples that cover the vast majority of human primary cell types, mouse
developmental tissues and a large number of commonly used cell lines. These data
is available from FANTOM web resource <http://fantom.gsc.riken.jp/5/data/> in
the form of TSS files with genomic coordinates and number of tags mapping to
each TSS detected by CAGE. The list of all available samples for both human and
mouse (as presented in the Supplementary Table 1 of the publication) has been
included in _CAGEr_ to facilitate browsing, searching and selecting samples
of interest. TSSs for selected samples are then fetched directly form the web
resource and imported into a `CAGEset` object enabling their further
manipulation with _CAGEr_.

## FANTOM3 and 4

Previous FANTOM projects (3 and 4) [@Consortium:2005kp,@Faulkner:2009fw,@Suzuki:2009gy])
produced CAGE datasets for multiple human and mouse tissues as well as several
timecourses, including differentiation of a THP-1 human myeloid leukemia cell
line. All this TSS data has been grouped into datasets by the organism and
tissue of origin and has been collected into an R data package named
_FANTOM3and4CAGE_, which is available from Bioconductor <https://bioconductor.org/packages/FANTOM3and4CAGE>.
The vignette accompanying the package provides information on available datasets
and lists of samples. When the data package is installed, _CAGEr_ can import the
TSSs for selected samples directly into a `CAGEset` object for further
manipulation.

## ENCODE cell lines

ENCODE consortium produced CAGE data for common human cell lines
[@Djebali:2012hc], which were used by ENCODE for various other types of
genome-wide analyses. The advantage of this dataset is that it enables the
integration of precise TSSs from a specific cell line with many other
genome-wide data types provided by ENCODE for the same cell line. However, the
format of CAGE data provided by ENCODE at UCSC (<http://genome.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html>)
includes only raw mapped CAGE tags and their coverage along the genome, and
coordinates of enriched genomic regions (peaks), which do not take advantage of
the single base-pair resolution provided by CAGE. To address this, we have used
the raw CAGE tags to derive single base-pair resolution TSSs and collected them
into an R data package named _ENCODEprojectCAGE_. This data package is
available for download from CAGEr web site at <http://promshift.genereg.net/CAGEr>
and includes TSSs for 36 different cell lines fractionated by cellular
compartment. The vignette accompanying the package provides information on
available datasets and lists of individual samples. Once the package has been
downloaded and installed, _CAGEr_ can access it to import TSS data for
selected subset of samples for further manipulation and integration.

## Zebrafish developmental timecourse

Precise TSSs are also available for zebrafish (_Danio Rerio_) from CAGE data
published by [@Nepal:2013]. The timecourse covering early
embryonic development of zebrafish includes 12 developmental stages. The TSS
data has been collected into an R data package named _ZebrafishDevelopmentalCAGE_,
which is available for download from CAGEr web site at <http://promshift.genereg.net/CAGEr>.
As with other data packages mentioned above, once the package is installed
_CAGEr_ can use it to import stage-specific single base pair TSSs into a
`CAGEset` object.

Importing public TSS data for manipulation in _CAGEr_
=====================================================

Note: this still uses the `CAGEset` class.

The data from above mentioned resources can be imported into a `CAGEset` object
using the `importPublicData()` function.  It function has four arguments: `source`,
`dataset`, `group` and `sample`. Argument `source` accepts one of the following
values: `"FANTOM5"`, `"FANTOM3and4"`, `"ENCODE"`, or `"ZebrafishDevelopment"`,
which refer to one of the four resources listed above. The following sections
explain how to utilize this function for each of the four currently supported
resources.

## FANTOM5 human and mouse samples

Lists of all human and mouse CAGE samples produced within FANTOM5 project are
available in _CAGEr_. To load the information on human samples type:

```{r}
library(CAGEr)
data(FANTOM5humanSamples)
head(FANTOM5humanSamples)
nrow(FANTOM5humanSamples)
```

There are 988 human samples in total and for each the following information is provided:

 - `sample`: a unique name/label of the sample which should be provided to
   `importPublicData()` function to retrieve given sample
  
 - `type`: type of sample, which can be "cell line", "primary cell" or "tissue"
  
 - `description`: short description of the sample as provided in FANTOM5 main
    publication [@Consortium:2014hz]
  
 - `library_id`: unique ID of the CAGE library within FANTOM5
 
 - `data_url`: URL to corresponding CTSS file at FANTOM5 web resource from which
    the data is fetched

Provided information facilitates searching for samples of interest, _e.g._ we
can search for astrocyte samples:

```{r}
astrocyteSamples <- FANTOM5humanSamples[grep("Astrocyte", 
				FANTOM5humanSamples[,"description"]),]
astrocyteSamples
```

```{r}
data(FANTOM5mouseSamples)
head(FANTOM5mouseSamples)
nrow(FANTOM5mouseSamples)
```

To import TSS data for samples of interest from FANTOM5 we use the `importPublicData()`
function and set the argument `source = "FANTOM5"`. The `dataset` argument can
be set to either `"human"` or `"mouse"`, and the `sample` argument is provided
by a vector of sample lables/names. For example, names of astrocyte samples from
above are:

```{r}
astrocyteSamples[,"sample"]
```

and to import first three samples type:

```
astrocyteCAGEset <- importPublicData(source = "FANTOM5", dataset = "human", 
					sample = astrocyteSamples[1:3,"sample"])
```

The resulting `astrocyteCAGEset` is a `CAGEset` object that can be included in
the _CAGEr_ workflow described above to perform normalisation, clustering,
visualisation, etc.

## _FANTOM3and4CAGE_ data package

To use TSS data from FANTOM3 and FANTOM4 projects, a data package
_FANTOM3and4CAGE_ has to be installed and loaded.  This package is available
from Bioconductor and can be installed by calling:

```
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("FANTOM3and4CAGE")
```

For the list of available datasets with group and sample labels for specific human or mouse samples load the data package and get list of samples:

```{r}
library(FANTOM3and4CAGE)
data(FANTOMhumanSamples)
head(FANTOMhumanSamples)
data(FANTOMmouseSamples)
head(FANTOMmouseSamples)
```

In the above data frames, the columns `dataset`, `group` and `sample` provide
values that should be passed to corresponding arguments in `importPublicData()`
function. For example to import human kidney normal and malignancy samples call:

```
kidneyCAGEset <- importPublicData(source = "FANTOM3and4",
			dataset = "FANTOMtissueCAGEhuman", 
			group = "kidney", sample = c("kidney", "malignancy"))
```

When the samples belong to different groups or different datasets, it is necessary to provide the dataset and group assignment for each sample separately:

```
mixedCAGEset <- importPublicData(source = "FANTOM3and4",
		dataset = c("FANTOMtissueCAGEmouse", "FANTOMtissueCAGEmouse", 
		"FANTOMtimecourseCAGEmouse"), group = c("liver", "liver", 
		"liver_under_constant_darkness"), 
		sample = c("cloned_mouse", "control_mouse", "4_hr"))
```

For more details about datasets available in the \emph{FANTOM3and4CAGE} data package please refer to the vignette accompanying the package.

## _ENCODEprojectCAGE_ data package

TSS data derived from ENCODE CAGE datasets has been collected into
_ENCODEprojectCAGE_ data package, which is available for download from the
_CAGEr_ web site (<http://promshift.genereg.net/CAGEr/>). Downloaded package can
be installed from local using `install.packages()` function from within R and
used with _CAGEr_ as described below.
List of datasets available in this data package can be obtained like this:

```
library(ENCODEprojectCAGE)
data(ENCODEhumanCellLinesSamples)
```

The information provided in this data frame is analogous to the one in previously discussed data package and provides values to be used with `importPublicData()` function. The command to import whole cell CAGE samples for three different cell lines would look like this:

```
ENCODEset <- importPublicData(source = "ENCODE", 
	dataset = c("A549", "H1-hESC", "IMR90"), 
	group = c("cell", "cell", "cell"), sample = c("A549_cell_rep1", 
	"H1-hESC_cell_rep1", "IMR90_cell_rep1"))
```

For more details about datasets available in the _ENCODEprojectCAGE_ data
package please refer to the vignette accompanying the package.

## _ZebrafishDevelopmentalCAGE_ data package

The zebrafish TSS data for 12 developmental stages is collected in
_ZebrafishDevelopmentalCAGE_ data package, which is also available for download
from the _CAGEr_ web site (<http://promshift.genereg.net/CAGEr/>). It can be
installed from local using `install.packages()` function. To get a list of
samples within the package type:

```
library(ZebrafishDevelopmentalCAGE)
data(ZebrafishSamples)
```

In this package there is only one dataset called `ZebrafishCAGE` and all samples
belong to the same group called `development`. To import selected samples from
this dataset type:

```
zebrafishCAGEset <- importPublicData(source = "ZebrafishDevelopment", 
			dataset = "ZebrafishCAGE", group = "development", 
			sample = c("zf_64cells", "zf_prim6"))
```

For more details please refer to the vignette accompanying the data package.


Importing TSS data from any of the four above explained resources results in the
`CAGEset` object that can be directly included into the workflow provided by
_CAGEr_ to perform normalisation, clustering, promoter width analysis,
visualisation, _etc_. This high-resolution TSS data can then easily be
integrated with other genomic data types to perform various promoter-centred
analyses, which does not rely on annotation but uses precise and matched
cell/tissue type TSSs.


Session info {.unnumbered}
==========================

```{r sessionInfo}
sessionInfo()
```

References
==========