---
title: "Package Vignette for Genomic Interactions: ChIA-PET data"
author: "Malcolm Perry"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{GenomicInteractions-ChIAPET}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---


```{r global_chunk_opts, echo=F, include=F, eval=FALSE}
library(knitr)
opts_chunk$set(fig.width=12, fig.height=8)
```

## ChIA-PET

Chromatin interaction analysis with paired-end tag sequencing (ChIA-PET) is a recent
method to study protein-mediated interactions at a genome-wide scale. Like most techniques
for studying chromatin interaction it is based on [chromosome conformation capture](http://en.wikipedia.org/wiki/Chromosome_conformation_capture)
technology. Unlike 3C, 4C and 5C, however, it can detect interactions genome-wide, and 
includes a [ChIP](http://en.wikipedia.org/wiki/Chromatin_immunoprecipitation) step to
purify interactions involving a protein of interest.

The raw data from ChIA-PET is in the form of paired-end reads attached to one of two
linker sequences. Reads with chimeric linkers are removed, and the data is aligned to the
reference genome. The [**ChIA-PET tool**](http://genomebiology.com/2010/11/2/R22) can then
be used to find pairs of regions ("anchors") which have a significant number of reads mapping
between them and therefore represent biologically meaningful chromatin 
interactions in the sample.

## Imports

First we need to load the GenomicInteractions package, and the mm9 reference genome:

```{r imports, warning=F, results="hide"}
library(GenomicInteractions)
library(InteractionSet)
library(GenomicRanges)
```

## Data

We can then read in our data directly from the output of the 
[**ChIA-PET tool**](http://genomebiology.com/2010/11/2/R22). At
this stage we can also provide information about the cell type and a
description tag for the experiment. The data is taken from Li et al., 2012, 
published in [**Cell**](http://www.sciencedirect.com/science/article/pii/S0092867411015170).
They have used antibodies against the initiation form of Pol II, which you would expect to
find at active promoters, and we are looking at data from the K562 myelogenous leukemia cell
line. The data should therefore give us an insight into the processes which regulate genes
that are being actively transcribed.

```{r load_data}
chiapet.data = system.file("extdata/k562.rep1.cluster.pet3+.txt", 
                           package="GenomicInteractions")

k562.rep1 = makeGenomicInteractionsFromFile(chiapet.data, 
                                type="chiapet.tool", 
                                experiment_name="k562", 
                                description="k562 pol2 8wg16")
```

This loads the data into a `GenomicInteractions` object, which consists of two
linked `GenomicRanges` objects containing the anchors in each interaction, as
well as the p-value, FDR and the number of reads supporting each interaction.

## GenomicInteractions Objects

The metadata we have added can easily be accesed, and edited:

```{r metadata}
name(k562.rep1)
description(k562.rep1) = "PolII-8wg16 Chia-PET for K562"
```

As can the data from the ChIA-PET experiment:

```{r gi_data_access}
head(interactionCounts(k562.rep1))
head((k562.rep1)$fdr)
hist(-log10(k562.rep1$p.value))
```

The two linked `GRanges` objects can be returned, but not altered in-place:

```{r anchor_access}
anchorOne(k562.rep1)
anchorTwo(k562.rep1)
```

`GenomicInteractions` objects can easily handle interactions detected between
chromosomes, known as *trans*-chromosomal interactions, since the anchors can
be at any point along the genome. `is.trans` returns a logical vector;
likewise `is.cis` is the opposite of this function.

```{r trans}
sprintf("Percentage of trans-chromosomal interactions %.2f", 
        100*sum(is.trans(k562.rep1))/length(k562.rep1))
```

The length of each interaction is not stored as metadata, but we can calculate
the distance of each interaction using either the inner edge, outer edge or
midpoints of the anchors. This is undefined for inter-chromosomal interactions,
so NA is returned, so it is important to exclude these interactions from some
analyses.

```{r short_range_interactions}
head(calculateDistances(k562.rep1, method="midpoint"))
```

`GenomicRanges` objects can be subsetted by either integer or logical vectors like
most R objects, and also BioConductor `Rle` objects. 

```{r subsetting}
k562.rep1[1:10] # first interactions in the dataset
k562.rep1[sample(length(k562.rep1), 100)] # 100 interactions subsample
k562.cis = k562.rep1[is.cis(k562.rep1)]
```

The length of each interaction is not stored as metadata, but we can calculate
the distance of each interaction using either the inner edge, outer edge or
midpoints of the anchors. Since this is undefinable for *trans*-chromosomal interactions
it is best to first subset only *cis* interactions before calling `calculateDistances`,
otherwise `NA`s will be present in the returned vector.

```{r susbet_distance}
head(calculateDistances(k562.cis, method="midpoint"))
k562.short = k562.cis[calculateDistances(k562.cis) < 1e6] # subset shorter interactions
hist(calculateDistances(k562.short)) 
```

We can also subset based on the properties of the linked `GRanges` objects.

```{r subset_chr}
chrom = c("chr17", "chr18")
sub = as.vector(seqnames(anchorOne(k562.rep1)) %in% chrom & seqnames(anchorTwo(k562.rep1)) %in% chrom)
k562.rep1 = k562.rep1[sub]
```

## Annotation

Genomic Interaction data is often used to look at the interactions between different
elements in the genome, which are believed to have different functional roles. Interactions
between promoters and their transcription termination sites, for example, are thought to
be a by-product of the transcription process, whereas long-range interactions with enhancers
play a role in gene regulation.

Since `GenomicInteractions` is based on `GenomicRanges`, it is very easy to
interrogate `GenomicInteractions` objects using `GenomicRanges` data. In the
example, we want to annotate interactions that overlap the promoters,
transcription termination sites or the body of any gene. Since this can 
be a time-consuming and data-heavy process, this example runs the analysis
for only chromosomes 17 & 18.

First we need the list of RefSeq transcripts:

```{r annotation_features, eval=F}
library(GenomicFeatures)

hg19.refseq.db <- makeTxDbFromUCSC(genome="hg19", table="refGene")
refseq.genes = genes(hg19.refseq.db)
refseq.transcripts = transcriptsBy(hg19.refseq.db, by="gene")
non_pseudogene = names(refseq.transcripts) %in% unlist(refseq.genes$gene_id) 
refseq.transcripts = refseq.transcripts[non_pseudogene] 
```

Rather than downloading the whole Refseq database, these are provided for
chromosomes 17 & 18:

```{r load_trascripts}
data("hg19.refseq.transcripts")
refseq.transcripts = hg19.refseq.transcripts
```

We can then use functions from `GenomicRanges` to call promoters and
terminators for these transcripts. We have taken promoter regions to be within
2.5kb of an annotated TSS and terminators to be within 1kb of the end of an
annotated transcript. Since genes can have multiple transcripts, they can also
have multiple promoters/terminators, so these are `GRangesList` objects, which makes
handling these objects slightly more complicated.

```{r magic}
refseq.promoters = promoters(refseq.transcripts, upstream=2500, downstream=2500)
# unlist object so "strand" is one vector
refseq.transcripts.ul = unlist(refseq.transcripts) 
# terminators can be called as promoters with the strand reversed
strand(refseq.transcripts.ul) = ifelse(strand(refseq.transcripts.ul) == "+", "-", "+") 
refseq.terminators.ul = promoters(refseq.transcripts.ul, upstream=1000, downstream=1000) 
# change back to original strand
strand(refseq.terminators.ul) = ifelse(strand(refseq.terminators.ul) == "+", "-", "+") 
# `relist' maintains the original names and structure of the list
refseq.terminators = relist(refseq.terminators.ul, refseq.transcripts)
```

These can be used to subset a `GenomicInteractions` object directly from
`GRanges` using the `GenomicRanges` overlaps methods. `findOverlaps` called on
a `GenomicInteractions` object will return a list containing `Hits` objects for
both anchors.

We can finds any interactions involving a RefSeq promoter:

```{r overlaps_methods}
subsetByFeatures(k562.rep1, unlist(refseq.promoters))
```

However, one of the most powerful features in the `GenomicInteractions` package
is the ability to annotate each anchor with a list of genomic regions and then
summarise interactions according to these features. This annotation is
implemented as metadata columns for the anchors in the `GenomicInteractions`
object and so is fast, and facilitates more complex analyses.

The order in which we annotate the anchors is important, since each anchor can
only have one `node.class`. The first listed take precedence. Any regions not
overlapping ranges in `annotation.features` will be labelled as `distal`.

```{r annotation}
annotation.features = list(promoter=refseq.promoters, 
                           terminator=refseq.terminators, 
                           gene.body=refseq.transcripts)
annotateInteractions(k562.rep1, annotation.features)
annotationFeatures(k562.rep1)
```

We can now find interactions involving promoters using the annotated
`node.class` for each anchor:

```{r node.class}
p.one = anchorOne(k562.rep1)$node.class == "promoter"
p.two = anchorTwo(k562.rep1)$node.class == "promoter"
k562.rep1[p.one|p.two]
```

This information can be used to categorise interactions into promoter-distal,
promoter-terminator etc. A table of interaction types can be generated with 
`categoriseInteractions`:

```{r categorise_interactions}
categoriseInteractions(k562.rep1)
```

Alternatively, we can subset the object based on interaction type:

```{r is_interaction_type}
k562.rep1[isInteractionType(k562.rep1, "terminator", "gene.body")]
```

The 3 most common `node.class` values have short functions defined for convenience
(see `?is.pp` for a complete list):

```{r short_types, eval=F}
k562.rep1[is.pp(k562.rep1)] # promoter-promoter interactions
k562.rep1[is.dd(k562.rep1)] # distal-distal interactions
k562.rep1[is.pt(k562.rep1)] # promoter-terminator interactions
```

Summary plots of interactions classes can easily be produced to get an overall feel
for the data:

```{r interaction_classes}
plotInteractionAnnotations(k562.rep1, other=5)
```

`viewpoints` will only take those interactions with a certain `node.class`:

```{r promoter_classes, warning=F}
plotInteractionAnnotations(k562.rep1, other=5, viewpoints="promoter")
```

These are also combined in the function `plotSummaryStats`.

## Feature Summaries

The `summariseByFeatures` allows us to look in more detail at interactions
involving a specific set of loci. In this example we use all RefSeq promoters,
which we already have loaded in a `GRangesList` object. 

It is however possible to use any dataset which can be represented as a named
`GRanges` object, for example transcription-factor ChIP data, predicted
cis-regulatory sites or certain categories of genes.

The categories are generated automatically from the annotated `node.class`
values in the object.

```{r summarise, warning=F}
k562.rep1.promoter.annotation = summariseByFeatures(k562.rep1, refseq.promoters, 
                                                    "promoter", distance.method="midpoint", 
                                                    annotate.self=TRUE)
colnames(k562.rep1.promoter.annotation)
```

This allows us to very quickly generate summaries of the data and provides a
quick method to isolate genes of interest. In this case we produce lists of
RefSeq IDs, which can easily be converted to EntrezIDs or gene symbols through
existing BioConductor packages (in this case `org.Hs.eg.db` provides bimaps between
common human genome annotations).

Which promoters have the strongest Promoter-Promoter interactions based on PET-counts?

```{r p.p.interactions}
i = order(k562.rep1.promoter.annotation$numberOfPromoterPromoterInteractions, 
          decreasing=TRUE)[1:10]
k562.rep1.promoter.annotation[i,"Promoter.id"]
```

Which promoters are contacting the largest number of distal elements?

```{r enhancers}
i = order(k562.rep1.promoter.annotation$numberOfUniquePromoterDistalInteractions, 
          decreasing=TRUE)[1:10]
k562.rep1.promoter.annotation[i,"Promoter.id"]
```

What percentage of promoters are in contact with transcription termination sites?

```{r terminators}
total = sum(k562.rep1.promoter.annotation$numberOfPromoterTerminatorInteractions > 0)
sprintf("%.2f%% of promoters have P-T interactions", 100*total/nrow(k562.rep1.promoter.annotation))
```

## References
1. Li, Guoliang, et al. "Software ChIA-PET tool for comprehensive chromatin
   interaction analysis with paired-end tag sequencing." Genome Biol 11 (2010):
   R22.

2. Li, Guoliang, et al. "Extensive promoter-centered chromatin interactions
   provide a topological basis for transcription regulation." Cell 148.1
   (2012): 84-98