---
title: 'cTRAP: identifying candidate causal perturbations from differential gene expression data'
author: Bernardo P. de Almeida & Nuno Saraiva-Agostinho
date: "`r Sys.Date()`"
output: 
    rmarkdown::html_vignette:
        toc: true
vignette: >
    \usepackage[utf8]{inputenc}
    %\VignetteIndexEntry{cTRAP: DGE comparison with cellular perturbations}
    %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

---

# Introduction

`cTRAP` is an R package designed to compare differential gene
expression results with those from known cellular perturbations (such as gene
knock-down, overexpression or small molecules) derived from the
[Connectivity Map][clue.io]
([Subramanian et al., Cell 2017][subramanian2017]). Such analyses allow
not only to infer the molecular causes of the observed difference in gene
expression but also to identify small molecules that could drive or revert
specific transcriptomic alterations.

To illustrate the package functionalities, we will use an example based on a
gene knock-down dataset from the [ENCODE project][ENCODE] for which there is
available RNA-seq data. After performing differential expression analyses to the
macthed-control sample, we will compare the respective transcriptomic changes
with the ones caused by all Connectivity Map's gene knock-down perturbations to
identify which ones have similar or inverse transcriptomic changes to the
observed ones. As a positive control, we expect to find the knock-down of the
gene depleted in the ENCODE experiment as one of the most similar transcriptomic
perturbations.

# Getting started

To load the `cTRAP` package into your R environment type:

```{r load package}
library(cTRAP)
```

# Load ENCODE RNA-seq data and perform differential gene expression analysis

In this example, we will use the EIF4G1 shRNA knockdown followed by RNA-seq
experiment in HepG2 cell line from the ENCODE project as the dataset of
interest. The RNA-seq processed data (gene quantifications from RSEM method) for
the EIF4G1 knock-down and respective controls (two replicates each) can be
downloaded by typing:

```{r load ENCODE samples, eval=FALSE}
gene <- "EIF4G1"
cellLine <- "HepG2"

ENCODEmetadata <- downloadENCODEknockdownMetadata(cellLine, gene)
table(ENCODEmetadata$`Experiment target`)
length(unique(ENCODEmetadata$`Experiment target`))

ENCODEsamples <- downloadENCODEsamples(ENCODEmetadata)
counts <- prepareENCODEgeneExpression(ENCODEsamples)
```

```{r load data, include=FALSE}
gene <- "EIF4G1"
cellLine <- "HepG2"

data("ENCODEmetadata")
data("counts")
```

Gene expression data (read counts) were quantile-normalized using [`voom`][voom]
and differential expression analysis was performed using the [`limma`][limma] R
package.

```{r differential gene expression}
# Remove low coverage (at least 10 counts shared across two samples)
minReads   <- 10
minSamples <- 2
filter <- rowSums(counts[ , -c(1, 2)] >= minReads) >= minSamples
counts <- counts[filter, ]

# Convert ENSEMBL identifier to gene symbol
library(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- sapply(strsplit(counts$gene_id, "\\."), `[`, 1)
geneConversion <- getBM(filters="ensembl_gene_id", values=genes, mart=mart,
                        attributes=c("ensembl_gene_id", "hgnc_symbol"))
counts$gene_id <- geneConversion$hgnc_symbol[
    match(genes, geneConversion$ensembl_gene_id)]

# Perform differential gene expression analysis
diffExpr <- performDifferentialExpression(counts)
```

As our metric of differential expression after EIF4G1 shRNA knock-down, we will
use the respective t-statistic.

```{r t-statistics}
# Get t-statistics of differential expression with respective gene names 
# (expected input for downstream analyses)
diffExprStat <- diffExpr$t
names(diffExprStat) <- diffExpr$Gene_symbol
```

# Load Connectivity Map perturbation data

We will compare our differential gene expression metric with all Connectivity
Map's gene knock-down perturbations in the same cell line (HepG2). Note that
this comparison can also be done to perturbations in a different cell line or in
all cell lines, using in the latter the average result across cell lines.
Differential gene expression data for each Connectivity Map's perturbation are
available in z-score normalised values (see
[Subramanian et al., Cell 2017][subramanian2017] for more details).

```{r L1000 metadata conditions}
# Check available conditions for L1000 perturbations
l1000metadata <- downloadL1000data("l1000metadata.txt", "metadata")
getL1000conditions(l1000metadata)
```

```{r L1000 knockdown perturbations, eval=FALSE}
# Code for loading CMap gene KD HepG2 data
l1000metadataKnockdown <- filterL1000metadata(
    l1000metadata, cellLine="HepG2",
    perturbationType="Consensus signature from shRNAs targeting the same gene")
l1000zscores  <- downloadL1000data("l1000zscores.gctx", "zscores",
                                   l1000metadataKnockdown$sig_id)
l1000geneInfo <- downloadL1000data("l1000geneInfo.txt", "geneInfo")

l1000perturbationsKnockdown <- loadL1000perturbations(
    l1000metadataKnockdown, l1000zscores, l1000geneInfo)
```

If the interest is in small molecules, one can download the differential gene
expression z-scores for each small molecule perturbation in HepG2 (given a
specific concentration and time point of RNA extraction) through the following
command:

```{r L1000 small molecule perturbations, eval=FALSE}
l1000metadataSmallMolecules <- filterL1000metadata(
    l1000metadata, cellLine="HepG2", timepoint="24 h", 
    dosage="5 \U00B5M", # \U00B5 is the unicode code for the micro symbol
    perturbationType="Compound")
l1000zscores  <- downloadL1000data("l1000zscores.gctx", "zscores",
                                   l1000metadataSmallMolecules$sig_id)
l1000geneInfo <- downloadL1000data("l1000geneInfo.txt")

l1000perturbationsSmallMolecules <- loadL1000perturbations(
    l1000metadataSmallMolecules, l1000zscores, l1000geneInfo,
    sanitizeCompoundNames=TRUE)
```

```{r load L1000 perturbations, include=FALSE}
data("l1000perturbationsKnockdown")
data("l1000perturbationsSmallMolecules")
```

# Comparison with Connectivity Map's perturbations

This is the main function of the package. Here we compare the statistic for
differential expression (the t-statistic, in this case) with the z-scores for
the perturbations we have loaded using the three different methods available
(Spearman's correlation, Pearson's correlation and Gene Set Enrichment Analysis
(GSEA)). For GSEA, the default option is to use the top and bottom 150 genes
(ranked by the user's t-statistic) as gene sets, but this can be changed by the
user.

```{r compare knockdown, include=FALSE}
compareKnockdown <- list()

# Compare against L1000 using Spearman correlation
compareKnockdown$spearman <- compareAgainstL1000(
    diffExprStat, l1000perturbationsKnockdown, cellLine, method="spearman")

# Compare against L1000 using Pearson correlation
compareKnockdown$pearson <- compareAgainstL1000(
    diffExprStat, l1000perturbationsKnockdown, cellLine, method="pearson")

# Compare against L1000 using gene set enrichment analysis (GSEA) with the top
# and bottom 150 genes
compareKnockdown$gsea <- compareAgainstL1000(
    diffExprStat, l1000perturbationsKnockdown, cellLine, method="gsea",
    geneSize=150)
```

For small molecules:

```{r compare small molecules}
compareSmallMolecule <- list()
# Compare against L1000 using Spearman correlation
compareSmallMolecule$spearman <- compareAgainstL1000(
    diffExprStat, l1000perturbationsSmallMolecules, cellLine, method="spearman")

# Compare against L1000 using Pearson correlation
compareSmallMolecule$pearson <- compareAgainstL1000(
    diffExprStat, l1000perturbationsSmallMolecules, cellLine, method="pearson")

# Compare against L1000 using gene set enrichment analysis (GSEA) with the top
# and bottom 150 genes
compareSmallMolecule$gsea <- compareAgainstL1000(
    diffExprStat, l1000perturbationsSmallMolecules, cellLine, method="gsea",
    geneSize=150)
```

For each method used, this function will return a table with the results of the
comparison with each perturbation tested, including the test statistics
(depending on the method used: Spearman's correlation coefficient, Pearson's
correlation coefficient or GSEA Enrichment Score), the respective p-value and
the Benjamini-Hochberg-adjusted p-value.

```{r order knockdown perturbation comparison}
# Order knockdown perturbations according to similarity
compareKnockdown$spearman_ordered <- compareKnockdown$spearman[
    order(compareKnockdown$spearman$HepG2_spearman_coef, decreasing=TRUE)]
compareKnockdown$pearson_ordered <- compareKnockdown$pearson[
    order(compareKnockdown$pearson$HepG2_pearson_coef, decreasing=TRUE)]
compareKnockdown$gsea_ordered <- compareKnockdown$gsea[
    order(compareKnockdown$gsea$HepG2_WTCS, decreasing=TRUE)]

# Most positively associated perturbations (note that EIF4G1 knockdown is the
# 6th, 1st and 2nd most positively associated perturbation based on Spearman
# correlation, Pearson correlation and GSEA, respectively)
head(compareKnockdown$spearman_ordered)
head(compareKnockdown$pearson_ordered)
head(compareKnockdown$gsea_ordered)

# Most negatively associated perturbations
tail(compareKnockdown$spearman_ordered)
tail(compareKnockdown$pearson_ordered)
tail(compareKnockdown$gsea_ordered)
```

For small molecules:

```{r order small molecule perturbation comparison}
# Order small molecule perturbations according to similarity
compareSmallMolecule$spearman_ordered <- compareSmallMolecule$spearman[
    order(compareSmallMolecule$spearman$HepG2_spearman_coef, decreasing=TRUE)]
compareSmallMolecule$pearson_ordered <- compareSmallMolecule$pearson[
    order(compareSmallMolecule$pearson$HepG2_pearson_coef, decreasing=TRUE)]
compareSmallMolecule$gsea_ordered <- compareSmallMolecule$gsea[
    order(compareSmallMolecule$gsea$HepG2_WTCS, decreasing=TRUE)]

# Most positively associated perturbations
head(compareSmallMolecule$spearman_ordered)
head(compareSmallMolecule$pearson_ordered)
head(compareSmallMolecule$gsea_ordered)

# Most negatively associated perturbations
tail(compareSmallMolecule$spearman_ordered)
tail(compareSmallMolecule$pearson_ordered)
tail(compareSmallMolecule$gsea_ordered)
```

# Plot

To analyse the relationship between the user-provided differential expression
profile with that associated with a specific perturbation, we provide the option
to create a scatter plot (for Spearman and Pearson analyses) or a GSEA plot.

Below you can see how to plot the relationship between the ENCODE's EIF4G1 shRNA
knockdown with the perturbations of interest.

```{r plot knockdown perturbation comparison}
# Plot relationship with EIF4G1 knockdown
EIF4G1knockdown <- grep("EIF4G1", compareKnockdown$gsea_ordered$genes,
                        value=TRUE)
plotL1000comparison(compareKnockdown$spearman, EIF4G1knockdown)
plotL1000comparison(compareKnockdown$pearson, EIF4G1knockdown)
plotL1000comparison(compareKnockdown$gsea, EIF4G1knockdown, topGenes=TRUE)
plotL1000comparison(compareKnockdown$gsea, EIF4G1knockdown, topGenes=FALSE)

# Plot relationship with most negatively associated perturbation
plotL1000comparison(compareKnockdown$spearman,
                    tail(compareKnockdown$spearman_ordered, 1)$genes)
plotL1000comparison(compareKnockdown$pearson,
                    tail(compareKnockdown$pearson_ordered, 1)$genes)
plotL1000comparison(compareKnockdown$gsea,
                    tail(compareKnockdown$gsea_ordered, 1)$genes, 
                    topGenes=TRUE)
plotL1000comparison(compareKnockdown$gsea, 
                    tail(compareKnockdown$gsea_ordered, 1)$genes, 
                    topGenes=FALSE)
```

For small molecules:

```{r plot small molecule perturbation comparison}
# Plot relationship with most positively associated perturbation
plotL1000comparison(compareSmallMolecule$spearman,
                    head(compareSmallMolecule$spearman_ordered, 1)$genes)
plotL1000comparison(compareSmallMolecule$pearson,
                    head(compareSmallMolecule$pearson_ordered, 1)$genes)
plotL1000comparison(compareSmallMolecule$gsea,
                    head(compareSmallMolecule$gsea_ordered, 1)$genes)

# Plot relationship with most negatively associated perturbation
plotL1000comparison(compareSmallMolecule$spearman,
                    tail(compareSmallMolecule$spearman_ordered, 1)$genes)
plotL1000comparison(compareSmallMolecule$pearson,
                    tail(compareSmallMolecule$pearson_ordered, 1)$genes)
plotL1000comparison(compareSmallMolecule$gsea,
                    tail(compareSmallMolecule$gsea_ordered, 1)$genes)
```

# Feedback

All feedback on the program, documentation and associated material (including
this tutorial) is welcome. Please send any suggestions and comments to:

> Nuno Saraiva-Agostinho (nunoagostinho@medicina.ulisboa.pt)
>
> Bernardo P. de Almeida (bernardo.almeida94@gmail.com)
>
> [Disease Transcriptomics Lab, Instituto de Medicina Molecular (Portugal)][iMM]

[iMM]: http://imm.medicina.ulisboa.pt/group/distrans/
[ENCODE]: https://www.encodeproject.org/
[clue.io]: https://clue.io/
[limma]: https://www.ncbi.nlm.nih.gov/pubmed/25605792
[voom]: https://www.ncbi.nlm.nih.gov/pubmed/24485249
[subramanian2017]: https://doi.org/10.1016/j.cell.2017.10.049