---
title: "GeneExpressionSignature: Computing pairwise distances between different biological states"
author: 
  - Yang Cao
  - Fei Li
package: GeneExpressionSignature
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{GeneExpressionSignature}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: ref.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(BiocStyle)
```

# Introduction

The `r Biocpkg("GeneExpressionSignature")` package utilizes gene expression 
profiles to measure the similarity between different biological states. It
provides two algorithms for similarity measurement: the GSEA algorithm which is 
mentioned in [@iorio2010discovery] and the PGSEA algorithm in 
`r Rpackage("PGSEA")` package. A further description of the measurement methods 
based on gene expression signature can be found in Lamb [@lamb2006connectivity], 
Hu [@hu2009human] and Iorio [@iorio2010discovery].

This manual is a brief introduction to structure, functions and usage of 
`r Biocpkg("GeneExpressionSignature")` package. It shows how the biological 
similarity is determined through a series of calculation steps and how that 
information can be used for further cluster analysis.

The current version of `r Biocpkg("GeneExpressionSignature")` can be used only 
with data coming from the same platform, examples are on the HG-U133A platform.

# Getting Started

A complete analysis procedure accepts a set of gene expression profiles 
representing different biological states as input, and generates a similarity 
matrix as output. It can be divided into three steps: 1)data ranking, 2)rank 
merging, and 3)similarity measuring.

First, we load the package by entering the following command in your R session:

```{r load-ges}
library(GeneExpressionSignature)
```

## Data Ranking

Gene expression profiles should be properly preprocessed before analysis as 
prerequisite, including background correction, normalization and summarization. 
Instead of the exact values, ranks of gene expression levels are used in the 
following procedure. A ranked list of genes was obtained first by sorting the 
micro-array probe-set identifiers according to the different expression 
values (count or ratio). It should be noticed that there is no standard methods 
for data preprocessing, and there is a function `getRLs()` which takes the 
method in C-MAP for data preprocessing just for reference. We can obtain ranked 
lists matrix by calling `getRLs()`.

Your experimental data could be used for analyzing, or users can download 
gene-expression profiles from the GEO database with R package 
`r Biocpkg("GEOquery")`. Users can see the doc in the `r Biocpkg("GEOquery")` 
for more details.

As an example, we download data from GEO database with package 
`r Biocpkg("GEOquery")`. Then combined the treatment expression values to form 
a treatment matrix as well as the control expression values.

```{r get-ranklist,message=FALSE}
# If you have network access
#GSM118720 <- getGEO('GSM118720')
# GSM118721 <- getGEO('GSM118721')
if (require(GEOquery)){
  #treatment gene-expression profiles
  GSM118720 <- getGEO(
    filename = system.file(
      "extdata/GSM118720.soft",
      package = "GeneExpressionSignature")
  )
  #control gene-expression profiles
  GSM118721 <- getGEO(
    filename=system.file(
      "extdata/GSM118721.soft",
      package = "GeneExpressionSignature")
  )
  #data ranking according to the different expression values 
  control <- as.matrix(as.numeric(Table(GSM118721)[, 2]))
  treatment <- as.matrix(as.numeric(Table(GSM118720)[, 2]))
  ranked_list <- getRLs(control, treatment)
}
```

## Rank Merging

By rank merging, multiple ranked lists are merging into a single ranked list, 
referred as prototype ranked list (PRL), representing certain kind of biological 
state. This procedure is mainly performed before similarity measuring, and 
applied to specific situations that occur when multiple ranked list are 
assigned to one single biological state with different cell types or 
experimental condition.

However, two different cases should be considered: 1) all ranked list with the 
same biological state are treated equally important; 2) each individual ranked 
lists has its own ranked weights. This package provides two commonly employed 
algorithms: one utilizes the Kruskal algorithm proposed by Iorio 
[@iorio2010discovery] for the former case and another takes the average ranking 
technique a simple but rather useful method. Function `RankMering()` is provided 
for aggregating the ranked lists into one or many PRLs according their 
phenotypic data. All the things that we need to do is construct a `ExpressionSet` 
object as input, with ranked lists as assay data and corresponding biological 
states as phenotypic data.

For convenience, ranking data stored as `ExpressionSet` class in `eset` object 
as input data, with ranked lists (obtained by calling `getRLs()`) as assay data 
and corresponding biological states as phenotypic data. As an example, we start 
from loading cultured the exampleSet data, a subset of C-MAP 
[@lamb2006connectivity] as sample data, which is a large reference catalogue of
gene expression data from cultured human cells perturbed with many chemicals 
and genetic reagents. The sub dataset is composed of 50 paired gene expression 
profiles involving 22283 genes. This profiles are obtained from cells treated 
15 compounds respectively, the values of which already converted to rank orders. 

```{r sample-data}
data(exampleSet)
show(exampleSet)
exprs(exampleSet)[c(1:10), c(1:3)]
levels(as(phenoData(exampleSet), "data.frame")[, 1])
```

Rank merging process will generate a mergingSet of 15 PRLs from 50 paired 
expression profiles with each PRL corresponding one of 15 compounds 
respectively.

```{r rank-merge}
MergingSet <- RankMerging(exampleSet, "Spearman", weighted = TRUE)
show(MergingSet)
```

## Similarity Measuring

One single combined PRL for a state was obtained after rank merging procedure. 
These PRLs are used to measure the similarity of the gene signature across 
different biological states by scoring functions `ScoreGSEA()` and 
`ScorePGSEA()`. Not all the genes are involved in similarity measuring, but 
only a subset of genes called gene signature whose combined expression pattern 
is uniquely characteristic of the biological state. Generally the genes used as 
gene signatures in the similarity scoring procedure are predefined by priori 
knowledge. Iorio [@iorio2010discovery] proposed an "optimal signature" 
approach by taking the most up-regulated genes and the most down-regulated 
genes as gene signature.The size of gene signatures need to be considered, which 
is taken as another parameter besides the PRLs in similarity measuring. In most 
cases, the default size of gene signature is 250 for genome-wide expression 
profile.

Suppose `N` is the number of PRLs (also same as the number of biological states), 
an `N x N` distance matrix is generated by similarity measurement. For mergingSet, 
we will get a `15 x 15` matrix corresponding to the similarity distances 
between these compounds.

```{r gsea}
ds <- ScoreGSEA(MergingSet, 250, "avg")
ds[1:5, 1:5]
```

## Signature Distance

As we mentioned above, four algorithms implemented as functions `getRLs()`, 
`RankMerging()`, `ScoreGSEA()`, and `ScorePGSEA()`, one is for data 
preprocessing, one called Iorio algorithm is for rank merging, the other two 
algorithms called GSEA and PGSEA are for similarity measuring. Moreover, 
function `SignatureDistance()` is provided to serve as a single entry and easy 
access point to rank merging and similarity measuring, which runs through the  
including rank merging and scoring, and is recommended to use in most cases. 
Data ranking is not integration into this function for no standard methods for 
data preprocessing and gene-expression data types is uncertain. Furthermore, 
there is no effective method to integrate data from different platforms. 
Function `getRLs()` which takes the method in C-MAP for data preprocessing just 
for reference.

```{r sig-dis}
SignatureDistance(
  exampleSet,
  SignatureLength = 250,
  MergingDistance = "Spearman",
  ScoringMethod = "GSEA",
  ScoringDistance = "avg",
  weighted = TRUE
)
```

# Implementation Details

## Adaptively Weighted Rank Merging

The Iorio's rank merging algorithm utilizes Kruskal algorithm 
[@cormen1990introduction] to merge the ranked lists which corresponding to a 
same biological state. The distance of these ranked lists must be calculated 
first, a measure of the distance between two ranked lists is computed using 
"Spearman" algorithm or "Kendall tau" algorithm. It should be noticed that
rank merging with  Kendall tau distance is time consuming, so we recommend 
selecting the "Spearman" distance. Next, merge the two or more ranked lists 
with the same biological state using "Borda merging" algorithm.

According to the "Kruskal" algorithm method [@cormen1990introduction], this 
rank merging algorithm searches for the two ranked lists with the smallest 
Spearman's Footrule distance first, and then merges them using the Borda 
Merging method, obtaining a new ranked list. Finally, the new list replaces the 
two unmerged lists. This process won't terminate until only one list remains.

For convenience, users can directly obtain a PRL for each state by the function 
`RankMerging()`, which uses "Sprearman", "BordaMerging", and "Kruskal" 
algorithms to aggregate the ranked lists obtained with the same biological 
state. For instance, we will merge the sample data which with 50 samples into 
15 samples.

```{r merge-detail}
MergingSet <- RankMerging(exampleSet, "Spearman", weighted = TRUE)
show(MergingSet)
```

### Eqully Weighted Rank Merging

A simple but rather useful method for this problem is the average ranking 
technique. The technique is a two step process when we are under the assumption 
that importance is equally weighted for each ranked list. First step is
to calculate average rank for each ranked list and then the second step is to 
construct their final rankings.


### Similarity Measuring

Once ranked lists with same biological states are merged to one single PRL, 
Gene Set Enrichment Analysis (GSEA) and Parametric Gene Set Enrichment Analysis 
(PGSEA) are adopted to measure the similarity among these PRLs. 

GSEA algorithm [@subramanian2005gene] is a nonparametric, rank-based method for 
similarity measuring to determine whether a priori defined set of genes shows 
statistically significant, concordant differences between two biological states,
whereas PGSEA algorithm is a modified gene set enrichment analysis method based 
on a parametric statistical analysis model. Both of these two functions gives 
the corresponding p value, function `ScoreGSEA()` calcutes the empirical p 
values from Monte Carlo Procedures [@north2002note].

```{r gsea-detail}
ds <- ScoreGSEA(MergingSet,250,"avg")
ds[1:5,1:5]
```

```{r pgsea-detail}
ds <- ScorePGSEA(MergingSet,250,"avg")
ds[1:5,1:5]
```

# Futher Analysis

To illustrate how to use `r Biocpkg("GeneExpressionSignature")` in analysis of 
gene expression signatures, affinity propagation clustering can be used to group 
these biological states by the similarity of gene signature. Affinity 
propagation cluster algorithm iteratively searches for optimal clustering by 
maximizing an objective function called net similarity. Here, we use function 
in `r CRANpkg("apcluster")` package to classify the 15 biological states into 
3 groups. In this step, R package `r CRANpkg("apcluster")` should also be 
installed on your computer.

```{r cluster}
if (require(apcluster)){
  library(apcluster)
  clusterResult <- apcluster(1 - ds)
  show(clusterResult)
}
```

Cytoscape is used to visualize the result of clustering. In the network, nodes 
denotes different compounds (cell states treated with different compounds), and 
the edge means the similarity distance between these two compounds is lower 
than a threshold, which is 0.68 here. Different colors denote different groups, 
as the classification of compounds. We note that the largest group is numbered 
9 nodes, and the other two consist of 3 nodes for each group.

```{r cluster-graph,echo=FALSE}
knitr::include_graphics("cluster.png")
```

# Session Information {-}

```{r session}
sessionInfo()
```

# References