# Simple DESeq2 Work flow

## Experiment

The data in this work flow is from an experiment archived in
ArrayExperss with the identifier
[E-MTAB-1147](http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-1147/)

## Data and annotation sources

The BAM files (containing a subset of the data) are in the current directory, and end with the extension `.bam`.

```{r}
fls <- dir(".", pattern=".bam$")
names(fls) <- sub("_chr14.bam", "", fls)
```

Genome annotations define genome coordinates of gen. Load a
TranscriptDb package

```{r}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
```

For gene-level differential expression, load all exons from the data
base, and group them by gene

```{r}
ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
```

## count overlaps

Count overlaps using the default counting mode

```{r}
library(GenomicRanges)
library(Rsamtools)
library(parallel); options(mc.cores=detectCores())
counts <- summarizeOverlaps(ex, fls)
```

Remove rows that have 0 counts

```{r}
counts <- counts[rowSums(assay(counts)) != 0,]
```

As a sanity check, calculcate the distance between each sample and
visualize as a heatmap

```{r}
heatmap(as.matrix((dist(t(assay(counts))))))
```

## sample description, from ArrayExpress

The samples are described in ArrayExpress; query ArrayExpress for the
data

```{r}
url <- "http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-1147/E-MTAB-1147.sdrf.txt"
df0 <- read.delim(url, stringsAsFactors=FALSE)
```

Clean the data. We had to look at the data carefully and understand
each column. In the end, we do not need the columns describing FASTQ
files, and the remaining columns then contain duplicated
information. Remove the columns related to FASTQ files, and reduce the
data frame.

```{r}
idx <- sapply(df0, function(x) length(unique(x))) != 16
df <- unique(df0[, idx])                # remove duplicate rows
```

We make the row names equal to the identifier used for BAM files, so
that we can start to coordinate the sample information and the BAM
file information.

```{r}
rownames(df) <- df[,"Comment.ENA_RUN."] # set rownames to names of BAM files
```

Add the replicate and treatment information to the data frame. Compute
these values from the relevant columns of the data frame to avoid data
entry errors.

```{r}
df$Replicate <- factor(sub(".*Replicate ", "Rep.", df$Extract.Name))
df$Treatment <- factor(df$Factor.Value.RNAI.)
```

Finally, add the sample (column) information to `counts`, being sure
that the rows of the data frame are in the same order as the names of
the counts columns

```{r}
colData(counts) <- as(df[colnames(counts),], "DataFrame")
```

Re-do the heat map, but with additional information about replicate
and treatment

```{r}
m <- as.matrix((dist(t(assay(counts)))))
dimnames(m) <- list(counts$Treatment, counts$Replicate)
heatmap(m)
```

## Differential expression

Load the DESeq2 library

```{r}
library(DESeq2)
```

Create a DESeqDataSet, specifying the formula of the experimental
design under consideration

```{r}
dds <- DESeqDataSet(counts, ~ Replicate + Treatment)
```

Run the analysis and summarize the results

```{r}
dds <- DESeq(dds)
res <- results(dds)
```

Explore the results with an MA plot

```{r}
plotMA(dds)
```

and take a look at the genes with strongest log fold change

```{r}
idx <- order(abs(res$log2FoldChange), decreasing=TRUE)
res <- res[idx,]
head(res)
```

## Annotation

Place the results in biological context by using the H. sapiens
organism package to look up the gene names (rownames of the `res`
object) and map them to more human-friendly "SYMBOL" and "GENENAME"
identifiers.

```{r}
library(org.Hs.eg.db)
anno <- select(org.Hs.eg.db, rownames(res), c("SYMBOL", "GENENAME"))
```

Merge the annotation and statistical result

```{r}
idx <- match(anno$ENTREZID, rownames(res))
res <- cbind(res, anno[idx,])
head(res)
```

Save the results to a csv file
```{r}
write.table(as.data.frame(res), "/tmp/E-MTAB-1147-DESeq2-output.csv")
```

## Packages used

These results were created with the following packages
```{r}
sessionInfo()
```