A simple example for a power calculation for RNA-Seq
====================================================

Assume we want to find genes that differ between two sub-types
of a cancer entity, say without and with a certain somatice mutation. We envision to get tumour samples from N
patients of one sub-type and from N patients of the other subtype. While many genes may be weakly affected by the presence of the somatic variant, we hope to find most of those genes, which change at least by a factor of 2.

How many samples do we need?

Let's use DESeq2 to perform a mock analysis.

```{r}
library( Biobase )
library( DESeq2 )
```


How strong is the inter-subject variability in typical tumour samples?

Find a suitable data set.

We use here the Pickrell et al. (Nature, 2010) dataset with RNA-Seq data from lymphoblastoid cell lines (LCLs) derived from 69 Yoruban HapMap subjects. A true cancer data set might be more suitable, though.

We download the count table for this data set from ReCount (Frazee et al., BMC Bioinformatics, 2011).

```{r}
#montpick <- load( url( "http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData" ) )
load( "montpick_eset.RData" )
countTable <- exprs(montpick.eset)[ , pData( montpick.eset )$study == "Pickrell" ]
```

Let's pick 30 samples at random, assign half to sub-type "A" and half to sub-type "B", then make a DESeq2 object.

```{r}
n <- 30

randomPick <- sample( colnames(countTable), n )

dds <- DESeqDataSetFromMatrix( 
   countTable[ , randomPick ],
   data.frame( subtype = rep( c( "A", "B" ), each=n/2 ) ),
   ~ subtype )
```

Pick 100 genes at random, make half of them 2-fold higher in subtype "A" and half two-fold higher in subtype "B"

```{r}
nGenes <- 100
affectedGenes <- sample( rownames(dds), nGenes )
upGenes <- affectedGenes[ 1:(nGenes/2) ]
dnGenes <- affectedGenes[ (nGenes/2+1):nGenes ]
```
To make things symmetric, we either multiply or divide by sqrt(2):
```{r}
fc <- 2

counts(dds)[ upGenes, colData(dds)$subtype == "A" ] <- as.integer(round( 
   counts(dds)[ upGenes, colData(dds)$subtype == "A" ] / sqrt(fc) ))
counts(dds)[ upGenes, colData(dds)$subtype == "B" ] <- as.integer(round( 
   counts(dds)[ upGenes, colData(dds)$subtype == "B" ] * sqrt(fc) ))
counts(dds)[ dnGenes, colData(dds)$subtype == "A" ] <- as.integer(round( 
   counts(dds)[ dnGenes, colData(dds)$subtype == "A" ] * sqrt(fc) ))
counts(dds)[ dnGenes, colData(dds)$subtype == "B" ] <- as.integer(round( 
   counts(dds)[ dnGenes, colData(dds)$subtype == "B" ] / sqrt(fc) ))
```

Now, do an analysis:

```{r}
dds <- DESeq( dds )
```

```{r}
res <- results( dds )
signGenes <- rownames(res)[ which( res$padj < .1 ) ]

table( 
   found = rownames(dds) %in% signGenes, 
   truth = rownames(dds) %in% affectedGenes )
```

Highlight the affected genes in an MA plot

```{r}
plotMA( res, ylim=c( -3, 3 ) )
points( res[upGenes,"baseMean"], res[upGenes,"log2FoldChange"], col="red", cex=1.5 )
points( res[dnGenes,"baseMean"], res[dnGenes,"log2FoldChange"], col="blue", cex=1.5 )
```


Session info
```{r}
sessionInfo()
```