# Simple DESeq2 Work flow ## Experiment The data in this work flow is from an experiment archived in ArrayExperss with the identifier [E-MTAB-1147](http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-1147/) ## Data and annotation sources The BAM files (containing a subset of the data) are in the current directory, and end with the extension `.bam`. ```{r} fls <- dir(".", pattern=".bam$") names(fls) <- sub("_chr14.bam", "", fls) ``` Genome annotations define genome coordinates of gen. Load a TranscriptDb package ```{r} library(TxDb.Hsapiens.UCSC.hg19.knownGene) ``` For gene-level differential expression, load all exons from the data base, and group them by gene ```{r} ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") ``` ## count overlaps Count overlaps using the default counting mode ```{r} library(GenomicRanges) library(Rsamtools) library(parallel); options(mc.cores=detectCores()) counts <- summarizeOverlaps(ex, fls) ``` Remove rows that have 0 counts ```{r} counts <- counts[rowSums(assay(counts)) != 0,] ``` As a sanity check, calculcate the distance between each sample and visualize as a heatmap ```{r} heatmap(as.matrix((dist(t(assay(counts)))))) ``` ## sample description, from ArrayExpress The samples are described in ArrayExpress; query ArrayExpress for the data ```{r} url <- "http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-1147/E-MTAB-1147.sdrf.txt" df0 <- read.delim(url, stringsAsFactors=FALSE) ``` Clean the data. We had to look at the data carefully and understand each column. In the end, we do not need the columns describing FASTQ files, and the remaining columns then contain duplicated information. Remove the columns related to FASTQ files, and reduce the data frame. ```{r} idx <- sapply(df0, function(x) length(unique(x))) != 16 df <- unique(df0[, idx]) # remove duplicate rows ``` We make the row names equal to the identifier used for BAM files, so that we can start to coordinate the sample information and the BAM file information. ```{r} rownames(df) <- df[,"Comment.ENA_RUN."] # set rownames to names of BAM files ``` Add the replicate and treatment information to the data frame. Compute these values from the relevant columns of the data frame to avoid data entry errors. ```{r} df$Replicate <- factor(sub(".*Replicate ", "Rep.", df$Extract.Name)) df$Treatment <- factor(df$Factor.Value.RNAI.) ``` Finally, add the sample (column) information to `counts`, being sure that the rows of the data frame are in the same order as the names of the counts columns ```{r} colData(counts) <- as(df[colnames(counts),], "DataFrame") ``` Re-do the heat map, but with additional information about replicate and treatment ```{r} m <- as.matrix((dist(t(assay(counts))))) dimnames(m) <- list(counts$Treatment, counts$Replicate) heatmap(m) ``` ## Differential expression Load the DESeq2 library ```{r} library(DESeq2) ``` Create a DESeqDataSet, specifying the formula of the experimental design under consideration ```{r} dds <- DESeqDataSet(counts, ~ Replicate + Treatment) ``` Run the analysis and summarize the results ```{r} dds <- DESeq(dds) res <- results(dds) ``` Explore the results with an MA plot ```{r} plotMA(dds) ``` and take a look at the genes with strongest log fold change ```{r} idx <- order(abs(res$log2FoldChange), decreasing=TRUE) res <- res[idx,] head(res) ``` ## Annotation Place the results in biological context by using the H. sapiens organism package to look up the gene names (rownames of the `res` object) and map them to more human-friendly "SYMBOL" and "GENENAME" identifiers. ```{r} library(org.Hs.eg.db) anno <- select(org.Hs.eg.db, rownames(res), c("SYMBOL", "GENENAME")) ``` Merge the annotation and statistical result ```{r} idx <- match(anno$ENTREZID, rownames(res)) res <- cbind(res, anno[idx,]) head(res) ``` Save the results to a csv file ```{r} write.table(as.data.frame(res), "/tmp/E-MTAB-1147-DESeq2-output.csv") ``` ## Packages used These results were created with the following packages ```{r} sessionInfo() ```