%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{3. Common Workflows: RNA-seq} # Common Workflows, with an RNA-seq example useR! 2014
Author: Martin Morgan (mtmorgan@fhcrc.org), Sonali Arora
Date: 30 June, 2014 ```{r setup, echo=FALSE} knitr::opts_chunk$set(cache=TRUE) ``` ## Overview RNASeq - Counting -- overlapping alignments - Known genes - Known transcripts - (Novel transcripts) ChIPSeq -- Chromatin immunopreciptation - (Peak calling) - Peak identification - Differential binding - Motif and other down-stream analysis Variants - (Calling) - Annotating - Effect prediction - GWAS - Integration Structural - Copy number - Rearrangements - ... ## In depth: RNASeq Work flow - Experimental design - Replication - Batch effects - Sequencing & alignment - Differential expression - Counting - Statistical evaluation - Annotation & down-stream analysis Statistical challenges - Designed experiments - Library size normalization - Counts (vs. RPKM) - Negative binomial distribution - Dispersion - Borrowing information -- shrinkage - Independent filtering ### RNASeq in parathyroid adenomas This work flow derives from a more comprehensive example developed in the [parathyroidSE][] package and the [CSAMA 2014](http://bioconductor.org/help/course-materials/2014/CSAMA2014/) workshop. It uses [DESeq2][] to estimate gene-level differential expression. Analysis picks up after counting reads aligning to Ensembl gene regions; counting can be accomplished using `summarizeOverlaps()`, as described in the [parathyroidSE][] vignette. The data is from [Haglund et al](http://www.ncbi.nlm.nih.gov/pubmed/23024189)., Evidence of a Functional Estrogen Receptor in Parathyroid Adenomas, J Clin Endocrin Metab, Sep 2012. The experiment investigated the role of the estrogen receptor in parathyroid tumors. The investigators derived primary cultures of parathyroid adenoma cells from 4 patients. These primary cultures were treated with diarylpropionitrile (DPN), an estrogen receptor β agonist, or with 4-hydroxytamoxifen (OHT). RNA was extracted at 24 hours and 48 hours from cultures under treatment and control. Start by loading the count data, which has been carefully curatd as a `SummarizedExperiment` object. Explore the object with accessors such as `rowData(se)`, `colData(se)`, `exptData(se)$MIAME`, and `head(assay(se))` ```{r deseq} require("DESeq2") require("parathyroidSE") data("parathyroidGenesSE") se <- parathyroidGenesSE colnames(se) <- se$run ``` The first step is to add an experimental design and perform additional preparatory steps. The experiment contains several technical replicates, and while these could be incorporated in the model, here we simply pool them (pooling technical replicates is often appropriate for RNASeq data). ```{r deseq-prep} ddsFull <- DESeqDataSet(se, design = ~ patient + treatment) ddsCollapsed <- collapseReplicates(ddsFull, groupby = ddsFull$sample, run = ddsFull$run) dds <- ddsCollapsed[, ddsCollapsed$time == "48h"] dds$time <- droplevels(dds$time) dds$treatment <- relevel(dds$treatment, "Control") ``` The entire work flow is executed with a call to the `DESeq()` function. This includes library size factor estimation, dispersion estimates, model fitting, independent filtering, and formulating test statistics based on the specified design. Here we generate the results for a comparison between DPN and control. ```{r deseq-analysis} dds <- DESeq(dds) res <- results(dds, contrast = c("treatment", "DPN", "Control")) ``` The remaining steps provide some basic checks and visualizations of the data. We identify genes for which the adjusted (for multiple comparison) _p_ value is less than 0.1, and order these by their log2 fold change. ```{r deseq-interpret} resSig <- res[which(res$padj < 0.1 ),] head(resSig[order(resSig$log2FoldChange),]) ``` An 'MA' plot represents each gene as a point, with average expression over all samples on the x-axis, and log2 fold change between treatment groups on the y-axis; highlighted values are genes with adjusted _p_ values less than 0.1. ```{r deseq-MA} plotMA(res, ylim = c(-1, 1)) ``` The strategy taken by [DESeq2][] for dispersion estimation is summarized by the `plotDispEsts()` function. It shows (a) black per-gene dispersion estimates, (b) a red trend line representing the global relationship between dispersion and normalized count, (c) blue 'shrunken' values moderating individual dispersion estimates by the global relationship, and (d) blue-circled dispersion outliers with high gene-wise dispersion that were not adjusted. ```{r deseq-dispest} plotDispEsts(dds, ylim = c(1e-6, 1e1)) ``` A final diagnostic is a plot of the histogram of _p_ values, which should be uniform under the null hypothesis; skew to the right may indicate batch or other effects not accommodated in the model ```{r deseq-hist} hist(res$pvalue, breaks=40, col="grey") ``` [DESeq2]: http://bioconductor.org/packages/release/bioc/html/DESeq2.html [parathyroidSE]: http://bioconductor.org/packages/release/data/experiment/html/parathyroidSE.html