--- title: "A.3 -- Statistics" author: Martin Morgan
Lori Shepherd
date: "July 26-28, 2017" output: BiocStyle::html_document2: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{A.3 -- Statistics} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` # Exploration and simple univariate measures ```{r echo=FALSE} path <- system.file(package="BiocIntro", "extdata", "BRFSS-subset.csv") ``` ```{r ALL-choose, eval=FALSE} path <- file.choose() # look for BRFSS-subset.csv ``` ```{r ALL-input} stopifnot(file.exists(path)) brfss <- read.csv(path) ``` ## Descriptive Statistics Let's look a little more closely at patient information in the `brfss` object: ```{r} names(brfss) median(brfss$Age) ``` The value `NA` shows up because some of the `brfss$age` values are NA. We can verify this by asking _R_ if there are any NA values ```{r brfss-anyNA} any(is.na(brfss$Age)) anyNA(brfss$Age) # same, but more efficient ``` Consulting the help page for `?median` suggests a solution -- specify the argument `na.rm=TRUE`. Explore other aspects of age, like `range()` and `quantile()`. ```{r} median(brfss$Age, na.rm=TRUE) ``` Some simple plots of patient ages -- note the nested functions! ```{r, eval=FALSE} plot(brfss$age) plot(sort(brfss$age)) sortedAge = sort(brfss$age) ?plot ``` - **Exercise**: Plot the `sortedAge` with markers at each data point and connect the points with red lines. You'll need to use the graphics parameters `type="b"` and `col="red"`, see `?plot.default`. - **Exercise**: Plot one variable (e.g., `age`) as a function of another (e.g., `sex`). Since `sex` is a factor, _R_ chooses to create a box plot; does this make sense? ## Clean data _R_ read `Year` as an integer value, but it's really a `factor` ```{r} summary(brfss) brfss$Year <- factor(brfss$Year) ``` ## Weight in 1990 vs. 2010 Females Create a subset of the data ```{r} brfssFemale <- brfss[brfss$Sex == "Female",] summary(brfssFemale) ``` Visualize ```{r} plot(Weight ~ Year, brfssFemale) ``` Statistical test ```{r} t.test(Weight ~ Year, brfssFemale) ``` ## Weight and height in 2010 Males Create a subset of the data ```{r} brfss2010Male <- subset(brfss, Year == 2010 & Sex == "Male") summary(brfss2010Male) ``` Visualize the relationship ```{r} hist(brfss2010Male$Weight) hist(brfss2010Male$Height) plot(Weight ~ Height, brfss2010Male) ``` Fit a linear model (regression) ```{r} fit <- lm(Weight ~ Height, brfss2010Male) fit ``` Summarize as ANOVA table ```{r} anova(fit) ``` Plot points, superpose fitted regression line; where am I? ```{r} plot(Weight ~ Height, brfss2010Male) abline(fit, col="blue", lwd=2) points(180, 88, col="red", cex=4, pch=20) ``` Class and available 'methods' ```{r, eval=FALSE} class(fit) # 'noun' methods(class=class(fit)) # 'verb' ``` Diagnostics ```{r, eval=FALSE} plot(fit) ?plot.lm ``` # Multivariate analysis This is a classic microarray experiment. Microarrays consist of 'probesets' that interogate genes for their level of expression. In the experiment we're looking at, there are 12625 probesets measured on each of the 128 samples. The raw expression levels estimated by microarray assays require considerable pre-processing, the data we'll work with has been pre-processed. ## Input and setup Start by finding the expression data file on disk. ```{r echo=FALSE} path <- system.file(package="BiocIntro", "extdata", "ALL-expression.csv") ``` ```{r ALL-choose-again, eval=FALSE} path <- file.choose() # look for ALL-expression.csv stopifnot(file.exists(path)) ``` The data is stored in 'comma-separate value' format, with each probeset occupying a line, and the expression value for each sample in that probeset separated by a comma. Input the data using `read.csv()`. There are three challenges: 1. The row names are present in the first column of the data. Tell _R_ this by adding the argument `row.names=1` to `read.csv()`. 2. By default, _R_ checks that column names do not look like numbers, but our column names _do_ look like numbers. Use the argument `check.colnames=FALSE` to over-ride _R_'s default. 3. `read.csv()` returns a `data.frame`. We could use a `data.frame` to work with our data, but really it is a `matrix()` -- the columns are of the same type and measure the same thing. Use `as.matrix()` to coerce the `data.frame` we input to a `matrix`. ```{r ALL-input-exprs} exprs <- read.csv(path, row.names=1, check.names=FALSE) exprs <- as.matrix(exprs) class(exprs) dim(exprs) exprs[1:6, 1:10] range(exprs) ``` We'll make use of the data describing the samples ```{r echo=FALSE} path <- system.file(package="BiocIntro", "extdata", "ALL-phenoData.csv") ``` ```{r ALL-phenoData.csv-clustering-student, eval=FALSE} path <- file.choose() # look for ALL-phenoData.csv stopifnot(file.exists(path)) ``` ```{r} pdata <- read.csv(path, row.names=1) class(pdata) dim(pdata) head(pdata) ``` Some of the results below involve plots, and it's convenient to choose pretty and functional colors. We use the [RColorBrewer][] package; see [colorbrewer.org][] [RColorBrewer]: https://cran.r-project.org/?package=RColorBrewer [colorbrewer.org]: http://colorbrewer.org ```{r colors} library(RColorBrewer) ## not available? install package via RStudio highlight <- brewer.pal(3, "Set2")[1:2] ``` `highlight' is a vector of length 2, light and dark green. For more options see `?RColorBrewer` and to view the predefined palettes `display.brewer.all()` ## Cleaning We'll add a column to `pdata`, derived from the `BT` column, to indicate whether the sample is B-cell or T-cell ALL. ```{r ALL-BorT} pdata$BorT <- factor(substr(pdata$BT, 1, 1)) ``` Microarray expression data is usually represented as a matrix of genes as rows and samples as columns. Statisticians usually think of their data as samples as rows, features as columns. So we'll transpose the expression values ```{r} exprs <- t(exprs) ``` Confirm that the `pdata` rows correspond to the `exprs` rows. ```{r} stopifnot(identical(rownames(pdata), rownames(exprs))) ``` ## Unsupervised machine learning -- multi-dimensional scaling Reduce high-dimensional data to lower dimension for visualization. Calculate distance between _samples_ (requires that the expression matrix be transposed). ```{r} d <- dist(exprs) ``` Use the `cmdscale()` function to summarize the distance matrix into two points in two dimensions. ```{r} cmd <- cmdscale(d) ``` Visualize the result, coloring points by B- or T-cell status ```{r} plot(cmd, col=highlight[pdata$BorT]) ```