---
title: "A.2 -- Data Input and Manipulation"
author: "Martin Morgan <Martin.Morgan@RoswellPark.org>"
date: "8 - 9 May 2017"
output:
  BiocStyle::html_document2:
    toc: true
    toc_depth: 2
vignette: >
  % \VignetteIndexEntry{A.2 -- Data Input and Manipulation}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```

# Exercise 1: BRFSS Survey Data

We will explore a subset of data collected by the CDC through its
extensive Behavioral Risk Factor Surveillance System ([BRFSS][])
telephone survey. Check out the link for more information. We'll look
at a subset of the data.

1. Use `file.choose()` to find the path to the file 'BRFSS-subset.csv'

    ```{r file.choose, eval=FALSE}
    path <- file.choose()
    ```

<!--
    ```{r file.path, echo=FALSE}
    path <- file.path("extdata", "BRFSS-subset.csv")
    ```
-->

2. Input the data using `read.csv()`, assigning to a variable `brfss`

    ```{r read.csv}
    brfss <- read.csv(path)
    ```

3. Use command like `class()`, `head()`, `dim()`, `summary()` to
   explore the data.

    - What variables have been measured?

    - Can you guess at the units used for, e.g., Weight and Height?
    
    ```
    class(brfss)
    head(brfss)
    dim(brfss)
    summary(brfss)
    ```

4. Use the `$` operator to extract the 'Sex' column, and summarize the
   number of males and females in the survey using `table()`. Do the
   same for 'Year', and for both `Sex` and `Year`

    ```{r brfss-sex}
    table(brfss$Sex)
    table(brfss$Year)
    table(brfss$Sex, brfss$Year)
    with(brfss, table(Sex, Year))                # same, but easier
    ```

5. Use `aggregate()` to summarize the mean weight of each group. What
   about the median weight of each group? What about the _number_ of
   observations in each group?

    ```{r brfss-aggregate}
    with(brfss, aggregate(Weight, list(Year, Sex), mean, na.rm=TRUE))
    with(brfss, aggregate(Weight, list(Year=Year, Sex=Sex), mean, na.rm=TRUE))
    ```

6. Use a `formula` and the `aggregate()` function to describe the
   relationship between Year, Sex, and Weight

    ```{r brfss-aggregate-formula}
    aggregate(Weight ~ Year + Sex, brfss, mean)  # same, but more informative
    aggregate(. ~ Year + Sex, brfss, mean)       # all variables
    ```

7. Create a subset of the data consisting of only the 1990
   observations. Perform a t-test comparing the weight of males and
   females ("'Weight' as a function of 'Sex'", `Weight ~ Sex`)

    ```{r t-test-1990}
    brfss_1990 = brfss[brfss$Year == 1990,]
    t.test(Weight ~ Sex, brfss_1990)
    t.test(Weight ~ Sex, brfss, subset = Year == 1990)
    ```
   What about differences between weights of males (or females) in
   1990 versus 2010? Check out the help page `?t.test.formula`. Is
   there a way of performing a t-test on `brfss` without explicitly
   creating the object `brfss_1990`?

8. Use `boxplot()` to plot the weights of the Male individuals.
   Can you transform weight, e.g., `sqrt(Weight) ~ Year`?
   Interpret the results. Do similar boxplots for the t-tests of the
   previous question.

    ```{r brfss-boxplot, fig.width=5, fig.height=5}
    boxplot(Weight ~ Year, brfss, subset = Sex == "Male",
            main="Males")
    ```

9. Use `hist()` to plot a histogram of weights of the 1990 Female individuals.

    ```{r brfss-hist, fig.width=5, fig.height=5}
    hist(brfss_1990[brfss_1990$Sex == "Female", "Weight"],
         main="Females, 1990", xlab="Weight" )
    ```

[BRFSS]: http://www.cdc.gov/brfss/about/index.htm

# Exercise 2: ALL Phenotypic Data

This data comes from an (old) Acute Lymphoid Leukemia microarray data
set.

Choose the file that contains ALL (acute lymphoblastic leukemia)
patient information and input the date using `read.csv()`; for
`read.csv()`, use `row.names=1` to indicate that the first column
contains row names.

```{r ALL-choose, eval=FALSE}
path <- file.choose()    # look for ALL-phenoData.csv
```

<!--
```{r echo=FALSE}
path <- file.path("extdata", "ALL-phenoData.csv")
```
-->

```{r ALL-input}
stopifnot(file.exists(path))
pdata <- read.csv(path, row.names=1)
```

Check out the help page `?read.delim` for input options. The exercises
use `?read.csv`; Can you guess why? Explore basic properties of the
object you've created, for instance...

```{r ALL-properties}
class(pdata)
colnames(pdata)
dim(pdata)
head(pdata)
summary(pdata$sex)
summary(pdata$cyto.normal)
```

Remind yourselves about various ways to subset and access columns of a
data.frame

```{r ALL-subset}
pdata[1:5, 3:4]
pdata[1:5, ]
head(pdata[, 3:5])
tail(pdata[, 3:5], 3)
head(pdata$age)
head(pdata$sex)
head(pdata[pdata$age > 21,])
```

It seems from below that there are 17 females over 40 in the data
set. However, some individuals have `NA` for the age and / or sex, and
these `NA` values propagate through some computations. Use `table()`
to summarize the number of females over 40, and the number of samples
for which this classification cannot be determined. When _R_
encounters an `NA` value in a subscript index, it introduces an `NA`
into the result. Observe this (rows of `NA` values introduced into the
result) when subsetting using `[` versus using the `subset()`
function.

```{r ALL-subset-NA}
idx <- pdata$sex == "F" & pdata$age > 40
table(idx, useNA="ifany")
dim(pdata[idx,])           # WARNING: 'NA' rows introduced
tail(pdata[idx,])
dim(subset(pdata, idx))    # BETTER: no NA rows
dim(subset(pdata, (sex == "F") & (age > 40)))  # alternative
tail(subset(pdata,idx))

## robust `[`: exclude NA values
dim(pdata[idx & !is.na(idx),])
```

Use the `mol.biol` column to subset the data to contain just
individuals with 'BCR/ABL' or 'NEG', e.g.,

```{r ALL-BCR/ABL-subset}
bcrabl <- subset(pdata, mol.biol %in% c("BCR/ABL", "NEG"))
```

The `mol.biol` column is a factor, and retains all levels even after
subsetting. It is sometimes convenient to retain factor levels, but in
our case we use `droplevels()` to removed unused levels

```{r ALL-BCR/ABL-drop-unused}
bcrabl$mol.biol <- droplevels(bcrabl$mol.biol)
```

The `BT` column is a factor describing B- and T-cell subtypes

```{r ALL-BT}
levels(bcrabl$BT)
```

How might one collapse B1, B2, ... to a single type B, and likewise
for T1, T2, ..., so there are only two subtypes, B and T? One strategy
is to replace two-letter level (e.g., `B1`) with the single-letter
level (e.g., `B`). Do this using `substring()` to select the first
letter of level, and update the previous levels with the new value
using `levels<-`.

```{r ALL-BT-recode}
table(bcrabl$BT)
levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)
```

Use `aggregate()` to count the number of samples with B- and T-cell
types in each of the BCR/ABL and NEG groups

```{r ALL-BCR/ABL-BT}
aggregate(rownames(bcrabl) ~ BT + mol.biol, bcrabl, length)
```

Use `aggregate()` to calculate the average age of males and females in
the BCR/ABL and NEG treatment groups.

```{r ALL-aggregate}
aggregate(age ~ mol.biol + sex, bcrabl, mean)
```

Use `t.test()` to compare the age of individuals in the BCR/ABL versus
NEG groups; visualize the results using `boxplot()`. In both cases,
use the `formula` interface. Consult the help page `?t.test` and re-do
the test assuming that variance of ages in the two groups is
identical. What parts of the test output change?

```{r ALL-age}
t.test(age ~ mol.biol, bcrabl)
boxplot(age ~ mol.biol, bcrabl)
```