% \VignetteEngine{knitr::knitr} % \VignetteIndexEntry{02. Working with R} \documentclass{article} \usepackage{Exercise} <>= BiocStyle::latex() @ \title{Introduction to \R{}} \author{Martin T.\ Morgan\footnote{\url{mtmorgan@fhcrc.org}}} \date{27-28 February 2014} \begin{document} \maketitle \section{\R{}} \begin{Exercise} This exercise uses data describing 128 microarray samples as a basis for exploring \R{} functions. Covariates such as age, sex, type, stage of the disease, etc., are in a data file \Robject{pData.csv}. The following command creates a variable \Robject{pdataFiles} that is the location of a comma-separated value (`csv') file to be used in the exercise. A csv file can be created using, e.g., `Save as...' in spreadsheet software. <>= pdataFile <- system.file(package="BiocIntro", "extdata", "pData.csv") @ <>= pdataFile <- "~/extdata/pData.csv" @ %% Make sure the file exists (you've specified the correct location) before continuing! <>= stopifnot(file.exists(pdataFile)) @ Input the csv file using \Rcode{read.table}, assigning the input to a variable \Robject{pdata}. Use \Rfunction{dim} to find out the dimensions (number of rows, number of columns) in the object. Are there 128 rows? Use \Rfunction{names} or \Rfunction{colnames} to list the names of the columns of \Robject{pdata}. Use \Rfunction{summary} to summarize each column of the data. What are the data types of each column in the data frame? A data frame is a list of equal length vectors. Select the `sex' column of the data frame using \Rfunction{[[} or \Rfunction{\$}. Pause to explain to your neighbor why this sub-setting works. Since a data frame is a list, use \Rfunction{sapply} to ask about the class of each column in the data frame. Explain to your neighbor what this produces, and why. Use \Rfunction{table} to summarize the number of males and females in the sample. Consult the help page \Rcode{?table} to figure out additional arguments required to include \Rcode{NA} values in the tabulation. The \Rcode{mol.biol} column summarizes molecular biological attributes of each sample. Use \Rfunction{table} to summarize the different molecular biology levels in the sample. Use \Rfunction{\%in\%} to create a logical vector of the samples that are either \Rcode{BCR/ABL} or \Rcode{NEG}. Subset the original phenotypic data to contain those samples that are \Rcode{BCR/ABL} or \Rcode{NEG}. After sub-setting, what are the levels of the \Rcode{mol.biol} column? Set the levels to be \Rcode{BCR/ABL} and \Rcode{NEG}, i.e., the levels in the subset. One would like covariates to be similar across groups of interest. Use \Rfunction{t.test} to assess whether \Rcode{BCR/ABL} and \Rcode{NEG} have individuals with similar age. To do this, use a \Robject{formula} that describes the response \Rcode{age} in terms of the predictor \Rcode{mol.biol}. If age is not independent of molecular biology, what complications might this introduce into subsequent analysis? Use the \Rfunction{boxplot} function to visualize the relationship between \Rcode{age} and \Rcode{mol.biol}. \end{Exercise} \begin{Solution} Here we input the data and explore basic properties. <>= pdata <- read.table(pdataFile) dim(pdata) names(pdata) summary(pdata) @ A data frame can be subset as if it were a matrix, or a list of column vectors. <>= head(pdata[,"sex"], 3) head(pdata$sex, 3) head(pdata[["sex"]], 3) sapply(pdata, class) @ The number of males and females, including \Rcode{NA}, is <>= table(pdata$sex, useNA="ifany") @ An alternative version of this uses the \Rfunction{with} function: \Rcode{with(pdata, table(sex, useNA="ifany"))}. The \Rcode{mol.biol} column contains the following samples: <>= with(pdata, table(mol.biol, useNA="ifany")) @ A logical vector indicating that the corresponding row is either \Rcode{BCR/ABL} or \Rcode{NEG} is constructed as <>= ridx <- pdata$mol.biol %in% c("BCR/ABL", "NEG") @ We can get a sense of the number of rows selected via \Rfunction{table} or \Rfunction{sum} (discuss with your neighbor what \Rfunction{sum} does, and why the answer is the same as the number of \Rcode{TRUE} values in the result of the \Rfunction{table} function). <>= table(ridx) sum(ridx) @ The original data frame can be subset to contain only \Rcode{BCR/ABL} or \Rcode{NEG} samples using the logical vector \Robject{ridx} that we created. <>= pdata1 <- pdata[ridx,] @ The levels of each factor reflect the levels in the original object, rather than the levels in the subset object, e.g., <>= levels(pdata1$mol.biol) @ These can be re-coded by updating the new data frame to contain a factor with the desired levels. <>= pdata1$mol.biol <- factor(pdata1$mol.biol) table(pdata1$mol.biol) @ To ask whether age differs between molecular biology types, we use a formula \Rcode{age \~{} mol.biol} to describe the relationship (`age as a function of molecular biology') that we wish to test <>= with(pdata1, t.test(age ~ mol.biol)) @ This summary can be visualize with, e.g., the \Rfunction{boxplot} function <>= ## not evaluated boxplot(age ~ mol.biol, pdata1) @ Molecular biology seem to be strongly associated with age; individuals in the \Rcode{NEG} group are considerably younger than those in the \Rcode{BCR/ABL} group. We might wish to include age as a covariate in any subsequent analysis seeking to relate molecular biology to gene expression. \end{Solution} \end{document}