`r knitr::opts_chunk$set(tidy=FALSE)` # The R language Motivation and relevance - Large data requires computationally efficient approaches to analysis. - Understanding the way R *works* provides some insight into how we can write efficient code, expanding the range of problems that we can reasonably tackle in R. It is not unusual to write efficient R code that is as correct but 100x faster than naive code. - Parallel evaluation allows us to scale efficient code to exploit computational resources both on our personal computers and in high performance environments, further expanding the scale of analysis reasonably undertaken in R. ## 'Atomic' vectors - `logical()`, `integer()`, `numeric()`, `complex()`, `character()`, `raw()`, `list()` ## `data.frame()`, `matrix()` Attributes `str()` (including 2nd argument) and `dput()` **Exercise**: what's a `factor()`? ## `environment()` - Pass-by-value versus pass-by-reference - Utility? Why would pass-by-value be useful? - Confusion? How would pass-by-value confuse the user? - Each environement has a parent. By default, symbols not found by `get()` in the environment are searched for in the parent, iteratively **Exercise**: finding variables - How does R know the value of `pi`? - Why does this question appear under an exercise about envirnoments? - if you are the [Illinois state legislature in the 1897][PiBill] and define `pi <- 3.2`, where does your symbol `pi` reside? - Is there still a `pi` with a more precise value? **Exercise**: NAMESPACE - An R package contains symbols defined in the package namespaces The namespace is an environment - Use `getNamespace("IRanges")` to retrieve the IRanges package name space - Use `ls()` to list the content of the namespace - Use `parent.env()` to recursively discover the search path for a symbol mentioned in the name space ## `function` Argument basics - Named, unnamed, unspecified arguments - Default values - Argument matching -- by name, then position Function environments - Scope and symbol resolution - What about `<<-` ? **Exercise**: bank account: explain... ```{r bank-account} account <- function(initial=0) { available <- initial list(deposit=function(amount) { available <<- available + amount available }, balance=function() { available }) } my_acct <- account() my_acct$deposit(100) your_acct <- account(20) my_acct$deposit(200) my_acct$balance() your_acct$balance() ``` - Implement, if necessary, `my_acct$withdraw`. - Implement, if you're quick, `my_bank` to manage a number of accounts. ## Primitives and essential 'classes' and 'methods' - conceptual class hierarchy, with functions and C `SEXP` type (mentioned further below)
vector
   o length, [, [<-, [[, [[<-, names, names<-, class, class<-, ...
-- raw()                   RAWSXP
-- logical()               LGLSXP
-- numeric()               REALSXP
   -- integer()            INTSXP
-- complex()               CPLXSXP
-- character()             STRSXP
-- list()                  VECSXP
   -- data.frame()
   -- ... many S3 objects
-- structure()
   -- array()
      -- matrix()
-- expression()            EXPRSXP
environment (new.env())    ENVSXP
   o ls
   o [[, [[<-
closure (e.g., function)   CLOSSXP
S4 class                   S4SXP
...
# Deeper... ## Copy-on-write Some tools - `.Internal(inspect())` - `tracemem()` **Exercise**: explain... ```{r internals, eval=FALSE} x <- 1:5; tracemem(x) x[1] <- 2L x[1] <- 2 x <- y <- seq(1, 5); tracemem(x) x[1] <- 2L df <- data.frame(x=1:5, y=5:1) tracemem(df); tracemem(df$x) df[1,1] <- 2 m <- matrix(1:10, 2); tracemem(m) m[1, 1] <- 2L f <- function(x) x[1] g <- function(x) { x[1] <- 2L; x } tracemem(x <- 1:5); f(x) tracemem(x <- 1:5); g(x) ``` ## Data representation - Something like `x <- 1:5` associates the _symbol_ x with the value `1:5` in a particular (e.g., `.GlobalEnv`) environment. - `x` points to a location in memory, where there's a C `struct`. - The C `struct` is an [S-expression][SEXP] (SEXP) atom that represents the data values (`1:5`) as well as information about the data (e.g., that they are integers, hence `INTSXP`). We can peak into the S-expression structure with ```{r sexp} x <- 1:5 .Internal(inspect(x)) ``` - Among other pieces of information, we see the memory address where the symbol `x` is located (following the `@`), that this is an instance of type `INTSXP` (integer), and that it has length `len=5`. **Exercise**: Use `.Internal(inspect())` to discover other common S-expression types, in addition to `INTSXP`. Some examples: ```{r sexp-types, eval=FALSE} .Internal(inspect(pi)) .Internal(inspect(data.frame())) .Internal(inspect(function() {})) .Internal(inspect(expression(1 + 2))) ``` ## Garbage collection - Memory allocated and reclaimed by the program, no need for user to explicitly remove `rm()` or garbage collect `gc()` (many experienced R programmers _never_ use `rm()`). Uses `NAMED` rather than reference counts - NAMED 0: memory that is not bound to a symbol ```{r named-0} .Internal(inspect(1:5)) ``` - NAMED 1: memory that is or _has been_ bound to exactly one symbol ```{r named-1} .Internal(inspect(x <- 1:5)) ``` - NAMED 2: memory that is or _has been_ bound to two or more symbols ```{r named-2} .Internal(inspect(y <- x <- 1:5)) ``` Copy-on-write _illusion_ - Only duplicate when updating memory that is NAMED(2) # Styles of programming ## Declarative vs. imperative Example from [Rowe][]: 'clamp' data so that values are no greater than 5 standard deviations from the mean. Data: ```{r styles-data} set.seed(123) x <- rnorm(10000000) ``` Find values: Declarative ```{r declarative-find} x[abs(x) > 5 * sd(x)] ``` Imperative ```{r imperative-find, eval=FALSE} ans <- numeric() for (xi in x) if (xi > 5 * sd(x)) ans <- c(ans, xi) ``` Clamp: Declarative ```{r declarative-clamp} x[abs(x) > 5 * sd(x)] <- 5 * sd(x) ``` Imperative ```{r imperative-clamp, eval=FALSE} for (i in seq_along(x)) if (abs(x[i]) > 5 * sd(x)) x[i] <- 5 * sd(x) ``` - (how could the imperative style be made more efficient?) **Question**: What are the merits of declarative vs. imperative styles? ## Functional - Ideal: return value entirely determined by arguments, no 'side effects' - Very easy to reason about - Easy to parallelize **Exercise**: Few R functions are truly functional, but its possible to to recognize 'more' versus 'less' functional ways of writing R code. For the following, ```{r functional} df <- data.frame(x=1:5, y=5:1) x0 <- sapply(names(df), function(x) sqrt(df[[x]])) x1 <- sapply(names(df), function(x, df) sqrt(df[[x]]), df) x3 <- sapply(df, function(x, fun) fun(x), sqrt) x2 <- sapply(df, sqrt) ``` - Are the following equivalent? - Are some of these examples more functional than others? - Is updating an object, e.g.,. `df$x <- 5:1`, consistent with functional programming? ## Object-oriented - Association of _methods_ with object _classes_ - In R: methods associated with _generic_ functions - Different from main-stream languages (Java, C++) - Preserves illusion of copy-on-write - But not from many langauges - Much detail this afternoon [SEXP]: http://en.wikipedia.org/wiki/S-expression [Rowe]: http://cartesianfaith.files.wordpress.com/2013/09/rowe-modeling-data-with-functional-programming.pdf [PiBill]: http://en.wikipedia.org/wiki/Indiana_Pi_Bill