`r knitr::opts_chunk$set(tidy=FALSE)` # Interfacing to C Motivation and relevance - Two primary reasons to write C code 1. Access to existing libraries (e.g., samtools) 2. Fast or memory efficient implementation of advanced algorithms (e.g., `findOverlaps`). - Significant cost in terms of development time, opportunity for introducing errors, and incomplete handling of rich R data types including such basics as `NA` values. **Exercise**: sum a numeric vector, in C ## .C 1. Create a file in your home directory `~/c_sum.c` containing the following lines of code ``` void c_sum(const double *x, const int *length, double *answer) { answer[0] = 0; for (int i = 0; i < length[0]; ++i) answer[0] += x[i]; } ``` We'll pass the arguments `x`, `length`, and `answer` in to C from R; everything in R is a vector, so all values seen in C are pointers. 2. Compile this function to a 'dynamic library' by opening a shell, navigating to the directory where `c_sum.c` is located, and evaluating the command ``` R CMD SHLIB c_sum.c ``` This creates (if there are no syntax or other errors!) a file `c_sum.so`. 3. In R, write a short wrapper function to call `c_sum` ```{r c_sum} c_sum <- function(x) .C("c_sum", as.numeric(x), length(x), answer=numeric(1))$answer ``` The first argument to `.C` is the name of the C function we want to invoke. The remainder of the arguments are the values we pass to C; all are vectors in R, hence pointers in C. The type of the R argument must match the type of the C signature; there is no type-checking provided (e.g., it would be a mistake to pass the integer vector `x <- 1:10`, without first coercing to numeric). `x` and `length(x)` MUST BE treated as read-only in C; we've enforced this by declaring them as `const`, e.g., `const double *`. We'll modify the final argument, but to do so we must be confident that there are no other references to it in R, i.e., that it's `NAMED` level is 0. Arguments to `.C` can be named; names are not relevant at the C level, but are useful in interpretting the return value. The return value is a `list()` of the original arguments, perhaps updated during the call The paradigm adopted above names the argument that we wish to use in subsequent calculations, and selects that value from the returned list. 4. Use the function after loading the dynamic library ```{r c_sum-invoke, eval=FALSE} dyn.load("~/c_sum.so") c_sum(1:100) ``` 5. Compare the time required for `c_sum()` to sum a large vector, e.g., 100M values, with the time required for base R's `sum()`. Any ideas on the difference? Break `c_sum()` in various ways. 6. (Advanced) Explore C headers provided with R ```{r c-headers} R.home("include") ``` ## .Call The `.C` interface is useful but quite restricted, e.g., we had to pass in the value that we returned, rather than creating it in C. 1. Add the following line at the begining of `c_sum.c` ``` #include ``` This includes the C header file that defines function prototypes and other information required to work with R. Explore the header file (at `R.home("include")`) at your leisure. 2. Add the following function after `c_sum`: ``` SEXP call_sum(SEXP x) { SEXP answer = PROTECT(Rf_allocVector(REALSXP, 1)); int len = Rf_length(x); c_sum(REAL(x), &len, REAL(answer)); UNPROTECT(1); return answer; } ``` This defines a function that takes an `SEXP` (S-expression, i.e., any R object; an SEXP is a pointer to a C `struct`) as an argument, and returns an `SEXP`. The first line declares a new `SEXP`, allocates memory to hold information for a `REALSXP` of length 1 (the equivalent of `numeric(1)` in R), and `PROTECT`s the allocation from garbage collection. The second line creates an integer variable, and uses an accessor to reach in to `x` to discover its length. The third line uses accessors to reach in to the various `SEXP` to retrieve the underlying data; these are passed to our previously defined, pure C function `c_sum`. In the third line we prepare to hand the memory we've allocated back to R for memory management; we don't need to `PROTECT` it any more. The final line returns our allocated SEXP to R. 3. Compile as before 4. Write a short wrapper. Because we're able to determine the length of the object we've passed in, and because we can allocate and return memory from C, we just need to pass in the vector we're adding. ```{r call_sum} call_sum <- function(x) .Call("call_sum", as.numeric(x)) ``` 5. Load the DLL and invoke `call_sum` as before. It may be necessary to start an new R session to replace the previously loaded shared object. 6. Compare the output and performance of `sum()`, `c_sum()`, and `call_sum()`. 7. Move the vector coercion `as.numeric()` in R to `Rf_coerceVector(x, REALSXP)` in C. Do you need to `PROTECT` the result of the coerion? Any thoughts on whether it's better to coerce in R, or in C? ## Rcpp `.Call()` is flexible, but exposes the details of R's internal representation and memory management to the programmer. The `Rcpp` package hides this complexity behind C++ classes, trading off specialized understanding of R's internals for more general understanding of C++. 1. Start a new file "rcpp_sum.cpp" with the following ``` #include using namespace Rcpp; // [[Rcpp::export]] NumericVector rcpp_sum(NumericVector x) { NumericVector ans(1); for (int i = 0; i < x.size(); ++i) ans[0] += x[i]; return ans; } ``` The file starts by including the headers for Rcpp. `using namespace Rcpp` is a C++ directive to manage where symbols (e.g., `NumericVector`) will be discovered. `// [[Rcpp::export]]` is an Rcpp attribute that indicates that the following function should be made available in R (not all functions need to be exposed to the R user). `NumericVector` is a C++ class representing R's `numeric()` type, so the signature to `rcpp_sum()` says that it expects a single `numeric()` vector argument, and will return a `numeric()` vector to R. Rcpp is wired so that type coercion and memory management are automatic -- if the user invokes `rcpp_sum()` with an integer vector (Rcpp type `IntegerVector`), it will be coerced to type `NumericVector`. `NumericVector` behaves like standard C++ containers, so has a `size()` method and array subset operations `[]` that behave in predictable ways. Rcpp implements `NumericVector` and friends _on top of_ R's interface, but hides from us the need to explicitly allocate and manage memory. 2. It's possible to compile this as before, but let's instead do this from within R ```{r rccp-compile, eval=FALSE} library(Rcpp) sourceCpp("~/rcpp_sum.cpp") ``` This creates an object in R, `rccp_sum()`, that can be invoked immediately. ```{r eval=FALSE} rcpp_sum(1:10) ``` 3. Compare the output and performance of `rcpp_sum()` with `sum()` and the other functions we've written. What issues have been addressed by Rcpp? What problems remain?