<!--
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{D0: C Interface}
-->
`r knitr::opts_chunk$set(tidy=FALSE)`

# Interfacing to C

Motivation and relevance

- Two primary reasons to write C code

  1. Access to existing libraries (e.g., samtools)

  2. Fast or memory efficient implementation of advanced algorithms
     (e.g., `findOverlaps`).

- Significant cost in terms of development time, opportunity for
  introducing errors, and incomplete handling of rich R data types
  including such basics as `NA` values.

**Exercise**: sum a numeric vector, in C

## .C

1. Create a file in your home directory `~/c_sum.c` containing the
   following lines of code
   ```
   void c_sum(const double *x, const int *length, double *answer)
   {
       answer[0] = 0;
       for (int i = 0; i < length[0]; ++i)
           answer[0] += x[i];
   }
   ```
   We'll pass the arguments `x`, `length`, and `answer` in to C from
   R; everything in R is a vector, so all values seen in C are
   pointers.
   
2. Compile this function to a 'dynamic library' by opening a shell,
   navigating to the directory where `c_sum.c` is located, and
   evaluating the command
   ```
   R CMD SHLIB c_sum.c
   ```
   This creates (if there are no syntax or other errors!) a file `c_sum.so`.

3. In R, write a short wrapper function to call `c_sum`
   ```{r c_sum}
   c_sum <- function(x) 
       .C("c_sum", as.numeric(x), length(x), answer=numeric(1))$answer
   ```
   The first argument to `.C` is the name of the C function we want to
   invoke. The remainder of the arguments are the values we pass to C;
   all are vectors in R, hence pointers in C. The type of the R
   argument must match the type of the C signature; there is no
   type-checking provided (e.g., it would be a mistake to pass the
   integer vector `x <- 1:10`, without first coercing to numeric). `x`
   and `length(x)` MUST BE treated as read-only in C; we've enforced
   this by declaring them as `const`, e.g., `const double *`. We'll
   modify the final argument, but to do so we must be confident that
   there are no other references to it in R, i.e., that it's `NAMED`
   level is 0.

   Arguments to `.C` can be named; names are not relevant at the C
   level, but are useful in interpretting the return value. The return
   value is a `list()` of the original arguments, perhaps updated
   during the call The paradigm adopted above names the argument that
   we wish to use in subsequent calculations, and selects that value
   from the returned list.

4. Use the function after loading the dynamic library
   ```{r c_sum-invoke, eval=FALSE}
   dyn.load("~/c_sum.so")
   c_sum(1:100)
   ```
   
5. Compare the time required for `c_sum()` to sum a large vector,
   e.g., 100M values, with the time required for base R's `sum()`. Any
   ideas on the difference? Break `c_sum()` in various ways.

6. (Advanced) Explore C headers provided with R
   ```{r c-headers}
   R.home("include")
   ```

## .Call

The `.C` interface is useful but quite restricted, e.g., we had to
pass in the value that we returned, rather than creating it in C.

1. Add the following line at the begining of `c_sum.c`
   ```
   #include <Rinternals.h>
   ```
   This includes the C header file that defines function prototypes
   and other information required to work with R. Explore the header
   file (at `R.home("include")`) at your leisure.

2. Add the following function after `c_sum`:
   ```
   SEXP call_sum(SEXP x)
   {
       SEXP answer = PROTECT(Rf_allocVector(REALSXP, 1));
       int len = Rf_length(x);
       c_sum(REAL(x), &len, REAL(answer));
       UNPROTECT(1);
       return answer;
   }
   ```
   This defines a function that takes an `SEXP` (S-expression, i.e.,
   any R object; an SEXP is a pointer to a C `struct`) as an argument,
   and returns an `SEXP`.
   
   The first line declares a new `SEXP`, allocates memory to hold
   information for a `REALSXP` of length 1 (the equivalent of
   `numeric(1)` in R), and `PROTECT`s the allocation from garbage
   collection.
   
   The second line creates an integer variable, and uses an accessor
   to reach in to `x` to discover its length.

   The third line uses accessors to reach in to the various `SEXP` to
   retrieve the underlying data; these are passed to our previously
   defined, pure C function `c_sum`.
   
   In the third line we prepare to hand the memory we've allocated
   back to R for memory management; we don't need to `PROTECT` it any
   more.
   
   The final line returns our allocated SEXP to R.
     
3. Compile as before

4. Write a short wrapper. Because we're able to determine the length
   of the object we've passed in, and because we can allocate and
   return memory from C, we just need to pass in the vector we're
   adding.
   ```{r call_sum}
   call_sum <- function(x) .Call("call_sum", as.numeric(x))
   ```

5. Load the DLL and invoke `call_sum` as before. It may be necessary
   to start an new R session to replace the previously loaded shared
   object.
   
6. Compare the output and performance of `sum()`, `c_sum()`, and
   `call_sum()`.

7. Move the vector coercion `as.numeric()` in R to `Rf_coerceVector(x,
   REALSXP)` in C. Do you need to `PROTECT` the result of the coerion?
   Any thoughts on whether it's better to coerce in R, or in C?

## Rcpp

`.Call()` is flexible, but exposes the details of R's internal
representation and memory management to the programmer. The `Rcpp`
package hides this complexity behind C++ classes, trading off
specialized understanding of R's internals for more general
understanding of C++.

1. Start a new file "rcpp_sum.cpp" with the following
   ```
   #include <Rcpp.h>
   using namespace Rcpp;

   // [[Rcpp::export]]
   NumericVector rcpp_sum(NumericVector x)
   {
       NumericVector ans(1);
       for (int i = 0; i < x.size(); ++i)
           ans[0] += x[i];
       return ans;
   }
   ```
   The file starts by including the headers for Rcpp. `using namespace
   Rcpp` is a C++ directive to manage where symbols (e.g.,
   `NumericVector`) will be discovered. `// [[Rcpp::export]]` is an
   Rcpp attribute that indicates that the following function should be
   made available in R (not all functions need to be exposed to the R
   user).
   
   `NumericVector` is a C++ class representing R's `numeric()` type,
   so the signature to `rcpp_sum()` says that it expects a single
   `numeric()` vector argument, and will return a `numeric()` vector
   to R. Rcpp is wired so that type coercion and memory management are
   automatic -- if the user invokes `rcpp_sum()` with an integer
   vector (Rcpp type `IntegerVector`), it will be coerced to type
   `NumericVector`. `NumericVector` behaves like standard C++
   containers, so has a `size()` method and array subset operations
   `[]` that behave in predictable ways.
   
   Rcpp implements `NumericVector` and friends _on top of_ R's
   interface, but hides from us the need to explicitly allocate and
   manage memory.

2. It's possible to compile this as before, but let's instead do this
   from within R
   ```{r rccp-compile, eval=FALSE}
   library(Rcpp)
   sourceCpp("~/rcpp_sum.cpp")
   ```
   This creates an object in R, `rccp_sum()`, that can be invoked
   immediately.
   ```{r eval=FALSE}
   rcpp_sum(1:10)
   ```

3. Compare the output and performance of `rcpp_sum()` with `sum()` and
   the other functions we've written. What issues have been addressed
   by Rcpp? What problems remain?