\documentclass{article}

\usepackage{hyperref}
\usepackage{theorem}
\usepackage{underscore}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\classdef}[1]{%
  {\em #1}
}
\newcommand{\code}[1]{{\texttt{#1}}}
\newcommand{\file}[1]{{\texttt{#1}}}

\theoremstyle{break} \newtheorem{Ex}{Exercise}

\begin{document}

\title{Lab: Implementing a simple (but fast) Position Weight Matrix matching
       algorithm for long DNA sequences}

\author{Herv\'e Pag\`es \and Martin Morgan}
\date{\today}
\maketitle

<<setup00, echo=FALSE, results=hide>>=
options(width=72, continue=" ")
@ 


\section{Goals}

\begin{itemize}
\item Create a fully working package (\Rpackage{simpleMatchPWM}) around
      a simple PWM matching function (\Rfunction{matchPWM}) and some
      related functions (see below).
\item Write the core algorithms in C (for speed purpose) and use the
      \Rfunction{.Call} interface to call them from the R code.
\item Make our \Rfunction{matchPWM} function work on long DNA sequences
      like full chromosome sequences. The standard container for such sequences
      in Bioconductor is the \Robject{DNAString} container defined in the
      \Rpackage{Biostrings} package. This will require us to learn a little
      bit about the \Rpackage{Biostrings} internals and the use of the
      {\it Biostrings C interface} but an attempt has been made to keep
      this as simple as possible.
\end{itemize}


\section{Prerequisites}

\begin{itemize}
\item A laptop with a recent version of R 2.7 and all the tools required
      for package development (compilers, Perl, etc...)
\item The latest version of the \Rpackage{Biostrings} package from the 2.7
      series (the current development series). Note that this document
      reflects the state of affairs in \Rpackage{Biostrings} 2.7.46
      and is guaranteed to be accurate for this version only.
\item The \Rpackage{BSgenome.Dmelanogaster.FlyBase.r51} package.
\item Some basic knowledge of R package authoring.
\item Some C knowledge and previous experience in C programming.
\item You should have received 2 package source tarballs:
      \file{simpleMatchPWM.stub_0.0.2.tar.gz}
      and \file{simpleMatchPWM_0.99.1.tar.gz}.
      The former is an incomplete package (it doesn't pass \verb+R CMD check+):
      you will add the code that is missing in it.
      The latter is the fully working version of the former (it passes
      \verb+R CMD check+ without any errors or warnings).
      Note that these packages, like this document, reflect the state of
      affairs in \Rpackage{Biostrings} 2.7.46.
\end{itemize}


\section{Checking your installation}

\begin{Ex}
Extract the \file{simpleMatchPWM.stub_0.0.2.tar.gz} tarball somewhere.
\end{Ex}

Now look what's under the newly created \file{simpleMatchPWM.stub} folder:
you'll find the layout of files and subfolders that you've already
seen in most R packages. Except maybe for the \file{src} subfolder:
this is the standard folder where to put files containing native code
(\file{.c}, \file{.h}, \file{.cpp}, \file{.f} files, etc...).
In the \Rpackage{simpleMatchPWM.stub} package, \file{src} contains
only 2 files: \file{matchPWM.c} and \file{Biostrings_stubs.c}.

You could have as many files as you want in \file{src}, and you
could mix files written in different languages (proper
extensions must be used). Note that, most of the times, you don't
need a \file{Makefile} or any other extra file:
by default \verb+R CMD INSTALL+ knows what to do with them.
First it will invoke the compiler on each of them (skipping the
header files (\file{.h}) and anything else that does not require
compilation), then it will link together all the \file{.o} files
produced by the individual compilations into a shared object
(\file{.so} on Unix/Linux/Mac and \file{.DLL} on Windows).

\begin{Ex}
Run \verb+R CMD INSTALL+ on the \file{simpleMatchPWM.stub} folder
and look at how the compiler is invoked.

What compiler flags are used?

Note that \verb+R CMD INSTALL+ doesn't clean after itself:
can you see the compilation products in \file{src}?
\end{Ex}

On Unix/Linux/Mac, if you are using the gcc compiler, you can turn on the
-Wall flag (in order to get warnings about potential problems in your code).
This is done by editing \file{\$R_HOME/etc/Makeconf} (e.g. append \verb+-Wall+
to the \verb+CFLAGS+ line and you'll get the warnings during the compilation
of C files).

\begin{Ex}
Turn on the -Wall gcc flag and run \verb+R CMD INSTALL+ again (you should
see 2 warnings). Are those warnings really serious?
\end{Ex}


\section{Is folder \file{src} all I need in order to support native code
in my package?}

Basically yes. But in the case of the \Rpackage{simpleMatchPWM.stub} package,
since it has a namespace, then the following line needs to be added to its
\file{NAMESPACE} file (generally to the top):

\verb+useDynLib(simpleMatchPWM.stub)+

\begin{Ex}
Check \Rpackage{simpleMatchPWM.stub}'s \file{NAMESPACE} file.
\end{Ex}


\section{Trying to use \Rpackage{simpleMatchPWM.stub}}

\begin{Ex}
Start R, load the \Rpackage{simpleMatchPWM.stub} package,
and look at the man page for the \Rfunction{matchPWM}
function (\Robject{?matchPWM}).

Try to run the examples. Anything wrong?

Load the \Rpackage{BSgenome.Dmelanogaster.FlyBase.r51} package
and display the \Robject{Dmelanogaster} object.
Display chromosome 3R. What's its length? What's the class of this
object?

Use \Rfunction{alphabetFrequency} (from the \Rpackage{Biostrings} package)
on it. Are there any other letters than A, C, G or T in this sequence?

What about chromosome 3L?
\end{Ex}


\section{A first example using .Call}

Before we try to implement the \Rfunction{PWMscore}, \Rfunction{matchPWM}
and \Rfunction{countPWM} functions, let's go thru the classic {\it Hello
World} exercise. We'll start by adding a little function
to the \file{src/matchPWM.c} file that prints \verb+Hello world!+.
Then we'll take any required step to make this C function callable from R
via the .Call interface. In other words, we want to make our C function
a {\it .Call entry point}.

\begin{Ex}
Open the \file{src/matchPWM.c} file and add the \verb+hello_world+ function
near the bottom of the file, right before the section called
REGISTRATION OF THE .Call ENTRY POINTS.
\end{Ex}

For any {\it .Call entry point}, the arguments and the returned value
must be \verb+SEXP+ objects. In the case of the \verb+hello_world+ function,
we don't need any argument, which is fine, but we must return an
\verb+SEXP+ object. A common solution is to return \verb+NULL_USER_OBJECT+
(symbol defined in \file{\$R_HOME/include/Rdefines.h}) which is
the \verb+SEXP+ object representing the \Robject{NULL} value in R.

\begin{Ex}
Make \verb+hello_world+ return \verb+NULL_USER_OBJECT+.

Save, reinstall \Rpackage{simpleMatchPWM.stub}, restart R, load
\Rpackage{simpleMatchPWM.stub} and try:
<<helloworld1, eval=false>>=
.Call("hello_world", PACKAGE="simpleMatchPWM.stub")
@
\end{Ex}

The above didn't work because all {\it .Call entry points} must be
registered. This is done by adding an entry for \verb+hello_world+
to the \verb+callMethods+ array defined at the bottom of the file.
Note that the last value of each entry must be the number of arguments
of the {\it .Call entry point}.

\begin{Ex}
Register \verb+hello_world+ and try to call it again from R as before.
\end{Ex}


\section{Manipulating \Rclass{DNAString} objects in C}

All the R code in \Rpackage{simpleMatchPWM.stub} has been put in a
single file, the \file{R/matchPWM.R} file.

\begin{Ex}
Open \file{R/matchPWM.R} and find the definition of the
\Rfunction{PWMscore} function.

What C function does it call? How many arguments are passed to this
C function?

Find the definition of the \verb+PWM_score()+ entry point in
\file{src/matchPWM.c}.
\end{Ex}

The 2nd argument of {\it .Call entry points} \verb+PWM_score()+ and
\verb+match_PWM()+ must be a \Rclass{DNAString} object. This is enforced
at the R level: the callers will check the nature of their \Robject{subject}
argument and raise an error if it's not a \Rclass{DNAString} object.
Note that this kind of sanity checking could be done at the C level
but they are generally much easier to do (and to read, understand and
modify) at the R level.

You don't need to know all the details about the \Rclass{DNAString} class
in order to manipulate a \Rclass{DNAString} object at the C level.
Most of the time, all you need to know is the address in memory of its
first letter and its length.
This information can be retrieved by calling the \verb+get_XString_asRoSeq+
function. This function is part of the {\it Biostrings C interface} which
will be introduced now.


\section{The Biostrings C interface}

The {\it Biostrings C interface} is defined in the
\file{inst/include/Biostrings_interface.h} file of the \Rpackage{Biostrings}
package. By default all R packages are installed under \file{\$R_HOME/library}
(see \Robject{?install.packages} for more information on this),
so \file{Biostrings_interface.h} should be located in
\file{\$R_HOME/library/Biostrings/inst/include/}.

\begin{Ex}
Consult the \file{Biostrings_interface.h} file for more information about
the {\it Biostrings C interface}.

In particular, check that the \Rpackage{simpleMatchPWM.stub} package is
set up properly:
\begin{itemize}
\item check the \verb+Depends:+, \verb+Imports:+ and \verb+LinkingTo:+
      fields in the \file{DESCRIPTION} file;
\item check the \code{import} directives in the \file{NAMESPACE} file;
\item check the \file{src/Biostrings_stubs.c} file;
\item check the \verb+#include "Biostrings_interface.h"+ line in
      \file{src/matchPWM.c}.
\end{itemize}
\end{Ex}

\begin{Ex}
Copy/paste the definition of the \verb+print_XString_bytes+
function given in \file{Biostrings_interface.h} into your \file{src/matchPWM.c}
file, and register it as a {\it .Call entry point}.
Save, reinstall \Rpackage{simpleMatchPWM.stub}, restart R,
load \Rpackage{simpleMatchPWM.stub} and use \Rfunction{.Call} to call
\verb+print_XString_bytes+ on a \Rclass{DNAString} object.
\end{Ex}

In fact, \Rclass{DNAString} objects, like \Rclass{RNAString}, \Rclass{AAString}
and \Rclass{BString} objects, are particular kinds of \Rclass{XString} objects.
The \verb+get_XString_asRoSeq+ function can be used on any of them.

\begin{Ex}
Call \verb+print_XString_bytes+ on \Robject{DNAString("ACGTacgt")},
\Robject{RNAString("ACGUacgu")} and \Robject{BString("ACGTacgt")}.
\end{Ex}

As stated in \file{Biostrings_interface.h}, there are 2 important things
to remember about \Rclass{XString} objects:
\begin{itemize}
\item they are {\it not} null-terminated like standard strings in C: they can
      eventually contain the null byte so you should never use the C standard
      string functions on them;
\item \Rclass{DNAString} and \Rclass{RNAString} objects have their data
      {\it encoded}. This means that a code (different from the ASCII code)
      is used to represent each nucleotide internally.
\end{itemize}

\begin{Ex}
Modify the \verb+print_XString_bytes+ function to make it display
(in clear) the nucleotide letters contained in a \Rclass{DNAString} object.
Try to use this modified version on a \Rclass{BString} object.
\end{Ex}


\section{Specifications of the PWM_score() function}

Now we are almost ready to implement \verb+PWM_score()+. But before we start, we
need to describe {\it exactly} what this function will do.

We start by describing \verb+PWM_score()+'s arguments:
\begin{itemize}
\item \verb+pwm+: the Position Weight Matrix (PWM). This is an integer matrix
      with row names A, C, G and T (in this order);
\item \verb+subject+: a \Rclass{DNAString} object containing the
      subject (or target) sequence;
\item \verb+start+: the set of starting positions. This is an integer vector
      of arbitrary length (\Robject{NA}s accepted).
\end{itemize}

Now what the function will do: given a PWM, a DNA sequence and a set
of starting positions, \verb+PWM_score()+ must walk thru the set of
starting positions, and, for each of them, {\it place} the PWM
such that its first column is aligned with the current starting
position and compute a score.
For a given starting position, the score is obtained by picking,
in each column of the PWM, the coefficient that corresponds to the
nucleotide in the DNA sequence that is aligned with the column,
and by summing them.

Type of the returned value: given that the \verb+pwm+ argument
will always be a matrix with integer coefficients (see the caller,
\Rfunction{PWMscore}, in \file{R/matchPWM.R}), then the score can
only be an integer too. Hence we want \verb+PWM_score()+ to return an
\verb+SEXP+ object representing an integer vector.

Vectorization: the \verb+PWM_score()+ function is vectorized in respect
to the \verb+start+ argument. This means that applying the function
to the \verb+start+ vector is equivalent to applying the same function
on each of its elements.
Also, as it is usually the case with vectorized functions, we want to
allow \Robject{NA} values in the \verb+start+ object: when a starting
position is \Robject{NA}, then the corresponding score must be \Robject{NA}
too.

Finally, we want to raise an error if one of the starting positions is
{\it invalid}. When the first column of the PWM is aligned with the
starting position, then the entire PWM should fit within the limits
of the DNA sequence, otherwise, the starting position is considered
{\it invalid}.


\section{Some notes about the compute_score() helper function}

The \verb+compute_score()+ helper function is provided so you can use it
any time you need to compute the score for a given starting position.
This will make implementing \verb+PWM_score()+ and \verb+match_PWM()+ easier.

The arguments of \verb+compute_score()+ are:
\begin{itemize}
\item \verb+pwm+: a pointer to the first coefficient of the PWM. In R,
      a matrix (like any array) is just an atomic vector with a
      \Robject{"dim"} attribute. This means that, at the C level, its
      coefficients are stored one next to each other in memory.
      Most importantly, they are stored {\it column by column}.
      For example, in the case of the PWM, the first column is stored
      in \verb+pwm[0]+, \verb+pwm[1]+, \verb+pwm[2]+, \verb+pwm[3]+,
      the second column 
      in \verb+pwm[4]+, \verb+pwm[5]+, \verb+pwm[6]+, \verb+pwm[7]+,
      and so on...
\item \verb+pwm_ncol+: the number of column of the PWM. Hence the last
      valid element in \verb+pwm[]+ is \verb+pwm[4*pwm_ncol-1]+.
\item \verb+S+: a pointer to the first letter in the target sequence.
\item \verb+nS+: the length of the target sequence.
\item \verb+pwm_shift+: how far the PWM must be shifted along the target.
      A shift of zero corresponds to a starting position of one and
      the shift is in fact always the starting position minus one.
\end{itemize}

Speed: inside \verb+compute_score()+, everytime we move to the next letter in the
target sequence (\verb+S+), then we need not move to the next column in the PWM
(\verb+pwm+), we need to map this new letter to the corresponding row in the PWM.
In order to make this as fast as possible, we use the translation
table \verb+DNAcode2PWMrowoffset+.
This table must be initialized before \verb+compute_score()+ can be used.
So don't forget to call \verb+init_DNAcode2PWMrowoffset()+
in \verb+PWM_score()+ and \verb+match_PWM()+ before they call
\verb+compute_score()+ for the first time.

\begin{Ex}
Have a close look at \verb+compute_score()+ and try to understand how
it works.
\end{Ex}

Note that we didn't make \verb+compute_score()+ a {\it .Call entry point}
because we never need to call it directly from R. Even more, by declaring
this function static (see the use of the \verb+static+ keyword at the
beginning of the function definition), we restrict its use to the \file{matchPWM.c}
file. That is, only functions defined in the same file can call it.
In the case of the \Rpackage{simpleMatchPWM.stub} package, it doesn't really
matter whether a C function is declared static or not, because all C functions
are in the same file. But, when a program is made of many \file{.c} files,
making some functions statics can make the code easier to maintain,
and it helps to keep things organized.


\section{Implementing the PWM_score() function}

We are finally ready to implement \verb+PWM_score()+!

\begin{Ex}
Follow the instructions given in the \verb+PWM_score()+ stub to
implement the function.

Tips:
\begin{itemize}
\item Apply \verb+INTEGER()+ to an \verb+SEXP+ object representing an
      integer vector in order to get the address of its first element
      (remember that in C, the index of the first element in an array
      is 0).
\item Use \verb+PROTECT(ans = NEW_INTEGER(LENGTH(start)));+ in order
      to allocate memory for the {\it answer object}.
      Then, for example, to assign a value to its first element, use
      \verb+INTEGER(ans)[0]+ on the left side of the assignement.
\item Every time you try to compile your code, try to decipher
      all compiler errors or warnings. They can sometimes be cryptic,
      but getting familiar with gcc's jargon becomes essential in
      troubleshooting times.
\item If you get stuck, look at the \Rpackage{simpleMatchPWM} package.
\end{itemize}
\end{Ex}


\section{Implementing the match_PWM() function}

The \verb+match_PWM()+ function must {\it find} and return all the starting
positions in the target sequence that realize a score greater or equal to
a given value (the {\it min score}).

We will use {\it brute force} for this, that is, we will try all valid
starting positions in the target sequence and return only those that lead
to a match. This means that our main loop in \verb+match_PWM()+ will walk
the target sequence from its first letter to a letter close to its last letter
(the last \verb+pwm_ncol-1+ letters in the target sequence are invalid
starting positions). For example, with a chromosome sequence, we will
typically have to examine millions of starting positions!

\begin{Ex}
Make sure you know exactly what the \verb+match_PWM()+ function will do.
\end{Ex}

\begin{Ex}
Follow the instructions given in the \verb+match_PWM()+ stub to
implement the function. Tips: you could start by ignoring
the \verb+count_only+ argument and come back to it later.

Look at the \Rfunction{matchPWM} function in \file{R/matchPWM.R}
and see how it converts the object returned by \Rfunction{.Call}
into a \Rclass{BStringViews} object.

When you are done, try to run the examples in \Robject{?matchPWM}.
Is \verb+match_PWM()+ fast enough?
\end{Ex}


\section{Finishing your package}

\begin{Ex}
Set the \verb+Author:+ and \verb+Maintainer:+ fields in the \file{DESCRIPTION}
file.

Finally, make sure that your package passes \verb+R CMD check+ without
any errors or warnings.
\end{Ex}

That's it!

\end{document}