\documentclass{article} \usepackage{hyperref} \usepackage{theorem} \usepackage{underscore} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\classdef}[1]{% {\em #1} } \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\file}[1]{{\texttt{#1}}} \theoremstyle{break} \newtheorem{Ex}{Exercise} \begin{document} \title{Lab: Implementing a simple (but fast) Position Weight Matrix matching algorithm for long DNA sequences} \author{Herv\'e Pag\`es \and Martin Morgan} \date{\today} \maketitle <>= options(width=72, continue=" ") @ \section{Goals} \begin{itemize} \item Create a fully working package (\Rpackage{simpleMatchPWM}) around a simple PWM matching function (\Rfunction{matchPWM}) and some related functions (see below). \item Write the core algorithms in C (for speed purpose) and use the \Rfunction{.Call} interface to call them from the R code. \item Make our \Rfunction{matchPWM} function work on long DNA sequences like full chromosome sequences. The standard container for such sequences in Bioconductor is the \Robject{DNAString} container defined in the \Rpackage{Biostrings} package. This will require us to learn a little bit about the \Rpackage{Biostrings} internals and the use of the {\it Biostrings C interface} but an attempt has been made to keep this as simple as possible. \end{itemize} \section{Prerequisites} \begin{itemize} \item A laptop with a recent version of R 2.7 and all the tools required for package development (compilers, Perl, etc...) \item The latest version of the \Rpackage{Biostrings} package from the 2.7 series (the current development series). Note that this document reflects the state of affairs in \Rpackage{Biostrings} 2.7.46 and is guaranteed to be accurate for this version only. \item The \Rpackage{BSgenome.Dmelanogaster.FlyBase.r51} package. \item Some basic knowledge of R package authoring. \item Some C knowledge and previous experience in C programming. \item You should have received 2 package source tarballs: \file{simpleMatchPWM.stub_0.0.2.tar.gz} and \file{simpleMatchPWM_0.99.1.tar.gz}. The former is an incomplete package (it doesn't pass \verb+R CMD check+): you will add the code that is missing in it. The latter is the fully working version of the former (it passes \verb+R CMD check+ without any errors or warnings). Note that these packages, like this document, reflect the state of affairs in \Rpackage{Biostrings} 2.7.46. \end{itemize} \section{Checking your installation} \begin{Ex} Extract the \file{simpleMatchPWM.stub_0.0.2.tar.gz} tarball somewhere. \end{Ex} Now look what's under the newly created \file{simpleMatchPWM.stub} folder: you'll find the layout of files and subfolders that you've already seen in most R packages. Except maybe for the \file{src} subfolder: this is the standard folder where to put files containing native code (\file{.c}, \file{.h}, \file{.cpp}, \file{.f} files, etc...). In the \Rpackage{simpleMatchPWM.stub} package, \file{src} contains only 2 files: \file{matchPWM.c} and \file{Biostrings_stubs.c}. You could have as many files as you want in \file{src}, and you could mix files written in different languages (proper extensions must be used). Note that, most of the times, you don't need a \file{Makefile} or any other extra file: by default \verb+R CMD INSTALL+ knows what to do with them. First it will invoke the compiler on each of them (skipping the header files (\file{.h}) and anything else that does not require compilation), then it will link together all the \file{.o} files produced by the individual compilations into a shared object (\file{.so} on Unix/Linux/Mac and \file{.DLL} on Windows). \begin{Ex} Run \verb+R CMD INSTALL+ on the \file{simpleMatchPWM.stub} folder and look at how the compiler is invoked. What compiler flags are used? Note that \verb+R CMD INSTALL+ doesn't clean after itself: can you see the compilation products in \file{src}? \end{Ex} On Unix/Linux/Mac, if you are using the gcc compiler, you can turn on the -Wall flag (in order to get warnings about potential problems in your code). This is done by editing \file{\$R_HOME/etc/Makeconf} (e.g. append \verb+-Wall+ to the \verb+CFLAGS+ line and you'll get the warnings during the compilation of C files). \begin{Ex} Turn on the -Wall gcc flag and run \verb+R CMD INSTALL+ again (you should see 2 warnings). Are those warnings really serious? \end{Ex} \section{Is folder \file{src} all I need in order to support native code in my package?} Basically yes. But in the case of the \Rpackage{simpleMatchPWM.stub} package, since it has a namespace, then the following line needs to be added to its \file{NAMESPACE} file (generally to the top): \verb+useDynLib(simpleMatchPWM.stub)+ \begin{Ex} Check \Rpackage{simpleMatchPWM.stub}'s \file{NAMESPACE} file. \end{Ex} \section{Trying to use \Rpackage{simpleMatchPWM.stub}} \begin{Ex} Start R, load the \Rpackage{simpleMatchPWM.stub} package, and look at the man page for the \Rfunction{matchPWM} function (\Robject{?matchPWM}). Try to run the examples. Anything wrong? Load the \Rpackage{BSgenome.Dmelanogaster.FlyBase.r51} package and display the \Robject{Dmelanogaster} object. Display chromosome 3R. What's its length? What's the class of this object? Use \Rfunction{alphabetFrequency} (from the \Rpackage{Biostrings} package) on it. Are there any other letters than A, C, G or T in this sequence? What about chromosome 3L? \end{Ex} \section{A first example using .Call} Before we try to implement the \Rfunction{PWMscore}, \Rfunction{matchPWM} and \Rfunction{countPWM} functions, let's go thru the classic {\it Hello World} exercise. We'll start by adding a little function to the \file{src/matchPWM.c} file that prints \verb+Hello world!+. Then we'll take any required step to make this C function callable from R via the .Call interface. In other words, we want to make our C function a {\it .Call entry point}. \begin{Ex} Open the \file{src/matchPWM.c} file and add the \verb+hello_world+ function near the bottom of the file, right before the section called REGISTRATION OF THE .Call ENTRY POINTS. \end{Ex} For any {\it .Call entry point}, the arguments and the returned value must be \verb+SEXP+ objects. In the case of the \verb+hello_world+ function, we don't need any argument, which is fine, but we must return an \verb+SEXP+ object. A common solution is to return \verb+NULL_USER_OBJECT+ (symbol defined in \file{\$R_HOME/include/Rdefines.h}) which is the \verb+SEXP+ object representing the \Robject{NULL} value in R. \begin{Ex} Make \verb+hello_world+ return \verb+NULL_USER_OBJECT+. Save, reinstall \Rpackage{simpleMatchPWM.stub}, restart R, load \Rpackage{simpleMatchPWM.stub} and try: <>= .Call("hello_world", PACKAGE="simpleMatchPWM.stub") @ \end{Ex} The above didn't work because all {\it .Call entry points} must be registered. This is done by adding an entry for \verb+hello_world+ to the \verb+callMethods+ array defined at the bottom of the file. Note that the last value of each entry must be the number of arguments of the {\it .Call entry point}. \begin{Ex} Register \verb+hello_world+ and try to call it again from R as before. \end{Ex} \section{Manipulating \Rclass{DNAString} objects in C} All the R code in \Rpackage{simpleMatchPWM.stub} has been put in a single file, the \file{R/matchPWM.R} file. \begin{Ex} Open \file{R/matchPWM.R} and find the definition of the \Rfunction{PWMscore} function. What C function does it call? How many arguments are passed to this C function? Find the definition of the \verb+PWM_score()+ entry point in \file{src/matchPWM.c}. \end{Ex} The 2nd argument of {\it .Call entry points} \verb+PWM_score()+ and \verb+match_PWM()+ must be a \Rclass{DNAString} object. This is enforced at the R level: the callers will check the nature of their \Robject{subject} argument and raise an error if it's not a \Rclass{DNAString} object. Note that this kind of sanity checking could be done at the C level but they are generally much easier to do (and to read, understand and modify) at the R level. You don't need to know all the details about the \Rclass{DNAString} class in order to manipulate a \Rclass{DNAString} object at the C level. Most of the time, all you need to know is the address in memory of its first letter and its length. This information can be retrieved by calling the \verb+get_XString_asRoSeq+ function. This function is part of the {\it Biostrings C interface} which will be introduced now. \section{The Biostrings C interface} The {\it Biostrings C interface} is defined in the \file{inst/include/Biostrings_interface.h} file of the \Rpackage{Biostrings} package. By default all R packages are installed under \file{\$R_HOME/library} (see \Robject{?install.packages} for more information on this), so \file{Biostrings_interface.h} should be located in \file{\$R_HOME/library/Biostrings/inst/include/}. \begin{Ex} Consult the \file{Biostrings_interface.h} file for more information about the {\it Biostrings C interface}. In particular, check that the \Rpackage{simpleMatchPWM.stub} package is set up properly: \begin{itemize} \item check the \verb+Depends:+, \verb+Imports:+ and \verb+LinkingTo:+ fields in the \file{DESCRIPTION} file; \item check the \code{import} directives in the \file{NAMESPACE} file; \item check the \file{src/Biostrings_stubs.c} file; \item check the \verb+#include "Biostrings_interface.h"+ line in \file{src/matchPWM.c}. \end{itemize} \end{Ex} \begin{Ex} Copy/paste the definition of the \verb+print_XString_bytes+ function given in \file{Biostrings_interface.h} into your \file{src/matchPWM.c} file, and register it as a {\it .Call entry point}. Save, reinstall \Rpackage{simpleMatchPWM.stub}, restart R, load \Rpackage{simpleMatchPWM.stub} and use \Rfunction{.Call} to call \verb+print_XString_bytes+ on a \Rclass{DNAString} object. \end{Ex} In fact, \Rclass{DNAString} objects, like \Rclass{RNAString}, \Rclass{AAString} and \Rclass{BString} objects, are particular kinds of \Rclass{XString} objects. The \verb+get_XString_asRoSeq+ function can be used on any of them. \begin{Ex} Call \verb+print_XString_bytes+ on \Robject{DNAString("ACGTacgt")}, \Robject{RNAString("ACGUacgu")} and \Robject{BString("ACGTacgt")}. \end{Ex} As stated in \file{Biostrings_interface.h}, there are 2 important things to remember about \Rclass{XString} objects: \begin{itemize} \item they are {\it not} null-terminated like standard strings in C: they can eventually contain the null byte so you should never use the C standard string functions on them; \item \Rclass{DNAString} and \Rclass{RNAString} objects have their data {\it encoded}. This means that a code (different from the ASCII code) is used to represent each nucleotide internally. \end{itemize} \begin{Ex} Modify the \verb+print_XString_bytes+ function to make it display (in clear) the nucleotide letters contained in a \Rclass{DNAString} object. Try to use this modified version on a \Rclass{BString} object. \end{Ex} \section{Specifications of the PWM_score() function} Now we are almost ready to implement \verb+PWM_score()+. But before we start, we need to describe {\it exactly} what this function will do. We start by describing \verb+PWM_score()+'s arguments: \begin{itemize} \item \verb+pwm+: the Position Weight Matrix (PWM). This is an integer matrix with row names A, C, G and T (in this order); \item \verb+subject+: a \Rclass{DNAString} object containing the subject (or target) sequence; \item \verb+start+: the set of starting positions. This is an integer vector of arbitrary length (\Robject{NA}s accepted). \end{itemize} Now what the function will do: given a PWM, a DNA sequence and a set of starting positions, \verb+PWM_score()+ must walk thru the set of starting positions, and, for each of them, {\it place} the PWM such that its first column is aligned with the current starting position and compute a score. For a given starting position, the score is obtained by picking, in each column of the PWM, the coefficient that corresponds to the nucleotide in the DNA sequence that is aligned with the column, and by summing them. Type of the returned value: given that the \verb+pwm+ argument will always be a matrix with integer coefficients (see the caller, \Rfunction{PWMscore}, in \file{R/matchPWM.R}), then the score can only be an integer too. Hence we want \verb+PWM_score()+ to return an \verb+SEXP+ object representing an integer vector. Vectorization: the \verb+PWM_score()+ function is vectorized in respect to the \verb+start+ argument. This means that applying the function to the \verb+start+ vector is equivalent to applying the same function on each of its elements. Also, as it is usually the case with vectorized functions, we want to allow \Robject{NA} values in the \verb+start+ object: when a starting position is \Robject{NA}, then the corresponding score must be \Robject{NA} too. Finally, we want to raise an error if one of the starting positions is {\it invalid}. When the first column of the PWM is aligned with the starting position, then the entire PWM should fit within the limits of the DNA sequence, otherwise, the starting position is considered {\it invalid}. \section{Some notes about the compute_score() helper function} The \verb+compute_score()+ helper function is provided so you can use it any time you need to compute the score for a given starting position. This will make implementing \verb+PWM_score()+ and \verb+match_PWM()+ easier. The arguments of \verb+compute_score()+ are: \begin{itemize} \item \verb+pwm+: a pointer to the first coefficient of the PWM. In R, a matrix (like any array) is just an atomic vector with a \Robject{"dim"} attribute. This means that, at the C level, its coefficients are stored one next to each other in memory. Most importantly, they are stored {\it column by column}. For example, in the case of the PWM, the first column is stored in \verb+pwm[0]+, \verb+pwm[1]+, \verb+pwm[2]+, \verb+pwm[3]+, the second column in \verb+pwm[4]+, \verb+pwm[5]+, \verb+pwm[6]+, \verb+pwm[7]+, and so on... \item \verb+pwm_ncol+: the number of column of the PWM. Hence the last valid element in \verb+pwm[]+ is \verb+pwm[4*pwm_ncol-1]+. \item \verb+S+: a pointer to the first letter in the target sequence. \item \verb+nS+: the length of the target sequence. \item \verb+pwm_shift+: how far the PWM must be shifted along the target. A shift of zero corresponds to a starting position of one and the shift is in fact always the starting position minus one. \end{itemize} Speed: inside \verb+compute_score()+, everytime we move to the next letter in the target sequence (\verb+S+), then we need not move to the next column in the PWM (\verb+pwm+), we need to map this new letter to the corresponding row in the PWM. In order to make this as fast as possible, we use the translation table \verb+DNAcode2PWMrowoffset+. This table must be initialized before \verb+compute_score()+ can be used. So don't forget to call \verb+init_DNAcode2PWMrowoffset()+ in \verb+PWM_score()+ and \verb+match_PWM()+ before they call \verb+compute_score()+ for the first time. \begin{Ex} Have a close look at \verb+compute_score()+ and try to understand how it works. \end{Ex} Note that we didn't make \verb+compute_score()+ a {\it .Call entry point} because we never need to call it directly from R. Even more, by declaring this function static (see the use of the \verb+static+ keyword at the beginning of the function definition), we restrict its use to the \file{matchPWM.c} file. That is, only functions defined in the same file can call it. In the case of the \Rpackage{simpleMatchPWM.stub} package, it doesn't really matter whether a C function is declared static or not, because all C functions are in the same file. But, when a program is made of many \file{.c} files, making some functions statics can make the code easier to maintain, and it helps to keep things organized. \section{Implementing the PWM_score() function} We are finally ready to implement \verb+PWM_score()+! \begin{Ex} Follow the instructions given in the \verb+PWM_score()+ stub to implement the function. Tips: \begin{itemize} \item Apply \verb+INTEGER()+ to an \verb+SEXP+ object representing an integer vector in order to get the address of its first element (remember that in C, the index of the first element in an array is 0). \item Use \verb+PROTECT(ans = NEW_INTEGER(LENGTH(start)));+ in order to allocate memory for the {\it answer object}. Then, for example, to assign a value to its first element, use \verb+INTEGER(ans)[0]+ on the left side of the assignement. \item Every time you try to compile your code, try to decipher all compiler errors or warnings. They can sometimes be cryptic, but getting familiar with gcc's jargon becomes essential in troubleshooting times. \item If you get stuck, look at the \Rpackage{simpleMatchPWM} package. \end{itemize} \end{Ex} \section{Implementing the match_PWM() function} The \verb+match_PWM()+ function must {\it find} and return all the starting positions in the target sequence that realize a score greater or equal to a given value (the {\it min score}). We will use {\it brute force} for this, that is, we will try all valid starting positions in the target sequence and return only those that lead to a match. This means that our main loop in \verb+match_PWM()+ will walk the target sequence from its first letter to a letter close to its last letter (the last \verb+pwm_ncol-1+ letters in the target sequence are invalid starting positions). For example, with a chromosome sequence, we will typically have to examine millions of starting positions! \begin{Ex} Make sure you know exactly what the \verb+match_PWM()+ function will do. \end{Ex} \begin{Ex} Follow the instructions given in the \verb+match_PWM()+ stub to implement the function. Tips: you could start by ignoring the \verb+count_only+ argument and come back to it later. Look at the \Rfunction{matchPWM} function in \file{R/matchPWM.R} and see how it converts the object returned by \Rfunction{.Call} into a \Rclass{BStringViews} object. When you are done, try to run the examples in \Robject{?matchPWM}. Is \verb+match_PWM()+ fast enough? \end{Ex} \section{Finishing your package} \begin{Ex} Set the \verb+Author:+ and \verb+Maintainer:+ fields in the \file{DESCRIPTION} file. Finally, make sure that your package passes \verb+R CMD check+ without any errors or warnings. \end{Ex} That's it! \end{document}