\documentclass{article}

\usepackage{mathptmx}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{float}
\usepackage{epsfig}

%% \usepackage{semcolor}
\usepackage[pdftex,colorlinks=false,pdfborder=0 0 0,pageanchor=false,bookmarks=false,urlcolor=black]{hyperref}

\pdfpagewidth=11truein
\pdfpageheight=8.5truein
\pdfhorigin=1truein     % default value(?), but doesn't work without
\pdfvorigin=1truein     % default value(?), but doesn't work without

%% \newcommand{\fs}{\fromSlide}
%% \newcommand{\bs}{\begin{slide}}
%% \newcommand{\es}{\end{slide}}

%% \slideframe{none}
\setcounter{secnumdepth}{0}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\newcommand{\R}{\textsf{R}}

\newcommand{\term}[1]{\emph{#1}}
\newcommand{\mref}[2]{\htmladdnormallinkfoot{#2}{#1}}

\begin{document}

\author{H. Pag\`es \\ 
  Gentleman Lab \\
  Fred Hutchinson Cancer Research Center \\
  Seattle, WA}
\title{Biostrings Lab (BioC2007)}
\date{6 August, 2007}
\maketitle

% ---------------------------------------------------------------------------

\section{Introduction}
The \Rpackage{Biostrings} package: provides the infrastructure for storing
and manipulating large nucleotide and amino acid sequences (up to hundreds
of millions of letters) in {\R}.

We need {\R} version 2.6 and \Rpackage{Biostrings} version 2.5.19 for this
lab:
<<>>=
library(Biostrings)
sessionInfo()
@

The \Rpackage{BSgenome.*} packages: data packages containing the full genome
sequence for a given organism. Ex: \Rpackage{BSgenome.Celegans.UCSC.ce2},
\Rpackage{BSgenome.Hsapiens.UCSC.hg18}, etc...
Use the \Rfunction{available.genomes()} function from the \Rpackage{BSgenome}
package to see the list of available \Rpackage{BSgenome.*} packages (network
access required):
<<>>=
library(BSgenome)
available.genomes()
@

Then install with \Rfunction{biocLite()}:
<<eval=FALSE>>=
source("http://bioconductor.org/biocLite.R")
biocLite("BSgenome.Dmelanogaster.FlyBase.r51")
@

Note that those \Rpackage{BSgenome.*} packages are big (from a few Mb to more
than 800 Mb for \Rpackage{BSgenome.Hsapiens.UCSC.hg18}) so this installation
over the network can take a long time.


% ---------------------------------------------------------------------------

\section{\Rclass{BString} objects}

The \Rclass{BString} class is the basic container for big sequences.
Unlike standard character vectors in {\R} that can store an arbitrary number
of strings, a \Rclass{BString} object can only contain 1 string.
For sequences commonly found in biology, 3 variants of the \Rclass{BString}
class are available:
\begin{enumerate}
\item The \Rclass{DNASstring} class for storing a DNA sequence.
\item The \Rclass{RNAString} class for storing a RNA sequence.
\item The \Rclass{AAString} class for storing a sequence of amino acids.
\end{enumerate}

For each of these classes, there is a constructor that allows to create
an instance of the class from a character string:

<<>>=
s1 <- BString("hello")
s2 <- DNAString("GACCCT")
s3 <- AAString("MARKSLEMSIR")
@ 

\subsection{Exercise 1}
\begin{enumerate}
\item Create some arbitrary \Rclass{BString}, \Rclass{DNAString}
      and \Rclass{AAString} instances.
\item Apply \Rfunction{nchar} and \Rfunction{alphabet} on them.
\item Apply \Rfunction{alphabetFrequency} and \Rfunction{reverseComplement}
      on the \Rclass{DNAString} objects.
\item Extract substrings by using the subsetting operator~\Rmethod{[}
      and by using the \Rfunction{subBString} function.
\end{enumerate}

IMPORTANT: The \Rfunction{subBString} function should always be preferred
to the subsetting operator~\Rmethod{[} for substring extraction. More on this
later...


% ---------------------------------------------------------------------------

\section{\Rpackage{BSgenome.*} packages}

A \Rpackage{BSgenome.*} package is a data package containing the full genome
sequence for a given organism.

Please refer to the introduction of this document for how to display the list of
\Rpackage{BSgenome.*} packages currently available and how to install them
over the network.

The name of a \Rpackage{BSgenome.*} package is made of 4 parts separated by a dot
(e.g. \Rpackage{BSgenome.Celegans.UCSC.ce2}).
The 1st part is always {\tt BSgenome}, the 2nd part is the name of the organism
(abbreviated), the 3rd part is the name of the organisation who assembled the genome
and the 4th part is the release string or number used by this organisation
for this assembly of the genome.

A \Rpackage{BSgenome.*} package contains a single top level object whose name matches
the second part of the package name:
<<>>=
library(BSgenome.Celegans.UCSC.ce2)
Celegans
@

Displaying this object shows some information about the provenance of this genome
plus 2 indexes of sequences: the single sequences and the multiple sequences.
The single sequences are \Rclass{DNAString} objects and the multiple sequences
are \Rclass{BStringViews} objects (will be introduced soon).

\subsection{Exercise 2}
\begin{enumerate}
\item Display several chromosomes of C. elegans. Do they contain IUPAC extended
      letters?
\item Load \Rpackage{BSgenome.Dmelanogaster.FlyBase.r51} (fruit fly genome).
      Do the chromosomes contain IUPAC extended letters?
\item Extract the last 10M bases from fruit fly chromosome 2L.
      Use the \Rfunction{system.time} function to compare the performance
      of \Rfunction{subBString} vs the subsetting operator~\Rmethod{[}.
\end{enumerate}

% ---------------------------------------------------------------------------

\section{\Rclass{BStringViews} objects}

A \Rclass{BStringViews} object contains a set of views on the same sequence
called "the subject". Each view is defined by its start and end locations:
both are integers such that start <= end.
We use the \Rfunction{views} function to create a \Rclass{BStringViews} object
for a given set of start and end locations.
Here we create a \Rclass{BStringViews} object with 4 views:
<<>>=
v <- views(DNAString("TAATAATG"), 3:0, 5:8)
v
@

Note that two views can overlap and that a view can be "out of
limits" i.e. it can start before the first letter of the subject
or/and end after its last letter.

The subject is the DNA sequence {\tt TAATAATG}:
<<>>=
subject(v)
@

A \Rclass{BStringViews} object can be subsetted like a standard character
vector:
<<>>=
length(v)
v[4:2]
@

The start/end/width (integer vectors) can be extracted with \Robject{start()},
\Robject{end()} and \Robject{width()}:
<<>>=
start(v)
end(v)
width(v)
@

A given view can be extracted as a \Rclass{BString} (or derived) object
with the \Rmethod{[[} operator:
<<>>=
v[[2]]
@
Note this operator returns an instance of the same class as the subject.

\subsection{Exercise 3}
\begin{enumerate}
\item Compare \Rfunction{nchar} vs \Rfunction{width} on \Robject{v}.
\item Find a programatical way to remove "out of limits" views from
      a \Rclass{BStringViews} object.
\end{enumerate}

\subsection{Exercise 4}
\begin{enumerate}
\item Use the \Rfunction{mask} function to mask the {\tt N}s in fruit fly chromosome 2L.
      Apply \Rfunction{mask} again to the result (and without the second argument)
      to see the location of the {\tt N}s. Do we have isolated {\tt N}s or {\tt N}-blocks?
\item Find the location of the {\tt N}s in C. elegans chromosome III and in
      H. sapiens chromosome 1.
\item What's the longest {\tt A}-block in fruit fly chromosome 2L?
\end{enumerate}

\subsection{Exercise 5}
The objective of this exercise is to "look at" the introns of a given
eukaryote. For this we need:
\begin{enumerate}
\item The \Rpackage{BSgenome.*} package with the full genome sequence of the
      target organism.
\item Annotations that give us the gene and exon locations of the organism.
      Note that it's important to make sure that these annotations are
      targetting exactly the version (build/assembly) of the genome
      that we are using.
\item The \Rpackage{Biostrings} package.
\end{enumerate}

We do this analysis on the fruit fly genome: we use genome r5.1 from
FlyBase (the \Rpackage{BSgenome.Dmelanogaster.FlyBase.r51} package)
and a package of annotations for this genome, the \Rpackage{ann.Dmelanogaster.FlyBase.r51}
package, made from the GFF files provided by FlyBase:

<<>>=
library(ann.Dmelanogaster.FlyBase.r51)
@

The annotations in this package are organized as follow:
\Robject{CHR\_SHORTNAMES} gives the list of chromosomes for which
annotations are provided:
<<>>=
CHR_SHORTNAMES
@
and for each chromosome, 2 data frames are provided: 1 for the genes and 1 for the
exons:
<<>>=
colnames(getAnnGenes("2L"))
colnames(getAnnExons("2L"))
@
For our analysis, we only need the \Robject{"seqname"}, \Robject{"start"},
\Robject{"end"} and \Robject{"strand"}.
<<>>=
getAnnGenes("2L")[1:5, c("seqname", "start", "end", "strand")]
getAnnExons("2L")[1:5, c("seqname", "start", "end", "strand")]
@
The row names are gene ids for the former and exon ids for the later.

Finally, the data object \Robject{EXON2GENE} (named character vector) provides
the mappping between exon ids and gene ids (many-to-one relationship):
<<>>=
data(EXON2GENE)
EXON2GENE["CG11023:1"]
@

To get all the exons belonging to a given gene:
<<>>=
names(EXON2GENE)[EXON2GENE == "FBgn0031208"]
@

\begin{enumerate}
\item Extract the start and end position of the exons belonging to gene
      FBgn0025803 (chromosome 3R).
\item Load \Rpackage{BSgenome.Dmelanogaster.FlyBase.r51} and create
      a \Rclass{BStringViews} object where the subject is chromosome 3R
      and the views are the exon locations found previously.
\item Apply the \Rfunction{mask} function on this \Rclass{BStringViews}
      object. Is this new \Rclass{BStringViews} object representing the
      introns sequences for gene FBgn0025803? What needs to be adjusted?
\item Are we discovering something?
\end{enumerate}


% ---------------------------------------------------------------------------

\section{The \Rfunction{matchPattern} function}

This function finds all matches of a given pattern in a sequence.
Using \Rfunction{matchPattern} on a \Rclass{BString}, \Rclass{DNAString} or
\Rclass{AAString} object is much faster than working with regular expressions,
be it in R (on a standard character string) or in another language like Python.
<<>>=
library(BSgenome.Hsapiens.UCSC.hg18)
chr1 <- Hsapiens$chr1
system.time(m <- matchPattern("AAAAAAAAAATT", chr1))
m
@

Also, regular expressions can miss some matches:
<<>>=
length(gregexpr("ATA", "ATATA"))
@

\Rfunction{matchPattern} returns all matches:
<<>>=
length(matchPattern("ATA", "ATATA"))
@

\subsection{Exercise 6}
\begin{enumerate}
\item Find all the matches of an arbitrary nucleotide sequence in fruit fly chromosome 2L.
\item In fact, if we don't take any special action, we only get the hits in the plus
      strand of the chromosome. Find the matches in the minus strand too.
      IMPORTANT: You should always avoid to \Rfunction{reverseComplement} an entire
      chromosome sequence (this is very inefficient).
\end{enumerate}

\subsection{Exercise 7}
The objective of this exercise is to check that, for a given probe of a given chip,
the hit location found by \Rfunction{matchPattern} is in agrement with the location
reported in the annotations available for this chip. We use Affymetrix human chip
{\tt hgu95av2} for this analysis.
\begin{enumerate}
\item Use \Rpackage{hgu95av2} to find the chromosome location of the gene targetted
      by probeset {\tt 1138\_at}.
\item Use \Rpackage{hgu95av2probe} to find all the probe sequences for this probeset
      id (put the result in a character vector called \Robject{probes}).
\item Use \Rfunction{matchPattern} to find the corresponding hit for each probe in
      the corresponding chromosome (use a \Robject{sapply(probes, function(probe) ...)}
      construction to put all the results in a single list, the returned list will have
      one element per probe in \Robject{probes}). Do we have a hit for every probe?
      Compare with \Robject{hgu95av2CHRLOC[["1138\_at"]]}.
      Do you see something wrong?
\item Try again with \Robject{mismatch=1}. Do we have a hit for every probe?
\item In fact, according to \Rpackage{hgu95av2probe}, probeset {\tt 1138\_at} should
      hit a gene in the antistrand. Does this look correct?
\end{enumerate}

\subsection{Exercise 8}
IUPAC extended letters can be used to express ambiguities in the pattern or in the
subject of the search. This is controlled via the \Robject{fixed} argument of the
\Rfunction{matchPattern} function. If \Robject{fixed} is \Robject{TRUE} (the default),
all letters in the pattern and the subject are interpreted litterally.
If \Robject{fixed} is \Robject{FALSE}, IUPAC extended letters in the pattern and
in the subject are interpreted as ambiguities e.g. {\tt M} will match {\tt A} or {\tt C}
and {\tt N} will match any letter (the \Robject{IUPAC\_CODE\_MAP} named character vector
gives the mapping between IUPAC letters and possible nucleotides they stand for).
The most common use of this feature is to introduce wildcards in the pattern
by replacing some of its letters with {\tt N}s.

\begin{enumerate}
\item Search pattern {\tt GAACTTTGCCACTC} in fruit fly chromosome 2L.
\item Repeat but this time allow the 2nd {\tt T} in the pattern (6th letter) to match
      anything. Anything wrong?
\item You can use the \Rfunction{mask} function or \Robject{fixed="subject"} to work
      around the previous problem. Try and compare.
\end{enumerate}

\subsection{Exercise 9}
The objective of this exercise is to simulate a {\it PCR experiment} on the
fruit fly genome. A {\it PCR experiment} uses primer pairs specifically designed
to target the genome of a given organism to "amplify" certain regions of
this genome. Each primer pair is made of a forward and a reverse primer. Both
primers are short nucleotide sequences with 2 parts: a gene-specific part (the "probe")
and a "tag" used for the amplification processus.
We'll call the forward probe and the reverse probe the probe parts
of the forward and reverse primer respectively.
For each primer pair, the forward and reverse probes are
designed to hit the same chromosome at nearby locations.
The region between the 2 hits is called an amplicons: it is the region
that gets "amplified" by the PCR processus.

For our {\it computer-simulated PCR experiment}, we'll use the small library of
primer pairs stored in the {\tt plate1\_probes.txt} file. This library can be
loaded into a data frame with:
<<>>=
probes <- read.table("plate1_probes.txt", stringsAsFactors=FALSE)
dim(probes)
probes[1:5, ]
@

We've put one primer pair per row in this data frame.
The row names of the data frame contain the primer pair ids.
Also note that, as suggested by the col names, we've only put the probe
parts for each primer pair (the "tag" part has been removed, it can 
be ignored for the purpose of this exercise).

The first goal of this exercise is to write a function that, given
a probe pair and a subject (the PCR target), returns the amplicons
associated with this probe pair.

In order to acquire some expertise, we'll try to find the amplicon(s)
associated with the following arbitrary probe pair:
<<>>=
Fprobe0 <- DNAString("AGCTCCGAGTT")
Rprobe0 <- DNAString("CGTTGTTCACA")
@

We choose short probes on purpose so the risk to have several hits in several
chromosome is high. This is a good way to learn how to deal in a systematical
way with the multi hit problem!

\begin{enumerate}
\item Find all \Robject{Fprobe0} and \Robject{Rprobe0} hits in chromosome 3R.
      Remember that nothing prevents a probe to hit the strand that it was
      not designed for!
      Put the start positions of all the "plus hits" (i.e. the hits of
      \Robject{Fprobe0} and \Robject{Rprobe0} in the plus strand)
      in the \Robject{Phits} object (integer vector).
      Put the end positions of all the "minus hits" in the \Robject{Mhits}
      object (integer vector).
\item Ideally \Robject{Phits} and \Robject{Mhits} should be of length 1
      but it is not always the case. However the fact that we get
      several "plus hits" and/or "minus hits" doesn't necessarily mean
      that our original primer pair has a bad design.
      In fact, some of these hits don't prevent the PCR experiment to
      work properly: this is the case for any "plus hit" that is followed
      (from left to right) by another "plus hit" with no "minus hit" between.
      And similarly, it is the case for a "minus hit" that is preceded by
      another "minus hit" with no "plus hit" between.
      All these hits can be ignored.
      Write a function that takes 2 arguments, \Robject{Phits} and \Robject{Mhits},
      that removes all the above hits. The function will return a data frame
      with 2 columns, \Robject{Phits} and \Robject{Mhits}. Note that using
      a data frame now makes sense because after this removal, \Robject{Phits}
      and \Robject{Mhits} will always have the same length.
\item Use the previous function to "clean" \Robject{Phits} and \Robject{Mhits}.
      Compare your result with the "theoretical amplicons" returned by
      \Robject{matchProbePair(Fprobe0, Rprobe0, Dmelanogaster[["3R"]])}.
      From now we will use the \Rfunction{matchProbePair} function to achive this.
      Are all these "theoretical amplicons" realistic (PCR amplification is 
      supposed to work for regions up to 5k-10k bases).
      Would you still consider the \Robject{Fprobe0}/\Robject{Rprobe0} pair
      good for amplifying the region it was designed for?
\item Another problem that can prevent the PCR amplification from working
      properly is when a given primer pair produces more that 1 valid amplicon
      (e.g. <= 20k) over the whole genome.
      What about \Robject{Fprobe0}/\Robject{Rprobe0}?
\item Write a function that takes our probes library (the \Robject{probes} data frame)
      as input and produces a data frame similar to \Robject{probes} but with 3
      additional columns \Robject{chr}, \Robject{start} and \Robject{end} that
      contain the chromosome name/start/end of the amplicon found for each
      primer pair or NAs if 0 or more than 1 amplicon were found.
      Having NAs in a row means that the design of the primer pair is bad.
\end{enumerate}

% ---------------------------------------------------------------------------

\end{document}