\SweaveOpts{keep.source=TRUE}
\documentclass{article}

\usepackage{hyperref}
\usepackage{theorem}
\usepackage{underscore}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\newcommand{\code}[1]{{\texttt{#1}}}
\newcommand{\file}[1]{{\texttt{#1}}}

\theoremstyle{break} \newtheorem{Ex}{Exercise}

\begin{document}

\title{Biostrings Lab (BioC2008)}

\author{H. Pag\`es \\ 
  Gentleman Lab \\
  Fred Hutchinson Cancer Research Center \\
  Seattle, WA}

\date{\today}
\maketitle


% ---------------------------------------------------------------------------

\section{Lab overview}

\subsection{Goals}

  Learn the basics of the \Rpackage{Biostrings} package and how to use it
  together with the {\it BSgenome data packages} for different kinds
  of genome-wide sequence analyses.


\subsection{Prerequisites}

  For this lab you need:
  \begin{itemize}
    \item A laptop with the latest release version of R (R 2.7 series).
    \item The following Bioconductor packages (installed with
          \Rfunction{biocLite()}):
          \begin{itemize}
          \item \Rpackage{Biostrings}
          \item \Rpackage{BSgenome}
          \item \Rpackage{hgu95av2probe}
          \item \Rpackage{drosophila2probe}
          \item \Rpackage{BSgenome.Hsapiens.UCSC.hg18}
          \item \Rpackage{BSgenome.Dmelanogaster.UCSC.dm3}
          \item \Rpackage{BSgenome.Celegans.UCSC.ce2}
          \item \Rpackage{SNPlocs.Hsapiens.dbSNP.20071016}
          \end{itemize}
    \item Some familiarity with R and object oriented programming.
  \end{itemize}

\subsection{Useful links and additional resources}

  \begin{itemize}
    \item Bioconductor website: http://bioconductor.org/
    \item This document and other material used for this lab (including solutions
          to the exercises):
          http://bioconductor.org/workshops/2008/BioC2008/labs/biostrings/
    \item The Bioconductor mailing lists: http://bioconductor.org/docs/mailList.html
  \end{itemize}


% ---------------------------------------------------------------------------

\section{Introduction}

\subsection{Biostrings}

  The \Rpackage{Biostrings} package provides:
  \begin{itemize}
  \item The infrastructure for representing and manipulating large nucleotide
        sequences (up to hundreds of millions of letters) in Bioconductor.
        This infrastructure consists of a class system (the \Rclass{DNAString},
        \Rclass{DNAStringSet}, \Rclass{XStringViews} containers, etc...),
        plus a set of constructors for creating objects belonging to these classes,
        plus a set of accessor methods for querying or modifying the content
        of these objects.
  \item Utility functions for extracting basic sequence information
        (\Rfunction{alphabetFrequency}, \Rfunction{dinucleotideFrequency}
        functions, etc...), or for performing basic sequence transformations
        (\Rfunction{reverseComplement}, \Rfunction{chartr},
        \Rfunction{injectHardMask} functions, etc...).
  \item String matching functions for finding all the occurrences (hits) of
        arbitrary motifs in a reference sequence
        (\Rfunction{matchPattern}, \Rfunction{matchProbePair},
        \Rfunction{matchPWM}, \Rfunction{matchPDict} functions, etc...)
        or for finding palindromes or complemented palindromes in a reference
        sequence (\Rfunction{findPalindromes} and
        \Rfunction{findComplementedPalindromes} functions).
  \item String alignment functions (\Rfunction{pairwiseAlignment} function
        and related utilities). Not covered in this lab.
  \end{itemize}

\subsection{BSgenome data packages}

  The Bioconductor project also provides a collection of {\it BSgenome data
  packages} (e.g. \Rpackage{BSgenome.Dmelanogaster.UCSC.dm3}).
  Each of them contains the full genomic sequence for a given organism.
  These sequences are stored in the containers defined in the
  \Rpackage{Biostrings} package so they are ready to be used (no format
  conversion needed). Because of this, these packages are also called
  {\it Biostrings-based genome data packages}.
  Regardless to the particular genome that they contain, all the {\it BSgenome
  data packages} packages are very similar and can be manipulated in a consistent
  and easy way. They all require the \Rpackage{BSgenome} software package in order
  to work properly.

  Note that these packages are big (from a few Mb to more than 800 Mb for
  \Rpackage{BSgenome.Hsapiens.UCSC.hg18}) so installating them over the network
  can take a long time.


% ---------------------------------------------------------------------------

\section{Checking your installation}

\begin{Ex}
  \begin{enumerate}
  \item Start R and load the \Rpackage{Biostrings} package.
  \item Then load the \Rpackage{BSgenome} package and use the
        \Rfunction{available.genomes} function to check the list of
        {\it BSgenome data packages} that are available on the
        Bioconductor repositories for your version of R.
  \item Check that you have \Rpackage{BSgenome.Dmelanogaster.UCSC.dm3}
        (try to load it).
  \end{enumerate}
\end{Ex}


% ---------------------------------------------------------------------------

\section{Basic containers}

\subsection{DNAString objects}

The \Rclass{DNAString} class is the basic container for storing a large
nucleotide sequence. Unlike a standard character vector in R that can store
an arbitrary number of strings, a \Rclass{DNAString} object can only contain
1 sequence.
Like for most classes defined in \Rpackage{Biostrings}, \Rfunction{DNAString}
is also the name of the constructor function for \Rclass{DNAString} objects.

\begin{Ex}
\begin{enumerate}
\item Use the \Rfunction{DNAString} constructor to convert R character
      string \Robject{"aagtac"} into a \Rclass{DNAString} instance.
\item Use \Rfunction{nchar} and \Rfunction{alphabetFrequency} on it.
\item Get its reverse complement with \Rfunction{reverseComplement}.
\item Extract an arbitrary substring with \Rfunction{subseq}.
\end{enumerate}
\end{Ex}

\subsection{DNAStringSet objects}

The \Rclass{DNAStringSet} class is the basic container for storing an
arbitrary number of nucleotide sequences.
Like with R character vectors (and any vector-like object in general),
use \Rfunction{length} to get the number of sequences stored in a
\Rclass{DNAStringSet} object and the subsetting operator \Rfunction{[}
to subset it.
In addition, the subsetting operator \Rfunction{[[} can be used to
extract an arbitrary element as a \Rclass{DNAString} object.

\begin{Ex}
\begin{enumerate}
\item Use the \Rfunction{DNAStringSet} constructor to convert R character
      vector \Robject{c("aagtac", "ccc", "gtt")} into a \Rclass{DNAStringSet}
      instance. Let's call this instance \Robject{x0}.
\item Use \Rfunction{width} and \Rfunction{length} on \Robject{x0}.
\item Use subsetting operator \Rfunction{[} to remove its 2nd element.
\item How would you invert the order of its elements?
\item Use subsetting operator \Rfunction{[[} to extract its 1st element
      as a \Rclass{DNAString} object.
\item Use the \Rfunction{DNAStringSet} constructor to remove the last 2
      nucleotides of each element.
\item Try \Rfunction{alphabetFrequency} and \Rfunction{reverseComplement}
      on \Robject{x0}.
\end{enumerate}
\end{Ex}

For the next exercise, you need the \Rpackage{hgu95av2probe} package.
It contains all the probe sequences for microarray hgu95av2 from
Affymetrix.

\begin{Ex}
\begin{enumerate}
\item Load the \Rpackage{hgu95av2probe} package.
\item Store the \Robject{sequence} variable of the \Robject{hgu95av2probe}
      object into a \Rclass{DNAStringSet} object (let's call this object
      \Robject{x1}). Display it.
\item How many probes have a GC content of 80\% or more?
\item What's the GC content for the entire microarray?
\end{enumerate}
\end{Ex}

\subsection{XStringViews objects}

An \Rclass{XStringViews} object contains a set of views on the same sequence
called {\it the subject} (for example this can be a \Rclass{DNAString} object).
Each view is defined by its start and end locations: both are integers
such that start <= end.
The \Rfunction{views} function can be used to create an \Rclass{XStringViews}
object given a subject and a set of start and end locations.
Like for \Rclass{DNAStringSet} objects, \Rfunction{length}, \Rfunction{width},
\Rfunction{[} and \Rfunction{[[} are available for \Rclass{XStringViews} objects.
Additional \Rfunction{subject}, \Rfunction{start}, \Rfunction{end} and
\Rfunction{gaps} methods are also provided.

\begin{Ex}
\begin{enumerate}
\item Use the \Rfunction{views} function to create an \Rclass{XStringViews}
      object with a \Rclass{DNAString} subject. Make it such that some
      views are overlapping but also that the set of views don't cover
      the subject entirely.
\item Try \Rfunction{subject}, \Rfunction{start}, \Rfunction{end}
      and \Rfunction{gaps} on this \Rclass{XStringViews} object.
\item Try \Rfunction{alphabetFrequency} on it.
\item Turn it into a \Rclass{DNAStringSet} object with the
      \Rfunction{DNAStringSet} constructor.
\end{enumerate}
\end{Ex}


% ---------------------------------------------------------------------------

\section{BSgenome data packages}

The name of a {\it BSgenome data package} is made of 4 parts separated by
a dot (e.g. \Rpackage{BSgenome.Celegans.UCSC.ce2}):
\begin{itemize}
\item The 1st part is always {\tt BSgenome}.
\item The 2nd part is the name of the organism (abbreviated).
\item The 3rd part is the name of the organisation who assembled the genome.
\item The 4th part is the release string or number used by this organisation
      for this assembly of the genome.
\end{itemize}

All {\it BSgenome data package} contain a single top level object whose name
matches the second part of the package name.

\begin{Ex}
\begin{enumerate}
\item Load \Rpackage{BSgenome.Celegans.UCSC.ce2} and display its top level
      object.
\item Display some of the chromosomes. Do they contain IUPAC extended
      letters?
\item Load \Rpackage{BSgenome.Dmelanogaster.UCSC.dm3} (Fruit Fly genome).
      Do the chromosomes contain IUPAC extended letters?
\item Put the first 5M bases from Fruit Fly chromosome 2L into a
      \Robject{DNAString} object and {\it slice it}, that is, create
      the set of views obtained by moving an imaginary windows of width
      20 from the 5' end to the 3' end of the sequence and one nucleotide
      at a time.
\item Use \Rfunction{chartr} to simulate a bisulfite transformation of
      chromosome 4.
\end{enumerate}
\end{Ex}

Starting with Bioconductor 2.2 (the current release), some {\it BSgenome
data packages} contain {\it masked sequences}, that is, chromosome
sequences that have built-in masks defined on them (those masks are
not active by default). More on this later...


% ---------------------------------------------------------------------------

\section{String matching}

\subsection{The matchPattern function}

This function finds all matches of a given pattern in a reference sequence
called {\it the subject}.

\begin{Ex}
\begin{enumerate}
\item Find all the matches of an arbitrary nucleotide sequence in Fruit Fly
      chromosome 2L. (Don't choose it too short!)
\item In fact, if we don't take any special action, we only get the hits in
      the plus strand of the chromosome. Find the matches in the minus
      strand too. (Note: the cost of taking the reverse complement of an
      entire chromosome sequence can be high in terms of memory usage.
      Try to do something better.)
\end{enumerate}
\end{Ex}

IUPAC extended letters can be used to express ambiguities in the pattern
or in the subject of a search with \Rfunction{matchPattern}. This is
controlled via the \Robject{fixed} argument of the function.
If \Robject{fixed} is \Robject{TRUE} (the default), all letters in the
pattern and the subject are interpreted litterally.
If \Robject{fixed} is \Robject{FALSE}, IUPAC extended letters in the
pattern and in the subject are interpreted as ambiguities e.g. {\tt M}
will match {\tt A} or {\tt C} and {\tt N} will match any letter
(the \Robject{IUPAC\_CODE\_MAP} named character vector gives the mapping
between IUPAC letters and the set of nucleotides that they stand for).
The most common use of this feature is to introduce wildcards in the
pattern by replacing some of its letters with {\tt N}s.

\begin{Ex}
\begin{enumerate}
\item Search pattern {\tt GAACTTTGCCACTC} in Fruit Fly chromosome 2L.
\item Repeat but this time allow the 2nd {\tt T} in the pattern (6th letter) to match
      anything. Anything wrong?
\item Call \Rfunction{matchPattern} with \Robject{fixed="subject"} to work
      around this problem.
\end{enumerate}
\end{Ex}

\subsection{Masking}

Starting with Bioconductor 2.2, some {\it BSgenome data packages} contain
{\it masked sequences}, that is, chromosome sequences that have built-in
masks defined on them. There are 3 built-in masks per sequence:
\begin{enumerate}
\item the mask of assembly gaps,
\item the mask of repeat regions (as reported by RepeatMasker),
\item and the mask of tandem repeats with a period <= 12 (as reported by
      Tandem Repeats Finder).
\end{enumerate}
Note that each of these masks corresponds to a {\it track} in the UCSC genome
browser.

Each masked sequence is stored in a new container: the \Rclass{MaskedDNAString}
class. Use the \Rfunction{unmasked} accessor to turn a \Rclass{MaskedDNAString}
object into a \Rclass{DNAString} object (the mask information will be dropped).
Use the \Rfunction{masks} accessor to extract the mask information (the
sequence that is masked will be dropped).
The built-in masks are NOT active by default. They can be activated
individually with:
<<echo=FALSE, results=hide>>=
library(BSgenome.Hsapiens.UCSC.hg18)
@
<<>>=
chr1 <- Hsapiens$chr1
active(masks(chr1))[3] <- TRUE # activate Tandem Repeats Finder mask
@
or all together with:
<<>>=
active(masks(chr1)) <- TRUE # activate all the masks
@

Some functions in \Rpackage{Biostrings} like \Rfunction{alphabetFrequency}
or the string matching functions will skip the masked region when walking
along a sequence with active masks.

\begin{Ex}
\begin{enumerate}
\item Activate the mask of assembly gaps in Human chromosome 1.
      What percentage of the sequence is masked?
\item Check the alphabet frequency of this masked sequence
      (compare with non-masked chromosome 1).
\item Activate all masks in Human chromosome 1 and find the occurences of
      an arbitrary DNA pattern in it. Compare to what you get with non-masked
      chromosome 1.
\end{enumerate}
\end{Ex}

In addition to the built-in masks, the user can put its own mask on
a sequence. Two types of user-controlled masking are supported:
{\it by content} or {\it by position}.
The \Rfunction{maskMotif} function will mask the regions of a
sequence that contain a motif specified by the user.
The \Rfunction{Mask} constructor will return the mask made of
the regions defined by the start and end locations specified by the
user (like with the \Rfunction{views} function).

\begin{Ex}
\begin{enumerate}
\item Activate the mask of assembly gaps in Human chromosome 2
      and check its alphabet frequency. Do you see something unexpected?
\item Use \code{as(chr2 , "XStringViews")} to turn this masked sequence
      into an \Rclass{XStringViews} object. Do you see the unmasked Ns?
\item Which contig do they belong to?
\item Have a look at the inter-view gaps of this \Rclass{XStringViews}
      object.
\end{enumerate}
\end{Ex}

\subsection{Finding palindromes}

The \Rfunction{findPalindromes} function finds palindromic regions in a
sequence. The palindromes that will be detected are either strict
palindromes or 2-arm palindromes i.e. palindromes where the 2 arms
are separated by an arbitrary sequence called "the loop".
The \Rfunction{findComplementedPalindromes} function finds complemented
palindromes i.e. palindromes where the 2 arms are reverse-complementary
sequences from each other.

\begin{Ex}
\begin{enumerate}
\item Find all the complemented palindromes with arms of at least 30 bp
      and a loop of at most 10 bp in Fruit Fly chromosome X.
\item Try to do the same with Human chromosome 2. Which palindromic
      regions contain the 4 bases?
\end{enumerate}
\end{Ex}

Note that \Rfunction{findPalindromes} and
\Rfunction{findComplementedPalindromes} are still a work-in-progress:
they don't support the \code{max.looplength} and \code{max.mismatch}
args yet and they don't return a ``clean'' set of palindromic regions
(the same palindromic region is reported several times).
These issues need to be addressed in future versions of
\Rpackage{Biostrings}.

\subsection{Finding the hits of a large set of short motifs}

The set of short reads that one gets from an ultra-high throughput
sequencing experiments can be very large (typically a few millions
of 35-mers with a Solexa machine) so we need fast algorithms if we
want to be able to {\it align} all those reads to the reference
genome in reasonable time.

Starting with Bioconductor 2.2, this can be done with the
\Rfunction{matchPDict} function from the \Rpackage{Biostrings}
package. Note that this new functionality is currently under
active development in the devel version of Bioconductor (2.3)
that will be released in the fall of 2008. If you are planning
to use Bioconductor for processing ultra-high throughput sequencing
data, you are invited to check the \Rpackage{ShortRead} package
(not released yet, available only in BioC devel) and to
subscribe to the bioc-sig-sequencing mailing list (see
http://bioconductor.org/docs/mailList.html for the details).

The workflow with \Rfunction{matchPDict} is the following:
\begin{enumerate}
\item Preprocess the set of short reads with the \Rfunction{PDict}
      constructor.
\item Call \Rfunction{matchPDict} on it.
\item Query the \Rclass{MIndex} object returned by \Rfunction{matchPDict}.
\end{enumerate}

\begin{Ex}
In this exercise we want to find the hits of all the probes of chip
hgu95av2 in Human chromosome 1. For now, ``hit'' means ``location of
an exact match''.
\begin{enumerate}
\item Load the {\it BSgenome data package} for Human (hg18 from UCSC).
\item Load the \Rpackage{hgu95av2probe} package and put the probe
      sequences in a \Robject{DNAStringSet} object (let's call this object
      \Robject{dict0}).
\item Preprocess \Robject{dict0} with the \Rfunction{PDict} constructor.
\item Call \Rfunction{matchPDict} on this \Rclass{PDict} object (1st arg)
      and Human chromosome 1 (2nd arg). Put the result in a variable
      called \Robject{mindex} (\Rfunction{matchPDict} returns an
      \Rclass{MIndex} object).
\item Use \Rfunction{countIndex} on \Robject{mindex} to extract the count
      of hits.
\item Which probe has the highest number of hits? Display those hits as
      an \Rclass{XStringViews} object.
\end{enumerate}
\end{Ex}


% ---------------------------------------------------------------------------

\section{SNP injection}

When the patterns to match against a genome of reference come from
a sequencing experiment then they will contain the SNPs of the
individual that was used for the experiment.
In this case looking for exact matches will miss a lot of hits,
not only because of the sequencing errors inherent to the sequencing
technology, but because of the presence of SNPs.

This second type of hit loss can be partially avoided by {\it injecting}
the known SNPs in the reference genome, that is, by replacing the
non-ambiguous letter by an IUPAC ambiguity letter at each SNP location
in the reference genome.
Then, when \Rfunction{matchPattern} or \Rfunction{matchPDict} are
used on this modified genome and with \code{fixed=FALSE}, hits that
contain a known SNP will be found too.
This provides a way to detect known SNPs in the individual.

Starting with Bioconductor 2.2, one {\it SNPlocs} package is available:
the \Rpackage{SNPlocs.Hsapiens.dbSNP.20071016} package.
It contains the locations of all known SNPs for Human (extracted from
dbSNP) together with the alleles information (an IUPAC ambiguity letter
for each SNP). It is aimed to be used in conjonction with the
\Rpackage{BSgenome.Hsapiens.UCSC.hg18} package (more on this below).

Note that the {\it SNPlocs} packages are a new type of annotation packages
and more packages can be added in the future if they are found useful.
The \Rfunction{available.SNPs} function (from the \Rpackage{BSgenome}
software package) will allow you to check the list of {\it SNPlocs packages}
that are available on the Bioconductor repositories for your version of R.

It's easy to {\it inject} SNPs in the reference genome. For example, with
the Human genome:
<<>>=
library(SNPlocs.Hsapiens.dbSNP.20071016)
hg18snp <- injectSNPs(Hsapiens, 'SNPlocs.Hsapiens.dbSNP.20071016')
@
The resulting \Robject{hg18snp} object is a modified version of the
original genome (the \Robject{Hsapiens} object) where IUPAC ambiguity
letters have been injected in the chromosome sequences at the SNP
locations. You can display and use \Robject{hg18snp} exactly in the
same way that you use \Robject{Hsapiens} (both are \Rclass{BSgenome}
objects).

\begin{Ex}
In this exercise, we reuse the \Rclass{PDict} object obtained by
preprocessing the probe sequences of the \Rpackage{hgu95av2probe}
package.
\begin{enumerate}
\item Inject the SNPs from \Rpackage{SNPlocs.Hsapiens.dbSNP.20071016}
      in the hg18 genome and display the resulting \Rclass{BSgenome}
      object (note the additional line \code{"with SNPs injected
      from..."}).
\item Load the modified sequence for chromosome 1 and look at its
      alphabet frequency (compare with the original chromosome 1).
\item Mask the assembly gaps in this modified chromosome 1 and
      look at its alphabet frequency again.
\item Use \Rfunction{countPDict} with \code{fixed="pattern"}
      to count the nb of hits for all the hgu95av2 probes.
\item Try to ``see'' some hits with SNPs.
\end{enumerate}
\end{Ex}


% ---------------------------------------------------------------------------

\section{Genome-wide analysis}

So far you've written code for performing single chromosome analyses.
Most of the time, your analysis will need to be extended to the entire
genome (plus and minus strands of all chromosomes) so your code will need
to be adapted.

For example here is some simple code that finds all the hits of the
drosophila2 probes (Affymetrix microarray) in the Fruit Fly genome:
<<eval=false>>=
walkGenome <- function(bsgenome, seqnames, pdict, objname_suffix)
{
    for (seqname in seqnames) {
        objname <- paste(seqname, objname_suffix, sep="")
        filename <- paste(objname, ".rda", sep="")
        cat("Walking ", seqname, " (will save hits ",
            "in ", filename, " file)... ", sep="")
        subject <- unmasked(bsgenome[[seqname]])
        hits <- matchPDict(pdict, subject)
        rm(subject)
        unload(bsgenome, seqname)
        assign(objname, hits)
        save(list=objname, file=filename)
        rm(hits, list=objname)
        cat("OK\n")
    }
}

library(drosophila2probe)
dict0 <- DNAStringSet(drosophila2probe$sequence)
library(BSgenome.Dmelanogaster.UCSC.dm3)
SEQNAMES <- seqnames(Dmelanogaster)

### Find hits in the + strands
pdict <- PDict(dict0)
walkGenome(Dmelanogaster, SEQNAMES, pdict, "phits")

### Find hits in the - strands
rc_dict0 <- reverseComplement(dict0)
pdict <- PDict(rc_dict0)
walkGenome(Dmelanogaster, SEQNAMES, pdict, "mhits")
@

Writing code that performs genome-wide analysis is described in the
GenomeSearching vignette from the \Rpackage{BSgenome} software package.
Note that this vignette is still a work-in-progress (feedback is always
appreciated).

\end{document}