\SweaveOpts{keep.source=TRUE} \documentclass{article} \usepackage{hyperref} \usepackage{theorem} \usepackage{underscore} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\file}[1]{{\texttt{#1}}} \theoremstyle{break} \newtheorem{Ex}{Exercise} \begin{document} \title{Biostrings Lab (BioC2008)} \author{H. Pag\`es \\ Gentleman Lab \\ Fred Hutchinson Cancer Research Center \\ Seattle, WA} \date{\today} \maketitle % --------------------------------------------------------------------------- \section{Lab overview} \subsection{Goals} Learn the basics of the \Rpackage{Biostrings} package and how to use it together with the {\it BSgenome data packages} for different kinds of genome-wide sequence analyses. \subsection{Prerequisites} For this lab you need: \begin{itemize} \item A laptop with the latest release version of R (R 2.7 series). \item The following Bioconductor packages (installed with \Rfunction{biocLite()}): \begin{itemize} \item \Rpackage{Biostrings} \item \Rpackage{BSgenome} \item \Rpackage{hgu95av2probe} \item \Rpackage{drosophila2probe} \item \Rpackage{BSgenome.Hsapiens.UCSC.hg18} \item \Rpackage{BSgenome.Dmelanogaster.UCSC.dm3} \item \Rpackage{BSgenome.Celegans.UCSC.ce2} \item \Rpackage{SNPlocs.Hsapiens.dbSNP.20071016} \end{itemize} \item Some familiarity with R and object oriented programming. \end{itemize} \subsection{Useful links and additional resources} \begin{itemize} \item Bioconductor website: http://bioconductor.org/ \item This document and other material used for this lab (including solutions to the exercises): http://bioconductor.org/workshops/2008/BioC2008/labs/biostrings/ \item The Bioconductor mailing lists: http://bioconductor.org/docs/mailList.html \end{itemize} % --------------------------------------------------------------------------- \section{Introduction} \subsection{Biostrings} The \Rpackage{Biostrings} package provides: \begin{itemize} \item The infrastructure for representing and manipulating large nucleotide sequences (up to hundreds of millions of letters) in Bioconductor. This infrastructure consists of a class system (the \Rclass{DNAString}, \Rclass{DNAStringSet}, \Rclass{XStringViews} containers, etc...), plus a set of constructors for creating objects belonging to these classes, plus a set of accessor methods for querying or modifying the content of these objects. \item Utility functions for extracting basic sequence information (\Rfunction{alphabetFrequency}, \Rfunction{dinucleotideFrequency} functions, etc...), or for performing basic sequence transformations (\Rfunction{reverseComplement}, \Rfunction{chartr}, \Rfunction{injectHardMask} functions, etc...). \item String matching functions for finding all the occurrences (hits) of arbitrary motifs in a reference sequence (\Rfunction{matchPattern}, \Rfunction{matchProbePair}, \Rfunction{matchPWM}, \Rfunction{matchPDict} functions, etc...) or for finding palindromes or complemented palindromes in a reference sequence (\Rfunction{findPalindromes} and \Rfunction{findComplementedPalindromes} functions). \item String alignment functions (\Rfunction{pairwiseAlignment} function and related utilities). Not covered in this lab. \end{itemize} \subsection{BSgenome data packages} The Bioconductor project also provides a collection of {\it BSgenome data packages} (e.g. \Rpackage{BSgenome.Dmelanogaster.UCSC.dm3}). Each of them contains the full genomic sequence for a given organism. These sequences are stored in the containers defined in the \Rpackage{Biostrings} package so they are ready to be used (no format conversion needed). Because of this, these packages are also called {\it Biostrings-based genome data packages}. Regardless to the particular genome that they contain, all the {\it BSgenome data packages} packages are very similar and can be manipulated in a consistent and easy way. They all require the \Rpackage{BSgenome} software package in order to work properly. Note that these packages are big (from a few Mb to more than 800 Mb for \Rpackage{BSgenome.Hsapiens.UCSC.hg18}) so installating them over the network can take a long time. % --------------------------------------------------------------------------- \section{Checking your installation} \begin{Ex} \begin{enumerate} \item Start R and load the \Rpackage{Biostrings} package. \item Then load the \Rpackage{BSgenome} package and use the \Rfunction{available.genomes} function to check the list of {\it BSgenome data packages} that are available on the Bioconductor repositories for your version of R. \item Check that you have \Rpackage{BSgenome.Dmelanogaster.UCSC.dm3} (try to load it). \end{enumerate} \end{Ex} % --------------------------------------------------------------------------- \section{Basic containers} \subsection{DNAString objects} The \Rclass{DNAString} class is the basic container for storing a large nucleotide sequence. Unlike a standard character vector in R that can store an arbitrary number of strings, a \Rclass{DNAString} object can only contain 1 sequence. Like for most classes defined in \Rpackage{Biostrings}, \Rfunction{DNAString} is also the name of the constructor function for \Rclass{DNAString} objects. \begin{Ex} \begin{enumerate} \item Use the \Rfunction{DNAString} constructor to convert R character string \Robject{"aagtac"} into a \Rclass{DNAString} instance. \item Use \Rfunction{nchar} and \Rfunction{alphabetFrequency} on it. \item Get its reverse complement with \Rfunction{reverseComplement}. \item Extract an arbitrary substring with \Rfunction{subseq}. \end{enumerate} \end{Ex} \subsection{DNAStringSet objects} The \Rclass{DNAStringSet} class is the basic container for storing an arbitrary number of nucleotide sequences. Like with R character vectors (and any vector-like object in general), use \Rfunction{length} to get the number of sequences stored in a \Rclass{DNAStringSet} object and the subsetting operator \Rfunction{[} to subset it. In addition, the subsetting operator \Rfunction{[[} can be used to extract an arbitrary element as a \Rclass{DNAString} object. \begin{Ex} \begin{enumerate} \item Use the \Rfunction{DNAStringSet} constructor to convert R character vector \Robject{c("aagtac", "ccc", "gtt")} into a \Rclass{DNAStringSet} instance. Let's call this instance \Robject{x0}. \item Use \Rfunction{width} and \Rfunction{length} on \Robject{x0}. \item Use subsetting operator \Rfunction{[} to remove its 2nd element. \item How would you invert the order of its elements? \item Use subsetting operator \Rfunction{[[} to extract its 1st element as a \Rclass{DNAString} object. \item Use the \Rfunction{DNAStringSet} constructor to remove the last 2 nucleotides of each element. \item Try \Rfunction{alphabetFrequency} and \Rfunction{reverseComplement} on \Robject{x0}. \end{enumerate} \end{Ex} For the next exercise, you need the \Rpackage{hgu95av2probe} package. It contains all the probe sequences for microarray hgu95av2 from Affymetrix. \begin{Ex} \begin{enumerate} \item Load the \Rpackage{hgu95av2probe} package. \item Store the \Robject{sequence} variable of the \Robject{hgu95av2probe} object into a \Rclass{DNAStringSet} object (let's call this object \Robject{x1}). Display it. \item How many probes have a GC content of 80\% or more? \item What's the GC content for the entire microarray? \end{enumerate} \end{Ex} \subsection{XStringViews objects} An \Rclass{XStringViews} object contains a set of views on the same sequence called {\it the subject} (for example this can be a \Rclass{DNAString} object). Each view is defined by its start and end locations: both are integers such that start <= end. The \Rfunction{views} function can be used to create an \Rclass{XStringViews} object given a subject and a set of start and end locations. Like for \Rclass{DNAStringSet} objects, \Rfunction{length}, \Rfunction{width}, \Rfunction{[} and \Rfunction{[[} are available for \Rclass{XStringViews} objects. Additional \Rfunction{subject}, \Rfunction{start}, \Rfunction{end} and \Rfunction{gaps} methods are also provided. \begin{Ex} \begin{enumerate} \item Use the \Rfunction{views} function to create an \Rclass{XStringViews} object with a \Rclass{DNAString} subject. Make it such that some views are overlapping but also that the set of views don't cover the subject entirely. \item Try \Rfunction{subject}, \Rfunction{start}, \Rfunction{end} and \Rfunction{gaps} on this \Rclass{XStringViews} object. \item Try \Rfunction{alphabetFrequency} on it. \item Turn it into a \Rclass{DNAStringSet} object with the \Rfunction{DNAStringSet} constructor. \end{enumerate} \end{Ex} % --------------------------------------------------------------------------- \section{BSgenome data packages} The name of a {\it BSgenome data package} is made of 4 parts separated by a dot (e.g. \Rpackage{BSgenome.Celegans.UCSC.ce2}): \begin{itemize} \item The 1st part is always {\tt BSgenome}. \item The 2nd part is the name of the organism (abbreviated). \item The 3rd part is the name of the organisation who assembled the genome. \item The 4th part is the release string or number used by this organisation for this assembly of the genome. \end{itemize} All {\it BSgenome data package} contain a single top level object whose name matches the second part of the package name. \begin{Ex} \begin{enumerate} \item Load \Rpackage{BSgenome.Celegans.UCSC.ce2} and display its top level object. \item Display some of the chromosomes. Do they contain IUPAC extended letters? \item Load \Rpackage{BSgenome.Dmelanogaster.UCSC.dm3} (Fruit Fly genome). Do the chromosomes contain IUPAC extended letters? \item Put the first 5M bases from Fruit Fly chromosome 2L into a \Robject{DNAString} object and {\it slice it}, that is, create the set of views obtained by moving an imaginary windows of width 20 from the 5' end to the 3' end of the sequence and one nucleotide at a time. \item Use \Rfunction{chartr} to simulate a bisulfite transformation of chromosome 4. \end{enumerate} \end{Ex} Starting with Bioconductor 2.2 (the current release), some {\it BSgenome data packages} contain {\it masked sequences}, that is, chromosome sequences that have built-in masks defined on them (those masks are not active by default). More on this later... % --------------------------------------------------------------------------- \section{String matching} \subsection{The matchPattern function} This function finds all matches of a given pattern in a reference sequence called {\it the subject}. \begin{Ex} \begin{enumerate} \item Find all the matches of an arbitrary nucleotide sequence in Fruit Fly chromosome 2L. (Don't choose it too short!) \item In fact, if we don't take any special action, we only get the hits in the plus strand of the chromosome. Find the matches in the minus strand too. (Note: the cost of taking the reverse complement of an entire chromosome sequence can be high in terms of memory usage. Try to do something better.) \end{enumerate} \end{Ex} IUPAC extended letters can be used to express ambiguities in the pattern or in the subject of a search with \Rfunction{matchPattern}. This is controlled via the \Robject{fixed} argument of the function. If \Robject{fixed} is \Robject{TRUE} (the default), all letters in the pattern and the subject are interpreted litterally. If \Robject{fixed} is \Robject{FALSE}, IUPAC extended letters in the pattern and in the subject are interpreted as ambiguities e.g. {\tt M} will match {\tt A} or {\tt C} and {\tt N} will match any letter (the \Robject{IUPAC\_CODE\_MAP} named character vector gives the mapping between IUPAC letters and the set of nucleotides that they stand for). The most common use of this feature is to introduce wildcards in the pattern by replacing some of its letters with {\tt N}s. \begin{Ex} \begin{enumerate} \item Search pattern {\tt GAACTTTGCCACTC} in Fruit Fly chromosome 2L. \item Repeat but this time allow the 2nd {\tt T} in the pattern (6th letter) to match anything. Anything wrong? \item Call \Rfunction{matchPattern} with \Robject{fixed="subject"} to work around this problem. \end{enumerate} \end{Ex} \subsection{Masking} Starting with Bioconductor 2.2, some {\it BSgenome data packages} contain {\it masked sequences}, that is, chromosome sequences that have built-in masks defined on them. There are 3 built-in masks per sequence: \begin{enumerate} \item the mask of assembly gaps, \item the mask of repeat regions (as reported by RepeatMasker), \item and the mask of tandem repeats with a period <= 12 (as reported by Tandem Repeats Finder). \end{enumerate} Note that each of these masks corresponds to a {\it track} in the UCSC genome browser. Each masked sequence is stored in a new container: the \Rclass{MaskedDNAString} class. Use the \Rfunction{unmasked} accessor to turn a \Rclass{MaskedDNAString} object into a \Rclass{DNAString} object (the mask information will be dropped). Use the \Rfunction{masks} accessor to extract the mask information (the sequence that is masked will be dropped). The built-in masks are NOT active by default. They can be activated individually with: <>= library(BSgenome.Hsapiens.UCSC.hg18) @ <<>>= chr1 <- Hsapiens$chr1 active(masks(chr1))[3] <- TRUE # activate Tandem Repeats Finder mask @ or all together with: <<>>= active(masks(chr1)) <- TRUE # activate all the masks @ Some functions in \Rpackage{Biostrings} like \Rfunction{alphabetFrequency} or the string matching functions will skip the masked region when walking along a sequence with active masks. \begin{Ex} \begin{enumerate} \item Activate the mask of assembly gaps in Human chromosome 1. What percentage of the sequence is masked? \item Check the alphabet frequency of this masked sequence (compare with non-masked chromosome 1). \item Activate all masks in Human chromosome 1 and find the occurences of an arbitrary DNA pattern in it. Compare to what you get with non-masked chromosome 1. \end{enumerate} \end{Ex} In addition to the built-in masks, the user can put its own mask on a sequence. Two types of user-controlled masking are supported: {\it by content} or {\it by position}. The \Rfunction{maskMotif} function will mask the regions of a sequence that contain a motif specified by the user. The \Rfunction{Mask} constructor will return the mask made of the regions defined by the start and end locations specified by the user (like with the \Rfunction{views} function). \begin{Ex} \begin{enumerate} \item Activate the mask of assembly gaps in Human chromosome 2 and check its alphabet frequency. Do you see something unexpected? \item Use \code{as(chr2 , "XStringViews")} to turn this masked sequence into an \Rclass{XStringViews} object. Do you see the unmasked Ns? \item Which contig do they belong to? \item Have a look at the inter-view gaps of this \Rclass{XStringViews} object. \end{enumerate} \end{Ex} \subsection{Finding palindromes} The \Rfunction{findPalindromes} function finds palindromic regions in a sequence. The palindromes that will be detected are either strict palindromes or 2-arm palindromes i.e. palindromes where the 2 arms are separated by an arbitrary sequence called "the loop". The \Rfunction{findComplementedPalindromes} function finds complemented palindromes i.e. palindromes where the 2 arms are reverse-complementary sequences from each other. \begin{Ex} \begin{enumerate} \item Find all the complemented palindromes with arms of at least 30 bp and a loop of at most 10 bp in Fruit Fly chromosome X. \item Try to do the same with Human chromosome 2. Which palindromic regions contain the 4 bases? \end{enumerate} \end{Ex} Note that \Rfunction{findPalindromes} and \Rfunction{findComplementedPalindromes} are still a work-in-progress: they don't support the \code{max.looplength} and \code{max.mismatch} args yet and they don't return a ``clean'' set of palindromic regions (the same palindromic region is reported several times). These issues need to be addressed in future versions of \Rpackage{Biostrings}. \subsection{Finding the hits of a large set of short motifs} The set of short reads that one gets from an ultra-high throughput sequencing experiments can be very large (typically a few millions of 35-mers with a Solexa machine) so we need fast algorithms if we want to be able to {\it align} all those reads to the reference genome in reasonable time. Starting with Bioconductor 2.2, this can be done with the \Rfunction{matchPDict} function from the \Rpackage{Biostrings} package. Note that this new functionality is currently under active development in the devel version of Bioconductor (2.3) that will be released in the fall of 2008. If you are planning to use Bioconductor for processing ultra-high throughput sequencing data, you are invited to check the \Rpackage{ShortRead} package (not released yet, available only in BioC devel) and to subscribe to the bioc-sig-sequencing mailing list (see http://bioconductor.org/docs/mailList.html for the details). The workflow with \Rfunction{matchPDict} is the following: \begin{enumerate} \item Preprocess the set of short reads with the \Rfunction{PDict} constructor. \item Call \Rfunction{matchPDict} on it. \item Query the \Rclass{MIndex} object returned by \Rfunction{matchPDict}. \end{enumerate} \begin{Ex} In this exercise we want to find the hits of all the probes of chip hgu95av2 in Human chromosome 1. For now, ``hit'' means ``location of an exact match''. \begin{enumerate} \item Load the {\it BSgenome data package} for Human (hg18 from UCSC). \item Load the \Rpackage{hgu95av2probe} package and put the probe sequences in a \Robject{DNAStringSet} object (let's call this object \Robject{dict0}). \item Preprocess \Robject{dict0} with the \Rfunction{PDict} constructor. \item Call \Rfunction{matchPDict} on this \Rclass{PDict} object (1st arg) and Human chromosome 1 (2nd arg). Put the result in a variable called \Robject{mindex} (\Rfunction{matchPDict} returns an \Rclass{MIndex} object). \item Use \Rfunction{countIndex} on \Robject{mindex} to extract the count of hits. \item Which probe has the highest number of hits? Display those hits as an \Rclass{XStringViews} object. \end{enumerate} \end{Ex} % --------------------------------------------------------------------------- \section{SNP injection} When the patterns to match against a genome of reference come from a sequencing experiment then they will contain the SNPs of the individual that was used for the experiment. In this case looking for exact matches will miss a lot of hits, not only because of the sequencing errors inherent to the sequencing technology, but because of the presence of SNPs. This second type of hit loss can be partially avoided by {\it injecting} the known SNPs in the reference genome, that is, by replacing the non-ambiguous letter by an IUPAC ambiguity letter at each SNP location in the reference genome. Then, when \Rfunction{matchPattern} or \Rfunction{matchPDict} are used on this modified genome and with \code{fixed=FALSE}, hits that contain a known SNP will be found too. This provides a way to detect known SNPs in the individual. Starting with Bioconductor 2.2, one {\it SNPlocs} package is available: the \Rpackage{SNPlocs.Hsapiens.dbSNP.20071016} package. It contains the locations of all known SNPs for Human (extracted from dbSNP) together with the alleles information (an IUPAC ambiguity letter for each SNP). It is aimed to be used in conjonction with the \Rpackage{BSgenome.Hsapiens.UCSC.hg18} package (more on this below). Note that the {\it SNPlocs} packages are a new type of annotation packages and more packages can be added in the future if they are found useful. The \Rfunction{available.SNPs} function (from the \Rpackage{BSgenome} software package) will allow you to check the list of {\it SNPlocs packages} that are available on the Bioconductor repositories for your version of R. It's easy to {\it inject} SNPs in the reference genome. For example, with the Human genome: <<>>= library(SNPlocs.Hsapiens.dbSNP.20071016) hg18snp <- injectSNPs(Hsapiens, 'SNPlocs.Hsapiens.dbSNP.20071016') @ The resulting \Robject{hg18snp} object is a modified version of the original genome (the \Robject{Hsapiens} object) where IUPAC ambiguity letters have been injected in the chromosome sequences at the SNP locations. You can display and use \Robject{hg18snp} exactly in the same way that you use \Robject{Hsapiens} (both are \Rclass{BSgenome} objects). \begin{Ex} In this exercise, we reuse the \Rclass{PDict} object obtained by preprocessing the probe sequences of the \Rpackage{hgu95av2probe} package. \begin{enumerate} \item Inject the SNPs from \Rpackage{SNPlocs.Hsapiens.dbSNP.20071016} in the hg18 genome and display the resulting \Rclass{BSgenome} object (note the additional line \code{"with SNPs injected from..."}). \item Load the modified sequence for chromosome 1 and look at its alphabet frequency (compare with the original chromosome 1). \item Mask the assembly gaps in this modified chromosome 1 and look at its alphabet frequency again. \item Use \Rfunction{countPDict} with \code{fixed="pattern"} to count the nb of hits for all the hgu95av2 probes. \item Try to ``see'' some hits with SNPs. \end{enumerate} \end{Ex} % --------------------------------------------------------------------------- \section{Genome-wide analysis} So far you've written code for performing single chromosome analyses. Most of the time, your analysis will need to be extended to the entire genome (plus and minus strands of all chromosomes) so your code will need to be adapted. For example here is some simple code that finds all the hits of the drosophila2 probes (Affymetrix microarray) in the Fruit Fly genome: <>= walkGenome <- function(bsgenome, seqnames, pdict, objname_suffix) { for (seqname in seqnames) { objname <- paste(seqname, objname_suffix, sep="") filename <- paste(objname, ".rda", sep="") cat("Walking ", seqname, " (will save hits ", "in ", filename, " file)... ", sep="") subject <- unmasked(bsgenome[[seqname]]) hits <- matchPDict(pdict, subject) rm(subject) unload(bsgenome, seqname) assign(objname, hits) save(list=objname, file=filename) rm(hits, list=objname) cat("OK\n") } } library(drosophila2probe) dict0 <- DNAStringSet(drosophila2probe$sequence) library(BSgenome.Dmelanogaster.UCSC.dm3) SEQNAMES <- seqnames(Dmelanogaster) ### Find hits in the + strands pdict <- PDict(dict0) walkGenome(Dmelanogaster, SEQNAMES, pdict, "phits") ### Find hits in the - strands rc_dict0 <- reverseComplement(dict0) pdict <- PDict(rc_dict0) walkGenome(Dmelanogaster, SEQNAMES, pdict, "mhits") @ Writing code that performs genome-wide analysis is described in the GenomeSearching vignette from the \Rpackage{BSgenome} software package. Note that this vignette is still a work-in-progress (feedback is always appreciated). \end{document}