\documentclass{article} \usepackage{mathptmx} \usepackage{amsmath} \usepackage{graphicx} \usepackage{float} \usepackage{epsfig} %% \usepackage{semcolor} \usepackage[pdftex,colorlinks=false,pdfborder=0 0 0,pageanchor=false,bookmarks=false,urlcolor=black]{hyperref} \pdfpagewidth=11truein \pdfpageheight=8.5truein \pdfhorigin=1truein % default value(?), but doesn't work without \pdfvorigin=1truein % default value(?), but doesn't work without %% \newcommand{\fs}{\fromSlide} %% \newcommand{\bs}{\begin{slide}} %% \newcommand{\es}{\end{slide}} %% \slideframe{none} \setcounter{secnumdepth}{0} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\R}{\textsf{R}} \newcommand{\term}[1]{\emph{#1}} \newcommand{\mref}[2]{\htmladdnormallinkfoot{#2}{#1}} \begin{document} \author{H. Pag\`es \\ Gentleman Lab \\ Fred Hutchinson Cancer Research Center \\ Seattle, WA} \title{Biostrings Lab (BioC2007)} \date{6 August, 2007} \maketitle % --------------------------------------------------------------------------- \section{Introduction} The \Rpackage{Biostrings} package: provides the infrastructure for storing and manipulating large nucleotide and amino acid sequences (up to hundreds of millions of letters) in {\R}. We need {\R} version 2.6 and \Rpackage{Biostrings} version 2.5.19 for this lab: <<>>= library(Biostrings) sessionInfo() @ The \Rpackage{BSgenome.*} packages: data packages containing the full genome sequence for a given organism. Ex: \Rpackage{BSgenome.Celegans.UCSC.ce2}, \Rpackage{BSgenome.Hsapiens.UCSC.hg18}, etc... Use the \Rfunction{available.genomes()} function from the \Rpackage{BSgenome} package to see the list of available \Rpackage{BSgenome.*} packages (network access required): <<>>= library(BSgenome) available.genomes() @ Then install with \Rfunction{biocLite()}: <>= source("http://bioconductor.org/biocLite.R") biocLite("BSgenome.Dmelanogaster.FlyBase.r51") @ Note that those \Rpackage{BSgenome.*} packages are big (from a few Mb to more than 800 Mb for \Rpackage{BSgenome.Hsapiens.UCSC.hg18}) so this installation over the network can take a long time. % --------------------------------------------------------------------------- \section{\Rclass{BString} objects} The \Rclass{BString} class is the basic container for big sequences. Unlike standard character vectors in {\R} that can store an arbitrary number of strings, a \Rclass{BString} object can only contain 1 string. For sequences commonly found in biology, 3 variants of the \Rclass{BString} class are available: \begin{enumerate} \item The \Rclass{DNASstring} class for storing a DNA sequence. \item The \Rclass{RNAString} class for storing a RNA sequence. \item The \Rclass{AAString} class for storing a sequence of amino acids. \end{enumerate} For each of these classes, there is a constructor that allows to create an instance of the class from a character string: <<>>= s1 <- BString("hello") s2 <- DNAString("GACCCT") s3 <- AAString("MARKSLEMSIR") @ \subsection{Exercise 1} \begin{enumerate} \item Create some arbitrary \Rclass{BString}, \Rclass{DNAString} and \Rclass{AAString} instances. \item Apply \Rfunction{nchar} and \Rfunction{alphabet} on them. \item Apply \Rfunction{alphabetFrequency} and \Rfunction{reverseComplement} on the \Rclass{DNAString} objects. \item Extract substrings by using the subsetting operator~\Rmethod{[} and by using the \Rfunction{subBString} function. \end{enumerate} IMPORTANT: The \Rfunction{subBString} function should always be preferred to the subsetting operator~\Rmethod{[} for substring extraction. More on this later... % --------------------------------------------------------------------------- \section{\Rpackage{BSgenome.*} packages} A \Rpackage{BSgenome.*} package is a data package containing the full genome sequence for a given organism. Please refer to the introduction of this document for how to display the list of \Rpackage{BSgenome.*} packages currently available and how to install them over the network. The name of a \Rpackage{BSgenome.*} package is made of 4 parts separated by a dot (e.g. \Rpackage{BSgenome.Celegans.UCSC.ce2}). The 1st part is always {\tt BSgenome}, the 2nd part is the name of the organism (abbreviated), the 3rd part is the name of the organisation who assembled the genome and the 4th part is the release string or number used by this organisation for this assembly of the genome. A \Rpackage{BSgenome.*} package contains a single top level object whose name matches the second part of the package name: <<>>= library(BSgenome.Celegans.UCSC.ce2) Celegans @ Displaying this object shows some information about the provenance of this genome plus 2 indexes of sequences: the single sequences and the multiple sequences. The single sequences are \Rclass{DNAString} objects and the multiple sequences are \Rclass{BStringViews} objects (will be introduced soon). \subsection{Exercise 2} \begin{enumerate} \item Display several chromosomes of C. elegans. Do they contain IUPAC extended letters? \item Load \Rpackage{BSgenome.Dmelanogaster.FlyBase.r51} (fruit fly genome). Do the chromosomes contain IUPAC extended letters? \item Extract the last 10M bases from fruit fly chromosome 2L. Use the \Rfunction{system.time} function to compare the performance of \Rfunction{subBString} vs the subsetting operator~\Rmethod{[}. \end{enumerate} % --------------------------------------------------------------------------- \section{\Rclass{BStringViews} objects} A \Rclass{BStringViews} object contains a set of views on the same sequence called "the subject". Each view is defined by its start and end locations: both are integers such that start <= end. We use the \Rfunction{views} function to create a \Rclass{BStringViews} object for a given set of start and end locations. Here we create a \Rclass{BStringViews} object with 4 views: <<>>= v <- views(DNAString("TAATAATG"), 3:0, 5:8) v @ Note that two views can overlap and that a view can be "out of limits" i.e. it can start before the first letter of the subject or/and end after its last letter. The subject is the DNA sequence {\tt TAATAATG}: <<>>= subject(v) @ A \Rclass{BStringViews} object can be subsetted like a standard character vector: <<>>= length(v) v[4:2] @ The start/end/width (integer vectors) can be extracted with \Robject{start()}, \Robject{end()} and \Robject{width()}: <<>>= start(v) end(v) width(v) @ A given view can be extracted as a \Rclass{BString} (or derived) object with the \Rmethod{[[} operator: <<>>= v[[2]] @ Note this operator returns an instance of the same class as the subject. \subsection{Exercise 3} \begin{enumerate} \item Compare \Rfunction{nchar} vs \Rfunction{width} on \Robject{v}. \item Find a programatical way to remove "out of limits" views from a \Rclass{BStringViews} object. \end{enumerate} \subsection{Exercise 4} \begin{enumerate} \item Use the \Rfunction{mask} function to mask the {\tt N}s in fruit fly chromosome 2L. Apply \Rfunction{mask} again to the result (and without the second argument) to see the location of the {\tt N}s. Do we have isolated {\tt N}s or {\tt N}-blocks? \item Find the location of the {\tt N}s in C. elegans chromosome III and in H. sapiens chromosome 1. \item What's the longest {\tt A}-block in fruit fly chromosome 2L? \end{enumerate} \subsection{Exercise 5} The objective of this exercise is to "look at" the introns of a given eukaryote. For this we need: \begin{enumerate} \item The \Rpackage{BSgenome.*} package with the full genome sequence of the target organism. \item Annotations that give us the gene and exon locations of the organism. Note that it's important to make sure that these annotations are targetting exactly the version (build/assembly) of the genome that we are using. \item The \Rpackage{Biostrings} package. \end{enumerate} We do this analysis on the fruit fly genome: we use genome r5.1 from FlyBase (the \Rpackage{BSgenome.Dmelanogaster.FlyBase.r51} package) and a package of annotations for this genome, the \Rpackage{ann.Dmelanogaster.FlyBase.r51} package, made from the GFF files provided by FlyBase: <<>>= library(ann.Dmelanogaster.FlyBase.r51) @ The annotations in this package are organized as follow: \Robject{CHR\_SHORTNAMES} gives the list of chromosomes for which annotations are provided: <<>>= CHR_SHORTNAMES @ and for each chromosome, 2 data frames are provided: 1 for the genes and 1 for the exons: <<>>= colnames(getAnnGenes("2L")) colnames(getAnnExons("2L")) @ For our analysis, we only need the \Robject{"seqname"}, \Robject{"start"}, \Robject{"end"} and \Robject{"strand"}. <<>>= getAnnGenes("2L")[1:5, c("seqname", "start", "end", "strand")] getAnnExons("2L")[1:5, c("seqname", "start", "end", "strand")] @ The row names are gene ids for the former and exon ids for the later. Finally, the data object \Robject{EXON2GENE} (named character vector) provides the mappping between exon ids and gene ids (many-to-one relationship): <<>>= data(EXON2GENE) EXON2GENE["CG11023:1"] @ To get all the exons belonging to a given gene: <<>>= names(EXON2GENE)[EXON2GENE == "FBgn0031208"] @ \begin{enumerate} \item Extract the start and end position of the exons belonging to gene FBgn0025803 (chromosome 3R). \item Load \Rpackage{BSgenome.Dmelanogaster.FlyBase.r51} and create a \Rclass{BStringViews} object where the subject is chromosome 3R and the views are the exon locations found previously. \item Apply the \Rfunction{mask} function on this \Rclass{BStringViews} object. Is this new \Rclass{BStringViews} object representing the introns sequences for gene FBgn0025803? What needs to be adjusted? \item Are we discovering something? \end{enumerate} % --------------------------------------------------------------------------- \section{The \Rfunction{matchPattern} function} This function finds all matches of a given pattern in a sequence. Using \Rfunction{matchPattern} on a \Rclass{BString}, \Rclass{DNAString} or \Rclass{AAString} object is much faster than working with regular expressions, be it in R (on a standard character string) or in another language like Python. <<>>= library(BSgenome.Hsapiens.UCSC.hg18) chr1 <- Hsapiens$chr1 system.time(m <- matchPattern("AAAAAAAAAATT", chr1)) m @ Also, regular expressions can miss some matches: <<>>= length(gregexpr("ATA", "ATATA")) @ \Rfunction{matchPattern} returns all matches: <<>>= length(matchPattern("ATA", "ATATA")) @ \subsection{Exercise 6} \begin{enumerate} \item Find all the matches of an arbitrary nucleotide sequence in fruit fly chromosome 2L. \item In fact, if we don't take any special action, we only get the hits in the plus strand of the chromosome. Find the matches in the minus strand too. IMPORTANT: You should always avoid to \Rfunction{reverseComplement} an entire chromosome sequence (this is very inefficient). \end{enumerate} \subsection{Exercise 7} The objective of this exercise is to check that, for a given probe of a given chip, the hit location found by \Rfunction{matchPattern} is in agrement with the location reported in the annotations available for this chip. We use Affymetrix human chip {\tt hgu95av2} for this analysis. \begin{enumerate} \item Use \Rpackage{hgu95av2} to find the chromosome location of the gene targetted by probeset {\tt 1138\_at}. \item Use \Rpackage{hgu95av2probe} to find all the probe sequences for this probeset id (put the result in a character vector called \Robject{probes}). \item Use \Rfunction{matchPattern} to find the corresponding hit for each probe in the corresponding chromosome (use a \Robject{sapply(probes, function(probe) ...)} construction to put all the results in a single list, the returned list will have one element per probe in \Robject{probes}). Do we have a hit for every probe? Compare with \Robject{hgu95av2CHRLOC[["1138\_at"]]}. Do you see something wrong? \item Try again with \Robject{mismatch=1}. Do we have a hit for every probe? \item In fact, according to \Rpackage{hgu95av2probe}, probeset {\tt 1138\_at} should hit a gene in the antistrand. Does this look correct? \end{enumerate} \subsection{Exercise 8} IUPAC extended letters can be used to express ambiguities in the pattern or in the subject of the search. This is controlled via the \Robject{fixed} argument of the \Rfunction{matchPattern} function. If \Robject{fixed} is \Robject{TRUE} (the default), all letters in the pattern and the subject are interpreted litterally. If \Robject{fixed} is \Robject{FALSE}, IUPAC extended letters in the pattern and in the subject are interpreted as ambiguities e.g. {\tt M} will match {\tt A} or {\tt C} and {\tt N} will match any letter (the \Robject{IUPAC\_CODE\_MAP} named character vector gives the mapping between IUPAC letters and possible nucleotides they stand for). The most common use of this feature is to introduce wildcards in the pattern by replacing some of its letters with {\tt N}s. \begin{enumerate} \item Search pattern {\tt GAACTTTGCCACTC} in fruit fly chromosome 2L. \item Repeat but this time allow the 2nd {\tt T} in the pattern (6th letter) to match anything. Anything wrong? \item You can use the \Rfunction{mask} function or \Robject{fixed="subject"} to work around the previous problem. Try and compare. \end{enumerate} \subsection{Exercise 9} The objective of this exercise is to simulate a {\it PCR experiment} on the fruit fly genome. A {\it PCR experiment} uses primer pairs specifically designed to target the genome of a given organism to "amplify" certain regions of this genome. Each primer pair is made of a forward and a reverse primer. Both primers are short nucleotide sequences with 2 parts: a gene-specific part (the "probe") and a "tag" used for the amplification processus. We'll call the forward probe and the reverse probe the probe parts of the forward and reverse primer respectively. For each primer pair, the forward and reverse probes are designed to hit the same chromosome at nearby locations. The region between the 2 hits is called an amplicons: it is the region that gets "amplified" by the PCR processus. For our {\it computer-simulated PCR experiment}, we'll use the small library of primer pairs stored in the {\tt plate1\_probes.txt} file. This library can be loaded into a data frame with: <<>>= probes <- read.table("plate1_probes.txt", stringsAsFactors=FALSE) dim(probes) probes[1:5, ] @ We've put one primer pair per row in this data frame. The row names of the data frame contain the primer pair ids. Also note that, as suggested by the col names, we've only put the probe parts for each primer pair (the "tag" part has been removed, it can be ignored for the purpose of this exercise). The first goal of this exercise is to write a function that, given a probe pair and a subject (the PCR target), returns the amplicons associated with this probe pair. In order to acquire some expertise, we'll try to find the amplicon(s) associated with the following arbitrary probe pair: <<>>= Fprobe0 <- DNAString("AGCTCCGAGTT") Rprobe0 <- DNAString("CGTTGTTCACA") @ We choose short probes on purpose so the risk to have several hits in several chromosome is high. This is a good way to learn how to deal in a systematical way with the multi hit problem! \begin{enumerate} \item Find all \Robject{Fprobe0} and \Robject{Rprobe0} hits in chromosome 3R. Remember that nothing prevents a probe to hit the strand that it was not designed for! Put the start positions of all the "plus hits" (i.e. the hits of \Robject{Fprobe0} and \Robject{Rprobe0} in the plus strand) in the \Robject{Phits} object (integer vector). Put the end positions of all the "minus hits" in the \Robject{Mhits} object (integer vector). \item Ideally \Robject{Phits} and \Robject{Mhits} should be of length 1 but it is not always the case. However the fact that we get several "plus hits" and/or "minus hits" doesn't necessarily mean that our original primer pair has a bad design. In fact, some of these hits don't prevent the PCR experiment to work properly: this is the case for any "plus hit" that is followed (from left to right) by another "plus hit" with no "minus hit" between. And similarly, it is the case for a "minus hit" that is preceded by another "minus hit" with no "plus hit" between. All these hits can be ignored. Write a function that takes 2 arguments, \Robject{Phits} and \Robject{Mhits}, that removes all the above hits. The function will return a data frame with 2 columns, \Robject{Phits} and \Robject{Mhits}. Note that using a data frame now makes sense because after this removal, \Robject{Phits} and \Robject{Mhits} will always have the same length. \item Use the previous function to "clean" \Robject{Phits} and \Robject{Mhits}. Compare your result with the "theoretical amplicons" returned by \Robject{matchProbePair(Fprobe0, Rprobe0, Dmelanogaster[["3R"]])}. From now we will use the \Rfunction{matchProbePair} function to achive this. Are all these "theoretical amplicons" realistic (PCR amplification is supposed to work for regions up to 5k-10k bases). Would you still consider the \Robject{Fprobe0}/\Robject{Rprobe0} pair good for amplifying the region it was designed for? \item Another problem that can prevent the PCR amplification from working properly is when a given primer pair produces more that 1 valid amplicon (e.g. <= 20k) over the whole genome. What about \Robject{Fprobe0}/\Robject{Rprobe0}? \item Write a function that takes our probes library (the \Robject{probes} data frame) as input and produces a data frame similar to \Robject{probes} but with 3 additional columns \Robject{chr}, \Robject{start} and \Robject{end} that contain the chromosome name/start/end of the amplicon found for each primer pair or NAs if 0 or more than 1 amplicon were found. Having NAs in a row means that the design of the primer pair is bad. \end{enumerate} % --------------------------------------------------------------------------- \end{document}