%\VignetteIndexEntry{Getting Started DECIPHERing} %\VignettePackage{DECIPHER} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{underscore} \usepackage{enumerate} \usepackage{graphics} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\R}{{\textsf{R}}} \newcommand{\C}{{\textsf{C}}} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\term}[1]{{\emph{#1}}} \newcommand{\Rpackage}[1]{\textsf{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Getting Started DECIPHERing} \author{Erik S. Wright \\ University of Wisconsin \\ Madison, WI} \date{\today} \maketitle \newenvironment{centerfig} {\begin{figure}[htp]\centering} {\end{figure}} \renewcommand{\indent}{\hspace*{\tindent}} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{keep.source=TRUE} <>= options(continue=" ") options(width=60) @ \tableofcontents %------------------------------------------------------------ \section{About DECIPHER} %------------------------------------------------------------ \term{D}atabase \term{E}nabled \term{C}ode for \term{I}deal \term{P}robe \term{H}ybridization \term{E}mploying \term{R} (\Rpackage{DECIPHER}) is a software toolset that can be used for deciphering and managing DNA sequences efficiently using the \R{} statistical programming language. The project originally sprang to life as a program for developing hybridization probes for a variety of applications using 16S rRNA sequences. Although the program's functionality has expanded since its conception, it still maintains the name \Rpackage{DECIPHER} to this day. \Rpackage{DECIPHER} is available under the terms of the GNU Public License version 3 (\url{http://www.gnu.org/copyleft/gpl.html}). %------------------------------------------------------------ \section{Design Philosophy} %------------------------------------------------------------ \subsection{Curators Protect the Originals} One of the core principles of \Rpackage{DECIPHER} is the idea of the non-destructive workflow. This revolves around the concept that the original sequence information should never be altered: sequences are exported looking identical to how they were when they were first imported. Essentially, the sequence information in the database is thought of as a backup of the original sequence file and no function is able to directly alter the sequence data. All of the workflows simply \term{add} information to the database, which can be used to maintain, analyze, and decipher the sequences. When it comes time to export all or part of the sequences they are preserved in their original state without alteration. \subsection{Don't Reinvent the Wheel} \Rpackage{DECIPHER} makes use of the \Rpackage{Biostrings} package that is a core part of the Bioconductor suite (\url{http://www.bioconductor.org/}). This package contains numerous functions for common operations such as searching, aligning, and reverse complementing sequences. Furthermore, \Rpackage{DECIPHER} makes use of the \Rpackage{Biostrings} interface for handling DNA sequence data so that sequences are stored in a \Robject{DNAStringSet}. These objects are compatible with many useful packages in the Bioconductor suite. A wide variety of user objectives necessitates that \Rpackage{DECIPHER} be extensible to customized projects. \R{} provides a simple way to place the power of thousands of packages at your fingertips. Likewise, \R{} enables direct access to the speed and efficiency of the programming language \C{} while maintaining the utility of a scripting language. Therefore, minimal coding skill is required to solve complex new problems. Best of all, the \R{} statistical programming language is open source, and maintains a thriving user community so that direct collaboration with other \R{} users is available on several Internet forums \url{https://stat.ethz.ch/mailman/listinfo}. \subsection{That Which is the Most Difficult, Make Fastest} A core objective of \Rpackage{DECIPHER} is to make massive tasks feasible in minimal time. To this end, many of the most time consuming functions are parallelized to make use of multiple processors. For example, the function \Rfunction{DistanceMatrix} gets almost a 1x speed boost for each processor core. A modern processor with 8 cores can see a factor of close to eight times speed improvement. Similar speedups can be achieved when clustering the resulting distance matrix using \Rfunction{IdClusters}. This is all made possible through the integration of OpenMp, which is currently supported by default on most major platforms expect Windows (see installation \ref{sec:Installation}). Other time consuming tasks are handled efficiently. The function \Rfunction{FindChimeras} can uncover sequence chimeras by searching through a reference database of over a million sequences for thousands of 30-mer fragments in a number of minutes. This incredible feat is accomplished by using the \Rclass{PDict} class provided by \Rpackage{Biostrings}. Similarly, the \Rfunction{SearchDB} function can obtain the one-in-a-million sequences that match a targeted query in a matter of seconds. Such high-speed functions enable the user to find solutions to problems that previously would have been extremely difficult or nearly impossible to solve using antiquated methods. \subsection{Stay Organized} It is no longer necessary to store related data in several different files. \Rpackage{DECIPHER} is enabled by \Rpackage{RSQLite}, which is an \R{} interface to \term{SQLite} databases \url{http://www.sqlite.org/}. \Rpackage{DECIPHER} creates an organized collection of sequences and their associated information known as a sequence database. \term{SQLite} databases are flat files, meaning they can be handled just like any other file. There is no setup required since \term{SQLite} does not require a server, unlike other client database engines. These attributes of \term{SQLite} databases make sharing, backing-up, and storing sequence databases relatively straightforward. Separate projects can be stored in distinct tables in the same sequence database. Each new table is structured to include every sequence's description, identifier, and a unique key called the \term{row_name} all in one place. The sequences are referenced by their \term{row_names} or \term{identifier} throughout most functions in the package. New information created using \Rpackage{DECIPHER} functions is added as additional database columns to its respective sequences' \term{row_names}. To prevent the database from seeming like a black box there is a function named \Rfunction{BrowseDB} that facilitates viewing of the database contents in a web browser. A similar function is available to view sequences called \Rfunction{BrowseSequences}. The amount of DNA sequence information available is currently increasing at a phenomenal rate. \Rpackage{DECIPHER} stores individual sequences using \term{gzip} compression so that the database file takes up much less drive space than a standard text file of sequences. The compressed sequences are stored in a hidden table that is linked to the information table. For example, by default sequence information is stored in the table ``DNA'', and the associated sequences are stored in the table ``_DNA''. Storing the sequences in a separate table greatly improves access speed when there is a large amount of sequence information. Separating projects into distinct tables further increases query speed over that of storing every project in a single table. %------------------------------------------------------------ \section{Functionality} %------------------------------------------------------------ The functions of DECIPHER can be grouped into several categories based on intended use: \begin{enumerate} \item Primary functions for interacting with a sequence database: \begin{enumerate} \item \Rfunction{Add2DB} \item \Rfunction{DB2FASTA} \item \Rfunction{SearchDB} \item \Rfunction{Seqs2DB} \end{enumerate} \item Secondary functions for typical database tasks: \begin{enumerate} \item \Rfunction{IdConsensus} \item \Rfunction{IdentifyByRank} \item \Rfunction{IdLengths} \end{enumerate} \item Functions for phylogenetics with a set of DNA sequences: \begin{enumerate} \item \Rfunction{ConsensusSequence} \item \Rfunction{DistanceMatrix} \item \Rfunction{IdClusters} \end{enumerate} \item Functions for visualization with a web browser: \begin{enumerate} \item \Rfunction{BrowseDB} \item \Rfunction{BrowseSequences} \end{enumerate} \item Functions for multiple sequence alignment: \begin{enumerate} \item \Rfunction{AlignProfiles} \item \Rfunction{AlignSeqs} \item \Rfunction{MaskAlignment} \end{enumerate} \item Functions related to chimeras: \begin{enumerate} \item \Rfunction{CreateChimeras} \item \Rfunction{FindChimeras} \item \Rfunction{FormGroups} \end{enumerate} \item Functions related to DNA microarrays: \begin{enumerate} \item \Rfunction{Array2Matrix} \item \Rfunction{CalculateEfficiencyArray} \item \Rfunction{DesignArray} \item \Rfunction{NNLS} \end{enumerate} \item Functions related to probes for fluorescence \textit{in situ} hybridization (FISH): \begin{enumerate} \item \Rfunction{CalculateEfficiencyFISH} \item \Rfunction{DesignProbes} \item \Rfunction{TileSeqs} \end{enumerate} \item Functions related to primers for polymerase chain reaction (PCR): \begin{enumerate} \item \Rfunction{CalculateEfficiencyPCR} \item \Rfunction{DesignPrimers} \item \Rfunction{TileSeqs} \end{enumerate} \end{enumerate} %------------------------------------------------------------ \section{Installation} %------------------------------------------------------------ \newlength\tindent \setlength{\tindent}{\parindent} \setlength{\parindent}{0pt} \label{sec:Installation} \subsection{Typical Installation (recommended)} \begin{enumerate} \item Install \R{} (version >= 2.13.0) from \url{http://www.r-project.org/}. \item Install \Rpackage{DECIPHER} in R by entering: \begin{Schunk} \begin{Sinput} > source("http://bioconductor.org/biocLite.R") > biocLite("DECIPHER") \end{Sinput} \end{Schunk} \end{enumerate} \subsection{Update to the Latest Version} \begin{enumerate} \item Update \R{} to the latest version available from \url{http://www.r-project.org/}. \item Reinstall \Rpackage{DECIPHER} in R by entering: \begin{Schunk} \begin{Sinput} > source("http://bioconductor.org/biocLite.R") > biocLite("DECIPHER") \end{Sinput} \end{Schunk} \end{enumerate} \subsection{Manual Installation} \subsubsection{All platforms} \begin{enumerate} \item Install \R{} (version >= 2.13.0) from \url{http://www.r-project.org/}. \item Install \Rpackage{Biostrings} in R by entering: \begin{Schunk} \begin{Sinput} > source("http://bioconductor.org/biocLite.R") > biocLite("Biostrings") \end{Sinput} \end{Schunk} \item Install \Rpackage{RSQLite} in R by entering: \begin{Schunk} \begin{Sinput} > source("http://bioconductor.org/biocLite.R") > biocLite("RSQLite") \end{Sinput} \end{Schunk} \item Download \Rpackage{DECIPHER} from \url{http://DECIPHER.cee.wisc.edu}. \end{enumerate} \subsubsection{Mac OS X} \begin{enumerate} \item First Option (simplest but no parallelization): \begin{Schunk} \begin{Sinput} > install.packages("<>", repos=NULL) \end{Sinput} \end{Schunk} \item Second Option (more difficult but enables parallelization): \begin{enumerate}[(a)] \item Open the DECIPHER source folder. Remove the file named ``Makevars'' in the ``src'' folder. Then save a text-file with the line ``CC=gcc-4.2 -std=gnu99'' to ``~/.R/Makevars''. This will force packages to be compiled with gcc-4.2 instead of the default llvm. \item Open Terminal, and in the command prompt enter: \begin{Schunk} \begin{Sinput} R CMD build --no-vignettes "<>" R CMD INSTALL "<>" \end{Sinput} \end{Schunk} \end{enumerate} \end{enumerate} \subsubsection{Linux} In a shell enter: \begin{Schunk} \begin{Sinput} R CMD build --no-vignettes "<>" R CMD INSTALL "<>" \end{Sinput} \end{Schunk} \subsubsection{Windows} Two options are available: the first is simplest, but requires the pre-built binary (DECIPHER.zip). \begin{enumerate} \item First Option: \begin{Schunk} \begin{Sinput} > install.packages("<>", repos=NULL) \end{Sinput} \end{Schunk} \item Second Option (more difficult): \begin{enumerate}[(a)] \item Install Rtools from \url{http://cran.r-project.org/bin/windows/Rtools/}. Be sure to check the box that says edit PATH during installation. \item Open a MS-DOS command prompt by clicking Start -> All Programs -> Accessories -> Command Prompt. \item In the command prompt enter: \begin{Schunk} \begin{Sinput} R CMD build --no-vignettes "<>" R CMD INSTALL "<>" \end{Sinput} \end{Schunk} \end{enumerate} \end{enumerate} \section{Example Workflow} To get started we need to load the \Rpackage{DECIPHER} package, which automatically loads several other required packages: % <>= library(DECIPHER) @ Help for any function can be accessed through a command such as: \begin{Schunk} \begin{Sinput} > ? DECIPHER \end{Sinput} \end{Schunk} To begin we can import a GenBank file of sequences into a sequence database. We need to provide an arbitrary sequence \term{identifier} to \Rfunction{Seqs2DB}, which we will call ``Bacteria''. The \term{identifier} is used by many \Rpackage{DECIPHER} functions to reference a specific set of sequences in the database: <>= # access a sequence file included in the package: gen <- system.file("extdata", "Bacteria_175seqs.gen", package="DECIPHER") # connect to a database: dbConn <- dbConnect(SQLite(), ":memory:") # import the sequences into the sequence database Seqs2DB(gen, "GenBank", dbConn, "Bacteria") @ Now we can view the table of information we just added to the database in a web browser (Fig. 1): <>= BrowseDB(dbConn) @ Suppose we wanted to count the number of bases in each sequence and add that information to the database: <>= l <- IdLengths(dbConn) head(l) Add2DB(l, dbConn) BrowseDB(dbConn, maxChars=20) @ Next let's identify our sequences by phylum and update this information in the database: <>= r <- IdentifyByRank(dbConn, add2tbl=TRUE) BrowseDB(dbConn, maxChars=20) @ \begin{figure} \begin{center} \includegraphics[width=1\textwidth]{BrowseDBOutput} \caption{\label{f1} Database table shown in web browser} \end{center} \end{figure} We can now look at only those sequences that belong to the phylum \term{Bacteroidetes} (Fig. 2): <>= dna <- SearchDB(dbConn, id="Bacteroidetes") BrowseSequences(subseq(dna, 140, 240), colorBases=TRUE) @ \begin{figure} \begin{center} \includegraphics[width=1\textwidth]{BrowseSequencesOutput} \caption{\label{f2} Sequences shown in web browser} \end{center} \end{figure} Let's construct a phylogenetic tree from the \term{Bacteroidetes} sequences (Fig. 3): \begin{centerfig} <>= d <- DistanceMatrix(dna, correction="Jukes-Cantor", verbose=FALSE) c <- IdClusters(d, method="ML", cutoff=.05, showPlot=TRUE, myDNAStringSet=dna, verbose=FALSE) @ \caption{Maximum likelihood tree showing the relationships between sequences.} \end{centerfig} We could then use the command below to save the in-memory database to a file for long term storage. Be sure to change the path names to those on your system by replacing all of the text inside quotes labeled ``$<<$path to ...$>>$'' with the actual path on your system. \begin{Schunk} \begin{Sinput} > sqliteCopyDatabase(dbConn, "<>") \end{Sinput} \end{Schunk} Finally, we should disconnect from the database connection. Since the sequence database was created in temporary memory, all of the information will be erased: <>= dbDisconnect(dbConn) @ \section{Session Information} All of the output in this vignette was produced under the following conditions: <>= toLatex(sessionInfo(), locale=FALSE) @ \end{document}