%%% %%% Do not edit the .tex file. It is automatically generated %%% go to Work/vignette %%% %\documentclass[fleqn, letter, 10pt]{article} %\documentclass[article]{jss} \documentclass[nojss]{jss} %\usepackage[round,longnamesfirst]{natbib} %\usepackage[left=1.5in,top=1.5in,right=1.5in,bottom=1.5in,nohead]{geometry} %\usepackage{graphicx,keyval,thumbpdf,url} %\usepackage{hyperref} %\usepackage{Sweave} %\SweaveOpts{strip.white=TRUE, eps=FALSE} %\AtBeginDocument{\setkeys{Gin}{width=0.6\textwidth}} \usepackage[utf8]{inputenc} \usepackage[english]{babel} %% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage{amsmath} \usepackage{amsfonts} %\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}} \newcommand{\class}[1]{\mbox{\textsf{#1}}} \newcommand{\func}[1]{\mbox{\texttt{#1()}}} %\newcommand{\code}[1]{\mbox{\texttt{#1}}} %\newcommand{\pkg}[1]{\strong{#1}} %\newcommand{\samp}[1]{`\mbox{\texttt{#1}}'} %\newcommand{\proglang}[1]{\textsf{#1}} \newcommand{\set}[1]{\mathcal{#1}} \newcommand{\vect}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} %\newcommand{\sQuote}[1]{`{#1}'} %\newcommand{\dQuote}[1]{``{#1}''} \newcommand\R{{\mathbb{R}}} %\DeclareMathOperator*{\argmin}{argmin} %\DeclareMathOperator*{\argmax}{argmax} %\setlength{\parindent}{0mm} %\setlength{\parskip}{3mm plus2mm minus2mm} %\VignetteIndexEntry{rRDP: Interface to the RDP Classifier} \author{ Michael Hahsler\\Southern Methodist University \And Anurag Nagar\\Southern Methodist University } \title{\pkg{rRDP}: Interface to the RDP Classifier} %% for pretty printing and a nice hypersummary also set: \Plainauthor{Michael Hahsler, Anurag Nagar} %% comma-separated \Plaintitle{rRDP: Interface to the RDP Classifier} %% without formatting \Shorttitle{Interface to the RDP Classifier}%% a short title (if necessary) %% an abstract and keywords \Abstract{ This package installs and interfaces the naive Bayesian classifier for 16S rRNA sequences developed by the Ribosomal Database Project (RDP). With this package the classifier trained with the standard training set can be used or a custom classifier can be trained. } \Keywords{bioinformatics, Bioconductor, Biostrings, sequence classification} \Plainkeywords{bioinformatics, Bioconductor, Biostrings, sequence classification} %% at least one keyword must be supplied %% publication information %% NOTE: Typically, this can be left commented and will be filled out by the technical editor %% \Volume{13} %% \Issue{9} %% \Month{September} %% \Year{2004} %% \Submitdate{2004-09-29} %% \Acceptdate{2004-09-29} %% The address of (at least) one author should be given %% in the following format: \Address{ Michael Hahsler\\ Engineering Management, Information, and Systems\\ Lyle School of Engineering\\ Southern Methodist University\\ P.O. Box 750123 \\ Dallas, TX 75275-0123\\ E-mail: \email{mhahsler@lyle.smu.edu}\\ URL: \url{http://lyle.smu.edu/~mhahsler} Anurag Nagar\\ Computer Science and Engineering\\ Lyle School of Engineering\\ Southern Methodist University\\ P.O. Box 750122 \\ Dallas, TX 75275-0122\\ E-mail: \email{anagar@smu.edu}\\ } %% It is also possible to add a telephone and fax number %% before the e-mail in the following format: %% Telephone: +43/1/31336-5053 %% Fax: +43/1/31336-734 %% for those who use Sweave please include the following line (with % symbols): %% need no \usepackage{Sweave.sty} %% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \sloppy %\abstract{} <>= options(width = 70, prompt="R> ", digits=4) ### for sampling set.seed(1234) @ \section{Classification with RDP} The RDP classifier was developed by the Ribosomal Database Project which provides various tools and services to the scientific community for data related to 16S rRNA sequences. The classifier uses a Naive Bayesian approach to quickly and accurately classify sequences. The classifier uses 8-mer counts as features \cite{RDP}. \subsection{Using the RDP classifier trained with the default training set} RDP is shipped trained with a 16S rRNA training set. The model data is available in the data package~\pkg{rRDPData}. For the following example we load some test sequences shipped with the package. <<>>= library(rRDP) seq <- readRNAStringSet(system.file("examples/RNA_example.fasta", package="rRDP")) seq @ Note that the name contains the annotation from the FASTA file. In this case the annotation contains the actual classification information and is encoded in Greengenes format. For convenience, we replace the annotation with just the sequence id. <<>>= annotation <- names(seq) names(seq) <- sapply(strsplit(names(seq), " "), "[", 1) seq @ Next, we apply RDP with the default training set. Note that the data package \pkg{rRDPDate} needs to be installed! <<>>= pred <- predict(rdp(), seq) pred @ The prediction confidence is supplied as the attribute \code{"confidence"}. <<>>= attr(pred, "confidence") @ To evaluate the classification accuracy we can compare the known classification with the predictions. The known classification was stored in the FASTA file and encoded in Greengenes format. We can decode the annotation using \code{decode_Greengenes()}. <<>>= actual <- decode_Greengenes(annotation) actual @ Now we can compare the prediction with the actual classification by creating a confusion table and calculating the classification accuracy. Here we do this at the Genus level. <<>>= confusionTable(actual, pred, rank="genus") accuracy(actual, pred, rank="genus") @ \subsection{Training a custom RDP classifier} RDP can be trained using \code{trainRDP()}. We use an example of training data that is shipped with the package. <<>>= trainingSequences <- readDNAStringSet( system.file("examples/trainingSequences.fasta", package="rRDP")) trainingSequences @ Note that the training data needs to have names in a specific RDP format: \centerline{\tt{" ;;;;;"}} In the following we show the name for the first sequence. We use here \code{sprintf} to display only the first 65~characters so the it fits into a single line. <<>>= sprintf(names(trainingSequences[1]), fmt="%.65s...") @ Now, we can train a the classifier. The model is stored in a directory specified by the parameter \code{dir}. <<>>= customRDP <- trainRDP(trainingSequences, dir = "myRDP") customRDP @ <<>>= testSequences <- readDNAStringSet( system.file("examples/testSequences.fasta", package="rRDP")) pred <- predict(customRDP, testSequences) pred @ Since the custom classifier is stored on disc it can be recalled anytime using \code{rdp()}. <<>>= customRDP <- rdp(dir = "myRDP") @ To permanently remove the classifier use \code{removeRDP()}. <<>>= removeRDP(customRDP) @ \section*{Acknowledgments} This research is supported by research grant no. R21HG005912 from the National Human Genome Research Institute (NHGRI / NIH). %\bibliographystyle{abbrvnat} \bibliography{rRDP,sequence} \end{document}