%\VignetteIndexEntry{Finding Chimeric Sequences}
%\VignettePackage{DECIPHER}

\documentclass[10pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage{underscore}

\textwidth=6.5in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=-.1in
\evensidemargin=-.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\newcommand{\R}{{\textsf{R}}}
\newcommand{\code}[1]{{\texttt{#1}}}
\newcommand{\term}[1]{{\emph{#1}}}
\newcommand{\Rpackage}[1]{\textsf{#1}}
\newcommand{\Rfunction}[1]{\texttt{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}

\bibliographystyle{plainnat}

\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{Finding Chimeric Sequences}
\author{Erik S. Wright}
\date{\today}
\maketitle

\renewcommand{\indent}{\hspace*{\tindent}}
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
\SweaveOpts{keep.source=TRUE}
<<echo=false>>=
options(continue=" ")
options(width=80)
@

\tableofcontents

%------------------------------------------------------------
\section{Introduction}
%------------------------------------------------------------

This document illustrates how to detect chimeric sequences using the \Rpackage{DECIPHER} package through the use of the \Rfunction{FindChimeras} function.  The function finds chimeric sequences present in a database (\Rfunarg{dbFile}) of sequences by making use of a reference database (\Rfunarg{dbFileReference}) of good sequences.  The examples in this document are directed towards finding chimeric 16S rRNA sequences, although a similar strategy could be used with any type of sequence.  For 16S sequences the functionality of \Rfunction{FindChimeras} is implemented as a web tool available at \underline{\url{http://DECIPHER.codes}}.

%------------------------------------------------------------
\section{Chimera Finding Problems}
%------------------------------------------------------------

Chimeras are a common type of sequence anomaly that result from the use of PCR amplification.  Chimeras masquerade as normal sequences, but actually are made up of several different sequence parts that do not exist together in reality.  Undetected chimeras can have insidious effects, such as causing incorrect assessments of sequence diversity or primer/probe specificity.  Chimeras are very difficult to detect, and several detections methods have been proposed to varying degrees of success.  The basic problem can be described as the search for one or more regions of a sequence that do not belong with the rest of the sequence.

The \Rfunction{FindChimeras} algorithm works by finding suspect sequence fragments that are uncommon in the group where the sequence belongs, but very common in another group where the sequence does not belong.  Each sequence in the \Rfunarg{dbFile} is tiled into short sequence segments called fragments.  If the fragments are infrequent in their respective group in the \Rfunarg{dbFileReference} then they are considered suspect.  If enough suspect fragments from a sequence meet the specified constraints then the sequence is flagged as a chimera.

%------------------------------------------------------------
\section{Getting Started}
%------------------------------------------------------------

\newlength\tindent
\setlength{\tindent}{\parindent}
\setlength{\parindent}{0pt}

To get started we need to load the \Rpackage{DECIPHER} package, which automatically loads several other required packages.

<<startup,results=hide>>=
library(DECIPHER)
@

Help for any \Rpackage{DECIPHER} function, such as \Rfunction{FindChimeras}, can be accessed through:

\begin{Schunk}
\begin{Sinput}
> ? FindChimeras
\end{Sinput}
\end{Schunk}

If \Rpackage{DECIPHER} is installed on your system, the code in each example can be obtained via:

\begin{Schunk}
\begin{Sinput}
> browseVignettes("DECIPHER")
\end{Sinput}
\end{Schunk}

%------------------------------------------------------------
\section{Obtaining/Building A Reference Database}
%------------------------------------------------------------

The first step necessary to use \Rfunction{FindChimeras} is to obtain a reference database (\Rfunarg{dbFileReference}) of good sequences.  In order for the algorithm to work accurately, it is crucial that the reference database is free of any chimeras.  The examples given in this document use 16S rRNA sequences, and a suitable reference database can be found at \underline{\url{http://DECIPHER.codes/Download.html}}.  The 16S reference database must be unzipped after download.

If a reference database is not present then it is feasible to create a one if a significantly large repository of sequences is available.  By using the input database as the reference database it is possible to remove chimeras from the database.  Iteratively repeating the process can eventually result in a clean reference database that is chimera free.

The reference database is required to have several fields of information that are not found in a newly created sequence database:
\begin{enumerate}
\item \term{bases} and \term{nonbases} as provided by the function \Rfunction{IdLengths}.
\item \term{origin} as provided by the function \Rfunction{FormGroups}.  The origin column contains the taxonomic rank above the level shared by the whole group.  For example, the origin of the group \textit{Nanoarchaeum} would be \textit{Root; Archaea; Nanoarchaeota}.  Without the taxonomic rank it would be possible to substitute numbers.  For example, the rank 3.1.4.2 with a group named 4 that included 4.1 and 4.2 would have origin 3.1.  These numbers could be generated using \Rfunction{IdClusters} with multiple \Rfunarg{cutoffs}.
\end{enumerate}

Below is an example of building a reference database \textit{de novo} from a FASTA file of LSU rRNA (23S) sequences downloaded from SILVA (\underline{\url{http://www.arb-silva.de/}}).  Of course, the reference database must match the type of the query sequences, so in the remainder of this vignette we use a comprehensive SSU rRNA (16S) database.  The information in this section is only given as an example of how to build a reference database from scratch.

\begin{Schunk}
\begin{Sinput}
> dbConn <- dbConnect(SQLite(), "SILVA-117 23S Reference Database.sqlite")
> Seqs2DB("SILVA 23S RefDB.fasta", "FASTA", dbConn, "")

50675 total sequences in the table DNA.
Time difference of 192.56 secs

[1] 50675
\end{Sinput}
\end{Schunk}

When importing a GenBank file the rank (classification) information is automatically imported from the ORGANISM field.  The classification is simply the levels of taxonomic rank separated by semicolons.  In this case the classification must be parsed from each sequence's description information following the ``>'' symbol.

\begin{Schunk}
\begin{Sinput}
> class <- dbGetQuery(dbConn, "select description from Seqs")
> f <- function (x) {paste(x[2:length(x)], collapse="")}
> class <- unlist(lapply(strsplit(class$description, " ", fixed=TRUE), f))
> Add2DB(data.frame(rank=class), dbConn)
Expression:  alter table DNA add column rank TEXT
Expression:  update or replace DNA set rank = :rank where row_names = :row_names 

Added to table DNA:  "rank".
Time difference of 0.88 secs

[1] TRUE
\end{Sinput}
\end{Schunk}

We can now screen the reference database for chimeras.  This process can be repeated until no new chimeras can be found in the reference database.  After completing this process the reference database will be ready for checking new sequences.

\begin{Schunk}
\begin{Sinput}
> lengths <- IdLengths(dbConn, add2tbl=TRUE)
  |===============================================================| 100%
Lengths counted for 50675 sequences.
Added to DNA:  "bases", "nonbases", and "width".

> groups <- FormGroups(dbConn, add2tbl=TRUE)
  |===============================================================| 100%
Updating column: "identifier"...
Updating column: "origin"...
Formed 72 distinct groups.
Added to table DNA: "identifier", "origin".
Time difference of 917.76 secs

> chimeras <- FindChimeras(dbConn, dbFileReference=dbConn, add2tbl=TRUE)
Found 102 possible chimeras.
Added to table DNA:  "chimera".
Time difference of 13387.34 secs

> dbDisconnect(dbConn)
\end{Sinput}
\end{Schunk}

Note that sequences in the reference database must be in the same orientation as the query sequences.  The function \Rfunction{OrientNucleotides} can be used to reorient the query sequences, if necessary.

%------------------------------------------------------------
\section{Steps to Find Chimeras}
%------------------------------------------------------------

\subsection{Assigning Sequences To A Reference Group}

\subsubsection{Classifying Sequences}

In order to assign new sequences to a group in the reference database, it is necessary to first classify the sequences.  There are a wide variety of classification methods that would work for this purpose.  Here we make use of the RDP Multiclassifier available from \underline{\url{http://rdp.cme.msu.edu/classifier/}}.  After downloading the multiclassifier.jar we need to use it to classify our sequences.  Be sure to change the path names to those on your system by replacing all of the text inside quotes labeled ``$<<$path to ...$>>$'' with the actual path on your system.

<<expr1,eval=FALSE>>=
output.file <- tempfile()
command <- paste("java -Xmx1g -jar '",
"<<path to multiclassifier.jar>>",
"' --hier_outfile=/dev/null --conf=0 --assign_outfile='",
output.file,
"' '",
"<<path to FASTA file>>",
"'",
sep="")
cat("\n\n", command, "\n\n", sep="")
@
\code{java -jar '/multiclassifier.jar' --hier_outfile=/dev/null --conf=0 --assign_outfile='/tmp/file1' '/seqs.fas'}

<<expr2,eval=FALSE>>=
system(command)
@

\subsubsection{Determining Reference Groups}

Next we need to determine which groups are present in the reference database and the number of sequences present in each group.

<<expr3,eval=FALSE>>=
dbRef <- dbConnect(SQLite(), '<<path to reference database>>')
t <- dbGetQuery(dbRef, "select distinct rank, identifier from Seqs group by rank")
y <- dbGetQuery(dbRef, paste("select distinct identifier,
	count(*) as count from Seqs", "where chimera is null group by identifier"))

t$count <- 0
for (i in 1:dim(t)[1])
t$count[i] <- y$count[which(y$identifier==t$identifier[i])]
@

\subsubsection{Assign to Groups}

We can now assign each of the classified sequences to a group in the reference database.

<<expr4,eval=FALSE>>=
r <- readLines(output.file)

z <- data.frame(identifier=I(character(length(r))),
	row.names=1:length(r))
confidence <- .51
for (i in 1:length(r)) {
	u <- unlist(strsplit(r[i], "\t"))
	
	confidenceLevel <- 0
	for (k in seq(5, length(u), 3)) {
		if (as.numeric(u[k]) < confidence) {
			confidenceLevel <- k
			break
		}
	}
	
	if (confidenceLevel == 0) { # full confidence
		r[i] <- paste(u[seq(3, length(u), 3)],
			collapse="; ")
	} else { # unclassified at some level
		r[i] <- paste(u[seq(3, (confidenceLevel - 3), 3)],
			collapse="; ")
		r[i] <- paste(r[i],
			"unclassified",
			sep="; ")
	}
	
	index <- grep(r[i],
		t$rank,
		fixed=TRUE)
	
	if (length(index) == 0)
		z$identifier[i] <- "unknown_lineage"
	else
		z$identifier[i] <- as.character(t$identifier[index[1]])
}
@

At this point we should have the group assignment for each sequence.
\begin{Schunk}
\begin{Sinput}
> head(z)
           identifier
1      Lysinibacillus
2      Staphylococcus
3      Staphylococcus
4   Carnobacteriaceae
5      Peptococcaceae
6 Erysipelotrichaceae
\end{Sinput}
\end{Schunk}

\subsection{Forming the Sequence Database}

\subsubsection{Importing Sequences}

Next, we need to load the sequences into a sequence database (\Rfunarg{dbFile}).  This database will be used directly by the \Rfunction{FindChimeras} function.  Here we can either set \code{dbConn} to a file location to write the database, or if there is enough memory simply create the database in memory.

<<expr5,eval=FALSE>>=
# specify a path for where to write the sequence database
dbConn <- "<<path to write sequence database>>"
# OR create the sequence database in memory
dbConn <- dbConnect(SQLite(), ":memory:")
Seqs2DB("<<path to FASTA file>>", "FASTA", dbConn, "unknown")
@

\begin{Schunk}
\begin{Sinput}
Reading FASTA file from line 0 to 99999 

175 total sequences in database.
Time difference of 0.2 secs

[1] 175
\end{Sinput}
\end{Schunk}

\subsubsection{Adding Group Names}

At this point we can add the group names that we determined previously to the sequence database we just created.

<<expr6,eval=FALSE>>=
Add2DB(z, dbConn)
@

\subsection{Finding Chimeras}

Finally, we can use the \Rfunction{FindChimeras} function to search for chimeras in our sequences.  Two sets of parameters are available for 16S sequences.  The defaults apply to sequences longer than 1,000 nucleotides, while the second command applies to sequences that are shorter in length.

<<expr7,eval=FALSE>>=
# exclude some identifiers from searches
exclude <- c("unclassified_Bacteria",
             "unclassified_Root",
             "unclassified_Archaea")
# full-length sequences
chimeras <- FindChimeras(dbConn, dbFileReference=dbRef, add2tbl=TRUE,
	maxGroupSize=1e4, excludeIDs=exclude)
# OR for short-length sequences
chimeras <- FindChimeras(dbConn, dbFileReference=dbRef, minLength=40,
	minSuspectFragments=2, add2tbl=TRUE, maxGroupSize=1e4,
        excludeIDs=exclude)
@

\begin{Schunk}
\begin{Sinput}
Group of Sequence (Number of Sequences):
  |===============================================================| 100%

Searching Other Groups:
  |===============================================================| 100%

Found 19 possible chimeras.
Added to table DNA:  "chimera".
Time difference of 1665.8 secs
\end{Sinput}
\end{Schunk}

If any chimeras were found then they are stored in the \code{chimeras data.frame}.  There are several ways to determine which sequences were found to be chimeric.

\begin{Schunk}
\begin{Sinput}
> # outputs the number of chimeras that were found
> cat("Number of chimeras =", dim(chimeras)[1])
Number of chimeras = 19 
> # get the ordering of chimeric sequences in the set
> row.names(chimeras)
 [1] "147" "89"  "87"  "90"  "88"  "132" "139" "97"  "5"   "120" "123" "13" 
[13] "118" "114" "173" "174" "175" "18"  "124"
> # allows viewing the chimera results in a web browser
> BrowseDB(dbConn)
[1] TRUE
> # returns the FASTA description of the chimeric sequences
> dbGetQuery(dbConn, "select description from DNA where chimera is not null")
1       uncultured bacterium; Fin_CL-100633_OTU-22.
2       uncultured bacterium; Fin_CL-030731_OTU-12.
3        uncultured bacterium; Fin_CL-050647_OTU-9.
4       uncultured bacterium; Pro_CL-100643_OTU-10.
5        uncultured bacterium; LO1_CL-100643_OTU-3.
6        uncultured bacterium; LO3_CL-100641_OTU-3.
7       uncultured bacterium; UWH_CL-010711_OTU-13.
8   uncultured bacterium; Chlplus_CL-120529_OTU-35.
9       uncultured bacterium; UWH_CL-110523_OTU-35.
10      uncultured bacterium; UWH_CL-100636_OTU-11.
11 uncultured bacterium; Chlminus_CL-090519_OTU-14.
12      uncultured bacterium; UWL_CL-080514_OTU-34.
13      uncultured bacterium; UWL_CL-080545_OTU-12.
14       uncultured bacterium; Pro_CL-100633_OTU-7.
15      uncultured bacterium; UWL_CL-110645_OTU-15.
16        uncultured bacterium; UWL_CL-09061_OTU-5.
17       uncultured bacterium; Fin_CL-03079_OTU-21.
18      uncultured bacterium; UWL_CL-110518_OTU-11.
19      uncultured bacterium; UWL_CL-110548_OTU-32.
\end{Sinput}
\end{Schunk}

\begin{Schunk}
\begin{Sinput}
> dbDisconnect(dbRef)
[1] TRUE
> dbDisconnect(dbConn)
[1] TRUE
> unlink(output.file)
\end{Sinput}
\end{Schunk}

\section{Session Information}
All of the output in this vignette was produced under the following
conditions:

<<sessinfo,echo=FALSE,results=tex>>=
toLatex(sessionInfo(), locale=FALSE)
@

\end{document}