\name{IdClusters}
\alias{IdClusters}
\title{
Cluster Sequences By Distance
}
\description{
Groups the sequences represented by a distance matrix into clusters of similarity.
}
\usage{
IdClusters(myDistMatrix,
           method = "UPGMA",
           cutoff = -Inf,
           showPlot = FALSE,
           asDendrogram = FALSE,
           myDNAStringSet = NULL,
           add2tbl = FALSE,
           dbFile = NULL,
           verbose = TRUE)
}
\arguments{
  \item{myDistMatrix}{
A symmetric \eqn{N} x \eqn{N} distance matrix with the values of dissimilarity between \eqn{N} sequences.
}
  \item{method}{
An agglomeration method to be used.  This should be (an unambiguous abbreviation of) one of \code{"complete"}, \code{"single"}, \code{"UPGMA"}, \code{"average"}, \code{"NJ"} or \code{"ML"}.
}
  \item{cutoff}{
A vector with the maximum branch length separating the sequences in the same cluster.  If \code{asDendrogram=TRUE} then only one cutoff may be specified.
}
  \item{showPlot}{
Logical specifying whether or not to plot the resulting dendrogram.
}
  \item{asDendrogram}{
Logical.  If \code{TRUE} the object returned is of class \code{dendrogram}.
}
  \item{myDNAStringSet}{
\code{DNAStringSet} used in the creation of \code{myDistMatrix}.  Only necessary if \code{method="ML"}.
}
  \item{add2tbl}{
Logical or a character string specifying the table name in which to add the result.
}
  \item{dbFile}{
A connection to a SQLite database or character string giving the path to the database file.  Only necessary if \code{add2tbl} is not \code{FALSE}.
}
  \item{verbose}{
Logical indicating whether to display progress.
}
}
\details{
Groups the input sequences into clusters using a set dissimilarities representing the distance between \eqn{N} sequences.  Initially a phylogenetic tree is formed using the specified \code{method}.  Then each leaf (sequence) of the tree is assigned to a cluster based on its branch lengths to the other leaves (sequences).

A number of different clustering methods are provided.  The method (\code{complete} assigns clusters using complete-linkage so that sequences in the same cluster are no more than \code{cutoff} percent apart.  The method \code{single} assigns clusters using single-linkage so that sequences in the same cluster are within \code{cutoff} of at least one other sequence in the same cluster.  \code{UPGMA} or \code{average} (the default) assigns clusters using average-linkage which is a compromise between the sensitivity of complete-linkage clustering to outliers and the tendency of single-linkage clustering to connect distant relatives that do not appear to be closely related.

\code{NJ} uses the Neighbor-Joining method proposed by Saitou and Nei that does not assume lineages evolve at the same rate (the molecular clock hypothesis).  The \code{NJ} method is typically the most phylogenetically accurate of the above distance based methods.  \code{ML} creates a neighbor-joining tree and then prints the negative log likelihood of the tree.  Presently \code{ML} does not adjust the neighbor joining tree to maximize its likelihood.

If a \code{add2tbl=TRUE} then the resulting data.frame is added/updated into column(s) of the default table "DNA" in \code{dbFile}.  If \code{add2tbl} is a character string then the result is added to the specified table name in \code{dbFile}.  The added/updated column names are printed if \code{verbose=TRUE}.
}
\value{
If \code{asDendrogram=FALSE} (the default), returns a data.frame with a column for each cutoff specified.  The row.names of the data.frame correspond to the dimnames of \code{myDistMatrix}.  Each one of \eqn{N} sequences is assigned to one of \eqn{M} clusters.
If \code{asDendrogram=TRUE}, returns an object of class \code{dendrogram} that can be used for further manipulation and plotting.  Leaves of the dendrogram are randomly colored by cluster number.
}
\references{
Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. \emph{Journal of Molecular Evolution}, \bold{17(6)}, 368-376

Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. \emph{Molecular Biology and Evolution}, \bold{4(4)}, 406-425.
}
\author{
Erik Wright \email{DECIPHER@cae.wisc.edu}
}
\seealso{
\code{\link{DistanceMatrix}}, \code{\link{Add2DB}}
}
\examples{
# using the matrix from the original paper by Saitou and Nei
m <- matrix(0,8,8)
m[2:8,1] <- c(7, 8, 11, 13, 16, 13, 17)
m[3:8,2] <- c(5, 8, 10, 13, 10, 14)
m[4:8,3] <- c(5, 7, 10, 7, 11)
m[5:8,4] <- c(8, 11, 8, 12)
m[6:8,5] <- c(5, 6, 10)
m[7:8,6] <- c(9, 13)
m[8,7] <- c(8)

# returns an object of class "dendrogram"
myClusters <- IdClusters(m, cutoff=10, method="NJ", showPlot=TRUE, asDendrogram=TRUE)

# example of specifying a cutoff
# returns a data frame
IdClusters(m, cutoff=c(2,6,10,20))
}