\name{DBFMCL}
\alias{DBFMCL}

\title{The "Density Based Filtering and Markov CLustering" algorithm (DBF-MCL).}
\description{DBF-MCL is a tree-steps adaptative algorithm (\url{http://tagc.univ-mrs.fr/tbrowser/}) that \emph{(i)}find elements located in dense areas (DBF), \emph{(ii)}uses selected items to construct a graph, \emph{(iii)}performs graph partitioning using the Markov CLustering Algorithm (MCL). 

This function requires installation of the mcl program (\url{http://www.micans.org/mcl}). See "Warnings" section for more informations.
}


\usage{
DBFMCL(data = NULL, filename = NULL, path = ".", name = NULL, distance.method = c("pearson", "spearman", "euclidean", "spm", "spgm"), 
 clustering = TRUE, silent = FALSE, verbose = TRUE, k = 150, random = 3, memory.used = 1024, fdr = 10, inflation = 2.0, set.seed = 123, returnRank = FALSE)
}

\arguments{
  \item{data}{a \code{matrix}, \code{data.frame} or \code{ExpressionSet} object.}
  \item{filename}{a character string representing the file name.}
  \item{name}{a prefix for the names of the intermediary files created by DBF and MCL.}
  \item{path}{a character string representing the data directory where intermediary files are to be stored. Default to current working directory.}
  \item{distance.method}{a method to compute the distance to the k-th nearest neighbor. One of "pearson" (Pearson's correlation coefficient-based distance), "spearman" (Spearman's rho-based distance), "euclidean". The "spm" distance corresponds to the arithmetic mean :("pearson"+"spearman")/2 whereas "spgm" is the geometric mean : sqrt("pearson"*"spearman).}
  \item{clustering}{indicates whether partitioning step (MCL) should be applied to the data. If \code{clustering = FALSE}, the function returns a \code{DBFMCLresult} object that contains informative elements (as detected by the DBF step) coerced into a single cluster.   }
  \item{silent}{if set to TRUE, the progression of distance matrix calculation is not displayed.}
  \item{verbose}{if set to TRUE the function runs verbosely.}
  \item{k}{the neighborhood size.}
  \item{random}{the number of simulated distributions S to compute. By default \code{random = 3}.}
  \item{memory.used}{size of the memory used to store part of the distance matrix. The subsequent sub-matrix is used to computed simulated distances to the k-th nearest neighbor (see detail section).}
  \item{fdr}{an integer value corresponding to the false discovery rate (range: 0 to 100).}
  \item{inflation}{the main control of MCL. Inflation affects cluster granularity. It is usually chosen somewhere in the range \code{[1.2-5.0]}. \code{inflation = 5.0} will tend to result in fine-grained clusterings, and whereas \code{inflation = 1.2} will tend to result in very coarse grained clusterings. By default, \code{inflation = 2.0}. Default setting gives very good results for microarray data when k is set between 70 and 250.}
  \item{set.seed}{specify seeds for random number generator.}
  \item{returnRank}{allows to obtain in the \code{DBFMCLresult} object, a rank's matrix. The output files are processed using the \code{normalization} argument.}
}
\details{
When analyzing a noisy dataset, one is interested in isolating dense regions as they are populated with genes/elements that display weak distances to their nearest neighbors (i.e. strong profile similarities). To isolate these regions DBF-MCL computes, for each gene/element, the distance with its kth nearest neighbor (DKNN).In order to define a critical DKNN value that will depend on the dataset and below which a gene/element will be considered as falling in a dense area, DBF-MCL computes simulated DKNN values by using an empirical randomization procedure. Given a dataset containing n genes and p samples, a simulated DKNN value is obtained by sampling n distance values from the gene-gene distance matrix D and by extracting the kth-smallest value. This procedure is repeated n times to obtain a set of simulated DKNN values S. Computed distributions of simulated DKNN are used to compute a FDR value for each observed DKNN value. The critical value of DKNN is the one for which a user-defined FDR value (typically 10\%) is observed. Genes with DKNN value below this threshold are selected and used to construct a graph. In this graph, edges are constructed between two genes (nodes) if one of them belongs to the k-nearest neighbors of the other. Edges are weighted based on the respective coefficient of correlation (\emph{i.e.}, similarity) and the graph obtained is partitioned using the Markov CLustering Algorithm (MCL). 
}

\value{
a DBFMCLresults class object.
}

\section{Warnings}{
With the current implementation, this function only works only on UNIX-like plateforms.

MCL should be installed. One can used the following command lines in a terminal:

\code{# Download the latest version of mcl (the script has been tested successfully with the 06-058 version).}

\code{wget http://micans.org/mcl/src/mcl-latest.tar.gz}

\code{# Uncompress and install mcl}

\code{tar xvfz mcl-latest.tar.gz}

\code{cd mcl-xx-xxx}

\code{./configure}

\code{make}

\code{sudo make install}

\code{# You should get mcl in your path}

\code{mcl -h}
}


\references{
Lopez F.,Textoris J., Bergon A., Didier G., Remy E., Granjeaud S., Imbert J. , Nguyen C. and Puthier D. TranscriptomeBrowser: a powerful and 
flexible toolbox to explore productively the transcriptional landscape of the Gene Expression Omnibus database. PLoSONE, 2008;3(12):e4001.

Van Dongen S. (2000) A cluster algorithm for graphs. National Research Institute for Mathematics and Computer Science in the 1386-3681.
}

\author{Bergon A., Lopez F., Textoris J., Granjeaud S. and Puthier D.}
\seealso{\code{\link{createSignatures4TB}}}
\examples{
\dontrun{
## with an artificial dataset

m <- matrix(rnorm(80000), nc=20)
m[1:100,1:10] <- m[1:100,1:10] + 4
m[101:200,11:20] <- m[101:200,11:20] + 3
m[201:300,5:15] <- m[201:300,5:15] + -2

res <- DBFMCL(data = m, distance.method = "pearson", clustering = TRUE, k = 25)
plotGeneExpProfiles(res)
plotGeneExpProfiles(res,signatures=1)

## with a real dataset
library(ALL)
data(ALL)
sub <- exprs(ALL)[1:3000,]

#First, we will normalize the data set using the doNormalScore function. 
subNorm <- doNormalScore(sub)
res <- DBFMCL(subNorm, distance.method="pearson", memory=512)

#The results are stored in an instance of class DBFMCLresult.
class(res)
res

# The expression matrix is stored in the data slot. 
# This matrix contains only genes detected as informative (that is falling into a cluster).
head(res@data[,1:2])

# The partitioning results are stored in the cluster slot.
slotNames(res)

# Here, 3 TS were found.
res@size

# The following instruction can be used to get the expression matrix corresponding to the first TS. 
res@data[res@cluster ==1,]

# The high level function plotGeneExpProfilescan be used to visualize, 
# for instance, gene expression profiles corresponding to the first signature.
plotGeneExpProfiles(res, sign=1)

#To stored the partitioning results onto disk (as a tab-delimited file), 
# use the writeDBFMCLresult function as show below.
writeDBFMCLresult(res, filename.out="ALL.sign.txt")

}
}

\keyword{manip}