%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
% \VignetteIndexEntry{gpls Tutorial}
% \VignetteDepends{}
% \VignetteKeywords{}
% \VignettePackage{gpls}
\documentclass[11pt]{article}

\usepackage{graphicx}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\usepackage{url}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

<<echo=FALSE>>=
options(width=60)
@
%\parindent 0in

\bibliographystyle{plainnat}

\begin{document}
\title{ChemmineR: A Compound Mining Toolkit for Chemical Genomics in R}
\author{Yiqun Cao \\ Anna Charisi \\ Thomas Girke}
\maketitle

\section*{Introduction}

The \Rpackage{ChemmineR} package includes functions for calculating atom pair descriptors of chemical compounds  \citep{carhart1985apm, chen2002psm, Guha2007a}, structural similarity searching, clustering of compound libraries, and visualization of cluster results and chemical structures. These utilities are often important for the assembly of compound libraries and typical analysis routines of chemical genomics screening projects. An overview of all functions is illustrated in Fig. \ref{fig:overview}. The \Rfunction{cmp.parse} function accepts an SD file (SDF) of a whole library, parses it, and generates a descriptor database for all the compounds in the library. Similarly, the \Rfunction{cmp.parse1} function accepts an SDF for a single compound, parses it, and generates a descriptor vector for that compound. The \Rfunction{cmp.similarity} function computes atom pair-based similarities between two compounds using by default the Tanimoto coefficient as similarity measure. Searching for compounds in a library that are similar to a query structure can be accomplished with the \Rfunction{cmp.search} function. The \Rfunction{cmp.cluster} , \Rfunction{cluster.sizestat}, and \Rfunction{cluster.visualize} functions together allow binning clustering of compounds in a parsed library. The \Rfunction{sdf.subset} function provides utiliies for managing and subsetting libraries in SDF format, while the \Rfunction{sdf.visualize} function converts their compounds into images of chemical structures on HTML pages. \Rpackage{ChemmineR} integrates well with the online ChemMine \citep{girke2005ccm} portal which provides access to extensive compound annotations and web-based chemoinformatic tools.

\begin{figure}[bthp]
  \centering
   \includegraphics[width=\textwidth]{overview.pdf} 
   \caption{Overview of functions provided in the \Rpackage{ChemmineR}
   package.}
   \label{fig:overview}
\end{figure}

\section*{Installing the Package}
Users can download the package from the download page (\url{http://bioweb.ucr.edu/ChemMineV2/chemminer/download}) for the OS of their choice, and then use the \Rfunction{install.packages("...", repos=NULL)} command to install the package. More detailed instructions for Mac, Linux and Windows can be found in the online manual  (\url{http://bioweb.ucr.edu/ChemMineV2/chemminer/tutorial}).

\section*{Loading the Package and Its Documentation}
The package can be loaded with the standard \Rfunction{library} function:
<<>>=
library(ChemmineR)
@
The following command lists all functions available in the package:
<<eval=FALSE>>=
library(help=ChemmineR)
options(chemminer.max.upload=100)
@
The package contains a PDF manual, which can be loaded in the standard way:
<<eval=FALSE>>=
vignette("ChemmineR")
@

\section*{Compound Database Import from SDF}

The \Rfunction{cmp.parse} function imports SDFs containing the data for one or many compounds. It returns a searchable atom pair database, which can be used for structural similarity searching, clustering, and SDF manipulation. The only argument of this function is the path (or URL) to the file containing the SDF information. 


<<>>=
library(ChemmineR)
db <- cmp.parse("http://bioweb.ucr.edu/ChemMineV2/static/example_db.sdf", quiet=TRUE)
@
The database can be saved as an R-specific binary file for faster loading in the future. 
<<eval=FALSE>>=
save(db, file="compounddb.rda", compress=TRUE)
@
The \Rfunction{load} function loads the database back into R without repeating the time-consuming descriptor generation step. 
<<eval=FALSE>>=
load("compounddb.rda")
@

\section*{Single Compound Import from SDF}


 The \Rfunction{cmp.parse1} function will parse an SDF for a single compound. Similarly as before, the only argument required is the path (or URL) to the SDF.  
<<>>=
query.url <- "http://bioweb.ucr.edu/ChemMineV2/compound/Aurora/b32:NNQS2MBRHAZTI===/sdf"
query <- cmp.parse1(query.url)
@


\section*{Descriptor Database Content}

The descriptors of compounds are stored as numeric vectors in a list object along with available annotation information about the database. You may skip this section if you are not interested in internals of descriptor database.


The \Rfunction{cmp.parse1} function parses the SDF of a single compound, generates the descriptors and stores them in a numeric vector. Each entry of the vector is a descriptor for this compound.

<<>>=
class(query)
@

In contrast to this, the \Rfunction{cmp.parse} function generates a list object with four components.
<<>>=
names(db)
@

The \Robject{descdb} component is a list. Each entry of the list is a vector
of descriptors of one compound.
<<>>=
class(db$descdb)
class(db$descdb[[1]])
@

The \Rfunction{db.explain} function returns the descriptors in a human readable format. A single descriptor can be returned like this:
<<>>=
db.explain(query[1])
db.explain(db$descdb[[1]][1])
@
The same is possible for multiple descriptors at once.
<<>>=
db.explain(db$descdb[[1]][1:5])
@

\section*{Removing Duplicated Compounds}

The \Rfunction{cmp.duplicated}
function can be used to quickly identify and remove duplicated compounds in imported compound databases. In most cases the method will identify the duplicates correctly. However, users have to be aware that the atom pair algorithm will treat isomers, conformers and other smaller structural variants as identical compounds. If it is important to retain those variants in the data set then the function \Rfunction{cmp.duplicated} should not be used. The support of InChI stings will overcome this limitation in the future.  

The function takes a descriptor database as the only required argument and returns the duplication information as a logical vector.

To demo this feature on the imported sample data set, one can create a duplication with the following command.
<<>>=
db$descdb <- c(db$descdb, list(db$descdb[[1]])) 
@
In the next step the duplication is identified with the \Rfunction{cmp.duplicated} function. 
<<>>=
dup <- cmp.duplicated(db)
dup
@
The \texttt{TRUE} entry in the returned logical vector indicates the duplication. It can be easily removed with the standard R subsetting syntax.
<<>>=
db$descdb <- db$descdb[!dup]
@
In a real example one also needs to remove the duplications from the other database components.
<<eval=FALSE>>=
db$cids <- db$cids[!dup]
db$sdfsegs <- db$sdfsegs[!dup]
@

\section*{Pairwise Compound Comparisons}
The \Rfunction{cmp.similarity} function computes the atom pair similarity between two compounds using the Tanimoto coefficient as similarity measure. The Tanimoto coefficient between the atom pair descriptors of two compounds (CMP A and CMP B) is calculated here according to the following formula:

\begin{center}
Tanimoto coefficient = $c/(a + b + c)$ 

$a$ = count of atom pair descriptors in CMP A but not in CMP B  

$b$ = count of atom pair descriptors in CMP B but not in CMP A  

$c$ = count of atom pair descriptors shared by CMP A and CMP B  
\end{center}


<<>>=
similarity <- cmp.similarity(db$descdb[[1]], db$descdb[[2]])
cat(paste("The similarity between compound 1 and 2 is ", similarity, "\n", sep=""))
similarity <- cmp.similarity(db$descdb[[1]], query)
cat(paste("The similarity between compound 1 and the query is ", similarity, "\n", sep=""))
@

With the \Rfunction{cmp.similarity} function one can easily design custom search subroutines similar to the one introduced in the next section.

\section*{Similarity Searching}
The \Rfunction{cmp.search} function searches an atom pair database for compounds that are similar to a query compound.

<<>>=
cmp.search(db, query, cutoff=0.4, return.score=TRUE, quiet=TRUE)
@


The function returns a data frame where the rows are sorted by similarity score (best to worst). The first column contains the indices of the matching compounds in the database. The argument \Rfunarg{cutoff} can be a similarity cutoff, meaning only compounds with a similarity value larger than this cutoff will be returned; or it can be an integer value restricting how many compounds will be returned. If the argument \Rfunarg{return.score} is set to \texttt{FALSE}, then the function will return a vector of indices rather than a data frame. When supplying a cutoff of 0, the function will return the similarity values for every compound in the database.

<<eval=FALSE>>=
similarities <- cmp.search(db, db[[1]][[1]], cutoff =0, return.score=TRUE, quiet=TRUE) 
@

The \Rfunction{cmp.search} function allows to visualize the chemical structure images for the search results. A similar but more flexible chemical structure rendering function (\Rfunction{sdf.visualize}) is described later in this manual. By setting the \Rfunarg{visualize} argument in \Rfunction{cmp.search} to \texttt{TRUE}, the matching compounds and their scores can be visualized with a standard web browser on the online ChemMine interface. Depending on the \Rfunarg{visualize.browse} argument, an URL will be printed or a webpage will be opened showing the structures of the matching compounds along with their scores.
<<eval=FALSE>>=
similarities <- cmp.search(db, query, cutoff=10, quiet=TRUE, visualize=TRUE, visualize.browse=FALSE) 
@
Setting the \texttt{visualize.browse} argument to \texttt{TRUE} will automatically open the webpage in the default browser. 

The query structure can also be displayed on the visualization webpage by supplying the SDF of the query in a character string or providing its file name or URL. For example,
<<eval=FALSE>>=
similarities <- cmp.search(db, query, cutoff=10, quiet=TRUE, visualize=TRUE, visualize.browse=TRUE, visualize.query=query.url) 
@
This will read the SDF provided by \texttt{query.url}, and display it as a ``reference compound'' at the top of the page. Part of the screenshot of the resulting output is shown in Fig. \ref{fig:search}. A live demo is also available and linked from the online version of this manual (\url{http://bioweb.ucr.edu/ChemMineV2/chemminer/tutorial}).

\begin{figure}[bthp]
  \centering
   \includegraphics[width=\textwidth]{searchscreenshot.pdf} 
   \caption{\Rfunction{cmp.search} can automatically upload the structures and scores of matching compounds to ChemMine for visualization.}
   \label{fig:search}
\end{figure}


Any information uploaded to \textit{ChemMine} by \Rpackage{ChemmineR} is kept private and secure using a highly randomized URL. The visualization pages can be shared with colleagues by providing the corresponding URLs.

\section*{Rendering Chemical Structure Images}
Internally, the similarity search function uses the \Rfunarg{sdf.visualize} function to send compounds to ChemMine for structure visualization. The same function can be used to send any custom combination of compounds for visualization on ChemMine along with complex annotation and activity information. The function accepts a database and a vector of compound indices. The following example performs first a similarity search to obtain a vector of indices.

<<>>=
indices <- cmp.search(db, query, cutoff=10, quiet=TRUE)
@
<<eval=FALSE>>=
url <- sdf.visualize(db, indices, browse=TRUE, quiet=TRUE)
@
<<echo=FALSE>>=
url <- sdf.visualize(db, indices, browse=FALSE, quiet=TRUE)
@
<<>>=
url
@
The URL stored in the \Robject{url} object points to a webpage that shows the structures of the compounds. If the \Rfunarg{browse} argument is set to \texttt{TRUE}, then the default browser will open automatically.

In addition, one can display other information next to the structures using the \Rfunarg{extra} argument. In the following example, a vector of character strings is assigned to \Rfunarg{extra}, and its entries are displayed next to corresponding chemical structures.
<<>>=
extra.info <- paste("Matching compound #", 1:length(indices), sep="")
names(extra.info) <- rep("Note", length(indices))
@
<<eval=FALSE>>=
url <- sdf.visualize(db, indices, browse=TRUE, quiet=TRUE, extra=extra.info)
@
<<echo=FALSE>>=
url <- sdf.visualize(db, indices, browse=FALSE, quiet=TRUE, extra=extra.info)
<<>>=
url
@

The function also allows to list a reference compound at the top of the page. The user supplies the SDF of this reference compound in form of a character string or a file. Annotation information can also be displayed next to the reference structure. 

<<eval=FALSE>>=
url <- sdf.visualize(db, indices, browse=TRUE, quiet=TRUE, reference.sdf=query.url, reference.note="Reference Compound from Aurora Library")
@
<<echo=FALSE>>=
url <- sdf.visualize(db, indices, browse=FALSE, quiet=TRUE, reference.sdf=query.url, reference.note="Reference Compound from Aurora Library")
@
<<>>=
url
@

It is also possible to display more complex tabular data next to each compound by providing a list of data frames. To demonstrate this utility, the following example creates such a list of data frames via a similarity search. Each data frame is then displayed next to the corresponding compound. The screenshot of the resulting output is shown in Fig. \ref{fig:visualize}.
\begin{figure}[bthp]
  \centering
   \includegraphics[width=\textwidth]{visualizescreenshot.pdf} 
   \caption{Visualization webpage created by calling \Rfunction{sdf.visualize}. This page shows the table information properly rendered and displayed next to the compound structures.}
   \label{fig:visualize}
\end{figure}

To generate this output, first a similarity is performed using a cutoff of 0 to obtain the similarity values between the query compound and each of the compounds in the database. 
<<>>=
search.results <- cmp.search(db, query, cutoff=0, return.score=T, quiet=T)
@

The resulting data frame will be used as annotation table for the query compound. To provide a table name, one has to embed it into a list. If a table name is not required, then there is no need to generate the list object
<<>>=
note.q <- list(search.results)
names(note.q) <- "Similarities With All"
@

For each of the top 10 hits in the search result, we perform the same search to obtain the similarity values between the hit and all compounds in the database. This information will then be displayed next to the structures on the visualization page. 
<<>>=
indices <- search.results[1:10,1]
notes <- list()
for (i in indices) {
  search.results <- cmp.search(db, db$descdb[[i]], 0, return.score=T, quiet=T)
  notes <- c(notes, list(search.results))
}
names(notes) <- rep("Similarities With All", 10)
@
The following step displays the complex sample data set on ChemMine.
<<eval=FALSE>>=
url <- sdf.visualize(db, indices, extra=notes, browse=T, reference.sdf=query.url, reference.note=note.q)
@

Note: the \Rfunction{sdf.visualize} function depends on the original SDF file from which the descriptor database has been generated. If the SDF file has been moved or altered then this step cannot be used.

Any information uploaded to \textit{ChemMine} by \Rpackage{ChemmineR} is kept private and secure using a highly randomized URL. The visualization pages can be shared with colleagues by providing the corresponding URLs.


\section*{Subsetting SDF Batch Files}
After identifying a subset of interesting compounds, one can generate an SDF for this subset of compounds using the \Rfunction{sdf.subset} function.


For example, one can perform a similarity search, and use the top 10 results for subsetting.
<<>>=
indices <- cmp.search(db, query, cutoff=10, quiet=TRUE)
@
With the corresponding indices one can generate a custom SDF batch data set and store it in an external file. 
<<>>>=
sdf <- sdf.subset(db, indices)
cat(sdf, file="matching.sdf")
@

One may also create a sub-database from a descriptor database using the  related \Rfunction{db.subset} function.
<<>>>=
db_sub <- db.subset(db, indices)
@

Note: the \Rfunction{sdf.visualize} function depends on the original SDF file from which the descriptor database has been generated.


\section*{Binning Clustering}

Compound libraries can be clustered into discrete similarity groups with the binning clustering function \Rfunction{cmp.cluster}. 
This binning clustering method is optimized for the typical compound library analysis routines for high-throughput screening projects. The algorithm uses single-linkage clustering to join compounds into similarity groups, where every member in a cluster shares with at least one other member a similarity value above a user-specified threshold. The algorithm is optimized for speed and memory efficiency by avoiding the calculation of an all-against-all distance matrix. This is achieved by calculating on-the-fly only the distance values that are required in each clustering step. Because an optimum similarity threshold is often unknown, a series of binning clustering results can be calculated simultaneously for several user-specified thresholds. Cluster results for several thresholds can be calculated almost with the same speed as for a single threshold by issuing multiple clustering processes simultaneously, but calculating the required distances only once.

The function requires as input a descriptor database as well as a similarity threshold. The binning clustering result is returned in form of a data frame. Single linkage is used for cluster joining. The function calculates the required compound-to-compound distance information on the fly, while a memory-intensive distance matrix is only created upon user request via the \Rfunarg{save.distances} argument (see below).

<<>>=
clusters <- cmp.cluster(db, cutoff=0.65, quiet=TRUE)
@
The previous step clusters the compounds stored in \Rfunarg{db} with a similarity cutoff of 0.65. In other words, if two compounds share a similarity of 0.65 or above, then they will be joined into the same cluster. The first 10 rows of the result data frame are shown here:

<<>>=
clusters[1:10,]
@
The first column contains the compound IDs, the second the cluster size and third the cluster ID. The compound in cluster ID 1 can be returned with the following command: 
<<>>=
clusters[clusters[,3]==1,]
@
Similarly as above, one can visualize the chemical structures for a compound cluster of interest with the \Rfunction{sdf.visualize} function.  
<<eval=FALSE>>=
ids <- clusters[clusters[,3]==23, 1]
sdf.visualize(db, ids, browse=TRUE, quiet=TRUE)
@

\section*{Binning Clustering with Multiple Cutoffs}
Because an optimum similarity threshold is often not known, the
\Rfunction{cmp.cluster} function can calculate cluster results for multiple cutoffs  in one step with almost the same speed as for a single cutoff. The clustering results for the different cutoffs will be stored in one data frame.
<<>>=
clusters <- cmp.cluster(db, cutoff=c(0.65, 0.5), quiet=TRUE)
@
The first 10 rows of the generated cluster result data frame are:
<<>>=
clusters[1:10,]
@
Cluster 14 obtained by the cutoff 0.65 contains the following compounds: 
<<>>=
clusters[clusters[, "CLID_0.65"]==14,]
@
and cluster 14 from cutoff 0.5 contains:
<<>>=
clusters[clusters[, "CLID_0.5"]==14,]
@


One may force the \Rfunction{cmp.cluster} function to calculate and store the distance matrix by supplying a file name to the \Rfunarg{save.distances} argument. The generated distance matrix can be loaded and passed on to many other clustering methods available in R, such as the hierarchical clustering function \Rfunction{hclust} (see below).

If a distance matrix is available, it may also be supplied to \Rfunction{cmp.cluster} via the \Rfunarg{use.distances} argument. This is useful when one has a pre-computed distance matrix either from a previous call to \Rfunction{cmp.cluster} or from other distance calculation subroutines. 

\section*{Multi-Dimensional Scaling (MDS)}

To visualize and compare clustering results, the \Rfunction{cluster.visualize} function can be used. The function performs Multi-Dimensional Scaling (MDS) and visualizes the results in form of a scatter plot. It requires as input a descriptor database, a clustering result from \Rfunction{cmp.cluster}, and a cutoff for the minimum cluster size to consider in the plot.  To help determining a proper cutoff size, the \Rfunction{cluster.sizestat} function is provided to generate cluster size statistics.

The following example uses the clustering result obtained above using cutoff values 0.65 and 0.5. By default, the \Rfunction{cluster.sizestat} uses the first cutoff value:

<<>>=
cluster.sizestat(clusters)
@

Based on this size statistics, clusters of size 3 or larger will be included in the following MDS visualization step.

<<fig=TRUE>>=
coord <- cluster.visualize(db, clusters, 3, quiet=TRUE)
@

By default \Rfunction{cluster.visualize} will draw the scatter plot in the  R plotting device, and the user can interactively click a point to retrieve more information on the corresponding compounds. In the non-interactive mode (\Rfunarg{non.interactive}), it will save the plot to a specified file in EPS or PDF format.

A 3D MDS plot can be created with the following sequence of commands.

<<eval=FALSE>>=
coord <- cluster.visualize(db, clusters, 3, dimensions=3, quiet=TRUE)
library(scatterplot3d)
scatterplot3d(coord)
@
% drawing requires scatterplot3d, so we include pre-generated figure
\begin{figure}[bthp]
  \centering
   \includegraphics[width=.6\textwidth]{scatterplot.pdf} 
   \caption{3D MDS plot generated using coordinate information returned by \Rfunction{cluster.visualize} and the \Rpackage{scatterplot3d} package.}
   \label{fig:mds}
\end{figure}


The data returned by the \Rfunction{cluster.visualize} can also be inspected with the fully interactive \Rpackage{rggobi} data visulalization system. The GGobi software and its dependencies can be obtained from the GGobi project site (\url{http://www.ggobi.org/rggobi}). The following commands demonstrate the import of the generated MDS data set into \Rpackage{rggobi}.
<<eval=FALSE>>=
library(rggobi)
ggobi(coord)
@


\section*{Clustering with Other Packages}

\Rpackage{ChemmineR} allows the user to take advantage of the wide spectrum of clustering utilities available in R. An example on how to perform hierarchical clustering with the \Rfunction{hclust} function is given below.
<<>>=
dummy <- cmp.cluster(db, 0, save.distances="distmat.rda", quiet=T)
@
The \Rfunction{cmp.cluster} function is used with the \Rfunarg{save.distances="distmat.rda"} argument to generate a distance matrix. The matrix is saved to a file named \texttt{'distmat.rda'} and it needs to be loaded into R with the \Rfunction{load} function. This matrix can be directly passed on to \Rfunction{hclust}.
<<fig=TRUE>>=
load('distmat.rda')
hc <- hclust(as.dist(distmat), "single")
plot(as.dendrogram(hc), horiz=T)
@

\section*{Format Interconversions between SMILES and SDF}
This option will be provided in the future. At this point, SMILES strings can be imported into ChemmineR only indirectly by converting them into SDFs via ChemMine's online WorkBench (\url{http://bioweb.ucr.edu/ChemMineV2/work/smiles/}).

\section*{Calculation of Physicochemical Descriptors}
Several functions will be available in the near future for calculating physicochemical descriptors directly in ChemmineR. Currently, users can calculate 40 common physicochemical descriptors with the online descriptor prediction tool available on ChemMine's WorkBench (\url{http://bioweb.ucr.edu/ChemMineV2/work/sdf/}).

\bibliography{bibtex}
\end{document}