% Manual compile % Sweave("ChemmineR.Rnw"); system("pdflatex ChemmineR.tex; bibtex ChemmineR; pdflatex ChemmineR.tex; pdflatex ChemmineR.tex") % echo 'Sweave("ChemmineR.Rnw")' | R --slave; echo 'Stangle("ChemmineR.Rnw")' | R --slave; pdflatex ChemmineR.tex; bibtex ChemmineR; pdflatex ChemmineR.tex % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{ChemmineR Tutorial} % \VignetteDepends{} % \VignetteKeywords{} % \VignettePackage{gpls} \documentclass{article} <>= BiocStyle::latex() @ \usepackage[authoryear,round]{natbib} \bibliographystyle{plainnat} \def\bibsection{\section{References}} \usepackage{graphicx} \usepackage{color} \usepackage{hyperref} \usepackage{url} %\newcommand{\comment}[1]{} %\newcommand{\Rfunction}[1]{{\texttt{#1}}} %\newcommand{\Robject}[1]{{\texttt{#1}}} %\newcommand{\Rpackage}[1]{{\textit{#1}}} %\newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} %\newcommand{\Rclass}[1]{{\textit{#1}}} % Define header and footer area with fandyhdr package (see: http://www.ctan.org/tex-archive/macros/latex/contrib/fancyhdr/fancyhdr.pdf) \usepackage{fancyhdr} \pagestyle{fancy} \fancyhead{} \fancyfoot{} \rhead{\nouppercase{\leftmark}} \lhead{\textit{ChemmineR Manual}} \rfoot{\thepage} <>= options(width=95) unlink("test.db") @ %\parindent 0in %\bibliographystyle{plainnat} \begin{document} \title{ChemmineR: Cheminformatics Toolkit for R} \author{Yiqun Cao, Tyler Backman, Kevin Horan, Thomas Girke \\ Email contact: thomas.girke@ucr.edu} \maketitle \section{Introduction} \Rpackage{ChemmineR} is a cheminformatics package for analyzing drug-like small molecule data in R. Its latest version contains functions for efficient processing of large numbers of small molecules, physicochemical/structural property predictions, structural similarity searching, classification and clustering of compound libraries with a wide spectrum of algorithms. \begin{figure}[bthp] \centering \includegraphics[width=0.75\textwidth]{overview.png} \caption{\Rpackage{ChemmineR} environment with its add-on packages and selected functionalities} \label{fig:overview} \end{figure} \noindent In addition, \Rpackage{ChemmineR} offers visualization functions for compound clustering results and chemical structures. The integration of chemoinformatic tools with the R programming environment has many advantages, such as easy access to a wide spectrum of statistical methods, machine learning algorithms and graphic utilities. The first version of this package was published in \cite{Cao2008c}. Since then many additional utilities and add-on packages have been added to the environment and many more are under development for future releases (Figure 2; \citep{Backman2011a, Wang2013a}. \tableofcontents \section{\textcolor{red}{Recently Added Features}} \begin{itemize} \item Improved SMILES support via new \Rclass{SMIset} object class and SMILES import/export functions \item Integration of a subset of OpenBabel functionalities via new \Rpackage{ChemmineOB} add-on package \citep{OBoyle2011a} \item Streaming functionality for processing millions of molecules on a laptop \item Mismatch tolerant maximum common substructure (MCS) search algorithm \item Fast and memory efficient fingerprint search support using atom pair or PubChem fingerprints \end{itemize} \section{Getting Started} \subsection{Installation} The R software for running ChemmineR can be downloaded from CRAN (\url{http://cran.at.r-project.org/}). The ChemmineR package can be installed from R using the \Rfunction{bioLite} install command. <>= source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script. biocLite("ChemmineR") # Installs the package. @ \subsection{Loading the Package and Documentation} <>= library("ChemmineR") # Loads the package @ <>= library(help="ChemmineR") # Lists all functions and classes vignette("ChemmineR") # Opens this PDF manual from R @ \subsection{Five Minute Tutorial} The following code gives an overview of the most important functionalities provided by \Rpackage{ChemmineR}. Copy and paste of the commands into the R console will demonstrate their utilities. \\ \noindent Create Instances of \Rclass{SDFset} class: <>= data(sdfsample) sdfset <- sdfsample sdfset # Returns summary of SDFset sdfset[1:4] # Subsetting of object sdfset[[1]] # Returns summarized content of one SDF <>= view(sdfset[1:4]) # Returns summarized content of many SDFs, not printed here as(sdfset[1:4], "list") # Returns complete content of many SDFs, not printed here @ \noindent An \Rclass{SDFset} is created during the import of an SD file: <>= sdfset <- read.SDFset("http://faculty.ucr.edu/~tgirke/Documents/ R_BioCond/Samples/sdfsample.sdf") @ \noindent Miscellaneous accessor methods for \Rclass{SDFset} container: <>= header(sdfset[1:4]) # Not printed here <>= header(sdfset[[1]]) <>= atomblock(sdfset[1:4]) # Not printed here <>= atomblock(sdfset[[1]])[1:4,] <>= bondblock(sdfset[1:4]) # Not printed here <>= bondblock(sdfset[[1]])[1:4,] <>= datablock(sdfset[1:4]) # Not printed here <>= datablock(sdfset[[1]])[1:4] @ \noindent Assigning compound IDs and keeping them unique: <>= cid(sdfset)[1:4] # Returns IDs from SDFset object sdfid(sdfset)[1:4] # Returns IDs from SD file header block unique_ids <- makeUnique(sdfid(sdfset)) cid(sdfset) <- unique_ids @ \noindent Converting the data blocks in an \Rclass{SDFset} to a matrix: <>= blockmatrix <- datablock2ma(datablocklist=datablock(sdfset)) # Converts data block to matrix numchar <- splitNumChar(blockmatrix=blockmatrix) # Splits to numeric and character matrix numchar[[1]][1:2,1:2] # Slice of numeric matrix numchar[[2]][1:2,10:11] # Slice of character matrix @ \noindent Compute atom frequency matrix, molecular weight and formula: <>= propma <- data.frame(MF=MF(sdfset), MW=MW(sdfset), atomcountMA(sdfset)) propma[1:4, ] @ \noindent Assign matrix data to data block: <>= datablock(sdfset) <- propma datablock(sdfset[1]) @ \noindent String searching in \Rclass{SDFset} (): <>= grepSDFset("650001", sdfset, field="datablock", mode="subset") # Returns summary view of matches. Not printed here. . @ <>= grepSDFset("650001", sdfset, field="datablock", mode="index") @ \noindent Export SDFset to SD file: <>= write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE) @ \noindent Plot molecule structure of one or many SDFs: <>= plot(sdfset[1:4], print=FALSE) # Plots structures to R graphics device @ <>= sdf.visualize(sdfset[1:4]) # Compound viewing in web browser @ \begin{figure}[bthp] \centering \includegraphics[width=\textwidth]{visualizescreenshot.png} \caption{Visualization webpage created by calling \Rfunction{sdf.visualize}.} \label{fig:visualize} \end{figure} \noindent Structure similarity searching and clustering: <>= apset <- sdf2ap(sdfset) # Generate atom pair descriptor database for searching . @ <>= data(apset) # Load sample apset data provided by library. cmp.search(apset, apset[1], type=3, cutoff = 0.3, quiet=TRUE) # Search apset database with single compound. cmp.cluster(db=apset, cutoff = c(0.65, 0.5), quiet=TRUE)[1:4,] # Binning clustering using variable similarity cutoffs. @ \section{OpenBabel Functions} \Rpackage{ChemmineR} integrates now a subset of cheminformatics functionalities implemented in the OpenBabel C++ library \citep{OBoyle2008a, OBoyle2011a}. These utilities can be accessed by installing the \Rpackage{ChemmineOB} package and the OpenBabel software itself. \Rpackage{ChemmineR} will automatically detect the availability of \Rpackage{ChemmineOB} and make use of the additional utilities. The following lists the functions and methods that make use of OpenBabel. References are included to locate the sections in the manual where the utility and usage of these functions is described. \\ \vspace{0.2cm} \noindent \textit{Structure format interconversions} (see Section \ref{sec:conversions}) \begin{itemize} \item \Rfunction{smiles2sdf}: converts from SMILES to SDF object \item \Rfunction{sdf2smiles}: converts from SDF to SMILES object \item \Rfunction{convertFormat}: converts strings between two formats \item \Rfunction{convertFormatFile}: converts files between two formats \item \Rfunction{propOB}: generates several compound properites \item \Rfunction{fingerprintOB}: generates fingerprints for compounds \end{itemize} \section{Overview of Classes and Functions} The following list gives an overview of the most important S4 classes, methods and functions available in the ChemmineR package. The help documents of the package provide much more detailed information on each utility. The standard R help documents for these utilities can be accessed with this syntax: \Rfunction{?function\_name} (\textit{e.g.} \Rfunction{?cid}) and \Rfunction{?class\_name-class} (\textit{e.g}. \Rfunction{?"SDFset-class"}). \subsection{Molecular Structure Data} \noindent \textit{Classes} \begin{itemize} \item \Rclass{SDFstr}: intermediate string class to facilitate SD file import; not important for end user \item \Rclass{SDF}: container for single molecule imported from an SD file \item \Rclass{SDFset}: container for many SDF objects; most important structure container for end user \item \Rclass{SMI}: container for a single SMILES string \item \Rclass{SMIset}: container for many SMILES strings \end{itemize} \noindent \textit{Functions/Methods (mainly for \Rclass{SDFset} container, \Rclass{SMIset} should be coerced with \Rfunction{smiles2sdf} to \Rclass{SDFset})} \begin{itemize} \item Accessor methods for \Rclass{SDF/SDFset} \begin{itemize} \item Object slots: \Rfunction{cid, header, atomblock, bondblock, datablock (sdfid, datablocktag)} \item Summary of \Rclass{SDFset}: \Rfunction{view} \item Matrix conversion of data block: \Rfunction{datablock2ma, splitNumChar} \item String search in SDFset: \Rfunction{grepSDFset} \end{itemize} \item Coerce one class to another \begin{itemize} \item Standard syntax \Rfunction{as(..., "...")} works in most cases. For details see R help with \Rclass{?"SDFset-class"}. \end{itemize} \item Utilities \begin{itemize} \item Atom frequencies: \Rfunction{atomcountMA, atomcount} \item Molecular weight: \Rfunction{MW} \item Molecular formula: \Rfunction{MF} \item ... \end{itemize} \item Compound structure depictions \begin{itemize} \item R graphics device: \Rfunction{plot, plotStruc} \item Online: \Rfunction{cmp.visualize} \end{itemize} \end{itemize} \subsection{Structure Descriptor Data} \noindent \textit{Classes} \begin{itemize} \item \Rclass{AP}: container for atom pair descriptors of a single molecule \item \Rclass{APset}: container for many AP objects; most important structure descriptor container for end user \item \Rclass{FP}: container for fingerprint of a single molecule \item \Rclass{FPset}: container for fingerprints of many molecules, most important structure descriptor container for end user \end{itemize} \noindent \textit{Functions/Methods} \begin{itemize} \item Create \Rclass{AP/APset} instances \begin{itemize} \item From \Rclass{SDFset}: \Rfunction{sdf2ap} \item From SD file: \Rfunction{cmp.parse} \item Summary of \Rclass{AP/APset}: \Rfunction{view, db.explain} \end{itemize} \item Accessor methods for AP/APset \begin{itemize} \item Object slots: \Rfunction{ap, cid} \end{itemize} \item Coerce one class to another \begin{itemize} \item Standard syntax \Rfunction{as(..., "...")} works in most cases. For details see R help with \Rclass{?"APset-class"}. \end{itemize} \item Structure Similarity comparisons and Searching \begin{itemize} \item Compute pairwise similarities : \Rfunction{cmp.similarity, fpSim} \item Search APset database: \Rfunction{cmp.search, fpSim} \end{itemize} \item AP-based Structure Similarity Clustering \begin{itemize} \item Single-linkage binning clustering: \Rfunction{cmp.cluster} \item Visualize clustering result with MDS: \Rfunction{cluster.visualize} \item Size distribution of clusters: \Rfunction{cluster.sizestat} \end{itemize} \end{itemize} \section{Import of Compounds} \subsection{SDF Import} The following gives an overview of the most important import/export functionalities for small molecules provided by \Rpackage{ChemmineR}. The given example creates an instance of the \Rclass{SDFset} class using as sample data set the first 100 compounds from this PubChem SD file (SDF): Compound\_00650001\_00675000.sdf.gz (\url{ftp://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/SDF/}). \\ \noindent SDFs can be imported with the \Rfunction{read.SDFset} function: <>= sdfset <- read.SDFset("http://faculty.ucr.edu/~tgirke/Documents/ R_BioCond/Samples/sdfsample.sdf") @ <>= data(sdfsample) # Loads the same SDFset provided by the library sdfset <- sdfsample valid <- validSDF(sdfset) # Identifies invalid SDFs in SDFset objects sdfset <- sdfset[valid] # Removes invalid SDFs, if there are any @ \noindent Import SD file into \Rclass{SDFstr} container: <>= sdfstr <- read.SDFstr("http://faculty.ucr.edu/~tgirke/Documents/ R_BioCond/Samples/sdfsample.sdf") @ \noindent Create \Rclass{SDFset} from \Rclass{SDFstr} class: <>= sdfstr <- as(sdfset, "SDFstr") sdfstr as(sdfstr, "SDFset") @ \subsection{SMILES Import} The \Rfunction{read.SMIset} function imports one or many molecules from a SMILES file and stores them in a \Rclass{SMIset} container. The input file is expected to contain one SMILES string per row with tab-separated compound identifiers at the end of each line. The compound identifiers are optional. \\ \noindent Create sample SMILES file and then import it: <>= data(smisample); smiset <- smisample write.SMI(smiset[1:4], file="sub.smi") smiset <- read.SMIset("sub.smi") @ \noindent Inspect content of \Rclass{SMIset}: <>= data(smisample) # Loads the same SMIset provided by the library smiset <- smisample smiset view(smiset[1:2]) @ \noindent Accessor functions: <>= cid(smiset[1:4]) (smi <- as.character(smiset[1:2])) @ \noindent Create \Rclass{SMIset} from named character vector: <>= as(smi, "SMIset") @ \section{Export of Compounds} \subsection{SDF Export} \noindent Write objects of classes \Rclass{SDFset/SDFstr/SDF} to SD file: <>= write.SDF(sdfset[1:4], file="sub.sdf") @ \noindent Writing customized \Rclass{SDFset} to file containing \Rpackage{ChemmineR} signature, IDs from \Rclass{SDFset} and no data block: <>= write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE, cid=TRUE, db=NULL) @ \noindent Example for injecting a custom matrix/data frame into the data block of an \Rclass{SDFset} and then writing it to an SD file: <>= props <- data.frame(MF=MF(sdfset), MW=MW(sdfset), atomcountMA(sdfset)) datablock(sdfset) <- props write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE, cid=TRUE) @ \noindent Indirect export via \Rclass{SDFstr} object: <>= sdf2str(sdf=sdfset[[1]], sig=TRUE, cid=TRUE) # Uses default components sdf2str(sdf=sdfset[[1]], head=letters[1:4], db=NULL) # Uses custom components for header and data block @ \noindent Write \Rclass{SDF}, \Rclass{SDFset} or \Rclass{SDFstr} classes to file: <>= write.SDF(sdfset[1:4], file="sub.sdf", sig=TRUE, cid=TRUE, db=NULL) write.SDF(sdfstr[1:4], file="sub.sdf") cat(unlist(as(sdfstr[1:4], "list")), file="sub.sdf", sep="\n") @ \subsection{SMILES Export} \noindent Write objects of class \Rclass{SMIset} to SMILES file with and without compound identifiers: <>= data(smisample); smiset <- smisample # Sample data set write.SMI(smiset[1:4], file="sub.smi", cid=TRUE) write.SMI(smiset[1:4], file="sub.smi", cid=FALSE) @ \section{Format Interconversions}\label{sec:conversions} The \Rfunction{sdf2smiles} and \Rfunction{smiles2sdf} functions provide format interconversion between SMILES strings (Simplified Molecular Input Line Entry Specification) and \Rclass{SDFset} containers.\\ \noindent Convert an \Rclass{SDFset} container to a SMILES \Rclass{character} string: <>= data(sdfsample); sdfset <- sdfsample[1] smiles <- sdf2smiles(sdfset) smiles @ \noindent Convert a SMILES \Rclass{character} string to an \Rclass{SDFset} container: <>= sdf <- smiles2sdf("CC(=O)OC1=CC=CC=C1C(=O)O\tname") view(sdf) @ \noindent When the \Rpackage{ChemineOB} package is installed these conversions are performed with the OpenBabel Open Source Chemistry Toolbox. Otherwise the functions will fall back to using the ChemMine Tools web service for this operation. The latter will require internet connectivity and is limited to only the first compound given. \Rpackage{ChemmineOB} provides access to the compound format conversion functions of OpenBabel. Currently, over 160 formats are supported by OpenBabel. The functions \Rfunction{convertFormat} and \Rfunction{convertFormatFile} can be used to convert files or strings between any two formats supported by OpenBabel. For example, to convert a SMILES string to an SDF string, one can use the \Rfunction{convertFormat} function. <>= sdfStr <- convertFormat("SMI","SDF","CC(=O)OC1=CC=CC=C1C(=O)O\ttest_name") @ \noindent This will return the given compound as an SDF formatted string. 2D coordinates are also computed and included in the resulting SDF string. \noindent To convert a file with compounds encoded in one format to another format, the \Rfunction{convertFormatFile} function can be used instead. <>= convertFormatFile("SMI","SDF","test.smiles","test.sdf") @ \noindent To see the whole list of file formats supported by OpenBabel, one can run from the command-line ``obabel -L formats''. \section{Splitting SD Files} The following \Rfunction{write.SDFsplit} function allows to split SD Files into any number of smaller SD Files. This can become important when working with very big SD Files. Users should note that this function can output many files, thus one should run it in a dedicated directory! \\ \noindent Create sample SD File with 100 molecules: <>= write.SDF(sdfset, "test.sdf") @ \noindent Read in sample SD File. Note: reading file into SDFstr is much faster than into SDFset: <>= sdfstr <- read.SDFstr("test.sdf") @ \noindent Run export on \Rclass{SDFstr} object: <>= write.SDFsplit(x=sdfstr, filetag="myfile", nmol=10) # 'nmol' defines the number of molecules to write to each file @ \noindent Run export on \Rclass{SDFset} object: <>= write.SDFsplit(x=sdfset, filetag="myfile", nmol=10) @ \section{Streaming Through Large SD Files} The \Rfunction{sdfStream} function allows to stream through SD Files with millions of molecules without consuming much memory. During this process any set of descriptors, supported by \Rpackage{ChemmineR}, can be computed (\textit{e.g.} atom pairs, molecular properties, etc.), as long as they can be returned in tabular format. In addition to descriptor values, the function returns a line index that gives the start and end positions of each molecule in the source SD File. This line index can be used by the downstream \Rfunction{read.SDFindex} function to retrieve specific molecules of interest from the source SD File without reading the entire file into R. The following outlines the typical workflow of this streaming functionality in \Rpackage{ChemmineR}.\\ \noindent Create sample SD File with 100 molecules: <>= write.SDF(sdfset, "test.sdf") @ \noindent Define descriptor set in a simple function: <>= desc <- function(sdfset) { cbind(SDFID=sdfid(sdfset), # datablock2ma(datablocklist=datablock(sdfset)), MW=MW(sdfset), groups(sdfset), APFP=desc2fp(x=sdf2ap(sdfset), descnames=1024, type="character"), AP=sdf2ap(sdfset, type="character"), rings(sdfset, type="count", upper=6, arom=TRUE) ) } @ \noindent Run \Rfunction{sdfStream} with \Rfunction{desc} function and write results to a file called \Robject{matrix.xls}: <>= sdfStream(input="test.sdf", output="matrix.xls", fct=desc, Nlines=1000) # 'Nlines': number of lines to read from input SD File at a time @ \noindent One can also start reading from a specific line number in the SD file. The following example starts at line number 950. This is useful for restarting and debugging the process. With \Rfunarg{append=TRUE} the result can be appended to an existing file. <>= sdfStream(input="test.sdf", output="matrix2.xls", append=FALSE, fct=desc, Nlines=1000, startline=950) @ \noindent Select molecules meeting certain property criteria from SD File using line index generated by previous \Rfunction{sdfStream} step: <>= indexDF <- read.delim("matrix.xls", row.names=1)[,1:4] indexDFsub <- indexDF[indexDF$MW < 400, ] # Selects molecules with MW < 400 sdfset <- read.SDFindex(file="test.sdf", index=indexDFsub, type="SDFset") # Collects results in 'SDFset' container @ \noindent Write results directly to SD file without storing larger numbers of molecules in memory: <>= read.SDFindex(file="test.sdf", index=indexDFsub, type="file", outfile="sub.sdf") @ \noindent Read AP/APFP strings from file into \Robject{APset} or \Robject{FP} object: <>= apset <- read.AP(x="matrix.xls", type="ap", colid="AP") apfp <- read.AP(x="matrix.xls", type="fp", colid="APFP") @ \noindent Alternatively, one can provide the AP/APFP strings in a named character vector: <>= apset <- read.AP(x=sdf2ap(sdfset[1:20], type="character"), type="ap") fpchar <- desc2fp(sdf2ap(sdfset[1:20]), descnames=1024, type="character") fpset <- as(fpchar, "FPset") @ \section{Storing Compounds in an SQL Database} As an alternative to sdfStream, there is now also an option to store data in an SQL database, which then allows for fast queries and compound retrieval. This is still an experimental feature. The default database is SQLite, but any other SQL database should work with some minor modifications to the table definitions, which are stored in schema/compounds.SQLite under the ChemmineR package directory. Compounds are stored in their entirety in the databases so there is no need to keep any original data files. Users can define their own set of compound features to compute and store when loading new compounds. Each of these features will be stored in its own, indexed table. Searches can then be performed using these features to quickly find specific compounds. Compounds can always be retrieved quickly because of the database index, no need to scan a large compound file. In addition to user defined features, descriptors can also be computed and stored for each compound. A new database can be created with the \Rfunction{initDb} function. This takes either an existing database connection, or a filename. If a filename is given then an SQLite database connection is created. It then ensures that the required tables exist and creates them if not. The connection object is then returned. This function can be called safely on the same connection or database many times and will not delete any data. \subsection{Loading Data} The function \Rfunction{loadSdf} can be used to load \Robject{SDF} data, either from a file or \Rclass{SDFset} object. The \Rfunarg{fct} parameter should be a function to extract features from the data. It will be handed an \Rclass{SDFset} generated from the data being loaded. This may be done in batches, so there is no guarantee that the given SDFSset will contain the whole dataset. This function should return a data frame with a column for each feature and a row for each compound given. The order of the final data frame should be the same as that of the \Rclass{SDFset}. The column names will become the feature names. Each of these features will become a new, indexed, table in the database which can be used later to search for compounds. The \Rfunarg{descriptors} parameter can be a function which computes descriptors. This function will also be given an \Rclass{SDFset} object, which may be done in batches. It should return a data frame with the following two columns: ``descriptor'' and ``descriptor\_type''. The ``descriptor'' column should contain a string representation of the descriptor, and ``descriptor\_type'' is the type of the descriptor. Our convention for atom pair is ``ap'' and ``fp'' for finger print. The order should also be maintained. When the data has been loaded, \Rfunction{loadSdf} will return the compound id numbers of each compound loaded. These compound id numbers are computed by the database and are not extracted from the compound data itself. They can be used to quickly retrieve compounds later. New features can also be added using this function. However, all compounds must have all features so if new features are added to a new set of compounds, all existing features must be computable by the \Rfunarg{fct} function given. If new features are detected, all existing compounds will be run through \Rfunarg{fct} in order to compute the new features for them as well. For example, if dataset X is loaded with features F1 and F2, and then at a later time we load dataset Y with new feature F3, the \Rfunarg{fct} function used to load dataset Y must compute and return features F1, F2, and F3. \Rfunction{loadSdf} will call \Rfunarg{fct} with both datasets X and Y so that all features are available for all compounds. If any features are missing an error will be raised. If just new features are being added, but no new compounds, use the \Rfunction{addNewFeatures} function. In this example, we create a new database called ``test.db'' and load it with data from and \Rclass{SDFset}. We also define \Rfunarg{fct} to compute the molecular weight, ``MW'', and the number of rings and aromatic rings. The rings function actually returns a data frame with columns ``RINGS'' and ``AROMATIC'', which will be merged into the data frame being created which will also contain the ``MW'' column. These will be the names used for these features and must be used when searching with them. Finally, the new compound ids are returned and stored in the ``ids'' variable. <>= data(sdfsample) #create and initialize a new SQLite database conn <- initDb("test.db") # load data and compute 3 features: molecular weight, with the MW function, # and counts for RINGS and AROMATIC, as computed by rings, which # returns a data frame itself. ids<-loadSdf(conn,sdfsample, function(sdfset) data.frame(MW = MW(sdfset), rings(sdfset,type="count",upper=6, arom=TRUE)) ) @ \subsection{Searching} Compounds can be searched for using the \Rfunction{findCompounds} function. This function takes a connection object, a vector of feature names used in the tests, and finally, a vector of tests that must all pass for a compound to be included in the result set. Each test should be a boolean expression. For example: ``c("MW <= 400","RINGS > 3")'' would return all compounds with a molecular weight of 400 or less and more than 3 rings, assuming these features exist in the database. The syntax for each test is `` ''. If you know SQL you can go beyond this basic syntax. These tests will simply be concatenated together with ``AND'' in-between them and tacked on the end of a WHERE clause of an SQL statement. So any SQL that will work in that context is fine. The function will return a list of compound ids, the actual compounds can be fetched with \Rfunction{getCompounds}. If just the names are needed, the \Rfunction{getCompoundNames} function can be used. Compounds can also be fetched by name using the \Rfunction{findCompoundsByName} function. In this example we search for compounds with molecular weight less than 300. We then fetch the matching compounds and show their molecular weight. <>= lightIds <- findCompounds(conn,"MW",c("MW < 300")) MW(getCompounds(conn,lightIds)) #names of matching compounds: getCompoundNames(conn,lightIds) @ \section{Working with SDF/SDFset Classes} \noindent Several methods are available to return the different data components of \Rclass{SDF/SDFset} containers in batches. The following examples list the most important ones. To save space their content is not printed in the manual. <>= view(sdfset[1:4]) # Summary view of several molecules length(sdfset) # Returns number of molecules sdfset[[1]] # Returns single molecule from SDFset as SDF object sdfset[[1]][[2]] # Returns atom block from first compound as matrix sdfset[[1]][[2]][1:4,] c(sdfset[1:4], sdfset[5:8]) # Concatenation of several SDFsets @ \noindent The \Rfunction{grepSDFset} function allows string matching/searching on the different data components in \Rclass{SDFset}. By default the function returns a SDF summary of the matching entries. Alternatively, an index of the matches can be returned with the setting \emph{mode="index"}. <>= grepSDFset("650001", sdfset, field="datablock", mode="subset") # To return index, set mode="index") . @ \noindent Utilities to maintain unique compound IDs: <>= sdfid(sdfset[1:4]) # Retrieves CMP IDs from Molecule Name field in header block. cid(sdfset[1:4]) # Retrieves CMP IDs from ID slot in SDFset. unique_ids <- makeUnique(sdfid(sdfset)) # Creates unique IDs by appending a counter to duplicates. cid(sdfset) <- unique_ids # Assigns uniquified IDs to ID slot @ \noindent Subsetting by character, index and logical vectors: <>= view(sdfset[c("650001", "650012")]) view(sdfset[4:1]) mylog <- cid(sdfset) %in% c("650001", "650012") view(sdfset[mylog]) @ \noindent Accessing \Rclass{SDF/SDFset} components: header, atom, bond and data blocks: <>= atomblock(sdf); sdf[[2]]; sdf[["atomblock"]] # All three methods return the same component header(sdfset[1:4]) atomblock(sdfset[1:4]) bondblock(sdfset[1:4]) datablock(sdfset[1:4]) header(sdfset[[1]]) atomblock(sdfset[[1]]) bondblock(sdfset[[1]]) datablock(sdfset[[1]]) @ \noindent Replacement Methods: <>= sdfset[[1]][[2]][1,1] <- 999 atomblock(sdfset)[1] <- atomblock(sdfset)[2] datablock(sdfset)[1] <- datablock(sdfset)[2] @ \noindent Assign matrix data to data block: <>= datablock(sdfset) <- as.matrix(iris[1:100,]) view(sdfset[1:4]) @ \noindent Class coercions from \Rclass{SDFstr} to \Rclass{list}, \Rclass{SDF} and \Rclass{SDFset}: <>= as(sdfstr[1:2], "list") as(sdfstr[[1]], "SDF") as(sdfstr[1:2], "SDFset") @ \noindent Class coercions from \Rclass{SDF} to \Rclass{SDFstr}, \Rclass{SDFset}, list with SDF sub-components: <>= sdfcomplist <- as(sdf, "list") sdfcomplist <- as(sdfset[1:4], "list"); as(sdfcomplist[[1]], "SDF") sdflist <- as(sdfset[1:4], "SDF"); as(sdflist, "SDFset") as(sdfset[[1]], "SDFstr") as(sdfset[[1]], "SDFset") @ \noindent Class coercions from \Rclass{SDFset} to lists with components consisting of SDF or sub-components: <>= as(sdfset[1:4], "SDF") as(sdfset[1:4], "list") as(sdfset[1:4], "SDFstr") @ \section{Molecular Property Functions (Physicochemical Descriptors)} \noindent Several methods and functions are available to compute basic compound descriptors, such as molecular formula (MF), molecular weight (MW), and frequencies of atoms and functional groups. In many of these functions, it is important to set \Rfunarg{addH=TRUE} in order to include/add hydrogens that are often not specified in an SD file. <>= propma <- atomcountMA(sdfset, addH=FALSE) boxplot(propma, col="blue", main="Atom Frequency") @ <>= boxplot(rowSums(propma), main="All Atom Frequency") @ \noindent Data frame provided by library containing atom names, atom symbols, standard atomic weights, group and period numbers: <>= data(atomprop) atomprop[1:4,] @ \noindent Compute MW and formula: <>= MW(sdfset[1:4], addH=FALSE) MF(sdfset[1:4], addH=FALSE) @ \noindent Enumerate functional groups: <>= groups(sdfset[1:4], groups="fctgroup", type="countMA") @ \noindent Combine MW, MF, charges, atom counts, functional group counts and ring counts in one data frame: <>= propma <- data.frame(MF=MF(sdfset, addH=FALSE), MW=MW(sdfset, addH=FALSE), Ncharges=sapply(bonds(sdfset, type="charge"), length), atomcountMA(sdfset, addH=FALSE), groups(sdfset, type="countMA"), rings(sdfset, upper=6, type="count", arom=TRUE)) propma[1:4,] @ \noindent The following shows an example for assigning the values stored in a matrix (\textit{e.g.} property descriptors) to the data block components in an \Rclass{SDFset}. Each matrix row will be assigned to the corresponding slot position in the \Rclass{SDFset}. <>= datablock(sdfset) <- propma # Works with all SDF components datablock(sdfset)[1:4] test <- apply(propma[1:4,], 1, function(x) data.frame(col=colnames(propma), value=x)) sdf.visualize(sdfset[1:4], extra = test) @ \noindent The data blocks in SDFs contain often important annotation information about compounds. The \Rfunction{datablock2ma} function returns this information as matrix for all compounds stored in an \Rclass{SDFset} container. The \Rfunction{splitNumChar} function can then be used to organize all numeric columns in a \Rclass{numeric matrix} and the character columns in a \Rclass{character matrix} as components of a \Rclass{list} object. <>= datablocktag(sdfset, tag="PUBCHEM_NIST_INCHI") datablocktag(sdfset, tag="PUBCHEM_OPENEYE_CAN_SMILES") @ \noindent Convert entire data block to matrix: <>= blockmatrix <- datablock2ma(datablocklist=datablock(sdfset)) # Converts data block to matrix numchar <- splitNumChar(blockmatrix=blockmatrix) # Splits matrix to numeric matrix and character matrix numchar[[1]][1:4,]; numchar[[2]][1:4,] # Splits matrix to numeric matrix and character matrix . @ \section{Bond Matrices} \noindent Bond matrices provide an efficient data structure for many basic computations on small molecules. The function \Rfunction{conMA} creates this data structure from \Rclass{SDF} and \Rclass{SDFset} objects. The resulting bond matrix contains the atom labels in the row/column titles and the bond types in the data part. The labels are defined as follows: 0 is no connection, 1 is a single bond, 2 is a double bond and 3 is a triple bond. <>= conMA(sdfset[1:2], exclude=c("H")) # Create bond matrix for first two molecules in sdfset conMA(sdfset[[1]], exclude=c("H")) # Return bond matrix for first molecule plot(sdfset[1], atomnum = TRUE, noHbonds=FALSE , no_print_atoms = "", atomcex=0.8) # Plot its structure with atom numbering rowSums(conMA(sdfset[[1]], exclude=c("H"))) # Return number of non-H bonds for each atom . @ \section{Charges and Missing Hydrogens} \noindent The function \Rfunction{bonds} returns information about the number of bonds, charges and missing hydrogens in \Rclass{SDF} and \Rclass{SDFset} objects. It is used by many other functions (\textit{e.g.} \Rfunction{MW}, \Rfunction{MF}, \Rfunction{atomcount}, \Rfunction{atomcuntMA} and \Rfunction{plot}) to correct for missing hydrogens that are often not specified in SD files. <>= bonds(sdfset[[1]], type="bonds")[1:4,] bonds(sdfset[1:2], type="charge") bonds(sdfset[1:2], type="addNH") @ \section{Ring Perception and Aromaticity Assignment} \noindent The function \Rfunction{rings} identifies all possible rings in one or many molecules (here \Rclass{sdfset[1]}) using the exhaustive ring perception algorithm from \cite{Hanser1996a}. In addition, the function can return all smallest possible rings as well as aromaticity information. \vspace{0.3cm} \noindent The following example returns all possible rings in a \Rclass{list}. The argument \Rfunarg{upper} allows to specify an upper length limit for rings. Choosing smaller length limits will reduce the search space resulting in shortened compute times. Note: each ring is represented by a character vector of atom symbols that are numbered by their position in the atom block of the corresponding \Rclass{SDF/SDFset} object. <>= (ringatoms <- rings(sdfset[1], upper=Inf, type="all", arom=FALSE, inner=FALSE)) @ \noindent For visual inspection, the corresponding compound structure can be plotted with the ring bonds highlighted in color: <>= atomindex <- as.numeric(gsub(".*_", "", unique(unlist(ringatoms)))) plot(sdfset[1], print=FALSE, colbonds=atomindex) @ \noindent Alternatively, one can include the atom numbers in the plot: <>= plot(sdfset[1], print=FALSE, atomnum=TRUE, no_print_atoms="H") @ \noindent Aromaticity information of the rings can be returned in a logical vector by setting \Rfunarg{arom=TRUE}: <>= rings(sdfset[1], upper=Inf, type="all", arom=TRUE, inner=FALSE) @ \noindent Return rings with no more than 6 atoms that are also aromatic: <>= rings(sdfset[1], upper=6, type="arom", arom=TRUE, inner=FALSE) @ \noindent Count shortest possible rings and their aromaticity assignments by setting \Rfunarg{type=count} and \Rfunarg{inner=TRUE}. The inner (smallest possible) rings are identified by first computing all possible rings and then selecting only the inner rings. For more details, consult the help documentation with \Rfunction{?rings}. <>= rings(sdfset[1:4], upper=Inf, type="count", arom=TRUE, inner=TRUE) @ \section{Rendering Chemical Structure Images} \subsection{R Graphics Device} A new plotting function for compound structures has been added to the package recently. This function uses the native R graphics device for generating compound depictions. At this point this function is still in an experimental developmental stage but should become stable soon. \\ \noindent Plot compound Structures with R's graphics device: <>= data(sdfsample) sdfset <- sdfsample plot(sdfset[1:4], print=FALSE) # 'print=TRUE' returns SDF summaries @ \noindent Customized plots: <>= plot(sdfset[1:4], griddim=c(2,2), print_cid=letters[1:4], print=FALSE, noHbonds=FALSE) @ \noindent In the following plot, the atom block position numbers in the SDF are printed next to the atom symbols (\Rfunarg{atomnum = TRUE}). For more details, consult help documentation with \Rfunction{?plotStruc} or \Rfunction{?plot}. <>= plot(sdfset["CMP1"], atomnum = TRUE, noHbonds=F , no_print_atoms = "", atomcex=0.8, sub=paste("MW:", MW(sdfsample["CMP1"])), print=FALSE) @ \noindent Substructure highlighting by atom numbers: <>= plot(sdfset[1], print=FALSE, colbonds=c(22,26,25,3,28,27,2,23,21,18,8,19,20,24)) @ \subsection{Online with ChemMine Tools} Alternatively, one can visualize compound structures with a standard web browser using the online ChemMine Tools service. The service allows to display other information next to the structures using the extra argument of the \Rfunction{sdf.visualize} function. The following examples demonstrate, how one can plot and annotate structures by passing on extra data as vector of character strings, matrices or lists. \\ \noindent Plot structures using web service ChemMine Tools: <>= sdf.visualize(sdfset[1:4]) @ \begin{figure}[bthp] \centering \includegraphics[width=\textwidth]{visualizescreenshot.png} \caption{Visualization webpage created by calling \Rfunction{sdf.visualize}.} \label{fig:visualize} \end{figure} \noindent Add extra annotation as \Rclass{vector}: <>= sdf.visualize(sdfset[1:4], extra=month.name[1:4]) @ \noindent Add extra annotation as \Rclass{matrix}: <>= extra <- apply(propma[1:4,], 1, function(x) data.frame(Property=colnames(propma), Value=x)) sdf.visualize(sdfset[1:4], extra=extra) @ \noindent Add extra annotation as \Rclass{list}: <>= sdf.visualize(sdfset[1:4], extra=bondblock(sdfset[1:4])) @ \section{Similarity Comparisons and Searching} \subsection{Maximum Common Substructure (MCS) Searching} The ChemmineR add-on package \href{http://www.bioconductor.org/packages/devel/bioc/html/fmcsR.html}{{\textcolor{blue}{fmcsR}}} provides support for identifying maximum common substructures (MCSs) and flexible MCSs among compounds. The algorithm can be used for pairwise compound comparisons, structure similarity searching and clustering. The manual describing this functionality is available \href{http://www.bioconductor.org/packages/devel/bioc/vignettes/fmcsR/inst/doc/fmcsR.pdf}{{\textcolor{blue}{here}}} and the associated publication is available in \cite{Wang2013a}. The following gives a short preview of some functionalities provided by the \Rpackage{fmcsR} package. \\ <>= library(fmcsR) data(fmcstest) # Loads test sdfset object test <- fmcs(fmcstest[1], fmcstest[2], au=2, bu=1) # Searches for MCS with mismatches plotMCS(test) # Plots both query compounds with MCS in color @ \subsection{AP/APset Classes for Storing Atom Pair Descriptors} The function \Rfunction{sdf2ap} computes atom pair descriptors for one or many compounds \citep{carhart1985apm, chen2002psm}. It returns a searchable atom pair database stored in a container of class \Rclass{APset}, which can be used for structural similarity searching and clustering. As similarity measure, the Tanimoto coefficient or related coefficients can be used. An \Rclass{APset} object consists of one or many \Rclass{AP} entries each storing the atom pairs of a single compound. Note: the deprecated \Rfunction{cmp.parse} function is still available which also generates atom pair descriptor databases, but directly from an SD file. Since the latter function is less flexible it may be discontinued in the future. \\ \noindent Generate atom pair descriptor database for searching: <>= ap <- sdf2ap(sdfset[[1]]) # For single compound ap <>= apset <- sdf2ap(sdfset) # For many compounds. <>= view(apset[1:4]) @ \noindent Return main components of APset objects: <>= cid(apset[1:4]) # Compound IDs ap(apset[1:4]) # Atom pair descriptors db.explain(apset[1]) # Return atom pairs in human readable format @ \noindent Coerce APset to other objects: <>= apset2descdb(apset) # Returns old list-style AP database tmp <- as(apset, "list") # Returns list as(tmp, "APset") # Converts list back to APset @ \subsection{Large SDF and Atom Pair Databases} When working with large data sets it is often desirable to save the \Rclass{SDFset} and \Rclass{APset} containers as binary R objects to files for later use. This way they can be loaded very quickly into a new R session without recreating them every time from scratch. \\ \noindent Save and load of \Rclass{SDFset} and \Rclass{APset} containers: <>= save(sdfset, file = "sdfset.rda", compress = TRUE) load("sdfset.rda") save(apset, file = "apset.rda", compress = TRUE) load("apset.rda") @ \subsection{Pairwise Compound Comparisons with Atom Pairs} \noindent The \Rfunction{cmp.similarity} function computes the atom pair similarity between two compounds using the Tanimoto coefficient as similarity measure. The coefficient is defined as $c/(a+b+c)$, which is the proportion of the atom pairs shared among two compounds divided by their union. The variable $c$ is the number of atom pairs common in both compounds, while $a$ and $b$ are the numbers of their unique atom pairs. <>= cmp.similarity(apset[1], apset[2]) cmp.similarity(apset[1], apset[1]) @ \subsection{Similarity Searching with Atom Pairs} The \Rfunction{cmp.search} function searches an atom pair database for compounds that are similar to a query compound. The following example returns a data frame where the rows are sorted by the Tanimoto similarity score (best to worst). The first column contains the indices of the matching compounds in the database. The argument cutoff can be a similarity cutoff, meaning only compounds with a similarity value larger than this cutoff will be returned; or it can be an integer value restricting how many compounds will be returned. When supplying a cutoff of 0, the function will return the similarity values for every compound in the database. <>= cmp.search(apset, apset["650065"], type=3, cutoff = 0.3, quiet=TRUE) @ Alternatively, the function can return the matches in form of an index or a named vector if the \Rfunarg{type} argument is set to \Rfunarg{1} or \Rfunarg{2}, respectively. <>= cmp.search(apset, apset["650065"], type=1, cutoff = 0.3, quiet=TRUE) cmp.search(apset, apset["650065"], type=2, cutoff = 0.3, quiet=TRUE) @ \subsection{FP/FPset Classes for Storing Fingerprints} The \Rclass{FPset} class stores fingerprints of small molecules in a matrix-like representation where every molecule is encoded as a fingerprint of the same type and length. The \Rclass{FPset} container acts as a searchable database that contains the fingerprints of many molecules. The \Rclass{FP} container holds only one fingerprint. Several constructor and coerce methods are provided to populate \Rclass{FP/FPset} containers with fingerprints, while supporting any type and length of fingerprints. For instance, the function \Rfunction{desc2fp} generates fingerprints from an atom pair database stored in an \Rclass{APset}, and \Rfunction{as(matrix, "FPset")} and \Rfunction{as(character, "FPset")} construct an \Rclass{FPset} database from objects where the fingerprints are represented as \Rclass{matrix} or \Rclass{character} objects, respectively. \\ \noindent Show slots of \Rclass{FPset} class: <>= showClass("FPset") @ \noindent Instance of \Rclass{FPset} class: <>= data(apset) (fpset <- desc2fp(apset)) view(fpset[1:2]) @ \noindent \Rclass{FPset} class usage: <>= fpset[1:4] # behaves like a list fpset[[1]] # returns FP object length(fpset) # number of compounds cid(fpset) # returns compound ids fpset[10] <- 0 # replacement of 10th fingerprint to all zeros cid(fpset) <- 1:length(fpset) # replaces compound ids c(fpset[1:4], fpset[11:14]) # concatenation of several FPset objects @ \noindent Construct \Rclass{FPset} class form \Rclass{matrix}: <>= fpma <- as.matrix(fpset) # coerces FPset to matrix as(fpma, "FPset") @ \noindent Construct \Rclass{FPset} class form \Rclass{character vector}: <>= fpchar <- as.character(fpset) # coerces FPset to character strings as(fpchar, "FPset") # construction of FPset class from character vector @ \noindent Compound similarity searching with \Rclass{FPset}: <>= fpSim(fpset[1], fpset, method="Tanimoto", cutoff=0.4, top=4) @ \subsection{Atom Pair Fingerprints} Atom pairs can be converted into binary atom pair fingerprints of fixed length. Computations on this compact data structure are more time and memory efficient than on their relatively complex atom pair counterparts. The function \Rfunction{desc2fp} generates fingerprints from descriptor vectors of variable length such as atom pairs stored in \Rclass{APset} or \Rclass{list} containers. The obtained fingerprints can be used for structure similarity comparisons, searching and clustering. \\ \noindent Create atom pair sample data set: <>= data(sdfsample) sdfset <- sdfsample[1:10] apset <- sdf2ap(sdfset) @ \noindent Compute atom pair fingerprint database using internal atom pair selection containing the 4096 most common atom pairs identified in DrugBank's compound collection. For details see \Rclass{?apfp}. The following example uses from this set the 1024 most frequent atom pairs: <>= fpset <- desc2fp(apset, descnames=1024, type="FPset") @ \noindent Alternatively, one can provide any custom atom pair selection. Here, the 1024 most common ones in \Robject{apset}: <>= fpset1024 <- names(rev(sort(table(unlist(as(apset, "list")))))[1:1024]) fpset <- desc2fp(apset, descnames=fpset1024, type="FPset") @ \noindent A more compact way of storing fingerprints is as character values: <>= fpchar <- desc2fp(x=apset, descnames=1024, type="character") fpchar <- as.character(fpset) @ \noindent Converting a fingerprint database to a matrix and vice versa: <>= fpma <- as.matrix(fpset) fpset <- as(fpma, "FPset") @ \noindent Similarity searching and returning Tanimoto similarity coefficients: <>= fpSim(fpset[1], fpset, method="Tanimoto") @ \noindent Under \Rfunarg{method} one can choose from several predefined similarity measures including \Rfunarg{Tanimoto} (default), \Rfunarg{Euclidean}, \Rfunarg{Tversky} or \Rfunarg{Dice}. Alternatively, one can pass on custom similarity functions. <>= fpSim(fpset[1], fpset, method="Tversky", cutoff=0.4, top=4, alpha=0.5, beta=1) @ \noindent Example for using a custom similarity function: <>= myfct <- function(a, b, c, d) c/(a+b+c+d) fpSim(fpset[1], fpset, method=myfct) @ \noindent Clustering example: <>= simMAap <- sapply(cid(apfpset), function(x) fpSim(x=apfpset[x], apfpset, sorted=FALSE)) hc <- hclust(as.dist(1-simMAap), method="single") plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE) @ \subsection{Pairwise Compound Comparisons with PubChem Fingerprints} \noindent The \Rfunction{fpSim} function computes the similarity coefficients (\textit{e.g.} Tanimoto) for pairwise comparisons of binary fingerprints. For this data type, $c$ is the number of "on-bits" common in both compounds, and $a$ and $b$ are the numbers of their unique "on-bits". Currently, the PubChem fingerprints need to be provided (here PubChem's SD files) and cannot be computed from scratch in \Rpackage{ChemmineR}. The PubChem fingerprint specifications can be loaded with \Rfunction{data(pubchemFPencoding)}.\\ \noindent Convert base 64 encoded PubChem fingerprints to \Rclass{character} vector, \Rclass{matrix} or \Rclass{FPset} object: <>= cid(sdfset) <- sdfid(sdfset) fpset <- fp2bit(sdfset, type=1) fpset <- fp2bit(sdfset, type=2) fpset <- fp2bit(sdfset, type=3) fpset @ \noindent Pairwise compound structure comparisons: <>= fpSim(fpset[1], fpset[2]) @ \subsection{Similarity Searching with PubChem Fingerprints} Similarly, the \Rfunction{fpSim} function provides search functionality for PubChem fingerprints: <>= fpSim(fpset["650065"], fpset, method="Tanimoto", cutoff=0.6, top=6) @ \subsection{Visualize Similarity Search Results} The \Rfunction{cmp.search} function allows to visualize the chemical structures for the search results. Similar but more flexible chemical structure rendering functions are \Rfunction{plot} and \Rfunction{sdf.visualize} described above. By setting the visualize argument in \Rfunction{cmp.search} to \Rfunarg{TRUE}, the matching compounds and their scores can be visualized with a standard web browser. Depending on the \Rfunarg{visualize.browse} argument, an URL will be printed or a webpage will be opened showing the structures of the matching compounds along with their scores.\\ \noindent View similarity search results in R's graphics device: <>= cid(sdfset) <- cid(apset) # Assure compound name consistency among objects. plot(sdfset[names(cmp.search(apset, apset["650065"], type=2, cutoff=4, quiet=TRUE))], print=FALSE) @ \noindent View results online with Chemmine Tools: <>= similarities <- cmp.search(apset, apset[1], type=3, cutoff = 10) sdf.visualize(sdfset[similarities[,1]], extra=similarities[,3]) @ \section{Clustering} \subsection{Clustering Identical or Very Similar Compounds} Often it is of interest to identify very similar or identical compounds in a compound set. The \Rfunction{cmp.duplicated} function can be used to quickly identify very similar compounds in atom pair sets, which will be frequently, but not necessarily, identical compounds.\\ \noindent Identify compounds with identical AP sets: <>= cmp.duplicated(apset, type=1)[1:4] # Returns AP duplicates as logical vector cmp.duplicated(apset, type=2)[1:4,] # Returns AP duplicates as data frame @ \noindent Plot the structure of two pairs of duplicates: <>= plot(sdfset[c("650059","650060", "650065", "650066")], print=FALSE) @ \noindent Remove AP duplicates from SDFset and APset objects: <>= apdups <- cmp.duplicated(apset, type=1) sdfset[which(!apdups)]; apset[which(!apdups)] @ \noindent Alternatively, one can identify duplicates via other descriptor types if they are provided in the data block of an imported SD file. For instance, one can use here fingerprints, InChI, SMILES or other molecular representations. The following examples show how to enumerate by identical InChI strings, SMILES strings and molecular formula, respectively. <>= count <- table(datablocktag(sdfset, tag="PUBCHEM_NIST_INCHI")) count <- table(datablocktag(sdfset, tag="PUBCHEM_OPENEYE_CAN_SMILES")) count <- table(datablocktag(sdfset, tag="PUBCHEM_MOLECULAR_FORMULA")) count[1:4] @ \subsection{Binning Clustering} Compound libraries can be clustered into discrete similarity groups with the binning clustering function \Rfunction{cmp.cluster}. The function accepts as input an atom pair (\Robject{APset}) or a fingerprint (\Robject{FPset}) descriptor database as well as a similarity threshold. The binning clustering result is returned in form of a data frame. Single linkage is used for cluster joining. The function calculates the required compound-to-compound distance information on the fly, while a memory-intensive distance matrix is only created upon user request via the \Rfunarg{save.distances} argument (see below). Because an optimum similarity threshold is often not known, the \Rfunction{cmp.cluster} function can calculate cluster results for multiple cutoffs in one step with almost the same speed as for a single cutoff. This can be achieved by providing several cutoffs under the cutoff argument. The clustering results for the different cutoffs will be stored in one data frame. One may force the \Rfunction{cmp.cluster} function to calculate and store the distance matrix by supplying a file name to the \Rfunarg{save.distances} argument. The generated distance matrix can be loaded and passed on to many other clustering methods available in R, such as the hierarchical clustering function \Rfunction{hclust} (see below). If a distance matrix is available, it may also be supplied to \Rfunction{cmp.cluster} via the \Rfunarg{use.distances} argument. This is useful when one has a pre-computed distance matrix either from a previous call to \Rfunction{cmp.cluster} or from other distance calculation subroutines. \\ \noindent Single-linkage binning clustering with one or multiple cutoffs: <>= clusters <- cmp.cluster(db=apset, cutoff = c(0.7, 0.8, 0.9), quiet = TRUE) clusters[1:12,] @ \noindent Clustering of \Robject{FPset} objects with multiple cutoffs. This method allows to call various similarity methods provided by the \Rfunction{fpSim} function. For details consult \Rfunction{?fpSim}. <>= fpset <- desc2fp(apset) clusters2 <- cmp.cluster(fpset, cutoff=c(0.5, 0.7, 0.9), method="Tanimoto", quiet=TRUE) clusters2[1:12,] @ \noindent Sames as above, but using Tversky similarity measure: <>= clusters3 <- cmp.cluster(fpset, cutoff=c(0.5, 0.7, 0.9), method="Tversky", alpha=0.3, beta=0.7, quiet=TRUE) @ \noindent Return cluster size distributions for each cutoff: <>= cluster.sizestat(clusters, cluster.result=1) cluster.sizestat(clusters, cluster.result=2) cluster.sizestat(clusters, cluster.result=3) @ \noindent Enforce calculation of distance matrix: <>= clusters <- cmp.cluster(db=apset, cutoff = c(0.65, 0.5, 0.3), save.distances="distmat.rda") # Saves distance matrix to file "distmat.rda" in current working directory. load("distmat.rda") # Loads distance matrix. @ \subsection{Jarvis-Patrick Clustering} The Jarvis-Patrick clustering algorithm is widely used in cheminformatics \citep{Jarvis1973a}. It requires a nearest neighbor table, which consists of \Rfunarg{j} nearest neighbors for each item (\textit{e.g.} compound). The nearest neighbor table is then used to join items into clusters when they meet the following requirements: (a) they are contained in each other's neighbor list and (b) they share at least \Rfunarg{k} nearest neighbors. The values for \Rfunarg{j} and \Rfunarg{k} are user-defined parameters. The \Rfunction{jarvisPatrick} function implemented in \Rpackage{ChemmineR} takes a nearest neighbor table generated by \Rfunction{nearestNeighbors}, which works for \Rclass{APset} and \Rclass{FPset} objects. This function takes either the standard Jarvis-Patrick \Rfunarg{j} parameter (as the \Rfunarg{numNbrs} parameter), or else a \Rfunarg{cutoff} value, which is an extension to the basic algorithm that we have added. Given a cutoff value, the nearest neighbor table returned contains every neighbor with a similarity greater than the cutoff value, for each item. This allows one to generate tighter clusters and to minimize certain limitations of this method, such as false joins of completely unrelated items when operating on small data sets. The \Rfunction{trimNeighbors} function can also be used to take an existing nearest neighbor table and remove all neighbors whose similarity value is below a given cutoff value. This allows one to compute a very relaxed nearest neighbor table initially, and then quickly try different refinements later. In case an existing nearest neighbor matrix needs to be used, the \Rfunction{fromNNMatrix} function can be used to transform it into the list structure that \Rfunction{jarvisPatrick} requires. The input matrix must have a row for each compound, and each row should be the index values of the neighbors of compound represented by that row. The names of each compound can also be given through the \Rfunarg{names} argument. If not given, it will attempt to use the \Rfunction{rownames} of the given matrix. The \Rfunction{jarvisPatrick} function also allows one to relax some of the requirements of the algorithm through the \Rfunarg{mode} parameter. When set to ``a1a2b'', then all requirements are used. If set to ``a1b'', then (a) is relaxed to a unidirectional requirement. Lastly, if \Rfunarg{mode} is set to ``b'', then only requirement (b) is used, which means that all pairs of items will be checked to see if (b) is satisfied between them. The size of the clusters generated by the different methods increases in this order: ``a1a2b'' < ``a1b'' < ``b''. The run time of method ``a1a2b'' follows a close to linear relationship, while it is nearly quadratic for the much more exhaustive method ``b''. Only methods ``a1a2b'' and ``a1b'' are suitable for clustering very large data sets (e.g. >50,000 items) in a reasonable amount of time. An additional extension to the algorithm is the ability to set the linkage mode. The \Rfunarg{linkage} parameter can be one of ``single'', ``average'', or ``complete'', for single linkage, average linkage and complete linkage merge requirements, respectively. In the context of Jarvis-Patrick, average linkage means that at least half of the pairs between the clusters under consideration must meet requirement (b). Similarly, for complete linkage, all pairs must requirement (b). Single linkage is the normal case for Jarvis-Patrick and just means that at least one pair must meet requirement (b). The output is a cluster \Rclass{vector} with the item labels in the name slot and the cluster IDs in the data slot. There is a utility function called \Rfunction{byCluster}, which takes out cluster vector output by \Rfunction{jarvisPatrick} and transforms it into a list of vectors. Each slot of the list is named with a cluster id and the vector contains the cluster members. By default the function excludes singletons from the output, but they can be included by setting \Rfunarg{excludeSingletons}=FALSE. \noindent Load/create sample \Rclass{APset} and \Rclass{FPset}: <>= data(apset) fpset <- desc2fp(apset) @ \noindent Standard Jarvis-Patrick clustering on \Rclass{APset} and \Rclass{FPset} objects: <>= jarvisPatrick(nearestNeighbors(apset,numNbrs=6), k=5, mode="a1a2b") #Using "APset" jarvisPatrick(nearestNeighbors(fpset,numNbrs=6), k=5, mode="a1a2b") #Using "FPset" @ \noindent The following example runs Jarvis-Patrick clustering with a minimum similarity \Rfunarg{cutoff} value (here Tanimoto coefficient). In addition, it uses the much more exhaustive \Rfunarg{"b"} method that generates larger cluster sizes, but significantly increased the run time. For more details, consult the corresponding help file with \Rfunarg{?jarvisPatrick}. <>= cl<-jarvisPatrick(nearestNeighbors(fpset,cutoff=0.6, method="Tanimoto"), k=2 ,mode="b") byCluster(cl) @ \noindent Output nearest neighbor table (\Rclass{matrix}): <>= nnm <- nearestNeighbors(fpset,numNbrs=6) nnm$names[1:4] nnm$ids[1:4,] nnm$similarities[1:4,] @ \noindent Trim nearest neighbor table: <>= nnm <- trimNeighbors(nnm,cutoff=0.4) nnm$similarities[1:4,] @ \noindent Perform clustering on precomputed nearest neighbor table: <>= jarvisPatrick(nnm, k=5,mode="b") @ \noindent Using a user defined nearest neighbor matrix: <>= nn <- matrix(c(1,2,2,1),2,2,dimnames=list(c('one','two'))) nn byCluster(jarvisPatrick(fromNNMatrix(nn),k=1)) @ \subsection{Multi-Dimensional Scaling (MDS)} To visualize and compare clustering results, the \Rfunction{cluster.visualize} function can be used. The function performs Multi-Dimensional Scaling (MDS) and visualizes the results in form of a scatter plot. It requires as input an \Rclass{APset}, a clustering result from \Rfunction{cmp.cluster}, and a cutoff for the minimum cluster size to consider in the plot. To help determining a proper cutoff size, the \Rfunction{cluster.sizestat} function is provided to generate cluster size statistics. \noindent MDS clustering and scatter plot: <>= cluster.visualize(apset, clusters, size.cutoff=2, quiet = TRUE) # Color codes clusters with at least two members. cluster.visualize(apset, clusters, quiet = TRUE) # Plots all items. @ \noindent Create a 3D scatter plot of MDS result: <>= library(scatterplot3d) coord <- cluster.visualize(apset, clusters, size.cutoff=1, dimensions=3, quiet=TRUE) scatterplot3d(coord) @ \noindent Interactive 3D scatter plot with Open GL (graphics not evaluated here): <>= library(rgl) rgl.open(); offset <- 50; par3d(windowRect=c(offset, offset, 640+offset, 640+offset)) rm(offset); rgl.clear(); rgl.viewpoint(theta=45, phi=30, fov=60, zoom=1) spheres3d(coord[,1], coord[,2], coord[,3], radius=0.03, color=coord[,4], alpha=1, shininess=20); aspect3d(1, 1, 1) axes3d(col='black'); title3d("", "", "", "", "", col='black'); bg3d("white") # To save a snapshot of the graph, one can use the command rgl.snapshot("test.png"). @ \subsection{Clustering with Other Algorithms} \Rpackage{ChemmineR} allows the user to take advantage of the wide spectrum of clustering utilities available in R. An example on how to perform hierarchical clustering with the hclust function is given below. \\ \noindent Create atom pair distance matrix: <>= dummy <- cmp.cluster(db=apset, cutoff=0, save.distances="distmat.rda", quiet=TRUE) load("distmat.rda") @ \noindent Hierarchical clustering with \Rfunction{hclust}: <>= hc <- hclust(as.dist(distmat), method="single") hc[["labels"]] <- cid(apset) # Assign correct item labels plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=T) @ \noindent Instead of atom pairs one can use PubChem's fingerprints for clustering: <>= simMA <- sapply(cid(fpset), function(x) fpSim(fpset[x], fpset, sorted=FALSE)) hc <- hclust(as.dist(1-simMA), method="single") plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE) @ \noindent Plot dendrogram with heatmap (here similarity matrix): <>= library(gplots) heatmap.2(1-distmat, Rowv=as.dendrogram(hc), Colv=as.dendrogram(hc), col=colorpanel(40, "darkblue", "yellow", "white"), density.info="none", trace="none") @ \section{Searching PubChem} \subsection{Get Compounds from PubChem by Id} The function \Rfunction{getIds} accepts one or more numeric PubChem compound ids and downloads the corresponding compounds from PubChem Power User Gateway (PUG) returning results in an \Rclass{SDFset} container. The ChemMine Tools web service is used as an intermediate, to translate queries from plain HTTP POST to a PUG SOAP query.\\ \noindent Fetch 2 compounds from PubChem: <>= compounds <- getIds(c(111,123)) compounds @ \subsection{Search a SMILES Query in PubChem} The function \Rfunction{searchString} accepts one SMILES string (Simplified Molecular Input Line Entry Specification) and performs a >0.95 similarity PubChem fingerprint search, returning the hits in an \Rclass{SDFset} container. The ChemMine Tools web service is used as an intermediate, to translate queries from plain HTTP POST to a PubChem Power User Gateway (PUG) query.\\ \noindent Search a SMILES string on PubChem: <>= compounds <- searchString("CC(=O)OC1=CC=CC=C1C(=O)O") compounds @ \subsection{Search an SDF Query in PubChem} The function \Rfunction{searchSim} performs a PubChem similarity search just like \Rfunction{searchString}, but accepts a query in an \Rclass{SDFset} container. If the query contains more than one compound, only the first is searched.\\ \noindent Search an \Rclass{SDFset} container on PubChem: <>= data(sdfsample); sdfset <- sdfsample[1] compounds <- searchSim(sdfset) compounds @ \section{Version Information} <>= toLatex(sessionInfo()) @ \section{Funding} This software was developed with funding from the National Science Foundation: \href{http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0957099}{{\textcolor{blue}{ABI-0957099}}}, 2010-0520325 and IGERT-0504249. %<<>>= %library(ChemmineR) %db <- cmp.parse("http://bioweb.ucr.edu/ChemMineV2/static/example_db.sdf", quiet=TRUE) %@ %The database can be saved as an R-specific binary file for faster loading in the future. %<>= %save(db, file="compounddb.rda", compress=TRUE) %@ %The \Rfunction{load} function loads the database back into R without repeating the time-consuming descriptor generation step. %<>= %load("compounddb.rda") %@ %\section*{Single Compound Import from SDF} % The \Rfunction{cmp.parse1} function will parse an SDF for a single compound. Similarly as before, the only argument required is the path (or URL) to the SDF. %<<>>= %query.url <- "http://bioweb.ucr.edu/ChemMineV2/compound/Aurora/b32:NNQS2MBRHAZTI===/sdf" %query <- cmp.parse1(query.url) %@ % \section*{Descriptor Database Content} % % The descriptors of compounds are stored as numeric vectors in a list object along with available annotation information about the database. You may skip this section if you are not interested in internals of descriptor database. % % % The \Rfunction{cmp.parse1} function parses the SDF of a single compound, generates the descriptors and stores them in a numeric vector. Each entry of the vector is a descriptor for this compound. % % %<<>>= % %class(query) % %@ % % In contrast to this, the \Rfunction{cmp.parse} function generates a list object with four components. % %<<>>= % %names(db) % %@ % % The \Robject{descdb} component is a list. Each entry of the list is a vector % of descriptors of one compound. % %<<>>= % %class(db$descdb) % %class(db$descdb[[1]]) % %@ % % The \Rfunction{db.explain} function returns the descriptors in a human readable format. A single descriptor can be returned like this: % %<<>>= % %db.explain(query[1]) % %db.explain(db$descdb[[1]][1]) % %@ % The same is possible for multiple descriptors at once. % %<<>>= % %db.explain(db$descdb[[1]][1:5]) % %@ % % % \section*{Removing Duplicated Compounds} % % The \Rfunction{cmp.duplicated} % function can be used to quickly identify and remove duplicated compounds in imported compound databases. It takes a descriptor database as the only required argument and returns the duplication information as a logical vector. % % To demo this feature on the imported sample data set, one can create a duplication with the following command. % %<<>>= % %db$descdb <- c(db$descdb, list(db$descdb[[1]])) % %@ % In the next step the duplication is identified with the \Rfunction{cmp.duplicated} function. % %<<>>= % %dup <- cmp.duplicated(db) % %dup % %@ % The \texttt{TRUE} entry in the returned logical vector indicates the duplication. It can be easily removed with the standard R subsetting syntax. % %<<>>= % %db$descdb <- db$descdb[!dup] % %@ % In a real example one also needs to remove the duplications from the other database components. % %<>= % %db$cids <- db$cids[!dup] % %db$sdfsegs <- db$sdfsegs[!dup] % %@ % % \section*{Pairwise Compound Comparisons} % The \Rfunction{cmp.similarity} computes the atom pair similarity between two compounds using the Tanimoto coefficient as similarity measure. % % %<<>>= % %similarity <- cmp.similarity(db$descdb[[1]], db$descdb[[2]]) % %cat(paste("The similarity between compound 1 and 2 is ", similarity, "\n", sep="")) % %similarity <- cmp.similarity(db$descdb[[1]], query) % %cat(paste("The similarity between compound 1 and the query is ", similarity, "\n", sep="")) % %@ % % With the \Rfunction{cmp.similarity} function one can easily design custom search subroutines similar to the one introduced in the next section. % % \section*{Similarity Searching} % The \Rfunction{cmp.search} function searches an atom pair database for compounds that are similar to a query compound. % % %<<>>= % %cmp.search(db, query, cutoff=0.4, return.score=TRUE, quiet=TRUE) % %@ % % % The function returns a data frame where the rows are sorted by similarity score (best to worst). The first column contains the indices of the matching compounds in the database. The argument \Rfunarg{cutoff} can be a similarity cutoff, meaning only compounds with a similarity value larger than this cutoff will be returned; or it can be an integer value restricting how many compounds will be returned. If the argument \Rfunarg{return.score} is set to \texttt{FALSE}, then the function will return a vector of indices rather than a data frame. When supplying a cutoff of 0, the function will return the similarity values for every compound in the database. % % %<>= % %similarities <- cmp.search(db, db[[1]][[1]], cutoff =0, return.score=TRUE, quiet=TRUE) % %@ % % The \Rfunction{cmp.search} function allows to visualize the chemical structure images for the search results. A similar but more flexible chemical structure rendering function (\Rfunction{sdf.visualize}) is described later in this manual. By setting the \Rfunarg{visualize} argument in \Rfunction{cmp.search} to \texttt{TRUE}, the matching compounds and their scores can be visualized with a standard web browser on the online ChemMine interface. Depending on the \Rfunarg{visualize.browse} argument, an URL will be printed or a webpage will be opened showing the structures of the matching compounds along with their scores. % %<>= % %similarities <- cmp.search(db, query, cutoff=10, quiet=TRUE, visualize=TRUE, visualize.browse=FALSE) % %@ % Setting the \texttt{visualize.browse} argument to \texttt{TRUE} will automatically open the webpage in the default browser. % % The query structure can also be displayed on the visualization webpage by supplying the SDF of the query in a character string or providing its file name or URL. For example, % %<>= % %similarities <- cmp.search(db, query, cutoff=10, quiet=TRUE, visualize=TRUE, visualize.browse=TRUE, visualize.query=query.url) % %@ % This will read the SDF provided by \texttt{query.url}, and display it as a ``reference compound'' at the top of the page. Part of the screenshot of the resulting output is shown in Fig. \ref{fig:search}. A live demo is also available and linked from the online version of this manual (\url{http://bioweb.ucr.edu/ChemMineV2/chemminer/tutorial}). % % \begin{figure}[bthp] % \centering % \includegraphics[width=\textwidth]{searchscreenshot.pdf} % \caption{\Rfunction{cmp.search} can automatically upload the structures and scores of matching compounds to ChemMine for visualization.} % \label{fig:search} % \end{figure} % % % Any information uploaded to \textit{ChemMine} by \Rpackage{ChemmineR} is kept private and secure using a highly randomized URL. The visualization pages can be shared with colleagues by providing the corresponding URLs. % % \section*{Rendering Chemical Structure Images} % Internally, the similarity search function uses the \Rfunarg{sdf.visualize} function to send compounds to ChemMine for structure visualization. The same function can be used to send any custom combination of compounds for visualization on ChemMine along with complex annotation and activity information. The function accepts a database and a vector of compound indices. The following example performs first a similarity search to obtain a vector of indices. % % %<<>>= % %indices <- cmp.search(db, query, cutoff=10, quiet=TRUE) % %@ % %<>= % %url <- sdf.visualize(db, indices, browse=TRUE, quiet=TRUE) % %@ % %<>= % %url <- sdf.visualize(db, indices, browse=FALSE, quiet=TRUE) % %@ % %<<>>= % %url % %@ % The URL stored in the \Robject{url} object points to a webpage that shows the structures of the compounds. If the \Rfunarg{browse} argument is set to \texttt{TRUE}, then the default browser will open automatically. % % In addition, one can display other information next to the structures using the \Rfunarg{extra} argument. In the following example, a vector of character strings is assigned to \Rfunarg{extra}, and its entries are displayed next to corresponding chemical structures. % %<<>>= % %extra.info <- paste("Matching compound #", 1:length(indices), sep="") % %names(extra.info) <- rep("Note", length(indices)) % %@ % %<>= % %url <- sdf.visualize(db, indices, browse=TRUE, quiet=TRUE, extra=extra.info) % %@ % %<>= % %url <- sdf.visualize(db, indices, browse=FALSE, quiet=TRUE, extra=extra.info) % %<<>>= % %url % %@ % % The function also allows to list a reference compound at the top of the page. The user supplies the SDF of this reference compound in form of a character string or a file. Annotation information can also be displayed next to the reference structure. % % %<>= % %url <- sdf.visualize(db, indices, browse=TRUE, quiet=TRUE, reference.sdf=query.url, reference.note="Reference Compound from Aurora Library") % %@ % %<>= % %url <- sdf.visualize(db, indices, browse=FALSE, quiet=TRUE, reference.sdf=query.url, reference.note="Reference Compound from Aurora Library") % %@ % %<<>>= % %url % %@ % % It is also possible to display more complex tabular data next to each compound by providing a list of data frames. To demonstrate this utility, the following example creates such a list of data frames via a similarity search. Each data frame is then displayed next to the corresponding compound. The screenshot of the resulting output is shown in Fig. \ref{fig:visualize}. % \begin{figure}[bthp] % \centering % \includegraphics[width=\textwidth]{visualizescreenshot.pdf} % \caption{Visualization webpage created by calling \Rfunction{sdf.visualize}. This page shows the table information properly rendered and displayed next to the compound structures.} % \label{fig:visualize} % \end{figure} % % To generate this output, first a similarity is performed using a cutoff of 0 to obtain the similarity values between the query compound and each of the compounds in the database. % %<<>>= % %search.results <- cmp.search(db, query, cutoff=0, return.score=T, quiet=T) % %@ % % The resulting data frame will be used as annotation table for the query compound. To provide a table name, one has to embed it into a list. If a table name is not required, then there is no need to generate the list object % %<<>>= % %note.q <- list(search.results) % %names(note.q) <- "Similarities With All" % %@ % % For each of the top 10 hits in the search result, we perform the same search to obtain the similarity values between the hit and all compounds in the database. This information will then be displayed next to the structures on the visualization page. % %<<>>= % %indices <- search.results[1:10,1] % %notes <- list() % %for (i in indices) { % % search.results <- cmp.search(db, db$descdb[[i]], 0, return.score=T, quiet=T) % % notes <- c(notes, list(search.results)) % %} % %names(notes) <- rep("Similarities With All", 10) % %@ % The following step displays the complex sample data set on ChemMine. % %<>= % %url <- sdf.visualize(db, indices, extra=notes, browse=T, reference.sdf=query.url, reference.note=note.q) % %@ % % Note: the \Rfunction{sdf.visualize} function depends on the original SDF file from which the descriptor database has been generated. If the SDF file has been moved or altered then this step cannot be used. % % Any information uploaded to \textit{ChemMine} by \Rpackage{ChemmineR} is kept private and secure using a highly randomized URL. The visualization pages can be shared with colleagues by providing the corresponding URLs. % % % \section*{Subsetting SDF Batch Files} % After identifying a subset of interesting compounds, one can generate an SDF for this subset of compounds using the \Rfunction{sdf.subset} function. % % % For example, one can perform a similarity search, and use the top 10 results for subsetting. % %<<>>= % %indices <- cmp.search(db, query, cutoff=10, quiet=TRUE) % %@ % With the corresponding indices one can generate a custom SDF batch data set and store it in an external file. % %<<>>>= % %sdf <- sdf.subset(db, indices) % %cat(sdf, file="matching.sdf") % %@ % % One may also create a sub-database from a descriptor database using the related \Rfunction{db.subset} function. % %<<>>>= % %db_sub <- db.subset(db, indices) % %@ % % Note: the \Rfunction{sdf.visualize} function depends on the original SDF file from which the descriptor database has been generated. % % % \section*{Binning Clustering} % % Compound libraries can be clustered into discrete similarity groups with the binning clustering function \Rfunction{cmp.cluster}. The function requires as input a descriptor database as well as a similarity threshold. The binning clustering result is returned in form of a data frame. Single linkage is used for cluster joining. The function calculates the required compound-to-compound distance information on the fly, while a memory-intensive distance matrix is only created upon user request via the \Rfunarg{save.distances} argument (see below). % % %<<>>= % %clusters <- cmp.cluster(db, cutoff=0.65, quiet=TRUE) % %@ % The previous step clusters the compounds stored in \Rfunarg{db} with a similarity cutoff of 0.65. In other words, if two compounds share a similarity of 0.65 or above, then they will be joined into the same cluster. The first 10 rows of the result data frame are shown here: % % %<<>>= % %clusters[1:10,] % %@ % The first column contains the compound IDs, the second the cluster size and third the cluster ID. The compound in cluster ID 1 can be returned with the following command: % %<<>>= % %clusters[clusters[,3]==1,] % %@ % Similarly as above, one can visualize the chemical structures for a compound cluster of interest with the \Rfunction{sdf.visualize} function. % %<>= % %ids <- clusters[clusters[,3]==23, 1] % %sdf.visualize(db, ids, browse=TRUE, quiet=TRUE) % %@ % % \section*{Binning Clustering with Multiple Cutoffs} % Because an optimum similarity threshold is often not known, the % \Rfunction{cmp.cluster} function can calculate cluster results for multiple cutoffs in one step with almost the same speed as for a single cutoff. The clustering results for the different cutoffs will be stored in one data frame. % %<<>>= % %clusters <- cmp.cluster(db, cutoff=c(0.65, 0.5), quiet=TRUE) % %@ % %The first 10 rows of the generated cluster result data frame are: % %<<>>= % %clusters[1:10,] % %@ % %Cluster 14 obtained by the cutoff 0.65 contains the following compounds: % %<<>>= % %clusters[clusters[, "CLID_0.65"]==14,] % %@ % %and cluster 14 from cutoff 0.5 contains: % %<<>>= % %clusters[clusters[, "CLID_0.5"]==14,] % %@ % % % One may force the \Rfunction{cmp.cluster} function to calculate and store the distance matrix by supplying a file name to the \Rfunarg{save.distances} argument. The generated distance matrix can be loaded and passed on to many other clustering methods available in R, such as the hierarchical clustering function \Rfunction{hclust} (see below). % % If a distance matrix is available, it may also be supplied to \Rfunction{cmp.cluster} via the \Rfunarg{use.distances} argument. This is useful when one has a pre-computed distance matrix either from a previous call to \Rfunction{cmp.cluster} or from other distance calculation subroutines. % % \section*{Multi-Dimensional Scaling (MDS)} % % To visualize and compare clustering results, the \Rfunction{cluster.visualize} function can be used. The function performs Multi-Dimensional Scaling (MDS) and visualizes the results in form of a scatter plot. It requires as input a descriptor database, a clustering result from \Rfunction{cmp.cluster}, and a cutoff for the minimum cluster size to consider in the plot. To help determining a proper cutoff size, the \Rfunction{cluster.sizestat} function is provided to generate cluster size statistics. % % The following example uses the clustering result obtained above using cutoff values 0.65 and 0.5. By default, the \Rfunction{cluster.sizestat} uses the first cutoff value: % % %<<>>= % %cluster.sizestat(clusters) % %@ % % Based on this size statistics, clusters of size 3 or larger will be included in the following MDS visualization step. % % %<>= % %coord <- cluster.visualize(db, clusters, 3, quiet=TRUE) % %@ % % By default \Rfunction{cluster.visualize} will draw the scatter plot in the R plotting device, and the user can interactively click a point to retrieve more information on the corresponding compounds. In the non-interactive mode (\Rfunarg{non.interactive}), it will save the plot to a specified file in EPS or PDF format. % % A 3D MDS plot can be created with the following sequence of commands. % % %<>= % %coord <- cluster.visualize(db, clusters, 3, dimensions=3, quiet=TRUE) % %library(scatterplot3d) % %scatterplot3d(coord) % %@ % %<> % %coord <- cluster.visualize(db, clusters, 3, dimensions=3, quiet=TRUE) % %if(suppressWarnings(require(require(scatterplot3d)))) scatterplot3d(coord) % %@ % % % %The data returned by the \Rfunction{cluster.visualize} can also be inspected with the fully interactive \Rpackage{rggobi} data visulalization system. The GGobi software and its dependencies can be obtained from the GGobi project site (\url{http://www.ggobi.org/rggobi}). The following commands demonstrate the import of the generated MDS data set into \Rpackage{rggobi}. % %<>= % %library(rggobi) % %ggobi(coord) % %@ % % % \section*{Clustering with Other Packages} % % \Rpackage{ChemmineR} allows the user to take advantage of the wide spectrum of clustering utilities available in R. An example on how to perform hierarchical clustering with the \Rfunction{hclust} function is given below. % %<<>>= % %dummy <- cmp.cluster(db, 0, save.distances="distmat.rda", quiet=T) % %@ % The \Rfunction{cmp.cluster} function is used with the \Rfunarg{save.distances="distmat.rda"} argument to generate a distance matrix. The matrix is saved to a file named \texttt{'distmat.rda'} and it needs to be loaded into R with the \Rfunction{load} function. This matrix can be directly passed on to \Rfunction{hclust}. % %<>= % %load('distmat.rda') % %hc <- hclust(as.dist(distmat), "single") % %plot(as.dendrogram(hc), horiz=T) % %@ % % \section*{Format Interconversions between SMILES and SDF} % This option will be provided in the future. At this point, SMILES strings can be imported into ChemmineR only indirectly by converting them into SDFs via ChemMine's online WorkBench (\url{http://bioweb.ucr.edu/ChemMineV2/work/smiles/}). % % \section*{Calculation of Physicochemical Descriptors} % Several functions will be available in the near future for calculating physicochemical descriptors directly in ChemmineR. Currently, users can calculate 40 common physicochemical descriptors with the online descriptor prediction tool available on ChemMine's WorkBench (\url{http://bioweb.ucr.edu/ChemMineV2/work/sdf/}). \bibliography{bibtex} \end{document}