% Manual compile % Sweave("fmcsR.Rnw"); system("pdflatex fmcsR.tex; bibtex fmcsR; pdflatex fmcsR.tex; pdflatex fmcsR.tex") % echo 'Sweave("fmcsR.Rnw")' | R --slave; echo 'Stangle("fmcsR.Rnw")' | R --slave; pdflatex fmcsR.tex; bibtex fmcsR; pdflatex fmcsR.tex % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{gpls Tutorial} % \VignetteDepends{} % \VignetteKeywords{} % \VignettePackage{gpls} \documentclass{article} <>= BiocStyle::latex() @ \usepackage{hyperref} \usepackage{url} \usepackage[authoryear,round]{natbib} \usepackage{graphicx} \bibliographystyle{plainnat} \def\bibsection{\section{References}} \usepackage{graphicx} \usepackage{color} \usepackage{hyperref} \usepackage{url} %\newcommand{\comment}[1]{} %\newcommand{\Rfunction}[1]{{\texttt{#1}}} %\newcommand{\Robject}[1]{{\texttt{#1}}} %\newcommand{\Rpackage}[1]{{\textit{#1}}} %\newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} %\newcommand{\Rclass}[1]{{\textit{#1}}} % Define header and footer area with fandyhdr package (see: http://www.ctan.org/tex-archive/macros/latex/contrib/fancyhdr/fancyhdr.pdf) \usepackage{fancyhdr} \pagestyle{fancy} \fancyhead{} \fancyfoot{} \rhead{\nouppercase{\leftmark}} \lhead{\textit{fmcsR Manual}} \rfoot{\thepage} <>= options(width=95) unlink("test.db") @ %\parindent 0in \begin{document} \title{fmcsR: Mismatch Tolerant Maximum Common Substructure Detection for Advanced Compound Similarity Searching} \author{Yan Wang, Tyler Backman, Kevin Horan, Thomas Girke} \maketitle \tableofcontents \section{Introduction} Maximum common substructure (MCS) algorithms rank among the most sensitive and accurate methods for measuring structural similarities among small molecules. This utility is critical for many research areas in drug discovery and chemical genomics. The MCS problem is a graph-based similarity concept that is defined as the largest substructure (sub-graph) shared among two compounds \citep{Cao2008a, Wang2013a}. It fundamentally differs from the structural descriptor-based strategies like fingerprints or structural keys. Another strength of the MCS approach is the identification of the actual MCS that can be mapped back to the source compounds in order to pinpoint the common and unique features in their structures. This output is often more intuitive to interpret and chemically more meaningful than the purely numeric information returned by descriptor-based approaches. Because the MCS problem is NP-complete, an efficient algorithm is essential to minimize the compute time of its extremely complex search process. The \Rpackage{fmcsR} package implements an efficient backtracking algorithm that introduces a new flexible MCS (FMCS) matching strategy to identify MCSs among compounds containing atom and/or bond mismatches. In contrast to this, other MCS algorithms find only exact MCSs that are perfectly contained in two molecules. The details about the FMCS algorithm are described in the Supplementary Materials Section of the associated publication \citep{Wang2013a}. The package provides several utilities to use the FMCS algorithm for pairwise compound comparisons, structure similarity searching and clustering. To maximize performance, the time consuming computational steps of \Rpackage{fmcsR} are implemented in C++. Integration with the \Rpackage{ChemmineR} package provides visualization functionalities of MCSs and consistent structure and substructure data handling routines \citep{Cao2008c, Backman2011a}. The following gives an overview of the most important functionalities provided by \Rpackage{fmcsR}. \\ \section{Installation} The R software for running \Rpackage{fmcsR} and \Rpackage{ChemmineR} can be downloaded from CRAN (\url{http://cran.at.r-project.org/}). The \Rpackage{fmcsR} package can be installed from an open R session using the \Rfunction{biocLite} install command. <>= source("http://bioconductor.org/biocLite.R") biocLite("fmcsR") @ \section{Quick Overview} \noindent To demo the main functionality of the \Rpackage{fmcsR} package, one can load its sample data stored as \Rclass{SDFset} object. The generic \Rfunction{plot} function can be used to visualize the corresponding structures. <>= library(fmcsR) data(fmcstest) plot(fmcstest[1:3], print=FALSE) @ \begin{figure}[bthp] \centering \includegraphics[height=60mm]{fmcsR-quicktest1.pdf} \caption{Structures depictions of sample data.} \label{fig:quicktest1} \end{figure} \noindent The \Rfunction{fmcs} function computes the MCS/FMCS shared among two compounds, which can be highlighted in their structure with the \Rfunction{plotMCS} function. <>= test <- fmcs(fmcstest[1], fmcstest[2], au=2, bu=1) plotMCS(test) @ \begin{figure}[bthp] \centering \includegraphics[height=60mm]{fmcsR-quicktest2.pdf} \caption{The red bonds highlight the MCS shared among the two compounds.} \label{fig:quicktest2} \end{figure} \pagebreak \section{Documentation} <>= library("fmcsR") # Loads the package @ <>= library(help="fmcsR") # Lists functions/classes provided by fmcsR library(help="ChemmineR") # Lists functions/classes from ChemmineR vignette("fmcsR") # Opens this PDF manual vignette("ChemmineR") # Opens ChemmineR PDF manual @ \noindent The help documents for the different functions and container classes can be accessed with the standard R help syntax. <>= ?fmcs ?"MCS-class" ?"SDFset-class" @ \section{MCS of Two Compounds} \subsection{Data Import} \noindent The following loads the sample data set provided by the \Rpackage{fmcsR} package. It contains the SD file (SDF) of \Sexpr{length(fmcstest)} molecules stored in an \Rclass{SDFset} object. <>= data(fmcstest) sdfset <- fmcstest sdfset @ \noindent Custom compound data sets can be imported and exported with the \Rfunction{read.SDFset} and \Rfunction{write.SDF} functions, respectively. The following demonstrates this by exporting the \Rclass{sdfset} object to a file named sdfset.sdf. The latter is then reimported into R with the \Rfunction{read.SDFset} function. <>= write.SDF(sdfset, file="sdfset.sdf") mysdf <- read.SDFset(file="sdfset.sdf") @ \subsection{Compute MCS} \noindent The \Rfunction{fmcs} function accepts as input two molecules provided as \Rclass{SDF} or \Rclass{SDFset} objects. Its output is an S4 object of class \Rclass{MCS}. The default printing behavior summarizes the MCS result by providing the number of MCSs it found, the total number of atoms in the query compound $a$, the total number of atoms in the target compound $b$, the number of atoms in their MCS $c$ and the corresponding \textit{Tanimoto Coefficient}. The latter is a widely used similarity measure that is defined here as $c/(a+b-c)$. In addition, the \textit{Overlap Coefficient} is provided, which is defined as $c/min(a,b)$. This coefficient is often useful for detecting similarities among compounds with large size differences. <>= mcsa <- fmcs(sdfset[[1]], sdfset[[2]]) mcsa mcsb <- fmcs(sdfset[[1]], sdfset[[3]]) mcsb @ \noindent If \Rfunction{fmcs} is run with \Rfunarg{fast=TRUE} then it returns the numeric summary information in a named \Rclass{vector}. <>= fmcs(sdfset[1], sdfset[2], fast=TRUE) @ \subsection{\Rclass{MCS} Class Usage} \noindent The \Rclass{MCS} class contains three components named \Rclass{stats}, \Rclass{mcs1} and \Rclass{mcs2}. The \Rclass{stats} slot stores the numeric summary information, while the structural MCS information for the query and target structures is stored in the \Rclass{mcs1} and \Rclass{mcs2} slots, respectively. The latter two slots each contain a \Rclass{list} with two subcomponents: the original query/target structures as \Rclass{SDFset} objects as well as one or more numeric index vector(s) specifying the MCS information in form of the row positions in the atom block of the corresponding \Rclass{SDFset}. A call to \Rfunction{fmcs} will often return several index vectors. In those cases the algorithm has identified alternative MCSs of equal size. <>= slotNames(mcsa) @ \noindent Accessor methods are provided to return the different data components of the \Rclass{MCS} class. <>= stats(mcsa) # or mcsa[["stats"]] mcsa1 <- mcs1(mcsa) # or mcsa[["mcs1"]] mcsa2 <- mcs2(mcsa) # or mcsa[["mcs2"]] mcsa1[1] # returns SDFset component mcsa1[[2]][1:2] # return first two index vectors @ \noindent The \Rfunction{mcs2sdfset} function can be used to return the substructures stored in an \Rclass{MCS} instance as \Rclass{SDFset} object. If \Rfunarg{type="new"} new atom numbers will be assigned to the subsetted SDF, while \Rfunarg{type="old"} will maintain the atom numbers from its source. For details consult the help documents \Rclass{?mcs2sdfset} and \Rclass{?atomsubset}. <>= mcstosdfset <- mcs2sdfset(mcsa, type="new") plot(mcstosdfset[[1]], print=FALSE) @ \noindent To construct an \Rclass{MCS} object manually, one can provide the required data components in a \Rclass{list}. <>= mylist <- list(stats=stats(mcsa), mcs1=mcs1(mcsa), mcs2=mcs2(mcsa)) as(mylist, "MCS") @ \section{FMCS of Two Compounds} \noindent If \Rfunction{fmcs} is run with its default paramenters then it returns the MCS of two compounds, because the mismatch parameters are all set to zero. To identify FMCSs, one has to raise the number of upper bound atom mismates \Rfunarg{au} and/or bond mismatches \Rfunarg{bu} to interger values above zero. <>= plotMCS(fmcs(sdfset[1], sdfset[2], au=0, bu=0)) @ \begin{figure}[bthp] \centering \includegraphics[height=60mm]{fmcsR-au0bu0.pdf} \caption{MCS for \Rclass{sdfset[1]} and \Rclass{sdfset[2]} with \Rfunarg{au=0} and \Rfunarg{bu=0}} \label{fig:au0bu0} \end{figure} <>= plotMCS(fmcs(sdfset[1], sdfset[2], au=1, bu=1)) @ \begin{figure}[bthp] \centering \includegraphics[height=60mm]{fmcsR-au1bu1.pdf} \caption{FMCS for \Rclass{sdfset[1]} and \Rclass{sdfset[2]} with \Rfunarg{au=1} and \Rfunarg{bu=1}} \label{fig:au1bu1} \end{figure} <>= plotMCS(fmcs(sdfset[1], sdfset[2], au=2, bu=2)) @ \begin{figure}[bthp] \centering \includegraphics[height=60mm]{fmcsR-au2bu2.pdf} \caption{FMCS for \Rclass{sdfset[1]} and \Rclass{sdfset[2]} with \Rfunarg{au=2} and \Rfunarg{bu=2}} \label{fig:au2bu2} \end{figure} \pagebreak <>= plotMCS(fmcs(sdfset[1], sdfset[3], au=0, bu=0)) @ \begin{figure}[bthp] \centering \includegraphics[height=60mm]{fmcsR-au0bu013.pdf} \caption{MCS for \Rclass{sdfset[1]} and \Rclass{sdfset[3]} with \Rfunarg{au=0} and \Rfunarg{bu=0}} \label{fig:au2bu213} \end{figure} \pagebreak \section{FMCS Search Functionality} \noindent The \Rfunction{fmcsBatch} function provides FMCS search functionality for compound collections stored in \Rclass{SDFset} objects. <>= data(sdfsample) # Loads larger sample data set sdf <- sdfsample fmcsBatch(sdf[1], sdf[1:30], au=0, bu=0) @ \section{Clustering with FMCS} \noindent The \Rfunction{fmcsBatch} function can be used to compute a similarity matrix for clustering with various algorithms available in R. The following example uses the FMCS algorithm to compute a similarity matrix that is used for hierarchical clustering with the \Rfunction{hclust} function and the result is plotted in form of a dendrogram. <>= sdf <- sdf[1:7] d <- sapply(cid(sdf), function(x) fmcsBatch(sdf[x], sdf, au=0, bu=0, matching.mode="aromatic")[,"Overlap_Coefficient"]) d hc <- hclust(as.dist(1-d), method="complete") plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE) @ \begin{figure}[bthp] \centering \includegraphics[width=80mm]{fmcsR-tree.pdf} \caption{Hierarchical clustering result.} \label{fig:tree} \end{figure} \noindent The FMCS shared among compound pairs of interest can be visualized with \Rfunction{plotMCS}, here for the two most similar compounds from the previous tree: <>= plotMCS(fmcs(sdf[3], sdf[7], au=0, bu=0, matching.mode="aromatic")) @ \begin{figure}[bthp] \centering \includegraphics[height=60mm]{fmcsR-au0bu024.pdf} \caption{Most similar compounds from previous tree.} \label{fig:au2bu224} \end{figure} \section{Version Information} <>= toLatex(sessionInfo()) @ % \section{Supplementary Materials: Outline of FMCS Algorithm} % \markboth{Supplementary Materials: Outline of FMCS Algorithm}{} % Only required to print section title in header field without numbering. % \label{supplement:algorithm} % % \subsection{Background Knowledge} % % \subsubsection{Isomorphism} % To introduce the graph concepts used by the following parts in this outline, we assume two graphs $G_{query}$ and $G_{target}$. The two are said to be \emph{isomorphic} when there is a \emph{one-to-one correspondence} among their vertices, which means an edge exists in one graph if and only if an edge exists between the corresponding two vertices in the other graph. For instance, the two graphs (a) and (b) in Figure \ref{FIG:ISO} are isomorphic since all vertices have a one-to-one correspondence, but graph (a) and graph (c) are not isomorphic, since there is no \emph{one-to-one correspondence} among their vertices. % % % \begin{figure}[bthp] % \centering % %{\def\epsfsize#1#2{0.5#1}\FIG{iso}} % \includegraphics[width=10cm]{images/iso.pdf} % \caption{Illustration of isomorphism on three sample graphs.} % \label{FIG:ISO} % \end{figure} % % % % \begin{figure}[bthp] % \centering % %{\def\epsfsize#1#2{0.5#1}\FIG{sub}} % \includegraphics[width=8cm]{images/sub.pdf} % \caption{Three sample structures to illustrate induced substructures.} % \label{FIG:sub} % \end{figure} % % \subsubsection{Common Substructure} % A graph $G_{sub}(V_{sub}, E_{sub})$ is said to be a \emph{subgraph} of $G(V, E)$, when $V_{sub}$ is a subset of $V$ and $E_{sub}$ is a subset of $E$. A similar but distinct term is \emph{induced subgraph}. A graph $G_{ind}(V_{ind}, E_{ind})$ is said to be an \emph{induced subgraph} of $G(V, E)$, when $V_{sub}$ is a subset of $V$ and $E_{ind}$ is the set of all correspondence edges. In Figure \ref{FIG:sub}, graph (b) is a subgraph of graph (a), but not an \emph{induced subgraph}, because it lacks the correspondence edge $E_{BC}$. Graph (c) is an \emph{induced subgraph} of graph (a), since it contains all correspondence edges from graph (a). In this study all small molecule structures are treated as graphs, thus, every atom has a corresponding vertex and every chemical bond corresponds to an edge. Therefore, in the rest of this outline, we use the terms \emph{substructure} and \emph{subgraph} interchangeably, because they are equivalent. % % The term \emph{common substructure} is easy to understand. For instance, the induced substructures $G_{sub1}$ and $G_{sub2}$ from molecules $G_1$ and $G_2$, respectively, are common substructures, if $G_{sub1}$ and $G_{sub1}$ are isomorphic. Here we are aiming to find common substructures with the largest number of vertices, thus the term \emph{maximum common substructure} (MCS). % % \subsection{Overview of the algorithm} % To introduce our algorithm, the following illustrates its major steps for two simple structures (1) and (2) shown in Figure \ref{FIG:EMP}. For simplicity, both structures contain only one atom type (\textit{e.g.} carbon) and one bond type. % \begin{figure}[bthp] % \centering % %{\def\epsfsize#1#2{0.5#1}\FIG{emp}} % \includegraphics[width=10cm]{images/emp.pdf} % \caption{Two different structures.} % \label{FIG:EMP} % \end{figure} % % \begin{figure}[bthp] % \centering % %{\def\epsfsize#1#2{0.5#1}\FIG{tree}} % \includegraphics[width=14cm]{images/tree.pdf} % \caption{Sample search tree to identify the MCS shared among compound (1) and (2) in Figure \ref{FIG:EMP}.} % \label{FIG:TREE} % \end{figure} % If we take these two structures as input, our algorithm will built a search tree given in Figure \ref{FIG:TREE}. The first level on the top of the tree contains an empty \emph{candidate mapping set}. As the algorithm traverses at each level through every node of the tree, it will try adding a potential match to the \emph{candidate mapping set}. This results in one mapping on the second level. Because there is only one atom type in the given example, all combinations of atoms are valid matches. In the third level, the algorithm will try to add one more potential match. However, the set $\{(A,a), (C,c)\}$ contains two disconnected substructures, which is not desired, therefore, the search on this branch will be terminated. The next set ${(A,a), (B,b)}$ results in a larger candidate substructure because the new atom match $(B,b)$ is a valid match. All atoms in this set are connected, and the bonds between them have a one-to-one correspondence. Therefore, the algorithm will continue on this branch to traverse to the next level in the tree where $(C,c)$ will be added to the candidate set which is also a valid mapping. This process will continue until the entire search tree has been travered. The largest candidate set obtained at the end will be returned as the MCS shared among the two input structures. % % The pseudocode of our method is given in Algorithm \ref{ALG:recurrsion}. The following explains the most relevant parts of the algorithm. Important variables are \texttt{targetAtomSet}, \texttt{queryAtomSet}, \texttt{candidate MappingSet}, and \texttt{largestMappingSet}. The \texttt{targetAtomSet} contains atoms from the \emph{target} compound, and \texttt{queryAtomSet} contains atoms from the \emph{query} compounds which are the two input compounds, respectively. The \texttt{candidateMappingSet} will be used throughout the entire algorithm to store the candidate mappings between the target and query compound, while the \texttt{largestMappingSet} keeps track of the largest mapping set found at any given step. % % \begin{algorithmbis}[Pseudocode for the main recursive function.]\label{ALG:recurrsion} % \textsc{RecursiveSolving}\texttt{(targetAtomSet, queryAtomSet)} % \begin{algorithmic}[1] % % \STATE \texttt{potentialSize} $\gets$ \textsc{CalulatePotentialSize}\textsc{()} \label{LINE:bbStart} % \IF{\texttt{potentialSize} $<$ \texttt{largestMappingSet.size()}} \label{LINE:bbCheck} % \RETURN % \ENDIF \label{LINE:bbEnd} % % \WHILE{\TRUE} % \IF{\texttt{targetAtomList.empty()} \textbf{or} \texttt{queryAtomList.empty()}} % \STATE Update \texttt{largestMappingSet} \label{LINE:update} % \RETURN % \ENDIF % \STATE \texttt{targetAtom} $\gets$ \textsc{NextAtom}\texttt{(targetAtomSet)} % \STATE \texttt{targetAtomSet.erase(targetAtom)} % \FOR{each \texttt{queryAtom} in \texttt{queryAtomSet}} % \STATE \texttt{atomMismatchOccurred} $\gets $ \texttt{\FALSE} % \IF{\texttt{targetAtom.type()} $\ne$ \texttt{queryAtom.type()}} % \STATE \texttt{atomMismatchCounter} $\gets$ \texttt{atomMismatchCounter + 1} \label{LINE:aCounter} % \STATE \texttt{atomMismatchOccurred} $ \gets$ \texttt{\TRUE} % \ENDIF % % \IF{$not$ \textsc{RulesAllowed}\texttt{(targetAtom.type(), queryAtom.type())}} \label{LINE:ruleStart} % \RETURN % \ENDIF \label{LINE:ruleEnd} % % \IF{\texttt{atomMismatchCounter} $\leq$ \texttt{atomMismatchUpperBound}} % \IF{\textsc{Compatible}\texttt{(targetAtom, queryAtom)}} \label{LINE:checkCompat} % \STATE \texttt{bondMismatchCounter} $\gets$ \texttt{bondMismatchCounter + newBondMismatchCount} \label{LINE:bondMisUse1} % \IF{\texttt{bondMismatchCounter} $\leq$ \texttt{bondMismatchUpperBound}} % \IF{the new mapping atom does not connect to any atom in \texttt{candidateMappingSet}} \label{LINE:fragStart} % \STATE \texttt{totalFragments} $\gets$ \texttt{totalFragments + 1} % \STATE \texttt{newFragmentIntroduced} $\gets$ \texttt{\TRUE} % \ENDIF % \IF{\texttt{totalFragments} $\leq$ \texttt{fragmentLimit}} \label{LINE:fragEnd} % \STATE \texttt{candidateMappingSet.add(}\textsc{Pair}\texttt{(targetAtom, queryAtom))} % \STATE \texttt{queryAtomSet.erase(queryAtom)} % \STATE \textsc{RecursiveSolving}\texttt{(targetAtomSet, queryAtomSet)} % \STATE \texttt{queryAtomSet.add(queryAtom)} % \STATE \texttt{candidateMappingSet.erase(}\textsc{Pair}\texttt{(targetAtom, queryAtom))} % % \ENDIF % \IF{\texttt{newFragmentIntroduced = \TRUE}} % \STATE \texttt{totalFragments} $\gets$ \texttt{totalFragments - 1} % \STATE \texttt{newFragmentIntroduced} $\gets$ \texttt{\FALSE} % \ENDIF % \ENDIF % \STATE \texttt{bondMismatchCounter} $\gets$ \texttt{bondMismatchCounter - newBondMismatchCount} \label{LINE:bondMisUse2} % \ENDIF % \ENDIF % \IF{\texttt{atomMismatchOccurred = \TRUE}} % \STATE \texttt{atomMismatchCounter} $\gets$ \texttt{atomMismatchCounter - 1} % \ENDIF % \ENDFOR % \ENDWHILE % % %\algstore{my} % \end{algorithmic} % % %\caption{Pseudocode for the main recursive function.} % %\label{ALG:recurrsion} % \end{algorithmbis} % % % \subsection{Fragment Control} % % Our algorithm applies a heuristic to efficiently identifty MCSs among two compounds. During the search process, it evaluates every possible mapping solution. Thus, it will sometimes enter a branch containing a large number of small subsolutions that are unlikely to contribute to the final MCS. Because finding disconnected substructures is costy with respect to time performance and it is not the objective of this algorithm, we minimize these unnecessary computations as follows. When a search enters a branch where the number of subsolutions exceeds a user definable limit, the algorithm will stop searching this branch and move to the next one. This subroutine is defined in lines \ref{LINE:fragStart} - \ref{LINE:fragEnd} of Algorithm \ref{ALG:recurrsion}. According to experiments, this constraint dramatically reduces the search space, and makes the algorithm much more efficient. % % \subsection{Add Mappings} % % Line \ref{LINE:checkCompat} evaluates whether a mapping between two atoms is valid or not. According to the definition of an induced subgraph, all vertices between the two graphs must have a one-to-one correspondence. To determine whether such a correspondence exists between atoms of the \emph{target} and \emph{query} compound, the \texttt{targetAtom} and \texttt{queryAtom} are set to be two unmatched atoms in the \texttt{targetAtomSet} and \texttt{queryAtomSet}, respectively. Next, the algorithm will test whether \textsc{Pair}\texttt{(targetAtom, queryAtom)} is a feasible match and retrieve all matched neighbors of \texttt{targetAtom}, which are \texttt{targetNeighborAtoms} and all matched neighbors \texttt{queryNeighborAtoms} of \texttt{queryAtom}. If the atoms in \texttt{targetNeighborAtoms} and \texttt{queryNeighborAtoms} do have a one-to-one correspondence, then the correspondence \textsc{Pair}\texttt{(targetAtom, queryAtom)} will be added to the current match set. If there is no correspondence, then the current search branch will be discarded. This subroutine is defined in line \ref{LINE:checkCompat} of Algorithm \ref{ALG:recurrsion}. % % \begin{algorithm} % \textsc{Compatible}\texttt{(targetAtom, queryAtom)} % \begin{algorithmic}[1] % \STATE \texttt{targetNeighborAtoms} $\gets$ \texttt{targetAtom.neighborsWith(candidateMappingSet)} \label{LINE:targetMap} % \STATE \texttt{queryNeighborAtoms} $\gets$ \texttt{queryAtom.neighborsWith(candidateMappingSet)} \label{LINE:queryMap} % % \STATE \texttt{neighborMapping} $\gets$ \texttt{candidateMappingSet.getMappingOf(targetAtom)} \label{LINE:getMapping} % % \IF{$not$ \texttt{neighborMapping.queries().equals(queryNeighborAtoms)}} \label{LINE:checkEqual} % \RETURN \texttt{\FALSE} % \ENDIF % % \FOR{each \texttt{mapping} in \texttt{targetNeighborMapping}} \label{LINE:bondMisStart} % \IF{\textsc{Bond}\texttt{(mapping.target(), targetAtom)} $\ne$ \textsc{Bond}\texttt{(mapping.query(), queryAtom)}} % \STATE \texttt{newBondMismatchCount = newBondMismatchCount +1} \label{Line:bondMis} % \ENDIF % \ENDFOR \label{LINE:bondMisEnd} % % \STATE here to handle aromatic or ring bonds % % \RETURN \texttt{\TRUE} % % \end{algorithmic} % % \caption{Pseudo-code for the \textsc{Compatible} subroutine} % \label{ALG:compatible} % \end{algorithm} % % A pseudo-code is given in Algorithm \ref{ALG:compatible}. Line~\ref{LINE:targetMap} and line~\ref{LINE:queryMap} retrieve all matched neighbors of \texttt{targetAtom} and \texttt{queryAtom}, respectively. Line~\ref{LINE:getMapping} just retrieves a set of mapping atoms from \texttt{candidateMappingSet} based on \texttt{targetAtom}. The one-to-one correspondence checking is performed be line~\ref{LINE:checkEqual}, if not, simply return false. Line~\ref{LINE:bondMisStart} through ~\ref{LINE:bondMisEnd} handles bonds mismatches. The value of \texttt{newBondMismatchCount} in line~\ref{Line:bondMis} will be used in line~\ref{LINE:bondMisUse1} and line~\ref{LINE:bondMisUse2} in algorithm \ref{ALG:recurrsion}. % % % \subsection{Tolerance of Mismatches} % % A key requirement of our algorithm is to allow mis-matches among the atom-to-atom and bond-to-bond mappings of an MCS shared among two compounds. Specifically, this means a carbon could be matched to a nitrogen, or a double bond could be matched to a single bond. Especially, atom mismatches are not supported by perfect matching MCS algorithms. However, our Algorithm \ref{ALG:recurrsion} supports flexible matching by using a series of counters. The \texttt{atomMismatchCounter} (line \ref{LINE:aCounter}) and \texttt{bondMismatchCounter} (line \ref{LINE:bondMisUse1}) record the atom and bond mismatches, respectively, while mismatch tracking is achieved with a straightforward backtracking approach. When a mismatch occurs during searching, the algorithm first evaluates whether it is a valid mismatch given the chosen matching rules (line \ref{LINE:ruleStart} - \ref{LINE:ruleEnd}). If it is a valid mismatch, then the corresponding counter will be increased by 1. If one of the counters exceeds its upper limit or the mismatch is illegal, then the algorithm will stop searching that branch, and transition to the next node. The algorithm will check the counter's lower-bound when updating the \texttt{largestMappingSet} (line \ref{LINE:update}), if the counter does not reach the corresponding lower-bound limit, the mapping won't be recorded. % % \subsection{Cut Branches} % % To avoid an exhaustive search of the entire search tree, a branch and bound strategy is introduced to prune branches that are guaranteed not to lead to the optimal solution. This is achieved as follows. Every time the algorithm enters a node, it will estimate an upper bound of the size of the MCS generated by this branch. If this upper bound is below the size of the \texttt{largestMappingSet}, the current branch will be cut meaning the search on this branch will be terminated. This branch and bound strategy is formalized in lines \ref{LINE:bbStart} through \ref{LINE:bbEnd} of Algorithm \ref{ALG:recurrsion}. Line \ref{LINE:bbCheck} terminates searching if the upper bound is smaller than the size of the \texttt{largestMappingSet}. % % \begin{figure}[bthp] % \centering % %{\def\epsfsize#1#2{0.8#1}\FIG{bb}} % \includegraphics[width=8cm]{images/bb.pdf} % \caption{Method to calculate upper bond of potential matching.} % \label{FIG:bb} % \end{figure} % % The upper bound is estimated by using the definition of the induced subgraph heuristic. At some point when traversing the search tree (see Figure \ref{FIG:bb}), let $m$ be the size of \texttt{candidateMappingSet}, $T$ be the set of the unmatched vertices in \texttt{targetAtomSet} and $Q$ be the set of the unmatched vertices in \texttt{queryAtomSet}. For each atom in $T$, if it has not been marked as infeasible by the previous search, we apply the above heuristic to test whether it is allowed to match to any atoms in $Q$. If $n$ atoms of $T$ find potential matches in $Q$, then an upper bound on the sizes of the common subgraphs at the leaf nodes of the present branch is $m + n$. % % \begin{algorithm} % \textsc{CalulatePotentialSize}\textsc{()} % \begin{algorithmic}[1] % \FOR{each \texttt{targetAtom} in \texttt{targetAtomSet}} \label{LINE:for1} % \IF{$not$ \texttt{candidateMappingSet.contains(targetAtom)}} % \FOR{each \texttt{queryAtom} in \texttt{queryAtomSet}} \label{LINE:for2} % \IF{$not$ \texttt{candidateMappingSet.contains(queryAtom)}} % \IF{\texttt{targetAtom.bondsToMapping() = queryAtom.bondsToMapping()}} \label{LINE:check} % \STATE \texttt{n $\gets$ n + 1} % \STATE \texttt{queryAtomSet.erase(queryAtom)} % \ENDIF % \ENDIF % \ENDFOR % \ENDIF % \ENDFOR % \RETURN \texttt{candidateMappingSet.size() + n} \label{LINE:return} % \end{algorithmic} % % \caption{Pseudo-code for the \textsc{CalulatePotentialSize} subroutine.} % \label{ALG:bb} % \end{algorithm} % % The pseudocode of this subroutine is given in Algorithm \ref{ALG:bb}. The two \texttt{FOR} loops, starting at lines \ref{LINE:for1} and \ref{LINE:for2}, identify the unmatched atoms from \texttt{targetAtomSet} and \texttt{queryAtomSet}. Line \ref{LINE:check} determines the number of connections the two atoms have within \texttt{candidateMappingSet}. % % \subsection{Ordering Search Space} % \label{SEC:order} % A proper search order can dramatically accelerate the search process. When the algorithm is traversing the search tree, we are more interested in searching the branch that most likely contains the optimal solution first. Once we have a suboptimal solution which is very near to the real optimal one, we can cut much more branches using branch and bound. This dramatically reduces the complexity of the search process, and shortens the compute time. % % When traversing the search tree, to compute the MCS between the \emph{target} and \emph{query} compound, our algorithm chooses the next atom to match in a hierarchical manner. It always uses the atom in \texttt{targetAtomSet} with the most neighbors in \texttt{candidateMappingSet} (see Algorithm \ref{ALG:next}). Our performance tests show that this strategy results in an additional speedup of our algorithm. % % \begin{algorithm} % \textsc{NextAtom}\texttt{(atomSet)} % \begin{algorithmic}[1] % \FOR{each \texttt{atom} in \texttt{atomSet}} % \IF{\texttt{atom} has largest connection with \texttt{candidateMappingSet}} % \RETURN \texttt{atom} % \ENDIF % \ENDFOR % \end{algorithmic} % % \caption{Pseudo-code for the \textsc{NextAtom} subroutine.} % \label{ALG:next} % \end{algorithm} % % % \subsection{Rings and Aromaticity} % % Often it is not desired to compute MCSs that violate basic rules of organic chemistry. Those include mappings of non-aromatic bonds with aromatic ones or aliphatic with ring components. To avoid this, our FMCS can be run in a mode that is aware of these basic chemical principles by treating aromatic and ring bonds differently to compute chemically more reasonable MCSs. This is achieved by detecting in a preprocessing step all ring as well as aromatic bonds to differentiate them from aliphatic bonds. In the ring and aromaticity aware modes the number of mappings the main algorithm has to perform is usually reduced resulting in a shorter compute time when this mode is turned on. \bibliography{bibtex} \end{document}