\newcommand{\mod}{\mathop{\rm mod}\nolimits} \newcommand\dtd{\acro{DTD}} \newcommand\SGML{\acro{SGML}\xspace} \newcommand\ISO{\acro{ISO}\xspace} \title{Standard DTDs and scientific publishing} \author{N. A. F. M. Poppelier (\texttt{n.poppelier@elsevier.nl}),\\ E. van Herwijnen (\texttt{eric@vanherwijnen.org}), and \\ C.A. Rowley (\texttt{C.A.Rowley@open.ac.uk})} \date{7 August 1992} \let\Tub\TUB \begin{Article} \section{Abstract} This paper has two parts. In the first part we argue that scientific publishing needs \textsl{one} standard \dtd{} for each class of documents that is published. For example one for all research papers and one for all books. In the second part we apply this reasoning to mathematical formulas, and we outline some design requirements for a document type definition for mathematical formulas. In the appendices we discuss and compare existing document type definitions for mathematical formulas. \section{Introduction} In the preface to \cite{one} Charles Goldfarb wrote that the Standard Generalized Markup Language can be described as many things, and that \SGML is all that -- and more. In the introduction to \cite{one} Yuri Rubinsky wrote: \begin{quote} \ISO~8870 never describes \SGML as a meta-language, but everything about its system of declarations and notations implies that a developer has the tools to build exactly what is required to indicate the internal structure of any type of information in a common tool independent manner. \end{quote} Indeed, a strong point of \SGML is that it can be regarded as a meta-language, a tool with which one can define the syntax of many languages, very much similar to context-free grammars. In \SGML terminology these `languages' are called \textsl{document type definitions}, called \textsl{\dtd{}} for short. \dtd{}s can he written for any type of information, research papers, books and music. A \dtd{} can be used for many purposes, of which two important ones are storage and exchange of information coded according to this \dtd{}. The premise of this paper is that the exchange of information, if it is based on \SGML, needs a single common \dtd{}, agreed upon by all parties involved, for each class of documents that is exchanged Suppose two parties, $A$ and~$B$, exchange information in the form of one class of documents. and that they each have a \dtd{}, $D(A)$ and $D(B)$, with $D(A)$ not identical to $D(B)$. If~$A$ sends a document to~$B$ then~$A$ can include the document type definition $D(A)$. for that document (instance) at the beginning of the document. This enables~$B$ to use an \SGML parser to check the validity of the document he received. However, there is nothing more~$B$ can do with the document: the \dtd{} $D(A)$ contains no information about the meaning of the coding scheme that $D(A)$ defines, and a mapping of the document from $D(A)$ to $D(B)$ is a procedure that cannot be automated. The problem becomes even more difficult when a third party, $C$, is introduced, who accepts material from both~$A$ and~$B$. How is~$C$ going to handle material with two different coding schemes? This is where we encounter one of the weaknesses of \SGML \textsl{as it is being used currently}, namely that it enables every party involved in this process to define and use a different \dtd{}. \section{Scientific publishing}\label{sci-pub} In the rest of this paper we concentrate on the exchange of information that occurs in scientific publishing, in particular on the exchange of papers that contain mathematical formulas and are published in research journals. Recent developments in this area formed the main reason for writing this paper. A few standards for encoding of mathematical formulas have already emerged, of which a well-known one is the \acro{AAP} Standard or Electronic Manuscript Standard \cite{two}. A \dtd{} for mathematical formulas accompanies this standard, but it is not part of it. Another standard for mathematical formulas is the one adopted by CALS \cite{three}, and others are under development \cite{four}, \cite{five}. The handling of mathematical formulas in scientific publishing is part of the bigger whole of information exchange within a (the) scientific community, with the publisher as intermediary, as is shown below: \begin{picture}(100,80)(-70,0) \put(40,50){\oval(80,40)} \put(30,60){$C$} \put(59,50){\oval(20,10)} \put(55,46){$P$} \put(65,50){\vector(-1,-2){20}} \put(40,10){\oval(20,10)} \put(36,6){$G$} \put(34,10){\vector(-1,2){20}} \end{picture} \noindent The authors of research papers are the providers, $P$. The publishers are the gatherers of information, $G$. They accept information from many providers, gather this in the form of a journal issue, and distribute this. In this process, the publisher provides a quality check via the system of peer reviewing, makes notation consistent, and in some cases improves the prose. The information is distributed to a group of consumers, $C$, with the set~$C$ a superset of the set~$P$. In this process, two sorts of information can be exchanged: \begin{itemize} \item material that is structured in the sense of being encoded according to, and checked against, some formal structural specification such as a \dtd{}; \item material that is not structured. \end{itemize} At present most of the material exchanged in the process of scientific publishing is of the unstructured type. We expect that this will remain the situation in the near future. As soon as authors get the possibility of using more sophisticated tools, we expect that publishers will receive increasing numbers of papers of the structured type. Several scientific publishers, among whom Elsevier Science Publishers, have adopted \SGML as the future main tool for the process of publishing scientific articles \cite{six}, and several other publishers have made, or are expected to make, the same choice. The European Laboratory for Particle Physics (\acro{CERN}), a large community of information providers, are using \SGML to automate the loading of bibliographic information in their library's database \cite{seven}. For both authors and publishers it would be advantageous to agree on one \dtd{} for the encoding of research papers. There are several reasons for this: \begin{itemize} \item Most authors do not submit all their articles to one and the same publisher every time me. At present they are confronted with `Instructions to Authors' that differ significantly from publisher to publisher. \item A recent trend is that authors prepare their papers with text-processing software on some computer. This enables them to send the paper in electronic form (electronic manuscript or `compuscript') to the publisher. Publishers are confronted with a variety of text-processing software on a variety of computer systems \cite{eight}, \cite{nine}. Moreover, every field of science appears to have its own `Top Ten' of most used text processing packages. \item Bibliographic information about all research papers in all (or most) scientific journals is stored in bibliographic databases. In an ideal world, authors would still be able to use their favourite text-processing system, which would generate \SGML `behind the screens', so to speak. All publishers would accept one standard \dtd{}, and all text-processing systems would be able to generate documents prepared according to this \dtd{}, and all bibliographic databases would be able to store this material. \end{itemize} An example of activities towards achieving this ideal situation: the European Working Group on \SGML (\acro{EWS}) and the European Physical Society (\acro{EPS}) have taken the Electronic Manuscript Standard and are trying to develop it into a complete \dtd{}, which should be acceptable to information providers, information gatherers and information consumers. The Electronic Manuscript Standard is now a Draft International Standard, \ISO/\acro{DIS} 12083. The \acro{EWS} and \acro{EPS} hope that the final standard will include their work. \section{Encoding of mathematical formulas} In Annex A of \ISO~8879~\cite{ten} we find the following: \begin{quotation} Generalized markup is based on two novel postulates: \begin{itemize} \item Markup should describe a document's structure and other attributes rather than specify processing to be performed on it, as descriptive markup need be done only once and will suffice for all future processing. \item Markup should be rigorous so that the techniques available for processing rigorously defined objects like programs and databases can be used for processing documents as well. \end{itemize} \end{quotation} There is no reason why this should not be valid for mathematical formulas. We need to delimit the kind of mathematical formulas we are trying to describe if we want an unambiguous structure. The field of mathematics is so vast, that it may be impossible to design a single \dtd{} that covers every kind of mathematical formula. If we concentrate on those sciences which use mathematics as a tool, for example physics, we see that the mathematics used in many physics papers can be described as ``advanced calculus'' This definition can be made more precise by referring to some standard textbooks containing these types of formulas, e.g.\ \textsl{Handbook of Mathematical Functions} \cite{eleven} and the \textsl{Table of integrals, series and products} \cite{twelve}. If we aim for rigorous encoding of mathematical formulas (the second postulate), we must develop a system of descriptive markup of mathematical formulas that enables us to: \begin{itemize} \item convert the formulas between different word processors; \item store the formulas in and extract them from a database; \item allow programs to input or output formulas in descriptive markup. \end{itemize} An example of the first application would be the conversion of mathematical formulas coded in \LaTeX\ to, say, Word\footnote{Word is a registered trademark of MicroSoft.} via \SGML. The benefits of using \SGML as an intermediate language for conversion are described in \cite{thirteen}. Note, for example, that the number of programs required for pairwise conversion between~$n$ languages is proportional to $n^2-n$ without an intermediate language, but to $2n$ with an intermediate language. An example of the second application would be encoding and storing the complete contents of the above mentioned \textsl{Handbook of Mathematical Functions} \cite{eleven} and \textsl{Table of integrals, series and products} \cite{twelve} in a database, so that this information can be accessed on-line by, say, mathematicians and physicists. Many articles have mathematical formulas in their titles, so any program that extracts bibliographic data should be able to handle mathematics as well. An example of the third application would be the extraction and subsequent use in a computer program, written in an ordinary programming language or, for example, in Mathematica.\footnote{Mathematica is a registered trademark of Wolfram Research.} At this point we come back to the ideal world for scientific publishing we sketched earlier. In this world, publishers would use one standard \dtd{} for scientific papers, which enables them to prepare a primary publication -- in paper and (or) in some electronic form - and to store the information in databases for various secondary purposes. The question now is: what should a \dtd{} for mathematical formulas look like, if it is going to be used for these purposes? There are two choices for a \dtd{} for mathematics: \begin{itemize} \item P-type: the \dtd{} reflects the Presentation or visual structure; examples of this type are discussed in the appendices. \item S-type: the \dtd{} reflects the Semantics or logical structure; at present no \dtd{}s of this type exist. \end{itemize} The quotation from Annex~A of \ISO~8879 \cite{ten} indicates the preference of the creator(s) of \SGML: markup of a formula should be of S-type, it should describe the logical structure of the formula, rather than the way it is represented on a certain medium, say the page of a traditional (non-electronic) book. Let us suppose, for the sake of the argument, that an information gatherer, a publisher, chooses a \dtd{} of S-type. This raises two further questions: \begin{enumerate} \item Is descriptive markup of mathematical material possible? \item If it is possible, who can use it and for which purposes? \end{enumerate} The second question needs some explanation. As discussed in section \ref{sci-pub}, in the process of scientific publishing two sorts of information can be exchanged. mathematical material that is structured according to a formal structural specification, and material that is not structured. This means that there are two possible scenarios. Scenario 1: an author submits a paper in the form of a manuscript (paper), i.e.\ with unstructured formulas, or a compuscript with mathematical formulas in P-type notation (\TeX, WordPerfect, \dots). Scenario 2: an author submits a paper with mathematical formulas in S-type notation. In scenario 1 it is the task of the publisher to convert from paper or P-type notation to S-type notation. Before we discuss the feasibility of this conversion, we will first look at some characteristics of mathematical notation. \subsection{Characteristics of mathematical notation}\label{character} Mathematical notation is designed to create the correct ideas in the mind of the reader. It is \textsl{deliberately} ambiguous and incomplete: indeed, it is almost meaningless to all other readers. Or, more technically: the intrinsic information content of any mathematical formula is very low. A formula gets its meaning, i.e. its information content, only when used to communicate between two minds which share a large collection of concepts and assumptions, together with an agreed language for communicating the associated ideas. The ambiguity encountered in mathematical notation can be of two types \cite{fourteen} \begin{enumerate} \item A generic notation uses the same symbols to represent similar but different functions, for example `$+$' or `$\times$'. In the case of addition this is not really a problem, but multiplication is a problem since, multiplication of numbers is commutative, whereas matrix multiplication is non-commutative! \item A more fundamental ambiguity is posed by the same notation being used in different fields in different ways. For example: $f'$ stands for the first derivative of~$f$ in calculus, but can mean `any other entity different from $f$' in other areas. \end{enumerate} More examples of ambiguity are: \begin{itemize} \item Does~$\bar x$ represent a mean, a conjugation or a negation? \item Is~$i$ an integer variable, e.g.\ the index of a matrix, or is it $\sqrt{-1}$? \item The other way around: is $\sqrt{-1}$ denoted by~$i$ or by~$j$?\footnote{There are examples of authors actually writing something like $[L_i,L_j] =\frac{i}{2}L_k$, where the first~$i$ is an index, and the second~$i$ stands for~$\sqrt{-1}$.} \item What is the function of the~2 in $\textrm{SU}_2$ $\log_2x$, $x^2$, $T_2^2$?\footnote{In $\textrm{SU}_2$ it is the number of dimensions of the Lie group; in $\log_2x$ it is the base of the logarithm; if~$x$ is a vector, the~${}_2$ in~$x_2$ is an index: the~${}^2$ in~$x^2$ could be a power, but if~$T$ is a tensor, the~${}^2$ in~$T^2_2$ is a contrainvariant tensor index.} \item Is $|X|$ the absolute value of a real (complex) number~$X$ or the polyhedron of a simplicial complex~$X$ \cite{fifteen}? \end{itemize} The inverse problem, which is equally common, arises when different typographical constructs have the same mathematical meaning. For example, the meanings of both the following two lines would be coded identically \begin{eqnarray*} 3 &+& 4 (\mod 5)\\ 3 &+_5& 4 \end{eqnarray*} and this would lead to great difficulty if an author wanted to write: \begin{quote} We shall often write, for example, $3 + 4 (\mod 5)$ in the shorter form $3 +_5 4$, or even as simply $3+4$ when this will not lead to confusion. \end{quote} Of course, natural languages are similarly ambiguous and incomplete, but no one we know is suggesting that in an \SGML document each word should be coded such that it reflects the full dictionary definition of the meaning which that particular use of the word is intended to have! \subsection{Who performs the markup of math?} How does one convert P-type mathematical material, which an author has produced, to S-type notation, which the publisher uses? In \cite{one}, (p.9) Goldfarb gives a three-step model for document processing: \begin{enumerate} \item recognition of part of a document (adding a generic identifier for the appropriate element);\label{first} \item mapping (associating a processing function with each element);\label{second} \item processing (e.g.\ translating elements into word processor commands).\label{third} \end{enumerate} In the publishing of scientific papers and books steps~\ref{second} and~\ref{third} are the responsibility of the publisher. Traditionally, step~\ref{first} was also their responsibility: the technical editor adds markup signs in the margin of the manuscript, depending on the text and the visual representation that the house style dictates. It is, however, unlikely that a technical editor is capable of identifying the precise function of every part of a mathematical formula, for several reasons, most of which were discussed in the previous subsection, namely that mathematical notation: \begin{itemize} \item is not unambiguous, \item is not completely standardized, \item is not a closed system. \end{itemize} Even if the technical editor were capable of identifying every part of a formula, this would be too time- consuming -- and therefore too costly. However, under certain conditions \cite{sixteen}, automatic translation from visual structure to logical structure of mathematical material is simplified greatly. This, and what we discussed in section~\ref{character}, leads us to conclude the following. A publisher has no choice but to use a P-type \dtd{} for mathematical material that is submitted in unstructured form or in P-type notation. Even if S-type markup of a mathematical formula would be possible, conversion from P-type to S-type would be difficult or even impossible. Conclusion: the tags for S-type markup should not be added by the information gatherer, but by the information providers, i.e. the authors, who should be able to identify each part of their formulas. \subsection{Feasibility of S-type notation} In our second scenario, authors would submit papers with mathematical formulas in S-type notation. This would enable the publisher to `down translate'\footnote{`Down' because information is lost in the process; we borrowed the terminology of translating `up' and `down' from Exoterica OmniMark.} to any mathematics typesetting language (P-type notation). However, the same reasoning as in section 3.1 leads us to the following conjecture: Conjecture. It is impossible to create an S-type \dtd{} for all of mathematics. Representing the ``full meaning'' of a mathematical formula, if such a notion exists, will almost certainly lead to attempts to pack more and more unnecessary information into the representation until it becomes useless for any purpose. This is rather like Russell and Whitehead reducing ``simple arithmetic'' to logic and taking several pages of symbols to represent the ``true meaning of $2+2=4$''. Even if it were possible to define an S-type \dtd{} for a certain branch of mathematics, this still gives problems. Supposing an S-type \dtd{} contains an element for a ``derivative'' of a function. Since the S-type \dtd{} will not contain any presentational attributes, a decision will have to be made to represent the derivative of $f(x)$ on paper as $f'(x)$ or $\frac{\text{\fontfamily{cmr}\selectfont d}f(x)}{\text{\fontfamily{cmr}\selectfont d}x}$. There are, however, times (such as in this article) that both representations are required for the same semantic object, and that the author will need other notation in addition to that defined by the S-type \dtd{}. A likely reason for the belief that an S-type \dtd{} is possible, is that many people in the worlds of document processing or computer science are convinced that each symbol has at most a few possible uses and that mathematical notation is as straightforward to analyse as, for example, a piece of code for a somewhat complicated programming language. The reality is that mathematical notation is more akin to natural language: it is ambiguous and incomplete, as we pointed out earlier. \subsection{Some problems with existing languages} To show that it is not obvious to capture mathematical syntax in a \dtd{}, let alone its semantics, consider the example of a limit \[ \lim_{x\to a}f(x) \] The syntactic structure of a limit is: \begin{itemize} \item The limit operator \item The part containing the variable and its limit value \item The expression of which the limit is to be taken \end{itemize} The first part could: \begin{itemize} \item always be ``lim'', in which case it is just a part of the presentation of the formula and it should be left out. \item be one of a finite list of alternatives, indicating the type of limit($\liminf$, $\sup$, $\max$, etc.). In this case it should be an attribute. \item be any expression. \item be any text. \end{itemize} We think the second possibility comes closest to the syntax of the limit construct. Th second and third parts can be any mathematical expression. Now let's look at the way this formula is coded with the \dtd{}s from \ISO \acro{TR}~9573, \acro{AAP} math and Euromath respectively. Using the mathematics \dtd{} from \ISO \acro{TR}~9573 there are three possibilities: \begin{itemize} \item \verb|lim x → a f(x)| \item \verb|limx ↓| \verb|a f(x)| \item \verb|x →|\\ \verb|af(x)| \end{itemize} whereas with the Euromath \dtd{} we would have: \begin{verbatim} x\→ a f(x) \end{verbatim} We see that the \acro{AAP} and Euromath expressions are closest to the limit syntax. The best solution from \ISO \acro{TR}~9573 involves a more general ``plex'' construct, which can be used for integrals, sums, products, set unions, limits and others. When the plex construct contains the actual lower and upper bounds it may even give semantic information. Some mathematicians, however, are not satisfied with this solution \cite{seventeen}. The plex operation is probably a notation for an iterated application of a binary operation (e.g.\ sums and products), while limits are of a different nature. In many cases only the from part will be used, and there the whole range of the bound variable will be indicated, as an interval or a more general set. How does one go about extracting the bound variable? This supports our conjecture from the previous section, namely that it is very hard to capture the semantics for all mathematics. it also suggests that some redundancy is required to select whichever notation is most appropriate in a certain context. \section{Re-using mathematical formulas} There are two important uses for a generically coded mathematical formula. The first one is in a mathematical manipulation -- or computer algebra -- system (\acro{MMS}), such as Mathematica \cite{eighteen} or Maple \cite{nineteen}. Computer programs for the numerical evaluation of formulas, for example written in \textsc{Fortran} or Modula-2, can also be regarded as mathematical manipulation programs. The second form of re-usage is in a mathematical typesetting system, for formatting the formula on paper or on screen; examples of this are \TeX\ \cite{twenty} and eqn/troff \cite{twentyone}, \cite{twentytwo}. For computer algebra systems the notation for the formula should be such that a particular type of manipulation on a particular system is possible, given a `background' of concepts and assumptions that enables the system to interpret the input as a mathematical statement. The coding of a formula that is adequate for document formatting, for example the \TeX\ notation \verb|f^{(2)}(x)|, is very unlikely to contain much of the information required for a manipulation system to make use of it. However, for a limited held of discourse it is feasible to use the same coding for both types of system \cite{sixteen}. Some examples: the square of $\sin x$ is typographically represented as $\sin^2x$, but a system like Mathematics or Maple would probably prefer something like $(\sin x)^2$ as input. Typesetting the inverse of $\sin x$ as $\sin^{-1}x$, however, could be confusing: does it mean $1/(\sin x)$ or $\arcsin x$? An \acro{MMS} would probably require the second derivative of a function~$f$ with respect to its argument~$x$ to be coded as $(D,x)((D,x)f(x)))$ but on paper this would be represented as $f''(x)$, or $f^{(2)}(x)$, or $\frac{\displaystyle\text{\fontfamily{cmr}\selectfont d}^2f(x)}% {\displaystyle\text{\fontfamily{cmr}\selectfont d}x^2}$. On the output side of a \acro{MMS} there are other problems since some of the coding necessary for typographically acceptable output cannot be automatically derived by the system from the coding used by the \acro{MMS}. The Euromath view \cite{seventeen} is that a common interface should be designed together with the manufacturer of a \acro{MMS}. Perhaps an \acro{MMS}-type \dtd{} will be required. \section{Related problems} Another problem is, of course, that mathematics is by its nature extensible, so there will always be new types of manipulations to be done. Notations are changed or new notations are invented almost every day, figuratively speaking. Normally these new subjects will use existing typographic representations, but the computer algebra system will not know what formatting to use! Occasionally a new typographic convention will be needed. And although there is agreement on the notation for most mathematical concepts, authors of books on mathematics tend to introduce alternative notations, for instance when they feel this is necessary for didactic reasons. Mathematical notation is not standardized, and it is open -- anyone can use it, and add to it, in any way they wish. If we consider a given \dtd{} at any time, we have to ask ourselves: can an author add elements when the need for this arises? Theoretically the answer is `Yes, he can' \cite{twentythree}, (p.71), although it is not straightforward to include the new elements in the content models of existing elements. Are such modification by the author desirable? A \dtd{} which is locally modified by an author will quickly give rise to the situation described in the introduction to this paper, and this should therefore probably be discouraged. Others, however, have also noticed a need for private elements, as described in \acro{EPSIG} News 3, no.~4; one of the challenging aspects of using \SGML being encountered by the Text Encoding Initiative is that the guidelines need to be extensible by researchers. They need to be able to extend the \dtd{} in a disciplined way. This problem, however, may not be a serious one. The collection of style elements is almost a closed set, since the number of fonts, symbols and ways to combine them is limited. In fact, most notation is not syntactically new, since the limited number of constructs works well as a notation. The multitude of notations is obtained by combinations of fonts, symbols and positions (left or right subscript, left or right superscript, atop, below, \dots), and by giving one notation more than one meaning. This again seems to support our view that only a P-type \dtd{} can be constructed for \emph{all} of mathematics. An \SGML \dtd{}, of whatever type, also doesn't solve the problems of new atomic or composite symbols, which occur frequently in mathematics. As with new elements, an author can add entities for these new symbols. There is no method to add the name of a new symbol, whether atomic or composite, to an existing set of entity definitions for symbols, other than to contact the owner of the set and wait for an update. Although there is now a standard method to describe that symbol's glyph (shape) \cite{twentyfive}, it is not practical for an author to include it. A compromise solution seems to be to extend an existing set, such as the one from \ISO \cite{twentysix}, as much as possible, and try to standardize its use. \section{Conclusions} We have argued as follows: \begin{itemize} \item That a logical \dtd{} in the sense of describing the structure of the mathematical meaning is as impossible for maths as it is for natural language, and also it is useless for formatting since the same mathematical structure can be visually represented in many different ways. The correct one for any given occurrence of that structure cannot be determined automatically, but must be specified by the author. \item That what needs to be encoded for formatting purposes, is information that enables a particular set of detailed rules for maths typesetting to be applied. This could he described as a `generic-visual encoding' or `encoding the logic of the visual structure'. To establish exactly what these code?, should be will require an expert analysis (probably involving expertise from mathematicians, particularly editors, and from typographers aware of the traditions of mathematical typesetting). \item That this is different to what needs to be encoded for use in mathematical manipulation software. Since neither of these encodings can be deduced automatically from the other, a useful database will need to store both. Perhaps a separate \dtd{} will be required to enable this communication. \end{itemize} Possible solutions are \begin{itemize} \item A \dtd{} based on a hybrid of visual structure and logical structure \item Two \dtd{}s, one for visual structure and one for logical structure, that are linked in some fashion \item Two concurrent \dtd{}s, one for visual structure and one for logical structure. \end{itemize} The simplest solution is probably to have a basic visual structure which is described as an \SGML entity, supplemented with a (redundant) logical structure, described by a second \SGML entity. This solution avoids any special \SGML features and gives the user all flexibility for mixing and matching as required. We believe that similar reasoning can be applied to tables and chemical formulas, where the problem of separation form from content is just as complex, or even more. \begin{thebibliography}{10} \bibitem{one} Charles Goldfarb. \newblock {\em The {\SGML} Handbook}. \newblock Oxford University Press, Oxford, 1990. \bibitem{two} Standard for electronic manuscript preparation and markup version 2.0. \newblock Technical Report Z39.59-1988, {\acro{ANSI}/\acro{NISO}}, 1987. \bibitem{three} Techniques for using {\SGML}. \newblock Technical Report 9573, {\ISO}, 1988. \bibitem{four} American~Chemical Society. \newblock {\acro{ACS}} journal \dtd{}. \bibitem{five} Bj{\"{o}}rn von Sydow. \newblock On the \texttt{math} type in {E}uromath. \bibitem{six} N.~A. F.~M. Poppelier. \newblock {\SGML} and {\TeX} in scientific publishing. \newblock {\em \TUB}, 12:105--109, 1991. \bibitem{seven} E.~van Herwijnen, N.~A. F.~M. Poppelier, and J.C. Sens. \newblock Using the electronic manuscript standard for document conversion. \newblock {\em EPSIG News}, 1(14), 1992. \bibitem{eight} E.~van Herwijnen. \newblock The use of text interchange standards for submitting physics articles to journals. \newblock {\em Comp. Phys. Comm.}, 57:244--250, 1989. \bibitem{nine} E.~van Herwijnen and J.C. Sens. \newblock Streamlining publishing procedures. \newblock {\em Europhysics News}, pages 171--174, November 1989. \bibitem{ten} Standard generalized markup language ({\SGML}). \newblock Technical Report 8879, {\ISO}, l986. \bibitem{eleven} M.~Abramovitz and I.~Stegun. \newblock {\em Handbook of mathematical functions}. \newblock Dover, New York, 1972. \bibitem{twelve} I.S. Gradshteyn and I.M. Ryzhik. \newblock {\em Tables of integrals, series, and products}. \newblock Academic Press, New York, 1980. \bibitem{thirteen} S.A. Mamrak, C.S. O'Connell, and J.~Barnes. \newblock Technical documentation for the integrated chameleon architecture. \newblock Technical report, March 1992. \bibitem{fourteen} Neil~M. Soiffer. \newblock {\em The design of a user interface for computer algebra systems}. \newblock PhD thesis, Computer Science Division ({\acro{EECS}}), University of California, Berkeley, 1991. \newblock Report {\acro{UCB}/\acro{USD}} 91/626. \bibitem{fifteen} M.~Nakahara. \newblock {\em Geometry, Topology and Physics}. \newblock Adam Hilger, Bristol, 1990. \bibitem{sixteen} Dennis~S. Arnon and Sandra~A. Mamra. \newblock On the logical structure of mathematical notation. \newblock {\em \TUB}, 12:479--484, 1991. \bibitem{seventeen} Bj{\"{o}}rn von Sydow. \newblock private communication to EvH. \bibitem{eighteen} Stephen Wolfram. \newblock {\em Mathematica: a system for doing mathematics by computer}. \newblock Addison-Wesley, Reading, 1991. \bibitem{nineteen} Bruce~W. Char, Keith~O. Geddes, Gaston~H. Gonnet, and Stephen~M. Watt. \newblock {\em Maple User's Guide}. \newblock \acro{WATCOM} Publications Ltd., Waterloo, 1985. \bibitem{twenty} Donald~E. Knuth. \newblock {\em The {\TeX}book}. \newblock Addison-Wesley, Reading, 1984. \bibitem{twentyone} Joseph~E Osanna. \newblock Nroff/troff. \newblock In {\em {UNIX} Programmer's Manual (2b)}. Bell Laboratories, 1978. \bibitem{twentytwo} Brian~W. Kernighan and Linda Cherry. \newblock Typesetting mathematics. \newblock In {\em {UNIX} Programmer's Manual (2b)}. Bell Laboratories, 1978. \bibitem{twentythree} E.~van Herwijnen. \newblock {\em Practical {\SGML}}. \newblock Kluwer Academic Publishers, Dordrecht, 1990. \bibitem{twentyfive} Font information interchange. \newblock Technical Report 9541, \ISO, 1991. \bibitem{twentysix} Information processing -- {\SGML} support facilities -- techniques for using {\SGML} -- part 13. \newblock Technical Report 9573, \ISO, 1991. \newblock Proposed Draft Technical Report. \end{thebibliography} %\begin{tabular}{ll} %N. A. F. M. Poppelier& E. van Herwijnen, \\ %Elsevier Science Publishers,&CERN,\\ %P.O. Box 2400,&1211-CH,\\ %1000 CK Amsterdam,&Geneva 23,\\ %the Netherlands&Switzerland\\ %\texttt{n.poppelier@elsevier.nl}&%??? %\end{tabular} %\noindent\qquad and\\ %\begin{tabular}{l} %C.A. Rowley\\\texttt{C.A.Rowley@open.ac.uk} %\end{tabular} \end{Article} \endinput \section{References} \end{Article} \endinput A Existing mathematical notations A.1 Comparison of existing \dtd{}s In making comparisons between existing \dtd{}s we shall refer often to what is probably the best-known system for coding mathematical notation in documents. This is the version of TEX coding used in LaTeX 127] (which differs little from Knuth's Plain T~ notation described in [201), now a de facto standard in many areas. It is a mixture of visual and logical tagging, with a bias towards the visual which probably results from reasoning similar to that in this paper. The following document type definitions for mathematical formulas were investigated for this paper: AAP 128], ISO [29] and Euromath [51. We will try to give a few general characteristics of each of them: AAP This \dtd{} shows a hybrid of visual and logical tagging. It is quite similar to the mathematical notation of TEX 120]. Integrals, sums and similar constructions have sub-elements tagged explicitly as lower limit, upper limit and integrand (summand,...). The same goes for fractions, roots, and limit-like constructions. All rectangular schemes of mathematical expressions, e.g.\ matrices and determinants, are tagged as 'array in this \dtd{}. The delimiters are not part of the construction, although matrices are usually indicated by ( ) or as C ], and determinants as I ( Alignment of rows, columns and cells is indicated by attributes, even though they have nothing to do with function, but are in fact processing information. This idea also appears in the array notation of LaTeX~[27].