% % Author: Horst Szillat % eMail: szillat@berlin.snafu.de % % This article is meant to be published in the UK\TeX's journal % ``Baskervil(l?)(e?)''. % As I am a strict \LaTeX user and --- as far as I know --- there is used % a plain-\TeX-like style I tried a compromise. % So the editors are ask to adapt this text (and to insert minor language % corrections). The only problem should be the figure (good luck!). % % To: S. Rahtz (for publication) % CC: C. Rowly, J. Schrod, J. Lammarsch, L. Burnard (for private/internal use) % \def\tabbox#1#2{% \vbox{% \offinterlineskip% \hrule% \halign{% \vrule\quad\hfil##\hfil\quad\vphantom{()}\vrule\cr% #1\cr% #2\cr% }% \hrule% }% } \title{SGML and \LaTeX} \author[Horst Szillat]{Horst Szillat\\\texttt{szillat@berlin.snafu.de}} \begin{Article} SGML --- {\it Standard Generalized Markup Language} --- is a formal language to describe structured text documents. It should be introduced here by comparison to \TeX{} and \LaTeX. It is interesting to have a look at how Donald~E. Knuth introduces \TeX{} in the \TeX book himself. The beginning is to simply type in the text and \TeX{} mainly does what one expects it to do. Quite a lot of more or less complex rules have been implemented to provide these results. An example of this behaviour is the {\it space factor code} ({\verb|\sfcode|}). Using this code \TeX{} is able to identify most of the ends of sentences. Moreover \TeX{} is realized in a way that one can program almost all kinds of printing layouts. In this way one can program a macro which influences the layout in any place. So \TeX{} is a layout oriented system which is able to format texts for printing and to do a bit more. Although \LaTeX{} simply is \TeX{}, too, and has all these characteristics, too, it introduces a new idea of representing the text input. The basic idea is that a text is given in the form of {\it embedded environments}. The layout of a text portion depends on the environment it is embedded in. Moreover, the layout of whole environments may depend on which other environment they are embedded in. The user can define new environments ({\verb|\newenvironment|}) which realize a user defined layout. But the main point is that the author inputs his text on a less technical but a more abstract level. This way \LaTeX{} enforces the idea of separating the text structure from the printing layout. Changing the layout in \LaTeX{} means to replace the existing style files, only. One could do the same in plain\TeX{} directly, of course. One can do structured programming in assembler, too, but assembler does not enforce it. Now one can simply say SGML is \LaTeX{} without \TeX{} to be written in a slightly different manner. This means SGML is a representation of the text in its hierarchical structure without any idea of a layout. If one has lost the layout there has to be an advantage on the side of the text structuring. And so it is, indeed. \LaTeX's environments are called {\it elements} in SGML. Within a certain model one can now define which way the elements are embedded in each other and where text is to be allowed. Within that model the amount and the order of embedded elements and text is defined. Such a definition of a text structure is called {\it document type definition} ({\it DTD}). The ``best-known'' example of a SGML document type definition is HTML ({\it Hypertext Markup Language}) used for the World Wide Web. While processing the document an SGML-parser is able to validate the structure of the document by the given document type definition. A simple example should illustrate this: {\small \begin{verbatim} \end{verbatim} } These lines are to be read as follows: An environment/element called {\tt section} consists of maximum one {\tt paragraph} and at least one {\tt subsection} in this order. A {\tt subsection} consists of exactly one {\tt paragraph} plus at least one {\tt paragraph}, e.g.\ at least two {\tt paragraph}s. And at last, a {\tt paragraph} consists of letters. Here it is not possible anymore --- unlike in \LaTeX{} --- to put the first subsection before the first section. One could define the \LaTeX{} environments with such control structures, too. But again, \LaTeX{} is not designed for this goal and does not enforce it, while such validating is the nature of SGML. Another structural advantage over \LaTeX{} is the consequent distinction between {\it parameter} and {\it data}. The lines \begin{verbatim} \label{Hallo!} \section{Errors} \unknown{whatever} \end{verbatim} \noindent show that in \LaTeX{} one can never be sure what is human readable text (data) and what is internal technical information (parameter). On the other hand SGML has a strict idea of this distinction. As long as the SGML structures are not misused malevolently it is possible to make this distinction without even understanding the content. This is an important condition for any computer based data processing. An example will be given later. But even in the days of total computerizing the final goal of text representing is to print the text onto paper. There are two projects/tools specially designed for the printing of SGML documents. FOSI (Formatted Output Specification Instance) and DSSSL (Document Style Semantics and Specification Language). But why not use \LaTeX? \LaTeX{} has some characteristics which make it the first choice. \begin{itemize} \item The structure of SGML and \LaTeX{} are very close, so that the documents are easily to convert. \item \LaTeX{} is a programming language and therefore can realize a wide range of unforeseen layouts. \item \LaTeX{} has been used for many years by a large number of people. So there exists a widespread experience. \end{itemize} A principal scheme of the processing might look as shown Figure \ref{szillat1}. \begin{figure*} %\centering %{\hfill% % \halign{% % \hfil#\hfil&\hfil#\hfil&\hfil#\hfil&\hfil#\hfil&\hfil#\hfil&\hfil#\hfil&\hfil#\hfil\cr% % &&\tabbox{DTD}{(structure)}&&\tabbox{styles}{(layout)}&&\cr % &&$\downarrow$&&$\downarrow$&&\cr % \tabbox{concept of a}{document}&$\Rightarrow$&\tabbox{SGML-}{document}&$\Rightarrow$&\tabbox{\LaTeX-}{document}&$\Rightarrow$&\tabbox{printed}{document}\cr % }% %\hfill} \begin{center} \input{szpic} \end{center} \caption{Processing of a SGML document}\label{szillat1} \end{figure*} Unfortunatly it is not sufficient to convert the elements into environments and to write the needed style files. As already mentioned SGML and \LaTeX{} have different ideas of what is data and parameters. So it is especially necessary to transform SGML-data to \LaTeX-parameter so that \LaTeX{} can handle it more flexibly. A typical example is the following: \begin{verbatim}
section title section content
\end{verbatim} What one would like to get is something like this: \begin{verbatim} \section{section title}\label{main-section} section content \end{verbatim} One should note that {\tt main-section} is a parameter before as well as after conversion while {\tt section title} moves from being data to being a parameter. The easiest way to solve this problem is to introduce additional braces within the \LaTeX{} environment. Depending on the number of parameters defined in the definition of the environment the data is treated as a parameter or the last parameter is treated as data: \begin{verbatim} data \end{verbatim} converts to \begin{verbatim} \Bsgml{name}{value}{% data% }\Esgml{name} \end{verbatim} With some (yet still to be defined) command \verb|\NewSgmlEnv{name}[n]{...}| one gets: \begin{itemize} \item both {\tt value} and {\tt data} being a parameter for $n=2$. \item {\tt value} being a parameter and {\tt data} being data within the environment for $n=1$. \item both {\tt value} and {\tt data} being data within the environment for $n=0$. \end{itemize} Note that this conversion can be done without any conversion parameters. All programming, e.g.\ replacements are done in \LaTeX. This is a major difference to the widely used SGML-to-whatever converter {\tt format} which works with replacement tables. But the real reason for why I started to develop my own SGML to \LaTeX{} converter is that I felt the necessity to manipulate the data within the conversion process. The main questions are what information about the used words are needed for typesetting and where this information comes from. Again this seems to be a typical non-English problem. In German there are two similar problems: hyphenation and (wrong) ligatures. Basically German hyphenation rules are easily to be adapted for pattern matching and ligatures can be applied. (Hyphenation is allowed before the last consonant out of a group of consonants. There is no hyphenation within a group of consonants at the very beginning or end of a word. Certain combinations of consonants count as one single consonant. Easy, isn't it?) At the present there is a problem with the umlauts. But this problem should disappear with the {\tt dc}--fonts. The real problem raises with complex words, e.g.\ words which are composed of several words but look like one. These words have to be hyphenated between the elements of the compound. This fools every pattern matching. Moreover, there should not be any ligature in these places. The reason is that one does not want to have less space ``between words''. An example of a rather unsuspicious word is {\tt aufflammen}. One would guess the hypenation \verb|auff\-lam\-men|, which is wrong, of course. The english translation gives a hint: {\it flame up}. Within terms of {\tt german.sty} one should write \verb.auf"|flam\-men., where \verb."|. means: hyphenation is allowed but no ligature is allowed. The printing result is ``{% %\fontfamily{cmr}\selectfont au\mbox{f}\mbox{f}\mbox{l}ammen}'' instead of ``{% %\fontfamily{cmr}\selectfont aufflammen}''. Unfortunately \TeX{} is unable to store this information neither in the hyphenation table nor in the document preamble by \verb|\hyphenation|. Maybe a successor of \TeX{} will be able to do so. So far an author writing in \LaTeX{} has to input this information directly into the document, well --- if he cares\ldots Using a conversion from SGML to \LaTeX{} the converter would be the right place to insert the additional hyphenation and ligature information. The converter has to use two dictionaries --- a standard dictionary and a special dictionary. It is not unusual that special matters need special terms and consequently special dictionaries. But in German the problem is that one can create new complex words ad hoc. These new compounds may be specific to a particular document. So it would be a nice idea to ship this special dictionary as a structural part of the document! In this way the author does not have to care about every single hyphenation and ligature exception, but additionally has a spell checker. But unfortunately there is even a worse case which needs special treatment. It is the word {\tt Baumast}, which can be \verb|Bau\-mast| ({\it mast used in building}) \verb|Baum\-ast| ({\it bough of a tree}), both made of wood, of course. This is a really rare case that a word must be tagged with an additional information where it occurs within the document. This information should explain which word is to be meant. One could do that in the form of an explicit hyphenation information. In SGML it could look like \verb|Baumast| (This example is simplified. It would be more correct to use a SDATA-Entity so that the \LaTeX-specifics are hidden.) Note that the hyphenation information on the words {\tt aufflammen} and {\tt Baumast} are totally different things. The first one is part of the layout information (how to print out?), while the second one is a structural part of the document (which word?). Summarizing one can state that SGML and \LaTeX{} are a good pair. Using the specifics of both systems one can do a lot of things correctly in an easier way. \subsection{Further reading} \begin{itemize} \item H. Szillat: {\it SGML --- Eine praktische Einf\"uhrung} ISBN 3-929821-75-3, Int. Thomson Publ. \item ftp-server: {\tt ftp.ifi.uio.no} \item news groups: comp.text.sgml, sgml-l \end{itemize} \end{Article}