\MakeShortVerb{\|} % this needs the typehtml.sty, either from ctan or I'll send it you. % It is exactly the stuff from the dtx file on ctan, except the % examples cut down a bit to fit two column linewidth. \makeatletter \def\multispan{\omit\@multispan} \def\@multispan#1{% \@multicnt#1\relax \loop\ifnum\@multicnt>\@ne \sp@n\repeat} \def\sp@n{\span\omit\advance\@multicnt\m@ne} \makeatother \title{\textsf{typehtml}: A \LaTeX\ package to typeset HTML} \author{David Carlisle} \begin{Article} \section{Introduction} This package enables the processing of HTML codes. The \verb|\dohtml| command allows fragments of HTML to be placed within a \LaTeX\ document, \begin{verbatim} \dohtml html markup ... \end{verbatim} The \verb||\ldots\verb|| is \emph{required}. (It is anyway a good idea to have these tags in an HTML document.) The \verb|\htmlinput| command is similar, but takes a file name as argument. In that case the file need not necessarily start and end with \verb||\ldots\verb||. This package covers most of the HTML2 DTD, together with the mathematics extensions from HTML3.\footnote {The draft specification of HTML3 has expired, and the W3C group are currently devising a new proposed extension of HTML, so the mathematics typesetting part of this package may need substantial revision once a final specification of the HTML mathematics markup is agreed.} The rest of HTML3 may be added at a later date. Its current incarnation has not been extensively tested, having been thrown together during a couple of weeks in response to a question on \texttt{comp.text.tex} about the availability of such a package. The package falls into three sections. Firstly the options section allows a certain amount of customisation, and enabling of extensions. Not all these options are fully operational at present. Secondly comes a section that implements a kind of SGML parser. This is not a real conforming SGML parser (not even a close approximation to such a thing!) The assumption (sadly false in the anarchic WWW) is that any document will have been validated by a conforming SGML parser before it ever gets to the stage of being printed by this package. Finally are a set of declarations that essentially map the declarations of the HTML DTD into \LaTeX\ constructs. \section{Options} \subsection{HTML Level} The options \texttt{html2} (the default) and \texttt{html3} control the HTML variant supported. Using the \texttt{html3} option will use up a lot more memory to support the extra features, and the math entity (symbol) names. Against my better judgement there is also a \texttt{netscape} option to allow some of the non-HTML tags accepted by that browser. \subsection{Headings} The six options \texttt{chapter}, \texttt{chapter*}, \texttt{section}, \texttt{section*}, \texttt{subsection} and \texttt{subsection*} determine to which \LaTeX\ sectional command the HTML element \texttt{h1} is mapped. (\texttt{h2}--\texttt{h6} will automatically follow suit.) The default is \texttt{section*}. \subsection{Double Quote Handling} Most HTML pages use |"| as as a quotation mark in text, for example: \begin{verbatim} quoted "like this" example \end{verbatim} This slot in the ISO latin-1 encoding is for `straight' double quotes. Unfortunately the Standard \TeX\ fonts in the OT1 encoding do not have such a character, only left and right quotes, ``like this''. By default this package uses the \texttt{straightquotedbl} option which uses the \LaTeX\ command |\textquotedbl| to render |"|. If used with the T1 encoded fonts |\usepackage[T1]{fontenc}| then the straight double quote from the current font is used. With OT1 fonts, the double quote is taken from the |\ttfamily| font, which looks \texttt{\char'042}like this\texttt{\char'042} which is fairly horrible, but better than the alternative which is ''like this''. The \texttt{smartquotedbl} option redefines |"| so that it produces alternatively an open double quote `` then a close ''. As there is a chance of it becoming confused, it is reset to `` at the beginning of every paragraph, whatever the current mode. Neither of these options affects the use of |"| as part of the SGML syntax to surround attribute values. In principle the package ought to have similar options dealing with the single quote, but there the situation is more complicated due to its dual use as an apostrophe, so currently the package takes no special precautions: all single quotes are treated as a closing quote/apostrophe. Also the conventions of `open' and `close' quotes only really apply to English. If someone wants to suggest what the package should do with |"| in other languages\ldots \subsection{Images} The default option is \texttt{imgalt} This means that all inline images (the HTML \texttt{img} element) are replaced by the text specified by the \texttt{alt} attribute, or \textsf{[image]} if no such attribute is specified. The \texttt{imggif} option\footnote{one day\dots} uses the \verb|\includegraphics| command so that inline images appear as such in the printed version. The \texttt{imgps} option\footnotemark[9] is similar to \texttt{imggif} but first replaces the extension \texttt{.gif} at the end of the source file name by \texttt{.ps}. This will enable drivers that can not include GIF files to be used, as long as the user keeps the image in both PostScript and Gif formats. \subsection{Hyperref} Several options control how the HTML anchor tag is treated. The default \texttt{nohyperref} option ignores \texttt{name} anchors, and typesets the body of \texttt{src} anchors using |\emph|. The \texttt{ftnhyperref} option is similar to \texttt{nohyperref}, but adds a footnote showing the destination address of each link, as specified by the \texttt{src} attribute. If the \texttt{hyperref} option is specified, the hypertext markup in the HTML file will be replicated using the hypertext specials of the Hyper\TeX\ group. If in addition the \textsf{hyperref} package is loaded, the extra features of that package may be used, for instance producing `native PDF' specials for direct use by Adobe Distiller rather than producing the specials of the hyper\TeX\ conventions. The \texttt{dviwindo} option converts the hypertext information in the HTML into the |\special| conventions of Y\&Y's \emph{dviwindo} previewer for Microsoft Windows. \subsection{Big Integrals} \LaTeX\ does not treat integral signs as variable sized symbols, in the way that it treats delimiters such as brackets. In common with summation signs and a few other operators, they come in just two fixed sizes, a small version for inline mathematics, and a large version used in displays. In fact by default \LaTeX\ always uses the same two sizes (from the 10\,pt math extension font) even if the document class has been specified with a size option such as \texttt{12pt}, or if a size command such as |\large| has been used. The standard \textsf{exscale} package loads the math extension font at larger sizes if the current font size is larger than 10\,pt. The HTML3 math description explicitly states that integral signs should be treated like delimiters and stretch if applied to a large math expression. By default this package ignores this advice and treats integral signs in the standard way, however an option \texttt{bigint} does cause integral signs to `stretch' (or at least be taken from a suitably large font). The standard Computer Modern fonts use a very `sloped' integral which means that they are not really suitable for being stretched. Some other math fonts, for instance Lucida, have more vertical integral signs, and one could imagine in those cases making an integral sign with a `repeatable' vertical middle section so that it could grow to an arbitrary size, in the way that brackets grow. \section{Latin-1 characters} The SGML character entities for the ISO-Latin1 characters such as \texttt{\é} are recognised by this style, although as usual, some of them such as the Icelandic thorn character, \texttt{\þ}, \verb|\th|, produce an error if the old `OT1' encoded fonts are being used. These characters will print correctly if `T1' encoded fonts are used, for example by declaring \verb|\usepackage[T1]{fontenc}|~. HTML also allows direct 8-bit input of characters according to the ISO-latin1 encoding, to enable this you need to enable latin-1 input for \LaTeX\ with a declaration such as \verb|\usepackage[latin1]{inputenc}|~. \section{Mathematics} The HTML3 \texttt{math} element is fairly well supported, including the \texttt{box} and \texttt{class} attributes. (Currently only \texttt{chem} value for class is supported, and as far as I can see the \texttt{box} attribute is only in the report, not in the DTD.) The super and subscripts are supported, including the shortref maps, however only the default right alignment is implemented so far. The convention described in the draft report for using white space to distinguish superscript positioning is fairly \emph{horrible}! The documentation that I could find on HTML3 did not include a full list of the entity names to be used for the symbols. This package currently \emph{only} defines the following entities, which should be enough for testing purposes at least. \begin{itemize} \item |gt| ($>$) |lt| ($<$) (Already in the HTML2 DTD) \item Some Greek letters. |alpha| ($\alpha$) |beta| ($\beta$) |gamma| ($\gamma$) |Gamma| ($\Gamma$) \item Integral and Sum. $\int$ grows large if the \texttt{bigint} package option is given. |int| ($\int$) |sum| ($\sum$) \item Braces (The delimiters (\,)[\,] also stretch as expected in the \texttt{box} element) |lbrace| ($\lbrace$) |rbrace| ($\rbrace$) \item A random collection of mathematical symbols: |times| ($\times$) |cup| ($\cup$) |cap| ($\cap$) |vee| ($\vee$) |wedge| ($\wedge$) |infty| ($\infty$) |oplus| ($\oplus$) |ominus| ($\ominus$) |otimes| ($\otimes$) \item A Minimal set of trig functions: |sin| ($\sin$) |cos| ($\cos$) |tan| ($\tan$) \item Also in the special context as attributes to \texttt{above} and \texttt{below} elements the entities: |overbrace| ($\overbrace{\quad}$) |underbrace| (\,\smash{$\underbrace{\quad}$}\,) and any (\TeX) math accent name. \end{itemize} \section{SGML Minimisation features} SGML (and hence HTML) support various minimisation features that aim to make it easier to enter the markup `by hand'. These features make the kind of `casual' attempt at parsing SGML as implemented in this package somewhat error prone. Two particular features are enabled in HTML. The so called \texttt{shorttag} feature means that the name of a tag may be omitted if it may be inferred from the context. Typically in HTML this is used in examples like \begin{verbatim} A Document Title</> \end{verbatim} The end tag is shortened to |</>| and the system infers that \texttt{title} is the element to be closed. The second form of minimisation enabled in HTML is the \texttt{omittag} feature. Here a tag may be omitted altogether in certain circumstances. A typical example is the HTML list, where each list item is started with |<li>| but the closing |</li>| at the end of the item may be omitted and inferred by the following |<li>| or |</ol>| tag. This package is reasonably robust with respect to omitted tags. However it only makes a half hearted attempt at supporting the \texttt{shorttag} feature. The \texttt{title} example above would work, but nested elements, with multiple levels of minimised end tags will probably break this package. It would be possible to build a \LaTeX\ system that had full knowledge of the HTML (or any other) DTD and in particular the `content model' of every element. This would produce a more robust parsing system but would take longer than I was prepared to spend\ldots\ If you need a fully conforming SGML parser, it probably makes sense to use an existing one (excellent parsers are freely available) and then convert the output of the parser to a form suitable for \LaTeX. In that way all such concerns about SGML syntax features such as minimisation will have been resolved by the time \LaTeX\ sees the document. \section{Examples} \subsection{A section} This document uses the \texttt{subsection*} option. \begin{verbatim} <h1>HTML and LaTeX</h1> \end{verbatim} \dohtml <html> <h1>HTML and LaTeX</h1> </html> \subsection{An itemised list} \begin{verbatim} <ul> <li> something <li> something else </ul> \end{verbatim} \dohtml <html> <ul> <li> something <li> something else </ul> </html> \subsection{Latin1 Characters} \begin{verbatim} é ö \end{verbatim} \dohtml <html> é ö </html> \subsection{Images} Currently only the \texttt{alt} attribute is supported. \begin{verbatim} An image of me <img alt="DPC" src="dpc.gif"> \end{verbatim} \dohtml <html> This is an image of me <img alt="DPC" src="dpc.gif"> </html> \subsection{A Form} \begin{verbatim} <form action= "http://www.cogs/cgi-bin/ltxbugs2html" method=get><hr> You can search for all the bug reports about: <select name="category"> <option>AMS LaTeX</option> <option>Babel</option> <option>Graphics and colour</option> <option>LaTeX</option> <option selected>Metafont fonts</option> <option>PostScript fonts</option> <option>Tools</option> </select> <hr> </form> \end{verbatim} \dohtml <html> <form action="http://www.cogs.susx.ac.uk/cgi-bin/ltxbugs2html" method=get><hr> You can search for all the bug reports about: <select name="category"> <option>AMS LaTeX</option> <option>Babel</option> <option>Graphics and colour</option> <option>LaTeX</option> <option selected>Metafont fonts</option> <option>PostScript fonts</option> <option>Tools</option> </select> <hr> </form> </html> \subsection{Styles of Mathematics} \begin{verbatim} <math> H_2_O + CO_2_ </math> <math class=chem> H_2_O + CO_2_ </math> <math box> H_2_O + CO_2_ </math> <math class=chem box> H_2_O + CO_2_ </math> \end{verbatim} \dohtml <html> <math> H_2_O + CO_2_ </math> <math class=chem> H_2_O + CO_2_ </math> <math box> H_2_O + CO_2_ </math> <math class=chem box> H_2_O + CO_2_ </math> </html> \subsection{Integrals} Stretchy integrals with the \texttt{bigint} option. \begin{verbatim} <math> {∫^1^_3_<left> 1 <over> {x+{1<over>x+{2<over>x+ {3<over>x+{4<over>x}}}}} <right>dx} </math> \end{verbatim} \dohtml <html> <math> {∫^1^_3_<left> 1 <over> {x+{1<over>x+{2<over>x+ {3<over>x+{4<over>x}}}}} <right><t>d</t> x} </math> </html> And the same integral with the standard integral sign. \begingroup \makeatletter \let\HTML@bigint\int \dohtml <html> <math> {∫^1^_3_<left> 1 <over> {x+{1<over>x+{2<over>x+ {3<over>x+{4<over>x}}}}} <right><t>d</t>x} </math> </html> \endgroup \subsection{Oversized delimiters} \begin{verbatim} <math> <box> (<left>1 <atop> 2 <right>) </box> <box size=large> (<left>1 <atop> 2 <right>) </box> </math> \end{verbatim} \dohtml <html> <math> <box> (<left>1 <atop> 2 <right>) </box> <box size=large> (<left>1 <atop> 2 <right>) </box> </math> </html> \subsection{Roots, Overbraces etc} \begin{verbatim} <math> <above sym=overbrace> abc </above> <sup>k</sup>   <root>3<of>x</root> <sqrt>5</sqrt>   <below sym=underline> abc </below> <above sym=widehat> abc </above> </math> \end{verbatim} \dohtml <html> <math> <above sym=overbrace> a bc </above> <sup>k</sup>   <root>3<of>x</root> <sqrt>5</sqrt>   <below sym=underline> abc </below> <above sym=widehat> abc </above> </math> </html> \subsection{Arrays} Most of the array specification is supported. Currently most of the effort has gone into writing the HTML parser, so currently the column spacing is not yet ideal, as may be seen by the following examples, but that is (hopefully!) a small detail that can be corrected in a later release. \begin{verbatim} <math> <array align=top> <row><item><text>col 1</text> <item><text>col 2</text> <item><text>col 3</text> <item><text>col 4</text> <row><item><text>row 2</text> <item> a_22_ <item>a_23_<item>a_24_ <row><item><text>row 3</text> <item rowspan=3 colspan=2>a_32_-a_53_ <item>a_34_ <row><item><text>row 4</text> <item>a_44_ <row><item><text>row 5</text> <item>a_54_ <row><item><text>row 6</text> <item align=left>al_62_ <item align=right>ar_63_ <item>a_64_ </array> </math> \end{verbatim} \dohtml <html> <math> <array align=top> <row><item><text>col 1</text><item><text>col 2</text><item> <text>col 3</text><item><text>col 4</text> <row><item><text>row 2</text><item> a_22_ <item>a_23_<item>a_24_ <row><item><text>row 3</text><item rowspan=3 colspan=2> a_32_-a_53_<item>a_34_ <row><item><text>row 4</text><item>a_44_ <row><item><text>row 5</text><item>a_54_ <row><item><text>row 6</text><item align=left> al_62_<item align=right>ar_63_<item>a_64_ </array> </math> </html> Repeat that element, but change the \texttt{array} attributes as follows: \begin{verbatim} <array ldelim="(" rdelim=")" labels> \end{verbatim} \dohtml <html> <math> <array ldelim="(" rdelim=")" labels> <row><item><text>col 1</text><item><text>col 2</text><item> <text>col 3</text><item><text>col 4</text> <row><item><text>row 2</text><item> a_22_ <item>a_23_<item>a_24_ <row><item><text>row 3</text><item rowspan=3 colspan=2> a_32_-a_53_<item>a_34_ <row><item><text>row 4</text><item>a_44_ <row><item><text>row 5</text><item>a_54_ <row><item><text>row 6</text><item align=left> al_62_<item align=right>ar_63_<item>a_64_ </array> </math> </html> and finally an example of \texttt{colspec} \begin{verbatim} <math> <array colspec="R+C=L"> <row><item>abc_11_<item>abc_12_ <item>abc_13_ <row><item>a_21_<item>a_22_<item>a_23_ <row><item>a_31_<item>a_32_<item>a_33_ </array> </math> \end{verbatim} \dohtml <html> <math> <array colspec="R+C=L"> <row><item>abc_11_<item>abc_12_<item>abc_13_ <row><item>a_21_<item>a_22_<item>a_23_ <row><item>a_31_<item>a_32_<item>a_33_ </array> </math> </html> \subsection{Tables} HTML3 tables are not yet supported, but there is a minimal amount to catch simple cases. \def\table[#1]{\noindent\begin{minipage}\linewidth\centering} \def\endtable{\end{minipage}} \begin{verbatim} <table> <caption>Simple Table</caption> <tr><td>one <td> two <tr><td>a <td> b </table> \end{verbatim} \dohtml <html> <table> <caption>Simple Table</caption> <tr><td>one <td> two <tr><td>a <td> b </table> </html> \section{Concluding Remarks} Some parts of this package are still rather `rough'. In particular some of the spacing in the mathematics examples above is not perfect. I plan to revise the package and improve such details when (if?) a mathematics proposal for HTML to replace the HTML3 draft is published. Considering that it started off as an example just to show that \TeX\ is capable of processing markup languages that do not look like the traditional `backslash' commands, the package has proved surprisingly capable of handling a wide variety of `real world' HTML documents. Of the core HTML language the most noticeable feature not yet supported is graphics inclusion. I plan to support that better in a future release. A more difficult conceptual problem is that it is hard to linearise a hypertext document automatically. A typical `document' will consist of many HTML files interconnected by links. Currently one must invoke |\dohtml| or |\htmlinput| separately on each of these files, and manually order them into a page order for the typeset version. It would be nice to develop heuristics to traverse the HTML document and build up the linear typeset version automatically; however \TeX\ may not be the ideal language for writing a web-crawler\ldots \end{Article}