\def\latextohtml{\LaTeX{2}{\tt{HTML}} } \def\htmladdnormallink#1#2{ #1\footnote{#2}} \def\bull{$\bullet $ } % MYFIG % #1 - name without the .ps extension % #2 - caption % Frames the picture. \newcommand{\myfig}[2] {\begin{figure*} \centerline{% \epsfig{file=#1.ps,height=.9\textheight}} \caption{#2} \label{fig:#1} \end{figure*}} \title{Text to Hypertext Conversion with \latextohtml} \author[Nikos Drakos]{Nikos Drakos,\\ Computer Based Learning Unit,\\ University of Leeds, Leeds LS2 9JT, UK.\\ email: {\tt nikos@cbl.leeds.ac.uk}\\ www: {\tt http://cbl.leeds.ac.uk/nikos/personal.html}\\ } \begin{article} \begin{abstract} \latextohtml is a conversion tool that allows existing documents written in \LaTeX\ to become part of a global multimedia system. This paper presents some of the reasons for using such a system and describes the basic conversion process. \end{abstract} \section{World Wide Web --- A Global Multimedia System} \begin{quotation} Imagine a system that links all the text, data, digital sounds, graphics and video on all the world's computers into a single interlinked hypermedia ``web''. This is the potential of the Internet-based World Wide Web (WWW or W3) project \ldots \cite{levy:www} \end{quotation} The World Wide Web merges hypermedia techniques with networked document retrieval to provide a global information system of linked documents. These are traversed by ``clicking'' in textual or iconic active areas, or searched via query mechanisms \cite{tbl:www}. Hypertext links may point to a different location in the same document or to another document which may be located perhaps in another continent! Documents are not limited to containing only textual information and may include high resolution images, audio and video samples. WWW also encompasses most of the services currently available on the Internet such as Usenet news, ftp, wais, archie, etc. Access to these services as well as the invocation of arbitrary computer programs (e.g.\ a database access or a simulation) is completely transparent to the user who sees them all as part of some document and interacts with them in a uniform and intuitive way. Multimedia documents are written in a language designed specifically for the World Wide Web called HTML (HyperText Markup Language) which is based on SGML (Structured Generalised Markup Language). Documents are written by information providers who just place them on the WWW using a ``server'' program. Then anyone with access to the Internet can use a ``client'' or ``browser'' program to access and view available documents. Clients and servers communicate via the HTTP protocol (HyperText Transfer Protocol). Apart from navigation facilities, browsers also allow full text searches, ``cut and paste'', text or audio annotations, personal ``hotlists'', saving and printing in multiple formats and others. Such browser and server programs are freely available for most popular computer configurations. With the explosive growth of the World Wide Web (500-fold since the first graphical browsers were made available this year \cite{vern:www}), and a potential audience of 15 million in more than 50 countries, providing information via the WWW is becoming an extremely attractive proposition. \section{\LaTeX\ to HTML Conversion: Why?} HTML is quite a simple markup language to learn and use. It allows basic formatting commands, bulleted lists, ``inlined'' images, and hypertext links to other documents, multimedia sources, internet services or computer programs. But despite (and because of) its simplicity it has created a few headaches for information providers: \begin{itemize} \item there are no intuitive authoring tools (yet); \item yet another hypertext language has to be learned; \item existing documents available in other formats have to be reprocessed; \item hypertext document ``webs'' are difficult to maintain; \item it is difficult or impossible to create highly formatted documents in HTML. \end{itemize} \latextohtml can be used in order to address to a large degree these problems. The authoring problem simply disappears, existing documents can be reused immediately and a complex web of interlinked documents can be generated from a single source document. The automatic inclusion of formatted information such as tables or mathematical equations as inlined images also bypasses another serious problem with HTML. An additional benefit is that the paper-based version of a document can also be obtained from the same source. The utility of a conversion tool like \latextohtml can be seen from the variety of contexts in which it has been applied. Some examples are listed below. \begin{itemize} \item Electronic books (e.g.\ that produced by the Computational Science Education Project\footnote{http://compsci.cas.vanderbilt.edu/csep.html} which is sponsored by the US Department of Energy. This is one of the most complex documents currently available via the WWW.). \item General reports (e.g.\ the annual report of the Institute of Astronomy at Cambridge\footnote{ http://cast0.ast.cam.ac.uk/sub\-$\_$dir/cambridge/annual\-$\_$report/annual$\_$report.html}). \item User manuals\footnote{http://cs.indiana.edu:80/elisp/w3/docs.html}. \item System documentation\footnote{http://archie.ac.il:8001/papers/papers.html}. \item Scientific papers such as those on the MIT Transit Project\footnote{http://www.ai.mit.edu/projects/transit/tn-cat.html}. \item Electronic journals (e.g.\ Complexity International\footnote{http://life.anu.edu.au/ci/ci.html} --- a new Australian electronic journal). \end{itemize} \section{\LaTeX\ to HTML conversion: How?} The basic conversion process relies on the ability to distinguish between the {\em structure}, the {\em content} and the {\em formatting} information in a \LaTeX\ document. On the basis of sectioning information, a document is broken into separate parts and an iconic navigation mechanism is constructed in HTML which reflects this structure and allows a user to ``jump'' between different parts. The cross-references, citations, footnotes, the table of contents and the lists of figures and tables are also translated into hypertext links. Formatting information which has equivalent ``tags'' in HTML (lists, quotes, paragraph breaks, type styles, etc.) is also converted appropriately. Although in most cases the loss of some formatting information (e.g. page margins or line widths) is harmless, there are occasions where the format has meaning e.g.\ when dealing with tables or user defined environments. Another problem is the replication of the mathematical equations which must retain both their precise format as well as any of the predefined special mathematical symbols. The innovative solution in such cases relies on the ability of HTML browsers to display inlined images inside the main text. Any part of a \LaTeX\ document for which it is not obvious how it should be translated directly into HTML is extracted from the main document and then placed on a pipeline which converts it into an image. Each image is then placed at the correct position in the final HTML document. Special care is taken to preserve contextual information that may affect the contents of each image (counter values, labels, references, active style files etc). Some examples of converted documents can be seen in Figure \ref{fig:mosaic}. \myfig{mosaic}{A converted document displayed using Mosaic} \section{Hypermedia Extensions to \LaTeX} Apart from the obvious hypertext links within a \LaTeX\ document (e.g. navigation between sections, cross-references and citations) it is also possible to take full advantage of the HTML links to arbitrary multimedia sources (e.g.\ audio or video), electronic forms, and other remote documents or internet services. This can be done with some new commands defined in a separate style file ({\tt html.sty}) which are processed in a special way by the \latextohtml translator. This style file defines commands for embedding external hypertext links, for extending the basic {\tt \verb#\#ref-\verb#\#label} mechanism to operate between remote documents, and specifying that some text should only appear in the paper-based version or only in the HTML document. In most cases these commands have no effect when processed in the conventional way. Another command allows the inclusion of arbitrary HTML markup directly in a \LaTeX\ document. This can be used to take advantage of new HTML facilities as soon as they become available (HTML is currently evolving towards a new specification called HTML+). A particularly good use of this feature is in the creation of interactive electronic forms from within a \LaTeX\ document. \section{Concluding Remarks} Conversion tools like \latextohtml provide an easy migration path from familiar concepts towards authoring complex and format-rich hypermedia documents. In this way, familiarity with a system like \LaTeX\ makes it possible to contribute to and benefit from a rapidly expanding global hypermedia network. \bibliographystyle{plain} \begin{thebibliography}{1} \bibitem{tbl:www} T.~Berners-Lee, R.~Cailliau, J.~Groff, and B.~Pollerman. \newblock Worldwide web: The information universe. \newblock {\em Electronic Networking: Research, Application and Policy}, (1), 1992. \bibitem{levy:www} Joe Levy. \newblock The world in a web. \newblock {\em {\it The} Guardian}, page~19, November 11 1993. \bibitem{vern:www} Vern Paxson. \newblock Growth trends in wide-area {TCP} connections. \newblock {\em IEEE Network}, To Appear 1993. \newblock Available at ftp://ftp.ee.lbl.gov/WAN-TCP-growth-trends.revised.ps.Z. \end{thebibliography} \appendix \section{Further Information} \latextohtml is written in Perl and requires freely available software. \htmladdnormallink{More information on how to get, install and use it is available via the WWW}{http://cbl.leeds.ac.uk/nikos/\-tex2html/doc/latex2html/\-latex2html.html} or using anonymous ftp from ftp.tex.ac.uk in pub/archive/support/latex2html. A new release is planned for early December 1993. Several computers on the Internet have public access World Wide Web clients accessible by telnet e.g.\ \\ \bull telnet info.cern.ch (direct connection --- no username or password required) \\ \bull telnet ukanaix.cc.ukans.edu (``Lynx'' requires a vt100 terminal. Log in as www.) Information on World Wide Web is also available via anonymous ftp from {\tt ftp.germany.eu.net} in {\tt pub/infosystems/www}. The Mosaic clients are in the directory {\tt /pub/infosystems/www/ncsa/Web}. \end{article}