>\hyphenation{compu-script} \iffalse > Elements of SGML > by > Jonathan Fine 11 October 1994 > > 203 Coldhams Lane > Cambridge > CB1 3HY > Tel:0223 215389 > Email: J.Fine@pmms.cam.ac.uk This is the LaTeX file for the first of a series of articles which will appear in "Baskerville", which is the journal of the UK TeX User's Group. Comments are welcome. \fi \title{Elements of SGML} \author[Jonathan Fine]{Jonathan Fine\\ 203 Coldhams Lane\\ Cambridge CB1 3HY\\\texttt{J.Fine@pmms.cam.ac.uk}} \begin{Article} \noindent This is the first of a series of articles on various aspects of SGML. It is intended to be a general introduction. Subsequent articles will discuss the SGML concepts of document elements and their content, attributes, entities, markup minimization and data notation. The author welcomes queries and requests. He is also available for professional consultation. \subsection{Introduction} Here is the opening paragraph of the SGML standard: \begin{quotation} \noindent This International Standard specifies a language for document representation referred to as the ``Standard Generalized Markup Language'' (SGML). SGML can be used for publishing in its broadest definition, ranging from single medium conventional publishing to multi-media data base publishing. SGML can also be used in office document processing when the benefits of human readability and interchange with publishing systems are required. \raggedleft from \em Clause 0, Introduction, ISO 8879, October 1986 \end{quotation} Necessarily, this article will give an incomplete picture of SGML. Here are five professional activities involved in modern publishing. It is the \emph{author} whose words are published. Most of the time we assume for simplicity that it is the author who keys the manuscript into a computer, creating what I shall call a {\em compuscript}, or \emph{script} for short. This assumption is of course not true for Shakespeare and many other authors. The {\em designer} will establish the structure of the author's work---perhaps retrospectively---and establish standards for its printed representation. The \emph{typesetter} or application programmer will cause software tools to produce from the author's script printed pages or whatever meeting the designer's requirements. These software tools will have been created by an \emph{implementor} or systems programmer. The \emph{publisher} will have an overall responsibility for, and financial interest in, the whole process. This quintet---author, designer, typesetter, implementor, and publisher---are all involved in the production of a book before it goes to printer and binder. This production process is generally of little concern (except when it goes wrong) to the final sixth party, the \emph{reader}. However, SGML can be used to offer the reader new electronically published products. In this article I shall show you, in its entirety, an extremely simple SGML document. By and large I will take the author's point of view, in part in hope of alleviating any distrust there may be amongst the humanities towards technology. I hope that the more technically minded will bear with me during this apparently pedestrian exposition. They may wish to reflect on how SGML allows cooperation and division of responsibility within the production process. \section{Field of Application} It is useful to know what SGML can be used for, and what lies outside its province. \begin{quotation} \noindent The Standard Generalized Markup Language can be used for documents that are processed by any text processing or word processing system. It is particularly applicable to: \begin{list}{}{} \item a) Documents that are interchanged among systems with differing text processing languages. \item b) Documents that are processed in more than one way, even when the procedures use the same text processing language. \end{list} Documents that exist solely in final imaged form are not within the field of application of this International Standard. \raggedleft \em Clause 2, Field of Application, ISO 8879 \end{quotation} This is the whole of Clause~2 of the Standard, which altogether has 15~clauses and 9~annexes. Unfortunately, they are all much harder and longer than this clause. This clause may be all that the publisher needs to know. It is well worth noting that SGML is first of all a standard for {\em documents}, of particular use when documents are interchanged among systems (the author sends the compuscript to the publisher who sends it on to the typesetter) or processed in several ways (for paper or electronic publication, in a journal or a book, or extracts in a secondary journal, or even just a second edition). Just as the ASCII character codes provide a standard for the expression and thus interchange of sequences of characters, so SGML is to provide a standard for the expression of structured or marked-up documents. However (note to Clause~1) the SGML International Standard does not specify standard document types, or standard SGML applications, nor the implementation or architecture of either the application or the electronic storage representation of the documents. It is an abstract standard for documents, deliberately indifferent to the specifics of application and implementation. \section{Hello world!} It is traditional to begin the explanation of a computer language with the code that will print the words: \begin{verbatim} Hello world! \end{verbatim} say on the user's screen. An author might object to using a computer language to write the great English novel. But SGML is quite unlike other computer languages, and anyway, all word processors use their own computer language to represent your documents. Do you know what they are doing to your words? With SGML you do, or at least can if you wish to. You don't need to be a programmer to write an SGML document. SGML is not even a programming language---it is a document structure and markup language. Moreover, SGML does not understand the English language, and so will not correct your spelling or criticize your plot. Instead, with SGML you make statements, which are called {\em declarations}. Most other computer languages are concerned with giving instructions to the computer. However, SGML does little more than record your declarations. It also checks that you are doing only what you allowed yourself to do. Though this may sound onerous, according to Hegel ``one who will do something great must learn to limit oneself." It can be a healthy discipline. Back to our example. We wish to express \begin{verbatim} Hello world! \end{verbatim} in SGML. If this is too trivial for you, replace this text by Shakespeare's sonnet \begin{verbatim} Shall I compare thee to a summer's day? \end{verbatim} or even the collection of his sonnets, complete with editorial and critical apparatus, and publishing information. It is to such that we wish to add markup, which is defined in the Standard as \emph{text that is added to the data of a document in order to convey information about it.} We shall mark up our message so that it is an SGML document. What sort of document is it? It is a message. So we mark it up as \begin{verbatim} Hello world! \end{verbatim} where the added text is markup, sharing with the computer our knowledge, that the original text is a message. The \emph{content} of the message \emph{element} is our original text, which lies between the \emph{start-tag} and the \emph{end-tag}. Although the author may be satisfied by this, the programmer or typesetter will not. This person would like to know, without reading the whole document (which might be quite long and perhaps still in progress), what elements (tags) are to be found in it. And the author might like to be told when a tag name (called for historic reasons the \emph{generic identifier}) has been misspelt. Before we use a tag---or any other textual markup---we must first declare that it exists to be used. Although it may appear to be overly fussy, it is arguably to everyone's benefit that all declarations required for a document should appear before any of its text. In any case, SGML insists that a document consists of a \emph{prolog} (which contains all the markup declarations) followed by the \emph{document instance}, which is the author's text, marked up in conformity to the declarations in the prolog. This is important. To repeat: an SGML document consists of the prolog followed by the conforming document instance. It is this requirement which allows documents to be interchanged among systems. Before our marked up document instance \begin{verbatim} Hello world! \end{verbatim} can be allowed, we must create a prolog which declares the markup construction(s) that can be used. The text to do this is here needlessly verbose, except that it will later enable the powerful CONCUR and LINK capabilities of SGML. Here is our message, marked up as an SGML document. \begin{verbatim} ]> Hello world! \end{verbatim} There are four occurrences of the character string \verb"message". The first tells us that the document instance to follow is to consist of a \verb"message" element. The second tells us that there are no restrictions on what may appear in a \verb"message" element. \verb"ANY" words or characters, \verb"ANY" elements, and \verb"ANY" other markup constructions, repeated as often as one likes, and in \verb"ANY" order. The third and fourth delimit the content of the \verb"message" element, which the first occurrence of \verb"message" had promised. There are 55 \emph{reserved names} such as \verb"DOCTYPE", \verb"ELEMENT" and \verb"ANY" in SGML, which have special r\^oles in the prolog, and also some special character strings. Neither the author nor the publisher needs to know what they are or what they do. For more complicated documents, particularly those that are to conform to a house style, the prolog---which declares the elements and structure of the document---cannot be left to chance. In particular, the author should be relieved of responsibilty for the prolog, and not given the impression that it is something that he or she could change, if they so wished. (This last remark is directed to the publisher.) The designer (or someone else) will create a set of declarations which the author is able to invoke simply by placing a line such as: \begin{verbatim} \end{verbatim} at the top of the compuscript. The publisher (or someone else) should supply the author with guidance and examples as to how the document structure declared by the designer is to be used. The author should not need to consult the invoked prolog. Ideally, the publisher's tag set and its description ideally should, together with general SGML guidance, be enough to allow the author to mark up the document instance. (A proviso. Specialised \emph{data content notations}, such as for mathematics, may require additional non-SGML \section{But wait, there's more} The designers of SGML wanted a scheme able to encode the most complicated document structures, which was at the same time easy to learn and implement for the simpler documents. They did this by giving SGML a number of parameters and optional features, to be set even before the prolog was read. For example, the powerful CONCUR feature allows a single document to support two independent tagging schemes! For a historic printed (or manuscript) Bible or other text one might wish to record not only the division of the text into books and verses, but also into pages and lines! Each SGML document should, to be really official, begin with an SGML declaration, whose purpose is to state which of the optional features are in use, that \verb"<" and \verb"