\documentclass{ltxdoc} % \usepackage{tgschola,url} \usepackage{url} \usepackage[english]{babel} \usepackage{hyperref} \usepackage{luacode} \usepackage{framed} % Version is defined in the makefile, use default values when compiled directly \ifdefined\version\else \def\version{v0.2b} \let\gitdate\date \fi \newcommand\modulename[1]{\subsection{#1}\label{sec:#1}} \newcommand\modulesummary[1]{#1\\} \newcommand\moduleclass[1]{\subsubsection{Class: #1}} \newcommand\functionname[2]{\par\noindent\textbf{#1(#2)}\\} \newcommand\functionsummary[1]{#1\\\textbf{Parameters:}\\} \newcommand\functionparam[2]{\texttt{#1}: #2\\} \newcommand\functionreturn[1]{\textbf{Return: }\\#1\\} \usepackage[default]{luaxml} \begin{document} \title{The \textsc{LuaXML} library} \author{Paul Chakravarti \and Michal Hoftich} \date{Version \version\\\gitdate} \maketitle \tableofcontents \section{Introduction} |LuaXML| is pure lua library for processing and serializing of the |xml| files. The base code code has been written by Paul Chakravarti, with minor changes which brings Lua 5.3 or HTML 5 support. On top of that, new modules for accessing the |xml| files using |DOM| like methods or |CSS| selectors\footnote{Thanks to Leaf Corcoran for |CSS selector| parsing code.} have been added. The documentation is divided to three parts -- first part deals with the |DOM| library, second part describes the low-level libraries and the third part is original documentation by Paul Chakravarti. % Current release is aimed mainly as support for the odsfile package. % In first release it was included with the odsfile package, % but as it is general library which can be used also with other packages, % I decided to distribute it as separate library. \section{The \texttt{DOM\_Object} library} This library can process a |xml| sources using |DOM| like functions. To load it, you need to require |luaxml-domobject.lua| file. The |parse| function provided by the library creates \texttt{DOM\_Object} object, which provides several methods for processing the |xml| tree. \begin{verbatim} local dom = require "luaxml-domobject" local document = [[ sample

test

hello

]] -- dom.parse returns the DOM_Object local obj = dom.parse(document) -- it is possible to call methods on the object local root_node = obj:root_node() for _, x in ipairs(root_node:get_children()) do print(x:get_element_name()) end \end{verbatim} The details about available methods can be found in the API docs, section \ref{sec:luaxml-domobject}. The above code will load a |xml| document, it will get the ROOT element and print all it's children element names. The \verb|DOM_Object:get_children| function returns Lua table, so it is possible to loop over it using standard table functions. \begin{framed} \begin{luacode*} dom = require "luaxml-domobject" local document = [[ sample

test

hello

]] -- dom.parse returns the DOM_Object obj = dom.parse(document) -- it is possible to call methods on the object local root_node = obj:root_node() for _, x in ipairs(root_node:get_children()) do tex.print(x:get_element_name().. "\\par") end \end{luacode*} \end{framed} \subsection{HTML parsing} You can parse HTML documents using the \verb|DOM_Object.html_parse| function. This parser is slower than the default XML parser, but it can load files that would cause errors in the XML mode. It can handle wrongly nested HTML tags, inline JavaScript and CSS styles, and other HTML features that would cause XML errors. \begin{verbatim} dom = require "luaxml-domobject" local document = [[ sample

test

hello

another paragraph

first
second

]] -- dom.html_parse returns the DOM_Object obj = dom.html_parse(document) -- print names of all elements contained in body for _, x in ipairs(obj:query_selector("body *")) do tex.print(x:get_element_name().. "\\par") end \end{verbatim} \begin{framed} \begin{luacode*} dom = require "luaxml-domobject" local document = [[ sample

test

hello

another paragraph

first
second

]] -- dom.html_parse returns the DOM_Object obj = dom.html_parse(document) -- print names of all elements contained in body for _, x in ipairs(obj:query_selector("body *")) do tex.print(x:get_element_name().. "\\par") end \end{luacode*} \end{framed} \subsection{Void elements} The \verb|DOM_Object.parse| function tries to support the HTML void elements, such as \verb|| or \verb|

|. They cannot have closing tags, a parse error occurs when the closing tags are used. It is possible to define a different set of void elements using the second parameter for \verb|DOM_Object.parse|: \begin{verbatim} obj = dom.parse(document, {custom_void = true}) \end{verbatim} An empty table will disable all void elements. This setting is recommended for common |xml| documents. \subsection{Node selection methods} There are some other methods for element retrieving. \subsubsection{The \texttt{DOM\_Object:get\_path} method} If you want to print text content of all child elements of the body element, you can use \verb|DOM_Object:get_path|: \begin{verbatim} local path = obj:get_path("html body") for _, el in ipairs(path[1]:get_children()) do print(el:get_text()) end \end{verbatim} The \verb|DOM_Object:get_path| function always return array with all elements which match the requested path, even it there is only one such element. In this case, it is possible to use standard Lua table indexing to get the first and only one matched element and get it's children using \verb|DOM_Object:get_children| method. It the children node is an element, it's text content is printed using \verb|DOM_Object:get_text|. \begin{framed} \begin{luacode*} local path = obj:get_path("html body") for _, el in ipairs(path[1]:get_children()) do if el:is_element() then tex.print(el:get_text().."\\par") end end \end{luacode*} \end{framed} \subsubsection{The \texttt{DOM\_Object:query\_selector} method} This method uses |CSS selector| syntax to select elements, similarly to JavaScript \textit{jQuery} library. \begin{verbatim} for _, el in ipairs(obj:query_selector("h1,p")) do print(el:get_text()) end \end{verbatim} \begin{framed} \begin{luacode*} for _, el in ipairs(obj:query_selector("h1,p")) do tex.print(el:get_text().."\\par") end \end{luacode*} \end{framed} It supports also |XML| namespaces, using \verb_namespace|element_ syntax. \subsubsection{Supported CSS selectors}\label{sec:css_selectors} The \verb|query_selector| method supports following CSS selectors: \begin{description} \item[Universal selector -- \texttt{*}] -- select any element. \item[Type selector -- \texttt{elementname}] -- Selects all elements that have the given node name. \item[Class selector -- \texttt{.classname}] -- Selects all elements that have the given class attribute. \item[ID selector -- \texttt{\#idname}] -- Selects an element based on the value of its id attribute. \item[Attribute selector -- \texttt{[attrname='value']}] -- Selects all elements that have the given attribute. It can have the following variants: \texttt{[attrname]} -- elements that contain given attribute, \texttt{[attr\string|=value]} -- attribute text is exactly the value, with optional hyphen at the end, \verb|[attr~=value]| -- attribute name of attr whose value is a whitespace-separated list of words, one of which is exactly value, \verb|[attr^=value]| -- attribute text starts with value, \texttt{[attr\$=value]} -- attribute text ends with value. \item[Grouping selector -- \texttt{,}] -- This is a grouping method, it selects all the matching nodes. \end{description} \bigskip \noindent It is also possible to combine selectors using \textit{combinators} to make more specific searches. Supported combinators: \begin{description} \item[Descendant combinator -- \texttt{A B}] -- match all B elements that are inside A elements. \item[Child combinator -- \verb|A > B|] -- match B elements that are nested directly inside a A element. \item[General sibling combinator -- \texttt{A \char`\~ ~B}] -- the second element follows the first (though not necessarily immediately), and both share the same parent. \item[Adjacent sibling combinator -- \texttt{A + B}] -- the second element directly follows the first, and both share the same parent. \end{description} \bigskip \noindent LuaXML also supports some CSS pseudo-classes. A pseudo-class is a keyword added to a selector that specifies a special state of the selected element. The following are supported: \begin{description} \item[:first-child] -- matches an element that is the first of its siblings. \item[:first-of-type] -- matches an element that is the first of its siblings, and also matches a certain type selector. \item[:last-child] -- matches an element that is the last of its siblings. \item[:last-of-type] -- matches an element that is the last of its siblings, and also matches a certain type selector. \item[:nth-child] -- matches elements based on their position in a group of siblings. It can be used like this: \verb|li:nth-child(2)|. \end{description} \subsection{Element traversing} \subsubsection{The \texttt{DOM\_Object:traverse\_elements} method} It may be useful to traverse over all elements and apply a function on all of them. \begin{verbatim} obj:traverse_elements(function(node) print(node:get_text()) end) \end{verbatim} \begin{framed} \begin{luacode*} obj:traverse_elements(function(node) tex.print(node:get_text().."\\par") end) \end{luacode*} \end{framed} The \verb|get_text| method gets text from all children elements, so the first line shows all text contained in the \verb|| element, the second one in \verb|| element and so on. \subsection{DOM modifications} It is possible to add new elements, text nodes, or to remove them. \begin{verbatim} local headers = obj:query_selector("h1") for _, header in ipairs(headers) do header:remove_node() end -- query selector returns array, we must retrieve the first element -- to get the actual body element local body = obj:query_selector("body")[1] local paragraph = body:create_element("p", {}) body:add_child_node(paragraph) paragraph:add_child_node(paragraph:create_text_node("This is a second paragraph")) for _, el in ipairs(body:get_children()) do if el:is_element() then print(el:get_element_name().. ": ".. el:get_text()) end end \end{verbatim} In this example, \verb|

| element is being removed from the sample document, and new paragraph is added. Two paragraphs should be shown in the output: \begin{framed} \begin{luacode} local headers = obj:query_selector("h1") -- query selector returns array, we must retrieve the first element -- to get the actual body element local body = obj:query_selector("body")[1] local oldbody = body:copy_node() for _, header in ipairs(headers) do header:remove_node() end local paragraph = body:create_element("p", {}) body:add_child_node(paragraph) paragraph:add_child_node(paragraph:create_text_node("This is a second paragraph")) for _, el in ipairs(body:get_children()) do if el:is_element() then tex.print(el:get_element_name().. ": ".. el:get_text() .. "\\par") end end body:replace_node(oldbody) \end{luacode} \end{framed} \subsubsection{Adding raw XML and HTML string} You can also set XML or HTML markup from a string to an element using the \texttt{DOM\_Object:inner\_html} function. Pass true as the second argument to parse string as XML, it is parsed as HTML otherwise. \begin{verbatim} local document = [[
hello
]] local tree = dom.html_parse(document) local p = tree:query_selector("p")[1] -- insert inner_html as XML p:inner_html("hello this should be the new content") print(tree:serialize()) \end{verbatim} In this example, we replace contents of the first \verb|
| element by new content. \begin{framed} \ttfamily \begin{luacode} local document = [[
hello
]] local tree = dom.html_parse(document) local p = tree:query_selector("p")[1] -- insert inner_html as XML p:inner_html("hello this should be the new content") tex.print(tree:serialize()) \end{luacode} \end{framed} There are more variants of raw string methods that add the new content at specific places in the element instead of replacing contents of the element: \begin{description} \item[\texttt{DOM\_Object:insert\_before\_begin}] -- before element. \item[\texttt{DOM\_Object:insert\_after\_begin}] -- just inside the element, before its first child. \item[\texttt{DOM\_Object:insert\_before\_end}] -- just inside the element, after its last child. \item[\texttt{DOM\_Object:insert\_after\_end}] -- after the element. \end{description} \section{The \texttt{CssQuery} library} \label{sec:cssquery_library} This library serves mainly as a support for the \texttt{DOM\_Object:query\_selector} function. It also supports adding information to the DOM tree. \subsection{Example usage} \begin{verbatim} local cssobj = require "luaxml-cssquery" local domobj = require "luaxml-domobject" local xmltext = [[

Header

Some text, italics

Header

Some text, italics

]] local dom = domobj.parse(xmltext) local css = cssobj() css:add_selector("h1", function(obj) print("header found: " .. obj:get_text()) end) css:add_selector("p", function(obj) print("paragraph found: " .. obj:get_text()) end) css:add_selector("i", function(obj) print("found italics: " .. obj:get_text()) end) dom:traverse_elements(function(el) -- find selectors that match the current element local querylist = css:match_querylist(el) -- add templates to the element css:apply_querylist(el,querylist) end) \end{luacode*} \end{framed} More complete example may be found in the \texttt{examples} directory in the \texttt{LuaXML} source code repository\footnote{\url{https://github.com/michal-h21/LuaXML/blob/master/examples/xmltotex.lua}}. \section{The \texttt{luaxml-transform} library} This library is still a bit experimental. It enables XML transformation based on CSS selector templates. It isn't nearly as powerful as XSLT, but it may suffice for simpler tasks. \subsection{Basic example} \begin{verbatim} local transform = require "luaxml-transform" local transformer = transform.new() local xml_text = [[

hello world

]] -- transformatio rules transformer:add_action("section", "\\section{@<.>}") transformer:add_action("b", "\\textbf{@<.>}") -- transform and print the result to the document local result = transformer:parse_xml(xml_text) transform.print_tex("\\verb|" .. result .. "|") \end{luacode*} \end{framed} \subsection{The Transform object } The \texttt{luaxml-transform} library provides several functions. Most important of them is \verb|new()|. It returns a Transform object, that can be used for the transformations. It is possible to transform XML using text templates, or Lua functions. In both cases, actions for elements are selected using CSS selectors. If there is no action for an element, it's text content and text from transformed child elements, is placed to the output string. There are two methods for action specification, \verb|add_action| for text templates, and \verb|add_custom_action| for Lua functions. \subsubsection{Transforming using templates}\label{sec:transform-templates} Template actions can be added using the \verb|add_action| method: \begin{verbatim} transformer:add_action("CSS selector", "template", {parameters table}) \end{verbatim} For details about CSS selectors, see the \texttt{CssQuery} library (see page~\pageref{sec:cssquery_library}). Templates can contain arbitrary text, with special instructions that can insert transformed text contents of the element, contents of specific element, or element's attributes. \noindent\textbf{Instruction syntax:} \begin{description} \item[\verb|@\{attribute name\}|] insert value of an attribute \item[\verb|@<.>|] insert transformed content of the element \item[\texttt{\%s}] insert transformed content of the element. Shortcut for \verb|@<.>|. \item[\verb|@|] insert transformed content of the child element selected by it's number in the list of children \item[\verb|@|] insert transformed content of the named child element \end{description} \noindent\textbf{Parameters} The parameters table can hold following values: \begin{description} \item[verbatim] -- used for source code listings and similar texts, that should keep their original formatting. Special characters are not escaped, so you will want to transform the elements into verbatim or listings environment. \item[separator] -- when you select element by names (\verb|@|), you can use this parameter set the separator between possible multiple instances of the child element. \end{description} \noindent\textbf{Examples:} \noindent\textbf{Process children} \begin{verbatim} local transformer = transform.new() transformer:add_action("a", "@<.>") -- ignore element transformer:add_action("b", "") local result = transformer:parse_xml("hello, world") transform.print_tex(result) \end{verbatim} \begin{framed} \begin{luacode*} local transform = require "luaxml-transform" local transformer = transform.new() transformer:add_action("a", "@<.>") -- ignore element transformer:add_action("b", "") local result = transformer:parse_xml("hello, world") transform.print_tex(result) \end{luacode*} \end{framed} \noindent\textbf{Select elements by their position} \begin{verbatim} local transformer = transform.new() -- swap child elements transformer:add_action("x", "@<2>, @<1>") local result = transformer:parse_xml("world, hello") transform.print_tex(result) \end{verbatim} \begin{framed} \begin{luacode*} local transform = require "luaxml-transform" local transformer = transform.new() transformer:add_action("x", "@<2>, @<1>") local result = transformer:parse_xml("world, hello") transform.print_tex(result) \end{luacode*} \end{framed} \noindent\textbf{Select elements by name} \begin{verbatim} local transformer = transform.new() transformer:add_action("x", "@") local result = transformer:parse_xml("hello, world") transform.print_tex(result) \end{verbatim} \begin{framed} \begin{luacode*} local transform = require "luaxml-transform" local transformer = transform.new() transformer:add_action("x", "@") local result = transformer:parse_xml("hello, world") transform.print_tex(result) \end{luacode*} \end{framed} \noindent\textbf{Select attributes} \begin{verbatim} local transformer = transform.new() transformer:add_action("b", "\\textbf{@<.>}") -- this will select only elements with "style" attribute transformer:add_action("b[style]", "\\textcolor{@{style}}{\\textbf{@<.>}}") local text = 'hello world' local result = transformer:parse_xml(text) transform.print_tex(result) \end{verbatim} \begin{framed} \begin{luacode*} local transform = require "luaxml-transform" local transformer = transform.new() transformer:add_action("b", "\\textbf{@<.>}") -- this will select only elements with "style" attribute transformer:add_action("b[style]", "\\textcolor{@{style}}{\\textbf{@<.>}}") local text = 'hello world' local result = transformer:parse_xml(text) transform.print_tex("\\verb|" .. result .. "|") \end{luacode*} \end{framed} \subsubsection{Transforming using Lua functions} You can use Lua functions for more complex transformations where simple templates don't suffice. \begin{verbatim} transformer:add_custom_action("CSS selector", function) \end{verbatim} \noindent\textbf{Example} \begin{verbatim} local transformer = transform.new() local xml_text = "worldhello, " -- load helper functions local get_child_element = transform.get_child_element local process_children = transform.process_children -- define custom action transformer:add_custom_action("x", function(el) -- it basically just swaps child elements, -- like in the template @<2>@<1> local first = process_children(get_child_element(el, 1)) local second = process_children(get_child_element(el, 2)) return second .. first end) local result = transformer:parse_xml(xml_text) transform.print_tex(result) \end{verbatim} \begin{framed} \begin{luacode*} local transform = require "luaxml-transform" local transformer = transform.new() local xml_text = "worldhello, " -- load helper functions local get_child_element = transform.get_child_element local process_children = transform.process_children -- define custom action transformer:add_custom_action("x", function(el) -- it basically just swaps child elements, -- like in the template @<2>@<1> local first = process_children(get_child_element(el, 1)) local second = process_children(get_child_element(el, 2)) return second .. first end) local result = transformer:parse_xml(xml_text) transform.print_tex(result) \end{luacode*} \end{framed} \subsubsection{Character handling} You may want to escape certain characters, or replace them with \LaTeX\ commands. You can use the \texttt{unicodes} table contained in the Transform object: \begin{verbatim} local transformer = transform.new() -- you must use the Unicode character code transformer.unicodes[124] = "\\textbar" local text = '|' local result = transformer:parse_xml(text) transform.print_tex(result) \end{verbatim} \begin{framed} \begin{luacode*} local transform = require "luaxml-transform" local transformer = transform.new() -- you must use the Unicode character code transformer.unicodes[124] = "\\textbar" local text = '|' local result = transformer:parse_xml(text) transform.print_tex("\\verb|" .. result .. "|") \end{luacode*} \end{framed} \section{Character sets handling} The \texttt{luaxml-encodings} library provides functions to convert texts in legacy 8-bit encodings such as WINDOWS-1250 or ISO-8859-2 to UTF-8. This can be useful in fixing document encoding before HTML parsing using the \texttt{luaxml-mod-html} library. \subsection{Example} \begin{verbatim} kpse.set_program_name "luatex" local encodings = require "luaxml-encodings" --read HTML page from the standard input local text = io.read("*all") -- find the character encoding in HTML metadata local enc = encodings.find_html_encoding(text) if enc then -- local conversion table for the found encoding local mapping = encodings.load_mapping(enc) if mapping then -- if the mapping exists, recode the HTML input and print it local converted = encodings.recode(text, mapping) print(converted) end end \end{verbatim} \section{The \texttt{luaxml.sty} Package} The \texttt{luaxml.sty} package is designed to provide an interface for defining transformation rules for XML and HTML documents using Lua and \LaTeX\ commands. It allows users to declare transformation objects, apply transformation rules based on CSS selectors, and process XML or HTML from files or code snippets within \LaTeX\ documents. XML and HTML documents can be inserted from files or directly via commands and environments. All commands and environments intended for code input have two variants: with an asterisk for inputting HTML documents and without an asterisk for inputting XML documents. \subsection{Package Options} \begin{description} \item[default] -- load HTML templates. They will be available as \verb|html| option in \verb|luaxml.sty| commands and environments. \end{description} \begin{verbatim} \usepackage[default]{luaxml} ... \begin{LXMLCode*}{html}
Hello world and some text in italics.
\end{LXMLCode*} \end{verbatim} \begin{framed} \begin{LXMLCode*}{html}
Hello world and some text in italics.
\end{LXMLCode*} \end{framed} \subsection{Example of Transformation Using \LaTeX\ Commands} \begin{verbatim} \LXMLRule[sample]{h1}|\par\noindent{\large\bfseries %s\par}| \LXMLRule[sample]{p}|%s\par| \LXMLRule[sample]{a[href]}|\href{@{href}}{%s}| %% process HTML code \begin{LXMLCode*}{sample}
Hello

Here is a link to TeX.sx
\end{LXMLCode*} \end{verbatim} \begin{framed} \LXMLRule[sample]{h1}|\par\noindent{\large\bfseries %s\par}| \LXMLRule[sample]{p}|%s\par| \LXMLRule[sample]{a[href]}|\href{@{href}}{%s}| % process HTML code \begin{LXMLCode*}{sample}
Hello

Here is a link to TeX.sx
\end{LXMLCode*} \end{framed} \subsection{Declaring Transformation Rules} \begin{verbatim} \LXMLRule[]\{\}|| \end{verbatim} \noindent Defines a transformation rule for the current transformer. The transformation is applied to elements matching the given CSS selector. You can define multiple transformers, for example if you want to support multiple XML syntaxes and HTML at the same time. \medskip \noindent The \texttt{} parameter can include: \begin{itemize} \item \texttt{verbatim}: Whether to process the rule in verbatim mode. \item \texttt{transformer}: Specifies a transformer. \end{itemize} Any unknown key acts as a name of the transformer. In the following code, both examples add a rule to a transformer named \texttt{sample}. \begin{verbatim} \LXMLRule[transformer=sample]{b}|\textbf{%s}| \LXMLRule[sample]{i}|\textit{%s}| \end{verbatim} If you want to support only one syntax though, you don't need to specify the transformer name at all, a default object will be used. By default, spaces are collapsed. If you want to support elements where white spaces should be preserved, such as HTML \verb|
| element, use the \verb|verbatim| option: \begin{verbatim} \LXMLRule[verbatim]{pre}|\begin{verbatim} %s \end{verbatim} % trick to print \end{verbatim}| \verb+\end{verbatim}|+ \bigskip The \texttt{transformation rule} must be delimited by a pair of characters that are not used in the text of the rule. We use \verb+|+ in our examples, but you can use other characters if you like. This is similar to how the \verb|\verb| command works. You can use the syntax shown in the section~\ref{sec:transform-templates} (page~\pageref{sec:transform-templates}). The following code defines rule that transforms the \verb|
| element to a \verb|\section| command, and \verb|| element which has a \verb|href| attribute to \verb|\href|. URL of the link is used thanks to the \verb|@{href}| rule. \begin{verbatim} \LXMLRule{h1}|{\section{%s}| \LXMLRule{a[href]}|\href{@{href}}{%s}| \end{verbatim} \subsection{Content Transformation} \begin{verbatim} \LXMLSnippet[]{} \LXMLSnippet*[]{} \end{verbatim} \noindent The \verb|\LXMLSnippet| command processes a code snippet as XML or HTML. Use the starred variant for HTML input. The \texttt{} argument specifies the transformer object to apply (default is used if empty). The code to be transformed is passed in the second argument. \medskip \noindent{XML snippet transformation:} \begin{verbatim} \LXMLRule[xmlsnippet]{title}|title: %s| \LXMLSnippet{Hello} \end{verbatim} \begin{framed} \LXMLRule[xmlsnippet]{title}|title: %s| \LXMLSnippet[xmlsnippet]{Hello} \end{framed} \noindent{HTML snippet transformation:} \begin{verbatim} \LXMLRule[htmlsnippet]{h1}|title: %s| \LXMLSnippet*[htmlsnippet]{
Header
} \end{verbatim} \begin{framed} \LXMLRule[htmlsnippet]{h1}|title: %s| \LXMLSnippet*[htmlsnippet]{
Header
} \end{framed} \vtop\bgroup \begin{verbatim} \LXMLInputFile[]{} \LXMLInputFile*[]{} \end{verbatim} \noindent Processes a file as XML or HTML. Use the starred variant for HTML input. The \texttt{} specifies the transformer object to apply (default is used if empty). The file path is passed in the second argument. \egroup \noindent\textbf{Environments} \medskip \noindent \textbf{\texttt{\textbackslash begin\{LXMLCode\}\{\}} ... \texttt{\textbackslash end\{LXMLCode\}}} \noindent Processes XML code inside the environment. The \texttt{} specifies the transformer object to apply (default is used if empty). \begin{verbatim} \LXMLRule[xmlenv]{element}|hello: %s| \begin{LXMLCode}{xmlenv} Some content \end{LXMLCode} \end{verbatim} \begin{framed} \LXMLRule[xmlenv]{element}|hello: %s| \begin{LXMLCode}{xmlenv} Some content \end{LXMLCode} \end{framed} \medskip \noindent\textbf{\texttt{\textbackslash begin\{LXMLCode*\}\{\}} ... \texttt{\textbackslash end\{LXMLCode*\}}} \noindent Processes HTML code inside the environment. The \texttt{} specifies the transformer object to apply (default is used if empty). \begin{verbatim} \LXMLRule[htmlenv]{p}|paragraph: %s| \begin{LXMLCode*}{htmlenv}

Some HTML content

\end{LXMLCode*} \end{verbatim} \begin{framed} \LXMLRule[htmlenv]{p}|paragraph: %s| \begin{LXMLCode*}{htmlenv}

Some HTML content

\end{LXMLCode*} \end{framed} \clearpage \section{The API documentation} \input{doc/api.tex} \section{Low-level functions usage} % The processing is done with several handlers, their usage will be shown in the % following section. Full description of handlers is given in the original % documentation in section \ref{sec:handlers}. % \subsection{Usage examples} The original |LuaXML| library provides some low-level functions for |XML| handling. First of all, we need to load the libraries: \begin{verbatim} xml = require('luaxml-mod-xml') handler = require('luaxml-mod-handler') \end{verbatim} The |luaxml-mod-xml| file contains the xml parser and also the serializer. In |luaxml-mod-handler|, various handlers for dealing with xml data are defined. Handlers transforms the |xml| file to data structures which can be handled from the Lua code. More information about handlers can be found in the original documentation, section \ref{sec:handlers}. \subsection{The simpleTreeHandler} \begin{verbatim} sample = [[ hello world. another ]] treehandler = handler.simpleTreeHandler() x = xml.xmlParser(treehandler) x:parse(sample) \end{verbatim} You have to create handler object, using |handler.simpleTreeHandler()| and xml parser object using |xml.xmlParser(handler object)|. |simpleTreehandler| creates simple table hierarchy, with top root node in |treehandler.root| \begin{verbatim} -- pretty printing function function printable(tb, level) level = level or 1 local spaces = string.rep(' ', level*2) for k,v in pairs(tb) do if type(v) ~= "table" then print(spaces .. k..'='..v) else print(spaces .. k) level = level + 1 printable(v, level) end end end -- print table printable(treehandler.root) -- print xml serialization of table print(xml.serialize(treehandler.root)) -- direct access to the element print(treehandler.root["a"]["b"][1]) \end{verbatim} This code produces the following output: \begin{verbatim} output: a d=hello b 1=world. 2 1=another _attr at=Hi hello world. another world. \end{verbatim} First part is pretty-printed dump of Lua table structure contained in the handler, the second part is |xml| serialized from that table and the last part demonstrates direct access to particular elements. Note that |simpleTreeHandler| creates tables that can be easily accessed using standard lua functions, but if the xml document is of mixed-content type\footnote{% This means that element may contain both children elements and text.}: \begin{verbatim} hello world \end{verbatim} \noindent then it produces wrong results. It is useful mostly for data |xml| files, not for text formats like |xhtml|. \subsection{The domHandler} % For complex xml documents with mixed content, |domHandler| is capable of representing any valid XML document: For complex xml documents, it is best to use the |domHandler|, which creates object which contains all information from the |xml| document. \begin{verbatim} -- file dom-sample.lua -- next line enables scripts called with texlua to use luatex libraries --kpse.set_program_name("luatex") function traverseDom(current,level) local level = level or 0 local spaces = string.rep(" ",level) local root= current or current.root local name = root._name or "unnamed" local xtype = root._type or "untyped" local attributes = root._attr or {} if xtype == "TEXT" then print(spaces .."TEXT : " .. root._text) else print(spaces .. xtype .. " : " .. name) end for k, v in pairs(attributes) do print(spaces .. " ".. k.."="..v) end local children = root._children or {} for _, child in ipairs(children) do traverseDom(child, level + 1) end end local xml = require('luaxml-mod-xml') local handler = require('luaxml-mod-handler') local x = '
hello world, how are you?
' local domHandler = handler.domHandler() local parser = xml.xmlParser(domHandler) parser:parse(x) traverseDom(domHandler.root) \end{verbatim} The ROOT element is stored in |domHandler.root| table, it's child nodes are stored in |_children| tables. Node type is saved in |_type| field, if the node type is |ELEMENT|, then |_name| field contains element name, |_attr| table contains element attributes. |TEXT| node contains text content in |_text| field. The previous code produces following output in the terminal: % after command % |texlua dom-sample.lua| running: \begin{verbatim} ROOT : unnamed ELEMENT : p TEXT : hello ELEMENT : a href=http://world.com/ TEXT : world TEXT : , how are you? \end{verbatim} % With \verb|domHandler|, you can process documents with mixed content, like % \verb|xhtml|, so it is a most powerful handler. \clearpage \part{Original \texttt{LuaXML} documentation by Paul Chakravarti} \medskip \noindent This document was created automatically from the original source code comments using Pandoc\footnote{\url{http://johnmacfarlane.net/pandoc/}} \section{Overview} This module provides a non-validating XML stream parser in Lua. \section{Features} \begin{itemize} \item Tokenises well-formed XML (relatively robustly) \item Flexible handler based event api (see below) \item Parses all XML Infoset elements - ie. \begin{itemize} \item Tags \item Text \item Comments \item CDATA \item XML Decl \item Processing Instructions \item DOCTYPE declarations \end{itemize} \item Provides limited well-formedness checking (checks for basic syntax \& balanced tags only) \item Flexible whitespace handling (selectable) \item Entity Handling (selectable) \end{itemize} \section{Limitations} \begin{itemize} \item Non-validating \item No charset handling \item No namespace support \item Shallow well-formedness checking only (fails to detect most semantic errors) \end{itemize} \section{API} The parser provides a partially object-oriented API with functionality split into tokeniser and hanlder components. The handler instance is passed to the tokeniser and receives callbacks for each XML element processed (if a suitable handler function is defined). The API is conceptually similar to the SAX API but implemented differently. The following events are generated by the tokeniser \begin{verbatim} handler:starttag - Start Tag handler:endtag - End Tag handler:text - Text handler:decl - XML Declaration handler:pi - Processing Instruction handler:comment - Comment handler:dtd - DOCTYPE definition handler:cdata - CDATA \end{verbatim} The function prototype for all the callback functions is \begin{verbatim} callback(val,attrs,start,end) \end{verbatim} where attrs is a table and val/attrs are overloaded for specific callbacks - ie. \begin{tabular}{llp{5cm}} Callback & val & attrs (table)\\ \hline starttag & name & |{ attributes (name=val).. }|\\ endtag & name & nil\\ text & || & nil\\ cdata & | | & nil\\ decl & "xml" & |{ attributes (name=val).. }|\\ pi & pi name & \begin{verbatim}{ attributes (if present).. _text = }\end{verbatim}\\ comment & || & nil\\ dtd & root element & \begin{verbatim}{ _root = , _type = SYSTEM|PUBLIC, _name = , _uri = , _internal = }\end{verbatim}\\ \end{tabular} (starttag \& endtag provide the character positions of the start/end of the element) XML data is passed to the parser instance through the `parse' method (Note: must be passed as single string currently) \section{Options} Parser options are controlled through the `self.options' table. Available options are - \begin{itemize} \item stripWS Strip non-significant whitespace (leading/trailing) and do not generate events for empty text elements \item expandEntities Expand entities (standard entities + single char numeric entities only currently - could be extended at runtime if suitable DTD parser added elements to table (see obj.\_ENTITIES). May also be possible to expand multibyre entities for UTF--8 only \item errorHandler Custom error handler function \end{itemize} NOTE: Boolean options must be set to `nil' not `0' \section{Usage} Create a handler instance - \begin{verbatim} h = { starttag = function(t,a,s,e) .... end, endtag = function(t,a,s,e) .... end, text = function(t,a,s,e) .... end, cdata = text } \end{verbatim} (or use predefined handler - see luaxml-mod-handler.lua) Create parser instance - \begin{verbatim} p = xmlParser(h) \end{verbatim} Set options - \begin{verbatim} p.options.xxxx = nil \end{verbatim} Parse XML data - \begin{verbatim} xmlParser:parse(", _type = ROOT|ELEMENT|TEXT|COMMENT|PI|DECL|DTD, _attr = { Node attributes - see callback API }, _parent = _children = { List of child nodes - ROOT/NODE only } } \end{verbatim} \subsubsection{simpleTreeHandler} simpleTreeHandler is a simplified handler which attempts to generate a more `natural' table based structure which supports many common XML formats. The XML tree structure is mapped directly into a recursive table structure with node names as keys and child elements as either a table of values or directly as a string value for text. Where there is only a single child element this is inserted as a named key - if there are multiple elements these are inserted as a vector (in some cases it may be preferable to always insert elements as a vector which can be specified on a per element basis in the options). Attributes are inserted as a child element with a key of `\_attr'. Only Tag/Text \& CDATA elements are processed - all others are ignored. This format has some limitations - primarily \begin{itemize} \item Mixed-Content behaves unpredictably - the relationship between text elements and embedded tags is lost and multiple levels of mixed content does not work \item If a leaf element has both a text element and attributes then the text must be accessed through a vector (to provide a container for the attribute) \end{itemize} In general however this format is relatively useful. \subsection{Options} \begin{verbatim} simpleTreeHandler.options.noReduce = { = bool,.. } - Nodes not to reduce children vector even if only one child domHandler.options.(comment|pi|dtd|decl)Node = bool - Include/exclude given node types \end{verbatim} \subsection{Usage} Pased as delegate in xmlParser constructor and called as callback by xmlParser:parse(xml) method. \section{History} This library is fork of LuaXML library originaly created by Paul Chakravarti. Some files not needed for use with luatex were droped from the distribution. Documentation was converted from original comments in the source code. \section{License} This code is freely distributable under the terms of the Lua license (\url{http://www.lua.org/copyright.html}) \end{document}