\documentclass{ltxdoc} % \usepackage{tgschola,url} \usepackage{url} \usepackage[english]{babel} \usepackage{hyperref} \usepackage{luacode} \usepackage{framed} % Version is defined in the makefile, use default values when compiled directly \ifdefined\version\else \def\version{v0.2b} \let\gitdate\date \fi \newcommand\modulename[1]{\subsection{#1}\label{sec:#1}} \newcommand\modulesummary[1]{#1\\} \newcommand\moduleclass[1]{\subsubsection{Class: #1}} \newcommand\functionname[2]{\par\noindent\textbf{#1(#2)}\\} \newcommand\functionsummary[1]{#1\\\textbf{Parameters:}\\} \newcommand\functionparam[2]{\texttt{#1}: #2\\} \newcommand\functionreturn[1]{\textbf{Return: }\\#1\\} \usepackage[default]{luaxml} \begin{document} \title{The \textsc{LuaXML} library} \author{Paul Chakravarti \and Michal Hoftich} \date{Version \version\\\gitdate} \maketitle \tableofcontents \section{Introduction} |LuaXML| is pure lua library for processing and serializing of the |xml| files. The base code code has been written by Paul Chakravarti, with minor changes which brings Lua 5.3 or HTML 5 support. On top of that, new modules for accessing the |xml| files using |DOM| like methods or |CSS| selectors\footnote{Thanks to Leaf Corcoran for |CSS selector| parsing code.} have been added. The documentation is divided to three parts -- first part deals with the |DOM| library, second part describes the low-level libraries and the third part is original documentation by Paul Chakravarti. % Current release is aimed mainly as support for the odsfile package. % In first release it was included with the odsfile package, % but as it is general library which can be used also with other packages, % I decided to distribute it as separate library. \section{The \texttt{DOM\_Object} library} This library can process a |xml| sources using |DOM| like functions. To load it, you need to require |luaxml-domobject.lua| file. The |parse| function provided by the library creates \texttt{DOM\_Object} object, which provides several methods for processing the |xml| tree. \begin{verbatim} local dom = require "luaxml-domobject" local document = [[
hello
]] -- dom.parse returns the DOM_Object local obj = dom.parse(document) -- it is possible to call methods on the object local root_node = obj:root_node() for _, x in ipairs(root_node:get_children()) do print(x:get_element_name()) end \end{verbatim} The details about available methods can be found in the API docs, section \ref{sec:luaxml-domobject}. The above code will load a |xml| document, it will get the ROOT element and print all it's children element names. The \verb|DOM_Object:get_children| function returns Lua table, so it is possible to loop over it using standard table functions. \begin{framed} \begin{luacode*} dom = require "luaxml-domobject" local document = [[hello
]] -- dom.parse returns the DOM_Object obj = dom.parse(document) -- it is possible to call methods on the object local root_node = obj:root_node() for _, x in ipairs(root_node:get_children()) do tex.print(x:get_element_name().. "\\par") end \end{luacode*} \end{framed} \subsection{HTML parsing} You can parse HTML documents using the \verb|DOM_Object.html_parse| function. This parser is slower than the default XML parser, but it can load files that would cause errors in the XML mode. It can handle wrongly nested HTML tags, inline JavaScript and CSS styles, and other HTML features that would cause XML errors. \begin{verbatim} dom = require "luaxml-domobject" local document = [[hello
another paragraph
hello
another paragraph
hello
]] local tree = dom.html_parse(document) local p = tree:query_selector("p")[1] -- insert inner_html as XML p:inner_html("hello this should be the new content") print(tree:serialize()) \end{verbatim} In this example, we replace contents of the first \verb|| element by new content. \begin{framed} \ttfamily \begin{luacode*} local document = [[
hello
]] local tree = dom.html_parse(document) local p = tree:query_selector("p")[1] -- insert inner_html as XML p:inner_html("hello this should be the new content") tex.print(tree:serialize()) \end{luacode*} \end{framed} There are more variants of raw string methods that add the new content at specific places in the element instead of replacing contents of the element: \begin{description} \item[\texttt{DOM\_Object:insert\_before\_begin}] -- before element. \item[\texttt{DOM\_Object:insert\_after\_begin}] -- just inside the element, before its first child. \item[\texttt{DOM\_Object:insert\_before\_end}] -- just inside the element, after its last child. \item[\texttt{DOM\_Object:insert\_after\_end}] -- after the element. \end{description} \section{The \texttt{CssQuery} library} \label{sec:cssquery_library} This library serves mainly as a support for the \texttt{DOM\_Object:query\_selector} function. It also supports adding information to the DOM tree. \subsection{Example usage} \begin{verbatim} local cssobj = require "luaxml-cssquery" local domobj = require "luaxml-domobject" local xmltext = [[Some text, italics
]] local dom = domobj.parse(xmltext) local css = cssobj() css:add_selector("h1", function(obj) print("header found: " .. obj:get_text()) end) css:add_selector("p", function(obj) print("paragraph found: " .. obj:get_text()) end) css:add_selector("i", function(obj) print("found italics: " .. obj:get_text()) end) dom:traverse_elements(function(el) -- find selectors that match the current element local querylist = css:match_querylist(el) -- add templates to the element css:apply_querylist(el,querylist) end) \end{verbatim} \begin{framed} \begin{luacode*} local cssobj = require "luaxml-cssquery" local domobj = require "luaxml-domobject" local print = function(s) tex.print(s .. "\\par") end local xmltext = [[Some text, italics
]] local dom = domobj.parse(xmltext) local css = cssobj() css:add_selector("h1", function(obj) print("header found: " .. obj:get_text()) end) css:add_selector("p", function(obj) print("paragraph found: " .. obj:get_text()) end) css:add_selector("i", function(obj) print("found italics: " .. obj:get_text()) end) dom:traverse_elements(function(el) -- find selectors that match the current element local querylist = css:match_querylist(el) -- add templates to the element css:apply_querylist(el,querylist) end) \end{luacode*} \end{framed} More complete example may be found in the \texttt{examples} directory in the \texttt{LuaXML} source code repository\footnote{\url{https://github.com/michal-h21/LuaXML/blob/master/examples/xmltotex.lua}}. \section{The \texttt{luaxml-transform} library} This library is still a bit experimental. It enables XML transformation based on CSS selector templates. It isn't nearly as powerful as XSLT, but it may suffice for simpler tasks. \subsection{Basic example} \begin{verbatim} local transform = require "luaxml-transform" local transformer = transform.new() local xml_text = [[Hello world and some text in italics.
\end{LXMLCode*} \end{verbatim} \begin{framed} \begin{LXMLCode*}{html}Hello world and some text in italics.
\end{LXMLCode*} \end{framed} \subsection{Example of Transformation Using \LaTeX\ Commands} \begin{verbatim} \LXMLRule[sample]{h1}|\par\noindent{\large\bfseries %s\par}| \LXMLRule[sample]{p}|%s\par| \LXMLRule[sample]{a[href]}|\href{@{href}}{%s}| %% process HTML code \begin{LXMLCode*}{sample}Here is a link to TeX.sx
\end{LXMLCode*} \end{verbatim} \begin{framed} \LXMLRule[sample]{h1}|\par\noindent{\large\bfseries %s\par}| \LXMLRule[sample]{p}|%s\par| \LXMLRule[sample]{a[href]}|\href{@{href}}{%s}| % process HTML code \begin{LXMLCode*}{sample}Here is a link to TeX.sx
\end{LXMLCode*} \end{framed} \subsection{Declaring Transformation Rules} \begin{verbatim} \LXMLRule[| element, use the \verb|verbatim| option: \begin{verbatim} \LXMLRule[verbatim]{pre}|\begin{verbatim} %s \end{verbatim} % trick to print \end{verbatim}| \verb+\end{verbatim}|+ \bigskip The \texttt{transformation rule} must be delimited by a pair of characters that are not used in the text of the rule. We use \verb+|+ in our examples, but you can use other characters if you like. This is similar to how the \verb|\verb| command works. You can use the syntax shown in the section~\ref{sec:transform-templates} (page~\pageref{sec:transform-templates}). The following code defines rule that transforms the \verb|| element to a \verb|\section| command, and \verb|| element which has a \verb|href| attribute to \verb|\href|. URL of the link is used thanks to the \verb|@{href}| rule. \begin{verbatim} \LXMLRule{h1}|{\section{%s}| \LXMLRule{a[href]}|\href{@{href}}{%s}| \end{verbatim} \subsection{Content Transformation} \begin{verbatim} \LXMLSnippet[
]{ } \LXMLSnippet*[ ]{} \end{verbatim} \noindent The \verb|\LXMLSnippet| command processes a code snippet as XML or HTML. Use the starred variant for HTML input. The \texttt{ } argument specifies the transformer object to apply (default is used if empty). The code to be transformed is passed in the second argument. \medskip \noindent{XML snippet transformation:} \begin{verbatim} \LXMLRule[xmlsnippet]{title}|title: %s| \LXMLSnippet{ } \end{verbatim} \begin{framed} \LXMLRule[xmlsnippet]{title}|title: %s| \LXMLSnippet[xmlsnippet]{ Hello } \end{framed} \noindent{HTML snippet transformation:} \begin{verbatim} \LXMLRule[htmlsnippet]{h1}|title: %s| \LXMLSnippet*[htmlsnippet]{ Hello } \end{verbatim} \begin{framed} \LXMLRule[htmlsnippet]{h1}|title: %s| \LXMLSnippet*[htmlsnippet]{Header
} \end{framed} \vtop\bgroup \begin{verbatim} \LXMLInputFile[Header
]{ } \LXMLInputFile*[ ]{} \end{verbatim} \noindent Processes a file as XML or HTML. Use the starred variant for HTML input. The \texttt{ } specifies the transformer object to apply (default is used if empty). The file path is passed in the second argument. \egroup \noindent\textbf{Environments} \medskip \noindent \textbf{\texttt{\textbackslash begin\{LXMLCode\}\{ \}} ... \texttt{\textbackslash end\{LXMLCode\}}} \noindent Processes XML code inside the environment. The \texttt{ } specifies the transformer object to apply (default is used if empty). \begin{verbatim} \LXMLRule[xmlenv]{element}|hello: %s| \begin{LXMLCode}{xmlenv} \end{LXMLCode} \end{verbatim} \begin{framed} \LXMLRule[xmlenv]{element}|hello: %s| \begin{LXMLCode}{xmlenv} Some content \end{LXMLCode} \end{framed} \medskip \noindent\textbf{\texttt{\textbackslash begin\{LXMLCode*\}\{ Some content \}} ... \texttt{\textbackslash end\{LXMLCode*\}}} \noindent Processes HTML code inside the environment. The \texttt{ } specifies the transformer object to apply (default is used if empty). \begin{verbatim} \LXMLRule[htmlenv]{p}|paragraph: %s| \begin{LXMLCode*}{htmlenv} \end{LXMLCode*} \end{verbatim} \begin{framed} \LXMLRule[htmlenv]{p}|paragraph: %s| \begin{LXMLCode*}{htmlenv}Some HTML content
\end{LXMLCode*} \end{framed} \clearpage \section{The API documentation} \input{doc/api.tex} \section{Low-level functions usage} % The processing is done with several handlers, their usage will be shown in the % following section. Full description of handlers is given in the original % documentation in section \ref{sec:handlers}. % \subsection{Usage examples} The original |LuaXML| library provides some low-level functions for |XML| handling. First of all, we need to load the libraries: \begin{verbatim} xml = require('luaxml-mod-xml') handler = require('luaxml-mod-handler') \end{verbatim} The |luaxml-mod-xml| file contains the xml parser and also the serializer. In |luaxml-mod-handler|, various handlers for dealing with xml data are defined. Handlers transforms the |xml| file to data structures which can be handled from the Lua code. More information about handlers can be found in the original documentation, section \ref{sec:handlers}. \subsection{The simpleTreeHandler} \begin{verbatim} sample = [[Some HTML content
hello world. another ]] treehandler = handler.simpleTreeHandler() x = xml.xmlParser(treehandler) x:parse(sample) \end{verbatim} You have to create handler object, using |handler.simpleTreeHandler()| and xml parser object using |xml.xmlParser(handler object)|. |simpleTreehandler| creates simple table hierarchy, with top root node in |treehandler.root| \begin{verbatim} -- pretty printing function function printable(tb, level) level = level or 1 local spaces = string.rep(' ', level*2) for k,v in pairs(tb) do if type(v) ~= "table" then print(spaces .. k..'='..v) else print(spaces .. k) level = level + 1 printable(v, level) end end end -- print table printable(treehandler.root) -- print xml serialization of table print(xml.serialize(treehandler.root)) -- direct access to the element print(treehandler.root["a"]["b"][1]) \end{verbatim} This code produces the following output: \begin{verbatim} output: a d=hello b 1=world. 2 1=another _attr at=Hihello world. another world. \end{verbatim} First part is pretty-printed dump of Lua table structure contained in the handler, the second part is |xml| serialized from that table and the last part demonstrates direct access to particular elements. Note that |simpleTreeHandler| creates tables that can be easily accessed using standard lua functions, but if the xml document is of mixed-content type\footnote{% This means that element may contain both children elements and text.}: \begin{verbatim} hello world \end{verbatim} \noindent then it produces wrong results. It is useful mostly for data |xml| files, not for text formats like |xhtml|. \subsection{The domHandler} % For complex xml documents with mixed content, |domHandler| is capable of representing any valid XML document: For complex xml documents, it is best to use the |domHandler|, which creates object which contains all information from the |xml| document. \begin{verbatim} -- file dom-sample.lua -- next line enables scripts called with texlua to use luatex libraries --kpse.set_program_name("luatex") function traverseDom(current,level) local level = level or 0 local spaces = string.rep(" ",level) local root= current or current.root local name = root._name or "unnamed" local xtype = root._type or "untyped" local attributes = root._attr or {} if xtype == "TEXT" then print(spaces .."TEXT : " .. root._text) else print(spaces .. xtype .. " : " .. name) end for k, v in pairs(attributes) do print(spaces .. " ".. k.."="..v) end local children = root._children or {} for _, child in ipairs(children) do traverseDom(child, level + 1) end end local xml = require('luaxml-mod-xml') local handler = require('luaxml-mod-handler') local x = 'hello world, how are you?
' local domHandler = handler.domHandler() local parser = xml.xmlParser(domHandler) parser:parse(x) traverseDom(domHandler.root) \end{verbatim} The ROOT element is stored in |domHandler.root| table, it's child nodes are stored in |_children| tables. Node type is saved in |_type| field, if the node type is |ELEMENT|, then |_name| field contains element name, |_attr| table contains element attributes. |TEXT| node contains text content in |_text| field. The previous code produces following output in the terminal: % after command % |texlua dom-sample.lua| running: \begin{verbatim} ROOT : unnamed ELEMENT : p TEXT : hello ELEMENT : a href=http://world.com/ TEXT : world TEXT : , how are you? \end{verbatim} % With \verb|domHandler|, you can process documents with mixed content, like % \verb|xhtml|, so it is a most powerful handler. \clearpage \part{Original \texttt{LuaXML} documentation by Paul Chakravarti} \medskip \noindent This document was created automatically from the original source code comments using Pandoc\footnote{\url{http://johnmacfarlane.net/pandoc/}} \section{Overview} This module provides a non-validating XML stream parser in Lua. \section{Features} \begin{itemize} \item Tokenises well-formed XML (relatively robustly) \item Flexible handler based event api (see below) \item Parses all XML Infoset elements - ie. \begin{itemize} \item Tags \item Text \item Comments \item CDATA \item XML Decl \item Processing Instructions \item DOCTYPE declarations \end{itemize} \item Provides limited well-formedness checking (checks for basic syntax \& balanced tags only) \item Flexible whitespace handling (selectable) \item Entity Handling (selectable) \end{itemize} \section{Limitations} \begin{itemize} \item Non-validating \item No charset handling \item No namespace support \item Shallow well-formedness checking only (fails to detect most semantic errors) \end{itemize} \section{API} The parser provides a partially object-oriented API with functionality split into tokeniser and hanlder components. The handler instance is passed to the tokeniser and receives callbacks for each XML element processed (if a suitable handler function is defined). The API is conceptually similar to the SAX API but implemented differently. The following events are generated by the tokeniser \begin{verbatim} handler:starttag - Start Tag handler:endtag - End Tag handler:text - Text handler:decl - XML Declaration handler:pi - Processing Instruction handler:comment - Comment handler:dtd - DOCTYPE definition handler:cdata - CDATA \end{verbatim} The function prototype for all the callback functions is \begin{verbatim} callback(val,attrs,start,end) \end{verbatim} where attrs is a table and val/attrs are overloaded for specific callbacks - ie. \begin{tabular}{llp{5cm}} Callback & val & attrs (table)\\ \hline starttag & name & |{ attributes (name=val).. }|\\ endtag & name & nil\\ text & || & nil\\ cdata & | | & nil\\ decl & "xml" & |{ attributes (name=val).. }|\\ pi & pi name & \begin{verbatim}{ attributes (if present).. _text = }\end{verbatim}\\ comment & | | & nil\\ dtd & root element & \begin{verbatim}{ _root = , _type = SYSTEM|PUBLIC, _name = , _uri = , _internal = }\end{verbatim}\\ \end{tabular} (starttag \& endtag provide the character positions of the start/end of the element) XML data is passed to the parser instance through the `parse' method (Note: must be passed as single string currently) \section{Options} Parser options are controlled through the `self.options' table. Available options are - \begin{itemize} \item stripWS Strip non-significant whitespace (leading/trailing) and do not generate events for empty text elements \item expandEntities Expand entities (standard entities + single char numeric entities only currently - could be extended at runtime if suitable DTD parser added elements to table (see obj.\_ENTITIES). May also be possible to expand multibyre entities for UTF--8 only \item errorHandler Custom error handler function \end{itemize} NOTE: Boolean options must be set to `nil' not `0' \section{Usage} Create a handler instance - \begin{verbatim} h = { starttag = function(t,a,s,e) .... end, endtag = function(t,a,s,e) .... end, text = function(t,a,s,e) .... end, cdata = text } \end{verbatim} (or use predefined handler - see luaxml-mod-handler.lua) Create parser instance - \begin{verbatim} p = xmlParser(h) \end{verbatim} Set options - \begin{verbatim} p.options.xxxx = nil \end{verbatim} Parse XML data - \begin{verbatim} xmlParser:parse(", _type = ROOT|ELEMENT|TEXT|COMMENT|PI|DECL|DTD, _attr = { Node attributes - see callback API }, _parent = _children = { List of child nodes - ROOT/NODE only } } \end{verbatim} \subsubsection{simpleTreeHandler} simpleTreeHandler is a simplified handler which attempts to generate a more `natural' table based structure which supports many common XML formats. The XML tree structure is mapped directly into a recursive table structure with node names as keys and child elements as either a table of values or directly as a string value for text. Where there is only a single child element this is inserted as a named key - if there are multiple elements these are inserted as a vector (in some cases it may be preferable to always insert elements as a vector which can be specified on a per element basis in the options). Attributes are inserted as a child element with a key of `\_attr'. Only Tag/Text \& CDATA elements are processed - all others are ignored. This format has some limitations - primarily \begin{itemize} \item Mixed-Content behaves unpredictably - the relationship between text elements and embedded tags is lost and multiple levels of mixed content does not work \item If a leaf element has both a text element and attributes then the text must be accessed through a vector (to provide a container for the attribute) \end{itemize} In general however this format is relatively useful. \subsection{Options} \begin{verbatim} simpleTreeHandler.options.noReduce = { = bool,.. } - Nodes not to reduce children vector even if only one child domHandler.options.(comment|pi|dtd|decl)Node = bool - Include/exclude given node types \end{verbatim} \subsection{Usage} Pased as delegate in xmlParser constructor and called as callback by xmlParser:parse(xml) method. \section{History} This library is fork of LuaXML library originaly created by Paul Chakravarti. Some files not needed for use with luatex were droped from the distribution. Documentation was converted from original comments in the source code. \section{License} This code is freely distributable under the terms of the Lua license (\url{http://www.lua.org/copyright.html}) \end{document}