Xerces.pm: The Perl API to the Apache Xerces XML parser

       $Id: README,v 1.19 2002/04/25 06:05:04 jasons Exp $

LEGAL HOOP JUMPING:
===================
This code is distributed under the terms of the Apache Software
License, Version 1.1. See the file LICENSE for details

DESCRIPTION:
============
XML::Xerces is the Perl API to the Apache project's Xerces XML
parser. It is implemented using the Xerces C++ API, and it provides
access to *most* of the C++ API from Perl.

Because it is based on the Xerces-C parser, XML::Xerces provides a
validating XML parser written in a portable subset of C++. Xerces-C
makes it easy to give your application the ability to read and write
XML data. A shared library is provided for parsing, generating,
manipulating, and validating XML documents. Xerces-C is faithful to
the XML 1.0 recommendation and associated standards ( DOM 1.0, DOM
2.0. SAX 1.0, SAX 2.0, Namespaces, and partial support for W3C XML
Schema). The parser provides high performance, modularity, and
scalability. It also provides full support for Unicode.

XML::Xerces implements the vast majority of the Xerces-C API (if you
notice any discrepancies please mail the list
(xerces-p-dev@xml.apache.org). The exception of this are some
functions in the C++ API which have been overloaded to accept
different arguments may currently have only a single version in the
Perl API. This is a simple fix and most of the overloaded functions
are finished, but it will take time to catch them all. Also, there are
some functions in the C++ API which either have better Perl
counterparts (such as file I/O) or which manipulate internal
C++ information that serves no useful role in the Perl module.

The majority of the API is created automatically using the amazing
wonderful Simplified Wrapper Interface Generator (SWIG, 
http://www.swig.org/). Care has been taken to make most method
invocations natural to perl programmers, so a number of rough C++
edges have been smoothed over (See the 'Special Perl API features'
section).

AVAILABLE PLATFORMS:
====================

See the INSTALL file for a list of supported platforms
  
BUILD REQUIREMENTS:
===================

1.  An ANSI C++ compiler.  Builds are known to work with the GNU
    compiler.  Ports to other compilers such as MSVC++ (the Microsoft
    Visual C++ compiler and development environment) are in the works.
    Contributions in this area are always welcome :-).

2.  Perl5 

    ### NOTE ####

      Required version: 5.6.0
  
      XML::Xerces now supports Unicode. Since Unicode support wasn't
      added to Perl until 5.6.0, you will need to upgrade in order to
      use this and future versions of XML::Xerces. Upgrading to at least
      to the latest stable release, 5.6.1, is recommended, but if you
      already have 5.6.0 installed it will work fine.
  
      If you plan on using Unicode, I *strongly* recommend upgrading to
      Perl-5.7.2, the latest development version. There have been
      significant improvements to Perl's Unicode support.

    ### NOTE ####

3.  The Apache Xerces C++ XML Parser 

    ### NOTE ####

      Required version: 1.7.0
  
      Available at:
  
         http://xml.apache.org/dist/xerces-c/stable/
  
      Without this version you CANNOT COMPILE XML::Xerces

    ### NOTE ###

    You'll need both the library and header files, and to set up any
    environment variables that will direct the XML::Xerces build to the
    directories where these reside.

OPTIONAL COMPONENTS
===================

1.  SWIG - (Simplified Wrapper and Interface Generator) An open source
    tool by David Beazley of the University of Chicago for
    automatically generating Perl wrappers for C and C++ libraries
    (i.e. *.a or *.so for UNIX, *.dll for Windoes).  You can get the
    source from www.swig.org and then build it for your platform.

    ### NOTE ###

      You will only need this if the include Xerces.C and XML::Xerces
      files do not work for your perl distribution. The pre-generated
      files have been created by SWIG 1.3 and work under perl-5.6.

    ### NOTE ###

   This port will only work with versions 1.3.12 and later of SWIG.

   If your planning to use SWIG, you can set the environment variable
   SWIG to the full path to the SWIG executable before running 'perl
   Makefile.pl'. For example:
	
	export SWIG=/usr/bin/swig

   This is only necessary if it isn't in your path or you have more
   than one version installed.

PREPARE FOR THE BUILD:
======================

1.  Download the release and it's digital signature, from
    http://xml.apache.org/dist/xerces-p/stable 

2.  Optionally verify the release using the supplied digital signature
    (see http://xml.apache.org/xerces-p/download.html for details)

3.  Unpack the archive in a directory of your choice.  Example (for
    UNIX):

    tar zxvf XML-Xerces-1.7.x_y.tar.gz
    cd XML-Xerces-1.7.x_y

4.  Examine the Perl script "Makefile.PL".  You shouldn't
    need to change any of the information unless you are
    attempting to build on a platform other than UNIX, in which
    case, you will probably have to.

    Also, you may want to edit the path to the swig executable
    ($SWIG), if you're planning on regenerating Xerces.C and XML::Xerces
    in order to add new features to Xerces

5.  If the Xerces-C library and header files are installed on your
    system directly, e.g. via an rpm or deb package, proceed to the
    build.

    Otherwise, you must download Xerces-C from xml.apache.org and build
    it.  To build XML::Xerces in this case, make sure the value of your
    XERCESCROOT environment variable is the top-level directory of
    your xerces distribution (i.e. the same value it needs to be to
    build Xerces-C).

    If you have installed xerces on your system you should only need
    to set the XERCES_INCLUDE and XERCES_LIB environment
    variables. For example:

        export XERCES_INCLUDE=/usr/include/xerces
        export XERCES_LIB=/usr/lib

    If you have built Xerces-C yourself and want to work directly from
    the build directory, then you should only need to set the
    XERCESCROOT environment variable.


BUILD XML::Xerces:
===============

1. Go to the XML-Xerces-1.7.x_y directory.

2. Build XML::Xerces as you would any perl package that you might get
   from CPAN: 

    perl Makefile.PL
    make
    make test
    make install

USING XML::Xerces:
================

XML::Xerces implements the vast majority of the Xerces-C API (if you
notice any discrepancies please mail the list). Documentation for this
API are sadly not available in POD format, but the Xerces-C html
documentation is available at:

    http://xml.apache.org/xerces-c/apiDocs/index.html

I agree that this is criminal negligence and I should be flogged for
this. I have recently discovered that doxygen, the documentation
system used by Xerces-C will ouput XML. I am planning on transforming
this XML into Docbook and from there into POD. Expect the beginnings
of this as soon as possible.

For more information, see the example scripts in the samples/
directory, or the test scripts located in the t/ directory (especially
the TestUtils.pm module).


Special Perl API Features:
==========================

Even though XML::Xerces is based on the C++ API, it has been modified in
a few ways to make it more accessible to typical Perl usage, primarily
in the handling:
* strings (XMLCh arrays and perl string)
* lists   (DOM_NodeList and perl list)
* hashes  (DOM_NamedNodeMap and perl hash)
* DOMParse.pm (for serializing a DOM tree)
* implementing Perl handlers for C++ event callbacks
* handling exceptions C++ ({XML,DOM,SAX}Exception's)

* DOM vs. IDOM #### Incompatible Change ####
   
Handling of XMLCh Arrays
----------------------------------

Any functions in the C++ API that return XMLCh arrays will return
vanilla perl-strings in XML::Xerces.  This obviates calls to "transcode"
(in fact, it makes them entirely invalid).

Handling of DOM_NodeList's
--------------------------

Any function that in the C++ API returns a DOM_NodeList
(getChildNodes() and getElementsByTagName() for example) will return
different types if they are called in a list context or a scalar
context. In a scalar context, these functions return a reference to a
XML::Xerces::DOM_NodeList, just like in C++ API. However, in a list
context they will return a Perl list of XML::Xerces::DOM_Node
references. For example:

  # returns a reference to a XML::Xerces::DOM_NodeList
  my $node_list_ref = $doc->getElementsByTagName('foo');

  # returns a list of XML::Xerces::DOM_Node's
  my @node_list = $doc->getElementsByTagName('foo');

Handling of DOM_NamedNodeMap's
------------------------------

Any function that in the C++ API returns a DOM_NamedNodeMap
(getEntities() and getAttributes() for example) will return different
types if they are called in a list context or a scalar context. In a
scalar context, these functions return a reference to a
XML::Xerces::DOM_NamedNodeMap, just like in C++ API. However, in a
list context they will return a Perl hash.

  # returns a reference to a XML::Xerces::DOM_NamedNodeMap
  my $attr_map_ref = $element_node->getAttributes();

  # returns a hash of the attributes
  my %attrs = $element_node->getAttributes();

Using XML::Xerces::DOMParse to print a DOM Tree
-----------------------------------------------

DOMParse.pm implements a generic serializer API for DOM Trees. See the
samples/DOMPrint.pl script for an example of using this API.

For less complex usage, just use the serialize() method defined for
all DOM_Node subclasses.

Implementing {Document,Content,Error}Handlers from Perl
---------------------------------------------------------

Thanks to suggestions from Duncan Cameron, XML::Xerces now has a handler
API that matches the currently used semantics of other Perl XML
API's. There are three classes available for application writers:
* PerlErrorHandler    (SAX 1/2 and DOM 1)
* PerlDocumentHandler (SAX 1)
* PerlContentHandler  (SAX 2)

Using these classes is as simple as creating a perl subclass of the
needed class, and redefining any needed methods. For example, to
override the default fatal_error() method of the PerlErrorHandler
class we can include this piece of code within our application:

  package MyErrorHandler;
  @ISA = qw(XML::Xerces::PerlErrorHandler);
  sub fatal_error {die "Oops, I got an error\n";}
  
  package main;
  my $dom = new XML::Xerces::DOMParser;
  $dom->setErrorHandler(MyErrorHandler->new());

Handling exceptions ({XML,DOM,SAX}Exception's)
---------------------------------------------

Some errors occur outside parsing and are not caught by the parser's
ErrorHandler. XML::Xerces provides a way for catching these errors using
Perl's standard eval-based exception mechanism. Any method that can
throw an exception should be wrapped in an eval{...} block, and the
contents of $@ should be checked:
  eval {
    $parser->parse (XML::Xerces::LocalFileInputSource->new($file));
  };
  if ($@) {
    if (ref $@) {
      die $@->getMessage();
    } else {
      die $@;
    }
  }

XML::Xerces will catch C++ exceptions and call die() after setting $@
to the C++ exception object. If ref($@) is true, it is an exception
object, if false it is a standard Perl string. To make this very
common check easier to use, XML::Xerces provides a utility method,
error() that will do this for you, so the above could be written:

  eval {
    $parser->parse (XML::Xerces::LocalFileInputSource->new($file));
  };
  XML::Xerces::error($@) if $@;

To know which methods are capable of throwing exceptions, check the
Xerces-C API documentation.

DOM vs. IDOM
------------

** Incompatible Change **

Since Xerces-C-1.5 there has been an experimental DOM implementation
(IDOM) that is much more efficient than the old DOM implementation. As
of XML-Xerces-1.7 all DOM methods have been switched to the IDOM
implementation, and the old DOM implementation is no longer available. 

This has made the codebase *much* smaller and more efficient, but there
are some important issues to watch out for, and some code written to
use the old DOM implementation may not work:

* DOM_Node::isNull(): is no longer available. In the old DOM API, you
  could receive a valid DOM_Node that was really just a wrapper for a
  NULL pointer, so before you did anything, you always had to check it
  using the isNull() method. Using the new DOM, you will get undef
  instead of an object, so you would instead check using defined().

* DOMParser::setToCreateXMLDeclTypeNode(): the now DOM API follows the
  W3C specification more closely than the old one did, so this method
  is no longer available.

* DOM_Document::createDocument(): the now DOM API follows the W3C
  specification more closely than the old one did, so this method is
  no longer available, use DOM_DOMImplementation::createDocument()
  instead.

More examples
-------------

See the applications in samples/ for more details of how to create
perl event handlers.

BUGS
====

Please send the output of 'perl -V' and a description of your problem
to xerces-p-dev@xml.apache.org. Including a *minimal* example script,
xml file, and/or dtd is helpful. The more time you spend making those
files minimal the more likely we will be able to help solve your
problem. 

AUTHORS
=======

  Jason Stewart: Xerces 1.4 through 1.7 ports
  Harmon Nine: Xerces 1.3 DOM port
  Fredrick Paul Eisele: Xerces 1.3 DOM port
  Tom Watson: Xerces 1.1 DOM port

This list is incomplete. If you feel you were left out please send 
a note to the list (xerces-p-dev@xml.apache.org).