Author: | MB and Christopher Diggins(original) |
---|---|
Contact: | mb2act@yahoo.co.jp |
License: | Distributed under the Boost Software License, Version 1.0 |
Version: | 0.9xx |
I was looking for a light and unstrict xml parser. Boost.Spirit and Boost.Xpressive were big and not good at compile-time performance. I suspected that they were not static, but YARD written by Christopher Diggins was really static, small and fast. In time, I noticed that the lazy instantiation of templates could allow us to write recursive grammars, and I found that YARD and the finite state machine found at C++ Template Metaprogramming, could be binded. It was named biscuit.
biscuit is an object-oriented recursive-descent parser generator framework implemented using class templates. The templates allow us to author Extended Backus-Normal Form (EBNF) in C++. Technical informations are available at YARD.
A simple EBNF grammar snippet:
group ::= '(' expression ')' factor ::= integer | group term ::= factor (('*' factor) | ('/' factor))* expression ::= term (('+' term) | ('-' term))*
is approximated using biscuit's facilities as seen in this code snippet:
struct expression ; struct group : seq< str<'('>, expression, str<')'> > { }; struct factor : or_< integer, group > { }; struct term : seq< factor, star< or_< seq< str<'*'>, factor >, seq< str<'/'>, factor > > > > { }; struct expression : seq< term, star< or_< seq< str<'+'>, term >, seq< str<'-'>, term > > > > { };
Through the magic of the lazy template instantiation, they are the perfectly valid types. The production rule expression is in fact a type that has a static member function parse. As parse will be instantiated later by algorithms, all you have to do is not to define but to declare a type:
struct element; // forward declaration struct content : seq< opt<CharData>, star< seq< or_< element, // the magical recursion! Reference, CDSect, PI, Comment >, opt<CharData> > > > { }; struct element : or_< EmptyElemTag, seq<STag, content, ETag> > { };
Get and install the latest release of the Boost C++ Libraries. (biscuit uses only their headers.)
Include headers of biscuit:
#include "biscuit/algorithm.hpp" #include "biscuit/parser.hpp" using namespace biscuit;
Define your own parser type:
typedef seq< str<'/','*'>, star_until< any, str<'*','/'> > > c_comment;
Call algorithms:
if (match<c_comment>("/* hello, biscuit */")) { //... }
If you build test project, it is required to build Boost.Test:
bjam -sTOOLS=vc-7_1 --with-test install
A parser is any type that has the static member function:
template< class State, class UserState > static bool parse(State& s, UserState& us);
As a parser is a type, it can't have any runtime-state. But you can pass any UserState object to algorithms, and the object is passed to the parse.
Any type.
A Forward Range is a concept similar to the STL Container concept. A further document is available at Boost.Range.
A semantic action class can be any class of Function Object that has the member function:
void operator()(ForwardIter first, ForwardIter last, UserState& us);
Some parser templates are predefined as a means for parser composition and embedding.
The table below lists EBNF and their equivalents in biscuit.
EBNF (or Perl) biscuit Meaning . any any object A | B or_<A, B> alternation of A and B A B seq<A, B> sequence of A and B A* star<A> zero or more times, greedy A+ plus<A> one or more times, greedy A? opt<A> zero or one time, greedy A - B minus<A, B> match A, but the sub-match of A doesn't match B A{n,m} repeat<A, n, m> between n and m times, greedy A*? B star_until<A, B> zero or more As and B "Diggins" str<'D','i','g','g','i','n','s'> string ^ begin beginning of sequence $ end end of sequence \n eol end of line \d digit a digit \D not_<digit> not a digit \s space a space \S not_<space> not a space [0-9] char_range<'0','9'> characters in range '0' through '9' [abc] char_set<'a','b','c'> characters 'a','b', or 'c' [0-9abc] or_< char_range<'0','9'>, char_set<'a','b','c'> > characters 'a','b','c' or in range '0' though '9' [^abc] not_< char_set<'a','b','c'> > not characters 'a','b', or 'c' (?=A) before<A> positive look-ahead assertion (?!A) not_< before<A> > negative look-ahead assertion
YARD and biscuit have no back-tracking on star operations. The maximum supported arity of parsers is now twenty.
actor is a parser that triggers a semantic action class object:
struct decorate_action { template< class ForwardIter, class Stream > void operator()(ForwardIter first, ForwardIter last, Stream& out) { out << "["; out << std::string(first, last); out << "]"; } }; struct xml_comment : seq< str<'<','!','-','-'>, star< or_< minus< any, str<'-'> >, seq< str<'-'>, minus< any, str<'-'> > > > >, str<'-','-','>'> > { }; struct xml_comment_action : actor< xml_comment, decorate_action > { }; std::stringstream out; std::string s0("<!-- xml comment no.1 -->"); match<xml_comment_action>(s0, out); BOOST_CHECK( std::string("[<!-- xml comment no.1 -->]") == out.str() );
You can pass a semantic action class to actor, but cannot pass a function pointer. This trouble is fixed by grammar below.
Directives are also parsers, some ports of Boost.Spirit's directives.
Boost.Spirit biscuit Meaning lexeme_d[A] impossible turn off white space skipping as_lower_d[A] as_lower<A> convert inputs to lower-case no_actions[A] no_actions<A> all semantic actions not fire ??? definitive_actions<A> parse twice and suppress non-intended actions longest_d[A|B] longest<A, B> choose the longest match shortest_d[A|B] shortest<A, B> choose the shortest match limit_d[A] requires<A, PredicateClass> ensure the result of a parser is constrained ??? transform<A, FunctorClass> convert inputs using functor
Algorithms of biscuit work with Forward Range. Bear in mind that parsers don't know value_type of the range. For instance, a parser str works fine if value_type of the range is comparable with char.
match returns true if a parser run through the range; otherwise false:
BOOST_CHECK( match<xml_comment>("<!-- hello, xml comment -->") ); BOOST_CHECK( !match<xml_comment>("<!-- not well-formed comment -- -->") );
search returns the first sub matched boost::iterator_range; otherwise an empty range:
std::string s0(" /* c comment no.1 */x int i; /* c comment no.2 */ i = 1; /* c comment no.3 */ ++i; "); boost::sub_range<std::string> sr = search<c_comment>(s0); BOOST_CHECK( std::string(boost::begin(sr), boost::end(sr)) == "/* c comment no.1 */" );
filter_range is a filtered boost::iterator_range by a parser:
typedef filter_range< c_comment, std::string > fr_t; std::string s0(" /* c comment no.1 */ int i; /* c comment no.2 */ i = 1; /* c comment no.3 */ ++i; "); fr_t fr(s0); BOOST_CHECK(( match< repeat< c_comment, 3 > >(fr) ));
There is no reason why chains of filter_range do not work:
BOOST_CHECK(( match< str<'x','y','z'> >( make_filter_range< alpha >( make_filter_range< not_<space> >( make_filter_range< not_<digit> >("x & 4 y . 125 % z") ) ) ) ));
match_results is a Forward Range of boost::iterator_range:
typedef match_results<c_comment, std::string> mrs_t; typedef boost::range_const_iterator<mrs_t>::type iter_t; std::string s0(" /* c comment no.1 */int i; /* c comment no.2 */i = 1; /* c comment no.3 */ ++i; "); mrs_t mrs(s0); for (iter_t it = boost::const_begin(mrs); it != boost::const_end(mrs); ++it) { std::cout << std::string(boost::begin(*it), boost::end(*it)) << std::endl; }
Outputs:
/* c comment no.1 */ /* c comment no.2 */ /* c comment no.3 */
As parsers are just types, they has no runtime-state. Nontype template argument is farely limited. If value_type of Forward Range is not Integral Constants like char, what can we do? But C++ Template Metaprogramming says that member function pointers are available. They can bind templates and objects. Boost.Spirit makes a expression templates object from expression templates objects, but you can make expression type from expression templates using biscuit.
grammar binds parsers and objects:
struct my_grammar : grammar< my_grammar, std::vector<std::string> > { std::string text0() { return "hello"; }; std::string text1() { return "grammar"; }; std::string text2() { return "value"; }; struct start : seq< value_<&my_grammar::text0>, value_<&my_grammar::text1>, value_<&my_grammar::text2> > { }; }; void grm_value_test() { std::cout << "grm_value_test ing..." << std::endl; std::vector<std::string> texts; texts.push_back(std::string("hello")); texts.push_back(std::string("grammar")); texts.push_back(std::string("value")); my_grammar the_grammar; BOOST_CHECK( match< typename grammar_start<my_grammar>::type >(texts, the_grammar) ); }
Now that biscuit has no limitation of value_type of Forward Range parsed. As std::vector<std::string> is a Forward Range of std::string, it works. Keep in mind that UserState object is now your grammar object.
Here is a port of Boost.Spirit's calculator:
struct calculator : grammar< calculator, std::string > { void do_int(iterator_type str, iterator_type end) { std::string s(str, end); std::cout << "PUSH(" << s << ')' << std::endl; } void do_add(iterator_type, iterator_type) { std::cout << "ADD\n"; } void do_subt(iterator_type, iterator_type) { std::cout << "SUBTRACT\n"; } void do_mult(iterator_type, iterator_type) { std::cout << "MULTIPLY\n"; } void do_div(iterator_type, iterator_type) { std::cout << "DIVIDE\n"; } void do_neg(iterator_type, iterator_type) { std::cout << "NEGATE\n"; } struct expression; struct term; struct factor; struct expression : seq< term, star< or_< actor_< seq< str<'+'>, term >, &calculator::do_add >, actor_< seq< str<'-'>, term >, &calculator::do_subt > > > > { }; struct term : seq< factor, star< or_< actor_< seq< str<'*'>, factor >, &calculator::do_mult >, actor_< seq< str<'/'>, factor >, &calculator::do_div > > > > { }; struct factor : or_< actor_< plus<digit>, &calculator::do_int >, seq< str<'('>, expression, str<')'> >, actor_< seq< str<'-'>, factor >, &calculator::do_neg >, seq< str<'+'>, factor > > { }; struct start : expression { }; };
actor_ makes actor from the member function pointer. Enjoy the simplicity, compile-time performance and smaller-size of the executable.
biscuit emulates Boost.Spirit's debugging:
struct start_tag { }; struct integer_tag { }; struct factor_tag { }; struct term_tag { }; struct expression_tag { }; template< class ForwardRange > struct calculator_debug : grammar< calculator_debug, ForwardRange >, private boost::noncopyable { calculator_debug(std::stack<long>& eval_) : eval(eval_) { } void push_int(iterator_type first, iterator_type last) { std::string s(first, last); long n = boost::lexical_cast<long>(s); // std::strtol(str, 0, 10); eval.push(n); std::cout << "push\t" << long(n) << std::endl; } // ... struct integer; struct factor; struct term; struct expression; // struct start : debugger<start, // also ok, but long... struct start : debugger<start_tag, expression > { }; struct integer : debugger<integer_tag, actor_< plus<digit>, &calculator_debug::push_int > > { }; // ...
debugger uses type-name of the first argument for outputs. If your grammar is a class template like above, type-name can be very long. So I think that you want to define start_tag etc. Well, debugger automatically disappears on release-compile.
Outputs:
1 + 2 struct start_tag: "1+2" struct expression_tag: "1+2" struct term_tag: "1+2" struct factor_tag: "1+2" struct integer_tag: "1+2" push 1 /struct integer_tag: "+2" /struct factor_tag: "+2" /struct term_tag: "+2" struct term_tag: "2" struct factor_tag: "2" struct integer_tag: "2" push 2 /struct integer_tag: "" /struct factor_tag: "" /struct term_tag: "" popped 1 and 2 from the stack. pushing 3 onto the stack. /struct expression_tag: "" /struct start_tag: "" ------------------------- Parsing succeeded result = 3 -------------------------
You can find the idea of composing inlined algorithms from Boost.MPL TODO list. YARD and biscuit seem to be the example of it. By the way, this article was the hopeless war, vs Boost.Spirit. But don't you think another war will break out?
A snippet:
using namespace cranberry; int x = 7; int a = apply< plus< _1, _2 > >(x)(8); int b = apply< plus< _1, _2 > >()(7, 8); int c = apply< plus< _1, _2 > >(7, 8)(); int d = apply< plus< _1, int_<8> > >(7)(); int e = apply< bind< _1, _2, _3 > >(std::plus<int>())(7, 8); int f = apply< bind< _1, _2, _3 > >(&free_plus)(7, 8); int ar[5] = {1,2,3,4,5}; std::transform(ar, ar+5, ar, apply< plus< _1, _2 >, int const& >(x)); std::transform(ar, ar+5, ar, apply< bind< _1, _2, _3 > >(std::plus<int>(), x)); std::transform(ar, ar+5, ar, apply< bind< _1, _2, int_<7> > >(&free_plus)); std::for_each(ar, ar+5, apply< std_cout< _1 > >());