[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: seeking an OO or functional parser generator for Java
Seth Gordon wrote:
> I need to write a lexer and parser in Java. I've looked at JFlex/CUP
> and ANTLR, but I'm wondering if anyone has made a decent parser
> generator in Java that uses Java objects, not text files with snippets
> of Java code, to describe the grammars.
I've always ended up using ANTLR to write lexers/parsers in Java. But at
one point I needed a parser for a grammar that could be extended at
runtime (this is exactly the scenario where ANTLR falls down... like
most parser generators).
I ended up making a simple top-down OO parser library. It wasn't
particularly fast. Consider a simple set of lexer rules like:
INT =: ('0'..'9')+
IDENT = ('a'..'z')+ ('a'..'z' | '0'..'9' | _)*
The code for this in my library was:
Rule digits = Rule.inRange('0','9');
Rule atoz = Rule.inRange('a','z');
Rule INT = digits.plus();
Rule IDENT = atoz.plus().thenZeroOrMore(
atoz.or(digits).or('_')
);
The Rule class (and its subclasses) served two purposes:
1) it had an abstract match(buffer) method that actually performed the
parsing
2) it had a set of convenience "construction/factory" methods for taking
some rule and constucting another rule (like or(), plus(),
thenZeroOrMore(), etc.)
There was also a type of Rule that could execute a custom action.
I ended up getting it nominally working, but it is still somewhat
incomplete. It was not particularly efficient, but that was not my goal.
Before discarding that project to move onto more pressing things, I
remember thinking of a re-design that seperated out the two purposes
above (and allowed "optimization" when going from one two the other).
> It seems like taking such an approach would follow "the grain of the
> language" more than the lex-and-yacc approach. But maybe other people
> have tried to write a parser this way and discovered good reasons for
> sticking with the old ways.
To answer that: I have gone down this path before, and saw and some
interesting things along the way. Implementing an inefficient version of
this yourself should not take more than a couple of days. And I saw no
good reasons why this would be a bad idea.
=Matt
PS: other quick useful things I can remember:
- the input to Rule.match() is a "buffer" of integers. The only alphabet
supported was integers (but that is a superset of unicode, so I was happy)
- the "buffer" was necessary because of lookahead and backtracking
- Buffers were allowed to support "extended" alphabets of
(integer,Object) tuples. This means you could capture extra information
in a lexer (like the actual text matching an IDENT), and have that info
available to a parser.