[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: seeking an OO or functional parser generator for Java



Seth Gordon wrote:
> I need to write a lexer and parser in Java.  I've looked at JFlex/CUP 
> and ANTLR, but I'm wondering if anyone has made a decent parser 
> generator in Java that uses Java objects, not text files with snippets 
> of Java code, to describe the grammars.

I've always ended up using ANTLR to write lexers/parsers in Java. But at 
one point I needed a parser for a grammar that could be extended at 
runtime (this is exactly the scenario where ANTLR falls down... like 
most parser generators).

I ended up making a simple top-down OO parser library. It wasn't 
particularly fast. Consider a simple set of lexer rules like:

INT =: ('0'..'9')+
IDENT = ('a'..'z')+ ('a'..'z' | '0'..'9' | _)*

The code for this in my library was:

Rule digits = Rule.inRange('0','9');
Rule atoz = Rule.inRange('a','z');
Rule INT = digits.plus();
Rule IDENT = atoz.plus().thenZeroOrMore(
     atoz.or(digits).or('_')
);

The Rule class (and its subclasses) served two purposes:
1) it had an abstract match(buffer) method that actually performed the 
parsing
2) it had a set of convenience "construction/factory" methods for taking 
some rule and constucting another rule (like or(), plus(), 
thenZeroOrMore(), etc.)

There was also a type of Rule that could execute a custom action.

I ended up getting it nominally working, but it is still somewhat 
incomplete. It was not particularly efficient, but that was not my goal. 
Before discarding that project to move onto more pressing things, I 
remember thinking of a re-design that seperated out the two purposes 
above (and allowed "optimization" when going from one two the other).

> It seems like taking such an approach would follow "the grain of the 
> language" more than the lex-and-yacc approach.  But maybe other people 
> have tried to write a parser this way and discovered good reasons for 
> sticking with the old ways.

To answer that: I have gone down this path before, and saw and some 
interesting things along the way. Implementing an inefficient version of 
this yourself should not take more than a couple of days. And I saw no 
good reasons why this would be a bad idea.

=Matt

PS: other quick useful things I can remember:
- the input to Rule.match() is a "buffer" of integers. The only alphabet 
supported was integers (but that is a superset of unicode, so I was happy)
- the "buffer" was necessary because of lookahead and backtracking
- Buffers were allowed to support "extended" alphabets of 
(integer,Object) tuples. This means you could capture extra information 
in a lexer (like the actual text matching an IDENT), and have that info 
available to a parser.