[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: text processing as *the* problem



"Dan Sugalski" <dan@sidhe.org> writes:

> I've been thinking that regexes would be really useful if they could be
> extended out past the character level to the token level, so I could write
> something like:
>
>     if (/(\{Adj})*\{Noun}/) {
>       ...
>     }
>
> to match a string that's got zero or more adjectives preceding a noun.
(And
> yes, I'm painfully aware of how difficult the classification of words into
> parts of speech is--it's an example easily understandable by us but not
> necessarily easy to implement) When contemplating how to make Parrot's
> parser, I've really wanted to be able to do regular expressions against
> streams of tokens rather than streams of characters.

I've got a Python library that lets you do this: you can write
  if gre.match("adj* noun", tokens): ...
to match against a sequence of objects that have 'type' attributes with
values such as 'adj' or 'noun'.  (You can also compile an "adj* noun"
pattern strings into a DFA and use it later, or build a chart parser out of
a set of patterns, etc.)

Let me know if anyone's interested in beta testing this, and it will give me
an incentive to dust it off and finish documenting it.