[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: text processing as *the* problem
"Dan Sugalski" <dan@sidhe.org> writes:
> I've been thinking that regexes would be really useful if they could be
> extended out past the character level to the token level, so I could write
> something like:
>
> if (/(\{Adj})*\{Noun}/) {
> ...
> }
>
> to match a string that's got zero or more adjectives preceding a noun.
(And
> yes, I'm painfully aware of how difficult the classification of words into
> parts of speech is--it's an example easily understandable by us but not
> necessarily easy to implement) When contemplating how to make Parrot's
> parser, I've really wanted to be able to do regular expressions against
> streams of tokens rather than streams of characters.
I've got a Python library that lets you do this: you can write
if gre.match("adj* noun", tokens): ...
to match against a sequence of objects that have 'type' attributes with
values such as 'adj' or 'noun'. (You can also compile an "adj* noun"
pattern strings into a DFA and use it later, or build a chart parser out of
a set of patterns, etc.)
Let me know if anyone's interested in beta testing this, and it will give me
an incentive to dust it off and finish documenting it.