[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: text processing as *the* problem



At 03:08 PM 11/28/2001 -0500, Guy Steele - Sun Microsystems Labs wrote:
>    At 11:42 AM 11/28/2001 -0800, KELLEHER,KEVIN (Non-HP-Roseville,ex1) wrote:
>    >Are there any languages, even big languages, that were *built* with
>    >text processing in mind?  Are there approaches that are not limited
>    >to an implementation of regular-expression matching?
>
>    Well, there's always SNOBOL... I'm not sure what there is past
>    regexes--they're pretty much the next logical step past state-machinish
>    things.
>
>Well, one obvious step past regexes is parenthesis-balanced
>strings, which in general cannot be represented by regexes.

Yup, that's handy for one set of structured text. (Perl's regex engine's 
now capable of handling this, but it does require embedding pieces of 
executable code in the regex to count parens)

I've been thinking that regexes would be really useful if they could be 
extended out past the character level to the token level, so I could write 
something like:

    if (/(\{Adj})*\{Noun}/) {
      ...
    }

to match a string that's got zero or more adjectives preceding a noun. (And 
yes, I'm painfully aware of how difficult the classification of words into 
parts of speech is--it's an example easily understandable by us but not 
necessarily easy to implement) When contemplating how to make Parrot's 
parser, I've really wanted to be able to do regular expressions against 
streams of tokens rather than streams of characters.

>Small surprise that SNOBOL had a built-in primitive for
>matching a parenthesis-balanced string.  Once you have that,
>you have Lisp (sort of); you can build string patterns that
>will take CAR and CDR.  Regexes alone don't quite get you there.

Nope, definitely not. Which means regexes aren't really turing complete, 
which is probably a good thing. (I'd put forth the horror which would be 
GCC re-implemented as a perl regular expression, but that'd be cruel... :)

Still, the question remains--what besides plain regexes would be 
interesting to have when doing string processing? Dealing with tightly 
structured nested data structures (like XML or Lisp) is one, but I'm sure 
there are others.

					Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
dan@sidhe.org                         have teddy bears and even
                                      teddy bears get drunk