[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: text processing as *the* problem



(I'm sorry I had to miss the conference, but I'm enjoying the discussion.)

re text processing and regexps

Does this group agree "lightweight" is an interaction between "lifter" and
"liftee"?

When lifters are humans, there is nothing lightweight about regex's arcane
and cryptic syntax.  (This is one reason python feels so "light").
The best "lightweight" regex "alternative" I know is Ka-Ping Yee's rxb for
python.
(see http://web.lfw.org/python/rxb15.py  and below)

------------------------------------------------------------
This module encapsulates the construction and functionality of regular
expressions in a class named 'Pattern'...

You can concatenate 'Pattern' instances using the '+' operator or repeat
them using the '*' operator with a number.

The available functions are:

    exactly(<literal string>)               :: exactly the given string
    anybut(<literal string>)                :: text not containing the
string
    member(<literal>, <literal>, ...)       :: any single char mentioned
    nonmember(<literal>, <literal>, ...)    :: any single char not mentioned

    maybe(<pattern>)                        :: zero or one occurrence
    some(<pattern>)                         :: one or more occurrences
    any(<pattern>)                          :: zero or more occurrences
    either(<pattern>, <pattern>, ...)       :: one of the alternatives
    label(<name>, <pattern>)                :: label a subgroup for later

    followedby(<pattern>)                   :: positive lookahead assertion
    notfollowedby(<pattern>)                :: negative lookahead assertion

For 'label' you can also use the alternate, more concise syntax

    label.<name>(<pattern>)




------------------------------------------
Jonathan Schull, Ph.D.
Founder and President
SoftLock.com dba Digital Goods
Schull@DigitalGoods.com <mailto:Schull@DigitalGoods.com>
recent bio:
http://conferences.oreillynet.com/cs/p2pweb2001/view/e_spkr/1017
<http://conferences.oreillynet.com/cs/p2pweb2001/view/e_spkr/1017>
978-764-1058 cell and v-mail
978-568-9916 landline
978-246-0487 fax
------------------------------------------


-----Original Message-----
From: owner-ll1-discuss@ai.mit.edu
[mailto:owner-ll1-discuss@ai.mit.edu]On Behalf Of Dan Sugalski
Sent: Wednesday, November 28, 2001 10:12 PM
To: Guy Steele - Sun Microsystems Labs; kevin_kelleher@non.hp.com;
ll1-discuss@ai.mit.edu
Subject: Re: text processing as *the* problem


At 03:08 PM 11/28/2001 -0500, Guy Steele - Sun Microsystems Labs wrote:
>    At 11:42 AM 11/28/2001 -0800, KELLEHER,KEVIN (Non-HP-Roseville,ex1)
wrote:
>    >Are there any languages, even big languages, that were *built* with
>    >text processing in mind?  Are there approaches that are not limited
>    >to an implementation of regular-expression matching?
>
>    Well, there's always SNOBOL... I'm not sure what there is past
>    regexes--they're pretty much the next logical step past state-machinish
>    things.
>
>Well, one obvious step past regexes is parenthesis-balanced
>strings, which in general cannot be represented by regexes.

Yup, that's handy for one set of structured text. (Perl's regex engine's
now capable of handling this, but it does require embedding pieces of
executable code in the regex to count parens)

I've been thinking that regexes would be really useful if they could be
extended out past the character level to the token level, so I could write
something like:

    if (/(\{Adj})*\{Noun}/) {
      ...
    }

to match a string that's got zero or more adjectives preceding a noun. (And
yes, I'm painfully aware of how difficult the classification of words into
parts of speech is--it's an example easily understandable by us but not
necessarily easy to implement) When contemplating how to make Parrot's
parser, I've really wanted to be able to do regular expressions against
streams of tokens rather than streams of characters.

>Small surprise that SNOBOL had a built-in primitive for
>matching a parenthesis-balanced string.  Once you have that,
>you have Lisp (sort of); you can build string patterns that
>will take CAR and CDR.  Regexes alone don't quite get you there.

Nope, definitely not. Which means regexes aren't really turing complete,
which is probably a good thing. (I'd put forth the horror which would be
GCC re-implemented as a perl regular expression, but that'd be cruel... :)

Still, the question remains--what besides plain regexes would be
interesting to have when doing string processing? Dealing with tightly
structured nested data structures (like XML or Lisp) is one, but I'm sure
there are others.

					Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
dan@sidhe.org                         have teddy bears and even
                                      teddy bears get drunk