[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

text processing as the problem

To: "KELLEHER,KEVIN (Non-HP-Roseville,ex1)" <address@hidden>
Subject: text processing as *the* problem
From: Shriram Krishnamurthi <address@hidden>
Date: Sat, 1 Dec 2001 00:17:03 -0500 (EST)
Cc: "'address@hidden'" <address@hidden>
In-reply-to: <028FE7141C79D511B65100D0B74FE875CAF22C@xrose01.rose.hp.com>
References: <028FE7141C79D511B65100D0B74FE875CAF22C@xrose01.rose.hp.com>
Sender: address@hidden
Xref: oroboros.ai.mit.edu ll1-discuss:182

Kevin,

Thanks for your interesting post.

Let me offer a slightly different response than some others have
offered.

I find that a good many of the uses of regular expressions come about
precisely because we're trying to recapture structure that we had in
the first place -- but lost.  One lesson we should have known, but
even now don't seem to convey very well, is that structure lost is
very hard to reconstruct.  In some sense, this is a failure of
computer science.

For instance, I think of the amount of energy expended in parsing the
output Unix system commands.  Olin Shivers gave one good example of
this in his LL1 talk (roughly: using lc | wc -l to count the number of
files, but getting an incorrect count because there's a newline in one
of the filenames).  Granted, that's an extreme example. But just the
work of having to deal with the regexps, the formats, the columns
... to sum up the space used by a directory, I used to pipe the output
to ls -l through an awk script that added up the numbers.  Then I
noticed that some other script I'd written atop this began to fail
mysteriously.  Why?  Because I shifted from SunOS to Linux; the darn
space count was in a different column, and awk silently added up
nonsense (non-numeric strings) to give me a total of 0.

Would it really have been so darn difficult to offer a -M option to ls 
so that it provided a machine-readable format?  (M, by the way, is
one of the few ls option letters still left untaken.)  Maybe someone
will eventually create XML formats for these things.  And if I had
that, most of my needs for parsing outputs would go away.

My sense is that Lispers and Schemers -- certainly younger ones like
me -- have always scratched their heads at most uses of regexps,
wondering why dot-files, structured ouputs, etc couldn't just be
written in more sensible syntaxes in the first place.

> I've also encountered problems in parsing XML that have required some
> "heavy lifting" in terms of string comparisons that I have had to implement
> in C.  All the while something inside cries out that it shouldn't be so hard
> to do.

I'm sure your parsing problems are a lot harder than this example
here, but here's my approach to XML.  Suppose we begin with the file

  <doc>
   <actor name="hamlet">
     <or>
      <to action="be"/>
      <negation>
       <to action="be"/>
      </negation>
     </or>
    </actor>
  </doc>

stored in the file "2b".  Then loading

  (require (lib "xml.ss" "xml"))

[which I could auto-load at start-up] lets me run the following:

  (read-xml (open-input-file "2b"))

gives me a full internal structure parse of this document.

  (xml->xexpr (document-element (read-xml (open-input-file "2b"))))

gives me a Scheme representation of this.  I can turn on a powerful
filter to eliminate whitespace.  Using it at full tilt, I get

  (doc ()
   (actor  ((name "hamlet"))
    (or () 
      (to ((action "be"))) 
	(negation () 
	  (to ((action "be")))))))

which is a Scheme s-expression, ready to be manipulated with a myriad
of library functions.  (Notice that it preserves the entity/attribute
distinction, without violating the essential Lispishness of the data.)

Using Oleg Kiselyov's amazing SXML library, I can do MUCH more.

Shriram

References:
- text processing as *the* problem
  - From: "KELLEHER,KEVIN (Non-HP-Roseville,ex1)" <kevin_kelleher@non.hp.com>

Prev by Date: Re: Coroutines
Next by Date: Diversity - existence, value, and pursuit.
Previous by thread: Re: text processing as *the* problem
Next by thread: Re: text processing as *the* problem
Index(es):
- Date
- Thread

text processing as *the* problem

text processing as the problem