[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: What is a lightweight language



Quoth Paul Prescod:
> XML's tag-redundancy provides much better error recovery 

For every error you show me that is fixed by this, I'll show you a place
where I went to change an <h1> to an <h3>, or an <ol> to a <ul>, or some
more complicated change, and forgot to change the matching close tag. :P

> and readability (at a cost of verbosity).

The verbosity actually hurts the readability quite a bit, imo.  When I
have heavily nested text, it's a pain to have to sort through all the
extra crud.  So much so that even though I know it's "proper" to always
use a close tag, I skip it whenever I can get away with it (i.e. in most
HTML).

> S-expressions-as-we-know-them also use the worst possible character
> for grouping because parens happen so often in ordinary text.

True enough.

> XML is not ASCII text, but Unicode, whereas the Common Lisp and Scheme
> languages predate Unicode. 

So what?  S-expressions are not the exclusive domain of these languages,
and it's a no-brainer to imagine S-expressions that allow unicode text
instead.  In fact, if you have access to Unicode, you could further
imagine using a non-ASCII bracketing character to avoid the
abovementioned paren problem.

> XML allows text directly as content whereas s-expressions require them
> in quotes.

Well, Scheme and LISP s-expressions do.  In my day-to-day research (in
natural language processing), I work with the Penn treebank, which
stores its parse trees in sexps such as the following:

( (S (NP-SBJ (NN Trouble))
   (VP (AUX is)
    (, ,)
    (SBAR-PRD (S (NP-SBJ (PRP she))
               (VP (AUX has)
                (VP (VBN lost)
                 (NP (PRP it))
                 (ADVP-MNR (RB just) (RB as) (RB quickly)))))))
   (. .)))

There is a very small faction in the NLP community that keeps trying to get
everyone to convert to using XML-format treebank files.  Like:

<S1><S><NP GRAMM="SBJ"><NN>Trouble</NN></NP>
     <VP><AUX>is</AUX>
      <COMMA>,</COMMA>
      <SBAR GRAMM="PRD"><S><NP GRAMM="SBJ"><PRP>she</PRP></NP>
                         <VP><AUX>has</AUX>
                          <VP><VBN>lost</VBN>
                           <NP><PRP>it</PRP></NP>
                           <ADVP GRAMM="MNR"><RB>just</RB>
                                             <RB>as</RB>
                                             <RB>quickly</RB>
                           </ADVP>
                          </VP>
                         </VP>
                        </S>
      </SBAR>
     </VP>
     <PRD>.</PRD>
    </S>
</S1>

The verbosity is the *least* of the problems here.  The underlying text
is entirely lost under the weight of the markup; it is not visually
offset from the markup at all, either.  The length of the close tags
means that to be visually balanced they need to come on their own lines;
this causes the eye to entirely lose the period in the XML version.  And
perhaps most importantly, this example is an short sentence by treebank
standards (less than half the average length); yet already the XML
version takes nearly the full width of the screen and more than a third
of a page.  Rare is the treebank sentence that, in sexp form, takes more
than half a page, and virtually none take a full page.  But the XML
format would clearly be sending many sentences into pagewrap-land.

And I haven't even gotten started on how much easier sexps are to parse
than XML.  They're a simple READ statement in Scheme, of course, but
even in C, a page of lex and yacc will do the trick.

Hm, that turned into a bit more of a rant than I meant it to be; I just
got a little cheesed off at the oft-repeated (but false) assertion that
XML is better due to "better error recovery and readability".  :P  Being
easier for both computers *and* humans to read, sexps are definitely the
way to go, although I don't actually expect that they'll dominate
anytime soon.

-- 
-=-Don Blaheta-=-=-dpb@cs.brown.edu-=-=-<http://www.cs.brown.edu/~dpb/>-=-
"Letting strings denote degrees of nobility may be inadvisable.  It does
not prevent spurious degrees like `butcher' and `madman'."
					--_ML for the working programmer_