[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Questions for a language designer
On Tue, May 27, 2003 at 04:59:25PM -0400, Kevin Kelleher wrote:
> I may have mentioned this before, but there is a good set of pages
> about language design:
A constraint absent from computer languages is the necessity to
pronounce language productions. For natural languages, the written
form is a late addition. For designed natural languages, the spoken
structure and the written structure are best designed at once; and
that gives some consistency compared to the natural languages
I know (French, English).
But for computer languages productions, the representation constraint
are not often explicitely laid-out. Let's give it a try and analyse
some of the consequences.
We often implicitely assume that the preferred representation is a
succession of characters lines or even a mere sequence of characters,
newline (or crlf or whatever) being a space like another. Except for
a few languages or special constructs like here-ins or literal
strings, free form is the norm. There are noteworthy exceptions like
Note: Personally I believe in space as syntax even if I don't agree
with the way python does it. This subject would deserve a whole
post. I will develop only one argument. One can indent a program in
a way different that it is parsed; this is evil. I know there are
beautifier and indenter that cure that this discrepency but that
does not mean that the problem doe not exist to start with.
Also some form of ASCII or (now) UNICODE alphabet is implied.
Our input device, the keyboard, has hardly changed in years but our
display sreens are now a far cry from a teletype. Now, resolution
becomes correct and font rendering are of decent quality.
Mathematicians have the habit of many alphabets, many fonts and
typesetting convention like matrices, indices, exponents. When I
started in the field, I naively assumed that programming would do the
same without thinking what it would entail to implement.
As a matter of fact, math oriented languages now have a typesetted
representation. In the proprietary field, Mathematica stands to mind.
In the Open Source turf (or Gnu turf for that matter, if RMS is around)
TeXmacs (http://www.texmacs.org) is such an interface.
So a language can have many concrete syntaxes, here a typesetted one and
"sequential" ASCII or UNICODE one. That brings a new set of problems.
What is the reference "source" representation? For example, texmacs hides
pretty well the sequential representation. Mathematica permits easily to
The current versionning tools handles the archived program in their
concrete syntax form. These tools are line oriented and disturbed by
things as stupid as change from tab to space even if that does not
change the meaning of the program.
Some considerations of that kind led me to think it would be better to
use some form of abstract syntax as reference representation. Lisp
folk would say: we already do that. But I don't want to loose the
convenience of a real expressive concrete syntax.
The user would be free to use the concrete representation of his
choice to edit the program but it would be saved in its abstract
representation. The program that would translate from abstract to
concrete would be an highly configurable "unparser". Using a
correctly choosen abstract representation would make programs more
future proof because a future unparser will be able to identify patterns
and unparse them distinctively.
Compare that with languages that tries to fit more and more patterns
while keeping backward compatibilities. They are constrained by past
concrete syntaxes choices. If the reference for program was
stored in their abstract syntax, we could benefit of better choices
of concrete syntaxes like C++ with better identifier declaration
syntax like proposed by Damian Conway.
This just a (not so) random example. The point is that using
concrete syntax as reference representation limits us prematuraly.
A litmus test for that "linguistic design" is the renaming of a
identifier. Sooner or later, an identifier represent something
different that it did originally. Or the original name was
inappropriate to start with. Currently, renaming an identifier is a
daunting task, one has to find all the the identifier occurences
without confusing it with different identifiers with the same name; If
the identifier is part of a public API (it is the name of a public
method for example), then you are out of luck, But the name of an
identifier is a concrete property and not an abstract one. Ideally, in
the abstract representation each identifier would be a unique
number. The reference representation of a progam would include a table
that maps abstract identifiers (numbers) to concrete ones (names). So
with my scheme renaming a variable would be done in one place and
propagating everywhere by the magic of unparsing.
Certainly the consequences of such a design are multiple and affect the
whole programmatic eco-system. Note that the idea of names as unique
integers (in some space) would eventually affect linkers. Not
unsurprisingly, it would solve as a side-effect the C++ demangling
hack by making altogehter it irrelevant. This is a very good sign when
unanticipated side effects is to simplify things.
Current tools like emacs and vim are unapropriate because they don't
know much about the language edited. But can we make the economy of
litterate tools when legacy unix tools fail simple tasks like variable
In current tools, syntactical hilighting is really only lexicographic
ones. An appropriate tool will blend with the parsing/unparsing loop
that i hinted to (when speaking of the unparsing part).
In a sense, what I am proposing is not so radical of so new. Chomsky
has introduced the notion of surface and deep structure because the
deep structure exposed better the algebraic property of languages
while the superficial one was warped by various surface constraints.
Also some literate environments with some form of parsing/unparsing
loop have been proposed in the past like the "synthesizer generator"
http://www.allbookstores.com/book/0387969101. I have heard that ABC
came with a specialized environment too. Apparently, these
environments have been deemed too cumbersome. But some good ideas may
need time to mature or adapted to a new context to flourish.
An example: Leonard de Vinci anticipated the helicopter. His helix
apparently patterned after an Archimed screw and inapropriate because
the viscosity of air is not the one of water. And he had no
My point is not to compare myself to Leonard (even if I am left handed
too :) but that an idea is necessarily not bad because it has failed.
More specifically, that ideas behind the synthesizer generator must
probably be rehashed.
I am no historian, so I can't tell if the notion dichotomy between
abstract and concrete syntax has been a conscious mirroring of
the concepts of surface and deep structure.
If you are interested, I have a related post on perlmonks: