[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to Design (Declarative) Programming Languages



At 10:44 AM -0500 11/17/03, Jenkins, Steven wrote:
>I have a friend who is struggling to develop a DSL, and I've been giving
>him advice.  However, I'm getting a bit tired of basically teaching him
>how to design a language, so I started to point him to Essentials of
>Programming Languages. 
>
>After reviewing the TOC, though, I'm not convinced it's the right book
>to hand him.  Are there any recommendations as for a reference for
>designing a language?  I'd leaning toward Watt's Programming Language
>Syntax and Semantics, except that he's designing a declarative language,
>so perhaps Watt's Concepts and Paradigms book would be a better choice?
>
>Thanks,
>Steven


About 8 or 9 months ago, there was a thread on "how to
design a language".  I offered some thought, other people
chipped in, and I volunteered to collate the results.
This is as good an excuse as any to post what's there so
far.  I never really finished it, so it's unintentionally
a bit folksy...

--------------------
Here are some questions a prospective language designer should ask
himself when starting the designing a programming language.

  - What need are you trying to fill?  Don't fall into the trap of "a
    scripting language", because they always turn into general-purpose
    languages.

  - What's the metaphor?  Even though you might not be trying to build a
    "pure" language, it's worth having a model for the core language,
    such as "imperative, block-structured" (C), "object oriented"
    (Smalltalk), "generic object orientation" (Lisp), "functional" (ML),
    "lazy" (Haskell), "logic" (Prolog), "production system" (OPS5), etc.
    These different core models influence the "natural" styles of program
    development in different languages even if the set of available
    facilities is similar.  They also help define which late-arriving
    features will "fit" and which will be warts.
    [Jerry Jackson]

  - How many programming paradigms does your language support?  How tightly
    are they integrated?  Which other paradigms can you integrate with the
    built-in facilities?  How natural is the syntax of user-defined
    extensions?  Many problems are much better suited to some non-standard
    programming model than to the usual object-oriented/functional
    approaches.  For example, constraint languages allow a very concise
    description (and efficient solution) of many optimization problems.

    Dylan supports functional and object-oriented programming in a tightly
    integrated manner, but it offers no support for non-deterministic
    programming, constraint-solving, etc., and not much support to add them
    to the language.  If you have first-class continuations in the language
    you can add one additional programming model that requires non-standard
    control flow, but, in general, different extensions based on call/cc
    don't work together.
    [Matthias Holzl]

  - Is high performance an issue?  This says something about whether you
    want to implement an interpreted, a VM-based, or a natively compiled
    language.

  - Is high programmer productivity an issue?  How important is this with
    respect to performance?  This decision can affect how you store
    values and do function calling.

  - How portable across platforms do you want the language to be?  This
    will relates to whether you want to compile to a VM or to machine
    code, and to how well you support native libraries.  It will also
    affect library design for such things as graphics and GUI tools.
    [Anton van Straaten]

  - Do you want easily distributed executable code, i.e., do you you want
    to allow code to be easily transmitted across networks and run
    elsewhere, as Java does?  Do you want to provide built-in support for
    remote execution, like RPC/CORBA/RMI?  If you are writing for a VM,
    this can simplify some of these issue considerably.
    [Anton van Straaten]

  - What about debuggability?  If you plan to compile it, you need to
    think about how to store debugging information.

  - How do you want to bootstrap it?  This, too, says something about
    what kind of back-end you might build.  Perhaps you build a tiny VM
    in C, then compile to C.  This way, you avoid fun but time-consuming
    work on code generation for modern super-scalar hardware, register
    allocation, etc.

  - Do you want to be able to catch type errors early or late?  That says
    something about your type system (whether you require that all types
    be statically declared at compile-time, or allow them to be dynamic,
    or have a hybrid scheme like Dylan does).  In addition to the obvious
    effect on performance, this decision will affects your memory model
    in that completely static systems do not require tags or boxing.

  - Will variables be associated with explicit type declarations?
    - If yes, will these type declarations be required or optional?
    - If optional, will the language use inferencing to supply unspecified
      types, or simply use an all-purpose type (like Object or 'any')?
    [Anton van Straaten]

  - Will the language have any run-time type discrimination/checking at
    all, or will types be completely statically determined?  Some
    languages considered statically typed still do some run-time checking,
    such as Java.
    [Anton van Straaten]

  - Will any type checking happen at compile-time?  Some languages with
    explicit type declarations don't always check types at compile-time,
    such as old Visual Basic.
    [Anton van Straaten]

  - If you allow type declarations, you will want to think about whether
    you want parameterized types.  If you go whole hog with, say, F-bounded
    polymorphism, you can get performance *and* type safety *and* ease of
    use, but it's hard to get this exactly right.

  - What about namespaces?  Do you want to have a simple scheme as in
    Java, where classes, namespaces, and files are roughly equivalent?
    Lisp-style packages?  Dylan-style modules and libraries?  Within a
    single first-class namespace, how many second-class namespaces are
    there?  Java has 7 or 8: class names, function names, local variable
    names, slot names, etc.  Common Lisp has at least 3 (function,
    variable, and class names).  Dylan and Scheme have one, which greatly
    simplifies things at a small loss of generality which can usually be
    worked around with name conventions.

  - What about encapsulation?  Do you want to do information-hiding on a
    per-class basis as in C++ and Java, or on a "module" basis as in Dylan?

  - Is your language a functional language (that is, without side-effects)?
    If so, is it an almost-functional language or a true pure functional
    language?  Or is there a functional core with some sort of machinery
    for isolating side-effects, like monads do in Haskell?

  - What kind of evaluation semantics does the language have?  Eager as
    in most languages, or lazy as in Haskell?

  - Is your language purely lexical or do you offer dynamic variables (or,
    more generally, access to the dynamic environment) as well?  Dynamic
    binding allows you to introduce local state for the duration of a
    computation without side effects and without adding additional
    parameters.
    [Matthias Holzl]

  - Are there different semantics for "pointer-ish" and "non-pointer-ish"
    values, like in C?  Or is everything a first-class object reference,
    like in Lisp?  Having multiple ways of referencing values can make the
    user mode much more complicated.  On the other hand, making everything
    be object references can require boxing and/or tagging schemes that
    make your compiler and FFI more complex.

  - How do you want to pass arguments to functions?  By name as in Algol?
    By value or by reference as in C?  By object reference like Lisp does?
    Is there more than one convention in the language?

  - Do you want first-class functions?  What about lexical closures?
    First-class continuations?  The answer to those questions will tell
    you things about heap- and stack-allocation, and will also tell you
    how important it might be to do a continuation-based compiler.  It
    also tells you how hard your compiler has to work to avoid consing
    environments unnecessarily.  Lots of sophisticated language designers
    go with simple closures and avoid full continuations, because
    full-scale environment capture is hard to do well.

  - Does your language have an unwind-protect like facility?  When you
    design a new language it is tempting to include call/cc because it
    allows you to do define many common (and uncommon) control structures.
    On the other hand you want to have a facility that allows you to
    reliably relinquish resources after you are done.  If you simply try to
    combine call/cc and unwind-protect, you immediately get the
    "impenetrable shield vs. unstoppable force" problem in your language.
    Possible solutions include: no call/cc, weakened unwind-protect,
    different semantics for call/cc.
    [Matthias Holzl]

  - How do you handle conditions/errors?  Return codes or signalling?  Do
    you have an unwinding-only model like C++/Java or do you allow restarts
    like Dylan/CL?  If you do the latter do you separate conditions and
    restarts like Common Lisp or do unify them like Dylan?  These questions
    are important, because every programming language has to deal with error
    conditions, and in many cases the unwinding model is used simply because
    the language designer is not aware of any other possibilities.
    [Matthias Holzl]

  - Do you want the language to be "object-oriented" at all, given a
    broad definition of OO that includes the spectrum from single
    inheritance single receiver languages as in Java to multiple
    inheritance multiple receiver languages as in CLOS?  Do you want
    to provide genericity through some sort of template scheme?

    Here is how Jonathan Rees has characterized the very fuzzy term "OO".
    1. Encapsulation -- the ability to hide the implementation of a type
    2. Protection -- the inability of the client of a type to detect its
       implementation, guaranteeing that any changes to an implementation
       that preserve the behavior of the interface will not break any
       clients.  This also gives some measure of "security", because things
       like passwords can't leak out.
    3. Ad hoc polymorphism -- functions and data structures with parameters
       that can take on values of many different types.
    4. Parametric polymorphism -- functions and data structures that
       parameterize over arbitrary values, such as "a list of anything").  ML
       and Lisp both have this.  Java doesn't quite because of its non-Object
       primitive types.
    5. Everything is an object -- all values are objects.  True in Dylan, but
       not in Java because of its primitive types.
    6. "All you can do is send a message" (AYCDISAM) = Actors model -- there
       is no direct manipulation of objects, only communication with (or
       invocation of) them.  The presence of fields in Java violates this.
    7. Specification inheritance = subtyping -- there are distinct types
       known to the language with the property that a value of one type is as
       good as a value of another for the purposes of type correctness.  An
       example is Java interface inheritance.
    8. Implementation inheritance/reuse -- having written one pile of code, a
       similar pile (such as a superset) can be generated in a controlled
       manner, that is the code doesn't have to be copied and edited.  A
       limited and peculiar kind of abstraction. (E.g. Java class
       inheritance.)
    9. Sum-of-product-of-function pattern -- objects are, in effect,
       restricted to be functions that take as first argument a distinguished
       method key argument that is drawn from a finite set of simple names.
    Some people say Lisp is OO, meaning {3,4,5,7}.  Some people say Java is
    OO, meaning {1,2,3,7,8,9}.  E is supposed to be more OO than Java
    because it has {1,2,3,4,5,7,9} and almost has 6; 8 (subclassing) is seen
    as antagonistic to E's goals and not necessary for OO.  The conventional
    Simula 67-like pattern of class and instance will get you {1,3,7,9},
    which many people take as a definition for OO.
    [Jonathan Rees]

  - If the language is object-oriented, do you want it to be class-based
    or prototype-based?
    [Steve Dekorte]

  - If you've got an object system, do you want it to have first-class
    objects that exist in the run-time?  Should the object system extend
    to include all the way to the primitive types, or do you want to
    special-case those like Java does?  Do you want a Smalltalk/Java-style
    single receiver object orientation, or a CLOS-style multi-method
    generic function dispatch?  If the former, do you need some sort of
    static overloading like C++ has?  If the latter and performance is
    important, do you need some sort of Dylan-style "sealing" so that you
    can do some compile-time optimizations?  Do you want single
    inheritance, single inheritance with interfaces, multiple
    inheritance, or a hybrid single inheritance with mixins?  If you've
    got a more static type system, you'll need to deal with casts.  Do
    you additionally want auto-conversion?

  - If you've got an object system, how much of a meta-object system do
    you want to expose?  Do you want it to be purely reflective, or more
    than that?  In Dylan, we separated 'make' from 'initialize', which
    was a good idea, but do you also want to separate out 'allocate', so
    that you have control over where an object is created, e.g., in a
    "persistent memory" pool that might be back-ended by a database?

  - Do you need hairy CLOS-style method combination, or is a simpler
    style like we did in Dylan enough?  Do you care about what Gregor
    Kiczales calls "aspects", which might change your decision?

  - A more general question that relates to the object system, the
    meta-object system, and a different dimension of the bootstrapping
    question is: do you want to implement a language which provides a
    bunch of predefined and fixed constructs (such as an object system)
    or do you want to provide a layered language that implements such
    constructs in terms of lower-level features in the language?  The
    former is probably easier, but the latter can allow very flexible
    customization, which tends to be traded off against standardization.
    Note that even a language with a powerful built-in meta-object system
    won't necessarily allow you to replace that object system with
    something else, for example, unless the language supports that sort
    of thing.
    [Anton van Straaten]

  - How do you want to do memory management, manual or automatic (GC)?

  - Do you want to support threading?  Do you want to roll your own
    threads or use OS threads?  Do you want to support massive
    concurrency like Erlang does?  The answers to those questions will
    tell you about aspects of the run-time, memory allocation/GC, and
    performance.  Oh yeah -- it also tells you if you can actually take
    advantage of the multiple processors sitting in most of the machines
    we all have.  Do you want Java-style synchronization where it is
    built in to objects, or should that be handled orthogonally?

  - If you have threads and continuations, how do they relate to each
    other?

  - How well do you want to be able to integrate with native libraries?
    This decision affects your memory model, how you plan to represent
    run-time type info, how function call/return works, how signalling
    works, etc.  By "memory model", I also mean to include what sorts of
    objects are boxed or tagged.  (Opinion: the Harlqn/FunO Dylan
    compiler got it wrong -- I think we should have boxed everything, and
    then concentrated our efforts on box/unbox optimizations.  This would
    have *hugely* simplified FFI issues.)  Good integration with native
    code probably means that you will end up using a conservative
    collector, and that will effect the semantics of "finalization" (if
    you have it).

  - Do you want to be able to return multiple values?  How about &rest
    arguments?  These affect function call/return, tail-call elimination,
    and stack vs. heap allocation optimizations.

  - What's your order of evaluation in expressions?  This affects what
    sort of optimizations can be safely done.

  - What compilation model do you want?  Lots of include files like
    C[++]?  Lots of "packages" like Java?  Whole-worlds like Lisp?
    Separate libraries like Dylan?  This affects a lot of things, not
    least of which is the ability to deliver small applications.  It also
    informs the design of your core run-time.

  - Is the core run-time tiny like Scheme's?  Small like Dylan's?  Huge
    like Common Lisp's?  If you like the Common Lisp model, it's worth
    looking at EuLisp to see how to re-package it in a more layered way.

  - Even in a small run-time, you need to get the basic types right.  Are
    your numeric types "closed" (that is, do they include reals --
    rationals and irrationals -- and complex numbers)?  Are your string
    and character types rich enough to model Unicode?

  - Think hard about collections.  How do the following relate to each
    other: sets, tables, vectors, arrays, lists, sequences, ranges?  In
    Dylan, we decided too late having the tail of a list be a "cons" was
    maybe not such a great idea; what about that?  How do your
    collections interact with your threading model?

  - Think hard about iteration, especially over collections.  If all
    collections obey a uniform iteration protocol, it means that you can
    do things like 'for e in c ...'.  Note that if iterators are done in
    a first-class way, this has performance implications that your
    compiler needs to worry about.

  - Do you want some sort of security model built into the language?
    What sort of model do you want to use?  A simple "checker" like the
    Java VM uses, or a more sophisticated capability-based model.

  - What syntax do you want?  Parentheses unaccountably give lots of
    people hives, but S-expressions make a lot of things much simpler.
    Infix syntax is quite nice when it's done well, but you've got to get
    the "kernel" of that exactly right if you want your infix macro
    system ever to be usable.  If you decide on S-expressions, should
    they be represented as lists and conses, or do you want a first-class
    object for that?

  - Do you want to allow syntactic extensions (macros)?  Lisp-style
    macros?  Dylan-style pattern-matching non-procedural hygienic macros?
    Scheme-style 'syntax-case' pattern-matching procedural hygienic
    macros?  This says a lot about the syntax of your language, and it
    also says a lot about the model you choose for compile-time
    evaluation environments.

--------------------
Contributors

Andrew Cooke
Steve Dekorte (*)
Matthias Holzl (*)
Jerry Jackson (*)
Jonathan Rees (*)
Anton van Straaten (*)

(*) Contributions marked above