[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

mapping data



One of the barriers to using lightweight languages for many
low-level tasks is the difficulty in interfacing with data from an
outside source.  C is pretty good at this since you can define a
data structure (including C's union types) and map it to memory that
came from anywhere.  I'm curious about other solutions to this
problem, particularly in integrating with the type systems of high
level languages.

As an example, consider mmapping a structure into memory from disk
and manipulating it (a btree or a big on-disk hash table for
example).  Perl gives you unpack and pack to make the job possible,
but it can be quite cumbersome and it requires copying the data each
time.  Most other languages (particularly those with garbage
collection) that I'm aware of take a similar approach.  You can make
a copy of the data to get it into a form the runtime understands
(with type guarantees or type annotations as appropriate for the
language) but there isn't a great deal of flexibility in working
with the data in-place, short of treating it as a big string/byte
array.

Judging from cpan and related repositories, it seems that this is a
common motivator for various bindings in the form of C interfaces
that massage the data into a form more suitable for the host
language.  Once the data has been copied and annotated and the
wrapper code has given the secret handshake, the data is welcomed
into the club of trusted data and can be manipulated in whatever way
is natural for the language in question.  To commit changes back to
the original form, the reverse process happens.

Many systems can serialize and deserialize (or pickle and unpickle)
their own data, but this still involves copying the data from its
working format to the transmission/archival format.  Taking data
from a foreign source is a harder problem, and using it in its
native format regardless of the source is what I'm interested in.
Do any existing systems take a novel approach to this problem?

It would be cool to be able to define a grammar for the data
(probably with functions to validate and perform unavoidable
transformations) that could interact directly with the runtime
system.  When a network packet comes in, you select the appropriate
grammar to interpret it and then use it like any other structure in
the system, with the runtime system applying transformations as
dictated by the grammar on-the-fly and the garbage collector knowing
not to mess things up.  There would be some runtime overhead, but it
would be more convenient and probably less error prone.  For sparse
access to large on-disk data structures pulled in with mmap it could
be really handy.  Think of a shared memory region with a legacy C
program manipulating the same data.

The closest thing I can think of is something like perl's tied
hashes (providing a familiar interface to arbitrary code), or in
general using different implementations of an interface/type class
to hide the grunt work going on underneath, but surely there would
be advantages (less painful and less error-prone) to a more
declarative approach implemented once by the language runtime rather
than by every application every time a new binding is required.

Does this make sense?  Does anyone else think we could do better
than what current languages offer?  Are there any interesting
approaches out there that I just don't know about?  I'd love to be
able to use a high level language (ML, Haskell, Lisp, etc.) as
naturally with incoming network packets and mmapped files as with
data produced in or parsed into the language.

- Russ