This is a complication of reviews put together for Matt Welsh's Modern
Distributed Systems class, Fall 2003.  Please email me any comments or
questions. 

RON, David Andersen (SOSP,2001)

- Straw man: Andersen shows how well you could perform
  application-specific routing if you had essentially perfect
  information.
- BGP has problems: slow routing, generic, primitive policies, doesn't
  differentiate between down links and slow links.

- The BGP protocol is what ASs use to communicate with each other
about paths in the Internet.  BGP, because its messages aggregate a
huge amount of information, is both extremely scalable and light on
the details.  This is where RON comes in.  A RON is limited to several
hosts, up to about 50, that exchange detailed information about the
several paths between them.  Because they have this good detailed
information, RON nodes are able to quickly route around failures and
find better routes than BGP.

- Results: RON is able to route around failures with about 19 seconds
as opposed to several minutes with BGP.  It also made minor
improvements in throughput and loss in some transfers.

- The paper has an extremely thorough results section.  An additional
thought this brings up:

- There is a traditional CS mantra that one should be able to do
something not at all, one way, or an unlimited number of ways.  BGP
gives you one way and is therefore very scalable.  I wonder what kind
of improvement BGP would get if it had two ways (along Mitzenmacher's
Power of Two Choices lines).

Problems:
- They argue that you only need one node as an alternate route, but it
doesn't seem to work in case where you could use more than one node
(retroactive goal)
- RON only works when you are multi-homed.  However, in most cases, it
is the last link (the last mile) that is most likely to fail.  In this
case, RON gets you nothing.
- Quadratic, n^2 algorithm, doesn't scale beyond 50 nodes
- The biggest knob, the number of probe messages per second, they
don't turn.  

i3:

- Cute idea.  The authors use Chord as a register-your-interest and
  then sit back and receive packets mechanism.  To use i3, you say
  "I'm interested in key x" but only the upper bits really matter
  because that directs the message of your interest to the session
  manager, which they call the rendezvous point.  This is the join
  operation (and is called inserting a trigger).  Sometime after that,
  the sender sends packets sequentially starting at key x.  Because
  these all map to the same server (rendezvous), the server knows who
  to send them to.

- Purpose.  There are lots of different ways to achieve mobility,
multicast, and anycast, but no one solution that can handle all of
them.  Until now, of course.

- Robustness.  They essentially skirt the robustness issue by having
soft-state (triggers must be periodically reinserted) and redundancy
(there is a primary and backup trigger).

- They talk about routing efficiency being poor.  It's poor because
they are using naive Chord.  Proximity is a solved problem in p2p
networks (maybe it wasn't when they wrote the paper).  - Nodes use
public triggers, which are based on hashes of DNS names, to find keys
in the first place.  Private triggers are shorter lived and presumably
must be sent via an out-of-band channel.


Brewer's Lessons (2001)

- Interesting experience paper based on his work at Inktomi.
- Useful metrics:
  - uptime = (MTBF-MTTR)/MTBF
    where MTBF is mean time between failures and MTTR is mean time to
    recover.  He argues that, since you can't make parts fail less
    frequently and since it is hard to measure MTBF, you should focus
    on making repair faster.
  - yield = queries completed/queries offered
  In other words, not all seconds have equal value: it is more
  important to be up when you are getting a lot of requests.
  - harvest = data available/complete data
  High yield means that every query has access to the entire database.

  DQ Principle: Data per query x queries per second = constant
  where this constant is limited by your physical parameters; ideally,
  DQ scales linearly with number of nodes.

  Replication: AB1 AB2, AB2 goes down
  harvest stays the same because all data is available
  yield is cut in half because it can now handle half as many queries.

  Partitioning: A1 B2, B2 goes down
  harvest is cut in half because B is not available
  yield is the same because queries offered to A still complete

  Conclusion: replicate data beyond the assumed load so that harvest
  stays OK if machines die and yield is still above this threshold.  

- Upgrade methods: fast reboot, rolling upgrade, *big flip* - upgrade
  half the site at a time.
  

SNS/TACC (1997)
Fox, Gribble, Chawathe, Brewer, Gauthier

- This paper offers one solution to managing and building for
clusters.  The authors' solution provides the user (a programmer) with
a series of reusable building blocks to harness the advantages of
clusters and ameliorate the difficulties in shared state, load
balancing, and monitoring.  They concentrate on the read-dominated,
user-oriented side of the Internet, which allows them to reduce the
concurrency problem inherent in distributed design; if a worker
temporarily uses stale data, it is ok in their semantics.

- BASE (basically available, soft state, eventual consistency)

- Hierarchical API, which allows users to use parts at the top
(Service) level for most projects, but then delve into the TACC layer
if they need more customization.

- They concentrated on the burstiness of most Internet activity, by
allowing an administrator to designate overflow machines to
temporarily handle high workloads.  The lowest layer, SNS (Scalable
Network Service), spreads on to these extra machines when needed.

- Question/Problem: what if you need ACID semantics and need to be
this scalable?

- The authors do not do any experimental comparison of their work with
any other cluster-based solution.


SEDA (2001)
Welsh, Culler, Brewer

- SEDA is an interesting concept, but seems fundamentally flawed if
  one assumes the presence of asynchronous IO and, particularly, if
  working on only one CPU.  SEDA stands for staged event driven
  architecture.  Intuitively, this means that instead of running a
  single event driven thread that continuously takes events off one
  queue and processes them, you have one or more threads per event
  queue, where one event queue is used per a particular processing
  stages, e.g., read the web page from disk.

Problems:

- SEDA argues that threads are bad because of lock contention and poor
  cache performance.  What does it do for busy queues?  It
  uses threads!  Each time objects are moved between stages,
  synchronization must occur.  Increasing the number of threads buys
  you nothing on a single processor.

- The problem with stages is that you have less control over what is
  actually executing at a given time.  With a single queue, events can
  be sorted according to an application-specific priority or dropped.
  Having multiple queues relinquishes this control.  For example, to
  gain a better instruction cache hit ratio, you may want to run
  several events of the same type immediately following one another.
  You cannot do this with multiple threads looking at multiple queues.

Other items:

- Multiple queues only appear to have merit when multiple CPUs exist.

- The decision on when to shed load needs to take place well
  before it gets to the web server.  It can place at the load
  balancing router sitting in front of the servers, if such a machine
  exists, or in the kernel/network card of the server machine.  Once
  the request has made it this far -- all the way to the serving
  application -- is almost too late because so much work has already
  been performed on the request.  Having it performed in the network
  card is in some related work that I can't remember.


Capriccio (SOSP 2003)
von Behren, Condit, Zhou, Necula, Brewer

- The authors provide a very scalable threads package that assumes
  cooperative scheduling and asynchronous IO.

- Linked Stack Management is their compiler-based method to allow for
  thousands of threads without allocating thousands of thread stacks.
  Through compile time analysis, they heuristically guess at a good
  size to allocate memory "stack chunks."  They insert small pieces of
  code to determine whether the current chunk has enough space or
  whether a new one should be allocated.  All stack chunks for a given
  program are the same size.  They don't discuss page alignment.
  Linked stacks are based on the premise that sometimes a thread will
  need a deep stack and sometimes not, and when it doesn't that memory
  space can be given to a different thread.  Very cool idea.

- Resource-aware scheduling can be viewed as one policy for deciding
  the ordering of events on an event queue.  They make estimations on
  maximum resource limits and try to only schedule threads that will
  have available resources.  They currently track CPU, memory, and
  file descriptors.  Along the same lines as it being a specific
  policy, it would be interesting to have a plug in event-queue
  scheduler that did the same thing.


Problems:

- There performance section was thin and lacked good results.  Apache
  performed 15% better in terms of bandwidth with Capriccio, but they
  were still outperformed by an extremely lean web server that they
  themselves had written.

- Personally, I find an event-based model the simplest to think about
  and write, if only because there is less concurrency to worry
  about.  So I do not buy their general premise that having an
  extremely scalable threads package is all that necessary.

- They do not discuss the interaction of compiler optimizations with
  their insertions of checkpoints.  These could affect their estimates
  of the code size.

- Evaluation section:
  - They use SPEC Web99, which is a static workload.
  - Key problem: They only show throughput, not latency -- they do not
  seem to be giving the whole story.


Model-Based Resource Provisioning in a Web Service Utility
Ron Doyle, Jeff Chase, Omer Asad, Wei Jin, and Amin Vahdat, USITS'03.

- This paper attempts to come up with a way to guess how much of a
  resource a service will need in order to meet a service level
  agreement (SLA).  It is a reasonable and probably existent problem:
  you have a contract to provide some kind of service, but you don't
  know how many/much resources are really necessary to fulfill this
  service.  They mention, but do not examine, "assignment" of
  differentiated services; that is, putting different services on
  different machines after determining how much resource they need.

- Their approach is thoroughly queuing theory based.  Essentially,
  they take in a bunch of parameters that describe a static web
  workload and their hardware and spit out (most importantly) the
  average total response time.  With a simulated workload from
  fstress, their predictions match up well with reality.

Problems:

- It seems a serious oversimplification to describe storage purely in
  peak IO operations per second.  By the time the tail end of the
  Zipf-distributed workload hits the disk, it is going to be pretty
  random and you aren't going to get anywhere near peak IOPS.

- I didn't understand what Figures 10 and 13 were about.  Shouldn't
  axes have labels and units?!

- If the request rate is steadily increasing in Figure 9, why does the
  cache size drop off?

- In conclusion "A utility can avoid overload by expanding resource
  slices and recruiting additional servers as traffic increases."
  This is just plain wrong -- why do they even bother to say this as
  they have already talked about how their model-based approach
  doesn't work in "overload pathologies." (Section 3.5)


Resource Overbooking and Application Profiling in Shared Hosting
Platforms
Bhuvan Urgaonkar, Prashant Shenoy, and Timothy Roscoe OSDI'02.

- Idea: Southwest airlines meets resource and application allocation.
  They (1) run applications in isolation to determine their resource
  requirements, (2) find an x percentile of these requirements, where
  x is 99th or 95th, yielding 1% or 5% overbooking, respectively and
  (3) stick the applications on machines, allocating them as if they
  were only using this 95th or 99th resource percentile.  Very cool,
  new idea.

- I'd like to know what a "leaky bucket regulator for the network
  interface" is.

- Apparently Intel Research people like Brent Chun are using
  overbooking in PlanetLab now.

Problems:

- As part of their motivation, they say "widespread deployment of
  shared hosting platforms has been hampered by the lack of effective
  resource management mechanisms."  While I believe there work has a
  somewhat reasonable motivation, machines are cheap, rack space is
  kind of cheap, but people are expensive.  So if running their scheme
  is more expensive than the amount of money it saves in terms of
  people and machines, it's not worth it.

- Relatedly, they "automatically" derive QoS requirements by running
  the application solo.  Often setting something like this up and
  measuring it can take a long time (and money).  While this does
  appear that it would get good results, their idea would be stronger
  if this step didn't have to take place.

- Figure 7c shows the "number of applications supported."  But it does
  not show how the overbooked mixture of applications actually
  perform.  This would have been real validation of their idea.

- Figure 5 has so many different axes the numbers are incomparable.


Scalable, Distributed Data Structures for Internet Service
Construction
Steven D. Gribble, Eric A. Brewer, Joseph M. Hellerstein, and David
Culler
OSDI'00.

- They argue that the hash table interface is universal: put(), get(),
remove().  In a sense this is true, but a hash table may not be the
best data structure in all cases.

- Load balancing: they argue that you could just use mod of the number
of servers, but this makes it hard to add a new server and isn't
scalable.  Instead, they use the 'data partitioning map (dpmap)' to
add new servers.  It is organized as a trie, similar to a binary
tree.  This allows you to find a replica group membership list.  Given
this list, you know whom you must contact for an update or the set of
possible nodes that you could contact for a get.

- Bricks are CPUs with disks.  They run one brick per CPU and run on
four-way machines.

- One-copy serializability: The DDSLib runs as the transaction manager
and the bricks are clients.  They rely on the fact that user clients
(over the WAN) can retry failed transactions.  One can construct a
case where a reader will either get the new value or a "no, busy"
depending on which brick they contact (so is this really serialized?).
They call this an "optimistic two phase commit protocol."

- Making some assumptions makes their analysis much simpler.  They
assume machines are physically secure, cannot exhibit network
partitions, have uninterruptable power, and are fail-stop.



Manageability, Availability, and Performance in Porcupine: A Highly
Scalable, Cluster-Based Mail Service
Yasushi Saito, Brian N. Bershad, and Henry M. Levy

- As opposed to Fox's SNS/TACC, they want to look at an application
that has lots of reading _and_ writing; mail is not read-dominated.
They are filling out the spectrum between ACID and BASE.

- This requires that they have some state be stored on disk (hard).
They contrast this with "soft" state which can always be regenerated
from hard state.  They go into detail over which data structures need
to be of which type.

- "Functional homogeneity" means "any node can execute part or all of
any transaction, e.g., for the delivery or retrieval of mail."  Their
goal is for system administrators to be able to just plug in a new box
or remove an old one and have it all just work.  Pretty powerful idea.

- Three Round Membership Protocol allows them to detect membership
changes.  It is unclear to me what happens when people ask for their
mail when this protocol is going on.

Problems:
- Motivation: while they say that large-scale commercial services need
to handle x (huge number) of message per day, they don't show (or even
comment) that existing services don't do the job.  In Section 2.5 they
do address this somewhat.

- "Porcupine, on the other hand, tolerates any number of node failures
and continues to server users after network partitions by relaxing
data consistency guarantees."  "Any number"?  What about all?


On the Feasibility of Peer-to-Peer Web Indexing and Search
Jinyang Li, Boon Thau Loo, Joe Hellerstein, Frans Kaashoek, David
R. Karger, Robert Morris
IPTPS'03

This workshop paper is essentially a long, back-of-the-envelope
calculation showing that it is infeasible to run Google on a p2p
network.  

They distinguish between partition-by-document, where each peer makes
an index of the documents stored on it, and partition-by-keyword,
where each peer keeps a list of documents for keywords that hash to
its domain.  Because they are used to skirt copyright, Gnutella and
Kazaa use partition-by-document.  Partition-by-word would allow you to
easily find the storage location of copyrighted material.  One of
their arguments that p2p would be useful for Web search is that it is
censorship resistant, but focusing on partition-by-word, as they do,
seems to contradict this goal.

Strangely, the communication constraint they come up with is a
percentage of wide-area Internet capacity, which, I believe, is
largely untapped.  What is often saturated are last mile links, which
they do not mention.

Partition-by-keyword does not allow for word-closeness measures
without an exponential increase in the number of keys stored
(e.g. "the" "who" are 5 words apart, 6 words apart, ...)


Querying the Internet with PIER
Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau Loo, Scott
Shenker, and Ion Stoica
VLDB'03.

PIER stands for "Peer-to-Peer Information Exchange and Retrieval"

This paper is a proof-of-concept: you can run a DB on top of a DHT.
Their motivation is that data can be generated in a standard way in
many locations and is not amenable to centralized "warehouse"-type
solutions because: (1) you can get live data this way, (2) you don't
need to support a data center, (3) you don't need to worry about
social/political concerns about responsibility (Do I buy this?  Aren't
you responsible just by participating in the entire system then?)

They build PIER on top of CAN, although they built a second version on
Chord and say it required minimal integration.

Applications use the "Provider API:" get, put, renew, multicast,
lscan, and newData.  Get, put, and renew take namespace and resource
as inputs, corresponding to the relation name and a key.  In general
the key is the primary, but they use another input, instance, to allow
for secondary keys.  This allows get to return multiple tuples.  The
use of "lifetime" and renew gives them lease functionality.

They do a join by creating a new temporary relation in the DHT into
which matching tuples are added.  As tuples arrive, the newData
callback is called.

To save bandwidth during joins, they join subsets of the tables (the
keys) and then go back and fetch the relevant data.  They do this with
the keys themselves in the symmetric semi-join or with summaries of
the keys (Bloom filters).

They get terrible performance.  It seems that they would have more
success if the relations were not broken into tuples, but instead were
stored as large chunks.  I'm sure there are a bunch of optimizations
they could make, but as they note: "Such improvements would come at
the expense of 'dirtying' DHT APIs with PIER-specific features -- a
design approach we tried to avoid in our initial implementation."


Pond: the OceanStore Prototype
Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao,
and John Kubiatowicz
FAST'03.

I admire them for having Oceanstore as their goal and then going ahead
and building a prototype.  Saying let's just build it, I'm sure we'll
learn something from doing so, is a good thing.

Oceanstore has two tiers: (1) powerful, well-connected hosts that
serialize changes and (2) less powerful machines that provide the
primary storage.  In terms of the CAP theorem, they provide
consistency and availability, and assume that the powerful machines do
not become disconnected (no network partitions).  Oceanstore allows
write-sharing and provides full ACID semantics.

Stored files are called "data objects" and are never deleted.  Data is
stored in a Merkle tree to have history without wasting space.
Erasure coded blocks are stored in Tapestry.  Caching is done by
recovering the block, decoding it, and storing it this way
(post-erasure decoding).

Cryptography: they modify the Castro-Liskov Byzantine agreement
protocol to use MACs for all communication in the inner ring (powerful
machines) and public key crypto for outer ring (storage machines).  To
keep keys valid in the long running inner ring, they use proactive
threshold signatures, which allow a key to be formed using a some n
of m keys (n < m).

Here are my terse notes from the conference presentation: 
"All resources are virtualized."  They use encryption for 1) privacy
and 2) integrity.  Erasure coded blocks put into Tapestry.  Archival
storage is for durability -- they nicely distinguish between
durability and ease of access.  In their experiments, they use an NFS
loopback to translate into Andrew/NFS.  They chose to use public key
threshold signatures from Castro/Liskov paper instead of symmetric
(even though they take longer).  They are not sure why signing takes a
long time for short writes.
Q: Why do a web cache over Pond? 


Storage management and caching in PAST, a large-scale, persistent
peer-to-peer storage utility
Antony Rowstron and Peter Druschel
SOSP'01

PAST's goal is to provide a write-once, read-many backup file system.
A user stores a file with insert(); this returns a unique fileId.  The
user can then read his own file or give the fileId (and possibly a
decryption key) to someone else so she can read the file.  There is no
search capacity.  Files are looked-up with lookup(fileId).  The owner
can delete the file with reclaim(fileId), which lazily deletes it.

This PAST paper is not concerned with bandwidth usage.  Strangely,
they claim that nodes are homogeneous, but then spend much of the
paper handling different disk capacities.

Pastry can route to a target using different paths and they use this
capacity to handle malicious nodes in the network.  E.g., if we
suspect something is wrong, we can try a different route.  I believe
this has been built into the quasi-standardized DHT API (by Dabek et
al.).

On re-reading this paper, I recalled that very much of it is not about
what the focus on p2p systems has become.  For example, their
experimentation section is exclusively on their two methods for having
disks fill up a the same rate -- there is nothing about churn!  In
fact, this paper contained a large percentage of presumably
unimplemented and unsubstantiated conjecture.  For example: they
assume everyone has a smart card so that documents can be signed, but
this is never shown or even discussed in the experiments section.

PAST stores whole files.  They compare this with a CFS block-type
mechanism that divides files up before storing them.  Their two
methods for dealing with storage imbalance are:
- replica diversion: a replica doesn't have enough room to store a
file, so it uses one of its replicas like a pointer.
- file diversion: it is determined that most of the nodes on which the
replica would reside are full so the file is rehashed and put in an
entirely different location in the namespace.
 

SplitStream: High-bandwidth content distribution in cooperative
environments
Miguel Castro, Peter Druschel, Anne-Marie Kermarrec,Animesh Nandi,
Antony Rowstron and Atul Singh
SOSP 2003

I ended up reading the SOSP paper instead of the HotOS one.

- Problem being addressed: in tree-based multicast, interior nodes
carry the load.  Solution: many trees (a forest), where nodes are
usually in the interior of only one tree.

- Usage: Pastry has this value b (often equals 4) that is the
branching factor.  They take a file or stream, encode it with either
some stationary erasure encoding or stream encoding, and send one
stripe down each of the 2^b channels.  This gives them e.g. 16 trees
instead of one.  Nodes whose IDs match the channel ID are in the
interior of the tree, otherwise they are leaves.

- They go into some detail about whether a forest can be formed (is
feasible).  They have two properties that guarantee w.h.p. that it is
feasible:
(1) if (altogether) nodes want more incoming streams than they can
provide outgoing, then the system is not feasible (cannot be
constructed).
(2) a sufficient condition for construction is:
   (a) there to exist at least as many outgoing streams as incoming
   (b) for nodes whose outgoing is greater than their incoming, they
       must either receive (and forward) or originate all k stripes.
The probability of success increases with spare capacity (outgoing -
incoming) and the minimum number of incoming stripes.

- Indegree is limited by network topology (they said, but actually
this is false in practice); out-degree is limited by the scribe "push
down" mechanism.  Then they have all these silly ways to have nodes
join other node's children if their children's children's slots are
full.  If this doesn't succeed, they look in the "spare capacity group."

- Their evaluation section is pretty comprehensive, including churn
experiments.


Palimpsest: Soft-Capacity Storage for Planetary-Scale Services
Timothy Roscoe and Steven Hand
HotOS'03

- The problem they are trying to solve is that data that is
  short-lived, but important, like checkpoint state, is either stored
  locally on disk or via ACID-semantics off-site.  The former is not
  reliably backed up and the latter is too bothersome, they argue.
  Instead, we want storage that is briefly very reliable.  Palimpsest
  is wide-area storage that provides short-term soft-capacity.
  Because data is spread widely with erasure codes, they argue that it
  will still be very reliable.

- Usage: a writer uses Rabin's Information Dispersal Algorithm to
divide up a file into blocks.  They are then encrypted using Offset
Codebook Mode.  Then each block is given a random key by the writer
coming up with an initialization vector to prevent fileId guessing.

- They assume that all storage servers have the same amount of
storage.  Therefore, they all should have the same value tau, which is
the time a block should last in the system.

- The most interesting thing to me in this paper is the value tau.
They say "the value of tau acts as a single, simple system-wide
metric."  But I would really like a few other variables, like the rate
at which tau is changing and N, the number of nodes.  Of course, the
situation becomes much more complicated when nodes have different
storage capacities.  I have been trying to work out how accurate tau
would be with their sampling method versus gossiping.  No answer as
yet.

- For some reason, they believe that storage will be a scarce
commodity, as opposed to bandwidth.  For this reason, they contest
storage with a congestion pricing scheme.

- This reminds me of what David Black of EMC said at the last FAST
  conference: if it doesn't make things more reliable, we aren't
  really interested.  Maybe if tau were reliable it would.

Here are Margo's notes from the conference:
        Storage for planetary scale apps
        There are classes of persistent data, they are targeting
                data for which: 
                * local disk is no good
                * DFS no good
                * data needs to last reliably for a short time, but
                  then it times out and can go away.
                * data may be quite large

        Want ephemeral storage service, so that's what they built
                * looks like NFS, but isn't
                * stores file fragments
                * use IDA
                * use ring-based DHT
                * storage is a queue
                (This sounds vaguely like India)

                * all files die, so they may need to be refreshed
                  if their lifetime exceeds the "timeout" value.
                * interesting cost model: only bill for writing (i.e.,
                  not for the amount of storage and the length of time
                  the data is going to be stored).
                * As a result, you don't get DDoS because you have to
                  pay to write.
        Interestingly enough, he actually presented an economic model
                  for
                utility pricing.



Systems directions for pervasive computing, Robert Grimm, Janet Davis,
Ben Hendrickson, Eric Lemar, Adam MacBeth, Steven Swanson, Tom
Anderson, Brian Bershad, Gaetano Borriello, Steven Gribble, and David
Wetherall.
HotOS'01.

- This paper sketches out problems with the current mechanisms to
program in "pervasive computing environments."  What is pervasive
computing?  "The basic idea behind pervasive computing is to deploy a
wide variety of smart devices throughout our working and living
spaces."

- They note three problems with current approaches and state that
one.world would solve them:
  (1) application data and functionality need to be kept separate.  In
  other words, Java is not the answer.  Instead data is represented as
  tuples, functionality by "components."  Tuples, components, and
  other environments live inside environments.
  (2) applications must take changing resources into account and
  cannot be programmed using RPC.  Instead of Jini's limited leases,
  they argue to lease everything.  It is unclear what they intend to
  lease beyond what Jini already does.  one.world builds in
  checkpointing and migration.
  (3) a common platform is needed.  For some reason, they argue that
  Java's bytecodes and libraries, specifically designed for running on
  small devices (e.g., JavaME), don't cut it.

- They compare themselves to Jini/Java and Legion.  The Java
comparison makes sense: Java was designed to run on small devices and
for portability.  Legion, however, is part of the Grid.  It is unclear
how focused their definition of pervasive computing is if it includes
Legion.

- You just have to love the irony that one.world is implemented mostly
in Java.


The Design and Implementation of an Intentional Naming System
William Adjie-Winoto, Elliot Schwartz, Hari Balakrishnan, and Jeremy
Lilley
SOSP'99.

- INS is about decoupling names from object locations.  You describe
what you want and the system late-binds your request to the object
that best fits it at the time of the request.  The environment they
are imagining is a medium-scale network with at most a few thousands
of devices.

- There are two main modes of service: anycast and multicast.  In
anycast, you get the device that most closely matches your request.
In multicast, your request gets sent out to all of the devices that
match.  Match strings are written hierarchically and can have
wildcards.

- INS works by setting up Intentional Name Resolvers (INR), that
themselves form a spanning-tree.

- They show that it works with three sample applications: floorplan,
camera, and printer.

- Conclusion: interesting idea, but entering a crowded space,
including commercial implementations like Jini.


Chimera: A Virtual Data System for Representing, Querying and
Automating Data Derivation
I. Foster, J. Voeckler, M. Wilde, and Y. Zhao
SSDM'02. 

Chimera can be thought of as a shared lab notebook for physicists, or
a Makefile for computer scientists.  A scientist running an experiment
creates an original set of data through some process and then takes
the data through a series of transitions to get the final result.
Chimera acts to record all of the steps of an experiment so that the
results can be recreated by the original experimenter or by some one
else and so that all sets of data are available (or can be recreated).
In Makefile fashion, Chimera keeps intermediary objects around and,
transparently, will only regenerate what new data is needed.
Hypothetically, it is also smart enough to figure out when it would be
cheaper to regenerate data locally instead of shipping it from some
remote site.

The entities in Chimera are:
- a transformation, an executable that takes some data in and spits
some out.
- a derivation, a specific instance of a transformation, e.g., I ran
transformation x on date y with parameters a, b, c and it produced
output file z.
- a data object, the input into or the output from a derivation.

It implements this abstraction with a data catalog, schema,
interpreter, and storage (which all have the adjective "virtual" in
front of them, as if this makes their use more clear).

To show that it works, they ran two stripped down experiments with
it.

Chimera seems like a great idea for physicists who are dealing with
huge amounts of data that they freely share at no cost and who prefer
computer programs recording their activity over a lab notebook.
Unfortunately, there is no way to see how useful it really is to this
small but active group from this paper.  My intuition is that even
though this paper has no real evaluation, it is probably good enough
for quite a few scientists -- a quick search on Google shows that the
tool is in active use.


On Death, Taxes, and the Convergence of Peer-to-Peer and Grid
Computing
Ian Foster and Adriana Iamnitchi
IPTPS'03. 

Scooped, Again
Jonathan Ledlie, Jeff Shneidman, Margo Seltzer, and John Huth
IPTPS'03.

(Compare and contrast)

These two workshop papers cover similar ground in discussing the
relationship between the Grid and p2p.  Where they differ is in their
endings: "Scooped" argues that the Systems community will miss an
opportunity like the web (again), whereas "Convergence" argues that
the two fields will converge precisely because they have the same
goals.

The goal: "the pooling and coordinated use of large sets of
distributed resources."

- p2p tries to take advantage of resources at the edge of the
  Internet; the Grid works the middle and edge.
- Grid environments employ at least limited trust and are frequently
  centralized.  Much p2p work focuses on functioning without trust, or on
  mechanisms that enable trust.
- p2p and Grid have different target communities: scientists, file
  sharers, anonymizing services.
- p2p systems are vertically integrated and each version obsoletes the
  previous.  The Grid has standards.
- Even though p2p research often focuses on distributed storage and
  file systems, Grids deal with much larger quantities of data.
- Grids have thousands of users; p2p has millions.

"Convergence" uses p2p examples that are currently in use, e.g.,
Seti@Home.  It says that we should learn about failure from p2p and
infrastructure from the Grid.

"Scooped" begins with the point that p2p research does not have a
driving user base behind it and, therefore, is not required to build
something stable.  The Grid, however, has a demanding user base of
scientists/physicists who want the infrastructure "Convergence" talks
about.  "Scooped" ends with a "Call to Action" for p2p researchers to
go and talk with nearby scientists to find out what the real problems
are that they need solving.

System architecture directions for networked sensors
Jason Hill, Robert Szewczyk, Alec Woo, Seth Hollar, David Culler, and
Kristofer Pister
ASPLOS'00

The authors justify various design decisions in TinyOS, an OS that
runs on fairly small, fairly simple sensors, and I agree with their
choices for the most part:
- The system is made up of a two-level scheduler and components.
- The scheduler calls event handlers, which primarily respond to
hardware interrupts and are non-preemptable, and calls tasks, which are
only preemptable by event handlers, not other tasks.  Thus, tasks are
for "long" running computation, like finding an average, and an event
handler is for putting a new sensor reading into a queue.
- Components have fixed sized frames of memory and expose what events
and tasks they export.  What this allows for is shifting software
components into hardware without changing their interface -- good
idea!
- They argue that this type of scheduling is necessary because sensor
work is "concurrency-intensive:" while processing a task, you may
receive inputs that must be handled immediately.  Frequently this is
because something must be copied out of a buffer, for example, before
it is overwritten by a new input.
- By having cleanly modularized components, they are able to compile
the system down to exactly what is needed for a particular mote.
- Interesting comparison: "The problem we must tackle is strikingly
similar to that of building efficient network interfaces, which also
must maintain a large number of concurrent flows and juggle numerous
outstanding events."
- The scheduler itself compiles down to 178 bytes.  A simple
application weighs in at about 3KB (code size) and 226 bytes (data
size).

I believe this is a pretty early picture into TinyOS, given that Hill's
thesis was published three years later.  It doesn't go into much
experimentation and the paper is primarily about the design of the
system.

I'd like to know about "radios that can automatically wake themselves
at the start of an incoming transmission."  Why doesn't this take care
of the need to synchronize transmissions to save battery power?

Wireless Sensor Networks for Habitat Monitoring
Alan Mainwaring, Joseph Polastre, Robert Szewczyk, David Culler, and
John Anderson
WSNA'02.

This is the Great Duck Island paper: what happens when a bunch of
computer scientists try to build a network for a bunch of scientists
who have something they want to research (birds).  This is opposed to
the next paper, which is what happens when a bunch of engineers want
to apply sensors to one particular problem case, albeit one that they
are highly familiar with.  The computer scientists are trying to solve
the problem generally and look for areas of future research to be
approached.  The building engineers are trying to solve their one
problem very efficiently.

The Habitat Monitoring paper comes up with a tree-based approach to
move all of the sensor data from the individual burrows on the island
up to a recording station, which then routes the data to a permanent
data service via the Internet.  They include the concept of local
reprogramming and interrogation that PDAs nearby individual motes can
perform.  The transition point between a sensor patch and the Internet
connected base station is called a sensor network gateway.

They discuss a benefit of sensors being that they do not disturb
animals or plants, like human recorders do.  However, it seems that
the motes are quite large relative to the burrows, because the motes
were unable to fit into the burrows with their protective layer, so
they may well be a little disturbing.  Also, we have no idea what
disturbance the radio signals they produce causes.

They note that a simple antenna installed on the mote used two orders
of magnitude less power for the same level of reception than a
CerfCube that used 802.11b and TCP/IP.

They discuss two potential methods for motes to relay their sensor
data up the tree to a base station.  In the first, nodes at the
deepest leaves start in parallel and each level of communication
proceeds in step.  The second has each branch compute its level in
turn (DFS).


Two-Tiered Wireless Sensor Network Architecture for Structural Health
Monitoring
Venkata A. Kottapalli, Anne S. Kiremidjian, Jerome P. Lynch, Ed
Carryer, Thomas W. Kenny, Kincho H. Law, Ying Lei
SPIE'03. 

Wired networks in buildings are bad because of all of the extra wiring
that has to go in, and because earthquakes and rodents can break the
links.  The solution is to go wireless, but now the problem is battery
lifetime.  Their solution: have a two-tiered approach where the last
mile is to battery powered low-range sensors; these sensors send all
of their data to outlet-powered nodes with longer range radios.  Data
moves from the sensors to these "local site masters" along to a single
"central site master" which gathers all of the data.

They say their design is inspired by LEACH and Bluetooth because it
uses a battery-saving approach for the sensors and a data-blasting one
for the local site masters.  At least, that's my understanding of it.

Interestingly, they need two accelerometers, one to measure the
day-to-day small scale building changes and one for earthquakes.  The
earthquake one must be always ready to wake up and record because
earthquakes only last for a few seconds.

As they note, at least in the bridge architecture where local site
masters are laid out linearly on the bridge, the local site masters
are single points of failure.  In addition they show that the
batteries will die after 18 months of use -- this means that somebody
has got to replace them no matter where they are on a bridge or inside
a building!  Taking a pointer from the other paper, they might be able
to replace their sending protocol (802.11b) with something simpler for
a large battery savings.  Of course, batteries leak power no matter
what you do -- they must be somehow replaceable.


Energy-efficient Communication Protocols for Wireless Microsensor
Networks
Wendi Rabiner Heinzelman, Anantha Chandrakasan, Hari Balakrishnan
HICSS'00. 

LEACH is one of those good ideas that, as soon as you read it, you
think: I could have thought of that.  This doesn't make it any less of
a good idea; it is a straightforward next-step to a given problem.
The problem domain is this: you have a field full of identical,
low-power sensors and they are all collecting data, what is the best
way for them all to get their data, or a compressed/aggregated version
of it, to a base station that is far away (but reachable by any sensor
at, perhaps, a large energy expense).  Before going into their own
protocol, they look at two straw men: a "minimum energy" routing
protocol, which routes all data via the best hop to the base station,
and direct transmission, where each node just blasts its data at the
base station.  Via a MATLAB simulation which incorporates their simple
model of energy expenditure, they show that just blasting the data
isn't so bad.  The "minimum energy" route actually causes the nodes
near the base station to relay a lot of data, draining their
batteries.

After these straw men, they talk about LEACH, short for "low energy
adaptive clustering hierarchy."  In a nutshell, LEACH works in
rounds.  At each round:
- a collection of nodes elect themselves cluster leaders.  They give
an algorithm that will produce a kind of round-robin of usage, but in
a decentralized way.  This algorithm could be useful in other contexts.
- nodes that are not cluster leaders listen for the nearest leader.
- cluster leaders work out a TDMA schedule for communication
- cluster leaders aggregate the data and blast it to the base station
After each round, a new node elects itself to be leader; thus, the
nodes doing the most battery expenditure are continuously changing.

They show that battery usage is minimized when there are very few
cluster leaders at a given time (5%).  The number of cluster leaders
is decided a priori.

There are several problems with this paper:
- Their direct vs MTE does not take into account data collisions,
something the authors of the next paper had a great deal of trouble
with.  Similarly, there must be a huge number of collisions during
their CSMA cluster set-up phase.
- They make no effort to justify their energy expenditure model.  I
have no way to know how valid it is.
- They do not say what happens to data that is coming in in between
rounds.  Presumably the node could store it, but for how long?
- They make this comparison with Direct and MTE about energy usage,
showing that LEACH uses much less energy.  However, it appears that
LEACH is compressing/aggregating the data at each cluster head and the
other algorithms are not.  In other words, more bits of sensor data
are getting to the base station in Direct and MTE, so to some extent
we are comparing apples and oranges.
- The next paper also found that communication success rates were not
symmetric, meaning that if you can hear your cluster leader, it may
not be able to hear you.


Building Efficient Wireless Sensor Networks with Low-Level Naming
John Heidemann, Fabio Silva, Chalermek Intanagonwiwat, Ramesh
Govindan, Deborah Estrin, and Deepak Ganesan
SOSP'01.

This paper takes the "Directed Diffusion" work one step further --
into an implementation -- and shows that its got some advantages and
some work to do.  It spends most of the time talking about how
directed diffusion works.

In directed diffusion, a base station registers an interest with all
of the nodes in the system by broadcasting out this interest.  Each
node remembers this interest and who sent it to them, and then is able
to rely information in the right direction when they come across
information that matches this interest.  What directed diffusion also
allows for is for intermediary nodes to cache data and, if it is on a
common path, to aggregate it, ameliorating the problem of nodes near
the root having to propagate too much data.  Nodes send information of
interest along "gradients."

What they emphasize in this paper is the naming aspect provided by
directed diffusion.  Because you just send descriptions of what you
are interested in throughout the network, there does not need to be
any lower level routing layer, like IP.  This is what distinguishes it
from many related projects, like INS.

Instead of just providing equality and wildcard matching, they have a
fairly powerful, but simple, set of matching rules: EQ, NE, LE, GT,
LE, GE and IS.  They are also able to see if two elements match in
both directions (called a "complete match").  The matching rules give
them the ability to name rectangular (spatial) regions.

Interestingly, they talk about an aggregator node computing a
confidence rating in the data.  This seems like a good idea when
sensors may be giving faulty data when their batteries start to run
low.

They implement directed diffusion on several sets of hardware; in
particular, they have "micro-diffusion" running on a mote.  Their
experimental section is full of remarks about improving the MAC and
the problem of data colliding -- stuff that you only find out about
with a real implementation.


Topology Management for Sensor Networks: Exploiting Latency and Density
Curt Schurgers, Vlasios Tsiatsis, Saurabh Ganeriwal, and Mani
Srivastava
MobiHoc'02

Problem domain: you have a field of sensors that will send data in
response to an event, but until that event occurs, they should stay
asleep as much as possible.  In other words, continuous data sensing
is not the problem domain.  Their solution was centered on three
premises: (1) a multihop approach is better than direct data blasting
(although this is unsubstantiated and counters the LEACH paper;
however, the LEACH paper focused on continuous data), (2) that full
sleep mode uses much less power than just being idle (I didn't realize
this before, but they do substantiate this with data from the
manufacturer), and (3) that it is possible to have more than one radio
per sensor.  This last assumption seems pretty reasonable, given that
some sensors already come with two radios.

Their approach is to use one radio for control messages and one for
data.  Control messages use one frequency, and, although I didn't see
this in the paper, one could assume that point-to-point data could be
set to use a particular frequency, so there wouldn't be collisions on
the data channel.  They call the control channel the "wakeup plane"
and the other the "data plane."

For most of the time, both of the sensors' radios are off and their
low-power-using sensors (e.g., temperature) are on, trying to detect events.
Periodically every node turns on its control radio and listens to see
if it should switch on its data radio for incoming data from a
particular node.  The way a node tells a particular node to switch on its
radio (so that it can relay data back to the base station) is that it
includes that node's ID in the control broadcast.  How nodes establish
the right path back to the base station (and those nodes IDs) is left
as an exercise for the reader.  Also left unstated is what happens if
that node does not respond.  In the section combining STEM and GAF,
they mention node death for the first time -- oops!

The main problem with their work is that I just don't believe it: the
numbers out of their simulator are so close to their theoretical
analysis that it makes me believe their simulator must be incredibly
simple.  The primary simplification they do make is the either/or 40m
data radius, which just does not occur in real life.

The AUTHOR responded to this review:

1. Firstly, regarding the claim of not trusting results due to close
correspondence between simulation and analysis: the reason is simple -
STEM by its very nature almost removes from consideration the single
most hard to analyze consideration, which is MAC level collision. As
an aside, this is one of the few papers whose results have been
independently replicated and re-analyzed by many well-regarded
researchers (Nitin Vaidya, Michele Zorzi, to name a few) as they
created competing (and improved) schemes.  However, it is indeed true
that STEM (and follow-on work) all use the disk model for radio, but
in this case the primary impact of this assumption will be on absolute
energy numbers as opposed to relative improvement (which is the focus
of STEM).

2. STEM is not about multihop vs. point-to-point and therefore doesn't
counter anything about LEACH or had any relationship to it. Indeed, if
LEACH's assertion that single hop is better than multihop is valid in
a specific network, that is fine: STEM will still improve it (e.g.
Microsoft's work on Wake-on-wireless @ Mobicom 2002 or 2003 is the
special single-hop subcase of STEM).

3. The data channels operate on single frequency as stated in the
paper, and not cheating as the review speculates on different
collision free frequencies [which can't be done in practice anyways as
forwarding nodes will have a problem of switching frequencies and thus
adding latency, and frequency allocation will be a messy task].

4. Routing is orthogonal to STEM, and *any* routing scheme will work
with it (although some may interact better with STEM-driven wakeup and
others not).  For results in the paper we just had
shortest-path-routing to gateway sink, but in subsequent work we have
used others such as geographical forwarding and DSR. Incidentally,
this orthogonality to routing is also exhibited by other such schemes
(the S-MAC MAC protocol from USC/ISI which does a very similar wakeup
management although not a data-control separation, and also the newer
TinyOS MAC where the receiver polls and sender sends a long packet to
ensure rendezvous in a fashion which is similar to STEM's control
channel).



ESRT : Event-to-Sink Reliable Transport in Wireless Sensor Networks
Yogesh Sankarasubramaniam, Ozgur Akan, Ian Akyildiz
MobiHoc'03. 

Problem domain: again, these authors are concerned with sensors being
used to track a particular event, not being used to continuously send
data to some sink.  What they notice is that when an event is
detected, it is most likely detected by several sensors at once and
they all try to send their data back to the base station at once.  The
base station wants some amount of data R, but if the sensors send too
much data at once, the data packets will collide with one another in
the network and the base will get fewer than R.  In fact, there are
several states the system can be in based on the amount of sense data
that is being received (immediately perceptible by the base station)
and whether or not the amount of data received is due to the sensors
not emiting enough or due to them emiting too much and there being
collisions.  They propose a way to detect your state and move to the
optimal state, where the sensors are sending out the amount of data
you want to receive and there aren't collisions going on in the
network.

As I see it, the primary difficulty with their algorithm is there is
no attempt to discover whether or not this optimal balance is
achievable at all: what if there is no way to get the R the
application wants.  The authors don't even seem to have considered
that, nevermind put a solution for it into their algorithm.

The way that congestion is detected is that if a relay node sees that
its queue is about to become full, it sets a congestion bit in the
packet header.  Periodically, the data sink, which has the capability
to broadcast to all of the nodes at once, determines a new rate at
which the sensors should emit data and broadcasts this to them.  Of
course, this is all shown to work brilliantly using, again, a
simulation.

DSR: The Dynamic Source Routing Protocol for Multi-Hop Wireless Ad Hoc
Networks
David B. Johnson, David A. Maltz, and Josh Broch
in Ad Hoc Networking, edited by Charles E. Perkins
2001.

This paper is a nice retrospective on a protocol that people at CMU
have been working on for about five years.  DSR itself is a simple
idea; they cover it and some of the details that one might not think
of that they have come up with over time.  The basic idea is for each
sender to find and cache the route to each receiver independently and
in full.

DSR itself divides into two main actions: Route Discovery and
Route Maintenance.  In discovery, a node does not know how to reach a
destination: they are not in its routing table.  It learns the route
by broadcasting a discovery request.  If no intermediary knows and the
destination is reachable, the message eventually gets to the receiver
and the receiver responds.  Maintenance occurs when a node thinks it
knows how to route to a destination, but some hop along the way
doesn't work.  The broken hop responds to the originating node,
telling it which link failed.

Three interesting themes are:
- bi-directional vs. uni-directional communication.  Sometimes they
are able to perform optimizations if they know or assume they are
running on a network that enforces bi-directional links (e.g.,
802.11).  For example, a destination knows it has a valid route back
to the host.  Incorporating this from the beginning was a refreshing
change from the silly theoretical papers that assume fixed radii.
- They introduce a nice tradeoff between the benefits of running in
promiscuous mode and not.  They actually don't mention battery
savings, but this seems to be the main savings in not being
promiscuous.  One cool benefit was passive acknowledgement, where a
sender A to B would hear B sending to C and therefore wouldn't need an
ACK from B.
- Because DSR can rediscover paths, mobility works.

They discuss how DSR would work with heterogeneous radios (e.g., two
nodes with different power radios).

I'm not sure what to expect for a retrospective paper, but their
evaluation was pretty thin.  It's hard to find something interesting
in graphs with only one line.  They ran their simulations using ns-2.
It was unclear what exactly the routing overhead was: if it was
per-node, etc.

They ran a five node implementation with DSR running on laptops in
cars.  They do not have any numbers for this experiment.

Nodes were required to have unique IDs and this is how a node would
specify destinations (i.e., not geometrically).

Big question: how much memory does this use and how could it be
reduced for motes?


Geometric Spanner for Routing in Mobile Networks
Jie Gao, Leonidas J. Guibas, John Hershburger, Li Zhang, and An Zhu
MobiHoc'01

A lot of proofs and no comprehensive or comprehensible evaluation:
maybe it works but they haven't shown it to me.

Basic idea: Brap Karp and H.T. Kung proposed Greedy State Perimeter
Stateless Routing (GPSR).  This maintains a planar subgraph of node
connectivity and guarantees the delivery of a packet if a route
exists.  The two underlying graphs that this original paper examines
are the relative neighborhood graph (RNG) and the Gabriel graph (GG).
Where these authors come in is proposing a new underlying graph for
GPSR.  They propose using a restricted Delaunay graph (RDG) as the planar
subgraph.

The main benefits of RDG are that the Euclidean and topological
lengths are only a constant factor greater than optimal and that it
can be constructed in a distributed fashion.

A little graph theory:
- a Delaunay graph is constructed by creating a Vonoroi diagram and
connecting the points whose sides are adjacent.
- Delaunay graphs are good spanners.
- a restricted Delaunay graph is a subgraph of the Delaunay graph
where edges are less than 1 (some unit distance).  Here 1 is equal to
the radio propagation radius (which they assume is either/or
reception).

I didn't understand the advantage in being planar -- does it minimize
congestion?

As noted their evaluation lacks.  There is no comparison with GG and
no measurement of construction or maintenance costs.  They also do not
consider what actually happens when node clusterheads rotate.


The Cricket Location-Support System
Nissanka B. Priyantha, Anit Chakraborty, and Hari Balakrishnan
MobiCom'00

Main idea: in a ubiquitous programming space, e.g., indoor room
environments, objects need to know where they are currently located,
especially if they frequently move around.  Cricket addresses this
problem by attaching a "listener" to every object that needs to know
its own location and puts "beacons" in every discrete zone.  Listeners
then determine which beacon they are nearest and know their location
from that.

Communication works by listeners stockpiling light and sound signals
from beacons and then deducing their location.  Beacons transmit a RF
signal and an ultrasound signal at the same instant at random
intervals, to avoid conflicts with other beacons.  Listeners
continuously listen for these incoming signals.  When they start
receiving a light signal, they listen for a sound from the same beacon
and are able to approximate the distance from the beacon based on the
different times.  This appears to have grown out of earlier work that
tried to use RSSI for the same purpose and got poor results.  One
interesting point is that the listeners are purely receivers and the
beacons are purely senders: there is no communication back to the
beacons.  This makes things simple -- no time synchronization is
needed and nodes don't need to turn on or off their batteries -- but
it wouldn't do much for your battery consumption.

Listeners thus have collections of data about the distance between
themselves and one or more beacons.  Using this, they determine their
actual position using one of three ways:
- majority: pick the beacon you have heard from the most, ignoring
distance entirely.
- minmean: find the average distance that you believe a node is and
pick the nearest.
- minmode: discretize the samples into buckets (ten inches in their
experiments) and select as the beacon's distance the bucket with the
most entries (this is the mode).  Pick the beacon with the minimum
mode.
These different methods are worth mentioning because the last is
resistant to outliers and gave them good results (and it might not be
the most intuitive).

They throw in a little something about user privacy but I'm not sure what
that's about.

They have results from actual experiments which is a refreshing
change.  The results show that the system works fine -- they aren't
very exciting.


Localization from Mere Connectivity
Yi Shang, Wheeler Ruml, Ying Zhang, and Markus Fromherz
MobiHoc'03

I thought this would be a cool paper because Wheeler just graduated
from Harvard and I wanted to see what people were up to at Xerox PARC
(we had a choice of papers this week).  I couldn't really get over the
one big hump: see below.

Main idea: nodes need to know their locations when they have just been
dropped in arbitrary locations.  How can they come to some kind of
consensus about where they are relative to each other and, if there
exist a handful of nodes with GPS, where they are absolutely?  They
propose a CENTRALIZED algorithm to solve this problem.  The algorithm
is roughly:
- find connectivity information via broadcasts and RSSI (method unstated)
- run an all pairs shortest paths algorithm to construct a distance
matrix between nodes
- generate a relative mapping of internode locations using
multidimensional scaling (MDS).
- using nodes that know their absolute location via GPS, transform the
relative positions into absolute ones.

They reference a different paper from the Cricket one that uses Time
of Arrival measurements and ultrasound (from a later conference).

One positive aspect of this paper was that they weren't so gloating
like many of the papers we've read: they said what they were good and
in what ways other algorithms were better when they were.

They suggest as future work that their algorithm could be
decentralized by dividing the network into smaller zones and each zone
could perform the algorithms independently.  I suppose that this could
be performed in a heterogeneous scenario with motes and SCAN devices
(if they could find the distance information).


Fine-Grained Network Time Synchronization using Reference Broadcasts
Jeremy Elson, Lewis Girod and Deborah Estrin
OSDI'02.

Please see review I wrote for Sensor Paper Day, January 2003.
http://www.eecs.harvard.edu/~jonathan/timesynchreview.pdf


Optimal and Global Time Synchronization in Sensornets
Richard Karp, Jeremy Elson, Deborah Estrin, and Scott Shenker
CENS Technical Report 0012.

Main idea: RBS (from the previous paper) computes a pairwise
translation from one node's clock to another node's clock, including
skew: t_{j} \approx t_{i}a_{ij} + b{ij} where a_{ij} is the slope of
the conversion line (the relative clock skew) and b_{ij} is the clock
offset.  When you have a series of translations, there need not be
transitive agreement between them.  This is a problem if you want to
have globally consistent clocks.  They show that the most precise set
of pairwise synchronizations is globally consistent.

They first ignore skew and concentrate solely on offset.  They do this
by taking the maximally likely set of offset assignments, which is
guarenteed to be globally consistent, and then showing that this set
does produce minimum variance.

Toward the end of the paper, they add clock skew back in.  It would
have been interesting to see how frequently some kind of global
synchronization would need to be run based on the rate of skew.

Hopefully this paper will be a little more clear in its published
version.

The Design of an Acquisitional Query Processor for Sensor Networks,
Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei
Hong
SIGMOD'03.  

Key observation: traditional DB designs, even when they consider
streaming data, don't consider the interaction with the data sources
to change how the data is physically acquired.  The spin they put on
this paper is that with relatively smart sensors like motes, the DB
has got a whole new lever to pull in deciding what to do to meet a
user's request.  They boil this down to: "When should samples for a
particular query be taken?"

Notes on TAG language:
- can have queries that only start after an event has occurred, e.g.,
ON EVENT bird-detect, start query foo.
- epochs are specified with SAMPLE INTERVAL x s FOR y s [s->sec].
- a LIFETIME clause specifies a method for acquiring as much data as
possible over the course of many days, based on anticipated battery
usage (pretty cool!)

They discuss query optimization based around choosing the query plan
that will use the least amount of power, as this "subsumes issues of
processing cost and radio communication."  But what if I want the data
faster and I'm willing the pay the battery loss?  In any case metadata
on how expensive performing different operations is kept in a
centralized table that the optimizer can draw from.  They also discuss
the possibility of seeing multiple queries for the same set of data
and being able to rewrite queries to account for this.

The most interesting section of the paper is the discussion of
Semantic Routing Trees.  The basic idea is that there may be several
nodes to choose from when building a tree from the root downward.  One
way to choose one's parent is go solely on the estimated distance (by
radio strength).  They suggest looking at the types of queries that
will happen and adding this to the consideration of selecting a
parent.  That way trees will form that allow for better aggregation
than just a randomly selected tree.

Another interesting idea here was prioritizing data delivery: as each
node is sending values to the root and its packet queues start
overflowing, how can it decide which packets to keep and which to
throw away?  They look at three schemes: FIFO, winavg (which averages
the first two entries), and delta.  Delta "relies on the intuition
that the largest changes are probably interesting" and tries to keep
those in the queue over packets that are telling you what you already
know.  For fast-changing sensor data, this method worked particularly well.

 
Beyond Average: Towards Sophisticated Sensing with Queries
Joseph M. Hellerstein, Wei Hong, Samuel Madden, and Kyle Stanek
IPSN'03

This paper is a work in progress that examines topographic mapping
using isobars, compression, and vehicle tracking.  I found the section
on using wavelets particularly interesting and will read more about
this topic (I've printed out their main reference).

Their motivation is that TAG can support simple queries, but that many
people have asked them to include the ability to perform more complex
queries.  There is no real theme to the paper other than updating the
reader on the ways they are currently looking at answering more
complex queries.

The aggregates in TinyDB are composed of three functions: a merging
function f, an initializer i, and an evaluator e.  In general, f takes
two inputs and outputs a third (aggregate) value.  i is used to
initialize a partial state record (e.g., what is the sum and count of
the values we have seen so far?).  e performs the final evaluation,
e.g. taking the average.

Storage points are temporary tables that are "accessible for read or
write from any node in the network; its exact location within the
network is not fixed -- that is, it can be moved as an optimization."
I did not see how this happened or how they were found when they
moved.


GHT: A Geographic Hash Table for Data-Centric Storage
S. Ratnasamy, B. Karp, L. Yin, F. Yu, D. Estrin, R. Govindan, and S. Shenker
WSNA'02.  

Main idea: for sensor networks with huge numbers of nodes (e.g.,
10^6), queries may be best supported by storing the data at well-known
locations in the network instead of locally at the node that did the
sensing or sending all data to some exit point.  They achieve sending
all data pertaining to a particular event to the same node by hashing
the event name (e.g., "elephant sightings") to some geographic
location within the sensornet, and then using GPSR to route it there.
The node that actually stores the data is the last hop on the GPSR
route.  Data is backed up on physically nearby nodes (i.e., within
three hops [which seems pretty far to me]).  The glaring omission from
this paper is that there is no mention -- none! -- of the problem with
hashing names.  How do users performing queries know the exact name
of events a priori?  They had better, because otherwise there is no
way to find the data.  So, I can believe that their scheme might be a
good idea in the extremely large sensor network case, but the problem
of finding the data remains.

A debatable point is whether sensor networks would ever scale to this
size without having some gateway into more provisioned nodes.  My
contention is no, but I'd be willing to hear an argument against
this.  If the answer is no, GHT becomes unnecessary.

A counterargument to their data-centric vs node centric point is that
it is true you may not care the node that gathered the data, but you
do probably care the location of that node.  So one way or another the
node id or the location of the sensed observation needs to be attached
to the data.  This counterargument doesn't seem so strong if the nodes
are moving around, making the actual location of the observation more
important.

Their section on approximating communication costs was overly-simple.
They do not adequately cover summarization in the network for external
or local storage.  Also, I don't see why communication between any
node and the gateway should take sqrt(n) when, in general, there will
be a tree to the gateway, leading to log(n) hops.

A strange item in their evaluation was including failing _and
restarting_ nodes.  Why would sensor nodes restart?  This isn't a p2p
network.

 
An evaluation of multi-resolution search and storage in
resource-constrained sensor networks
Deepak Ganesan, Ben Greenstein. Denis Perelyubskiy,  Deborah Estrin
and John Heidemann
SenSys'03

Main idea: nodes create a time-series summary of their observations
using a temporally-organized wavelet; that is, the most recent
observations will experience the least compression and loss.  These
time-series wavelets are combined at each level toward the gateway to
provide spatial summarization, giving a more compressed and less
accurate picture at each level.  However, the old sense data is still
available in the network, so a user can drill down to an interesting
event and get at this data if he or she wants.

Even though the authors are designing this system for the case when
what the user wants is not know ahead of time, the user (somehow)
constructs an aging function for the temporal data on the sensing
nodes.