This is a complication of reviews put together for Matt Welsh's Modern Distributed Systems class, Fall 2003. Please email me any comments or questions. RON, David Andersen (SOSP,2001) - Straw man: Andersen shows how well you could perform application-specific routing if you had essentially perfect information. - BGP has problems: slow routing, generic, primitive policies, doesn't differentiate between down links and slow links. - The BGP protocol is what ASs use to communicate with each other about paths in the Internet. BGP, because its messages aggregate a huge amount of information, is both extremely scalable and light on the details. This is where RON comes in. A RON is limited to several hosts, up to about 50, that exchange detailed information about the several paths between them. Because they have this good detailed information, RON nodes are able to quickly route around failures and find better routes than BGP. - Results: RON is able to route around failures with about 19 seconds as opposed to several minutes with BGP. It also made minor improvements in throughput and loss in some transfers. - The paper has an extremely thorough results section. An additional thought this brings up: - There is a traditional CS mantra that one should be able to do something not at all, one way, or an unlimited number of ways. BGP gives you one way and is therefore very scalable. I wonder what kind of improvement BGP would get if it had two ways (along Mitzenmacher's Power of Two Choices lines). Problems: - They argue that you only need one node as an alternate route, but it doesn't seem to work in case where you could use more than one node (retroactive goal) - RON only works when you are multi-homed. However, in most cases, it is the last link (the last mile) that is most likely to fail. In this case, RON gets you nothing. - Quadratic, n^2 algorithm, doesn't scale beyond 50 nodes - The biggest knob, the number of probe messages per second, they don't turn. i3: - Cute idea. The authors use Chord as a register-your-interest and then sit back and receive packets mechanism. To use i3, you say "I'm interested in key x" but only the upper bits really matter because that directs the message of your interest to the session manager, which they call the rendezvous point. This is the join operation (and is called inserting a trigger). Sometime after that, the sender sends packets sequentially starting at key x. Because these all map to the same server (rendezvous), the server knows who to send them to. - Purpose. There are lots of different ways to achieve mobility, multicast, and anycast, but no one solution that can handle all of them. Until now, of course. - Robustness. They essentially skirt the robustness issue by having soft-state (triggers must be periodically reinserted) and redundancy (there is a primary and backup trigger). - They talk about routing efficiency being poor. It's poor because they are using naive Chord. Proximity is a solved problem in p2p networks (maybe it wasn't when they wrote the paper). - Nodes use public triggers, which are based on hashes of DNS names, to find keys in the first place. Private triggers are shorter lived and presumably must be sent via an out-of-band channel. Brewer's Lessons (2001) - Interesting experience paper based on his work at Inktomi. - Useful metrics: - uptime = (MTBF-MTTR)/MTBF where MTBF is mean time between failures and MTTR is mean time to recover. He argues that, since you can't make parts fail less frequently and since it is hard to measure MTBF, you should focus on making repair faster. - yield = queries completed/queries offered In other words, not all seconds have equal value: it is more important to be up when you are getting a lot of requests. - harvest = data available/complete data High yield means that every query has access to the entire database. DQ Principle: Data per query x queries per second = constant where this constant is limited by your physical parameters; ideally, DQ scales linearly with number of nodes. Replication: AB1 AB2, AB2 goes down harvest stays the same because all data is available yield is cut in half because it can now handle half as many queries. Partitioning: A1 B2, B2 goes down harvest is cut in half because B is not available yield is the same because queries offered to A still complete Conclusion: replicate data beyond the assumed load so that harvest stays OK if machines die and yield is still above this threshold. - Upgrade methods: fast reboot, rolling upgrade, *big flip* - upgrade half the site at a time. SNS/TACC (1997) Fox, Gribble, Chawathe, Brewer, Gauthier - This paper offers one solution to managing and building for clusters. The authors' solution provides the user (a programmer) with a series of reusable building blocks to harness the advantages of clusters and ameliorate the difficulties in shared state, load balancing, and monitoring. They concentrate on the read-dominated, user-oriented side of the Internet, which allows them to reduce the concurrency problem inherent in distributed design; if a worker temporarily uses stale data, it is ok in their semantics. - BASE (basically available, soft state, eventual consistency) - Hierarchical API, which allows users to use parts at the top (Service) level for most projects, but then delve into the TACC layer if they need more customization. - They concentrated on the burstiness of most Internet activity, by allowing an administrator to designate overflow machines to temporarily handle high workloads. The lowest layer, SNS (Scalable Network Service), spreads on to these extra machines when needed. - Question/Problem: what if you need ACID semantics and need to be this scalable? - The authors do not do any experimental comparison of their work with any other cluster-based solution. SEDA (2001) Welsh, Culler, Brewer - SEDA is an interesting concept, but seems fundamentally flawed if one assumes the presence of asynchronous IO and, particularly, if working on only one CPU. SEDA stands for staged event driven architecture. Intuitively, this means that instead of running a single event driven thread that continuously takes events off one queue and processes them, you have one or more threads per event queue, where one event queue is used per a particular processing stages, e.g., read the web page from disk. Problems: - SEDA argues that threads are bad because of lock contention and poor cache performance. What does it do for busy queues? It uses threads! Each time objects are moved between stages, synchronization must occur. Increasing the number of threads buys you nothing on a single processor. - The problem with stages is that you have less control over what is actually executing at a given time. With a single queue, events can be sorted according to an application-specific priority or dropped. Having multiple queues relinquishes this control. For example, to gain a better instruction cache hit ratio, you may want to run several events of the same type immediately following one another. You cannot do this with multiple threads looking at multiple queues. Other items: - Multiple queues only appear to have merit when multiple CPUs exist. - The decision on when to shed load needs to take place well before it gets to the web server. It can place at the load balancing router sitting in front of the servers, if such a machine exists, or in the kernel/network card of the server machine. Once the request has made it this far -- all the way to the serving application -- is almost too late because so much work has already been performed on the request. Having it performed in the network card is in some related work that I can't remember. Capriccio (SOSP 2003) von Behren, Condit, Zhou, Necula, Brewer - The authors provide a very scalable threads package that assumes cooperative scheduling and asynchronous IO. - Linked Stack Management is their compiler-based method to allow for thousands of threads without allocating thousands of thread stacks. Through compile time analysis, they heuristically guess at a good size to allocate memory "stack chunks." They insert small pieces of code to determine whether the current chunk has enough space or whether a new one should be allocated. All stack chunks for a given program are the same size. They don't discuss page alignment. Linked stacks are based on the premise that sometimes a thread will need a deep stack and sometimes not, and when it doesn't that memory space can be given to a different thread. Very cool idea. - Resource-aware scheduling can be viewed as one policy for deciding the ordering of events on an event queue. They make estimations on maximum resource limits and try to only schedule threads that will have available resources. They currently track CPU, memory, and file descriptors. Along the same lines as it being a specific policy, it would be interesting to have a plug in event-queue scheduler that did the same thing. Problems: - There performance section was thin and lacked good results. Apache performed 15% better in terms of bandwidth with Capriccio, but they were still outperformed by an extremely lean web server that they themselves had written. - Personally, I find an event-based model the simplest to think about and write, if only because there is less concurrency to worry about. So I do not buy their general premise that having an extremely scalable threads package is all that necessary. - They do not discuss the interaction of compiler optimizations with their insertions of checkpoints. These could affect their estimates of the code size. - Evaluation section: - They use SPEC Web99, which is a static workload. - Key problem: They only show throughput, not latency -- they do not seem to be giving the whole story. Model-Based Resource Provisioning in a Web Service Utility Ron Doyle, Jeff Chase, Omer Asad, Wei Jin, and Amin Vahdat, USITS'03. - This paper attempts to come up with a way to guess how much of a resource a service will need in order to meet a service level agreement (SLA). It is a reasonable and probably existent problem: you have a contract to provide some kind of service, but you don't know how many/much resources are really necessary to fulfill this service. They mention, but do not examine, "assignment" of differentiated services; that is, putting different services on different machines after determining how much resource they need. - Their approach is thoroughly queuing theory based. Essentially, they take in a bunch of parameters that describe a static web workload and their hardware and spit out (most importantly) the average total response time. With a simulated workload from fstress, their predictions match up well with reality. Problems: - It seems a serious oversimplification to describe storage purely in peak IO operations per second. By the time the tail end of the Zipf-distributed workload hits the disk, it is going to be pretty random and you aren't going to get anywhere near peak IOPS. - I didn't understand what Figures 10 and 13 were about. Shouldn't axes have labels and units?! - If the request rate is steadily increasing in Figure 9, why does the cache size drop off? - In conclusion "A utility can avoid overload by expanding resource slices and recruiting additional servers as traffic increases." This is just plain wrong -- why do they even bother to say this as they have already talked about how their model-based approach doesn't work in "overload pathologies." (Section 3.5) Resource Overbooking and Application Profiling in Shared Hosting Platforms Bhuvan Urgaonkar, Prashant Shenoy, and Timothy Roscoe OSDI'02. - Idea: Southwest airlines meets resource and application allocation. They (1) run applications in isolation to determine their resource requirements, (2) find an x percentile of these requirements, where x is 99th or 95th, yielding 1% or 5% overbooking, respectively and (3) stick the applications on machines, allocating them as if they were only using this 95th or 99th resource percentile. Very cool, new idea. - I'd like to know what a "leaky bucket regulator for the network interface" is. - Apparently Intel Research people like Brent Chun are using overbooking in PlanetLab now. Problems: - As part of their motivation, they say "widespread deployment of shared hosting platforms has been hampered by the lack of effective resource management mechanisms." While I believe there work has a somewhat reasonable motivation, machines are cheap, rack space is kind of cheap, but people are expensive. So if running their scheme is more expensive than the amount of money it saves in terms of people and machines, it's not worth it. - Relatedly, they "automatically" derive QoS requirements by running the application solo. Often setting something like this up and measuring it can take a long time (and money). While this does appear that it would get good results, their idea would be stronger if this step didn't have to take place. - Figure 7c shows the "number of applications supported." But it does not show how the overbooked mixture of applications actually perform. This would have been real validation of their idea. - Figure 5 has so many different axes the numbers are incomparable. Scalable, Distributed Data Structures for Internet Service Construction Steven D. Gribble, Eric A. Brewer, Joseph M. Hellerstein, and David Culler OSDI'00. - They argue that the hash table interface is universal: put(), get(), remove(). In a sense this is true, but a hash table may not be the best data structure in all cases. - Load balancing: they argue that you could just use mod of the number of servers, but this makes it hard to add a new server and isn't scalable. Instead, they use the 'data partitioning map (dpmap)' to add new servers. It is organized as a trie, similar to a binary tree. This allows you to find a replica group membership list. Given this list, you know whom you must contact for an update or the set of possible nodes that you could contact for a get. - Bricks are CPUs with disks. They run one brick per CPU and run on four-way machines. - One-copy serializability: The DDSLib runs as the transaction manager and the bricks are clients. They rely on the fact that user clients (over the WAN) can retry failed transactions. One can construct a case where a reader will either get the new value or a "no, busy" depending on which brick they contact (so is this really serialized?). They call this an "optimistic two phase commit protocol." - Making some assumptions makes their analysis much simpler. They assume machines are physically secure, cannot exhibit network partitions, have uninterruptable power, and are fail-stop. Manageability, Availability, and Performance in Porcupine: A Highly Scalable, Cluster-Based Mail Service Yasushi Saito, Brian N. Bershad, and Henry M. Levy - As opposed to Fox's SNS/TACC, they want to look at an application that has lots of reading _and_ writing; mail is not read-dominated. They are filling out the spectrum between ACID and BASE. - This requires that they have some state be stored on disk (hard). They contrast this with "soft" state which can always be regenerated from hard state. They go into detail over which data structures need to be of which type. - "Functional homogeneity" means "any node can execute part or all of any transaction, e.g., for the delivery or retrieval of mail." Their goal is for system administrators to be able to just plug in a new box or remove an old one and have it all just work. Pretty powerful idea. - Three Round Membership Protocol allows them to detect membership changes. It is unclear to me what happens when people ask for their mail when this protocol is going on. Problems: - Motivation: while they say that large-scale commercial services need to handle x (huge number) of message per day, they don't show (or even comment) that existing services don't do the job. In Section 2.5 they do address this somewhat. - "Porcupine, on the other hand, tolerates any number of node failures and continues to server users after network partitions by relaxing data consistency guarantees." "Any number"? What about all? On the Feasibility of Peer-to-Peer Web Indexing and Search Jinyang Li, Boon Thau Loo, Joe Hellerstein, Frans Kaashoek, David R. Karger, Robert Morris IPTPS'03 This workshop paper is essentially a long, back-of-the-envelope calculation showing that it is infeasible to run Google on a p2p network. They distinguish between partition-by-document, where each peer makes an index of the documents stored on it, and partition-by-keyword, where each peer keeps a list of documents for keywords that hash to its domain. Because they are used to skirt copyright, Gnutella and Kazaa use partition-by-document. Partition-by-word would allow you to easily find the storage location of copyrighted material. One of their arguments that p2p would be useful for Web search is that it is censorship resistant, but focusing on partition-by-word, as they do, seems to contradict this goal. Strangely, the communication constraint they come up with is a percentage of wide-area Internet capacity, which, I believe, is largely untapped. What is often saturated are last mile links, which they do not mention. Partition-by-keyword does not allow for word-closeness measures without an exponential increase in the number of keys stored (e.g. "the" "who" are 5 words apart, 6 words apart, ...) Querying the Internet with PIER Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, and Ion Stoica VLDB'03. PIER stands for "Peer-to-Peer Information Exchange and Retrieval" This paper is a proof-of-concept: you can run a DB on top of a DHT. Their motivation is that data can be generated in a standard way in many locations and is not amenable to centralized "warehouse"-type solutions because: (1) you can get live data this way, (2) you don't need to support a data center, (3) you don't need to worry about social/political concerns about responsibility (Do I buy this? Aren't you responsible just by participating in the entire system then?) They build PIER on top of CAN, although they built a second version on Chord and say it required minimal integration. Applications use the "Provider API:" get, put, renew, multicast, lscan, and newData. Get, put, and renew take namespace and resource as inputs, corresponding to the relation name and a key. In general the key is the primary, but they use another input, instance, to allow for secondary keys. This allows get to return multiple tuples. The use of "lifetime" and renew gives them lease functionality. They do a join by creating a new temporary relation in the DHT into which matching tuples are added. As tuples arrive, the newData callback is called. To save bandwidth during joins, they join subsets of the tables (the keys) and then go back and fetch the relevant data. They do this with the keys themselves in the symmetric semi-join or with summaries of the keys (Bloom filters). They get terrible performance. It seems that they would have more success if the relations were not broken into tuples, but instead were stored as large chunks. I'm sure there are a bunch of optimizations they could make, but as they note: "Such improvements would come at the expense of 'dirtying' DHT APIs with PIER-specific features -- a design approach we tried to avoid in our initial implementation." Pond: the OceanStore Prototype Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz FAST'03. I admire them for having Oceanstore as their goal and then going ahead and building a prototype. Saying let's just build it, I'm sure we'll learn something from doing so, is a good thing. Oceanstore has two tiers: (1) powerful, well-connected hosts that serialize changes and (2) less powerful machines that provide the primary storage. In terms of the CAP theorem, they provide consistency and availability, and assume that the powerful machines do not become disconnected (no network partitions). Oceanstore allows write-sharing and provides full ACID semantics. Stored files are called "data objects" and are never deleted. Data is stored in a Merkle tree to have history without wasting space. Erasure coded blocks are stored in Tapestry. Caching is done by recovering the block, decoding it, and storing it this way (post-erasure decoding). Cryptography: they modify the Castro-Liskov Byzantine agreement protocol to use MACs for all communication in the inner ring (powerful machines) and public key crypto for outer ring (storage machines). To keep keys valid in the long running inner ring, they use proactive threshold signatures, which allow a key to be formed using a some n of m keys (n < m). Here are my terse notes from the conference presentation: "All resources are virtualized." They use encryption for 1) privacy and 2) integrity. Erasure coded blocks put into Tapestry. Archival storage is for durability -- they nicely distinguish between durability and ease of access. In their experiments, they use an NFS loopback to translate into Andrew/NFS. They chose to use public key threshold signatures from Castro/Liskov paper instead of symmetric (even though they take longer). They are not sure why signing takes a long time for short writes. Q: Why do a web cache over Pond? Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Antony Rowstron and Peter Druschel SOSP'01 PAST's goal is to provide a write-once, read-many backup file system. A user stores a file with insert(); this returns a unique fileId. The user can then read his own file or give the fileId (and possibly a decryption key) to someone else so she can read the file. There is no search capacity. Files are looked-up with lookup(fileId). The owner can delete the file with reclaim(fileId), which lazily deletes it. This PAST paper is not concerned with bandwidth usage. Strangely, they claim that nodes are homogeneous, but then spend much of the paper handling different disk capacities. Pastry can route to a target using different paths and they use this capacity to handle malicious nodes in the network. E.g., if we suspect something is wrong, we can try a different route. I believe this has been built into the quasi-standardized DHT API (by Dabek et al.). On re-reading this paper, I recalled that very much of it is not about what the focus on p2p systems has become. For example, their experimentation section is exclusively on their two methods for having disks fill up a the same rate -- there is nothing about churn! In fact, this paper contained a large percentage of presumably unimplemented and unsubstantiated conjecture. For example: they assume everyone has a smart card so that documents can be signed, but this is never shown or even discussed in the experiments section. PAST stores whole files. They compare this with a CFS block-type mechanism that divides files up before storing them. Their two methods for dealing with storage imbalance are: - replica diversion: a replica doesn't have enough room to store a file, so it uses one of its replicas like a pointer. - file diversion: it is determined that most of the nodes on which the replica would reside are full so the file is rehashed and put in an entirely different location in the namespace. SplitStream: High-bandwidth content distribution in cooperative environments Miguel Castro, Peter Druschel, Anne-Marie Kermarrec,Animesh Nandi, Antony Rowstron and Atul Singh SOSP 2003 I ended up reading the SOSP paper instead of the HotOS one. - Problem being addressed: in tree-based multicast, interior nodes carry the load. Solution: many trees (a forest), where nodes are usually in the interior of only one tree. - Usage: Pastry has this value b (often equals 4) that is the branching factor. They take a file or stream, encode it with either some stationary erasure encoding or stream encoding, and send one stripe down each of the 2^b channels. This gives them e.g. 16 trees instead of one. Nodes whose IDs match the channel ID are in the interior of the tree, otherwise they are leaves. - They go into some detail about whether a forest can be formed (is feasible). They have two properties that guarantee w.h.p. that it is feasible: (1) if (altogether) nodes want more incoming streams than they can provide outgoing, then the system is not feasible (cannot be constructed). (2) a sufficient condition for construction is: (a) there to exist at least as many outgoing streams as incoming (b) for nodes whose outgoing is greater than their incoming, they must either receive (and forward) or originate all k stripes. The probability of success increases with spare capacity (outgoing - incoming) and the minimum number of incoming stripes. - Indegree is limited by network topology (they said, but actually this is false in practice); out-degree is limited by the scribe "push down" mechanism. Then they have all these silly ways to have nodes join other node's children if their children's children's slots are full. If this doesn't succeed, they look in the "spare capacity group." - Their evaluation section is pretty comprehensive, including churn experiments. Palimpsest: Soft-Capacity Storage for Planetary-Scale Services Timothy Roscoe and Steven Hand HotOS'03 - The problem they are trying to solve is that data that is short-lived, but important, like checkpoint state, is either stored locally on disk or via ACID-semantics off-site. The former is not reliably backed up and the latter is too bothersome, they argue. Instead, we want storage that is briefly very reliable. Palimpsest is wide-area storage that provides short-term soft-capacity. Because data is spread widely with erasure codes, they argue that it will still be very reliable. - Usage: a writer uses Rabin's Information Dispersal Algorithm to divide up a file into blocks. They are then encrypted using Offset Codebook Mode. Then each block is given a random key by the writer coming up with an initialization vector to prevent fileId guessing. - They assume that all storage servers have the same amount of storage. Therefore, they all should have the same value tau, which is the time a block should last in the system. - The most interesting thing to me in this paper is the value tau. They say "the value of tau acts as a single, simple system-wide metric." But I would really like a few other variables, like the rate at which tau is changing and N, the number of nodes. Of course, the situation becomes much more complicated when nodes have different storage capacities. I have been trying to work out how accurate tau would be with their sampling method versus gossiping. No answer as yet. - For some reason, they believe that storage will be a scarce commodity, as opposed to bandwidth. For this reason, they contest storage with a congestion pricing scheme. - This reminds me of what David Black of EMC said at the last FAST conference: if it doesn't make things more reliable, we aren't really interested. Maybe if tau were reliable it would. Here are Margo's notes from the conference: Storage for planetary scale apps There are classes of persistent data, they are targeting data for which: * local disk is no good * DFS no good * data needs to last reliably for a short time, but then it times out and can go away. * data may be quite large Want ephemeral storage service, so that's what they built * looks like NFS, but isn't * stores file fragments * use IDA * use ring-based DHT * storage is a queue (This sounds vaguely like India) * all files die, so they may need to be refreshed if their lifetime exceeds the "timeout" value. * interesting cost model: only bill for writing (i.e., not for the amount of storage and the length of time the data is going to be stored). * As a result, you don't get DDoS because you have to pay to write. Interestingly enough, he actually presented an economic model for utility pricing. Systems directions for pervasive computing, Robert Grimm, Janet Davis, Ben Hendrickson, Eric Lemar, Adam MacBeth, Steven Swanson, Tom Anderson, Brian Bershad, Gaetano Borriello, Steven Gribble, and David Wetherall. HotOS'01. - This paper sketches out problems with the current mechanisms to program in "pervasive computing environments." What is pervasive computing? "The basic idea behind pervasive computing is to deploy a wide variety of smart devices throughout our working and living spaces." - They note three problems with current approaches and state that one.world would solve them: (1) application data and functionality need to be kept separate. In other words, Java is not the answer. Instead data is represented as tuples, functionality by "components." Tuples, components, and other environments live inside environments. (2) applications must take changing resources into account and cannot be programmed using RPC. Instead of Jini's limited leases, they argue to lease everything. It is unclear what they intend to lease beyond what Jini already does. one.world builds in checkpointing and migration. (3) a common platform is needed. For some reason, they argue that Java's bytecodes and libraries, specifically designed for running on small devices (e.g., JavaME), don't cut it. - They compare themselves to Jini/Java and Legion. The Java comparison makes sense: Java was designed to run on small devices and for portability. Legion, however, is part of the Grid. It is unclear how focused their definition of pervasive computing is if it includes Legion. - You just have to love the irony that one.world is implemented mostly in Java. The Design and Implementation of an Intentional Naming System William Adjie-Winoto, Elliot Schwartz, Hari Balakrishnan, and Jeremy Lilley SOSP'99. - INS is about decoupling names from object locations. You describe what you want and the system late-binds your request to the object that best fits it at the time of the request. The environment they are imagining is a medium-scale network with at most a few thousands of devices. - There are two main modes of service: anycast and multicast. In anycast, you get the device that most closely matches your request. In multicast, your request gets sent out to all of the devices that match. Match strings are written hierarchically and can have wildcards. - INS works by setting up Intentional Name Resolvers (INR), that themselves form a spanning-tree. - They show that it works with three sample applications: floorplan, camera, and printer. - Conclusion: interesting idea, but entering a crowded space, including commercial implementations like Jini. Chimera: A Virtual Data System for Representing, Querying and Automating Data Derivation I. Foster, J. Voeckler, M. Wilde, and Y. Zhao SSDM'02. Chimera can be thought of as a shared lab notebook for physicists, or a Makefile for computer scientists. A scientist running an experiment creates an original set of data through some process and then takes the data through a series of transitions to get the final result. Chimera acts to record all of the steps of an experiment so that the results can be recreated by the original experimenter or by some one else and so that all sets of data are available (or can be recreated). In Makefile fashion, Chimera keeps intermediary objects around and, transparently, will only regenerate what new data is needed. Hypothetically, it is also smart enough to figure out when it would be cheaper to regenerate data locally instead of shipping it from some remote site. The entities in Chimera are: - a transformation, an executable that takes some data in and spits some out. - a derivation, a specific instance of a transformation, e.g., I ran transformation x on date y with parameters a, b, c and it produced output file z. - a data object, the input into or the output from a derivation. It implements this abstraction with a data catalog, schema, interpreter, and storage (which all have the adjective "virtual" in front of them, as if this makes their use more clear). To show that it works, they ran two stripped down experiments with it. Chimera seems like a great idea for physicists who are dealing with huge amounts of data that they freely share at no cost and who prefer computer programs recording their activity over a lab notebook. Unfortunately, there is no way to see how useful it really is to this small but active group from this paper. My intuition is that even though this paper has no real evaluation, it is probably good enough for quite a few scientists -- a quick search on Google shows that the tool is in active use. On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing Ian Foster and Adriana Iamnitchi IPTPS'03. Scooped, Again Jonathan Ledlie, Jeff Shneidman, Margo Seltzer, and John Huth IPTPS'03. (Compare and contrast) These two workshop papers cover similar ground in discussing the relationship between the Grid and p2p. Where they differ is in their endings: "Scooped" argues that the Systems community will miss an opportunity like the web (again), whereas "Convergence" argues that the two fields will converge precisely because they have the same goals. The goal: "the pooling and coordinated use of large sets of distributed resources." - p2p tries to take advantage of resources at the edge of the Internet; the Grid works the middle and edge. - Grid environments employ at least limited trust and are frequently centralized. Much p2p work focuses on functioning without trust, or on mechanisms that enable trust. - p2p and Grid have different target communities: scientists, file sharers, anonymizing services. - p2p systems are vertically integrated and each version obsoletes the previous. The Grid has standards. - Even though p2p research often focuses on distributed storage and file systems, Grids deal with much larger quantities of data. - Grids have thousands of users; p2p has millions. "Convergence" uses p2p examples that are currently in use, e.g., Seti@Home. It says that we should learn about failure from p2p and infrastructure from the Grid. "Scooped" begins with the point that p2p research does not have a driving user base behind it and, therefore, is not required to build something stable. The Grid, however, has a demanding user base of scientists/physicists who want the infrastructure "Convergence" talks about. "Scooped" ends with a "Call to Action" for p2p researchers to go and talk with nearby scientists to find out what the real problems are that they need solving. System architecture directions for networked sensors Jason Hill, Robert Szewczyk, Alec Woo, Seth Hollar, David Culler, and Kristofer Pister ASPLOS'00 The authors justify various design decisions in TinyOS, an OS that runs on fairly small, fairly simple sensors, and I agree with their choices for the most part: - The system is made up of a two-level scheduler and components. - The scheduler calls event handlers, which primarily respond to hardware interrupts and are non-preemptable, and calls tasks, which are only preemptable by event handlers, not other tasks. Thus, tasks are for "long" running computation, like finding an average, and an event handler is for putting a new sensor reading into a queue. - Components have fixed sized frames of memory and expose what events and tasks they export. What this allows for is shifting software components into hardware without changing their interface -- good idea! - They argue that this type of scheduling is necessary because sensor work is "concurrency-intensive:" while processing a task, you may receive inputs that must be handled immediately. Frequently this is because something must be copied out of a buffer, for example, before it is overwritten by a new input. - By having cleanly modularized components, they are able to compile the system down to exactly what is needed for a particular mote. - Interesting comparison: "The problem we must tackle is strikingly similar to that of building efficient network interfaces, which also must maintain a large number of concurrent flows and juggle numerous outstanding events." - The scheduler itself compiles down to 178 bytes. A simple application weighs in at about 3KB (code size) and 226 bytes (data size). I believe this is a pretty early picture into TinyOS, given that Hill's thesis was published three years later. It doesn't go into much experimentation and the paper is primarily about the design of the system. I'd like to know about "radios that can automatically wake themselves at the start of an incoming transmission." Why doesn't this take care of the need to synchronize transmissions to save battery power? Wireless Sensor Networks for Habitat Monitoring Alan Mainwaring, Joseph Polastre, Robert Szewczyk, David Culler, and John Anderson WSNA'02. This is the Great Duck Island paper: what happens when a bunch of computer scientists try to build a network for a bunch of scientists who have something they want to research (birds). This is opposed to the next paper, which is what happens when a bunch of engineers want to apply sensors to one particular problem case, albeit one that they are highly familiar with. The computer scientists are trying to solve the problem generally and look for areas of future research to be approached. The building engineers are trying to solve their one problem very efficiently. The Habitat Monitoring paper comes up with a tree-based approach to move all of the sensor data from the individual burrows on the island up to a recording station, which then routes the data to a permanent data service via the Internet. They include the concept of local reprogramming and interrogation that PDAs nearby individual motes can perform. The transition point between a sensor patch and the Internet connected base station is called a sensor network gateway. They discuss a benefit of sensors being that they do not disturb animals or plants, like human recorders do. However, it seems that the motes are quite large relative to the burrows, because the motes were unable to fit into the burrows with their protective layer, so they may well be a little disturbing. Also, we have no idea what disturbance the radio signals they produce causes. They note that a simple antenna installed on the mote used two orders of magnitude less power for the same level of reception than a CerfCube that used 802.11b and TCP/IP. They discuss two potential methods for motes to relay their sensor data up the tree to a base station. In the first, nodes at the deepest leaves start in parallel and each level of communication proceeds in step. The second has each branch compute its level in turn (DFS). Two-Tiered Wireless Sensor Network Architecture for Structural Health Monitoring Venkata A. Kottapalli, Anne S. Kiremidjian, Jerome P. Lynch, Ed Carryer, Thomas W. Kenny, Kincho H. Law, Ying Lei SPIE'03. Wired networks in buildings are bad because of all of the extra wiring that has to go in, and because earthquakes and rodents can break the links. The solution is to go wireless, but now the problem is battery lifetime. Their solution: have a two-tiered approach where the last mile is to battery powered low-range sensors; these sensors send all of their data to outlet-powered nodes with longer range radios. Data moves from the sensors to these "local site masters" along to a single "central site master" which gathers all of the data. They say their design is inspired by LEACH and Bluetooth because it uses a battery-saving approach for the sensors and a data-blasting one for the local site masters. At least, that's my understanding of it. Interestingly, they need two accelerometers, one to measure the day-to-day small scale building changes and one for earthquakes. The earthquake one must be always ready to wake up and record because earthquakes only last for a few seconds. As they note, at least in the bridge architecture where local site masters are laid out linearly on the bridge, the local site masters are single points of failure. In addition they show that the batteries will die after 18 months of use -- this means that somebody has got to replace them no matter where they are on a bridge or inside a building! Taking a pointer from the other paper, they might be able to replace their sending protocol (802.11b) with something simpler for a large battery savings. Of course, batteries leak power no matter what you do -- they must be somehow replaceable. Energy-efficient Communication Protocols for Wireless Microsensor Networks Wendi Rabiner Heinzelman, Anantha Chandrakasan, Hari Balakrishnan HICSS'00. LEACH is one of those good ideas that, as soon as you read it, you think: I could have thought of that. This doesn't make it any less of a good idea; it is a straightforward next-step to a given problem. The problem domain is this: you have a field full of identical, low-power sensors and they are all collecting data, what is the best way for them all to get their data, or a compressed/aggregated version of it, to a base station that is far away (but reachable by any sensor at, perhaps, a large energy expense). Before going into their own protocol, they look at two straw men: a "minimum energy" routing protocol, which routes all data via the best hop to the base station, and direct transmission, where each node just blasts its data at the base station. Via a MATLAB simulation which incorporates their simple model of energy expenditure, they show that just blasting the data isn't so bad. The "minimum energy" route actually causes the nodes near the base station to relay a lot of data, draining their batteries. After these straw men, they talk about LEACH, short for "low energy adaptive clustering hierarchy." In a nutshell, LEACH works in rounds. At each round: - a collection of nodes elect themselves cluster leaders. They give an algorithm that will produce a kind of round-robin of usage, but in a decentralized way. This algorithm could be useful in other contexts. - nodes that are not cluster leaders listen for the nearest leader. - cluster leaders work out a TDMA schedule for communication - cluster leaders aggregate the data and blast it to the base station After each round, a new node elects itself to be leader; thus, the nodes doing the most battery expenditure are continuously changing. They show that battery usage is minimized when there are very few cluster leaders at a given time (5%). The number of cluster leaders is decided a priori. There are several problems with this paper: - Their direct vs MTE does not take into account data collisions, something the authors of the next paper had a great deal of trouble with. Similarly, there must be a huge number of collisions during their CSMA cluster set-up phase. - They make no effort to justify their energy expenditure model. I have no way to know how valid it is. - They do not say what happens to data that is coming in in between rounds. Presumably the node could store it, but for how long? - They make this comparison with Direct and MTE about energy usage, showing that LEACH uses much less energy. However, it appears that LEACH is compressing/aggregating the data at each cluster head and the other algorithms are not. In other words, more bits of sensor data are getting to the base station in Direct and MTE, so to some extent we are comparing apples and oranges. - The next paper also found that communication success rates were not symmetric, meaning that if you can hear your cluster leader, it may not be able to hear you. Building Efficient Wireless Sensor Networks with Low-Level Naming John Heidemann, Fabio Silva, Chalermek Intanagonwiwat, Ramesh Govindan, Deborah Estrin, and Deepak Ganesan SOSP'01. This paper takes the "Directed Diffusion" work one step further -- into an implementation -- and shows that its got some advantages and some work to do. It spends most of the time talking about how directed diffusion works. In directed diffusion, a base station registers an interest with all of the nodes in the system by broadcasting out this interest. Each node remembers this interest and who sent it to them, and then is able to rely information in the right direction when they come across information that matches this interest. What directed diffusion also allows for is for intermediary nodes to cache data and, if it is on a common path, to aggregate it, ameliorating the problem of nodes near the root having to propagate too much data. Nodes send information of interest along "gradients." What they emphasize in this paper is the naming aspect provided by directed diffusion. Because you just send descriptions of what you are interested in throughout the network, there does not need to be any lower level routing layer, like IP. This is what distinguishes it from many related projects, like INS. Instead of just providing equality and wildcard matching, they have a fairly powerful, but simple, set of matching rules: EQ, NE, LE, GT, LE, GE and IS. They are also able to see if two elements match in both directions (called a "complete match"). The matching rules give them the ability to name rectangular (spatial) regions. Interestingly, they talk about an aggregator node computing a confidence rating in the data. This seems like a good idea when sensors may be giving faulty data when their batteries start to run low. They implement directed diffusion on several sets of hardware; in particular, they have "micro-diffusion" running on a mote. Their experimental section is full of remarks about improving the MAC and the problem of data colliding -- stuff that you only find out about with a real implementation. Topology Management for Sensor Networks: Exploiting Latency and Density Curt Schurgers, Vlasios Tsiatsis, Saurabh Ganeriwal, and Mani Srivastava MobiHoc'02 Problem domain: you have a field of sensors that will send data in response to an event, but until that event occurs, they should stay asleep as much as possible. In other words, continuous data sensing is not the problem domain. Their solution was centered on three premises: (1) a multihop approach is better than direct data blasting (although this is unsubstantiated and counters the LEACH paper; however, the LEACH paper focused on continuous data), (2) that full sleep mode uses much less power than just being idle (I didn't realize this before, but they do substantiate this with data from the manufacturer), and (3) that it is possible to have more than one radio per sensor. This last assumption seems pretty reasonable, given that some sensors already come with two radios. Their approach is to use one radio for control messages and one for data. Control messages use one frequency, and, although I didn't see this in the paper, one could assume that point-to-point data could be set to use a particular frequency, so there wouldn't be collisions on the data channel. They call the control channel the "wakeup plane" and the other the "data plane." For most of the time, both of the sensors' radios are off and their low-power-using sensors (e.g., temperature) are on, trying to detect events. Periodically every node turns on its control radio and listens to see if it should switch on its data radio for incoming data from a particular node. The way a node tells a particular node to switch on its radio (so that it can relay data back to the base station) is that it includes that node's ID in the control broadcast. How nodes establish the right path back to the base station (and those nodes IDs) is left as an exercise for the reader. Also left unstated is what happens if that node does not respond. In the section combining STEM and GAF, they mention node death for the first time -- oops! The main problem with their work is that I just don't believe it: the numbers out of their simulator are so close to their theoretical analysis that it makes me believe their simulator must be incredibly simple. The primary simplification they do make is the either/or 40m data radius, which just does not occur in real life. The AUTHOR responded to this review: 1. Firstly, regarding the claim of not trusting results due to close correspondence between simulation and analysis: the reason is simple - STEM by its very nature almost removes from consideration the single most hard to analyze consideration, which is MAC level collision. As an aside, this is one of the few papers whose results have been independently replicated and re-analyzed by many well-regarded researchers (Nitin Vaidya, Michele Zorzi, to name a few) as they created competing (and improved) schemes. However, it is indeed true that STEM (and follow-on work) all use the disk model for radio, but in this case the primary impact of this assumption will be on absolute energy numbers as opposed to relative improvement (which is the focus of STEM). 2. STEM is not about multihop vs. point-to-point and therefore doesn't counter anything about LEACH or had any relationship to it. Indeed, if LEACH's assertion that single hop is better than multihop is valid in a specific network, that is fine: STEM will still improve it (e.g. Microsoft's work on Wake-on-wireless @ Mobicom 2002 or 2003 is the special single-hop subcase of STEM). 3. The data channels operate on single frequency as stated in the paper, and not cheating as the review speculates on different collision free frequencies [which can't be done in practice anyways as forwarding nodes will have a problem of switching frequencies and thus adding latency, and frequency allocation will be a messy task]. 4. Routing is orthogonal to STEM, and *any* routing scheme will work with it (although some may interact better with STEM-driven wakeup and others not). For results in the paper we just had shortest-path-routing to gateway sink, but in subsequent work we have used others such as geographical forwarding and DSR. Incidentally, this orthogonality to routing is also exhibited by other such schemes (the S-MAC MAC protocol from USC/ISI which does a very similar wakeup management although not a data-control separation, and also the newer TinyOS MAC where the receiver polls and sender sends a long packet to ensure rendezvous in a fashion which is similar to STEM's control channel). ESRT : Event-to-Sink Reliable Transport in Wireless Sensor Networks Yogesh Sankarasubramaniam, Ozgur Akan, Ian Akyildiz MobiHoc'03. Problem domain: again, these authors are concerned with sensors being used to track a particular event, not being used to continuously send data to some sink. What they notice is that when an event is detected, it is most likely detected by several sensors at once and they all try to send their data back to the base station at once. The base station wants some amount of data R, but if the sensors send too much data at once, the data packets will collide with one another in the network and the base will get fewer than R. In fact, there are several states the system can be in based on the amount of sense data that is being received (immediately perceptible by the base station) and whether or not the amount of data received is due to the sensors not emiting enough or due to them emiting too much and there being collisions. They propose a way to detect your state and move to the optimal state, where the sensors are sending out the amount of data you want to receive and there aren't collisions going on in the network. As I see it, the primary difficulty with their algorithm is there is no attempt to discover whether or not this optimal balance is achievable at all: what if there is no way to get the R the application wants. The authors don't even seem to have considered that, nevermind put a solution for it into their algorithm. The way that congestion is detected is that if a relay node sees that its queue is about to become full, it sets a congestion bit in the packet header. Periodically, the data sink, which has the capability to broadcast to all of the nodes at once, determines a new rate at which the sensors should emit data and broadcasts this to them. Of course, this is all shown to work brilliantly using, again, a simulation. DSR: The Dynamic Source Routing Protocol for Multi-Hop Wireless Ad Hoc Networks David B. Johnson, David A. Maltz, and Josh Broch in Ad Hoc Networking, edited by Charles E. Perkins 2001. This paper is a nice retrospective on a protocol that people at CMU have been working on for about five years. DSR itself is a simple idea; they cover it and some of the details that one might not think of that they have come up with over time. The basic idea is for each sender to find and cache the route to each receiver independently and in full. DSR itself divides into two main actions: Route Discovery and Route Maintenance. In discovery, a node does not know how to reach a destination: they are not in its routing table. It learns the route by broadcasting a discovery request. If no intermediary knows and the destination is reachable, the message eventually gets to the receiver and the receiver responds. Maintenance occurs when a node thinks it knows how to route to a destination, but some hop along the way doesn't work. The broken hop responds to the originating node, telling it which link failed. Three interesting themes are: - bi-directional vs. uni-directional communication. Sometimes they are able to perform optimizations if they know or assume they are running on a network that enforces bi-directional links (e.g., 802.11). For example, a destination knows it has a valid route back to the host. Incorporating this from the beginning was a refreshing change from the silly theoretical papers that assume fixed radii. - They introduce a nice tradeoff between the benefits of running in promiscuous mode and not. They actually don't mention battery savings, but this seems to be the main savings in not being promiscuous. One cool benefit was passive acknowledgement, where a sender A to B would hear B sending to C and therefore wouldn't need an ACK from B. - Because DSR can rediscover paths, mobility works. They discuss how DSR would work with heterogeneous radios (e.g., two nodes with different power radios). I'm not sure what to expect for a retrospective paper, but their evaluation was pretty thin. It's hard to find something interesting in graphs with only one line. They ran their simulations using ns-2. It was unclear what exactly the routing overhead was: if it was per-node, etc. They ran a five node implementation with DSR running on laptops in cars. They do not have any numbers for this experiment. Nodes were required to have unique IDs and this is how a node would specify destinations (i.e., not geometrically). Big question: how much memory does this use and how could it be reduced for motes? Geometric Spanner for Routing in Mobile Networks Jie Gao, Leonidas J. Guibas, John Hershburger, Li Zhang, and An Zhu MobiHoc'01 A lot of proofs and no comprehensive or comprehensible evaluation: maybe it works but they haven't shown it to me. Basic idea: Brap Karp and H.T. Kung proposed Greedy State Perimeter Stateless Routing (GPSR). This maintains a planar subgraph of node connectivity and guarantees the delivery of a packet if a route exists. The two underlying graphs that this original paper examines are the relative neighborhood graph (RNG) and the Gabriel graph (GG). Where these authors come in is proposing a new underlying graph for GPSR. They propose using a restricted Delaunay graph (RDG) as the planar subgraph. The main benefits of RDG are that the Euclidean and topological lengths are only a constant factor greater than optimal and that it can be constructed in a distributed fashion. A little graph theory: - a Delaunay graph is constructed by creating a Vonoroi diagram and connecting the points whose sides are adjacent. - Delaunay graphs are good spanners. - a restricted Delaunay graph is a subgraph of the Delaunay graph where edges are less than 1 (some unit distance). Here 1 is equal to the radio propagation radius (which they assume is either/or reception). I didn't understand the advantage in being planar -- does it minimize congestion? As noted their evaluation lacks. There is no comparison with GG and no measurement of construction or maintenance costs. They also do not consider what actually happens when node clusterheads rotate. The Cricket Location-Support System Nissanka B. Priyantha, Anit Chakraborty, and Hari Balakrishnan MobiCom'00 Main idea: in a ubiquitous programming space, e.g., indoor room environments, objects need to know where they are currently located, especially if they frequently move around. Cricket addresses this problem by attaching a "listener" to every object that needs to know its own location and puts "beacons" in every discrete zone. Listeners then determine which beacon they are nearest and know their location from that. Communication works by listeners stockpiling light and sound signals from beacons and then deducing their location. Beacons transmit a RF signal and an ultrasound signal at the same instant at random intervals, to avoid conflicts with other beacons. Listeners continuously listen for these incoming signals. When they start receiving a light signal, they listen for a sound from the same beacon and are able to approximate the distance from the beacon based on the different times. This appears to have grown out of earlier work that tried to use RSSI for the same purpose and got poor results. One interesting point is that the listeners are purely receivers and the beacons are purely senders: there is no communication back to the beacons. This makes things simple -- no time synchronization is needed and nodes don't need to turn on or off their batteries -- but it wouldn't do much for your battery consumption. Listeners thus have collections of data about the distance between themselves and one or more beacons. Using this, they determine their actual position using one of three ways: - majority: pick the beacon you have heard from the most, ignoring distance entirely. - minmean: find the average distance that you believe a node is and pick the nearest. - minmode: discretize the samples into buckets (ten inches in their experiments) and select as the beacon's distance the bucket with the most entries (this is the mode). Pick the beacon with the minimum mode. These different methods are worth mentioning because the last is resistant to outliers and gave them good results (and it might not be the most intuitive). They throw in a little something about user privacy but I'm not sure what that's about. They have results from actual experiments which is a refreshing change. The results show that the system works fine -- they aren't very exciting. Localization from Mere Connectivity Yi Shang, Wheeler Ruml, Ying Zhang, and Markus Fromherz MobiHoc'03 I thought this would be a cool paper because Wheeler just graduated from Harvard and I wanted to see what people were up to at Xerox PARC (we had a choice of papers this week). I couldn't really get over the one big hump: see below. Main idea: nodes need to know their locations when they have just been dropped in arbitrary locations. How can they come to some kind of consensus about where they are relative to each other and, if there exist a handful of nodes with GPS, where they are absolutely? They propose a CENTRALIZED algorithm to solve this problem. The algorithm is roughly: - find connectivity information via broadcasts and RSSI (method unstated) - run an all pairs shortest paths algorithm to construct a distance matrix between nodes - generate a relative mapping of internode locations using multidimensional scaling (MDS). - using nodes that know their absolute location via GPS, transform the relative positions into absolute ones. They reference a different paper from the Cricket one that uses Time of Arrival measurements and ultrasound (from a later conference). One positive aspect of this paper was that they weren't so gloating like many of the papers we've read: they said what they were good and in what ways other algorithms were better when they were. They suggest as future work that their algorithm could be decentralized by dividing the network into smaller zones and each zone could perform the algorithms independently. I suppose that this could be performed in a heterogeneous scenario with motes and SCAN devices (if they could find the distance information). Fine-Grained Network Time Synchronization using Reference Broadcasts Jeremy Elson, Lewis Girod and Deborah Estrin OSDI'02. Please see review I wrote for Sensor Paper Day, January 2003. http://www.eecs.harvard.edu/~jonathan/timesynchreview.pdf Optimal and Global Time Synchronization in Sensornets Richard Karp, Jeremy Elson, Deborah Estrin, and Scott Shenker CENS Technical Report 0012. Main idea: RBS (from the previous paper) computes a pairwise translation from one node's clock to another node's clock, including skew: t_{j} \approx t_{i}a_{ij} + b{ij} where a_{ij} is the slope of the conversion line (the relative clock skew) and b_{ij} is the clock offset. When you have a series of translations, there need not be transitive agreement between them. This is a problem if you want to have globally consistent clocks. They show that the most precise set of pairwise synchronizations is globally consistent. They first ignore skew and concentrate solely on offset. They do this by taking the maximally likely set of offset assignments, which is guarenteed to be globally consistent, and then showing that this set does produce minimum variance. Toward the end of the paper, they add clock skew back in. It would have been interesting to see how frequently some kind of global synchronization would need to be run based on the rate of skew. Hopefully this paper will be a little more clear in its published version. The Design of an Acquisitional Query Processor for Sensor Networks, Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong SIGMOD'03. Key observation: traditional DB designs, even when they consider streaming data, don't consider the interaction with the data sources to change how the data is physically acquired. The spin they put on this paper is that with relatively smart sensors like motes, the DB has got a whole new lever to pull in deciding what to do to meet a user's request. They boil this down to: "When should samples for a particular query be taken?" Notes on TAG language: - can have queries that only start after an event has occurred, e.g., ON EVENT bird-detect, start query foo. - epochs are specified with SAMPLE INTERVAL x s FOR y s [s->sec]. - a LIFETIME clause specifies a method for acquiring as much data as possible over the course of many days, based on anticipated battery usage (pretty cool!) They discuss query optimization based around choosing the query plan that will use the least amount of power, as this "subsumes issues of processing cost and radio communication." But what if I want the data faster and I'm willing the pay the battery loss? In any case metadata on how expensive performing different operations is kept in a centralized table that the optimizer can draw from. They also discuss the possibility of seeing multiple queries for the same set of data and being able to rewrite queries to account for this. The most interesting section of the paper is the discussion of Semantic Routing Trees. The basic idea is that there may be several nodes to choose from when building a tree from the root downward. One way to choose one's parent is go solely on the estimated distance (by radio strength). They suggest looking at the types of queries that will happen and adding this to the consideration of selecting a parent. That way trees will form that allow for better aggregation than just a randomly selected tree. Another interesting idea here was prioritizing data delivery: as each node is sending values to the root and its packet queues start overflowing, how can it decide which packets to keep and which to throw away? They look at three schemes: FIFO, winavg (which averages the first two entries), and delta. Delta "relies on the intuition that the largest changes are probably interesting" and tries to keep those in the queue over packets that are telling you what you already know. For fast-changing sensor data, this method worked particularly well. Beyond Average: Towards Sophisticated Sensing with Queries Joseph M. Hellerstein, Wei Hong, Samuel Madden, and Kyle Stanek IPSN'03 This paper is a work in progress that examines topographic mapping using isobars, compression, and vehicle tracking. I found the section on using wavelets particularly interesting and will read more about this topic (I've printed out their main reference). Their motivation is that TAG can support simple queries, but that many people have asked them to include the ability to perform more complex queries. There is no real theme to the paper other than updating the reader on the ways they are currently looking at answering more complex queries. The aggregates in TinyDB are composed of three functions: a merging function f, an initializer i, and an evaluator e. In general, f takes two inputs and outputs a third (aggregate) value. i is used to initialize a partial state record (e.g., what is the sum and count of the values we have seen so far?). e performs the final evaluation, e.g. taking the average. Storage points are temporary tables that are "accessible for read or write from any node in the network; its exact location within the network is not fixed -- that is, it can be moved as an optimization." I did not see how this happened or how they were found when they moved. GHT: A Geographic Hash Table for Data-Centric Storage S. Ratnasamy, B. Karp, L. Yin, F. Yu, D. Estrin, R. Govindan, and S. Shenker WSNA'02. Main idea: for sensor networks with huge numbers of nodes (e.g., 10^6), queries may be best supported by storing the data at well-known locations in the network instead of locally at the node that did the sensing or sending all data to some exit point. They achieve sending all data pertaining to a particular event to the same node by hashing the event name (e.g., "elephant sightings") to some geographic location within the sensornet, and then using GPSR to route it there. The node that actually stores the data is the last hop on the GPSR route. Data is backed up on physically nearby nodes (i.e., within three hops [which seems pretty far to me]). The glaring omission from this paper is that there is no mention -- none! -- of the problem with hashing names. How do users performing queries know the exact name of events a priori? They had better, because otherwise there is no way to find the data. So, I can believe that their scheme might be a good idea in the extremely large sensor network case, but the problem of finding the data remains. A debatable point is whether sensor networks would ever scale to this size without having some gateway into more provisioned nodes. My contention is no, but I'd be willing to hear an argument against this. If the answer is no, GHT becomes unnecessary. A counterargument to their data-centric vs node centric point is that it is true you may not care the node that gathered the data, but you do probably care the location of that node. So one way or another the node id or the location of the sensed observation needs to be attached to the data. This counterargument doesn't seem so strong if the nodes are moving around, making the actual location of the observation more important. Their section on approximating communication costs was overly-simple. They do not adequately cover summarization in the network for external or local storage. Also, I don't see why communication between any node and the gateway should take sqrt(n) when, in general, there will be a tree to the gateway, leading to log(n) hops. A strange item in their evaluation was including failing _and restarting_ nodes. Why would sensor nodes restart? This isn't a p2p network. An evaluation of multi-resolution search and storage in resource-constrained sensor networks Deepak Ganesan, Ben Greenstein. Denis Perelyubskiy, Deborah Estrin and John Heidemann SenSys'03 Main idea: nodes create a time-series summary of their observations using a temporally-organized wavelet; that is, the most recent observations will experience the least compression and loss. These time-series wavelets are combined at each level toward the gateway to provide spatial summarization, giving a more compressed and less accurate picture at each level. However, the old sense data is still available in the network, so a user can drill down to an interesting event and get at this data if he or she wants. Even though the authors are designing this system for the case when what the user wants is not know ahead of time, the user (somehow) constructs an aging function for the temporal data on the sensing nodes.