NOTES from OSDI 2004 December 5-8, 2004 Jonathan Ledlie Best talks and/or most interesting ideas, in my opinion: - Recovering Device Drivers - Chain Replication - MapReduce - One-hop Source Routing - Failure-Oblivious Computing - Automated Worm Fingerprinting - Sean Rhea's suggestion for a demo session at conferences and panel on "Important, But Unpublishable" at WORLDS in general - Also, Jeff Mogul's proposal to have a conference where a standard performance section is outlawed. ###################################################################### Topic: Dependability and Recovery Recovering Device Drivers Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, and Henry M. Levy University of Washington This continues where Nooks (SOSP 2003) leaves off. Nooks gives the kernel the ability to restart device drivers that have gotten into a bad state. However, this usually crashes the apps using the device unless they are programmed very defensively. This work uses a "shadow driver" to hide from the app and kernel that the driver has faulted, reboots the driver, and sets things back into a decent state. In the meantime, it acts like a "spare tire," masking the failure. Instead of needing to implement a shadow driver for each driver, they found that one would work for a class of drivers, like sound or network, because the interfaces were the same. A good Q&A point was that their system could be used to log faults, which could then be sent to the driver writer as is frequently done now with other faults. So, this would not be a disincentive to build solid drivers. A very good idea and one of the best talks of the conference. Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, and Stefan Gotz University of Karlsruhe, Germany Like part of the motivation for the FluxOS, it would be nice have binary compatible device drivers across OSs to eliminate the problem of porting them. With 70% of the Linux 2.4 kernel being device drivers code, the desire to cut down on porting is understandable. The Flux approach to this is to write "glue" code between the driver and the core OS. Instead their solution is to start each driver in its own VM, and to do this recursively with devices that other devices refer to (e.g., the PCI card). Their experiments showed that performance wasn't terrible with all this indirection. Being from the land where the microkernel just won't die, they used L4Linux and, I believe, all of the VMs were run as user-level processes. I can see the motivation for this work, although I don't really agree with it. Also their approach seems pretty heavyweight. Microreboot--A Technique for Cheap Recovery George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox, Stanford University This was one of those "duh" ideas that obviously makes sense once you think of it. They just thought of it showed it worked :) Their starting point is that many failures in conventional three-tier architectures are caused by problems at the application layer (40%) and that many of these are cureable through application reboot. The two approaches thusfar seen are rebooting the whole machine or whole VM (and we have seen papers on fast VM rebooting) and rebooting the application, which in this case would be rebooting the JVM. Their approach, instead, is to make the unit of reboot a small set of components within the JVM (ones that are dependent on each other). They implemented this within j2ee, so their unit of microreboot was an EJB x (or a set of beans). They tested it by taking to users, finding out what the common faults were, and injecting these into their system. They had two main axes for showing their system worked well: - Microreboots were 30-50x faster than rebooting the whole JVM. - User-perceived latency could be cut to 1-2 seconds, below the threshold of about 8 seconds where a person starts to think there is a problem. Also, only the component that was being rebooted would experience the latency. They also showed that, even with a very high false positive rate that a problem was occurring, preemptively rebooting faulting components caused less of an interruption than rebooting the JVM. He used the terms "goodput" (which I'd heard before) and "badput" (which I hadn't) in his talk. Q&A: How are program invariants not messed up, e.g. holding a lock, are there components that you don't or can't detect? A: This isn't a problem given that the way J2EE apps are currently written. Q: How often do you need to reboot large groups? A: He says there is no standard as to how coupled components are. ###################################################################### Topic: Automated Management I Automated Worm Fingerprinting Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage, University of California, San Diego In the discussion in the break afterward, some people felt this work wasn't that great because morphing worms could still get through. It seems to me, however, that it closes some pretty big holes and, therefore, makes it that much harder to write a good worm. Motivation: Because worms can spread to a huge number of machines very quickly, we need mechanisms that can perceive and, perhaps, quell new worms even faster. What they've built is a worm detector that operates at line speed; that is, it does deep packet inspection without significantly slowing down packet transfer. In addition, they have it running at UCSD and at a local ISP and it has identified all known worms thus far. It's main drawback is that it only works for worms with some fixed signature. They rely on two main attributes common to most worms: (1) if it's a fast spreading worm, they will see the same payload frequently and (2) the worm will be trying to spread itself, so this same payload will be going to a wide spectrum of addresses. Here's the basic algorithm and components: - Use a prevalence table (for substrings) and dispersion table (for addresses) - Index all contiguous substrings of a fixed length S (not disjoint) - Hash each of these (adding some CPU cost) using an incremental function (rabin fingerprints) [not sure about this bit] - Because this might still be too much state, they use "value sampling" to sample ~randomly based on the hash (to decide which should be added to the tables) - Using multi-stage filters (estan02) they implement a high-pass filter to track only strings that are highly repeating. - They use "scalable bitmap counters" to keep track of sources and destinations compactly. Q: How do you detect encrypted or polymorphic worms? A: Might have to be done at end-hosts. Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fabio Oliveira, Ricardo Bianchini, Richard P. Martin, and Thu D. Nguyen, Rutgers University I appear not to have much in the way of notes on this talk. The main idea is to allow operators of Internet services to test actions before taking them. The PASS group should give this paper a skim to see how they how they performed experiments with people. Configuration Debugging as Search: Finding the Needle in the Haystack Andrew Whitaker, Richard S. Cox, and Steven D. Gribble, University of Washington This work can be thought of as a debugger for computer configuration errors. For example, you have installed a 3rd party tool into Mozilla and now it hangs. The list of possible bugs you might find with Google is huge and a help tools often don't supply the answers you're looking for. He noted that because Mozilla stores these tools in a separate place, even reinstalling it wouldn't work clean the state. He argues that this is a general class of errors called "WYNOT," the system worked yesterday, not today, why not?, and that this class often relies on human administrators to get fixed. The goal of their work is to (help) automate the diagnosis of WYNOT failures. They use Chronus to find out when system failed (which they've built) and suggest combining that with external analysis tools to figure out why. Chronus proactively logs system state. Once you crash, you can then use it to reactively debug by creating test environments with Denali VMs. ###################################################################### Topic: File and Storage Systems I Chain Replication for Supporting High Throughput and Availability Robbert van Renesse* and Fred B. Schneider, Cornell University He began by saying: "We're still at cornell, in spite of the recent election results." This is pretty cool, obvious idea (once you think of it:). In primary/backup replication, you access the primary for everything until it falls over. Instead, they propose "chain replication," which you can think of a primary/backup, but where the head is the primary for updates, and the tail is the primary for queries. It mainly helps to minimize effect of failures in a replicated system. How it works: - Take all of the replicas and organize them in a chain. - Send updates to head; updates are applied to all nodes in a chain, and the tail sends back ACK. Queries go to the tail. - Chain invariant: (absent in primary/backup) predecessor knows more than the successor. This makes recovery easier. - If someone in the middle of the chain fails, the head of the chain must re-transmit the update as part of recovery process. - New replicas must be added to end of chain. It has an important requirement that client requests be idempotent. I wonder if how this would fare as a DB primitive vs two-phase commit. Sim results: with 5-10% of the workload being updates, weak primary backup, chain, and pb converge in terms of throughput. He talked about LB via hashing like in a DHT, but its unclear to me how you quickly find the tail to do the query. He compared this to the Google file system's "random" (opportunistic) placement, which gives better performance in simulations. Future work: - automatic reconfiguration without a master (which random needs). Brewer notes that an update followed by a query might not reflect update. But it will if you wait for the update. Boxwood: Abstractions as the Foundation for Storage Infrastructure John MacCormick, Nick Murphy, Marc Najork, Chandramohan A. Thekkath, and Lidong Zhou, Microsoft Research Silicon Valley This talk made an interesting juxtaposition to WORLDS, where people kept referring to the problems with distributed computing and Deutsch's fallacies. Their goal is to make distributed, enterprise storage apps easy to design, build, and deploy, and to make them high speed and trusted. Their main contribution is to bring two sets of abstractions into dist storage (masking over any failures, somehow): - single, huge virtual disk, which maintain consistency during updates. - export useful data structures other than just files: lists, hash tables, b-tree, chunks. This seems like a good idea, as it is in BDB. Their architecture is to provide the services of locking, logging, and consensus (using Paxos protocol), to the side, and have the chunk store and b-tree sit on top of a "reliable" block interface. They gives you a malloc-like interface to chunks, and, under-the-covers, take care of free space management. One of their slides said this, anti-Waldo statement: 1. Design application for local storage 2. Adapt the design for a distibuted storage infrastructure This is in such juxtaposition to my philosophy on distributed systems; I can't believe they won't run into some really weird problems. None of the experiments shown looked at failure or reliability. Maybe they should try running it on PL. Secure Untrusted Data Repository (SUNDR) Jinyuan Li, Maxwell Krohn, David Mazires, and Dennis Shasha, New York University Here's what I have: - What do we do in a world when we don't trust our dist storage? - SUNDR is a FS designed for running on untrusted or compromised server. - It places trust in users, not server admins. :( ###################################################################### Topic: Distributed Systems MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, Google, Inc. In a model inspired by functional language primitives, they propose writing (some, many, but not all) distributed programs with two functions: map and reduce. That is, you put the work you want done in parallel into those functions, and your cluster can go off and run them, restarting parts of the job when necessary. The interfaces are: - map (key1,value1) -> list (key2,value2) - reduce (key2, list(value2)) -> list (value3) They show in the paper how this can be used for counting the number of occurrences of each word in a large collection of documents, and give several other examples. Proof-of-concept: Google's production indexing system uses MapReduce. He stated in the talk that they were trying to solve a narrow problem well, and the Q&A reflected that. Q&A: What tasks can't be expressed this way? A: Joins of different data types. Q: Looks like iterators you might find in a DB, what are the differences? A: Iterators give s wider-interface. FUSE: Lightweight Guaranteed Distributed Failure Notification John Dunagan, Microsoft Research; Nicholas J. A. Harvey, Massachusetts Institute of Technology; Michael B. Jones, Microsoft Research; Dejan Kostic*, Duke University; Marvin Theimer and Alec Wolman, Microsoft Research As part of their development of a distributed publish/subscribe system, they've created an abstraction for defining the path from the root to the leaves that notifies all members when the path is broken. The path can be broken by something occurring within the network or by one of the participants (via unsubscribe). While his presentation was a little this-is-the-best-thing-ever, it does seem like a useful abstraction. He placed his work in the context of distributed abstractions: xacts, consensus (BFT, paxos), and membership services (Fuse is this one). The way it's used is that the list of members is immutable (once the fuse is created), and either fuse or the app can terminate the group. Fuse needs to be told about the partitions via some other service, but this seems reasonable because this functionality is already taking place in the overlay they are using (and in most overlays). They create a fuse group for every member of every tree, which also seems reasonable because they have many different trees and, therefore, many roots. He mentioned two pieces of related work that are examples of p2p storage (Margo was looking for recent examples of this): totalrecall, OM. There was a question about the effect of transient partitions on consistency, but the speaker didn't quite understand the question. PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang, Vivek Pai, Larry Peterson, and Randy Wang, Princeton University Last talk of the day, limited notes/brain. What I got out of this was that they had set up a monitoring service on PL to look for routing anomalies. They focused on finding asymmetric anomalies, which are particularly hard to discern because info about them tends not to propagate. Their key observation was to combine passive monitoring with active probing. ###################################################################### Topic: Network Architecture Improving the Reliability of Internet Paths with One-hop Source Routing Krishna P. Gummadi, Harsha V. Madhyastha, Steven D. Gribble, Henry M. Levy, and David Wetherall, University of Washington After a week long measurement study of Internet path failures, they found that many links break briefly about once a day. They argue that this is probably often enough that it's worth fixing, but rare enough that you wouldn't want to have a heavyweight scheme like RON or a structured overlay. Based on where links tend to fail, they show what the best possible improvement would be for any one-hop scheme that routes around failures. Next, they showed that, if you were using a PL node as the failover, picking a random sample of any four of them would be about as good as you could do and that if it didn't help immediately, it probably wouldn't anytime soon. What's cool about this is that it gives you resilient routing without requiring maintaining ongoing state (except, I suppose, a set of secondary routers). Finally they showed their mechanism had little impact on improving web browsing because most errors occur above the routing level, but then argued that resilient routing might be useful in other contexts, which seems reasonable. Some details: - Called scalable one hop route routing (SOSR) - How they chose destinations: choose ~400 pop web servers, ~1000 broadcast hosts, ~1500 random IP (used as sanity check to see that web and broadband aren't that different). - Gave reasonable description of their gathering technique (more so than Gnutella traces, for example) - Found ~85% of all paths exp one failure per week - Where do failures happen? core (tier 1), middle, last hop. Failures occur at each of these with the same order of mag for servers, but most broadband hosts tend to fail at the last hop. - How long do they last? Majority of failures are 1-2 minute but with long tail. - Bad news is that diversity tends to happen at ends, so broadband end loss probably can't be fixed. - Potential benefit of any one-hop scheme: best possible improvement is 66% for servers and 39% for broadband. - [Note it's unclear why you want to fix last hop for bb when that host might not even be up] - If you pick four nodes at random, you have a very good chance of picking a node that will fix the broken link, if one exists. They point out that this requires no ongoing state. - What kinds of benefits would a user see in terms of web browsing? They show that there is limited improvement for web browsing, because so few improvements. The reason is because majority of failure are app level (e.g. tcp and web server). Krishna has put out a series of excellent papers where he combines traces with simulation and analysis. This paper is especially neat because it was compelling even with an essentially negative result. CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups KyoungSoo Park, Vivek S. Pai, Larry Peterson, and Zhe Wang, Princeton University Good motivation, we've-got-the-PL-hammer-let's-find-a-nail design. Their motivation is to reduce latency to local DNS (LDNS) servers. These servers are occasionally overloaded, leading to client timeouts and unpredictable service (I've seen this). Usually you've got a 5 second timeout. The problem is that the DNS info is usually there in the local cache (hit rates of 80-90%), but it just doesn't respond. I also know that I've written programs that would break when hit with this hiccup, so the motivation seems reasonable. They did a nice breakdown of why DNS hiccups happen. All of these contribute: packets dropped (DNS uses UDP), overload, cron on DNS box, and maintenance (misconfiguration and natural failures). The solution: hash the DNS name and contact _remote_ coDNS peer, based on hash. Yikes. As brought up in Q&A, this would work the same as existing system if you just entered in a remote tertiary nameserver into your client. In addition, it has worse security than current system. In question about having all clients run a local DNS, the speaker points out that this can consume a lot of memory. Middleboxes No Longer Considered Harmful Michael Walfish, Jeremy Stribling, Maxwell Krohn, Hari Balakrishnan, and Robert Morris, MIT Computer Science and Artificial Intelligence Laboratory; Scott Shenker, University of California, Berkeley, and ICIS Basic idea: extend Internet architecture to add middleboxes (NATs, firewalls) as first class citizens. Why? - Handing off of principals (introduction) is hard when every entity in the system doesn't have a unique identity. - The current setup is a barrier to innovation. To do this, he advocated a delegation oriented architecture (DOA): - restore globally unique ids, independent of topology via entity (or endpoint) ID (EID) - delegation primitive; allow senders and receiver to explicitly add or remove somebody on the path between them. The ids were created with public keys. The path delegation primitive was done by adding another header to IP packets, containing the hash of (sender, receiver), i.e.: IP header; (source EID, destination EID list); transport hdr ###################################################################### Topic: Automated Management II Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control Ira Cohen, Hewlett-Packard Laboratories; Jeffrey S. Chase, Duke University; Moises Goldszmidt, Terence Kelly, and Julie Symons, Hewlett-Packard Laboratories This speaker was enthusiastic enough about his work to make you believe that there just might be something here, even without (me having) the math background to properly evaluate it. Problem: you have a bunch of time-synchronize metrics (50-200) and a binary condition that you are or are not achieving a desired level of service (as in a service level objective (SLO)). Their goal is to "turn metrics into information." From these many metrics, you want to know which set of these is really the bottleneck: e.g., am I having memory pressure on my middle tier? Should I add more CPUs to my database? Their approach is to use statistical pattern recognition, which allows you to look at sets of metrics in combination. Given all metrics M, find subset of metrics M*, such that f(M*) > target SLO (gives SLO violations). Method: create a matrix [m1, m2, m3, ..., mN, yes/no overload] [next time unit] [...] Use "tree augmented naive bayes" (TAN) to estimate how often a sample function is correct, minimizing false alarms. This works when: - There are patterns to find. - There is bad behavior (yes/no) I think he said that metrics need to be weighted. He noted this is called "classification" (finding a boundary) instead of "regression" (fitting a function). Q&A: Have you looked at a dynamic bayesean network? A: No. Automatic Misconfiguration Troubleshooting with PeerPressure Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang, Microsoft Research At the beginning of her talk, she gave little demo of how Windows has a configuration bug, that by not doing anything, apparently, you get a change. Excellent and got a good laugh. You've got lots of variables in your system registry in Windows and similar variables in config files in Unix. Automatically diagnosing out which variable might be the cause of a problem is their goal. Their technique has the assumption that a single guilty entry causes the malfunction. The technique is: - All machines to periodically send their registries to a central location. - If you've got a bug, they try to see how you don't conform to the other registries in the DB. - They use bayesian inference to rank the potential guilt of all "entries of interest" (EOI). - This returns the entry that you probably want to fix and says how likely the misconfiguration is. Interestingly, larger samples didn't necessarily give better accuracy. Using Magpie for Request Extraction and Workload Modelling Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier, Microsoft Research, Cambridge, UK Similar to talk given at EW SIGOPS 2004. ###################################################################### Topic: Bugs Using Model Checking to Find Serious File System Errors Junfeng Yang, Paul Twohey, and Dawson Engler, Stanford University; Madanlal Musuvathi, Microsoft Research They used their earlier work on CMC to develop FiSC, a file system model checker. In an idealized sense, it goes through these steps: - starts off with empty disk - try all possible operations - check validity at each state - prune this tree (mod minor differences) Instead of doing randomized testing, they use a user-guided search to prune the state space. To try it out on each FS, they had to write custom block driver for each, which took them 1-2 weeks per FS. This group, as usual, found a fair number of bugs in various FSs. Future work: try it on databases, RAID, consensus algorithms. Q&A: Would model checking scale to dist file systems? A: They can check basic levels of parallelism. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou, University of Illinois at Urbana-Champaign The problem space is that programmers copy-and-paste code, but then often don't correct all of the code that they should. CP-code is surprisingly (or unsurprisingly) common: 12% of linux is CP, 19% of xwindows (known before), and 35-45% of linux and freebsd (with at most one change to it) are (they found). They do code analysis to find examples of CP using "frequent sequence mining," which finds frequent subsequences. CPminer is token based, as opposed to parse-tree based or string based. Enhancing Server Availability and Security Through Failure-Oblivious Computing Martin Rinard, Cristian Cadar, Daniel Dumitran, Daniel M. Roy, Tudor Leu, and William S. Beebee, Jr., Massachusetts Institute of Technology Very funny talk. Incorrect pointer arithmatic causes bound violations in lots of programs. There are two current approaches to this: (1) return the wrong data (read) or corrupt data (write) (use the bad pointer) or (2) catch it and crash the program. Their approach: just make up some data and give it back to the program (!) As he noted, this sounds crazy, but let's empirically try it to see if it really causes problems. He showed that, in five programs, returning made-up data was fine from a user's perspective. The made-up data returned actually followed a pattern that would hopefully cause loop exits, etc. He noted the places where this is or isn't appropriate; mainly appropriate when it eliminates corruption of global data structure (e.g., call stack). He also noted that you could use this in Java, but I didn't think you had the capability to do this kind of pointer arithmatic there. In Q&A: - "We haven't done any work; We've done a little bit of thinking." - Developers and users have different needs; developers would not want to use this, but users probably would. ###################################################################### Topic: WiPs pDNS: Parallelizing DNS Lookups to Improve Performance Ben Leong* and Barbara Liskov, MIT Goal: Speed up 10% of dns queries that are >2 seconds. Mechanism: cache NS records. How: an cooperative caching using an overlay network. Main difference between this work and the work this morning is that you send out requests in parallel so it doesn't lower the security over current DNS. Trickles: A Stateless Transport Protocol Alan Shieh*, Andrew Myers, and Emin Gun Sirer, Cornell On server side, make TCP and below stateless. Migrate all state to client and recreate state at the server via continuations. This allows transparent failover to new servers. Surviving Internet Catastrophes Flavio Junqueira*, Ranjita Bhagwan, Alejandro Hevia, Keith Marzullo, and Geoffrey M. Voelker, UC San Diego Idea: back up your data on a host with a different OS or hosts with different sets of vulnerabilities. Honeycomb: Enabling Structured DHTs to Support High Performance Applications Venugopalan Ramasubramanian*, Yee Jiun Song, and Emin Gun Sirer, Cornell Generalizes Beehive (NSDI 2004) to support any popularity distribution, objects of different sizes, and object update rates. PRACTI Replication for Large-Scale Systems Mike Dahlin, Lei Gao, Amol Nayate*, Arun Venkataramani, Praveen Yalagandula, and Jiandan Zheng, UT Austin Appears to suffer from trying to overcome the CAP theorem. Shruti: Dynamic Adaptation of Aggregation Aggressiveness Praveen Yalagandula and Mike Dahlin, UT Austin Detailed views of nearby info and fuzzy views of far away info. Looking at choosing different aggregation strategies, based on different read-write patterns. Shruti monitors reads and writes and dynamically sets strategy. MOAT: A Multi-Object Assignment Toolkit Haifeng Yu* and Phillip B. Gibbons, Intel Research Pittsburgh Talking about the availability of single objects can be misleading when you need to access all of them. MOAT tries to assign objects intelligently for more reliable access of groups. Causeway: Operating Systems Support for Distributed Resource Management, Performance Analysis and Security Anupam Chanda*, Khaled Elmeleegy, Nathan Froyd, Alan L. Cox, and John Mellor-Crummey, Rice; Willy Zwaenepoel, EPFL Causality tracking in multi-tier systems, like Magpie. PLuSH: A Tool for Remote Deployment, Management, and Debugging Christopher Tuttle*, Jeannie Albrecht, Alex C. Snoeren, and Amin Vahdat, UC San Diego What are the right abstractions for large scale experiments? Abstract Description Language, Discovery, Allocation, Monitoring, etc. Their tool does all of this; Chris Small comments that there are cluster tools like this. There are also Grid tools, the speaker notes. Using Inferred Emergent Behavior to Automate Resource Management Patrick Reynolds*, Duke; Janet Wiener, HP Labs; Amin Vahdat, UC San Diego; Jeff Mogul and Marcos Aguilera, HP Labs Using Access Logs To Detect Application-Level Failures Peter Bodik*, UC Berkeley; Greg Friedman, Lukas Biewald, and HT Levine Ebates.com; George Candea, Stanford Users know when there's a problem, but they don't usually tell you. Based on anomalies in users behavior, guess that certain pages are causing problems or have bugs. They are applying statistical tests. This sounds really hard. A Trust-based Model for Collaborative Intrusion Response Kapil Singh and Norman C. Hutchinson*, U British Columbia This guy's student was denied a visa so he had to come give the presentation. When a host on the Internet is attacked, why not attack back. Their idea is to send traffic back along the path along with a proof that it is being attacked. The Ghost of Intrusions Past Ashlesha Joshi* and Peter M. Chen, U Michigan You are vulnerable for a certain amount of time before you apply a patch. They try to figure out how you know whether you ever triggered the vulnerability. SoftwarePot: A Secure Software Circulation System Yoshihiro Oyama*, U Tokyo and Kazuhiko Kato, U Tsukuba Implementing an OS Scheduler for Multithreaded Chip Multiprocessors Alexandra Fedorova*, Harvard and Sun Microsystems; Margo Seltzer Harvard; Christopher Small, Sun Microsystems; Daniel Nussbaum, Sun Microsystems [Definitely the most put together and prepared WIP :) Really.] Charon: A Framework for Automated Kernel Specialization Mohan Rajagopalan* and Saumya K Debray, U Arizona; Matti A Hiltunen, and Rick D Schlichting, AT&T Labs Research Java In The Small: Enabling Standard Java on Embedded Devices Through Customization Alexandre Courbot* and Gilles Grimaud, LIFL; Jean-Jacques Vandewalle Gemplus Research Labs Small versions of Java have removed lots of interesting parts. I think the solution proposed is to run your app, see what parts of java it's using, and only put those on the device. Singularity: Software Systems as Dependable, Self-Describing Artifacts Galen Hunt*, Jim Larus, Paul Barham, Richard Black, John DeTreville, Manuel Fahndrich, Wolfgang Grieskamp, Ulfar Erlingsson, Orion Hodson, Rebecca Isaacs, Mike Jones, Rustan Leino, Steven Levi, Qunyan Mangus, Dushyanth Narayanan, Sriram Rajamani, Jakob Rehof, Wolfram Schulte, Dan Simon, Bjarne Steensgaard, David Tarditi, Ted Wobber, and Brian Zill, Ben Zorn, Microsoft Research; and Martin Abadi, UC Santa Cruz Another demo! ###################################################################### Topic: Kernel Networking Deploying Safe User-Level Network Services with icTCP Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin, Madison Interesting work along similar microkernel/infokernel veins to the rest of the InfoKernel project. The basic idea is that it would interesting and useful to be able to deploy modifications to TCP at the user-level for the standard motivational reasons that: - kernels are complex - vendors dislike change - chicken-and-egg problem with introducing a new standard Benefits: - new services are deployable - flexible - composable (stackable) (This is pretty cool.) From this, they ask what information and control need to be made available to an app outside the kernel for it to be able to run TCP with its own extensions (e.g. Vegas). They are able to know what info needs to be exported by just looking at tcp spec plus a bit more to allow newer extensions. What control is necessary: - set internal tcp variables in a safe manner - they make variables safe by having user level variables be "virtual" copies of the real variables, such that each virtual variable can only be set to a safe level (e.g., less than) the real variable. Only added 316 l.o.c. to linux kernel. They show an overhead of 12% loss of throughput with 96 connections; overhead increases with more connections. Q&A: How to use this for udp flows is in progress. ksniffer: Determining the Remote Client Perceived Response Time from Live Packet Streams David P. Olshefski, Columbia University and IBM T.J. Watson Research; Jason Nieh, Columbia University; Erich Nahum, IBM T.J. Watson Research The problem they are trying to tackle is a good one and I think they are on track for a reasonable approach, but this seems to have a big flaw. They observe that web serving companies would like to be able to accurately measure user-perceived latency: how long does it take to be able to download not only a containing web page, but also all of its components. Existing methods to do this are: - have a javascript run on the client send a message when all parts have been downloaded. - perform a packet capture at the web site; they say this does not scale and it's not real-time. Given this they propose "ksniffer," an online packet capturing system. Running in the kernel of the web server, it monitors outgoing traffic and correlates when the whole page has been sent to the client. He went into a bit of detail on how to tell when the last embedded object was downloaded. Also, they use the referrer field to associate components with containing pages. The main interesting point in their experiments was that Apache perceived latency was an order-of-magnitude off (lower) than the actual latency and that their tool was very close. The main flaw to this was that they assumed all of the objects existed on the single web server. A least one web site where I worked, nytimes.com, did not do it this way and instead segmented embedded graphics on different machines with different DNS names so that the objects would be cached in memory. Their correlation "at line speed" doesn't appear to be possible without extremely well synchronized clocks. Somebody asked a related question about CDN caching. FFPF: Fairly Fast Packet Filters Herbert Bos and Willem de Bruijn, Vrije Universiteit Amsterdam, The Netherlands; Mihai Cristea, Trung Nguyen, and Georgios Portokalidis, Universiteit Leiden, The Netherlands There are several good motivations for good network monitoring. They believe that existing tools are not sufficient because the speed of the network has increased at a far greater rate than processing power. A few motivations are: - monitoring at border gateways, e.g., they showed that 70% of the outbound traffic at Wisconsin was unknown (because they are using dynamic ports). - Dan Ellard's FS work where a significant percentage of the packets were dropped because they couldn't be processed fast enough. The way you describe what you want to monitor is with "flows:" e.g. all udp packets; all packets going to eth0; and a user might be interested in byte count, or other arbitrary user criteria. Given these pipe-like inputs, they create graphs of nodes where each node contains two components: an array for scratch space, and a circular buffer; these point to another circular buffer which contains actual packets. Also, they implemented this on the network card (and could also do it in the kernel and in user-space), which was pretty cool. e.g. ((device,eth0)|(device,eth1))->(sample,2)->(FPL-2,"..")|(BPF,"..")->(bytecount) They manage the big buffer with either "slow" read preference or "fast." They implemented pcap on top of this. They noted that the main win is if multiple applications are in the same flow group. ###################################################################### Topic: File and Storage Systems II Energy-Efficiency and Storage Flexibility in the Blue File System Edmund B. Nightingale and Jason Flinn, University of Michigan This work is motivated by the goal of ubiquitous data access. If you have a laptop at work, some portable memory device, and you also work at home, you ought to be able to access your data wherever it is. You should be able to access it from the fastest place currently available, given that this access will be energy-aware, and not stale. You should also be able to glue files to devices, if, for example, you always want to carry around your presentations with you (on your memory stick). - They use an "adaptive cache hierarchy" to pick the best device to read data from (e.g., remotely if your disk is already spun down). - He went into a bit of detail on how you would access the first few blocks of a file from the network while your disk is spinning up. - They support read from any, write to many semantics -- you write your data to all devices. - As noted, you can "glue" files to devices. - Because devices have differing amounts of storage, they use an LRU evict policy with each device. - They use Coda's cache consistency, and AFS's callbacks, with several modifications which he went into. Q: Does this work if you are disconnected? No, "All bets are off." Life or Death at Block-Level Muthian Sivathanu, Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin, Madison They start from the viewpoint that a very common interface between a machine and its storage is SCSI; given we are stuck with SCSI for the forseeable future and that we have smart storage, this storage should be able to figure out better what the file system is trying to do. Their main motivating example, which seems reasonable, is that programs that try to scrub delete a block often do so above the filesystem, but, with modern storage, you don't even know if your re-writes are hitting the same disk blocks. Other motivations for this work are that having a disk know whether a block has been deleted enables a bunch of optimiations: autoRAID, eager writing, faster recovery (fast 04), rotational replication (osdi 2000) (although I imagine that at least autoRAID worked before this). What this forces them to do is guess what a given FS is doing above them. They generalize from block liveness to set of blocks via "generations." Interestingly, they show how extending the SCSI interface to say when a block is alive or dead while retaining the same semantics would be quite difficult. He also noted that their approach requires FS-specific knowledge, but many modern storage systems already come with this info (what do they do with this info?). Program-Counter-Based Pattern Classification in Buffer Caching Chris Gniady, Ali R. Butt, and Y. Charlie Hu, Purdue University They propose a new buffer caching mechanism based on the machine learning technique of program context based pattern classification (PCC). Their related their work to other types of replacement policies: - frequency/recency based: lru, lfu, fifo, arc all have pathological cases - hint based: program gives hints about file references patterns; complex - pattern based classification: detect reference patterns - state of the art: UBM (Kim 2000) UBM, like the work here, divides access to a file into sequential, looping, or other. The main drawback of both approaches is that you need training for each new file. The assumption on which this work is based is that each time you access files from within a given code stack, you are probably going to access them in the same way. It needs a learning period for a particular bit of code, but it is able to carry over that knowledge if you access multiple files with it. In addition, this is able to incorporate the fact that you might have an I/O subroutine at the bottom of your code -- that's why they use the whole stack. They compared against the frequency-based policies in a simulator and against UBM in an implementation. In their paper, they say that the pathological case for ARC (and LRU) is when a loop doesn't fit into memory. I was trying to think of when this was, as ARC seems more "state of the art" than UBM, which they compared against. In Q&A, someone pointed out that you may need to reserve more than one block for sequential if more than one app is doing this. Doesn't seem like a major flaw, however. ############################################################ NOTES from WORLDS 2004 December 5, 2004 Jonathan Ledlie ############################################################ Platforms Session Chair: Steve Gribble Global-scale Service Deployment in the XenoServer Platform Evangelos Kotsovinos, Tim Moreton, Ian Pratt, Russ Ross, Keir Fraser, and Steven Hand, University of Cambridge Computer Laboratory; Tim Harris, Microsoft Research, Cambridge, UK They've broken the Xeno VM model into four parts: clients, servers, search, and "corp." Search is for discovery, and is something I should look into more. "Corp" is used for authentication and handles charging payments (it functions as a trusted 3rd party). Clients contact corp, then search for potential servers, then deploy their services on servers. At startup a Xen VMM runs on the hardware, which kicks off one Xeno daemon. Clients use this daemon to start up each of their services. Why do we need a new model for service instantiation? Because existing models don't give what they're looking for. Existing models are: grids (no efficient migration), Frisbee (LAN only), PL and co-deploy [didn't write down problems], others don't give parallel deployment. Simple, sensible, and cool idea for making setup of a service fast on a node. They store a set of template FSes locally on each (server) node. Each VM then builds on one of these. The part that built/used is the combination of patches the client provides and what is migrated. They call this "stacking." [Of course, this leaves open a problem of where to split the line between common stuff to put in the templates.] They AFS for remote access of patches. Evaluation: installing Apache as a service; 99% of the state is stored locally and 1% is shipped via AFS; starting a service took 31 seconds to start from scratch: 30 to ship, 1 second to start the service. They note that service installation can be done in parallel. Q: What do you do about shipping the IP address and other nice OS/lower state when relocating services? A: They don't do it. Q: How are you going to handle customization? E.g. in terms of handing out shares based on the amount of resources given out locally. A: [not sure, but I don't think they do this yet] Q: [by David Oppenheimer] why did you use AFS vs one of these newer systems based on e.g. DHTs? (Jay Lepreau whispers loudly "It works") A: We didn't evaluated them an maybe we should have. Safely Harnessing Wide Area Surrogate Computing or How to Avoid Building the Perfect Platform for Network Attacks Sachin Goyal and John Carter, School of Computing, University of Utah Problem: enable useful apps, block malicious ones. Let's say someone wants to access SkyServer data. In order to do their computation they might want to set up a service on a nearby "surrogate" compute machine, to avoid data shipping. But isn't this kind of platform wide open to interesting attacks? [Note: this speaker was a little hard to understand.] I think the basic idea is for a client who wants to use a nearby (or just some other) machine for surrogate access to use a pre-existing trust relationship with the surrogate. This relationship might be set up on the fly by contacting a third-party to set up the relationship. Fair enough. Implementing the Emulab-PlanetLab Portal: Experience and Lessons Learned Kirk Webb, Mike Hibler, Robert Ricci, Austin Clements, and Jay Lepreau, School of Computing, University of Utah Goal: provide an easy-to-use interface for PL users. [Seems to give a pretty simple way to set up slices. Maybe we should use this.] They monitor all pairs-ping data and link characteristics. [Because P. Pietzuch wants this data I talked with the speaker, Kirk,later about this.] A difficultly in using the interface, perhaps, is that the different user communities have different models for their experiments: - Emulab aims at rapid cycle experiments - PL at long-running services. He distinguished between the goals of Emulab and PL as another way of expressing that very few people were using their service to set up experiments. One of the problems they grappled with was trying to provide consistent identities to nodes, when they kept changing their internal information (e.g., IP, slice name). Metrics running on PL that they are accessing: trumpet, ganglia, emulab watchdog. Q: David Culler comments that the Emulab portals best use is for the typical experiment (small trials and then to large ones), isn't that what they are seeing? A: They just aren't seeing a lot of usage at all. Dave Culler quoted someone: "In theory, theory and practice are the same, but in practice they are different." :) ############################################################ Monitoring Session Chair: Steve Hand Design Considerations for Information Planes Brent Chun, Joseph M. Hellerstein, Ryan Huebsch, Petros Maniatis, and Timothy Roscoe, Intel Research, Berkeley ** Something for HG/SBON developers to really think about; see also Deutsch's principles below. Also lesson #5 here. Contention: any big distributed system uses an information plane. Idea: factor out this plane into a discrete unit. What would this plane do? Route info around the system intelligently. How can you send it intelligently? Filtering, aggregation, etc. Related work (tHings that look like info planes): - sophia, astrolabe, irislog, smims (all geared toward info management) - sword, trumpet, ganglia (RD for PL) - pluto (routing underlay service) - pier (dist query processing [speaking about network intrusion aspect]) His group has been focusing on PIER. PIER has been running continuously for 14 months, accessing SNORT which is intrusion detection system. They also use it to query bamboo routing tables. He asks: what should an info plane look like and what lessons have we learned though building PIER? [He mentioned wanting to use PIER to do recursive queries, but he didn't say why.] Lessons: * Emulate at multiple levels, this allows you to find bugs at each level. * How do you even validate what you are getting back? Answer: trace lineage of each tuple on its route through the system. * Avoid layer interactions * Queue as late as possible (don't serialize an object that you are going to need to pull off a queue to look at). * Are DHTs useful? The abstraction is very useful, but no control over where the data goes and which machines are processing which data; this argues for more flexible overlay construction (more network knowledge, richer interface?). ** Idea with potential ** Future work on PIER: - What's the right way to fold info about failures into the result set. - Ensuring integrity with buggy nodes. - Focus on protocol design (see #5). Q: Doug Terry asks if users will know enough in order to add network info to their queries. A: They probably will. A Shared Global Event Propagation System to Enable Next Generation Distributed Services Paul Brett, Rob Knauerhase, Mic Bowman, Robert Adams, Aroon Nataraj, Jeff Sedayao, and Michael Spindel, Intel Labs, Intel Corporation Users can listen on IM channels to find out about resource availability (e.g., what is the load on this node). Because this info is useful historically, they stick it in a DB. It's unclear if this DB is centralized or residing on each PL node. Distributed Resource Discovery on PlanetLab with SWORD David Oppenheimer, EECS Computer Science Division, University of California Berkeley; Jeannie Albrecht, Department of Computer Science and Engineering, University of California San Diego; David Patterson, EECS Computer Science Division, University of California Berkeley; Amin Vahdat, Department of Computer Science and Engineering, University of California San Diego This talk is really two talks: intro to SWORD, one graph which shows that maybe storing their measurements in a DHT isn't such a good idea after all, and a discussion of the tradeoffs between different evaluation environments (e.g. sim, modelnet, PL). The first couple of slides were straight from the SWORD TR, which is an interesting read. SWORD's motivation is to provide access to an run queries on PL node attributes. He divides attributes into two groups: - rapidly changing attributes, both per-node and inter-node - slowly changing ones (and they optimize for each differently) Currently, SWORD provides integrated resource discovery and service placement. Existing monitors populate query processor with measurements. They use Ganglia, Trumpet, slicestat (via cotop), vivaldi. Evaluation: - centralized vs dist: - showed that by splitting the central QP into (just) two nodes outperformed DHT - centralized soln works fine (better) for networks of this size Comparison of evaluation environments: - Has chart on sim vs emulator vs PL - They found max of 1000 on 40 node cluster using Modelnet. Then he went into a pretty interesting comparison of evaluation environments, although nothing too surprising. One idea was to derive traces from your exp on PL and then use it to drive simulators. Seems similar to the problems FS designers have been encountering for decades. Q&A: - Might be useful if there were a more direct bridge from PL to simulation. - Culler brought up giving statistically significant results, especially when we say results on PL are not reproduceable. [This was the best rehearsed of the morning if not the day. Many of them were not well rehearsed.] ############################################################ Good talk with Hakim Weatherspoon over lunch, whom I met at EW SIGOPS 2002. We talked about the parallel between the NASD approach of separating metadata and data and the fact that all of these "distributed" systems, at least those based at Berkeley that he knew about, perhaps hypocritically, require a centralized infrastructure to keep them running. SWORD, Bamboo, and Oceanstore all use a centralized postgres database for management. Every minute every node that is up, sends a message to this central point, asking if it needs to do anything different. We started talking about the parallel between this, Napster, and NASD, where you use a central point for metadata but distribute data transport. This metadata-central/data-decentralized split appears to be orthogonal to the decision to use objects or blocks (Oceanstore operates in the latter). - Open question: does this model work across federations? - Another supporting example of this is a project called Maze, based in China which is a centralized Napster-like p2p file-sharing network. An assumption that some people appear to be making in talks and Q&A is that CDNs (or similar) will start running offering something like PL in the future; that is, boxes on which you will start up services/utilities, and that these services will migrate, have monetary costs, etc. This seems reasonable to me, but I have no idea if it makes sense from a business perspective. ############################################################ IT Happens Session Chair: Vivek Pai The Seven Deadly Sins of Distributed Systems Steve Muir, Department of Computer Science, Princeton University He highlighted several problems that developers might well run into in doing development on PL. A few were: - Time can go backwards; plan for it. - He encourages failsafe behavior. - All kinds of bugs crop up when you are running on enough nodes; that is, you see what you think should be rare behavior. - Overutilization of PL nodes is the norm. See Peter Deutsch's eight fallacies of dist computing. As a response to several people pointing out "overutilization" of PL nodes, others have said that the norm in any business workload is for machines to have 10-20% utilization. So it's unclear if this max'ing out of resources is just an artifact of the environment or if it would actually happen whenever you get shared boxes. Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems Praveen Yalagandula, The University of Texas at Austin; Suman Nath, Carnegie Mellon University; Haifeng Yu, Intel Research Pittsburgh and Carnegie Mellon University; Phillip B. Gibbons, Intel Research Pittsburgh; Srinivasan Sesha, Carnegie Mellon University He took three uptime workloads, cleaned them up a bit (see below), and looked at the correlation between characteristics like availability, MTTF, MTTR, and number of failures. These were his findings: - High avail does not imply higher MTTF and lower MTTR. - Can predict MTTR and MTTF from node history. - A system should not rely on predictability of failure: - Interesting relwork: Harchol paper from SIG___ 1996 on Pareto lifetimes showed that: P (TTF >= 2T | TTF >= T) = c - Occurrence of failures was not independent (there was a correlation). Interesting because most theoretical results rely on them being independent. Q: How should one model correlated failures? A: Ongoing work at Intel Pittsburg. Q: Sean Rhea asks if pings data was scrubbed, because there was a bug in the tool that made it look like PL went down once an hour. A: Yes, they corrected for this bug. Q: Jay Lepreau comments that ping is very weak measure of availability, because to him "availability" means that he can set up a slice on that node, which is more difficult. Lessons from E-speak Alan H. Karp, Hewlett-Packard Laboratories The speaker talked about why his product was shut down at HP. ############################################################ Panel: Important, But Unpublishable Session Chairs: Timothy Roscoe and Brad Karp Panelists: David Mazi (res, New York University; Robert Morris, Massachusetts Institute of Technology; Sean Rhea, UC Berkeley; Ben Zhao, UC Santa Barbara Brad: It's hard to publish "engineering" solutions. ** Sean Rhea: reminds us of Rob Pike's "Systems Software Research is Irrelevant;" when did you last see an exciting non-commercial demo? what can we do to encourage more demos? His suggestion: find a way to have a demo session at each SOSP (etc.), complete with a prize. I think and so did several of the Q&A people that this is a great idea. Mazieres: Referred to "First seven years of Multics": through building, you understand. Publishing on real systems is a lot more labor intensive. With more system conferences, to publish more papers you need to write less code. By getting your code up to the point where it can be open sourced gives it much higher potential for impact. Ben Zhao: less quantity, more quality; mentions Dagsthul seminar model, which is where there is no publication requirement to enter the workshop, but a paper is produced at the end which contains a summary of the discussions from it. Robert Morris: He gave a very funny, self-deprecating talk. Good news: the research community is very sympathetic to real, built systems. Bad news: not sure if sympathy gets you very far. Methods for wasting time: - solve the wrong problem: e.g. Ivy, nobody needs it. - focus on performance: e.g. Click, where they wasted years on performance that nobody needs. - give up too early: You've put a bunch of work into a project and then have a hard time seeing its real application. e.g. RON follow-on systems; think harder to come up with a better application for your system. - write badly: e.g. writing about the system, not the ideas; you think bottom up, but need to write top down. Q&A: - If you look at the list of people who did demos, they are actually doing quite well, so maybe its not such a waste of time. ** Jeff Mogul: proposes banning performance at a conference and just allowing numbers on useability and other hard-to-measure numbers. Force people to come up other metrics. - Alan Karp: In experimental physics, you use a bunch of grad students to each build part of a larger system and these are theses each (this didn't get much traction). - Andrew Tanenbaum: the fixation on performance is just wrong, focus on new ideas! Talked with Jay Lepreau and others from Utah about current state of Emulab and the possibility of its use for experiments on disconnection in HG. The way to do it would be to be sure to ask for your maximum allocation of resources at the beginning of the experiment, then partition, then reconnect. He said this shouldn't be a problem (i.e., with link fidelity). Deployment of a Large-scale Peer-to-Peer Social Network Mao Yang and Hua Chen, Peking University, Beijing, China; Ben Y. Zhao, U. C. Santa Barbara, Santa Barbara, CA; Yafei Dai, Peking University, Beijing, China; Zheng Zhang, Microsoft Research Asia, Beijing, China I don't think this talk occurred because I don't have any notes on it. ############################################################ Distributed Information Session Chair: Lucy Cherkasova Deploying Large File Transfer on an HTTP Content Distribution Network KyoungSoo Park and Vivek S. Pai, Department of Computer Science, Princeton University [It was kind of unclear what this talk was about.] Towards a Deployable IP Anycast Service Hitesh Ballani and Paul Francis, Cornell University As the speaker noted, this paper should have been in HotNets and this was the wrong venue. Relevant to my work on resource discovery in pros/cons of IP Anycast. More relwork: YIORD (sp?) - paper discusses pros and cons with IP Anycast. - "Lets imagine all of the limitations I listed on the previous slide disappear; let's call this IP Anycast*; now, let's look at how useful this is." Interesting approach. - discussed how IP Anycast could be made more scalable (instead of just using app-level anycast). DISC: A System for Distributed Data Intensive Scientific Computing George Kola, Tevfik Kosar, Jaime Frey, and Miron Livny, Computer Sciences Department, University of Wisconsin-Madison; Robert Brunner, Department of Astronomy and NCSA, University of Illinois at Urbana-Champaign; Michael Remijan, NCSA, University of Illinois at Urbana-Champaign Motivation: resources are idle a lot. Current solutions (versions of Condor) only work well when data requirements are minimal. Problems with current solutions to accessing remote data: - remote I/O like NFS lead to high latency. - staging (but need to know what files are going to be accessed) He had a proposed solution for moving data around more efficiently, but I'm not sure what it was other than that it involved DAGs. Related to Grid in general; not super-closely related to PASS. ############################################################ Panel: Perspectives on the Future Internet: Grid vs. Networked Systems Session Chairs: David Culler and Ian Foster Panelists: Carl Kesselman, Greg Papadopoulos (CTO of SUN), and Larry Peterson My take on the panel: PL is a nice, naive playground for CS people to come up with clean solutions to clean problems. Eventually, these clean solutions might (probably will) be worked into reality. So, it's good this environment exists so that this can happen. The grid folks are happily setting up complete solutions (including ideas we like to ignore, like security). They (via Carl) believe that once these solutions and their interfaces are built, we can go back and refine them, as one normally does behind a relatively fixed API. This obviously has the problem of changing APIs if we decide they are wrong later, but he seemed OK with glossing over that. Greg, instead of acting like a moderator, pushed a third ("Jim Waldo") view that there were fundamental distributed CS problems that are out there that still need to be solved. He came at this from the point of view that Web Services, which is advocated by the grid, is a fundamentally flawed idea in terms of Deutsch's fallacies. He also said that grid and commerce meanings and goals for "utility computing" are diverging, although he didn't go into much detail on this.