Design Considerations for High Fan-in Systems: The HiFi Approach Michael Franklin et al. (Berkeley) CIDR, January 2005 This is preliminary work describing the group's aim to distribute their TelegraphCQ system to a hierarchical network, with a data warehouse at the root. Their main motivating scenario is speeding up supply chain management with more timely information on what is where in the supply chain. Thus, they consider RFID tags attached to products passing through loading dock doors at low levels of a hierarchy, this information being aggregated at regional levels, and finally centralized (in e.g., Bentonville, AK). Although they hardly need to motivate using a hierarchy for this kind of work, they do point out that hierarchies can trim out redundant data, reducing the size of data streams, and that many queries will be regional or local, as in "how many widgets are in my store?" I'm not sure how novel it is (but it doesn't appear to be how things are done now), but they argue that the same general format of streaming queries can be used up the hierarchy; they call this cascading streams. As data flows up the hierarchy, it passes through five stages which they have broken down into clean, smooth, arbitrate, validate, and analyze (CSAVA, pronounced "cassava"); this seems like a useful idea and perhaps generalizes from supply chain management. They discuss the problems of topological fluidity and query planning that we have been dealing with in Hourglass. They combine static and dynamic approaches to deal with network topologies, although they don't really go into how the dynamic would work. They do discuss how mobile nodes may become disconnected and may need to switch parents en route and support disconnected operation. They state that the goal of query planning is "to reduce system bandwidth usage by pushing operators down the hierarchy" and mention that to exploit shared processing it "may be advantageous to pull the operators up." They also mention the problem of converting simple RFID information into complex events, like that shoplifting has just occurred. One nice idea they describe is the use of "virtual devices." "A virtual device interfaces with multiple raw receptors that are in close proximity, processing and fusing their data streams to produce more useful, higher-quality data." Thus, instead of raw devices being the first level of their hierarchy, the data has already been cleaned up a little and put into stream format for it to enter the system. They mention that the lineage of each data item is tracked, although I'm not sure how this can happen at the virtual device level (it would appear to need to happen higher). The device also exports other aspects about the data, like its quality. Their architecture appears fairly preliminary, but was enough to produce a demo at VLDB 2004. It's divided into three components: - The metadata repository is the one that seems most preliminary, but they have given some thought to what's needed inside and the rate of change of the bits of data that it would store. It would store a very infrequently changing "mediated schema," that would contain entries for sensor and RFID data in the supply chain example. The repository would also contain the "views" exported by each node in the system (and the node's location). This essentially captures the schema of each on-going query. The repository also contains "system information" which includes "node capabilities, authorization and privacy controls, and information relating to organizational boundaries and administrative domains" and 'maintains runtime information, such as the current set of queries running on each node, current network usage, and unavailable/unreachable nodes to help guide and optimize system behavior." They are currently investigating how to build this. - The data stream processing component of each node is where they are going to plug in TelegraphCQ. - The "HiFi Glue" is the rest of the goop needed to make this work. They mention several global services: query placement, physical query planner (based on changing topologies, maps logical queries to physical), and control for overall management. They mention HG in the related work. Other items: - RFID tags often have a standard 96-bit Electronic Product Code (EPC), which might be scanned 10 to 15 times through the supply chain. - Windows are specified by range and slide parameters (typically expressed in time or tuples). - Sarbanes-Oxley, www.sarbanes-oxley.com