I came across a very interesting paper that has been published just a few days ago in the SAN workshop. It's called "Implementation and Evaluation of an Active Storage System Prototype". This paper is relevant to the idea we have been discussing in our !vino research meetings. It describes an implementation of a storage system, which offloads computation to disk and requires very little modification to the operating system. I found their experimental results very interesting: from them we can get a clue what workloads we should target if we decide to build a system with a similar purpose. This paper is not yet on the web, if you want a copy, come to my office or let me know where your mailbox is, and I will drop it off for you there. If you don't really want to read the paper, but want to know what it is about, read my brief review below. (Well, it was going to be brief but ended up being 2 pages). ******************************************************************************** Implementation and Evaluation of an Active Storage System Prototype (Xiaonan Ma, A.L. Narasimha Reddy) The problem: The authors wanted to create the network-attached storage system, where it is possible to offload the computation from the host machine to the network-attached disk. The novelty of their approach is in that they implement this storage system with minimum modifications to the host operating system, and with most of the file system code sitting in the host file system. They claim that previous approaches required either porting a file system to the device (which is hard) or making significant changes to the operating system, and thus were not very practical. Architecture: The key point in their approach is the use of "virtual files". A virtual file is a combination of a physical file and some filtering or processing applied to that file. For example a physical MPEG file /bar.mpg can have virtual versions /vd1/bar.mpg and /vd2/bar.mpg corresponding to different levels of quality. Virtual files are exported by the multi-view file system (MVFS), which is a stackable file system based on the vnode structure, and can therefore interact with the kernel through the standard vfs/vnode interface. Communication between MVFS and storage device that involves any "smart" activity (i.e. communicating requests for filtering a file) happens by writing to the addresses located outside of the device address range. Device has enough intelligence to parse these "out-of-bound" requests, perform the requested filtering and return the filtered data. Prototype implementation: All experiments have been performed on a testbed consisting of several "disk machines" connected to the host machine through 10/100 Mbps Ethernet Switch. The on-disk computation was not actually performed on the disk, but on the CPU of the disk machines, which were just regular Pentium machines with slow (166 MhZ) CPU's. The host machine had a CPU speed of 233 MhZ. (Note that the host CPU is only slightly faster than the "on-disk" CPU. Is this a realistic case? If not, we have to account for that when interpreting their results). Experimental results: The applications used were the following: File encryption, MPEG QoS filtering, Median filtering (explained later). Encryption This experiment involved encrypting file before writing it to the disk, and decrypting it before giving to the application. The authors compare two scenarios: host machine doing the encryption/decryption itself, and disk doing encryption/decryption after or before the data exchange with the host. The experiment was performed over the fast (100 Mbps) and slow (10 Mbps) links. Note that in this scenario, offloading the computation to disk frees up the CPU cycles on the host, but does not reduce the amount of data transferred between the host and the disk. So in this case one can only see the benefit of on-disk computation if the host can somehow occupy the freed-up CPU cycles. In their encryption experiments, the host did not have anything to do, while the disk was encrypting the file, and the spare cycles were wasted. Consequently, their results show that the benefit of offloading the computation to disk can only be seen if you employ several disks that perform the computation in parallel. Otherwise it is much faster for the host CPU to perform the computation itself. When they performed this experiment over slow network link (10 Mbps), both configurations (disk performing the computation and host performing the computation) performed comparably, because the network link was the bottleneck, and no reduction in data traffic has been made. MPEG Filtering: This experiment involves filtering MPEG file to produce different level of quality for the MPEG video. Again, they compare the case when the filtering is done on the host and on the disk. Note that this set-up reduces the amount of data transferred from the disk to the host when the disk does the filtering by a factor of 3. The results show that when the filtering is done on the disk, and the amount of transferred data is reduced, the on-disk case outperforms the host case only when more than one disk is used, when running over the fast link. When only one disk is used, there is no performance gain if the network link is fast, but there is performance gain when the network link is fast, because when the link is slow the CPU becomes the bottleneck. The point to take away from this experiment is that reducing the amount of data transferred becomes visible only when the communication link is the bottleneck. When the CPU is the bottleneck, one sees the difference only when combined CPU power of the disks is higher than that of the host. Again, it would be interesting to look at the scenario when the host has some way to occupy the free cycles that it gets when the processing is done on the disk. But surprisingly, the authors don't consider such a scenario. Median Filtering: Median filtering is a way of reducing noise in the image (the algorithm involves finding medians of subsets of pixels). The authors implemented their algorithm such that the computation is split between the host and the disk CPU's. Their algorithm has a not-so-simple way of dividing the work between the host and the disk (hint: we should look for applications where the computation can be naturally divided between the host and the disk: "embarrassingly parallelizable" is the right term, I think). Their experiment shows how the throughput changes as the proportion of computation allocated to the disk is increased. They perform this experiment using different numbers of disks (from 1 to 5). The shapes of the performance curves look the same for any number of disks used. The throughput rises as more computation is given to the disk, but up to a certain point. Then it drops as disk does more computation. This happens because if you offload too much computation to disk, the host CPU is sitting idle and wastes cycles. The peak performance is achieved when both CPU's are kept busy at all times. Unfortunately, they offer no performance comparison with the very interesting case when the host does all the computation alone. ----------------------------- The authors also performed some experiments where the host used the smart disk both as an active disk and a regular disk, and threads doing "active requests" run in parallel with threads that do "normal requests". They call such workloads "mixed workloads". From their experiments on mixed workloads they concluded that more work needs to be done on scheduling, because the way the "active" and "normal" threads are scheduled affects their relative performance. -------------------------------------------------------------------------------- Conclusion (this is my conclusion, not the conclusion of the paper): It seems that since the CPU on the disk is likely to be slower than that on the host, we should focus on the cases where a) interconnect is a bottleneck, and we can reduce the amount of sent data, b) the host can do something else while the disk is doing the computation. -- Sasha