Disco: Running Commodity Operating Systems on Scalable Multiprocessors
E. Bugnion, S. Devine, and M. Rosenblum (CSL, Stanford)

In this paper Bugnion et al. describe the Disco operating system which
is designed to make it easier to extend "commodity" operating systems
to multiprocessor machines.  Disco allows systems programmers to
extend existing operating systems by providing a virtual machine
abstraction.  As such, it is nothing new -- a great deal of work was
done in the 1970s on virtualization (e.g. IBM's VM/370, Cray's early
versions of hosted Unicos).  The new twist the authors bring to the
subject of virtualization is that their system is designed to make it
easier to deal with scalability, fault isolation, and non-uniform
memory access times without a complete re-write of a traditional
operating system for new hardware.  By minimizing the amount of work
taken to get an existing system running on new hardware, the
reliability of the resulting system is increased and the time taken to
produce a working system reduced.

The virtual machine approach has several benefits, including the
ability to share memory across VM boundaries with relatively small
changes to existing system software and the ability to run multiple
operating systems on the same physical machine.  The latter ability is
particularly useful for migration to a new system and to support
special-purpose operating systems for particular tasks such as
scientific computation.  Virtualization is not a panacea: costs
include (1) overhead of virtualizing hardware resources (CPUs, disks,
etc.) (2) resource management (3) communication among processors.
Disco attacks the problem by running multiple independent VMs
simultaneously on the same hardware.  It virtualizes the kernel
address space, uses dynamic page migration and replication to hide the
non-uniformity of memory access times, and virtualizes I/O devices,
providing a special abstraction for SCSI and network device
interfaces.

In order to achieve reasonable performance, Disco uses direct
execution for most operations.  The difficult and expensive part,
however, is the detection and emulation of services that can not be
safely exported in raw form.  For instance, to virtualize memory,
Disco maintains a set of physical-to-machine address mappings and
performs the necessary translations by entering mappings into the
MIPS's software-controlled TLB.  The trouble with this approach is
two-fold: (1) some kernel segments on the MIPS are traditionally
direct mapped (2) TLB misses are now both more frequent and more
expensive.  The first problem is addressed by changing the client
operating system; the second by maintaining a second-level cache for
TLB entries.  The authors also briefly describe their NUMA memory
management scheme, which attempts to hide the unusual aspects of the
architecture from clients running in a Disco VM.  In addition to
memory and CPU virtualization, Disco provides virtual DMA, network
devices, and disks.

In the last two sections of the paper, the authors describe their
experimental results and related work.  The results given by the
authors were produced by running the system on the SimOS machine
simulator rather than on real hardware.  The overhead of simulation
forced the experiments to be smaller in duration and scope than would
otherwise have been possible.  That said, Bugnion et al. provide
reasonably good performance numbers.  On the basis of these results,
they conclude that the overhead due to virtualization is acceptable
for many applications (the range they report is 3 to 16%, depending
upon the application).

Disco: Running Commodity Operating Systems on Scalable Multiprocessors 
(Stanford, 1997)
Jonathan Ledlie cs736
Operating Systems February 4, 2000

After one member of our reading group (Brian Forney) explained what ccNUMA was and how it 
was different than SMP/UMA, the ideas in this paper made much more sense and seemed 
like a practical solution to a difficult problem.  The problem is that the hardware 
jocks are coming out with new hardware that they believe is better, but their 
empirical tests are limited by the fact that no operating system (and hence no 
benchmarking software) runs on their brand new hardware.  They must often wait 
years for a new OS to take advantage of their hardware.  This Stanford group's 
pragmatic solution to this dilemna is to coat the new hardware with a thin veneer of 
an OS, which simulates older hardware to upper layers.  They call this base coat 
Disco.  Disco then allows several operating systems on top of it simultaneously, 
each of which uses the hardware in the way it knows how.  If some OS knows how to deal 
with the ccNUMA hardware, it is told about it; otherwise Disco provides a more 
traditional hardware view which the OS knows how to handle.  Particular examples of 
this are locality of reference (when one CPU is repeatedly asking for a page that is 
far away, it is replicated locally) and dealing with the fact that not all memory 
references take the same amount of time, depending on the number of hops to where the 
data actually is (ccNUMA presents a flat memory image).  Using virtual machines is 
not a new idea, but one key, unstated point of the paper is that some of these ideas 
which we dropped in the 1970s, like virtual machines, may have new applications 
today.  One difficulty we had with this paper's concept is that even though OS's may 
be moving to using a HAL (hardware abstraction layer) to allow them to more easily 
port and add new devices, in order to run on top of Disco, some tinkering is still 
needed in the HAL: "powerful software companies" must still be convinced "that 
running on their hardware is worth the effort of the port."  We also found that the 
idea of starting up some commodity OS just to run some piece of software and then 
having that OS shut down difficult, because then either the remaining OSs would not 
be using all the resources of the machine or Disco would have to give them the newly 
released resources on the fly (e.g. try plugging in more RAM while your computer is 
running) and most OSs are only made to account for their resources at boot time.