***********************************************************************
Cluster-based scalable network services
***********************************************************************
This paper describes architecture for a cluster of workstations and
describes the implementation of this architecture for two
services. The main ideas of the architecture presented are layered
design, centralized load balancing and distributed request
handling. Layered design: In this design, parsing of requests is
separated from load distribution, data transformation services, and
data retrieval services. This allows specialization of nodes for a
particular type of operation and makes implementation of each layer
simple. Specialization also helps to improve performance of
instruction and data caches, thereby increasing the
throughput. Centralized load balancing: The load balancing is
performed by a centralized manager that has global information about
the state of the cluster. The experimental results showed that one
manager can support a cluster of reasonable size. Distributed request
handling: The incoming requests are handled by a number of front-ends
that are equipped with manager stubs. Manager stubs contact the
centralized manager, who makes load distribution decisions based on
the load information that it has and based on use profile data that
can be looked up in a database. The manager chooses a particular
worker to handle the request. There is also a number of nodes in the
cluster that serve as caches. I found this paper difficult to evaluate
for the following reasons: 1. It described a cluster-based solution
for one specific service - TranSend.  Their solution is also used by
HotBot, but in a very limited way - it does not use some of the main
features, such as automatic load distribution (the database is
partitioned statically). 2. The description of the system was very
high-level. The authors talk about automatically choosing workers for
requests, but they do not go into enough detail of how particularly
the workers are chosen. 3. The authors do not do any experimental
comparison of their work with any other cluster-based solution.  This
is in part understandable, because it would be very difficult to port
TranSend onto another cluster architecture. But they could at least
run some synthetic benchmarks to present rough comparison
numbers. There are a few ideas in the paper that I found
interesting. 1. This system has a robust way of handling
fault-tolerance: worker units, distillers and managers cooperate to
detect failures and restart each other in the event of a
failure. 2. The idea of using idle user workstations as a pool of
overflow machines is interesting - these machines are used only when
the load is critically high, so the users are not disturbed all the
time. But this overflow pool can serve as a nice cushion, which can
make transition to a more powerful cluster very smooth. Even though
the description of the system together with the presented results did
not make me confident that this would actually work, the fact that it
is actually used by real companies was convincing.  Also, I think that
there is some value in sharing experience with a complex system like
this one. However, this is not a research paper as I see it, because
it does not provide a well-studied solution to a more-or-less general
problem, but describes a specific implementation of a particular
system.


Cluster-Based Scalable Network Services (Berkely, 1997)
Jonathan Ledlie CS 736 March 29, 2000

Clusters are to workstations as RAID is to JBOD.  Clusters and RAID
offer more power and scalability at less cost than their more
hardware-advanced counterparts, namely SMP and expensive, large
disks. Both, however, are hard to manage effectively.  This paper
offers one solution to managing and building for clusters.  The
authors' solution provides the user (a programmer) with a series of
reusable building blocks to harness the advantages of clusters and
ameliorate the difficulties in shared state, load balancing, and
monitoring.  They concentrate on the read-dominated, user-oriented
side of the Internet, which allows them to reduce the concurrency
problem inherient in distributed design; if a worker temporarily uses
stale data, it is ok in their semantics.  They call this reduction
BASE (basically available, soft state, eventual consistency).  In
addition to this simplification, they provide a nice, hierarchical
API, which allows users to use parts at the top (Service) level for
most projects, but then delve into the TACC layer if they need more
customization. Finally, they concentrated on the burstiness of most
Internet activity, by allowing an administrator to designate overflow
machines to temporarily handle high workloads.  The lowest layer, SNS
(Scalable Network Service), spreads on to these extra machines when
needed.  I find three major flaws in their work.  First, there is no
mention of security, particularly in the overflow situations which
appear to run on other people's workstations.  Second "the front end
spends more than 70% of its time in the kernel" -- not a good
sign. Third, and most significantly, more and more Internet operations
iarei requiring scalable ACID semantics, B2B virtual markets for
example, where having stale data does not work.  We would need a new
set of building blocks for this growing number of situations. NOTES
FROM CS261 What do you need in an Internet cluster? 1) incremental
scalability 2) fault-tolerant (24x7) 3) cheap Terminology: - rathole -
embarrisingly parallel - forklift upgrade