Monday, July 21, 2003 |
|
www.technologyreview.com | |
When Rebooting is Not an Option
Q&A: MIT computer scientist Larry Rudolph
explains why rebooting will not solve our digital problems in the future—and
how we can avoid a nightmare scenario in a world of ubiquitous technology.
July 16, 2003
Larry Rudolph is sitting in the back seat of a cab on his way to Carnegie Mellon University in Pittsburgh when the driver suddenly stops the vehicle and says: “Excuse me, I need to reboot my taxi.” The driver shuts off the car, counts until ten, and then turns it back on. The digital speedometer, which had been reading zero despite the car’s movement, is now working again. Rudolph says it was the first time he saw someone reboot a car. It was strange—but it worked.Rebooting has become routine for many computer users. They know that when their PCs crash or a piece of software doesn’t work, the technical support person’s first question is usually, “Have you tried rebooting your system?” Indeed, restarting a machine—be it a car or a computer—can many times fix the problem. But according to Rudolph, a principal researcher at MIT’s Computer Science and Artificial Intelligence Laboratory, we are moving toward a day when all sorts of electronic equipment—computers, PDAs, cell phones, TV sets, MP3 players, perhaps even the kitchen’s microwave—will be exchanging data and trying to make our lives easier and more comfortable. And in this emerging technoscape of pervasive computing, the “turn off, turn on” solution will be of no help.
In this world of autonomous interacting machines, Rudolph says it is easy to envision a nightmare scenario in which a small failure starts a chain reaction, the result being a whole menagerie of devices ringing and beeping and generally misbehaving. With so many elements talking to each other—and no central server to control them—it will be hopeless to find which is the malfunctioning one when a major failure breaks out. Rudolph says humans are relatively good at tracking down errors. The problem is that today’s equipment doesn’t give us any clue of where the problem might be. But this situation would change completely if electronic devices began to carry special systems capable of detecting anomalous behavior—and pointing out this information to the user.
Rudolph is leading a team of researchers to create such failure-detecting systems and test them in a pervasive computing environment. The work is part of project Oxygen, a five-year, multimillion-dollar partnership between the lab and six major corporations. Since the components of Oxygen are being built from scratch, now is the time to think about methods to overcome systemic failures Rudolph says. He spoke with Technology Review editorial intern Erico Guizzo.
TR: Why is rebooting becoming less practical?
RUDOLPH: The classic answer when something goes wrong is
to reset or reboot and start it again. This solution has worked so far. But
very soon it’s not going to work for, say, devices in our houses. It won’t
be clear what to reboot. The vision of Oxygen is that there’s no
PC that is the center, there’s no central server to which everything is connected.
In a house, the main computer may be in the study, while all the other devices
may be in the living room, the dining room, and the kitchen. Everything is
spread around and soon they will start talking to each other. There will be
times when you touch something in one room and it adversely affects something
in another room.
TR: And then the on-off switch is of no help?
RUDOLPH: That’s right. Imagine that a battery runs low,
causing some unusual behavior, which, through a series of unexpected events
causes some other part of the system to fail—the telephone doesn’t stop ringing,
for example. But since the system is no longer in one place, how do we find
the root cause? Or, if we don’t care about the cause, how does one stop the
telephone from ringing? We can reboot the telephone, but if that doesn’t solve
the problem, then what? Also, future robust computer systems will likely be
“fault tolerant” so that if one computer fails, the computation will automatically
continue on some other computer. In that case, shutting down the main computer
may not stop the phone from ringing. Should you reboot the living room? Or
maybe the house? And if that doesn’t work, should you try rebooting the whole
neighborhood?
TR: What is the solution?
RUDOLPH: Before pervasive computing, I had been working
on parallel processing, where it is well known that the debugging is a nightmare.
In a parallel computing system you have to handle ten, twenty, a hundred,
a thousand, ten thousand processors. But those processors are all the same.
In pervasive computing, on the other hand, we’re talking about lots and lots
of pieces that are all different—different technologies, different
generations, different software. How do I debug that? IBM has been pushing
something they call autonomic computing—techniques to do self-reflection,
self-healing, and other things with beautiful names. The computer automatically
finds what’s wrong and fixes it. I think we’re really far from that. But the
one thing we do know is that when something was working and it suddenly stopped
working, it’s because something has changed. So our systems should at least
give the human a chance to find out the problem. It is easier to tell what
has recently changed than to decide if that change is right or wrong.
TR: Can you make a machine
send the user a message saying it’s going to fail? TR: Can you give an example of this? TR: Does the fact that devices are going wireless makes
things more difficult? TR: Why didn't engineers think about failure-detecting
systems before? TR: That’s why MIT is doing this kind of development
work, instead of the companies that will sell the products? TR: What have you done so far? TR: How do you plan to simulate failures in the systems
you are developing?
RUDOLPH: Things change all the time and mostly it is normal
behavior. We are trying to develop systems that can figure out typical patterns
of behavior for individual system components and communication links. These
systems can learn the common patterns in communication connections, and the
typical patterns of input and output values of certain processes.
RUDOLPH: Suppose my computer music system starts having
annoying pauses. It might be due to a network congestion problem because I
started up a Web browser. Or, if I’m listening to a CD, it might be due to
a scratch on the disc. In the first case, the system will notice a change
in the communication rates, whereas in the second case it might notice a change
in the values of the audio stream itself. So in the first case—starting up
a browser—the system may recognize that this is typical behavior and the user
should just wait for the connection to get better. But if the cause is a
scratch, then the user should be told to examine the disc and the CD player.
The user has a hope—sometimes a very slim hope—to know where to look for
the problem.
RUDOLPH: Yes. Imagine a television set that can answer
my telephone. I’m watching TV and the telephone rings. I answer the call
using the TV, which activates my TiVo digital video recorder. Suppose that
suddenly the TV starts ringing without stop, and I want to disconnect the
telephone from the TV. How do I do that? If I’m lucky, there’s a plug on
the telephone outlet in the wall going into the TV. So I could just disconnect
that plug. Very soon, though, we’re not going to have wires anymore. The
communication will all be wireless—802.11, Bluetooth, whatever. I might have
to stand in front of an annoying, ringing TV fumbling with buttons trying
to disconnect the telephone.
RUDOLPH: Before the Internet, people built systems that
were very well engineered—the telephone network, for example. AT&T understood
its behavior—and owned the whole system. Then things like the Internet came
around. Now no one owns the whole thing—it’s too big, it’s too distributed.
We are no longer able to engineer the whole world. We can’t rebuild the Internet.
What’s great about the Oxygen experience is that we’re building new systems,
so we can try to do something right from the start without the pressure on
having to follow release dates. Universities have time to do something right.
RUDOLPH: That’s right. Academia has an important role here
in that we are helping to figure out how to build systems defensively. Nokia
is a partner of Oxygen. Nokia cares a lot about security and privacy. But
how much is Nokia willing to spend on research on security and privacy when
it knows that teenage girls dominate the cell phone market, and they are not
worried about privacy and security; they care about color and style and games
and other features. So if Nokia is going to spend a lot of research money
on security and privacy and some other company spends their research money
on the finicky tastes of teenagers, Nokia is going to lose market share. On
the other hand, if MIT, Stanford, Boston University, or any university can
develop a solid system with security and privacy and make it public, it would
be much easier for Nokia to incorporate that technology.
RUDOLPH: We’re just starting. We talked about the example
of a telephone talking to a TV. But how grandma can use this system? When
there’s no wire, how does grandma know that the phone is talking to the computer?
And how does she stop it? Does she find the IP address of the telephone and
delete it? No, she’s not going to do that. One possible solution: you hold
up a handheld device with a camera and have it view the room. Whenever it
sees devices it can figure out what they are, so it knows, for instance, that
it is pointing at a TV, or a telephone. Then it consults a database and concludes,
“that telephone is talking to that TV.” So now we can give feedback to grandma,
probably visually. You can use the image of the room and overlay a blue line
connecting the telephone and the TV. And then touching the screen you can
choose to break that connection.
RUDOLPH: We don’t have to. They just happen!