In our zeal to create technologies for developing high performance knowledge bases, we should not overlook the fact that the raw material for the largest, broadest-based, fastest-growing knowledge base in the world is, and has been, under construction for decades: it's the vast collection of information in digitized form, ranging from text, to flatfile databases, to sophisticated relational database, to pictures, audio, video, and various other modes of multi-media information.
The good news is that this stock of information has been growing
exponentially, particularly since the advent of the Web. The bad news is that
we can make next to no use of it. Why not? Several problems stand in our way.
So what can we do? We can do what any good researcher does when faced with an imposingly large, somewhat unstructured collection of information: find someone who can be smart about where to look. When entering the Library of Congress, which would you rather have available: Alta Vista or a good reference librarian who's been there a while?
What then stands in the way of creating a smart reference librarian for the Web? Many of the same things listed above. But we have not come full circle, for we can change the definition of our task in a small but crucial way: the reference librarian's task is not to understand the details of the material he locates for us, only to understand it well enough to know that it's the relevant information.
Hence we deal with the problem of the breadth of coverage of the information on the Web by attempting to capture not the details of what it contains, but more general knowledge about that knowledge, such as when it is relevant, to whom, and for what. We wish to attach such descriptive information to everything available on the Web.
Size and speed of growth would seem to make this impossible. But the key is to use the same distribution of effort, i.e., get everyone involved. We do this in turn by making the task of creating annotations less work than it's worth: we make it possible to create those annotations using a knowledge representation language that everyone knows: natural language.
But putting thousands of people to work building up knowledge about knowledge will inevitably result in conflicting knowledge. A second part of our proposal deals with automated technology not only for resolving these conflicts, but using them as added impetus for knowledge acquisition.
By allowing thousands of people to offer small amounts of knowledge about what information is relevant to a particular need, we believe we can create a very high performance knowledge base of an interesting form. The Web will still be built, at its base, of ``opaque'' knowledge: text, maps, charts, audio, video, etc.; but attached to each of these will be annotations that allow the intelligent and appropriate access to the information. The high performance knowledge base will be in the knowing what to find and where. And because performing that task will result in the retrieval of relevant information that the human can further interpret and understand, the effect will be to turn the Web into a form of high performance knowledge base.
These conclusions are inspired, in part, by success with our START system, which uses English sentences and phrases as information-describing annotations, so as to render the annotated information easily and universally accessible.
During the past three years, thousands of users have used START to access multi-media knowledge via the Web using natural language questions. We ourselves use START routinely as medium for entering knowledge.
In this program, we will build on START's success by building systems that collect the sort of information START needs both from armies of Web users and from existing databases.
These mechanisms will enable us to build HAWK (for Help Accumulate the World's Knowledge), which will be capable of interacting with thousands of users from anywhere in the world, so as to accumulate the knowledge required to access text, rules, procedures, know-how, images, maps, videos, briefing charts, or any other kinds of knowledge.
To support our work, we intend to make use of research both previously done and on-going. In particular, we plan to use the following:
The Intelligent Room figures into our plans because we mean to exploit the higher bandwidth interactions between people and computers that are enabled by progress in HCI. We visualize exciting, synergistic research interaction.
Next, we expect Davis's Conflict Resolution ideas to be useful because with many people contributing knowledge to the same knowledge base, there are bound to arise gratuitous differences resulting from differing choices of vocabulary, syntax, and inheritance-tree arrangement. CARTER is the salient tool for dealing with such differences.
Finally, the Protocol of Inference technology figures into our plans because we foresee a need for a kind of switchyard, lying behind START, directing particular questions to appropriate reasoners that lie further beyond. In particular, JOSHUA uses object-oriented programming tools to enable new and old reasoners to work together by simply writing methods that trap particular questions and shunt them off to the appropriate reasoner. Thus, integration is a matter of writing new methods, not altering existing code.