Next: START on the WWW Up: From Sentence Processing Previous: The Lexical Component

Natural Language Annotations

The discussion so far was centered on the analysis of single natural language sentences. We believe that given a sophisticated grammar, a large lexicon enhanced by advances in Lexical Semantics (such as class-membership information) and an inference engine (such as S-rules), it is possible to build a natural language system with satisfactory sentence-level performance. At the same time, however, it is becoming increasingly clear that a robust full-text natural language question-answering system cannot be realistically expected any time soon. Numerous problems such as intersentential reference and paraphrasing, summarization, common sense implication, and many more, will take a long time to solve to everybody's satisfaction. In the meantime, we need a mechanism that will let us bridge the gap between our ability to analyze natural language sentences and our appetite for processing huge amounts of natural language text.

The START system makes an attempt to bridge this gap by employing natural language annotations. (Katz and Winston [1994]). Annotations are computer-analyzable collections of natural language sentences and phrases that describe the contents of various information segments. START analyzes these annotations in the same fashion as any other sentences, but in addition to creating the required representational structures, the system also produces special pointers from these representational structures to the information segments summarized by the annotations.

Suppose, for example, that a user wishes to retrieve the following text fragment related to the discovery of Neptune:

Neptune was discovered using mathematics. Before 1845, Uranus was widely believed to be the most distant planet. However, astronomers observed that Uranus was not always in the position predicted for it. The astronomers concluded that the gravitational attraction of a more distant planet was disturbing the orbit of Uranus.
In 1845, John Adams, an English astronomer, calculated the location of this more distant planet. Urbain Leverrier, a French mathematician, independently did similar calculations. In 1846, John G. Galle and Heinrich d'Arrest of the Urania Observatory in Berlin, looked for the planet where Leverrier and Adams predicted it would be located. They saw the planet, which was later named Neptune, on September 23, 1846. Galileo, the noted Italian astronomer, may have been the first to see Neptune in 1613. However, Galileo did not recognize what he saw as a planet.

Let us assume that sentence (24) below serves as one of the annotations to this text fragment:

(24) John Adams discovered Neptune using mathematics.

This means that START analyzed sentence (24) and incorporated it into the knowledge base along with a pointer to the text fragment. Now suppose the user asks one of the following questions:

(25) Who discovered Neptune?

Did Adams discover Neptune?

How was Neptune discovered?

Was Neptune discovered using mathematics?

Tell me about Neptune's discovery.

START begins the process of answering a question from (25) in the regular fashion described in previous sections. Namely, after undoing the effects of the question and passive transformations applied, START creates a T-expression such as (26) for querying the knowledge base:

(26) <who discover Neptune>

In the next step, T-expression (26) is matched against the knowledge base. It is important to emphasize that the full power of sentence-level natural language processing is brought to bear on the matching process. START's matcher works both on the word-level (using, if appropriate, additional lexical information about synonyms, hyponyms, IS-A trees, etc.) and on the structure-level (utilizing necessary S-rules, information on verb-class membership, nominalization etc.), although in the case of a very simple interaction such as (24) and (25) most of this machinery is not utilized.

Since the representational structure returned by the matcher contains a special pointer to the annotated text fragment, START's familiar sentence-level question-answering strategy is modified. Instead of passing the representational structure to the language generation system and asking it to produce an English sentence such as (24), START simply follows the pointer and presents the text fragment (HTML-marked, as appropriate) to the user.

This last presentation step can be thought of as a general procedure to be executed after a successful matching process. As a result, the natural language annotation technique easily generalizes to the indexing and retrieval of all types of information, whether or not it is based on text. Using START, one can access text, images, sound, video, web pages, and more. (Katz and Winston [1995]).

Next: START on the WWW Up: From Sentence Processing Previous: The Lexical Component

Boris Katz
Thu Feb 27 15:34:49 EST 1997