Lecture Index

6 Lecture Index

The lecture index is a Java class library that encapsulates the management of all lecture information. The index uses two databases, Berkeley DB, Java Edition, for lecture meta-information, and the Apache Lucene text indexer for text. The index is also responsible for parsing segmentation files. The index also implements the Java beans used by the server, and provides the interface for indexing new lectures.

6.1 Meta-Information Index

The package edu.mit.csail.sls.lectures.index holds all the classes related to meta-information, Java beans, querying, and indexing. Users of the index access the text index indirectly through the meta-information index.

6.1.1 Accessing

The main index class is Index, which can be used directly by programs like the bulk indexer. A secondary class, LectureIndex, is better-suited towards web applications, since it supports a bean-like interface for starting and ending transactions and performing a few other web tasks. The listener class for the Lecture Server creates a LectureIndex.

The Index must be initialized with the directory that holds the index. The index will have two subdirectories, lucene and file, the first being for the Lucene text index, the second for the lecture meta-information.

6.1.2 Berkeley DB

Berkeley DB Java Edition provides data storage at a lower level than that provided by relational database. Databases are a persistent mapping between keys and values, each of which is a variable-length byte-vector. The keys are kept in a tree, so the data can be traversed in key order.

In older versions of Berkeley DB, such as the version available when the Lecture Server work started using Berkeley DB, applications were responsible for packing and unpacking the keys and values themselves. In the current version, Java class definitions can be annotated so as to automate the packing and unpacking. Unfortunately, this requires some retrofitting, but fortunately the Eclipse Java development environment greatly eases this process, so some of the old-style packing/unpacking has been removed. In particular, the class DataAccess manages the categories and courses, which use the new style, while the remaining data uses the old style.

One of the differences between Berkeley DB and an relational database is that you are somewhat on your own for joins. For example, in a relational database, you would use a join to combine data from a combination of categories and lectures. In Berkeley DB, you need to maintain a separate table. Some of this could be simplified by using the newer facilities.

The database supports transactions. When a program makes changes to the data, such as adding a lecture, the data can be in an inconsistent state. For example, a lecture may not have all its segments. If another program were to start looking at data, the inconsistent state would cause problems. Transactions let the updater and reader remain isolated. The updater and reader each start their own transactions. The reader will not see any changes made to the database since the beginning of its transaction, and the updater will only see its own changes. If something goes wrong with the update, all the updaters changes are thrown away. If all goes well, the updater commits, and any future transactions will see all the changes.

6.1.3 Data Classes

Raw data, corresponding to a database record, has an unadorned class name, such as Seg. In the old style database use, there will also be a class with DB appended, as in SegDB. This class handles packing and unpacking, as well as key generation. There is often another class, with Hit appended, which is a handle to retrieved data. The class SegHit is an example. Where the raw data might have an identifier, the hit would have a pointer to the identified object, or an iterator for a group of objects. The raw class and the -Hit classes are Java bean classes.

There are also class names corresponding to two data classes concatenated, as in CategoryLecture. These provide the data for the join operation alluded to previously. They may also incude -DB versions, as well as -Iterator versions which are iterator interfaces.

6.1.4 Queries

The class QueryBean provides “higher-level” access to queries than those provided by the Index class through a bean interface. In general, queries should be performed with the QueryBean, since it encapsulates the Index implementation. A QueryBean can be obtained with the Index method getQueryBean().

6.1.5 Updates

The class IndexWriter provides the interface for programs that need to add lectures. An IndexWriter is obtained with the getIndexWriter() method of Index.

6.2 Text index

The edu.mit.csail.sls.lectures.trans uses Apache Lucene to manage the text index. The TransIndexer class adds data to the index, while the TransQuery class retrieves data.