This section proposes changes to the lecture browser system. These changes fall into several broad categories:
The automatic transcription process must be available, either by packaging it (which would involve licensing and compute power), or offering it as a service in a way that could be integrated with the administration process.
The remainder of this section will describe changes in the implementation that are required to support the new capabilities. In many cases, multiple capabilities require the same implementation changes.
Changes need to be made in the way lecture information is managed to support several of the proposed changes. This section describes the problems with the current implementation and proposes a new way to manage the information.
Lecture content and meta-data are currently managed by hand in the filesystem, with the information cached during the indexing process. The server caches this information in the index, and supports adding additional information to the index, and changing cached information, but it cannot propagate updates beyond the cache, so changes made via the server do not survive index reinitialization.3
Attempting to propagate changes from the cache back to the file system would be problematic, since there are no controls on the file system. If the file system is modified by hand, the cached version will not agree. Furthermore, manually editing XML files is error-prone, which could lead to problems when the server tried to merge a change into an invalid file.
A problem that would be created by adding bookmarks is that in any situation where semi-permanent names are created, such as the URLs for bookmarks, the implementation of the content associated with the names should be isolated from the names. In the current system, lectures are identified by their lecure identifier, which is an artifact of the indexing process. If a new lecture is added to the XML files where it fits logically, rather than at the end, then the lecture identifiers will change.
Each lecture is associated with a unique URL for its media, but this is not good long-term identifier since these URLs are derived from the externally maintained layouts of MIT's Open Courseware and MIT World web sites, and could change at any time. Furthermore, the media itself sometimes changes. For example, for copyright reasons, portions of lectures are sometimes blacked-out. We have also seen media editing, which changes the time-alignment.
The problems with lecture information can be solved by moving the lecture information that is currently stored in the filesystem and modified by text editors into a managed database that is modified by lecture browser administration programs. The combination of a database and gated access to the information solves the multiple copies, concurrency, and formatting problems associated with the direct editing of files.
The lecture database must be carefully organized to meet the unique needs of the lecture data. The basic bookmarked entity is the lecture, not the media. If someone views a video of a lecture and bookmarks a location, they want the bookmark to correspond to that location in the lecture, not to a particular offset in a particular edit of the lecture in a particular format. For example, if they made the bookmark on a cell-phone browser that was using a low-bandwidth video stream and later wanted to view it on a desktop computer with a high-bandwidth stream, they would like the bookmark to “just work” for their new configuration.
Although the basic bookmarked entity is the lecture, the bookmark needs to include information about the media, or, at least the edit version of the media. When they mark the point where the lecturer says “Here we see...” that is 31:23.242 minutes into the version of the lecture they are watching, they will want playing to start at “Here we see....” three weeks later when an edit to the lecture has changed the time to be 32:46.126 minutes into the media. Since the edits occur independently of the lecture browser system, we need to bookmark with time positions in a version of the media, and then realize the bookmark by converting it to a semantic position and finding that semantic position in the new version of media.
There should be an entity that corresponds to a lecture, and most meta-data should be associated with that entity. Each lecture must have a unique permanent identifier, generated when the administrator creates an entry for the lecture.
Media is referenced in the database by parameterizable URL expressions. Since lecture browser administrators do not know when lectures change, a group of changes to the lecture database is assigned a generation number. When a media URL is first associated with a lecture, a unique identifier for the media is generated, and associated with the URL expression, the current generation, and a hash of the media (such as an MD5 sum). When a new set of media, possibly the same, is installed, the hash of the new media is recomputed, and, if it is different, the media has changed. In this case, a new media identifier is generated and associated with the URL expression, the current generation, and the new hash. The lecture is flagged as needing to be re-transcribed. It is assumed that if several media files for a lecture change, they all correspond to the same media edit, i.e. they are time-aligned with each other.
Lecture transcriptions are associated with the lecture, media generation, and a transcription specification, since all of these effect the results of the transcription. A transcription specification is parameterized by things like the speaker, auxillary data, language model, and the version of the transcription process. This makes it possible to experiment with changes to the transcription tools and analyze the results, without needing to worry about a change hurting browser users (since the indexing can be configured to not show the experimental transcriptions).
Every automatically transcribable lecture needs a media file for the waveform; when a video generation changes, a new wave file must be extracted or provided that corresponds to the new generation.
We receive media from MIT Open Courseware and MIT World either on a disk as a copy of the web site, or from the actual web site. As lectures are added or media edited, this information changes, but we do not know when the changes occur. The term “media refresh” refers to a scan of the web site to find new lectures and changed media.
After a media refresh, the administrator would be presented with a list of all the detected media and anything already known about it. They could choose to ignore, transcribe, or update each potential lecture. For example, if the Open Courseware web site dump had been reorganized, this would make it relatively easy for the administrator to find the new locations of the media, as well as to identify new media.
After new media is transcribed, it could be indexed. The index could also be searched for similar media, so as to identify lectures whose media may have been edited slightly. Dynamic programming could be used to provide to provide a map from time offsets in the old media to those in the new media, so that existing user bookmarks and other annotations could be preserved.
Since metadata is separated from the index, the index can be recreated from the metadata at any time. This provides a number of benefits: The same metadata can have more than one index, so that index development could safely share the same lecture data as a deployed system, or so that multiple copies of the index could be maintained to be used by multiple servers to achieve better scaling.
Normally, new lectures are automatically transcribed. We expect that some users will want to manually correct parts of the automatic transcriptions by replacing a portion of the transcription with their own transcription. The new transcription would be time-aligned (forced-alignment) with just the portion of the lecture whose transcription was replaced.
Since we also continue to work on improving the automatic transcriptions, we want to be able to update the automatic transcriptions associated with lectures. If users have edited or in other ways annotated a transcription or lecture, it is important that their work is not lost during a retranscription. Much as with source control systems, there needs to be a process in which they can refresh changes committed by other users or by an improved automatic transcription. Furthermore, correct hand-transcriptions provide valuable statistical information about the types of mistakes made by the automatic transcription process, and could be used both for evaluation and training puroposes.
The user visible enhancements require changes to the JavaScript that runs in the browser. Some changes, such as making the “Back” and “Forward” butttons and history work are difficult or impossible because browsers do not currently support this functionality for AJAX applications.4 The browser buttons and history can be made to work by making more requests go to the server for HTML, and by encoding the browser state in the URLs. This would also simplify bookmarks. There would be a slight performance penalty, but it would probably not be noticeable since we already need enough bandwidth for the videos.
Query results do not work properly since any words mentioned in the query counts as a hits within the lecture. For example, if “gamma ray” were searched for, only lectures containing “gamma ray” would be returned, but, in the lecture, in addition to returning “The gamma rays were deadly” fragments like “The manta ray disturbed the detector.” and “Gamma was 3.4 in this test” would also be shown.
Lucene allows a text stream to be tokenized and broken into fragments that are then filtered by a query. To make use of this capability, we need to be able to preserve the timing information through the filtering.
One way to do this would be to index words by character text position in the document since tokens can include this information. Alternatively, the document could be pre-fragmented, and fragments could be indexed by text character position. The highlighter would identify the fragments, and the character position of the start token in the fragment would link back to the timing information. Pre-fragmenting would also allow for statistical fragmenting based on content, rather than simply looking for timing gaps.