Bulk Population

4 Bulk Population

After briefly describing the process of automatically transcribing lectures, this section describes how information about the lectures is prepared for indexing, and how the indexing is accomplished.

4.1 Getting Transcriptions

We receive lecture videos in RealPlayer format. The audio portion of the videos are extracted manually to produce an accompanying RIFF format 16KHz audio .wav file. These audio files are automatically transcribed and then automatically segmented to produce a “.seg” file, as shown in Figure 19. The file consists of one section for each segment of the lecture, and, within each segment, triples of start time, end time, and text for each word.

<?xml version="1.0" encoding="UTF-8"?>
<document fileName="...">
<lecture title="NA" keywords="combination equation...">
<segment id="1" title="equation y zero minus ...">
6323 6534 hi
6768 6973 this
6973 7138 is
7138 7258 the
...
</segment>
<segment id="2" title="minus ...">
...
</segment>
</lecture>
</document>

Figure 19: Excerpt of Segmented Transcription

4.2 Preparation for Indexing

A separate XML file must be manually prepared for a group of lectures to be indexed. This is a tedious process and could be automated by better organization and integration of the tools, plus a layer of infrastructure for managing the lecture meta-information.

There are two types of descriptions, those for seminars, and those for courses. One file can consists entirely of seminars, or of all the lectures in a course.

<?xml version='1.0'?>
<seminars
 group="mitworld" institution = "MIT"
 media_root="12800/rm/mitworld.download.akamai.com/12800/"
 wav_root="MIT-World">
 <seminarseries
  series = "Back to the Classroom" host = "MIT Sloan School">
  <lecture
   title="The Future of Work" lecturer="Thomas Malone"
   date = "June 5, 2004"
   media="mitw-sloan-bttc-04-malone-work-05jun2004-220k.rm"
   segfile="MIT-World-2004-Sloan-BTTC-Malone-Work-05Jun2004.seg" 
   timescale = "1.0">
   <category>Business and Economics</category>
   <category>Technology and Innovation</category> 
  </lecture>     
  <lecture
   title="Innovation: Are You A Predator or Are You Prey?"
   lecturer="James Utterback" date = "June 7, 2003"
   media="mitw-sloan-backtoclass-utterback-07jun03-220k.rm"
   segfile="MIT-World-2003-Sloan-BackToClass-Utterback-01Jun2003.seg" 
   timescale = "1.0">
   <category>Business and Economics</category>
   <category>Technology and Innovation</category> 
  </lecture>     
 </seminarseries>
 <lecture
  title="Software Breakthroughs: Solving the Toughest..."
  lecturer="Bill Gates" date = "February 26, 2004"
  media="mitw-eecs-bill-gates-microsoft-26feb2004-220k.rm"
  host = "MIT Electrical Engineering and Computer Science Department"
  segfile="MIT-World-2004-EECS-Bill-Gates-Microsoft-26Feb2004.seg" 
  timescale = "1.0">
  <category>Technology and Innovation</category>
  <category>Engineering</category>
 </lecture>      
</seminars>

Figure 20: Excerpt of Description of Seminar Lectures

Figure 20 shows an excert from a file describing seminars. The values of the attributes in the seminars tag are shared by all the child nodes. There is one seminar series with two lectures, and one lecture that is not part of a series. Each of the lectures is included in two categories. The complete list of categories is derived from the set of all categories, so a spelling mistake will result in a new category.

The media_root and media attributes are concatenated to form an unrooted path for the media. As mentioned in the server configuration section, a configuration parameter for the web application supplies the root for the media pathnames. The segmentation is provided as an unrooted file name. A parameter to the indexing program supplies the remainder of this pathname. This makes it easier to set up a lecture server on another file system, such as a stand-alone Windows version.

The timescale attribute is a work-around for a problem that sometimes occurs during the waveform extraction. In some cases, for unknown reasons, there will be a slight linear skew between the extracted waveform sample rate and the video sample rate, which causes the transcription to be off by several seconds at the end of a lecture. The timescale parameter is a compensation mechanism.

<?xml version='1.0'?>
<course
 group="ocw" institution = "MIT" department = "Mathematics"
 number = "18.06" name = "Linear Algebra" year = "1999"
 media_root="7870/rm/mitworld.download.akamai.com/7870/"
 wav_root="18.06-1999">
 <lecture
  number="1" lecturer="Gilbert Strang"
  title="The Geometry of Linear Equations"
  media="18/18.06/videolectures/strang-1806-lec01-26aug1999-220k.rm"
  segfile="18.06-1999-L01.seg" 
  timescale = "1.0">
  <category>Mathematics</category>
  <category>Linear Algebra</category>
 </lecture>       
 <lecture
  number="2" lecturer="Gilbert Strang"
  title="Elimination with Matrices"
  media="18/18.06/videolectures/strang-1806-lec02-10sep1999-220k.rm"
  segfile="18.06-1999-L02.seg" 
  timescale = "1.0">
  <category>Mathematics</category>
  <category>Linear Algebra</category>
 </lecture>
 ...
</course>

Figure 21: Excerpt of Description of Course Lectures

Figure 21 shows the description for some lectures in a course. The description is similar to the one for seminar series, but tailored towards the lectures in a course.

4.3 Indexing

The Java program segindexer.jar is used to index lectures. It takes arguments of an index directory, a root directory for segmentation files, and zero or more XML files that describe lectures. If the index directory does not contain an index, a new one is created. Otherwise, the lectures are added to the existing index. Figure 22 contains a script for indexing three groups of lectures.

#!/bin/bash
indexer=~/workspace/segindexer.jar
index=/scratch/segindex
segroot=/s/lectures/mit-museum/transcripts/
xmlroot=/s/lectures/mit-museum/xml-info

cd $xmlroot
for f in Intro.xml 18.06-1999.info.xml 18.085-2001.info.xml ...
do
  echo $f
  java -jar $indexer -add $index -segroot $segroot $f
done

Figure 22: Bash Script for Indexing