This section describes how a client, such as the Lecture Browser, can use the lecture server. The next section will describe the implementation of the lecture server.
The lecture server is a web application that can answer queries about lecture content, access media, authenticate a user, modify a transcription, and index a new lecture. All queries are in the form of an HTTP GET or POST request, and results are in the form of values embedded in XML. Queries can be for the list of lecture categories, or for sets of lectures or parts of lectures. Since GET requests append their parameters to the URL, they can easily be peformed from a browser.
Category queries have no parameters and return the complete list of category names and their identifiers. The lecture browser uses this query to fill in the menu item with the list of categories. Example output for http://web.sls.csail.mit.edu/lectures/categories.jsp is shown in Figure 1. If you type the URL into your web browser address area, your browser will show you the current list of categories. Each item in the list has a name, which the client shows in a menu, and an identifier, which the client can include in lecture queries to restrict the query to a particular category.
<categories> <category name="Applied Mathematics" categoryid="201"/> <category name="Architecture" categoryid="511"/> <category name="Arts and Humanities" categoryid="503"/> <category name="Astronomy" categoryid="303"/> <category name="Biology" categoryid="505"/> <category name="Business and Economics" categoryid="501"/> <category name="Classical Mechanics" categoryid="302"/> <category name="Cognitive Science" categoryid="509"/> <category name="Education" categoryid="507"/> <category name="Electricity and Magnetism" categoryid="401"/> <category name="Engineering" categoryid="506"/> <category name="History and Political Science" categoryid="504"/> <category name="Linear Algebra" categoryid="102"/> <category name="MIT Culture and History" categoryid="508"/> <category name="Mathematics" categoryid="101"/> <category name="Media" categoryid="510"/> <category name="Physics" categoryid="301"/> <category name="Speech Processing" categoryid="1"/> <category name="Technology and Innovation" categoryid="502"/> <category name="Vibrations and Waves" categoryid="402"/> </categories>
Figure 1: categories.jsp
Lecture queries match text and structural attributes to return a list of “hits.” It is easiest to describe the query by working backwards from the general result, shown in Figure 2.
<results> <course/> <seminarseries/> <lecture> <segment> <fragment> <word/> <fragment> </segment> </lecture> </results>
Figure 2: General Form of Lecture Results
The results are always enclosed in a results tag. Zero or more course tags may follow, then zero or more seminarseries tags. The course and seminarseries tags are only included when the results contain lectures associated with the courses or seminar series. After any course and seminar series tags there are zero of more lecture tags. Each lecture may have zero of more segment tags, each segment zero or more fragment tags, and each fragment zero or more word tags.
The client often only wants a list of lectures or segments, so the parameter depth can be used to control how much detail is returned. The default value is 1, which means only the lectures are returned, so thier segments, fragments, and words are omitted. A depth of 2 will also return the segments, 3 the fragments, and 4 the words.
Another important parameter is query, which is a text query suitable for the Apache Lucene text index. Figure 3 shows the results of a query for the text “hacks”, requested as http://web.sls.csail.mit.edu/lectures/lectures.jsp?query=hacks.
<results query="hacks" > <lecture rpmurl="http://web.sls.csail.mit.edu/lectures/lecturerpm.jsp?lectureid=192" count="6" lectureid="192" keywords="cow board bridge tetazoo hack p dome hate floor art" date="October 20, 2005" name="Where the Sun Shines, There Hack They" number="" lecturer="Samuel Jay Keyser" duration="3642530"> </lecture> </results>
Figure 3: lecture.jsp?query=hacks
This query returned a single lecture which was not part of a seminar series or course. Not all lecture attributes are listed; some, such as the course and seminar series identifiers, are elided if they are not applicable. Figure 4 describes the various lecture attributes.
- count
- The number of hits in segments. We only retrieved the lectures, but if we had retrieved with a depth of 3, for fragments, there would have been six fragments for this lecture.
- courseid
- If present, the course identifier for the lecture.
- date
- The date of the lecture.
- duration
- The length of the lecture in milliseconds.
- keywords
- the statistical segmenter identified these words as occurring more often than normal in this lecture.
- lectureid
- The identifier for the lecture, and can be used in queries to restrict the query to a specific lecture.
- lecturer
- The person giving the lecture.
- name
- The name or title of the lecture.
- number
- In a course, the lecture number indicates which lecture in the course this lecture corresponds to, i.e. 1, 2, etc.
- rpmurl
- The URL for the media description required by RealPlayer. RealPlayer requires that the actual media be described with a short amount of text, which will be returned by a GET request to this URL.
- seriesid
- If present, the series identifier for the lecture.
Figure 4: Lecture Attributes
<results query="jupiter" > <course courseid="201" institution="MIT" department="Physics" number="8.01" name="Physics I: Classical Mechanics" year="1999" term=""/> <course courseid="401" institution="MIT" department="Physics" number="8.03" name="Physics III: Vibrations and Waves" year="2004" term=""/> <seminarseries seriesid="4" institution="MIT" name="Poetry@MIT" host="MIT Program in Writing and Humanistic Studies"/> <lecture rpmurl="http://web.sls.csail.mit.edu/...ctureid=168" count="1" lectureid="168" keywords="shop train poet kevin sofa parish w skylight local frogs" seriesid="4" date="October 17, 2002" name="A Reading by Seamus Heaney" number="" lecturer="Seamus Heaney" duration="3402053"> </lecture> <lecture rpmurl="http://web.sls.csail.mit.edu/...ctureid=189" count="1" lectureid="189" keywords="rocks apollo moon object ... layers erupt" date="April 2, 2003" name="The Quest for Mars: Scientific and Human Destiny?" number="" lecturer="Jim Garvin" duration="5621425"> </lecture> ... </results>
Figure 5: lecture.jsp?query=jupiter
Figure 5 shows the partially elided results of a query to http://web.sls.csail.mit.edu/lectures/lectures.jsp?query=jupiter that returns more lectures. With this query, there were two lectures from courses and one from a seminar series. Figure 6 describes the course attributes, and Figure 7 describes the seminar series attributes.
- courseid
- The identifier of the course.
- department
- The department that offered the course.
- institution
- Where the course was given.
- name
- The name of the course.
- number
- The course's institutional number.
- term
- The term of the course, e.g. Fall.
- year
- The year of the course.
Figure 6: Course Attributes
- seriesid
- The identifier for the seminar series.
- host
- The sponsor of the seminar series.
- institution
- Where the seminar series was held.
- name
- The name of the seminar series.
Figure 7: Seminar Series Attributes
When browsing, a user typically starts with an initial text query and then clicks on some lecture to see it in more detail. This can be handled by performing a second text query, restricting the lecture to the one they clicked on. In this case, we want to set the depth parameter to 2 so that we get the segments. We also want to set the parameter fillLecture to True. This causes all the segments to be fetched instead of just the ones with hits. This allows the client to show all the segments, highlighting the ones with hits. Figure 8 shows the result, and Figure 9 describes the attributes.
<results query="jupiter" > ... <lecture ...> <segment summary="frequency object omega pi ..." count="0" score="1.0" beginTime="1571" endTime="384448"> </segment> <segment summary="push table pool non ..." count="0" score="1.0" beginTime="385321" endTime="578013"> </segment> <segment summary="planets sun model string orbits..." count="1" score="1.0" beginTime="578259" endTime="1337630"> </segment> <segment summary="particles gravity direction..." count="0" score="1.0" beginTime="1340219" endTime="2234896"> </segment> <segment summary="salt nitrate table..." count="0" score="1.0" beginTime="2235345" endTime="2420363"> </segment> <segment summary="bucket string gravity sense..." count="0" score="1.0" beginTime="2420907" endTime="3043149"> </segment> </lecture> </results>
Figure 8: lectures.jsp?query=jupiter&lectureid=73&depth=2&fillLecture=True
The descriptions of the fragments and words, return in results of depth 3 and 4 respectively, are in Figure 9 and Figure 11.
- beginTime
- The time in milliseconds in the lecture when the segment begins.
- count
- How many hits are in the segment.
- endTime
- The time in milliseconds in the lecture when the segment ends.
- score
- Sometimes related to the hit score.
- summary
- The keywords that the statistical segmenter determined happened more frequently than normal.
Figure 9: Segment Attributes
- beginTime
- The starting time of the fragment, in milliseconds.
- count
- The number of hits in the segment.
- endTime
- The ending time of the fragment, in milliseconds.
Figure 10: Fragment Attributes
- beginTime
- The starting time of the word, in milliseconds.
- endTime
- The ending time of the word, in milliseconds.
- term
- Whether or not the word is a term to be highlighted.
- text
- The text of the word.
Figure 11: Word Attributes
Figure 12 shows the lecture.jsp query parameters.
- beginTime
- Only return hits whose fragments start at or beyond this time.
- categoryid
- Only return hits whose lecture includes the specified categoryid.
- courseid
- Only return hits for lectures with the specified courseid.
- depth
- The depth of the lecture tree to return. 1 is lectures, 2 is segments, 3 is fragments, and 4 is words. The default is 1.
- endTime
- Only return hits whose fragments end at or before this time.
- fillLecture
- If True, retrieve lecture segments even if the query did not match. This is useful in drilling down into a lecture since it returns timing information about all of the lecture segments, rather than just those with hits.
- highlighter
- A query of words that should be marked as terms in the result. Ideally, the highlighter would actually mark matching words, but right now it just marks words that appear in the query. For example, if the highlighter query were “bright light” all instances of the words “bright” and all instances of the word “light” within the search results would be marked as terms, not just those places where the sequence “bright light” occurred. Furthermore, “NOT gravity” would highlight the word “gravity” instead of everything else. With some work, it is possible to make the Lucene text retrieval do the right thing.
- lectureid
- Only return hits for lectures with the specified lectureid.
- maxHits
- The maximum number of hits (fragments) to be returned.
- query
- A text query.
- seriesid
- Only return hits for lectures with the specified seminar seriesid.
- startHit
- If a subset of the hits are to be returned, this indicates the offset of the first fragment within the set. This might not be implemented.
Figure 12: lectures.jsp Query Parameters
Times queries retrieve a subset of a time-aligned transcription based on start and end times within a lecture. These are used by the lecture browser when it presents the synchronized transcription. Figure 13. The results are in the form of a results containing fragment tags containing word tags, as described in Figure 10 and Figure11.
- beginTime
- The starting time in milliseconds for words in the lecture.
- endTime
- The ending time in milliseconds for words in the lecture.
- highlighter
- The highlighter expression used for marking terms as words.
- lectureid
- The lecture identifier for the lecture.
Figure 13: times.jsp query parameters
The RealPlayer browser plugin requires that the location of media be provided indirectly. The page lecturerpm.jsp is responsible for this indirection.2 The page takes one parameter, lectureid, and returns a result in format audio/x-pn-realaudio-plugin containing the text of the actual URL with the media. We have found that we get the best results by letting an Apache server send the data, rather than using a streaming server or Akamai cached locations.
The lecture server includes support for user login. The server allows authenticated users to perform additional tasks, such as submitting changes to transcriptions. At this time, the users and passwords are hard-coded, but the login and capabilities facilities in the server are relatively complete.
Currently, each user has two roles, login and edit. The login role allows the user to login, while the edit role allows the user to edit a transcription.
In general, when a user account is created, the user enters a password. Many users are not very good at password generation, so the system also generates a random string called the salt. The salt and the user password are concatenated, and an MD5 sum is computed and associated with the user identifier and the salt. Thus, the password is not itself stored, and any dictionary of common passwords used for attacks will need to be expanded by the number of salt values, which makes the target less desirable.
The first step to login is to get a randomly generated nonce and the salt for the user from nonce.jsp which has userid as a parameter. The salt is concatenated with the password the user types to the client and an MD5 sum is computed. The sum is converted to a base-64 string and concatenated with the nonce. An MD5 sum of that is converted to a base-64 string and sent to the login.jsp page as the clientHash parameter.
The login.jsp returns an XML login tag with attributes of loginOK and editOK to indicate whether or not login was successful, and whether or not the user is permitted to edit. For the client, this information is only informative; it is the server's session state that determines what the user can do.
NOTE: Unless SSL is used, which we do not use, someone monitoring network traffic would be able to pretend to be in the same session. SSL could be added relatively easy were it deemed necessary. The login procedure does prevent the need for storing passwords on the server and does make the passwords themselves relatively secure.