Find topic
WS06 topics
Members' area
Tools
Help!
-- KarenLivescu - 15 Dec 2005
|
Directory Structure
See AudioVisualRecAvicar for a description of the transcriptions, grammar, dictionaries, and so on.
There are three base kinds of feature files:
- data/AVICAR/plp - PLP audio feature data, in HTK format
- data/AVICAR/vid - Video feature data, in HTK format
- data/AVICAR/mix - mixed audiovisual features, in HTK format
There are some non-feature things in the AVICAR directory:
- data/AVICAR/audio - raw audio waveforms for all microphones
- data/AVICAR/scripts - scripts for the phone number and sentence utterances
- data/AVICAR/video_transcr - start frame and end frame of each utterance in the per-noise-condition video files
Every AVICAR feature file has the following name format:
- data/AVICAR/${type}/${type}${proc}/${talker}_${condition}_${utt}.${type}${proc}
e.g.,
- data/AVICAR/plp/plpM4N/AF1_55D_P9_C2.plpM4N
where
- ${type} in { plp, vid, mix }
- ${proc} denotes noise reduction processing; see below
- ${talker} = ${script}${gender}${num}, e.g., the talker named "AF1" is the first female talker who was asked to read the A script. There are ten scripts (A..J), with five talkers of each gender reading each script (1..5), for a total of 100 talkers. Video features have been generated for most of the talkers reading scripts A through E.
- ${condition} in { 35D, 35U, 55D, 55U, IDL }. For most talkers, video features exist only for the extreme noise conditions: IDL (as quiet as it gets), and 55D (as noisy as it gets).
- ${utt} = ${type}${seq}_C${take}. ${type} is in { D=digits, P=phone number, L=isolated letter, S=timit sentence }. ${seq}_${take} have different meanings for different utt_types; for phone numbers, P1_C1 is not the same as P1_C2.
Audio Feature Files
All PLP features are currently computed using 25ms windows, 10ms skip, 12 PLP cepstral coefficients plus HTK-normalized energy plus deltas and delta-deltas, for a total of 39 features per frame. None of these files have yet been mean/variance normalized (should be mean/variance normalized over the entire noise condition -- that hasn't been done yet). Current pre-processing types for PLP features are as follows. plp/plpM3N is produced from the unprocessed waveform.
- Beamformer options
- plp/plpM3? - Calculated from just one mic signal (the middle mic, M3)
- plp/plpDS? - Calculated from delay-and-sum beamformer
- plp/plpFS? - Calculated from "filter-and-sum" beamformer (currently just a one-tap MVDR)
- Noise reduction filter options
- plp/plp??N - No noise post-processing filter
- plp/plp??S - Nonlinear spectral subtraction
- plp/plp??W - Wiener filter, current frame is signal estimate
- plp/plp??A - MMSE spectral amplitude estimation
- plp/plp??L - MMSE log spectral amplitude estimation
Video Feature Files
Video pre-processing types are currently being developed. All video features are based on vid/vidraw, which contains 708 features per frame: (4 sub-images) X (lips (upper left x,y,width,height), face (upper left x,y,width,height), 13x13 DCT of pixels in the lip rectangle in column order). The last stage of pre-processing (PCA) compresses this very large vector into just 13 static coefficients -- see below.
- Variance normalization options
- vid/vidN?? or vid/vid??N - No variance normalization at the specified state (possible points for variance normalization: before and after the facial pose correction)
- vid/vid1?? or vid/vid??1 - Variance of each feature, computed for all files by a particular talker in a particular noise condition, normalized to 1.0
- vid/vidG?? or vid/vid??1 - Variance of each feature, computed for all files by a particular talker in a particular noise condition, normalized to equal the global average variance of that feature
- Facial pose correction
- vid/vid?N? - No head-position correction
- vid/vid?1? - Lip width, height, and DCTs have been orthogonalized to the vector (lipx, lipy, facex, facey, face_wid, face_ht) using a single linear regression matrix estimated across all four sub-images
- vid/vid?4? - Same as vid?1?, except that the linear regression matrix is estimated separately for each of the four sub-images
- vid/vid?P? - The face log(width) and log(height) measures from all four sub-images are PCA-compressed to a five-dimensional PCA vector (using covariance matrix estimated only from features in this talker and noise condition). The lip log(width), lip log(height), and log(abs(DCT)) of pixels in the lip rectangle are all orthogonalized to this PCA vector, then the log is inverted to get normalized lip width and height.
- vid/vid?A? - The face yaw and pitch angles are calculated explicitly, based on the position of the lip rectangle within the face rectangle, corrected for the view angle of each camera, and averaged across the four cameras. Lip width and height in each sub-image are corrected for inter-camera pitch and yaw differences, then averaged to compute the MMSE estimate of the true lip width and height.
- PCA compression
- vid/vid???V or mix/mix???V - The vector of lip widths, lip heights, and DCT coefficients (4 sub-images X 171 coefficients = 684 coefficients) is PCA compressed to a vector of 13 coefficients. PCA transform is computed as the eigenvectors of the average, across all talkers and noise conditions, of the within-talker-within-condition covariance matrix, thus the PCA should (hopefully) emphasize features that vary a lot between words, while paying no particular attention to features that vary between talkers. The features in the output "vid" file include 13 PCA coefficients, and their deltas and delta-deltas. The features in an output "mix" file include 39 audio coefficients, and 39 video coefficients, with file type "MFCC_D."
- mix/mix???M - Same as above, but after the video PCA, an additional PCA compresses the static audio and video coefficients together into a single 13-dimensional coefficient vector. Output file contains 13 static audio/video coefficients, and their deltas and delta-delta.
- vid/vid???N - No PCA is performed. The output video features are the lip (width,height), and 11 of the DCT coefficients ((0:2,0:2),(3,0), and (0,3)), averaged across all four sub-images.
-- MarkHasegawaJohnson - 14 Jul 2006
|