Audio and Video Signals

(Source http://www.idiap.ch/amicorpus/documentations/audio-and-video-signals)

Description of file formats, file naming conventions and mappings between mic / camera number and roles etc.
Details about audio and video pre-processing, file formats, and mappings between meeting participant and audio/video channel IDs is presented below. For information about file naming conventions, see AMI Corpus Meeting IDs Explained.

Pre-processing
Audio
Audio was downsampled from 48kHz to 16kHz, and made available as WAV files — 24 for each meeting (one per audio channel). Data were also encoded in RealMedia format for streaming from the media file server. To aid in the transcription of meetings, a simple energy-based technique was used to provide a speech/silence segmentation for each person in a meeting from corresponding lapel mic recordings. (See paper by Lathoud, McCowan, and Odobez.) At each time frame, the lapel with the most energy was selected, and an automatic energy threshold — derived from EM training of a bi-Gaussian model on log energy values — was applied to classify the frame as speech or silence. For each meeting participant, the segmentation output (speech and silence segments) was then smoothed with a low-pass filter. The output is a valid XML file in the correct format for use in Channeltrans, the software used for transcribing the meetings, as described in Transcription.

Video
Digital video tapes from the recordings at Edinburgh and IDIAP were digitized via Firewire and stored to disk using the DivX AVI codec 5.2.1 The AVIs were encoded at a bitrate of 2300Kbps with a maximum interval of 25 frames between two consecutive MPEG keyframes.

No separate processing stage was required at TNO since encoding was part of the video capture process. The start of video collected at TNO was either trimmed or padded to synchronize the first video frame with the first audio sample. Unfortunately, the encoding process resulted in dropped frames in the video signals, which caused the video to fall out of synch with the audio as the meeting progressed. To overcome this, video signals were repaired so that dropped frames were replaced with copies of the previous frame, thereby restoring synchronization for the entire meeting.

The image resolution of videos is high [720x576], sufficient for doing person location and facial feature analysis. To save on storage and downloading time, all videos have been encoded at a lower bitrate and made available in a smaller size [350x280] for research purposes that do not require high resolution video. Low bitrate (50Kbps) RealMedia files of the combined audio and video are also available.

File formats and channel mappings for scenario meetings

If you download the annotations, you will find this information given in a machine-readable format as corpusResources/meetings.xml. The naming of the audio and video files follow this naming convention:
< meeting id >.< media type >.< extension >

Pen, whiteboard and slide data

Details about the collection, pre-processing, and file formatting of auxiliary data, i.e. output from handwritten notes, whiteboard, and PowerPoint slides

Collection
To capture auxiliary data, a separate capture PC is used, which reads the MIDI Time Code generated by the MTP-AV to accurately time-stamp the data and ensure it is synchronized with the audio and video streams.

Handwritten notes
Each participant has access to a Logitech I/O digital pen throughout the scenario. The pen stores the time stamped x-y co-ordinates of any pen strokes made on special paper that contains an embedded 2D bar code. The pen strokes are then downloaded to the auxiliary capture PC as xml files for subsequent processing. The pens are synchronized to the auxiliary capture PC at the beginning and end of each meeting. Since they are not connected to the synchronization equipment during the meeting, precise calibration cannot be guaranteed. In practice, the pens' internal clocks do not drift by more than a few seconds during each meeting, resulting in sufficiently calibrated data.

Whiteboard
An eBeam System 2 digital whiteboard is used to capture any pen strokes the participants make. These are stored in XML format as time stamped x-y coordinates of the pen.

Beamer
Any slides presented on the beamer are captured via a VisionRGB-Pro VGA capture card, installed in the auxiliary capture PC and stored as JPEG images. Each image is time-stamped using the MTC for accurate integration with other data streams.

Pre-processing
The XML files generated by the E-Beam whiteboard capture system and the Logitech IO pens are converted into JPEG images showing what the participants wrote. A DIV-X movie of these files, synchronized with the audio and video recordings is also produced. This shows when the strokes were made, and whether it was done on the whiteboard or in the participant's notebook. A transcription of written data is also generated using an optical character recognition technique based on Liwicki and Bunke (2005). Optical character recognition based on the technique described in Chen et al. (2004) is used to produce transcriptions of the captured slides. This, along with the captured JPEG images and the HTML pages to which they are attached forms the pen and whiteboard data component of the database.

File Formats
For an explanation of AMI meeting IDs, see AMI Corpus Meeting IDs Explained.

Pen data
When available, data from participants' handwritten notes are stored alongside the meeting's audio and video files in the /pens subdirectory. For each page of written notes, there should be three files with AVI, JPEG, and PEN extensions. File IDs take the form [meetingID].pen[1-4]-Page[0-9][0-9][0-9]_[0-9][0-9].[0-9][0-9].200[0-9]. The four numbers after ‘Page’ correspond to the notebook page number. This is followed by an underscore and the date the meeting took place, e.g. ‘03.10.2005’. AVI files are video clips that show a re-drawing of all pen stroke sequences for a given page (although not in the same time as the original event). Still images showing a page of notes are stored in JPEG files. Raw pen data may be found in XML files with the PEN extension.

Whiteboard data
Whiteboard data can be found in a meeting's /whiteboard subdirectory, and also consist of a set of AVI, JPEG, and XML files. File IDs are very straightforward, taking the form [meetingID].strokes. As described above for pen data, AVI files encode short video clips in which pen stroke sequences are redrawn. Still images of whiteboard content are stored in JPEG files, and raw whiteboard data are in XML files.

Slide data
The "slides" directory contains JPEG files that are screenshots of the automatically captured projection. A software was integrated into the system to automatically detect the slide changes. Associated plain text files are the automatic optical character recognition outputs. A set of HTML files allows to quickly browse the slides on a web browser; two XML files are for use in JFerret browser.


Copyright © 2006 by AMI project, All Rights Reserved.