Description of file formats, file naming conventions and mappings
between mic / camera number and roles etc.
Details
about audio and video pre-processing, file formats, and mappings
between meeting participant and audio/video channel IDs is presented
below. For information about file naming conventions, see AMI
Corpus Meeting IDs Explained.
Pre-processing
Audio
Audio was downsampled from 48kHz to 16kHz, and made available as WAV
files — 24 for
each meeting (one per audio channel). Data were also encoded in
RealMedia format for streaming from the media file server. To aid in
the transcription of meetings, a simple energy-based technique was used
to provide a speech/silence segmentation for each person in a meeting
from corresponding lapel mic recordings. (See paper by Lathoud,
McCowan, and Odobez.)
At each time frame, the lapel with the most energy was selected, and an
automatic energy threshold — derived from EM training of a
bi-Gaussian
model on log energy values — was applied to classify the frame as
speech or silence. For each meeting participant, the segmentation
output (speech and silence segments) was then smoothed with a low-pass
filter. The output is a valid XML file in the correct format for use in
Channeltrans,
the software used for transcribing the meetings, as described in Transcription.
Video
Digital video tapes from the recordings at Edinburgh and IDIAP were
digitized via Firewire and stored to disk using the DivX AVI codec
5.2.1 The AVIs were encoded at a bitrate of 2300Kbps with a maximum
interval of 25 frames between two consecutive MPEG keyframes.
No separate processing stage was required at TNO since encoding was
part of the video capture process. The start of video collected at TNO
was either trimmed or padded to synchronize the first video frame with
the first audio sample. Unfortunately, the encoding process resulted in
dropped frames in the video signals, which caused the video to fall out
of synch with the audio as the meeting progressed. To overcome this,
video signals were repaired so that dropped frames were replaced with
copies of the previous frame, thereby restoring synchronization for the
entire meeting.
The image resolution of videos is high [720x576], sufficient for doing person location and facial feature analysis. To save on storage and downloading time, all videos have been encoded at a lower bitrate and made available in a smaller size [350x280] for research purposes that do not require high resolution video. Low bitrate (50Kbps) RealMedia files of the combined audio and video are also available.
File formats and channel mappings for scenario meetings
Details about the collection, pre-processing, and file formatting of auxiliary data, i.e. output from handwritten notes, whiteboard, and PowerPoint slides
Collection
To capture auxiliary data, a separate capture PC is used, which reads
the MIDI Time Code generated by the MTP-AV to accurately time-stamp the
data and ensure it is synchronized with the audio and video streams.
Handwritten notes
Each participant has access to a Logitech I/O digital pen throughout
the scenario. The pen stores the time stamped x-y co-ordinates of any
pen strokes made on special paper that contains an embedded 2D bar
code. The pen strokes are then downloaded to the auxiliary capture PC
as xml files for subsequent processing. The pens are synchronized to
the auxiliary capture PC at the beginning and end of each meeting.
Since they are not connected to the synchronization equipment during
the meeting, precise calibration cannot be guaranteed. In practice, the
pens' internal clocks do not drift by more than a few seconds during
each meeting, resulting in sufficiently calibrated data.
Whiteboard
An eBeam System 2 digital whiteboard is used to capture any pen strokes
the participants make. These are stored in XML format as time stamped
x-y coordinates of the pen.
Beamer
Any slides presented on the beamer are captured via a VisionRGB-Pro VGA
capture card, installed in the auxiliary capture PC and stored as JPEG
images. Each image is time-stamped using the MTC for accurate
integration with other data streams.
Pre-processing
The XML files generated by the E-Beam whiteboard capture system and the
Logitech IO pens are converted into JPEG images showing what the
participants wrote. A DIV-X movie of these files, synchronized with the
audio and video recordings is also produced. This shows when the
strokes were made, and whether it was done on the whiteboard or in the
participant's notebook. A transcription of written data is also
generated using an optical character recognition technique based on Liwicki and
Bunke (2005). Optical character recognition based on the
technique described in Chen
et al. (2004)
is used to produce transcriptions of the captured slides. This, along
with the captured JPEG images and the HTML pages to which they are
attached forms the pen and whiteboard data component of the database.
File Formats
For an explanation of AMI meeting IDs, see AMI Corpus Meeting IDs
Explained.
Pen data
When available, data from participants' handwritten notes are stored
alongside the meeting's audio and video files in the /pens
subdirectory. For each page of written notes, there should be three
files with AVI, JPEG, and PEN extensions. File IDs take the form [meetingID].pen[1-4]-Page[0-9][0-9][0-9]_[0-9][0-9].[0-9][0-9].200[0-9].
The four numbers after ‘Page’ correspond to the notebook
page number.
This is followed by an underscore and the date the meeting took place,
e.g. ‘03.10.2005’. AVI files are video clips that show a
re-drawing of
all pen stroke sequences for a given page (although not in the same
time as the original event). Still images showing a page of notes are
stored in JPEG files. Raw pen data may be found in XML files with the
PEN extension.
Whiteboard data
Whiteboard data can be found in a meeting's /whiteboard
subdirectory, and also consist of a set of AVI, JPEG, and XML files.
File IDs are very straightforward, taking the form [meetingID].strokes.
As described above for pen data, AVI files encode short video clips in
which pen stroke sequences are redrawn. Still images of whiteboard
content are stored in JPEG files, and raw whiteboard data are in XML
files.
Slide data
The "slides" directory contains JPEG files that are screenshots of the
automatically captured projection. A software was integrated into the
system to automatically detect the slide changes. Associated plain text
files are the automatic optical character recognition outputs. A set of
HTML files allows to quickly browse the slides on a web browser; two
XML files are for use in JFerret browser.