SYNTAX OF THE CAVIAR XML GROUND TRUTH FILES
VN: Oct 7, 2005
================================================================
The grammar and meaning of the ground truth files is described below.
First, some details about the video sequences and their labelling.
* There are 78 sequences for a total of about 90K frames, with between 300-2500
frames each, all hand-labelled.
* The labelling was done by hand using a JAVA-based interactive tool.
* Each sequence is labelled frame-by-frame. What connects the
sequences is the tracked individuals and groups, which have
a persistent and unique identifier in each sequence.
* Each video frame has a set of tracked objects visible in that
frame. The object is tightly surrounded by a bounding box, which
also has a dominant orientation. Groups also list which individual
boxes participate in the group.
* Normally, each tracked individual or group will have a single
context label which is recorded in every frame that the
person appears. The context will involve the person in a
sequence of situations, which are also labeled in each frame.
The person will also have labelling about how much s/he is moving
(inactive, active, walking, running).
There are some additional labels as listed below.
* Some areas of the sequences were not labelled (reception desk of
the lobby sequences, minor corridor view of shopping centre
sequences) because these were observed with too poor quality or for
too short of an interval.
* Variability of the ground truth: Since the labeling was done by humans,
there is a natural variation in both the parameters and occurrence of the
labels, e.g. the positions and sizes of the bounding boxes, or when the box
or activity starts. Knowing the range of human variation will help with
comparison to automatic calculations of the statistics.
To help assess this question, one of the datasets has three labelings by
different individuals. For an analysis of these labelings , see:
T. List, J. Bins, J. Vazquez, R. B. Fisher, "Performance Evaluating the
Evaluator", Proc. 2nd Joint IEEE Int. Workshop on Visual Surveillance and
Performance Evaluation of Tracking and Surveillance,
(VS-PETS), pp ***-***, Beijing, Oct 2005.
* We have taken the position of an omniscient labeler, so all contexts are
labeled as they actually are, although the system may not be able to
correctly label the context until many frames in the future.
The main labeling difficulty is one of timing - when does one situation
or context change into another.
We have assumed that differences in this will be the sort of natural
variation assessed as described above.
The labeling of the roles/situations/contexts was problematic.
It was often unclear how each of the labels was to be used.
We attempted to maintain at least consistent labeling by coordinating
and reviewing of labels by one person.
Therefore, the symbolic labeling is based on a best-guess representation
of the final activity model.
* We have attempted to define a group as a set of individuals that are
reacting to each other. This means that individuals may pass each other,
e.g. one behind the other, without interacting and thus not forming a group.
The human labelers can usually make this judgment, but it is less
likely that an automatic labeler will be able to distinguish
all instances of interaction. Thus, there are probably going to be a lot
of false alarms on group box detection (i.e. individuals who are really
not interacting, but just passing closely).
Similarly, we grouped individuals that were interacting independently of the
distance between the individuals, starting from the frame in which they
first seemed to react to each other. For example, if two people wave while
still quite distant and then turn to approach each other, the group box and
labeling starts in the frame where the two noticed each other and initiated
the waving.
* Multiple versus unique labels: Should an individual (or a group) have more
than one role label and participate in more than one situation and context
at the same time? In labeling, we have decided only single classifications
apply in each frame.
* In the lobby sequences, there are about 100 frames of a person
hand-signalling the topic of the sequence at the start. These were
intended to be removed but later we decided that they were useful data, too.
* Because of JPEG compression, there are various compression artifacts
in the uncompressed image data. While this is an annoyance, it is typical
of current stored image sequences, so we decided to work with the
slightly degraded sequences.
* Some tracked targets in 17 of the sequences have had additional information
about their head, gaze direction, hand, feet and shoulder positions added.
(Not all because we have no more time and money at the moment.)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% XML Description
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The hierarchy:
......... Other individual objects
.........
.........
ORIENTATIONGAPPEARANCE_FIELDMEMBER_LIST GMOVEMENT_FIELD GROLE_FIELD GSIT_FIELD GCONTEXT_FIELD
......... Other hypotheses for this individual object
.........
.........
......... Other group objects
.........
.........
The fields in upper case above are numbers or specific text strings, and
are defined below.
The file contains two kinds of objects