Training, Test and Evaluation Sets for the AMI Corpus

People working on automatic annotation of the AMI corpus should, where possible, use the same designations to ensure comparability of their results, in particular for a single annotation type.

On this page, you find our (strongly encouraged) suggestion for a division of the corpus into training, development and test sets, as well as a split into 5 and 10 parts for use in cross-validation. Please note that we present three different data partitions which are similar but not identical. Please make sure you refer to the particular partition you use in any publication. If any publication does not state which is used, you should assume the scenario-only partition. The partitions are: Scenario-only, Full-corpus, Full-corpus-ASR. Full-corpus is simply a superset of scenario-only that includes non-scenario data. Full-corpus-speech contains scenario and non-scenario meetings but has a smaller proportion of the data as test data and a larger proportion for training.

Motivation

In WP5, we are doing a number of content abstraction tasks. For each task, there may be several groups using algorithms of varying computational complexity and with different data requirements. We need to be able to compare, for instance, the results of systems for doing the same task that use ten-fold cross-validation, five-fold cross-validation, one simple training set, or a dev set and a training set. For this reason, we are holding out a portion of the data that no system will see but that will be used only for evaluation.

Main division into seen and unseen data (in hours):


seen data

unseen eval data

scenario (S)

rest (61-ish)

11

non-scenario ( N and B )

rest (28-ish)

2.5-3

We divide the seen data into training and development data for systems that require that distinction. The data is divided (in hours) as follows:


train

dev

scenario

rest (50-ish)

11

non-scenario

rest (25-ish)

2.5-3

Scenario-only Partition of meetings

  • SA (TRAINING PART OF SEEN DATA): ES2002, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3008, TS3009, TS3010, TS3011, TS3012 (25 sets, 25*4-2 = 98 meetings)

  • SB (DEV PART OF SEEN DATA): ES2003, ES2011, IS1008, TS3004, TS3006 (5 sets, 5*4 = 20 meetings)

  • SC (UNSEEN DATA FOR EVALUATION): ES2004, ES2014, IS1009, TS3003, TS3007 (5 sets, 5*4 = 20 meetings)

Look at the bottom of the page for annotation statistics, and the script used to generate them.

Scenario-only K-fold Cross-Validation

Each line (for k=10) or each pair of lines (for k=5) is one part of the partition. For each fold, you use one part for testing, the other nine (four) for training. The columns indicate from which set of the tripartition the meetings are taken - e.g. if you want to run CV using no unseen data, you'd only use meetings from the SA and SB columns.

SA

SB

SC

k=10

k=5

ES2002

IS1000


TS3004

ES2004

1

1

ES2007

IS1001

TS3005



2


ES2005

IS1002


TS3006

ES2014

3

2

ES2006

IS1003

TS3008



4


TS3009

IS1004


ES2003

IS1009

5

3

ES2008

ES2013

TS3010



6


TS3011

IS1006


ES2011

TS3003

7

4

ES2010

IS1007

ES2015



8


ES2009

TS3012


IS1008

TS3007

9

5

ES2012

IS1005

ES2016



10


Representation in NXT format

This information is encoded in NXT format in the following manner. The corpus resource file meetings.xml contains information about all meetings in the corpus. There are four relevant attributes on the 'meeting' elements:
  •  visibility - values are either 'seen' or 'unseen'. 'seen' means the data is in the training or test set; 'unseen' are those meetings in the evaluation set.
  • seen_type - if the visibility is 'seen', this attribute will either contain the word 'training' or 'development'
  • k10 - numerical value that divides the corpus into 10 parts for 10-fold cross-validation.
  • k5 - numerical value that divides the corpus into 5 parts for 5-fold cross-validation.
As with any elements and attributes in the NXT corpus these can be queried using the NXT Query Language.

Full-corpus partition of meetings

This is a strict superset of the scenario-only partition. We reduntantly include all that information here

  • SA (TRAINING PART OF SEEN DATA): ES2002, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3008, TS3009, TS3010, TS3011, TS3012, EN2001, EN2003, EN2004a, EN2005a, EN2006, EN2009, IN1001, IN1002. IN1005, IN1007, IN1008, IN1009, IN1012, IN1013, IN1014, IN1016

  • SB (DEV PART OF SEEN DATA): ES2003, ES2011, IS1008, TS3004, TS3006, IB4001, IB4002, IB4003, IB4004, IB4010, IB4011

  • SC (UNSEEN DATA FOR EVALUATION): ES2004, ES2014, IS1009, TS3003, TS3007, EN2002

Note that IB4005 does not appear because it has speakers in common with two sets of data.

Full-corpus-ASR partition of meetings

The only difference between this partition and that above is that there is more training data: we move ES2014 and TS3007 from test to training and ES2003 and TS3006 from dev to training. This leaves us with about 9 hours of meetings for the dev and eval (SB and SC) sets, and about 81 hours for training. This is deemed more suitable for a speech recognition task.

  • SA (TRAINING PART OF SEEN DATA): ES2002, ES2003, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2014, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3006, TS3007, TS3008, TS3009, TS3010, TS3011, TS3012, EN2001, EN2003, EN2004a, EN2005a, EN2006, EN2009, IN1001, IN1002. IN1005, IN1007, IN1008, IN1009, IN1012, IN1013, IN1014, IN1016

  • SB (DEV PART OF SEEN DATA): ES2011, IS1008, TS3004, IB4001, IB4002, IB4003, IB4004, IB4010, IB4011

  • SC (UNSEEN DATA FOR EVALUATION): ES2004, IS1009, TS3003, EN2002