Training, Test and Evaluation Sets for the AMI Corpus

People working on automatic annotation of the AMI corpus should, where possible, use the same designations to ensure comparability of their results, in particular for a single annotation type.

On this page, you find our (strongly encouraged) suggestion for a division of the corpus into training, development and test sets, as well as a split into 5 and 10 parts for use in cross-validation. Please note that we present three different data partitions which are similar but not identical. Please make sure you refer to the particular partition you use in any publication. If any publication does not state which is used, you should assume the scenario-only partition. The partitions are: Scenario-only, Full-corpus, Full-corpus-ASR. Full-corpus is simply a superset of scenario-only that includes non-scenario data. Full-corpus-speech contains scenario and non-scenario meetings but has a smaller proportion of the data as test data and a larger proportion for training.

Motivation

In WP5, we are doing a number of content abstraction tasks. For each task, there may be several groups using algorithms of varying computational complexity and with different data requirements. We need to be able to compare, for instance, the results of systems for doing the same task that use ten-fold cross-validation, five-fold cross-validation, one simple training set, or a dev set and a training set. For this reason, we are holding out a portion of the data that no system will see but that will be used only for evaluation.

Main division into seen and unseen data (in hours):

	seen data	unseen eval data
scenario (S)	rest (61-ish)	11
non-scenario ( N and B )	rest (28-ish)	2.5-3

We divide the seen data into training and development data for systems that require that distinction. The data is divided (in hours) as follows:

	train	dev
scenario	rest (50-ish)	11
non-scenario	rest (25-ish)	2.5-3

Scenario-only Partition of meetings

SA (TRAINING PART OF SEEN DATA): ES2002, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3008, TS3009, TS3010, TS3011, TS3012 (25 sets, 25*4-2 = 98 meetings)
SB (DEV PART OF SEEN DATA): ES2003, ES2011, IS1008, TS3004, TS3006 (5 sets, 5*4 = 20 meetings)
SC (UNSEEN DATA FOR EVALUATION): ES2004, ES2014, IS1009, TS3003, TS3007 (5 sets, 5*4 = 20 meetings)

Look at the bottom of the page for annotation statistics, and the script used to generate them.

Scenario-only K-fold Cross-Validation

Each line (for k=10) or each pair of lines (for k=5) is one part of the partition. For each fold, you use one part for testing, the other nine (four) for training. The columns indicate from which set of the tripartition the meetings are taken - e.g. if you want to run CV using no unseen data, you'd only use meetings from the SA and SB columns.

SA			SB	SC	k=10	k=5
ES2002	IS1000		TS3004	ES2004	1	1
ES2007	IS1001	TS3005			2
ES2005	IS1002		TS3006	ES2014	3	2
ES2006	IS1003	TS3008			4
TS3009	IS1004		ES2003	IS1009	5	3
ES2008	ES2013	TS3010			6
TS3011	IS1006		ES2011	TS3003	7	4
ES2010	IS1007	ES2015			8
ES2009	TS3012		IS1008	TS3007	9	5
ES2012	IS1005	ES2016			10

Representation in NXT format

This information is encoded in NXT format in the following manner. The corpus resource file meetings.xml contains information about all meetings in the corpus. There are four relevant attributes on the 'meeting' elements:

visibility - values are either 'seen' or 'unseen'. 'seen' means the data is in the training or test set; 'unseen' are those meetings in the evaluation set.
seen_type - if the visibility is 'seen', this attribute will either contain the word 'training' or 'development'
k10 - numerical value that divides the corpus into 10 parts for 10-fold cross-validation.
k5 - numerical value that divides the corpus into 5 parts for 5-fold cross-validation.

As with any elements and attributes in the NXT corpus these can be queried using the NXT Query Language.

Full-corpus partition of meetings

This is a strict superset of the scenario-only partition. We reduntantly include all that information here

SA (TRAINING PART OF SEEN DATA): ES2002, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3008, TS3009, TS3010, TS3011, TS3012, EN2001, EN2003, EN2004a, EN2005a, EN2006, EN2009, IN1001, IN1002. IN1005, IN1007, IN1008, IN1009, IN1012, IN1013, IN1014, IN1016
SB (DEV PART OF SEEN DATA): ES2003, ES2011, IS1008, TS3004, TS3006, IB4001, IB4002, IB4003, IB4004, IB4010, IB4011
SC (UNSEEN DATA FOR EVALUATION): ES2004, ES2014, IS1009, TS3003, TS3007, EN2002

Note that IB4005 does not appear because it has speakers in common with two sets of data.

Full-corpus-ASR partition of meetings

The only difference between this partition and that above is that there is more training data: we move ES2014 and TS3007 from test to training and ES2003 and TS3006 from dev to training. This leaves us with about 9 hours of meetings for the dev and eval (SB and SC) sets, and about 81 hours for training. This is deemed more suitable for a speech recognition task.

SA (TRAINING PART OF SEEN DATA): ES2002, ES2003, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2014, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3006, TS3007, TS3008, TS3009, TS3010, TS3011, TS3012, EN2001, EN2003, EN2004a, EN2005a, EN2006, EN2009, IN1001, IN1002. IN1005, IN1007, IN1008, IN1009, IN1012, IN1013, IN1014, IN1016
SB (DEV PART OF SEEN DATA): ES2011, IS1008, TS3004, IB4001, IB4002, IB4003, IB4004, IB4010, IB4011
SC (UNSEEN DATA FOR EVALUATION): ES2004, IS1009, TS3003, EN2002