Training, Test and Evaluation Sets for the AMI Corpus
People working on automatic annotation of the AMI corpus should, where possible, use the same designations to ensure comparability of their results, in particular for a single annotation type.
On this page, you find our (strongly encouraged) suggestion for a division of the corpus into training, development and test sets, as well as a split into 5 and 10 parts for use in cross-validation. Please note that we present three different data partitions which are similar but not identical. Please make sure you refer to the particular partition you use in any publication. If any publication does not state which is used, you should assume the scenario-only partition. The partitions are: Scenario-only, Full-corpus, Full-corpus-ASR. Full-corpus is simply a superset of scenario-only that includes non-scenario data. Full-corpus-speech contains scenario and non-scenario meetings but has a smaller proportion of the data as test data and a larger proportion for training.Motivation
In WP5, we are doing a number of content abstraction tasks. For each task, there may be several groups using algorithms of varying computational complexity and with different data requirements. We need to be able to compare, for instance, the results of systems for doing the same task that use ten-fold cross-validation, five-fold cross-validation, one simple training set, or a dev set and a training set. For this reason, we are holding out a portion of the data that no system will see but that will be used only for evaluation.
Main division into seen and unseen data (in hours):
|
seen data |
unseen eval data |
scenario (S) |
rest (61-ish) |
11 |
non-scenario ( N and B ) |
rest (28-ish) |
2.5-3 |
We divide the seen data into training and development data for systems that require that distinction. The data is divided (in hours) as follows:
|
train |
dev |
scenario |
rest (50-ish) |
11 |
non-scenario |
rest (25-ish) |
2.5-3 |
Scenario-only Partition of meetings
-
SA (TRAINING PART OF SEEN DATA): ES2002, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3008, TS3009, TS3010, TS3011, TS3012 (25 sets, 25*4-2 = 98 meetings)
-
SB (DEV PART OF SEEN DATA): ES2003, ES2011, IS1008, TS3004, TS3006 (5 sets, 5*4 = 20 meetings)
-
SC (UNSEEN DATA FOR EVALUATION): ES2004, ES2014, IS1009, TS3003, TS3007 (5 sets, 5*4 = 20 meetings)
Look at the bottom of the page for annotation statistics, and the script used to generate them.
Scenario-only K-fold Cross-Validation
Each line (for k=10) or each pair of lines (for k=5) is one part of the partition. For each fold, you use one part for testing, the other nine (four) for training. The columns indicate from which set of the tripartition the meetings are taken - e.g. if you want to run CV using no unseen data, you'd only use meetings from the SA and SB columns.
SA |
SB |
SC |
k=10 |
k=5 |
||
ES2002 |
IS1000 |
|
TS3004 |
ES2004 |
1 |
1 |
ES2007 |
IS1001 |
TS3005 |
|
|
2 |
|
ES2005 |
IS1002 |
|
TS3006 |
ES2014 |
3 |
2 |
ES2006 |
IS1003 |
TS3008 |
|
|
4 |
|
TS3009 |
IS1004 |
|
ES2003 |
IS1009 |
5 |
3 |
ES2008 |
ES2013 |
TS3010 |
|
|
6 |
|
TS3011 |
IS1006 |
|
ES2011 |
TS3003 |
7 |
4 |
ES2010 |
IS1007 |
ES2015 |
|
|
8 |
|
ES2009 |
TS3012 |
|
IS1008 |
TS3007 |
9 |
5 |
ES2012 |
IS1005 |
ES2016 |
|
|
10 |
|
Representation in NXT format
This information is encoded in NXT format in the following manner. The corpus resource file meetings.xml contains information about all meetings in the corpus. There are four relevant attributes on the 'meeting' elements:- visibility - values are either 'seen' or 'unseen'. 'seen' means the data is in the training or test set; 'unseen' are those meetings in the evaluation set.
- seen_type - if the visibility is 'seen', this attribute will either contain the word 'training' or 'development'
- k10 - numerical value that divides the corpus into 10 parts for 10-fold cross-validation.
- k5 - numerical value that divides the corpus into 5 parts for 5-fold cross-validation.
Full-corpus partition of meetings
This is a strict superset of the scenario-only partition. We reduntantly include all that information here
-
SA (TRAINING PART OF SEEN DATA): ES2002, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3008, TS3009, TS3010, TS3011, TS3012, EN2001, EN2003, EN2004a, EN2005a, EN2006, EN2009, IN1001, IN1002. IN1005, IN1007, IN1008, IN1009, IN1012, IN1013, IN1014, IN1016
-
SB (DEV PART OF SEEN DATA): ES2003, ES2011, IS1008, TS3004, TS3006, IB4001, IB4002, IB4003, IB4004, IB4010, IB4011
-
SC (UNSEEN DATA FOR EVALUATION): ES2004, ES2014, IS1009, TS3003, TS3007, EN2002
Full-corpus-ASR partition of meetings
The only difference between this partition and that above is that there is more training data: we move ES2014 and TS3007 from test to training and ES2003 and TS3006 from dev to training. This leaves us with about 9 hours of meetings for the dev and eval (SB and SC) sets, and about 81 hours for training. This is deemed more suitable for a speech recognition task.
-
SA (TRAINING PART OF SEEN DATA): ES2002, ES2003, ES2005, ES2006, ES2007, ES2008, ES2009, ES2010, ES2012, ES2013, ES2014, ES2015, ES2016; IS1000, IS1001, IS1002 (no a), IS1003, IS1004, IS1005 (no d), IS1006, IS1007; TS3005, TS3006, TS3007, TS3008, TS3009, TS3010, TS3011, TS3012, EN2001, EN2003, EN2004a, EN2005a, EN2006, EN2009, IN1001, IN1002. IN1005, IN1007, IN1008, IN1009, IN1012, IN1013, IN1014, IN1016
-
SB (DEV PART OF SEEN DATA): ES2011, IS1008, TS3004, IB4001, IB4002, IB4003, IB4004, IB4010, IB4011
-
SC (UNSEEN DATA FOR EVALUATION): ES2004, IS1009, TS3003, EN2002