Introducing EdAcc

Ramon Sanabria - 03/04/22

Is English Automatic Speech Recognition Solved?

“English Automatic Speech Recognition is solved” If I had collected a pound from researchers (particularly those outside the speech recognition field) each time I came across this statement, I'd undoubtedly be wealthy—or, at the very least, possess a sizable sum of pounds. While I do not subscribe to this line of thought, I would like to start analyzing the reason behind it. Word error rate (WER) — the error metrics to evaluate automatic speech recognition (ASR) systems — has indeed been on a steady decline year after year in established benchmarks. This progress is showcased in the graphs below, where we show the results of the best performing models from the last seven years on LibriSpeech and Switchboard, two of the most popular datasets in the speech recognition community. This graph motivated mainstream media to eagerly report that ASR has achieved human or even superhuman performance, leading many to believe that the challenge of ASR has been conquered.



Librispeech and Switchboard are two of the most common speech recognition datasets used to evaluate the accuracy of ASR systems. In Librispeech, authors collected large amounts of public audio book recordings with their transcriptions. While this provides a substantial amount of data, the variety in speech aspects, such as prosody, speaking rate, and acoustic environment, is somewhat limited. Below, you can listen to a sample of Librispeech. To see the difference, you can compare with a long term friend. You can listen to a Librispeech sample below. To appreciate this lack of variance try comparing it with the sample next to it – there you will find a natural conversation between two known people – you should notice natural volume dynamics, a non-standard English variety, hesitations, etc, clearly missing in Librispech.

Librispeech
Natural conversation


On the other hand, Switchboard offers a more realistic domain (i.e., more similar to our recording): telephone conversations. Conversational speech is how people naturally interact, full of pauses, hesitations, and changes in tone – just as the previous recording example. While it creates a tougher set up for ASR systems, there are a few drawbacks. 1) Participants don't know each other beforehand, leading to somewhat artificial conversations, and 2) the domain – telephone conversations – is outdated. Plus, just like Librispeech, both datasets focus solely on American English speakers – which, we'd argue, doesn't capture the rich diversity of the English language. Altogether, show a lack of of corpora that accurately represents English in its conversational settings, and makes the results shown in the graph above not representative of the actual performance of ASR systems.

English is Rich and Diverse!

Language is a living organism that evolves such as we do. English is one of its examples. English, in particular, has undergone significant changes throughout its history, in part due to its widespread use and exposure to other languages and cultures – David Crystal has many pieces that talk about it -- here I leave a short and long one. Take as an example the recording above where my partner and I have a random conversation. Both of us Catalan and Spanish speakers and have been living for a while in an English speaking country. We can say that we speak English because many of you would hopefully understand our conversations. So what changes? We speak a variation of English – although this variation is not formally defined (we will discuss this later). The reason why that happens is hard to track, but we (and many linguists) might agree that our linguistic background, culture and experiences have a lot to do. This is just a small example of English variation, but there are many more.

The Edinburgh International Accents of English Corpus: Representing the Richness of English Language

Statistics

Important note: we acknowledge that despite efforts to design an accessible and fair data curation process, we are only representing a small section of all English speakers. We hope to increase the diversity of the dataset in future iterations.

Acknowledgments


I want to give a big thank to everyone who's helped make EdAcc happen. The ICASSP template didn’t let me thank everyone properly -- and its been a long journey

First off, a huge thanks to The Institute for Language, Cognition and Computation from The University of Edinburgh who fund this project. In particular, to Alex Lascarides who managed the funding and provided essential guidance. To all my co-authors for the amazing work – with an extra thanks to Ondrej Klejch for going above and beyond during the evaluation phase. I also appreciate every reviewer (anonymous or otherwise) who took the time to give us feedback on early and late versions of the paper. Your input made our project so much better. Last but not least, thanks to the EdAcc speakers who shared their accent with us.

More personally, I (Ramon) want to give a special shoutout to Júlia Nueno for always encouraging me to keep pushing forward – even when things does not look good.