Listening evaluations – Teflon Project

How does a ”correct” utterance sound?

Written by: Sofia Strömberggson, Karolinska Institut

A picture containing text, slot machine

Description automatically generated

An important feature of the speech training app developed in TEFLON is to provide children with feedback concerning their speech production. This feedback will be presented as stars, with 5 stars representing “correct” pronunciation, and the fewer stars, the further away from “correct”. But how does the app know how many stars to present for a given utterance? In fact, this is something that the app (or rather, the acoustic model within the app) has to learn from how humans have evaluated children’s utterances.

But even for humans, the task of evaluating the correctness of children’s utterances is not trivial. For one, there are very many ways for utterances to be “correct”. (And that is a good thing when it comes to human perception in daily life! It means that we can “tolerate” a lot of variation in speech production without communication being too easily disrupted.) But also, there are even more ways that utterances can be “incorrect”. For example, is the utterance still intelligible as the intended word? And for intelligible utterances – are one or more speech sounds affected? Are different types of speech errors more severe than others? In TEFLON, the different research teams have tackled this challenge in slightly different ways.

At Karolinska Institutet, the human evaluators have used the same 1-5 scale that will be used in the app. In an effort to ensure consistency in the evaluations, the following rating key was specified:

not at all identifiable as the target word
not identifiable as the target word
slight phonemic error (e.g., the target word “kollision” is pronounced “kolliton”)
subphonemic error/”unexpected variant” (e.g., the /r/-sound in “ros” is not quite produced as you’d expect)
prototypical/adult-like/correct

But even with this rating key, the evaluation decisions are not always easy. The researchers are currently running listening experiments with more listeners – experts (in this context: speech-language pathologists) and non-experts – to explore listener behaviors more systematically. For this task, the listeners are instructed to rate the utterances from 1 to 5 as they think the app should rate the utterance (i.e., without being provided with a rating key). Through these experiments, the researchers hope to learn more concerning whether expert and non-experts differ in their ratings, and whether listener ratings are more consistent for listeners who have access to “reference samples” when conducting their evaluations. The researchers aim to present their findings later during 2023.