Workshop AAAS-2025

Important Dates

Submission DL: 16 December 2024 NEW deadline: 20 December 2024 (both papers and abstracts)
Notification of acceptance: 24 January 2025
Camera-ready DL: 3 February 2025
Workshop: 5 March 2025 (full day)
All deadlines are 11:55PM UTC-12:00 (“anywhere on Earth”).

Overview

Automatic Assessment of Atypical Speech (AAAS) explores the assessment of pronunciation and speaking skills of children, language learners, people with speech sound disorders and methods to provide automatic rating and feedback using automatic speech recognition (ASR)  and large language models (LLMs).  Automatic speaking assessment (ASA) is a rapidly growing field that answers to the need of developing AI tools for self-practising second and foreign language skills. This is not limited to pronunciation assessment, but the AI tools can also provide more complex feedback about fluency, vocabulary and grammar of the recorded speech. ASA is also very relevant for detection and quantification of speech disorders and for developing speech exercises that can be performed independent of time and place. The important applications of non-standard speech also include interfaces for children and elderly speakers as an alternative to using text input and output. The topic is timely, because the latest large speech models allow us now to develop ASR and classification methods for low-resourced data, such as atypical speech, where annotated training datasets are rarely available and expensive and difficult to produce and share. The goal of this workshop is to present the latest results in ASA and discuss the future work and collaboration between the researchers in Nordic and Baltic countries.

Topics of Interest

In particular, we would like to invite students, researchers, and other experts and stakeholders to contribute papers and/or join the discussion on the following (and related) topics:

  • Automatic speaking assessment (ASA) for L2 (second or foreign language) pronunciation
  • ASA for spoken L2 proficiency
  • ASA for speech sound disorders (SSD)
  • Automatic speech recognition (ASR) for L2 learners
  • ASR for children and young L2 learners
  • ASA and ASR for Nordic and other low-resource languages and tasks
  • Spoken L2 learning and speech therapy using games
  • Automatic generation of verbal feedback for spoken L2 learners using LLMs

AAAS-2025 workshop program 5th March, 2025

09:00 Mikko Kurimo: opening

09:10 Invited Speaker – Nina Benway: What is so hard about AI Speech Therapy? Evidence from Efficacy Trials.

Nina R Benway, PhD CCC-SLP, is a Postdoctoral Fellow in Electrical and Computer Engineering with Dr. Carol Espy-Wilson. Nina completed her doctoral training in speech-language pathology (concentration: neuroscience) with Dr. Jonathan Preston at Syracuse University, focusing on clinical trials in children with chronic rhotic speech sound disorders. The three studies of her dissertation resulted in the curation of an open-access 175,000-utterance speech corpus, the engineering of audio classification algorithms predicting speech-language pathologist perception of rhotic speech errors, and the clinical trial validation of an artificial intelligence tool that fully automates a speech sound treatment session. Nina’s doctoral training builds upon her undergraduate training in linguistics (acoustic phonetics) at Cornell University, graduate clinical training at The College of Saint Rose, and six years of clinical practice. Through these experiences Nina has refined a multidisciplinary skill set in speech science, speech signal processing, natural language processing, corpus phonetics, machine learning/artificial intelligence (AI), user interface development, cognitive frameworks of learning, and neurocomputational frameworks of speech production.

10:00 Coffee break

10:30 Invited Speaker – Ari Huhta: Automatic assessment of second/foreign language speaking: Review of developments for examination and teaching/learning purposes.

Ari Huhta is a Professor of Language Assessment at the Centre for Applied Language Studies, University of Jyväskylä, Finland. His research interests include diagnostic foreign/second language (L2) assessment, computerised assessment, self-assessment, as well as the development of reading, writing and vocabulary knowledge in L2. He was involved in developing the large-scale multilingual DIALANG online assessment and feedback system in the early 2000s and since then he has specialised in assessments that support language learning. Although his research has focused on learning and assessing reading and writing, he has been involved in designing several rating scales for speaking and in evaluating rating quality and studying rater behavior. Recently, he has participated in research projects that are developing ASR and automated assessment of L2 speaking, as well as using LLMs to evaluate Finnish L2 learners’ proficiency level.

11:10 Team TEFLON: Developing and testing Pop2Talk Nordic language learning game for children practising Nordic languages

This presentation describes the experiences and results on running game-based learning experiences with children learning Nordic languages, building the games for them and collecting a number of corpora spoken by children to train speech models for the games. The aim of the project is to develop and evaluate computer assisted pronunciation assessment systems both for non-native children learning a Nordic language (L2) and for L1 children with speech sound disorder (SSD). In this presentation we discuss the challenges encountered recording and annotating data, building the models and games, running the learning experiments, reporting the results as well as the ethical considerations related with making our data publicly available. We hope that sharing these experiences will encourage others to collect similar data and running similar experiments for other languages.

12:00 Raili Hilden and Anna von Zansen. Digital tool for L2 speaking assessment

Project Digitala, funded by the Academy of Finland, derived from a range of acknowledged practical and theoretical gaps. The nationwide high-stakes exam at the end of upper secondary education (the Matriculation Examination) has for along time been at odds with the communicative language syllabi it is intended to measure, since there is no speaking section in this test. The reasons are primarily practical: issues of implementation and lack of assessment resources of time and rater salaries. The digitalization of the exam opened new prospects to utilize advances in speech technology to solve the problems.

The scientific problems addressed by the project were: (RQ1.) Which features of L2 Finnish and Swedish are most amenable/problematic to automated analysis? (RQ2.) How can the automated analysis of specific features of speech be integrated to tasks used in practicing and assessing oral skills in Finnish and Swedish as L2 for formative and summative purposes? (RQ3.) What features of speech communication in the two languages are important for human raters and how could ASR help the process of measuring and standardization of these features? (RQ4.) How do the computerized and human ratings relate to each other? (RQ5.) What technical and acoustic factors have the greatest impact on the obtained scores (facilities, recording device background noise etc.)? (RQ6) What features of L2 Finnish and Swedish speech are connected with specific CEFR levels in the automated rating? (RQ7) What attitudes and beliefs do students and teachers host towards the emerging pedagogic innovation? (RQ8) Under what conditions can we use adaptation to create personalized L2 phoneme models that improve ASR and speech rating for L2 learners?

A consortium of three universities unifying expertise of language education and assessment (University of Helsinki), phonetics (University of Jyväskylä) and speech technology (Aalto University) set out to develop a prototype tool with a dual aim. First, the tool would provide formative support for practicing oral skills and in a long run, it could be refined to a technical resource to mitigate summative assessment of speaking in high-stakes contexts.

The tool is based on speech recognition and consists of several phases, starting with the spoken signal, scoring and evaluating the samples, feeding in training data and finally testing the functionality of the model. Languages with fewer learners face challenges due to the scarcity of training data. Recent advances in machine learning have made it possible to develop systems with a limited amount of data from the target domain. To this end, we propose automatic speech evaluation systems for spontaneous L2 speech in Finnish and Finnish-Swedish, each consisting of six machine learning models, and report their performance in terms of statistical evaluation criteria.

The results suggested that (RQ1) the most suitable for automated analysis are quantifiable phonetic features and vocabulary range. Task completion, again, was deemed problematic, because very short statements required for the mechanical evaluation do not allow for high level assessments, even if the speaker had such skill.

(RQ2) Read aloud tasks and short samples of free speech worked best in automatic assessment. On the other hand, automatic assessment of extended free speech difficult, because human raters target their assessments to different points and features in them, which lowers reliability.

(RQ3) Pronunciation and fluency, task completion, and language range were considered by the assessors to be key quality features of speaking. The ratings of pronunciation and fluency were the most consistent, while accuracy shared the raters´ opinions more.

(RQ4) The human assessors were most unanimous on the overall skill level, linguistic range and accuracy. The least consensus on pronunciation and task completion. Machine and human estimates were most consistent in overall skill level, fluency, and task completion. Pronunciation, linguistic range, and accuracy produced the lowest correlations between machine and human. There were certain differences in the agreement rates between the Finnish and Swedish datasets.

(RQ5) Experiences from the automatic evaluation pilot were mostly positive. Technical problems and network instability were annoying, as has been observed in many previous studies.

(RQ6) In the automatic assessment, the skill level was most likely followed by the quantitative characteristics of speech pronunciation, fluency and linguistic range.

(RQ7) The stakeholders (learners and teachers) agreed on certain benefits of automated assessment, such as it is a tireless rater, gives immediate feedback, enables self-regulated learning, saves time and money, and ensures the same criteria for all speakers. Potential challenges comprise the environment in which others are listening to one´s performance or disturbing with their simultaneous speech. Some concern was attributed to inexperience and data protection issues.

(RQ8) We gained good results for both Finnish and Swedish when adapting a big pre-trained wav2vec2.0 speech model that was finetuned first with a larger L1 dataset, and then with an L2 dataset gathered in the project.

The Digitala prototype will be further developed, finetuned and adjusted to address a new user group, adults preparing for the final test of integration training in L2 Finnish. The new project funding was granted by the Research Council of Finland for years 2025-2026 to the same consortium that will be working in close collaboration with the Finnish National Agency of Education, the provider of immigration training in Finland.

12:15 Anna von Zansen and Raili Hilden. The Aasis research project: automatically assessing spoken interaction in L2 Finnish

AASIS is based on the product of the the Digitala tool, a prototype device for assessing and providing feedback on monolog task performances. AASIS is funded by the Academy of Finland (09/2023–2027) with the same consortium of three universities as its predecessor Digitala. The consortium brings together researchers from speech and language processing, language education and phonetics to investigate adult L2 Finnish learners’ spoken interaction in dialogue speaking tasks. The scope of the tool will be expanded to measure aspects of interactional competence (gaze behavior, mimes, body movements and proxemics) .

The theoretical background and the justification of addressing interactional speech performance is grounded in the customarily narrow construct coverage in automatic speaking assessment: Automatic speech recognition and automatically measurable features from speech often limit to individual speakers and monologue speech using read-aloud and production tasks. Moreover, assessment of spoken interaction lacks non-verbal features. Although non-verbal behaviors are important in real-life interactions, features related to non-verbal communication such as gaze, facial expressions, gestures, intonational cues, turn-taking are traditionally not present in rating scales. Nevertheless, teachers and raters are known to pay attention to non-verbal communication, and their impact on the rating and consequent assessment decisions remain unconscious and variable. Expanding ASR-based speaking assessment to cover L2 learners’ spoken interaction would support human raters and promote L2 speaking practice.

AASIS addresses the following research questions:

RQ1: Which verbal and non-verbal features of interaction competence in L2 Finnish dialogues affect human raters’ assessments of L2 interaction skills?

RQ2: How could the above features be measured and utilized in ASR-based assessment of interactional skills?

The participants are adult learners of L2 Finnish at tertiary level of education. They have given their consent according to GDPR and ethical conduct of the three universities.

The data comprise learners´ multimodal performances, background information, self-assessments, views and feedback. Rater scores, background information, views and feedback questionnaires and responses to them are stored for scrutiny. The data files including ELAN records of nonverbal-behavior and gaze-tracking accounts are compiled and stored on secured servers.

The project relies on a variety of methods and techniques. The verbal data is investigated with methods drawing on interactional phonetics, such as measuring speaking time, silence, overlapping speech and turn-taking. Rater behavior and consistency, as well as indicators of task difficulty and scale functionality are subjected to many-facet Rasch measurement. Instances of less traditional techniques are the use of joystick for non-verbal cues and wearing glasses for detecting gaze-behavior.

Preliminary results provide support for the initial hypotheses regarding the association of certain interactional features (both verbal and non-verbal) and the level rating of paired performance.

12:30 Lunch

14:00 Aleksei Žavoronkov and Tanel Alumäe. Phoneme-level Estonian Pronuncation Assessment: Initial Results

This paper describes the dataset, model and initial results for an Estonian pronunciation scoring task. The model is developed for a mobile application that gives language learners feedback on their pronunciation, based on prompted individual words and sentences. For model development, 5 hours of Estonian speech from non-native speakers has been manually annotated at phoneme level with pronunciation scores, using a Likert-type scale with 3 values. We use an end-to-end pronunciation scoring model that relies on a self-supervised model that is first finetuned on a large corpus of phoneme-labeled speech and is then extended for pronunciation scoring. The paper focuses on comparing different pretraing methods and reports initial results.

14:15 Ekaterina Voskoboinik, Nhan Phan, Tamás Grósz and Mikko Kurimo. Leveraging Uncertainty for Finnish L2 Speech Scoring with LLMs

Automatic speech assessment (ASA) supports learning but often requires extensive data, which is scarce for languages with fewer learners. Recent research shows that Large Language Models (LLMs) can generalize to new tasks with minimal training data using in-context learning (ICL). We find LLMs to be effective in estimating the proficiency of individuals learning Finnish as a second language (L2) when given a few examples of human expert grading. The proficiency grades produced by the model, when evaluating verbatim transcripts from an automatic speech recognition (ASR) system, agree with human ratings at a level comparable to the agreement between the human raters. Our experiments reveal that adding more grading demonstrations in ICL improves the model’s accuracy but, counterintuitively, increases its uncertainty when selecting an appropriate proficiency level. We show that this uncertainty can be leveraged further by creating soft labels: instead of assigning the most probable level (hard label), we aggregate the model’s confidence across all possible levels, resulting in noticeable performance improvements. Further analysis reveals that the sources of model uncertainty differ across ICL settings. In zero-shot, uncertainty stems from intrinsic response properties, such as proficiency level. In few-shot, it is driven by the relationship between the sample and the demonstrations.

14:35 Lingyun Gao, Cristian Tejedor-García, Catia Cucchiarini and Helmer Strik. Investigating Further Fine-tuning Wav2vec2.0 in Low Resource Settings for Enhancing Children Speech Recognition and Word-level Reading Diagnosis

Automatic reading diagnosis systems can significantly enhance teachers’ efficiency in scoring reading exercises and provide students with easier access to reading practice and feedback. However, few studies have focused on developing Automatic Speech Recognition (ASR)-based reading diagnosis systems due mainly to scarcity of data. This study explores the effectiveness and robustness of further fine-tuning the Wav2vec2.0 large model in low-resource settings, for recognizing children speech and detecting reading miscues using target domain and similar out-of-domain data. Our results show a word error rate (WER) of 10.9% and an F1 score of 0.49 for reading miscue detection achieved by our best fine-tuned model training with target domain data, while using similar out-of-domain non-native read speech can enhance the model performance for unseen speakers and noisy settings. The analyses provide insights into the robustness of further fine-tuning strategies on the Wav2vec2.0 model.

15:00 Coffee break

15:30 a slot for one more talk

15:45 Panel discussion: Nina Benway, Ari Huhta, Torbjörn Svendsen, organizers

16:45 Mikko Kurimo: closing

17:00 Workshop is over

Organizers

  • Mikko Kurimo (chair), Aalto University, mikko.kurimo@aalto.fi
  • Giampiero Salvi, NTNU
  • Sofia Strömbergsson, Karolinska Institutet
  • Sari Ylinen, Tampere University
  • Minna Lehtonen, University of Turku
  • Tamas Grosz, Aalto University
  • Ekaterina Voskoboinik, Aalto University
  • Yaroslav Getman, Aalto University
  • Nhan Phan, Aalto University

    This workshop is supported by “Technology-enhanced foreign and second-language learning of Nordic languages (TEFLON)”  https://teflon.aalto.fi/ NordForsk project nr. 103893.

Contact Information

For questions and comments, please email mikko.kurimo@aalto.fi

Program committee

Adriana Stan, Anna Smolander, Catia Cucchiarini, Ekaterina Voskoboinik, Geza Nemeth, Giampiero Salvi, Helmer Strik, Horia Cucu, Hugo Van Hamme, Katalin Mády, Klaus Zechner, Mathew Magimai Doss, Mikko Kurimo (chair), Mikko Kuronen, Minna Lehtonen, Mittul Singh, Nhan Phan, Paavo Alku, Pablo Riera, Ragheb Al-Ghezi, Raili Hilden, Riikka Ullakonoja, Sari Ylinen, Shahin Amiriparian, Simon King (co-chair), Sneha Das, Sofia Strömbergsson, Tamás Grósz (co-chair), Tanel Alumäe, Yaroslav Getman