Challenges collecting and sharing speech data from children

The research and development that we carry out in the Teflon project relies on collections of speech recordings from foreign children who learn to speak the Nordic languages and from Swedish children with speech sound disorder. This kind of data is unique for several reasons: firstly, no data sets of children speaking Nordic languages are publicly available, as opposed to adult data that is far from scarce. Secondly, we are interested in second language (L2) speakers of Nordic languages, because we want to study how to detect and characterize their mispronunciations. Thirdly, there is no publicly available data of Nordic children with speech sound disorder. Consequently, substantial effort during the project was devoted to collecting such data. This was done both simulating the situation where children play the Pop2Talk game, and, later, recording the game sessions with the students.

The intention when collecting this data was to make all the corpora publicly available, in the spirit of the open science principles. However, we soon realized that the rules defined by the General Data Protection Regulation (GDPR) are interpreted very differently in the different Nordic countries (and, possibly in the other European countries). Those rules have the goal of protecting the privacy of European citizens and refer to any information that may identify each individual. In collections of text, complying with these rules may just mean removing any personal information such as names, telephone numbers, user IDs, or addresses. For speech the situation is more complicated. The main issue regarding speech, is related to the question if the voice is sufficient in order to identify the speaker, and if this, in turns, constitutes sufficient grounds to forbid public sharing of the recordings.

Our experience in Sweden is that different lawyers may interpret the law differently. The lawyers at KI, for example, think it is fine to share recordings of isolated words. But some disagree. When consulting the lawyers at Språkbanken Tal (the main channel for sharing speech data in Sweden), however, their interpretation of the law was much more strict. A similar stricter interpretation is now used in Finland (although publication of speech data from individuals over 16 years has been permitted in the past). In contrast, in Norway, the rules are interpreted to only apply to what is spoken, and not to the voice that speaks. As a result, of the many corpora that we recorded for Finnish, Swedish and Norwegian as target languages, we will only be able to publish the Norwegian corpus.

The different responses to our requests to the ethical committees in the different countries illustrate the complexity of speech data sharing. It is useful to stress that the L2 data sets for Swedish, Norwegian and Finnish were completely equivalent in all respects: age of participants, characteristics of the participants, content of the recordings, anonymization of the participants in the metadata, sharing conditions requested, to name a few. It is also important to point out that the characteristics of child voice, that may allow the identification of an individual, change very quickly with age. This means that the participants will not be identifiable by their voice in just a few months or years after the recordings. We hope that, in the future, uniform interpretations of regulations and practices in handling speech data will be introduced. We believe that the availability of open-access speech corpora with child speech is of essential value for scientific research and speech technology development, and they will bring about great advantages for society.

We have submitted a paper describing the data collected in the Teflon project in more detail to the LREC2024 conference that will take place in Turin, Italy, on May 20-25. Please visit the LREC2024 website for more information. 

Aalto’s participation in the ACM Multimedia 2023 Computational Paralinguistics and Multimodal Sentiment Analysis Challenge

The Computational Paralinguistics Challenge (ComParE) was organized as part of the ACM Multimedia 2023 conference. The annual challenge deals with states and traits of individuals as manifested in their speech and further signals’ properties. Each year, the organizers introduce tasks, along with data, for the participants to develop solutions.

Our team, consisting of PhD students: Dejan Porjazovski, and Yaroslav Getman, as well as the Research Fellow Tamás Grósz and Professor Mikko Kurimo, participated in the two challenges provided by the organizers:

Requests and Emotion Share.

The Requests sub-challenge involves real interactions in French between call centre agents and customers calling to resolve an issue. The task is further divided into two sub-tasks: Determine whether the customer call concerns a complaint or not, and whether the call concerns membership issues or a process, such as affiliation.

The multilingual Emotion Share task, involving speakers from the USA, South Africa, and Venezuela, tackles a regression problem of recognising the intensity of 9 emotions present in the dataset. The intensities of the emotions that need to be recognized are anger, boredom, calmness, concentration, determination, excitement, interest, sadness, and tiredness.

Our team tackled these issues by utilizing the state-of-the-art wav2vec2 model, along with a Bayesian linear layer. The choice for the appropriate wav2vec2 model, along with the best-performing transformer layer, and the Bayesian linear layer, led to our team winning the Emotion Share sub-challenge:

For more technical details about the work, see our paper:

Simultaneously, we also participated in the 4th Multimodal Sentiment Analysis Challenge. The team consisted of PhD students Anja Virkkunen and Dejan Porjazovski, together with Research Fellow Tamás Grósz and Professor Mikko Kurimo.

We chose to tackle two extremely hard problems: Humour and Mimicked Emotions detection using videos. Our solution focused on large, pre-trained models, and we developed a method that could identify relevant parts outputs of these foundation models, making them a bit more transparent. The empirical results demonstrate that these large AIs have smaller specialized outputs that contain relevant information for detecting jokes and emotions of the speakers based on visual and audible cues.

For more details, check out our paper:



Etsimme osallistujia yhteispohjoismaiseen tieteelliseen tutkimukseen, jossa tutkitaan lasten kielen oppimista oppimispelien avulla. Tampereen yliopistossa ja Aalto-yliopistossa tutkitaan pelisovelluksia, joilla voi harjoitella vierasta kieltä tai matematiikkaa. Tämä tutkimus on jatkoa vuonna 2016 alkaneelle lasten digitaalisen oppimisen tutkimukselle, jossa on aiemmin tutkittu mm. lasten englannin kielen oppimista äänipohjaisten sovellusten avulla.

Tutkimuksen mittaukset tapahtuvat Helsingissä.

Kutsumme mukaan tutkimukseen ekaluokkalaisia ja esikoulussa olevia lapsia, joiden äidinkieli on suomi, ja jotka eivät vielä puhu sujuvasti ruotsia tai englantia.
Osa lapsista pelaa kielenoppimispeliä ja osa puolestaan pelaa DragonBox Numbers -matematiikkapeliä.

Pyydämme kiinnostuneita ottamaan yhteyttä sähköpostitse:

Lämpimästi tervetuloa mukaan!