Challenges collecting and sharing speech data from children

The research and development that we carry out in the Teflon project relies on collections of speech recordings from foreign children who learn to speak the Nordic languages and from Swedish children with speech sound disorder. This kind of data is unique for several reasons: firstly, no data sets of children speaking Nordic languages are publicly available, as opposed to adult data that is far from scarce. Secondly, we are interested in second language (L2) speakers of Nordic languages, because we want to study how to detect and characterize their mispronunciations. Thirdly, there is no publicly available data of Nordic children with speech sound disorder. Consequently, substantial effort during the project was devoted to collecting such data. This was done both simulating the situation where children play the Pop2Talk game, and, later, recording the game sessions with the students.

The intention when collecting this data was to make all the corpora publicly available, in the spirit of the open science principles. However, we soon realized that the rules defined by the General Data Protection Regulation (GDPR) are interpreted very differently in the different Nordic countries (and, possibly in the other European countries). Those rules have the goal of protecting the privacy of European citizens and refer to any information that may identify each individual. In collections of text, complying with these rules may just mean removing any personal information such as names, telephone numbers, user IDs, or addresses. For speech the situation is more complicated. The main issue regarding speech, is related to the question if the voice is sufficient in order to identify the speaker, and if this, in turns, constitutes sufficient grounds to forbid public sharing of the recordings.

Our experience in Sweden is that different lawyers may interpret the law differently. The lawyers at KI, for example, think it is fine to share recordings of isolated words. But some disagree. When consulting the lawyers at Språkbanken Tal (the main channel for sharing speech data in Sweden), however, their interpretation of the law was much more strict. A similar stricter interpretation is now used in Finland (although publication of speech data from individuals over 16 years has been permitted in the past). In contrast, in Norway, the rules are interpreted to only apply to what is spoken, and not to the voice that speaks. As a result, of the many corpora that we recorded for Finnish, Swedish and Norwegian as target languages, we will only be able to publish the Norwegian corpus.

The different responses to our requests to the ethical committees in the different countries illustrate the complexity of speech data sharing. It is useful to stress that the L2 data sets for Swedish, Norwegian and Finnish were completely equivalent in all respects: age of participants, characteristics of the participants, content of the recordings, anonymization of the participants in the metadata, sharing conditions requested, to name a few. It is also important to point out that the characteristics of child voice, that may allow the identification of an individual, change very quickly with age. This means that the participants will not be identifiable by their voice in just a few months or years after the recordings. We hope that, in the future, uniform interpretations of regulations and practices in handling speech data will be introduced. We believe that the availability of open-access speech corpora with child speech is of essential value for scientific research and speech technology development, and they will bring about great advantages for society.

We have submitted a paper describing the data collected in the Teflon project in more detail to the LREC2024 conference that will take place in Turin, Italy, on May 20-25. Please visit the LREC2024 website for more information.