I was this week in Cape Town (SA) to attend and keynote at SLTU 2012, the third workshop of Spoken Language Technologies for Under-resourced languages. In few words, it is the community that is working on how to develop Text-To-Speech (TTS) and Speech Recognition (SR) engines for languages that don’t have enough speakers or that are not known enough to be managed by traditional approaches that were developed for e.g. French, Spanish or English.
TTS and SR are essential modules to deliver voice-based services, and in that regards, it is a topic we are working on in all our projects, in particular VBAT and VOICES. While voice-based services can be built in any languages of the World by recording audio prompts, TTS and SR are essential elements to make more advanced, dynamic, multi-channel (web and voice) services. While I’m not at all an expert in this domain, I was interested to learn how other people and organizations are addressing the same problems. I was also interested to see the type of use-cases that the community is working on.
At the end, I’m amazed and really surprised. First of all, I realized that I had absolutely no idea about how SR and TTS are built. A big part of the talks were Chinese to me. The domain, like all domains, has lots of specific words, acronyms, etc. that are totally un-understandable by a newbie like me. In our services, we are customers for these modules, and have a relatively good understanding of the functional requirements, but how to develop these modules, it is a totally different stories. So it was very interesting to get this insight view, and now i would be very quiet on these items in my future talks as I know now that I’ve no idea what I’m talking about.
The second major discovery was on the use-cases. One of my expectations was to find people doing similar work as we are doing. I was totally wrong! Except one or two organizations interested in applications, and focusing on services for under-privileged communities, most of people in the room were not focusing on any application. They are developing these modules for all kind of reasons, but mostly to resolve research issues, without any focus on applications. This is surprising to me. The reason of my venue was exactly on this point: trying to push people to focus on useful applications for speech technologies. I hope that what i presented was not as Chinese as what i understood from others, and that will have an impact on some of the participants.
African Conference Room: fireplace
The last very interesting point was the second keynote, delivered by Pedro Moreno, leading the speech technologies department at Google. Most of our work in this domain is focusing on providing the benefit of the Web to those who are currently excluded using voice technologies. Google has the opposite approach: how to make the Web friendlier for those who are already users. The role of the speech technologies department is to offer a voice-based search service: Using Google search engine through voice instead of text. All their work and algorithm are based on supporting as much languages as possible on their voice search, as far as there is enough content available on the Web in this language, and there are enough people searching for this content through Google. So they analyze the query strings, and are able then to identify enough content and then develop the languages they can support. They would not be able (as Pedro answered my question) to develop the support of languages that are not existing on the Web yet (using their current methodology). This is another very interesting approach, but reinforced my feeling that we have to push hard to extend the frontier of the Web, bring onboard people who are not yet Web citizens, to ensure that the technology will evolve in a way that will facilitate their interaction.
All in one, I learnt tons of things about languages (e.g. the different types of languages, the fact that languages with writing system less than 100 years old are mostly phonetic, the fact that tonal languages lose their tonal aspect if the speaker size increase a lot, etc.), it was a very eyes-opener exercise, and it was really worth the trip.