How AI is helping indigenous languages survive and thrive
By Erin Kalejs
Language matters, it’s primarily how we communicate and connect with others. According to a study done by Stanford University, there are roughly 7,000 languages spoken worldwide. While there is a lot in common among languages, each one is unique, both in its composition and in the way it reflects the culture of the people who speak it.
Unfortunately, many of the world’s languages are under threat. Indigenous languages such as Māori saw a major decline between 1920 and 1960. Now, the fluent population within many indigenous groups is both decreasing and aging, the language and its culture are both endangered.
In a recent AI For Good webinar Peter-Lucas Jones, Chief Executive Officer of Te Hiku Media, a small non-profit media organization based in New Zealand along with Keoni Mahelona, the organization’s Chief Technology Officer discussed how Artificial Intelligence (AI) is helping to preserve te reo Māori.
Korero Māori- teaching computers indigenous languages
Jones began to see Māori speech recognition as a necessity after they digitized the massive audio collection Te Hiku Media had gathered over 30 years of radio broadcasting. Te Hiku Media was originally a radio station founded in 1990, born out of the Māori rights movement in order to fight against language loss and cultural decline. The organization’s ultimate goal now is to stop the language from dying out by giving their community access to Māori content.
The aim behind this speech to text project, “Korero Māori” was to give access to the stories they had collected from their elders via radio interviews done over the last 30 years. Amazingly, they began to transcribe interviews by hand and tag phrases. Te Hiku Media wanted to transform these interviews with native speakers of te reo Māori so that second language learners would be able to be exposed to these high-quality examples of proficient speakers.
However, as Jones points out “to transcribe this level of speaker you need to be pretty good at speaking and hearing the language. Unfortunately, we had way more interviews than people who were able to transcribe. That’s really what started us thinking about speech to text” and prompted the team to think “how could we accelerate the way in which we could transcribe our archives, making them useful for educational purposes?” That was the moment they decided to take on the challenge and teach computers how to speak Māori.
The tools for building speech recognition technology are quite accessible, thanks to open-source tools such as, Mozilla’s Deep Speech tool. The obstacle for indigenous communities is a lack of labelled data to work with and collecting it from scratch is no easy task.
In order to obtain large quantities of labelled data it was imperative for Te Hiku Media to engage with their community. Through a competition they held, they managed to gather 316 hours of data in 10 days, it was an astonishing start and gave them enough initial data to build a speech-to-text engine. This project Jones says would not have been possible without their community’s support, which is why it is very important for
“projects like this to be community led and when you understand the community and you come from the community, there’s a level of trust that the community have in you. This was a clear example that our community supported us to do this piece of work,” said Jones.
During the webinar, Jones also spoke about the impact English has had and is still having on second language Māori speakers’ pronunciation and the importance of restoring Māori’s natural sound.
“We want to look at how to speak native sound back into our future, so we’ve been looking at the way in which our people speak Māori and how they sound when they speak Māori. We’re restoring that native sound because it’s such an important part of our place and our identity. As people that have been colonized by others and of course having language and land loss and all of our resources pretty much stolen from us we understood the difference between natural evolution and assimilation”, said Jones.
This is precisely the reason why Te Hiku Media recently launched their new Māori pronunciation app “Rongo”. The app uses their machine learning model to assess te reo Māori pronunciation, providing an oral experience without text. This completely oral method is a new and exciting way to learn Māori as Mahelona differentiates “there’s learning the language then there’s speaking it correctly.”
Mahelona adds, “we’re hoping an app like this along with machine learning tools can help us to detect those mistakes at scale and correct them at scale.”
To learn more about how AI is accelerating the revitalization of indigenous languages watch the AI for Good webinar
The organization has also gone on to build a web-based transcription tool called “Kaituhi”. In which the user can upload audio that is then broken into chunks and then automatically transcribed using their speech recognition API. This has opened the door to their next project, to train a bilingual speech recognition model as their current speech recognition model only detects te Reo Māori and not English.
As Mahelona explains, “for languages like te Reo Māori where there is a history of colonization there is a lot of bilingual. If you want speech tools that are practical and can be used day to day by everybody, they absolutely have to be bilingual.”
Digital Sovereignty was a main issue that was discussed during the AI for Good session, as Jones and Mahelona are not only using AI to advance the understanding of Māori language and culture but also to safeguard and maintain sovereignty over the community’s data.
“We want to ensure that our cultural knowledge, languages and digital communities aren’t damaged in the same way that colonization has damaged us prior,” explains Mahelona.
The organization’s “Kaitiakitanga License” is there to help uphold a set of principles that build Māori capability, ensure accountability and benefits back to the community from which the data was derived and helps maintain sovereignty overall. It also prohibits the use of data, software and tools for applications that discriminate and persecute.
Jones explained why data sovereignty is so crucial to indigenous communities, “we’re a Māori charity and so we look at ways to maintain sovereignty because we know what it means to lose it. We know that data is the new land and having had our land taken from us we take data sovereignty very seriously.”
“Whilst we’re reclaiming and re-emerging now and re-imagining our future, we’re very mindful of our past and so our approach is one of affirmative action for data software and the way in which we develop digitally,” Jones added
Mahelona also discussed the challenges that come with maintaining data sovereignty “if we want our indigenous languages and our cultures to survive than they need to exist in the digital domain. The challenge is when the digital domain is dominated by a select few companies. It makes it challenging because it says that if we want our languages and cultures to exist digitally that means they have to exist in these dominating platforms. For us as an organization and indigenous people we don’t want our stories in a platform that will also show racist things that undermines the content that we share.”
Which is why Mahelona says it’s important that they “promote their own languages and culture from our own worldview, that is the best way to tell the stories of our people.”
Mahelona also mentioned the pros and cons of using open-source tools and how in some cases it does not work for indigenous data, “if we put our data and tools into open source for example GitHub, it’s very unlikely that our own people will have the tools to access that data.”
Increasingly, more non-indigenous companies are requesting access to the speech tools Te Hiku Media have developed. As Mahelona clarifies, “we’re not surprised by this, but we want to change that, we want more Māori to be innovating and building digital apps and things like that. In terms of access to our API we want to give preference to young up and coming Māori developers to access our tools before some big company accesses it.”
An ethical future for AI
Instead of big tech companies misusing indigenous data and selling it back to them, Mahelona believes the answer lies in empowering communities to create their own platforms and solutions to help move their people forward. Te Hiku Media is a wonderful example of this as Keoni says “we’ve shown that as an organization in today’s day and age with the technology available and with the right skill set you can actually do some impactful things.”
What Te Hiku Media have achieved so far as a non-profit in terms of machine learning initiatives are outstanding, including an incredibly fast te reo Māori ASR (automatic speech recognition), real time pronunciation feedback model, automatic POS (part of speech), and a new and improved te reo Māori speech synthesis.
However, out of all of Te Hiku Media’s accomplishments attaining global recognition for their Kaitiakitanga License within the AI industry is the one they are proudest of because “it is not the tools we are building, it’s how we’re looking after the data and the tools that we’re building,” says Mahelona.
Jones concluded the session by reminding everyone that there is hope for a better future for AI and indigenous languages “It’s not just about teaching a computer to speak our language, it’s not just about machine learning, it’s about the survival and the future of our culture and our language for our future generations. That’s the AI for good, for us.”