Deep Mind In 125 Languages

Blog Credit : Trupti Thakur

Image Courtesy : Google

Deep Mind In 125 Languages

Google DeepMind’s India unit is leading a big AI project called “Morni.” This project is focused on developing AI technologies that can understand and work with 125 different Indian languages and dialects. This is very important because India is a country with many languages, and making AI work for all of them will help ensure that everyone can benefit from AI, no matter what language they speak.

Why is the Morni Project Important?

India officially recognizes 22 languages, but the Morni project is aiming much higher by trying to include over 100 languages. This is because many people in India speak languages that are not officially recognized but are still widely used. For example, 60 of these languages are spoken by more than a billion people in total. This shows that there is a huge number of people who are not fully represented in current AI technologies.

Challenges in Language Data

One big challenge that Manish Gupta, the Director of Google DeepMind, has pointed out is the lack of digital data for many Indian languages. For 73 out of the 125 languages targeted by the project, there is no digital data available at all. This makes it very hard to develop AI that can understand and work with these languages. Even Hindi, which is widely spoken, makes up only 0.1% of all the text available on the internet.

How is Google Addressing These Challenges?

To solve the problem of not having enough language data, Google started Project Vaani in collaboration with the Indian Institute of Science (IISc) and ARTPARK. The goal of this project is to create a large, open-source database of speech data from different Indian languages. This data will help the Morni project build AI that can understand and respond in these languages.

Progress and Future Goals

So far, Project Vaani has completed its first phase, where it collected over 14,000 hours of speech data from 58 languages. This involved 80,000 people from 80 different districts across India. The ultimate goal is to collect 154,000 hours of speech data from all 773 districts in the country. The project is currently in its second phase, where it is focusing on covering all states by collecting data from 160 districts.

What is Google DeepMind?

Google DeepMind, founded in 2010, is a leading AI research company. It became famous in 2016 when its AI program, AlphaGo, beat a world champion in the game of Go. The team also developed AlphaFold, which solved the problem of protein folding, a key issue in biology. DeepMind’s AI has been used in healthcare, where it can predict when a patient’s health might get worse and help diagnose eye diseases. The company also created AI that reduced energy use in Google’s data centers by 40%. DeepMind is committed to making AI that is safe, fair, and ethical. Their work has greatly impacted gaming and the management of complex algorithms.

Google DeepMind’s India unit is working on an Indic language artificial intelligence project called Morni (Multimodal Representation for India), with an aim to cover 125 Indian languages and dialects. “So, India has 22 scheduled languages, which are viewed as official languages. But in our work, we are targeting over 100 Indian languages, because we find that there are 60 Indian languages which have over a billion speakers and over 125 languages that have over a lakh speakers each,” said Manish Gupta, director at Google DeepMind, Google India. He was speaking at the Global Fintech Fest in Mumbai on Thursday. According to him, 73 of these 125 languages had zero corpus of digital data available. Even for a language like Hindi, which is now spoken by close to 10% of the world’s population, the share of text on the internet is 0.1%.

Google’s research lab overcame the challenge of sourcing data for these languages by launching a project, Vaani — a collaboration among Google, Indian Institute of Science and ARTPARK (Artificial Intelligence & Robotics Technology Park).

The project has completed its first phase to create an open-source database of over 14,000 hours of speech data in 58 languages, collected from 80,000 speakers in 80 districts, Gupta said.

First announced in December 2022, Project Vaani aims to collect and transcribe 154,000 hours of open-source anonymised speech data from all districts of India. Gupta said they are now in the middle of phase two that will cover 160 districts, spread across all states.

Recently, in its largest ever expansion of language coverage for Google Translate, the company added 110 new languages worldwide, out of which five were Indian.

“We extended PaLM-2 (transformer model) to understand over 1,500 languages of the world…110 new languages were added in one shot a few months ago, covering some 600 million plus people whose language is now covered by Google Translate,” Gupta said.

Besides these, Google is also building a digital agri-stack which could unlock use cases like loan to farmers, credit, reasonably priced crop insurance and enable various subsidy programmes that are run by the government in a non-data driven manner.

Blog By : Trupti Thakur

Deep Mind In 125 Languages

04

Deep Mind In 125 Languages

Recent Blog