According to the best estimates, north of 7,000 languages are spoken globally today. Around 400 or so languages have more than a million speakers. Considering that some languages, notably English, seem to dominate digitally, there is actually a tremendous need for tools that can work across different languages and carry out diverse tasks.
Artificial intelligence and natural language processing, a branch of computer science, have been at work for decades to develop tools that can do just that. Over the past few years, numerous tools have emerged based on multilingual models for natural language processing (NLP). These models serve as Rosetta Stone for the information age, allowing computers to move seamlessly between languages. They do not only provide translation but support a variety of applications, such as sentiment and content analyses.
Multilingual NLP, therefore, has a vital future role to play. It can be used for machine translation or for analyzing social media posts in different languages to determine sentiment, which could be used to inform marketing strategies or customer service. Multilingual NLP can also underpin content recommendations on streaming services or make customer service available in multiple languages. It can power news content analysis or enable health records translation at scale. In short, many tasks that might have seemed impossible at one time — translating the health records of a French hospital into English, for example — are possible with Multilingual NLP.
Some also see the rise of Multilingual NLP as a force for the democratization of data, making content and services that were once available in just a few languages accessible to everybody. And Multilingual NLP continues to develop, even incorporating non-textual data.
Of man and machine: Recent advancements in multilingual model architectures
Natural language processing has deep roots. The English mathematician and computer scientist Alan Turing described the potential for computers to generate natural language in his seminal 1950 essay “Computing Machinery and Intelligence.” NLP developed steadily in the ensuing decades, and Multilingual NLP began to develop quickly in the 2000s. However, some of the most significant advancements in multilingual model architectures have occurred over the past decade.
Some names of these models are familiar to almost anyone who has engaged in translation. DeepL, for example, is owned by Cologne, Germany-based DeepL SE, and relies on its own algorithm paired with convolutional neural networks to offer translation between 33 languages and dialects. First launched in 2017, this is a well-known example of a multilingual NLP.
Of course, there is also ChatGPT, launched by San Francisco-based OpenAI and based on its Generative Pre-trained Transformer foundational model 3.5, which was later upgraded to version 4. GPT 3.5 and 4 are among the largest language models out there, trained on massive data sets, which allows them to survey large amounts of text data, capture complex patterns in language, and output high-quality text.
This multilingual NLP has been adopted en masse for language translation, sentiment analysis, and many other purposes. GPT 3.5 and GPT 4 were made accessible via an API. In 2018, researchers at Google introduced a language model called Bidirectional Encoder Representations from Transformers or (BERT). The model included a transformer encoder architecture and is used by the company to make better sense of searches on its platform, as well as to return back more relevant information in queries. The model is trained via masked token prediction and next-sentence prediction.
Various related models have innovated on the BERT model, such as RoBERTa, which modifies hyperparameters, removes the next-sentence pretraining objective, and allows training with larger mini-batches.
Not to be outdone, Facebook AI published a model called XLM-R in 2019, in which it trained the aforementioned RoBERTa on a multilingual dataset consisting of about a hundred languages from CommonCrawl datasets.
The scientists describing the tool noted its ability to perform well in languages with smaller datasets, such as Swahili and Urdu, both of which have tens of millions of speakers. They also noted its performance in cross-lingual understanding, where a model is trained on one language and then used with another without needing more training data.
Ongoing challenges and proposed solutions
While Multilingual NLP has developed at a breakneck pace over the past few years, it does have to contend with various obstacles. One is simply linguistic diversity.
Creating such models is not only about providing seamless translations. Languages can vary regionally or rely more on context, and slang can also change. That means that NLP models must be continuously improved to be relevant.
Moreover, some languages are just not that well represented in terms of digital comment, and with those datasets, it’s easier to train a model. Smaller communities that use non-Latin alphabets, for example, are particularly left out.
A third and rather intriguing challenge involves code-switching, where community members might switch between languages. Think of an English poet who suddenly quotes something extensively in French or a Japanese writer who spices up his prose with English references. If a model recognizes the language as Japanese, how does it manage those English segments in the text?
There are also issues around access to resources and bias. Given the computational wherewithal required to attain multilingual NLPs, are only the world’s most powerful companies going to be able to muster the resources to create them? Or is there a way to make them more accessible to researchers and organizations? And if datasets favor larger languages or communities, how can one ensure that speakers of smaller languages are well-represented?
Finally, there is also the omnipresent issue of crappy data. Researchers have to contend that their source data for some languages might not be accurate, leading to skewed output.
Solutions across the board revolve around investing more time in research and cooperating. Researchers must work to obtain better data from underrepresented languages while improving their models. Some have already employed zero-shot and few-shot learning approaches to handle situations where little data is available for a language.
To reduce bias, they are also working to create diverse training datasets and developing metrics to ensure fairness. Developers are also aware that content in one language might be offensive or inappropriate if poorly rendered in another and are addressing the issue.
In terms of accessibility, smaller-scale models have emerged to address the resource issue. Some of these smaller models include Microsoft’s Orca 2 and Phi 2, EleutherAI’s GPT-J and GPT-Neo, and T5 Small, a slimmed-down version of Google’s Text-to-Text Transfer Transformer (T5).
The future of Multilingual NLP
Just as developers seek solutions to the challenges facing current-generation models, innovation is underway that is completely changing what these models can do.
Multimodal Multilingual NLP will do just that by processing other kinds of data, such as images or other audiovisual data, along with text. It could potentially analyze content for facial expressions or tone, for example, which could be used to improve machine translation or sentiment analysis, adding new dimensions of data to the processing pipeline.
Innovation is also underway to improve existing voice assistants and multilingual chatbots. Apple’s Siri voice assistant currently can respond to queries in about 25 languages and dialects, while Amazon’s Alexa is available in nine. By using Multilingual NLP, these voice assistants could be made accessible to millions of more people worldwide.
Likewise, chatbots and virtual agents can also be improved, not only in terms of content but also by making their responses more contextual and specific to the person’s query, which will, in turn, improve the user’s experience.
As the technology evolves, Multilingual NLP will expand beyond translation, sentiment analysis, and other current uses to broader-scale applications. For instance, online education tools could be more readily available in various languages.
Companies can improve their research, reach more clients, and better serve local markets than they currently do, all with the help of Multilingual NLP. In short, it’s still early days for Multilingual NLP. Given the speed of developments, the future will be here soon enough.
Featured image credit: Freepik