In a globalized world, language differences represent some of the final barriers to information transfer. Although software like Google Translate has recently emerged to narrow those differences, anyone who has used machine-mediated translation will acknowledge that the conversion of text from one language to another remains imperfect.
For a machine to perform translation effectively, it must be able to map the vocabulary and grammatical rules of one language onto another language. This requires a technique known as transfer learning. “Early transfer learning algorithms focused on homogeneous domain adaptation, which assumes that the source domain has very similar features to the target domain. While this approach has been useful for understanding texts in the same language, it is inefficient for cross-language classification,” explained Joey Zhou, Group Leader at A*STAR’s Institute of High Performance Computing (IHPC).
Compounding the problem is the fact that although there are extensive and well-annotated datasets for the English language, the same cannot be said for other languages, such as Spanish or Vietnamese. To deal with the disparity in features between two languages, as well as imbalances in the availability of annotated language datasets, a heterogeneous domain adaptation approach for transfer learning is needed.
Zhou’s team thus developed an algorithm that explores the underlying structures of a source and a target language, then matches each foreign word with just a few English words, reducing the complexity of mapping features between the two languages.
Next, the researchers built in a constraint that optimizes the learning algorithm by having it ignore less important features. They also used error-correcting output codes to enable the algorithm to rectify errors and arrive at accurate final predictions of word matches, allowing for more robust cross-language classification.
The team then applied their algorithms to real-world translation scenarios. “For example, our algorithms could analyze sentiments in actual product reviews and classify document topics in either English, German, French or Japanese,” Zhou said.
Beyond the realm of translation, the researchers’ technique can also be used to categorize text content. “Our algorithms outperformed six state-of-the-art baseline artificial intelligence methods in correctly classifying a collection of BBC News articles into six pre-defined topics, with the best algorithm typically exceeding 70 percent accuracy even when working on a different language from the one it was trained on,” he shared.
Moving forward, the team intends to integrate deep learning features from state-of-the-art language models into their algorithms, further improving their speed and performance. With these developments, seamless computerized translation could become a reality sooner rather than later.
The A*STAR-affiliated researcher contributing to this research is from the Institute of High Performance Computing (IHPC).