Did you ever share revision notes with friends in school? If you’ve ever had to study at the last minute for an important exam, you’ll know how useful it can be to share and compare notes with other students. As it turns out, the same is true for machine learning algorithms.
Effectively training a machine learning algorithm requires huge amounts of labelled data: raw data to be identified with meaningful labels to provide context for the algorithm. To ease the effort and cost of manually labelling data, computer scientists have developed a process called domain adaptation that allows machine learning algorithms to use existing labelled data from slightly different but still relevant data sets.
For example, a vehicle identification algorithm trained on labelled data from sunny Singapore could be used to train a vehicle identification algorithm to identify Icelandic vehicles despite the stark differences in the weather, vehicle types, and road conditions.
Impressive though this is, current domain adaptation techniques are far from perfect. They often transfer irrelevant data that hinders or even negatively impacts learning. Now, however, A*STAR researchers from the Institute for Infocomm Research (I2R) in collaboration with a team from Nanyang Technological University have invented a new data selection software that automatically chooses the most relevant data from a well-labelled source and excludes irrelevant samples that might hinder learning.
“The most exciting thing is that the superiority of the proposed method becomes more noticeable when dealing with more complex datasets. Here, it continuously outperforms all the baseline methods on almost all tasks and improves the accuracy by a large margin,” said Keyu Wu, first author of the research paper and a scientist at I2R.
Moreover, the researchers’ data selector tool can also be used for partial domain adaptation (PDA) techniques, when the target domain doesn’t need the entire data set from the source domain. This approach is more practical, as most real-world AI applications need only customised datasets to be trained. For example, a medical imaging dataset may cover five diseases, while training a customised real-world task may only require data from three of the five diseases.
While the data selector can currently be integrated into any existing domain adaptation or PDA model, the researchers still aim to improve their data selector further. “In the next two to three years, we plan to achieve better performance in both partial domain adaption and domain adaption tasks,” Wu said.
The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).