Highlights

Above

Thinking outside the sample

19 Apr 2016

Helping computers learn to tackle big-data problems outside their comfort zones

A*STAR researchers present a new machine learning framework to solve big-data problems.

Imagine combing through thousands of mugshots desperately looking for a match. If time is of the essence, the faster you can do this, the better. A*STAR researchers have developed a framework that could help computers learn how to process and identify these images both faster and more accurately.

Peng Xi of the A*STAR Institute for Infocomm Research notes that the framework can be used for numerous applications, including image segmentation, motion segmentation, data clustering, hybrid system identification and image representation.

A conventional way that computers process data is called representation learning. This involves identifying a feature that allows the program to quickly extract relevant information from the dataset and categorize it — a bit like a shortcut. Supervised and unsupervised learning are two of the main methods used in representation learning. Unlike supervised learning, which relies on costly labeling of data prior to processing, unsupervised learning involves grouping or ‘clustering’ data in a similar manner to our brains, explains Peng.

Subspace clustering is a form of unsupervised learning that seeks to fit each data point into a low-dimensional subspace to find an intrinsic simplicity that makes complex, real-world data tractable. Existing subspace clustering methods struggle to handle ‘out-of-sample’, or unknown, data points and the large datasets that are common today.

“One of the challenges of the big-data era is to organize out-of-sample data using a machine learning model based on ‘in-sample’, or known, observational data,” explains Peng who, with his colleagues, has proposed three methods as part of a unified framework to tackle this issue. These methods differ in how they implement representation learning; one focuses on sparsity, while the other two focus on low rank and grouping effects. “By solving the large-scale data and out-of-sample clustering problems, our method makes big-data clustering and online learning possible,” notes Peng.

The framework devised by the team splits input data into ‘in-sample’ data or ‘out-of-sample’ data during an initial ‘sampling’ step. Next, the in-sample data is grouped into subspaces during the ‘clustering’ step, after which the out-of-sample data is assigned to the nearest subspace. These points are then designated as cluster members.

Researchers from the A*STAR Institute for Infocomm Research; Peng Xi is third from the left.

The team tested their approach on a range of datasets including different types of information, from facial images to text — both handwritten and digital — poker hands and forest coverage. They found that their methods outperformed existing algorithms and successfully reduced the computational complexity (and hence running time) of the task while still ensuring cluster quality.

The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research.

Want to stay up-to-date with A*STAR’s breakthroughs? Follow us on Twitter and LinkedIn!

big data learning computers clustering subspace A*STAR Institute for Infocomm Research (A*STAR I²R)

References

Peng, X., Tang, H., Zhang, L., Yi, Z. & Xiao, S. A unified framework for representation-based subspace clustering of out-of-sample and large-scale data. IEEE Transactions on Neural Networks and Learning Systems advance online publication, 29 October 2015 (doi: 10.1109/TNNLS.2015.2490080). | Article

This article was made for A*STAR Research by Nature Research Custom Media, part of Springer Nature

Training AI to plan step by step

26 Jun 2026

Just by seeing the starting state and end goal, AI models could predict how to complete a task by filling in the unknown steps and focusing only on the actions that matter.

RIE2030: Turning the page

14 May 2026

As a new five-year phase of Research, Innovation and Enterprise takes off across the nation, A*STAR leaders present the strategic throughlines and shifts through which the agency will advance national priorities in health, economy, sustainability and future technologies.

Highlights

Thinking outside the sample

Want to stay up-to-date with A*STAR’s breakthroughs? Follow us on Twitter and LinkedIn!

References

This article was made for A*STAR Research by Nature Research Custom Media, part of Springer Nature

Related Articles

Training AI to plan step by step

RIE2030: Turning the page

Tuning AI to local news beats

Get the PDF deliveredto your inbox.

Get the PDF deliveredto your inbox.

Join our mailing list

Get the PDF delivered
to your inbox.

Get the PDF delivered
to your inbox.