Go Big and Go Deep: Mining Data at Scale

Jun 14, 2016

Pravin Kakar

Scientist

Institute for Infocomm Research (I²R)

The Institute for Infocomm Research (I²R) is a member of the Agency for Science, Technology and Research (A*STAR) family and is Singapore’s largest ICT research institute. Established in 2002, our vision is to power a vibrant and strong infocomm ecosystem in Singapore. We seek to foster world-class infocomm and media research and develop a deep talent pool of infocomm professionals to power a vibrant knowledge-based Singapore.

The phrase ‘big data’ is thrown about a lot these days. You may have heard it in reference to disease outbreaks, insurance assessment, as well as banking and e-commerce. Big data mining refers to leveraging huge volumes of data from various sources and unearthing patterns that can explain or predict phenomena. Indeed, the phones we carry in our pockets, the credit cards we use, the EZ-Link cards that we tap on public transport and the Fitbits we wear are a treasure trove that can help to understand who we are. Now multiply the amount of data you generate by the number of people with similar data sources, and you can visualize the scale of data that is available for analysis.

My research here in the Institute for Infocomm Research’s Data Analytics Department focuses on ‘deep learning’ algorithms to utilize this big data in a rather elegant way. Deep learning refers to using ‘deep’ neural network algorithms that consist of several layers of artificial neurons that are supposed to mimic the way a human brain functions. While still a far cry from the actual complexity of the human brain, the fundamental idea that stacking layers of fairly simple computations can allow for solving really complex computations holds. The field of neural networks itself has had a long and checkered history from great optimism in the ‘50s and ‘60s to the Artificial Intelligence (AI) winter of the ‘70s, from the ground-breaking discovery and application of techniques like backpropagation to neural networks in the ‘80s, to the steady increase in data and computation power through the ‘90s and into the 21st century.

I became interested in working with neural networks during my undergraduate studies, when I designed a system that could take my hand-drawn electronic circuit diagrams and, using neural networks and some image processing techniques, convert them to a format that I could simulate on a computer. I was introduced to deep learning techniques when I read about their breakthrough performance in image recognition tasks back in 2012. The proposed neural network, AlexNet, was able to reduce a classification error metric by over 40 per cent, at a scale of over a million training images, spread across a thousand classes. Since then, the entire field of visual computing dived into deep learning, experimenting with outrageously bigger and more complex networks and coming close to human-level performance in image recognition, a task that even half a decade ago would have seemed impossible.

A neural network basically takes some data as an input, passes it through multiple internal layers of mathematical transformations and produces an output corresponding to some objective. This also means that it can be trained to perform a complex task by providing it with several examples of inputs and desired outputs, and letting it figure out how best to optimize its internal transformations. One major problem with neural networks is that to perform significantly complex tasks they, unsurprisingly, need to have several stacked simple transformations. However, with several layers, they need new computational tricks to be successfully trained, greater computational power, and last but not least, more data. The latter factor is a bit subtle, but it stems from the fact that neural networks tend to be a bit too eager to reproduce the data they are presented with during the training process. This ‘overfitting’ can be detrimental to the generalization of new data. As an analogy, consider the case of a child learning multiplication tables. While he may be able to learn many such tables by rote, unless he understands what multiplication actually is, he will not be able to multiply any two arbitrary numbers.

One of the easiest solutions is to simply use more data to make it difficult for the network to perfectly reproduce the training data, forcing it to learn the relationship between the input and output, rather than the data itself. In our analogy above, this would be equivalent to presenting the child with, say, a thousand multiplication tables, thus making it more efficient for him to simply learn how to multiply any two numbers. Until quite recently, we did not have many sources of large amounts of data for real-world problems. With the advent of the internet to crawl for data and the ability to cheaply label the data via crowd-sourcing, it has become possible to have truly massive datasets, which has gone a long way towards solving the overfitting problem for deep neural networks.

My research involves using the lessons learned from visual and speech recognition in deep learning, and seeing how they can be adapted to other sources of data such as sensor data and regular text. The Data Analytics team has had some masterstrokes of creativity in applying these lessons, including creating image-like structures from non-image data and using neural networks intended for images on them. I work on creating and modifying other neural network architectures for such data analysis. Additionally, I also examine scalability issues involved with deep learning, such as studying how incorrect training data impacts performance, and what can be done to mitigate it. Finally, I look into dynamic neural network architectures that are increasingly showing promise in areas like sequence labelling and creating captions for images.

Deep learning is still a very nascent field, and there are lots of competing ideas and frameworks to experiment with. We have not even come close to the limits of the technology and we already have incredible results from industry giants like Google, Facebook, and Microsoft. It is an exciting time to be riding this wave of big data, and I am looking forward to seeing how far this takes us!  

Dr. Pravin Kakar is a research scientist in the Data Analytics Department at the Institute for Infocomm Research (I2R). He obtained his PhD from the School of Computer Engineering at Nanyang Technological University, Singapore where he researched techniques for passive image forensics. Prior to joining I2R, he was the lead algorithm developer at Graymatics Singapore, a computer vision start-up. His research interests lie in machine learning, data analytics and computer vision.