In brief

While humans may easily understand the content and context of a movie scene, machines have a hard time inferring meaning accurately from video footage and subtitle text.

© Pixabay

Helping machines get the plot

30 Jan 2020

A*STAR scientists have devised a learning framework to enable machines to integrate visual, auditory and text data.

In the 1994 American comedy-drama Forrest Gump, the titular character played by Tom Hanks decides to undertake a three-year marathon. For someone who has watched the movie, the motivation for Forrest’s impulse is clear: upset that his love interest, Jenny, has left him, he decides to go for a run one morning and just keeps running.

But ask a machine “Why did Forrest Gump embark on his three-year marathon?” and it will probably be stumped. The context and plotlines of the movie can only be inferred by combining visual and text information present in the video—no easy feat for a machine.

“Current machine learning algorithms do not effectively integrate different types, or modalities, of information,” said Chuan Sheng Foo, a Scientist at A*STAR’s Institute for Infocomm Research (I2R).

To overcome this problem, Foo’s team developed a machine learning framework that processes individual frames in videos as images, combines that data with subtitle texts, then uses that collective information to answer questions based on movie clips. Called the Holistic Multi-modal Memory Network (HMMN) framework, the technique involves the use of a bank of questions—and their answers—in the early stage of training the information-processing algorithms.

“The use of answers at the start of the inference process, before the answer prediction stage, helps identify relevant cues in the multi-modal data,” said Foo. This is akin to a student taking a reading comprehension test and being able to focus on the parts of the passage that matter.

The HMMN framework was evaluated for accuracy in answering questions from two benchmark video datasets (MovieQA and TVQA) comprising video clips and subtitles from 140 movies and six popular American TV shows. More than 100,000 questions were used for training, with another 15,000 used for validation and testing the framework.

“HMMN outperformed competing methods on MovieQA datasets and produced more accurate answers upon combination with the state-of-the-art system on TVQA. This indicates that our framework is more effective at leveraging the available information in videos to answer questions,” said Foo.

He added that HMMN could be useful for interactive exploration and querying of complex multi-modal databases. For example, HMMN could help to find related videos about performing maintenance on factory machinery, or respond to queries about broadcast videos.

Moving ahead, the team is exploring how contextual information, such as knowledge graphs describing relationships between words and spatial relationships between images, can be incorporated into their model to enhance the reasoning of textual and visual semantics.

The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).

Want to stay up to date with breakthroughs from A*STAR? Follow us on Twitter and LinkedIn!


Wang, A., Luu, AT., Foo, CS., Zhu, H., Tay, Y. et al. Holistic Multi-modal Memory Network for Movie Question Answering. IEEE Transactions on Image Processing (2019) | article

About the Researcher

View articles

Chuan Sheng Foo

Programme Head, Precision Medicine

Institute for Infocomm Research
Chuan Sheng Foo graduated with a PhD degree in computer science from Stanford University in 2017. He is currently Programme Head, Precision Medicine, and a Scientist in the Deep Learning and Healthcare departments at the Institute for Infocomm Research (I2R), A*STAR. His research revolves around the development of deep learning algorithms that can learn from less labeled data, inspired by applications in healthcare and medicine where collecting large, well-annotated datasets is often time- and cost-prohibitive due to the need for careful expert labeling.

This article was made for A*STAR Research by Wildtype Media Group