In the 1994 American comedy-drama Forrest Gump, the titular character played by Tom Hanks decides to undertake a three-year marathon. For someone who has watched the movie, the motivation for Forrest’s impulse is clear: upset that his love interest, Jenny, has left him, he decides to go for a run one morning and just keeps running.
But ask a machine “Why did Forrest Gump embark on his three-year marathon?” and it will probably be stumped. The context and plotlines of the movie can only be inferred by combining visual and text information present in the video—no easy feat for a machine.
“Current machine learning algorithms do not effectively integrate different types, or modalities, of information,” said Chuan Sheng Foo, a Scientist at A*STAR’s Institute for Infocomm Research (I2R).
To overcome this problem, Foo’s team developed a machine learning framework that processes individual frames in videos as images, combines that data with subtitle texts, then uses that collective information to answer questions based on movie clips. Called the Holistic Multi-modal Memory Network (HMMN) framework, the technique involves the use of a bank of questions—and their answers—in the early stage of training the information-processing algorithms.
“The use of answers at the start of the inference process, before the answer prediction stage, helps identify relevant cues in the multi-modal data,” said Foo. This is akin to a student taking a reading comprehension test and being able to focus on the parts of the passage that matter.
The HMMN framework was evaluated for accuracy in answering questions from two benchmark video datasets (MovieQA and TVQA) comprising video clips and subtitles from 140 movies and six popular American TV shows. More than 100,000 questions were used for training, with another 15,000 used for validation and testing the framework.
“HMMN outperformed competing methods on MovieQA datasets and produced more accurate answers upon combination with the state-of-the-art system on TVQA. This indicates that our framework is more effective at leveraging the available information in videos to answer questions,” said Foo.
He added that HMMN could be useful for interactive exploration and querying of complex multi-modal databases. For example, HMMN could help to find related videos about performing maintenance on factory machinery, or respond to queries about broadcast video.
Moving ahead, the team is exploring how contextual information, such as knowledge graphs describing relationships between words and spatial relationships between images, can be incorporated into their model to enhance the reasoning of textual and visual semantics.
The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).