Endlessly scrolling for what to watch on streaming platforms? With the glut of exciting video content online today, finding the perfect gem can be daunting. To decide what we feel like watching, we often look for short video summaries: a dramatic action-packed trailer, a funny montage, or an emotional cliffhanger that leaves us wanting more.
To grab viewers' attention, most video previews feature carefully-selected snippets of standout moments. These summaries are often put together by human editors to sum up the content’s narrative or emotional beats.
Software engineers have built algorithms to try and automate the video summarisation process. However, many machine learning (ML) models either need large, thoroughly-labelled datasets, or struggle to pick video segments from unlabelled data that would interest human viewers.
“It’s hard to define ‘interesting’ elements in a video; a segment that interests one person might bore another,” said Wai Cheong Lew, an A*STAR Postgraduate Scholarship recipient in computer science.
To build an ML model that can better account for human emotional responses when creating video summaries, Lew and colleagues at A*STAR’s Institute for Infocomm Research (I2R), Nanyang Technological University (NTU), and Singapore Management University (SMU) proposed a novel training method. Instead of training video datasets with manual annotations on ‘interesting’ segments—which can be subjective and costly—they hypothesised that a dataset of viewers’ brain signals while watching videos may serve the same purpose.
Using a publicly available electroencephalography (EEG) dataset, the team non-invasively measured brain signals from volunteers which reflected their emotional responses to specific scenes. After linking the EEG readings to the video sections that induced them, they were then fed to an unsupervised machine learning model as a training dataset.
“EEG signals can be an alternative to manually-created labels as another form of human annotation,” explained Lew.
The study was a collaboration with Joo-Hwee Lim, I2R Senior Principal Scientist III; Kai Keng Ang, I2R Senior Principal Scientist I; and colleagues from NTU and the SMU.
The diversity of viewer tastes and preferences created some subjective bias in the researchers’ training datasets. “We found it challenging to introduce EEG signals into the reinforcement learning framework, as they tend to be noisy and can bring disturbance into the training process, resulting in ineffective summaries,” said Lew.
To overcome this, the researchers implemented a deep learning model called the EEG Linear Attention Network (ELAN). ELAN draws connections between signals at different timepoints and different brain areas, selectively considering only those consistent across all volunteers.
Combining ELAN with standard ML models used in video processing, the researchers built the EEG-Video Emotion-based Summarization (EVES) model. EVES allows accurately extracted higher-level meanings for video summaries by considering emotion-evoking scenes, resulting in a better correlation with emotional content. In statistical tests against other published models, the team found EVES both outperformed traditional unsupervised models and matched the performance of supervised models trained on painstakingly labelled data.
The team also tested EVES-generated video summaries on a cohort of viewers. In terms of coherence and emotional content, the audience reported a preference for EVES summaries over those from other state-of-the-art models.
Lew hopes that this and other breakthroughs in the field of automated video summarisation will spur a demand for EEG-video conjoined datasets.
The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).
