We don’t solely rely on words to communicate. Body language and tone also contribute to how thoughts and feelings are expressed, and these vary from one situation to another.
With so many intertwining factors, it’s not surprising that computers struggle to decipher our emotions in a conversation. Communication platforms powered by artificial intelligence (AI), such as customer service chatbots, need to read into our feelings to create a realistic and rewarding experience.
However, we’re not likely to pour our hearts out to chatbots—conversations tend to consist of short, concise responses without words or phrases that bear emotional cues. In fact, this happens in human conversation as well. “In these cases, AI models must rely on various contextual clues to recognise emotions even when they are not explicitly stated,” notes Donovan Ong, a senior research engineer at A*STAR’s Institute for Infocomm Research (I2R).
The same series of words can convey very different feelings, said Ong, providing the example sentence of “We’ve got to say it to him.” In one sense, the statement could relay frustration from someone disagreeing with a prior statement. Alternatively, it might have a sympathetic undertone coming from compassion. With these nuances in mind, Ong and other colleagues from the Natural Language Processing (NLP) Group led by Jian Su, principal investigator of the research, created a new AI model that picks up on textual discourse cues in conversations to predict human emotions.
Several challenges stood in the way. Existing conversational datasets for training machine learning platforms don’t have discourse role labels (i.e. identifiers of whether a phrase is a question or if the person is responding to a previous comment). Furthermore, discourse roles are dynamic and context-dependent. The platform would need to understand dependencies such as an answer discourse role following a question role.
The team used an integrated-model approach to incorporate these variables into existing AI models. In order to label utterances with discourse roles, they deployed a variational autoencoder (VAE) that reads consecutive lines of a conversation and designates roles. To account for the sequential nature of discourse roles, the results from the VAE were fed into a recurrent neural network (RNN), a model built to model temporal relationship.
The results from a validation test exceeded the researchers’ expectations. “Our model achieved the best performance across three public datasets for emotion recognition in conversations,” Ong commented, adding that this is a green light towards future industrial applications. The team’s new model will likely be applied for other language tasks such as summarisation, translation, and dialogue generation.
Ong and colleagues plan to build on the model to recognise a broader landscape of emotional cues. To achieve this, they will incorporate more detailed linguistic, audio and visual information for a more complete training data package.
The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).