How AI can Exploit Body Language to Understand Social Behaviour in the Wild

by Hayley Hung (Delft University of Technology)

Researchers at the Socially Perceptive Computing Lab at Delft University of Technology in the Netherlands investigate novel ways to measure our quality of experience in social encounters. Their aim is to understand the difference between good and bad social encounters, from speed-dates, to professional networking events, to long-term simulated space missions. This paves the way for the development of technologies to understand and ultimately improve our social encounters.

Have you ever been anxious about attending social networking events where it wasn’t clear if you might meet the right people or the whole thing might have been a waste of time? We spend so many hours of our lives in face-to-face conversations and yet most of the time, we have no control over how well used that time is. The MINGLE project (Modelling Group Dynamics in Complex Conversational Scenes from Non-Verbal Behaviour) has been involved in developing new approaches to automatically analyse conversational interactions in large social gatherings such as professional-networking events. Research has shown that attending such events contributes greatly to career and personal success. However, it can often feel like there is too much serendipity involved or that how a conversation goes with a new acquaintance can determine whether a future relationship will be fruitful or not. Once machines can automatically interpret what is happening, they can start to help us to navigate our social experiences better. While much progress has been made in the development of automated analysis tools of small pre-arranged meetings, scaling up robustly to settings outside of meetings such as professional-networking events has remained an open and challenging research problem.

Moving Out of the Lab and into the Wild
One of the problems was that scaling up looking at social behaviour outside of lab conditions greatly changes the nature of the machine inference problem. In lab-based settings, former smart meeting rooms (See Figure 2) have been equipped with a microphone with clean audio data per person in the meeting, and a camera capturing each person’s face, gaze, and body behaviour (because they are mostly seated). The lighting can be easily controlled and we know beforehand that there is a single conversation occurring and who the participants are. We also have access to a rich body of social science literature that can help to inform and inspire the design of appropriate machine learning strategies to automatically estimate the role of participants, who is dominant, what someone’s personality is, who is disagreeing with whom, or even how cohesive the team appears.

Figure 1: The MINGLE Midge Smart ID badge (left). Demonstration of how it is worn with a black lanyard around the neck (right). It contains a 9-degree-of-freedom inertial measurement unit, Bluetooth transmitter and receiver for measuring proximity to other sensors, and the possibility of recording high or low frequency audio for assisting in the training of body-language-based models. The design is open source. More details and access can be found at [1] and [L1].

Figure 2: Snapshot taken from the AMI meeting corpus. This shows example footage taken from an instrumented meeting room where more controlled lab experiments can occur.

Scaling up to mingling settings and moving outside of lab conditions means that crowd density can increase, and multiple simultaneous conversations can occur in the same room (see Figure 3). Sensor data is more likely to capture cocktail party noise in audio and high levels of occlusion, resulting in people being visually hidden behind other people in video. Determining who is talking with whom is more challenging because groups can split and merge at will. Finally, in such mingling settings, people are there for their own motivations so recording sensitive information in private conversations may not be as appropriate. Moreover, recording the spontaneous facial expressions can feel quite invasive. A solution to this would require an unconventional approach – what if a single wearable sensor capturing your body movements was enough?

Finding Unconventional Solutions Guided by Social Science
Despite a seemingly impossible task, there are some tantalising hints from social science that may enable us to reimagine how machines might perceive the world. Rather than using the human senses (sight, hearing) as the primary inspiration for artificial social intelligence, the MINGLE project exploited findings from social science related to the role of body language in all aspects of social communication. The key was to investigate how body movements could be captured, even in environments that would be considered classically extremely challenging for computer vision or speech processing problems. Fortunately, at conferences or professional-networking events, participants often wear an ID badge on a lanyard around their neck. Imagine if this ID badge was smart and was equipped with sensors that could capture your body language? As a result of this, a Smart ID badge was designed: the open-source Midge Badge [L1],[1] (see Figure 1).

Social Interpretation of Conversations using Wearable Accelerometers
One of the most foundational indicators of conversation involvement is the pattern of speaking turns that participants in a conversation have. The more frequent the turns, the more engaged the group is, the more equally spread the turns are, the more likely that everyone is able to express their opinions. Without access to the audio, how does one estimate if people are speaking? Fortunately, social scientists have long since cited a relationship between speaking and head and hand gesturing. The project took this as inspiration, showing that just a single accelerometer embedded inside the Midge Badge was enough to capture when someone was speaking [1]. Combining this with body language represented by skeletons extracted from body key points led to an improvement in the estimation of speaking [1]. Going one step further, we also built an approach to measure the quality of the conversation based on their joint movement patterns [2], and to detect different intensities of laughter [3].

Figure 3: Example snapshots of data we used for capturing and training machine learning models to detect laughter from a real-life professional-networking event [3]. Faces have been blurred to protect the identity of the participants.

The Remaining Open Challenges
So far, the speaking behaviours that we have trained our systems for have been labelled based on observing the videos only. We need to capture data with high-quality audio to train the machine learning systems with higher-quality labels. We are currently missing important information about shorter and more subtle speaking turns, such as backchannels, which convey important information such as that someone is actively listening to a speaker. To truly understand the ebb and flow of conversations, having systems that can accurately capture short and long speaking turns is crucial. Further research will be required to see if these backchannels can be detected better using multi-modal sensor data and then whether combining longer and shorter turns, laughter, and coordination behaviours together can improve estimates of conversation quality.

The work described in this article was carried out in collaboration with Stephanie Tan, Jose Vargas Quiros, Chirag Raman, Ekin Gedik, Ashraful Islam, Navin Raj Prabhu, Catharine Oertel, and Laura Cabrera Quiros. It was funded by the Netherlands Organization for Scientific Research (NWO) under project number 639.022.606 with associated Aspasia Grant.

Links:
[L1] https://github.com/TUDelft-SPC-Lab/spcl_midge_hardware

References:
[1] C. Raman, et al., “ConfLab: A Data Collection Concept, Dataset, and Benchmark for Machine Analysis of Free -Standing Social Interactions in the Wild”, in Proc. of the Neural Information Processing Conf., 2022.
[2] C. Raman, N. Raj Prabhu, and H. Hung, “Perceived Conversation Quality in Spontaneous Interactions”, in arXiv preprint, 2022. https://arxiv.org/abs/2207.05791
[3] J. Vargas-Quiros, et al., “Impact of annotation modality on label quality and model performance in the automatic assessment of laughter in-the-wild”, in arXiv preprint, 2022. https://arxiv.org/abs/2211.00794

Please contact:
Hayley Hung, Delft University of Technology, The Netherlands
This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

How AI can Exploit Body Language to Understand Social Behaviour in the Wild