by Francois Bremond, Vania Bogorny, Luis Patino, Serhan Cosar, Guido Pusiol and Giuseppe Donatiello

Activity discovery within transportation systems (for example subways and roads) or home care monitoring based on cognitive vision and data-mining technologies, are the core activities of a project at Inria.

It is well known that video cameras provide one of the richest and most promising sources of information about people’s movements. New technologies which combine video understanding and data-mining can analyse people’s behaviour in an efficient way by extracting their trajectories and identifying the main movement flows within a scene equipped with video cameras. For instance, we are designing an activity recognition framework which can monitor people’s behaviour in an unsupervised manner. For each observed person, the framework extracts a set of space-time trajectory features describing his/her global position within the monitored scene and the motion of his/her body parts. Based on trajectory clustering, human information is gathered in a new feature that we call Perceptual Feature Chunks (PFC). The set of PFC is used to automatically learn the particular regions of a given scene where important activities occur. We call this set of learned scene regions the topology. Based on a k-means algorithm, a clustering procedure over the PFCs is performed in order to construct three topology layers, organized from coarsest to finest. Using topologies and PFCs, we are able to break the video into a set of small events or primitive events (PE), each of which has a semantic meaning. The sequences of PE and the three layers of topology, are used to construct a hierarchical model with three granularity levels of activity.

Figure 1: Extracted trajectories of  a person at home during the  preparation and eating of a meal (left )and people in a metro station while buying tickets (right).

The proposed approach has been experimented in collaboration with Nice Hospital within the FP7 European project DEM@CARE collecting datasets to supervise patients suffering from dementia. These datasets contain older adults performing everyday activities in a hospital room, equipped with a monocular and RGBD cameras (resolution: 640x480 pixels). The activities considered include ”watching TV“, ”preparing tea“, ”answering phone“, ”reading newspaper/magazine“, ”watering plant“, ”organizing the prescribed drugs“, ”writing a check at the desk“ and ”checking bus routes on a map“. The mono-camera dataset consists of 41 videos and the RGBD dataset contains 27 videos. For each person, the video lasts approximately 15 minutes. For the monocular camera, person detection is performed using an extension of the Gaussian Mixture Model algorithm for background subtraction. For the RGBD camera, we used a person detection algorithm that detects people’s heads and shoulders. Trajectories of people in the scene are obtained using a multi-feature algorithm that uses features such as 2D-size, 3D-displacement, colour histogram, dominant colour and covariance descriptors.

Experimentations on both datasets show that the framework achieves a high rate of True Positives and a low rate of False Negatives. In total, 99% of the performed activities (in real-life) are recognized by the framework [1]. Furthermore, the duration of the recognized activities is matched with more than 80% accuracy to the ground truthed activities, which means that not only can the system count the amount of activity instances but also pretty accurately detect their duration’s. Although a few activities were missed because of failure to detect finer motions, the experimental results show that this framework is a successful system that can be used to automatically discover, learn and recognize Activities of Daily Livings (ADLs). In addition, it can be observed that the framework is useful in medical applications for supporting the early diagnosis of Alzheimer or dementia in older adults. The framework can successfully distinguish people suffering from Alzheimer from those with Mild Cognitive Disorder or Normal Controls.

This framework can also be used in many other fields, such as the video surveillance of metros and roads. We have applied a variant of this framework in two scenarios within the FP7 European project VANAHEIM. The first scenario relates to the monitoring of activities in different locations (e.g., the entrance concourse) of Paris and Turin metro stations. The results obtained show which zones have the most intense activities (called Heat Map). In the scene under observation, rare or uncommon behaviours such as “jumping above the barrier”, “fainting” and “loitering” or frequent behaviours such as “buying tickets” were identified. In the second scenario, a road lane reserved for buses was surveyed. Again, we were able to learn the topology of the scene and reveal which were normal activities (i.e., the passage of a bus into the zone) and abnormal activities (i.e., the passage of other vehicles into the reserved lane). When the results were ground-truthed, a high level of recall and an acceptable degree of precision were obtained [2, 3].

A variety of domains can benefit from smart video analysis. For instance, in the near future, more than 2 billion people will be over 65 years old, and video analysis has the potential to help aging adults in their daily life through the use of smart home environments and to help providing doctors with activity observations that can be used to detect possible anomalies for disease prevention. Movement analysis in crowded areas, such as metro stations, can detect and alert anomalous behaviours, identifying simple issues such as a broken ticket machine to more complex events like a robbery or terrorism act. In supermarkets, the analysis of customer movements can provide information on how to enhance the shopping experience. However, this analysis still has many remaining challenges, such as the accuracy of activity recognition within huge amounts of noisy metadata.

[1]. G. Pusiol, F. Bremond and M. Thonnat: “Unsupervised Discovery, Modeling and Analysis of long term Activities, in proc. of ICVS 2011, Sophia Antipolis, France, 2011.
[2] L. Patino, F. Bremond, M. Thonnat: “Online learning of activities from video”, in proc. of AVSS 12, Beijing 2012.
[3] L. Patino, H. Benhadda, F. Bremond: “Data Mining in a Video Database”, in “Intelligent Video Surveillance Systems”, Wiley, Online Library, 2013, DOI:10.1002/9781118577851.ch14

Please contact:
François Bremond
Inria, France
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Next issue: January 2018
Special theme:
Quantum Computing
Call for the next issue
Image ERCIM News 98 epub
This issue in ePub format
Get the latest issue to your desktop
RSS Feed