Speech and Gesture Command Recognition to Improve Human-Robot Interaction in Manual Assembly Lines

by Mario Vento, Antonio Greco and Vincenzo Carletti (University of Salerno)

UNISA is working on a more natural human-robot interaction and cooperation in manual assembly lines through speech and gesture commands. Where are we with the FELICE project?

FELICE [L1] is an ambitious project, started in the beginning of 2021, which has received funding from the European Union’s Horizon 2020 Research and Innovation program. The project aims at delivering a modular platform that integrates and harmonises different technologies and methodologies, from robotics to computer vision and ergonomics, to increase the agility and productivity in manual assembly lines, ensure the safety of factory workers that interact with robots, and improve the workers’ physical and mental well-being.

To pursue these goals FELICE involves different academic and research institutions (FHOOE, FORTH, ICCS, IFADO, IML, PRO, TUD, UNISA), one the largest automotive manufacturers in Europe, i.e., Centro Ricerche Fiat (CRF) belonging to Stellantis, three SMEs with significant commercial expertise in engineering and IT (ACCREA, AEGIS, CALTEK), and a legal consultancy service provider (EUN).

The Department of Information and Electrical Engineering and Applied Mathematics (DIEM) at the University of Salerno (UNISA) [L2] has a long-time experience in the fields of artificial intelligence, computer vision, cognitive robotics and autonomous vehicle navigation. DIEM is involved in the project to provide its competencies to improve the interaction between robots and humans in manual assembly lines. Such interaction usually comes through hardware interfaces like keyboards, buttons, and displays. Even though this is the traditional and commonly adopted way to interact with robots, it is not that natural for humans that are used to communicating with each other using speech and gestures. This is where the recent outcomes of artificial intelligence and cognitive robotics can make the way workers and robots to interact smoother, by providing the latter with the ability to understand speech and gesture commands and react coherently to them.

In the past two years, several steps have been taken towards realising a more natural human-machine interaction. The first step has been the formalisation of the tasks to be performed through artificial intelligence methodologies and the selection of gestures and speech commands of interest. The recognition of speech commands has been firstly analysed in the most general scenario possible, where there is a speaker who can produce complex non-predefined sentences that may or may not contain a command, namely the Conventional Spoken Language Understanding (SLU). In this case, it requires an acoustic model that translates speech to text and a linguistic model that extracts the intent of the speaker from the text; the intent is then associated with one of the commands provided. Successively, a set of short and predefined commands have been identified and the freedom of the speaker has been limited to short sentences and slight variants of the commands, in order to deal with practical problems, such as the limited quantity of speech-command data and the limited computational capabilities of the hardware embedded in the robot. Under the previous assumptions, the speech-recognition system does not have to extract the semantic information from sentences, so it does not require the acoustic model for the translation of speech into text, allowing the use of an end-to-end model where the audio input is directly associated with one of the defined commands of interest. The system has achieved an average accuracy of more the 93% during the tests performed in the real working environment at different noise levels.

Considering the gesture-recognition task, an experimental analysis highlighted the complexity of recognising gestures in real-time using state-of-the-art approaches due to their computational demands [1]; the solution has been to exclude dynamic gestures that require analysis across the duration of actions and focus on static gestures that can be recognised using convolutional neural networks (CNNs). Therefore, once the set of gestures to be recognised has been defined, the gesture-command-recognition system is realised by using a two-stage approach: firstly, the hand is detected in the image, then the gesture is classified analysing the pose of the hand. The effectiveness of this solution has also been tested in the wild where the system has achieved an accuracy of 95.4% while classifying among 17 different hand poses.

The two command-recognition systems have been installed in the device embedded on the robot, then a batch of tests in operative conditions have been recently performed and have confirmed the effectiveness of the proposed approaches. The next steps will be: (i) the extension of the dataset for both speech- and gesture-recognition systems by acquiring more data in real-world environments, (ii) exploring other approaches and improving those that have been adopted so far using the data collected, optimising the recognition systems for NVIDIA-equipped devices using frameworks like Tensor RT.

Figure 1: The partners of the FELICE project in the test environment with the robot.

Links:
[L1] https://www.felice-project.eu
[L2] https://mivia.unisa.it

References:
[1] S. Bini, et al., “Benchmarking deep neural networks for gesture recognition on embedded devices”, 31st IEEE Int. Conf. on Robot and Human Interactive Communication (RO-MAN), 2022, pp. 1285-1290. https://doi.org/10.1109/RO-MAN53752.2022.9900705

Please contact:
Vincenzo Carletti, University of Salerno, Italy
This email address is being protected from spambots. You need JavaScript enabled to view it.

Antonio Greco, University of Salerno, Italy
This email address is being protected from spambots. You need JavaScript enabled to view it.