Phonetic Search in Audio and Video Recordings

by Ioannis Dologlou and Stelios Bakamidis (RC ATHENA)

A new system uses advanced speech recognition technology to easily and efficiently retrieve information from audio/video recordings just by using keywords.

The massive amount of information produced by today’s media (radio, television, etc.) and telecommunications (fixed, mobile telephony, satellite communications, etc) necessitates the use of automatic management strategies. Useful information can be retrieved from audio/video files by using keywords, in the same way as for text files, with a system that automatically searches for appropriate information in audio/video files using a state-of-the-art voice recognition engine. This enables valuable information in broadcast news or telephone conversations to be retrieved easily, quickly and accurately.

This system was developed by Voice-In SA, a spin-off company of the Greek Research Centre RC ATHENA [L1]. The research started in 2008 and the first system was delivered two years later. Research is ongoing to improve the performance and speed of the algorithms involved.

The proposed system implements the most advanced speech recognition technology (large vocabulary, continuous speech, speaker independent). It converts the statistical models of the speech recognition system and adapts them to increase both flexibility and efficiency over the handling of information which is provided by the keywords. In addition the new approach comprises a scoring algorithm for automatic detection of words or phrases that are closest to the user’s query.

The system consists of two subsystems. The first subsystem performs a pre-processing on each new archiving material (recordings or video files), so that a file with specific information is created. The second subsystem is the actual core of the system that implements the new algorithms for search and retrieval, simultaneously exploiting the previously stored information.

The input to the system is audio or video files along with some keywords that the user wants to locate in these files. Following a very fast processing of the input data, the system provides information on whether the keywords are present in those files or not. If the outcome of the search is positive, the specific audio or video spots that have been found are mined and supplied to the user accompanied by the exact timing information and their confidence level.

The major advantage of the new approach that makes it unbeatable compared to existing solutions is its ability to fully operate on any subject without any prior learning phase. Consequently it requires no overhead for customisation and/or installation and maintenance. More precisely, the system does not use any lexicon or database that will become outdated and need updating. Furthermore, it can cope with all kinds of words and terminology regardless of their frequency of use. The system has a very user-friendly interface with a comprehensive menu even for non-professionals. It can process all of the following formats: wav, wmv, wma, mp3, mpg, asf and avi.

The new system is useful for several applications in various domains and activities including:

Automated registration, classification, indexing and efficient, fast and inexpensive recovery of information from audiovisual media.
Easy access to information produced by state institutions (parliament, government departments, municipalities, communities, etc.).
Information retrieval from audiovisual material from meetings and general board meetings, corporate bodies etc.
Forensic applications, i.e., helping to locate people suspected of being involved in illegal activities through automatic monitoring of telephone calls, video recordings, etc.
Automatic monitoring of air and maritime frequencies in real time to detect incidents, such as mayday, for a prompt response.

Future activities
Future plans focus on the performance of the algorithms both in terms of accuracy and speed. Improving the accuracy involves the creation of a better speech recognition system with a large variety of acoustic models for many different environments (noisy, cocktail party effect etc). Faster search algorithms are also needed for handling the stored information which is created by the first subsystem.

Link:
[L1] https://kwz.me/hmy

References:
[1] S. Vijayarani, A. Sakila: “Multimedia Mining Research – An Overview”, IJCGA, Vol. 5, No.1, Jan 2015, 69-77.
[2] R. Pieraccini: “The Voice in the Machine. Building Computers That Understand Speech”, The MIT Press, 2012, ISBN 978-0262016858.
[3] T. Sainath et al.: “Convolutional neural networks for LVCSR”, ICASSP, 2013.

Please contact:
Ioannis Dologlou
RC ATHENA, Greece, +302106875306
This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

Phonetic Search in Audio and Video Recordings