by Karla Markert (Fraunhofer AISEC)
Automatic speech recognition systems are designed to transcribe audio data to text. The increasing use of such technologies makes them an attractive target for cyberattacks, for example via “adversarial examples”. This article provides a short introduction to adversarial examples in speech recognition and some background on current challenges in research.
As their name suggests, neural networks are inspired by the human brain: just like children learn the abstract notion of “an apple” from a collection of objects referred to as “apples”, neural networks automatically “learn” underlying structures and patterns from data. In classification tasks, this learning process consists of feeding a network with some labelled training data, aiming to find a function to describe the relation between data and label. For example, a speech recognition system presented with audio files and transcriptions can learn how to transcribe new, previously unprocessed speech data. In recent decades, research has paved the way to teaching neural networks how to recognise faces, classify street signs, and produce texts. However, just as humans can be fooled by optical or acoustic illusions [L0], neural networks can also be tricked to misclassify input data, even when they would be easy to correctly classify for a human .
Here we discuss two popular network architectures for speech recognition, explain how they mimic human speech perception, and how they can be tricked by data manipulations inaudible to a human listener. We discuss why designing neural networks that are robust to such aberrations is hard, and which research questions might help us make improvements.
Speech recognition can be based on different statistical models. Nowadays neural networks are a very common approach. Automatic speech recognition (ASR) models can be realised end-to-end by one single neural network or in a hybrid fashion. In the latter case, deep neural networks are combined with hidden Markov models.
When training an end-to-end model like Deepspeech [L1] or Lingvo [L2], the system is only provided with the audio data and its final transcription. Internally, the audio files are often pre-processed to “Mel-frequency cepstral coefficients”. In this case, the audio signal is cut into pieces, called frames, which are in return decomposed into frequency bins approximately summing up to the original input. This pre-processing step reflects how the inner ear transmits sounds to the brain. A recurrent neural network then transcribes every frame to a character without any additional information. Using a specific loss function, sequences of frame-wise transcriptions are turned into words. This process mimics our brain understanding words from a series of sounds.
On the other hand, hybrid models like Kaldi [L3] consists of different submodels that are trained separately and require additional expert knowledge. The audio data is pre-processed in a similar way to above. The acoustic model consists of a neural network that turns these audio features into phonemes (provided by an expert). Subsequently, the phonemes are turned into words by a language model that also accounts for language-specific information such as word distributions and probabilities.
Despite all their differences in the mathematical setup, both approaches are susceptible to adversarial examples. An adversarial example is an audio file manipulated in such a way that it can fool the recognition system, with the manipulation being inaudible to a human listener, as depicted in Figure 1. It can thus happen that, for a given audio file, a human hears “good morning” while the ASR model understands “delete all data”. Clearly, this is particularly threatening in sensitive environments like connected industries or smart homes. In these settings, the voice assistant can be misused to control all connected devices without the owner’s awareness. Current techniques enable hackers to craft nearly imperceptible adversarial examples, some of which even allow for playing over the air.
Figure 1: Adversarial examples can fool speech recognition systems while being imperceptible to the human. This enables far-reaching attacks on all connected devices.
Adversarial attacks are constructed per model and can be designed in a white-box setting, where the model is known to the attacker, as well as in a black-box setting, where the attacker can only observe the model’s output (the transcription, in the case of ASR systems). There are different explanations for why these attacks are possible and why one can manipulate input data to be classified as a chosen target: the neural network over- or underfits the data, or it just learns data attributes that are imperceptible to humans. With respect to images, it has been shown that some adversarial examples are even transferable between different models trained to perform the same task (e.g., object classification). In contrast, adversarial examples in the speech domain exhibit far less transferability .
Over the years, different mitigation approaches have been proposed. So far, the state-of-the-art method of defending against adversarial attacks is to include adversarial examples in the training data set [L4]. However, this requires a lot of computation: computing adversarial examples, retraining, recomputing, retraining. Current research addresses questions regarding the interpretability of speech recognition systems, the transferability of audio adversarial examples between different models, the design of detection methods for adversarial examples or of new tools to measure and improve robustness, especially in the language domain. Here, the feature set (audio), the human perception (psychoacoustic), and the learning models (recurrent neural networks) differ from the image domain, on which most previous research has focussed so far.
In a way, acoustic illusions and audio adversarial examples are similar: the perception of the human or the neural network, respectively, is fooled. Interestingly, very rarely can the human and the machine both be fooled at the same time . Rather, even when the ASR system is very accurate in mimicking human understanding, it is still susceptible to manipulations elusive to the human ear. Fortunately, however, such attacks still need careful crafting and can only work under suitable conditions. Thus, currently, “good morning” still means “good morning” in most cases.
[L5] Lessons Learned from Evaluating the Robustness of Neural Networks to Adversarial Examples. USENIX Security (invited talk), 2019. https://www.youtube.com/watch?v=ZncTqqkFipE
 I.J. Goodfellow, J. Shlens, C. Szegedi: “Explaining and harnessing adversarial examples”, arXiv preprint arXiv:1412.6572, 2014.
 H. Abdullah et al.: “SoK: The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems”, arXiv e-prints, 2020, arXiv: 2007.06622.
 G. F. El Sayed et al.: “Adversarial examples that fool both computer vision and time-limited humans”, arXiv preprint arXiv:1802.08195, 2018.
Fraunhofer Institute for Applied and Integrated Security AISEC, Germany