Machine Learning Based Audio Synthesis: Blessing and Curse?

by Nicolas Müller (Fraunhofer AISEC)

Machine learning based audio synthesis has made significant progress in recent years. Current techniques make it possible to clone any voice with deceptive authenticity, based on just a few seconds of reference recording. However, is this new technique more of a curse than a blessing? Its use in medicine (restoring the voice of the mute), in grief counselling or in the film and video game industry contrasts with its enormous potential for misuse (deep fakes). How can our society deal with a technology that has the potential to erode trust in (audio) media?

Owing to groundbreaking success in various disciplines, artificial intelligence (AI) is on everyone's lips, both figuratively and literally: using deep neural networks, researchers have recently developed a system that can reproduce any voice in a deceptively realistic way. Using only a few seconds of reference audio material, the AI recognises the characteristics of the person speaking, can reproduce the voice accordingly and thus place any sentence in the person's lips.

This is definitely a technical masterpiece with a variety of use cases. For example, it would allow people who have lost their voice due to accident or illness to communicate with a replica of their natural voice via a human-computer interface. Similar scenarios are also imaginable in grief counselling. Nevertheless, there are also great opportunities for artists and cultural workers and for the film and video game industry. Finally, this technology is a cornerstone for a new, more human form of artificial intelligence that will accompany us in our daily lives even more intensely than current systems.

This technology is already making impacts: Scientists have developed a system that can exchange single words in any sentence. Like its well-known image editing namesake, “Photoshop for Voice” allows free editing of audio [L1]. During a demonstration [L2], for example, the sentence “I kissed my wife” was changed to “I kissed Jordan”.

It is also possible to create entire sentences, resulting in a perfect illusion, providing the lip movements are synchronised accordingly. For example, in a well-known Deepfake video clip [L3], Boris Johnson supports his political opponent Jeremy Corbyn and even recommends him as Prime Minister of England. These examples are created without malicious intent, but at the same time, they illustrate how easily this technology can be misused.

As a result, it is obvious that in the future we will not be able to trust audio and video material unconditionally. Fake news, i.e., deliberately created false information, is continually improving, making it even easier to deceive us. In the US election of Donald Trump in 2018, voters were manipulated massively by fake news [L4]. This influence is likely to increase and machine-learning based audio synthesis may be used to defame political opponents.

Facing this challenge will be a central goal for the coming years. How are we supposed to deal with it? Banning the technology is simply not feasible. Since basic source code [L5] and technical specifications [L6] are already public, such AI systems will be widely available for anyone to use. It will not be a technology dominated by a few.

Therefore, what remains? One way to strengthen the trustworthiness of digital media could be a second AI: an AI that is an expert in distinguishing between generated and real audio material. This could include certificates of authenticity or warnings for counterfeit material. The Fraunhofer Institute for Applied and Integrated Security AISEC is currently actively researching AI systems that, as experts, can differentiate between genuine and fake media content. Such systems learn this distinction by using a training data set containing a large number of both “real” and “fake” audio examples. In this way, the expert AI learns to detect subtle irregularities in the fakes, which are not noticeable to humans but are nonetheless present. Thus, deep fakes can be detected.

Yet, we will have to learn to rethink: Just as we should not trust every article published on the Internet, we must start to mistrust every audio or video clip. Because it is not a question of “if”, but of “when” fake material will affect us on a large scale. When the time comes, we should be prepared both technically and socially.

Links:
[L1] https://futurism.com/adobes-new-audio-tool-can-edit-anyones-speech
[L2] https://www.youtube.com/watch?v=I3l4XLZ59iw
[L3] https://www.youtube.com/watch?v=30NvDC1zcL8
[L4] https://www.nature.com/articles/s41467-018-07761-2
[L5] https://github.com/CorentinJ/Real-Time-Voice-Cloning
[L6] https://arxiv.org/abs/1806.04558

Please contact:
Nicolas Müller, Fraunhofer AISEC, Germany
+49 89 3229986 197, This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

Machine Learning Based Audio Synthesis: Blessing and Curse?