Whether you’re describing the sound of a broken car engine or howling like the neighbor’s cat, imitating a sound with your voice can help convey a concept when words don’t do it justice.
Voice imitation is the same sound as scribbling a simple picture to convey what you see. However, instead of using a pencil to describe an image, you use your vocal tract to describe a sound. This may seem difficult, but it’s something everyone does intuitively. To experience it for yourself, try recreating the sound of an ambulance siren, crow, or banging bell in your own voice.
Inspired by the cognitive science of how we communicate, researchers at the Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a system for developing human speech impressions without training or even “hearing” human speech impressions. We have developed an AI system that can generate voice imitations like: .
To accomplish this, researchers designed a system that can generate and interpret sounds just like we do. They started by building a model of the human vocal tract that simulates how vibrations from the voice box are shaped by the throat, tongue, and lips. Cognitively inspired AI algorithms can then be used to control this vocal tract model and generate imitations, taking into account the context-specific ways humans choose to convey sound. I did.
The model effectively captures many sounds from the world and can mimic and generate human-like sounds, such as the rustling of leaves, the hissing of a snake, or the sound of an approaching ambulance siren. And just as some computer vision systems can obtain high-quality images based on sketches, the model can also be run in reverse to infer real-world sounds from imitations of human voices. . For example, the model can accurately distinguish between sounds that humans mimic a cat’s “meow” and “hiss.”
In the future, this model could lead to more intuitive “imitation-based” interfaces for sound designers, more human-like AI characters in virtual reality, and even ways to help students learn new languages. There is.
Co-lead authors Kartik Chandra SM ’23 and Karima Ma, MIT CSAIL doctoral students, and undergraduate researcher Matthew Caren say that computer graphics researchers understand that realism is the ultimate goal of visual representation. He points out that he has long recognized that this is rarely the case. For example, an abstract painting or a child’s crayon doodle can be just as expressive as a photo.
“Over the past few decades, advances in sketching algorithms have created new tools for artists, advanced AI and computer vision, and provided a deeper understanding of human cognition,” Chandra said. states. “Just as a sketch is an abstract, non-photorealistic representation of an image, our method captures an abstract, non-phonorealistic way of representing the sounds that humans hear. It tells us about the process of auditory abstraction.”
play video
“The goal of this project is to understand and computationally model vocal mimicry, which we believe is the auditory equivalent of sketching in the visual domain,” Caren said. Masu.
The Art of Imitation, in 3 Parts
The team developed three more subtle versions of the model to compare with human voice imitations. First, they created a baseline model that simply aimed to produce an imitation as similar to real-world sounds as possible. However, this model did not fit well with human behavior.
Next, the researchers designed a second “communication” model. According to Caren, the model considers what is distinctive about a sound to the listener. For example, it might imitate the roar of an engine to imitate the sound of a motorboat. This is because even though it is not the loudest aspect of the sound (compared to, for example, water splashing), it is the most distinctive auditory feature. This second model created a better mimic than the baseline, but the team wanted to improve it even further.
To take their method a step further, the researchers added a final layer of inference to the model. “Vocal imitations sound different depending on the amount of effort you put into them. It takes time and energy to create a perfectly accurate sound,” says Chandra. The researchers’ full model accounts for this by avoiding very fast, loud, or high-pitched or low-pitched speech that people are less likely to use in conversation. The result is a more human-like imitation that closely matches many of the decisions humans make when imitating the same sounds.
After building this model, the team conducted behavioral experiments to see whether AI or human-generated voice imitations were perceived as better by human judges. Notably, participants in the experiment favored the AI model 25% of the time overall, 75% of the time for the motorboat imitation, and as much as 50% of the time for the gunshot imitation.
Aiming for more expressive sound technology
Caren, who is passionate about music and art technology, believes the model will allow artists to better convey sound to computer systems, and allow filmmakers and other content creators to generate more subtle AI sounds tailored to specific contexts. We hope to be able to support you in this way. It also allows musicians to quickly search sound databases by imitating noises that are difficult to describe, such as with text prompts.
Meanwhile, Cullen, Chandra and Marr are examining the model’s impact in other areas, including language development, how young children learn to speak, and even imitative behavior in birds such as parrots and songbirds. .
The team still has work to do on the current iteration of the model. There were issues with some consonants, such as “z,” and the impression of some sounds, such as the bee’s wing sound, was inaccurate. And we still can’t reproduce how humans imitate sounds, such as speech, music, or heartbeats, which are imitated differently in each language.
Robert Hawkins, a professor of linguistics at Stanford University, says that some languages have onomatopoeia, or the sound that imitates what they express, but not perfectly, such as the sound “meow,” which very imprecisely resembles the sound a cat makes. He states that there are many words that have not been reproduced. “The process from a real cat’s cry to words like ‘meow’ reveals much about the complex interplay between physiology, social reasoning and communication in the evolution of language.” , said Hawkins, who was not involved. In the CSAIL investigation. “This model presents an interesting step toward formalizing and testing the theory of these processes, and suggests that physical constraints from the human vocal tract and social pressures from communication can explain the distribution of voice imitation. It shows that both are necessary.”
Caren, Chandra, and Ma are joined by two other CSAIL stakeholders: Jonathan Ragan-Kelley, an associate professor in the MIT Department of Electrical Engineering and Computer Science, and Joshua Tenenbaum, an MIT professor of brain and cognitive sciences in the Center for Brains, Minds, and Machines. I wrote this paper. member. Their research was supported in part by the Hertz Foundation and the National Science Foundation. It was announced at SIGGRAPH Asia in early December.