A team of generative AI researchers has created a Swiss Army knife for sound that allows users to control audio output simply using text.
Some AI models can compose songs or modify audio, but none have the agility of newer products.
The tool, called Fugatto (short for Foundational Generative Audio Transformer Opus 1), uses any combination of text and audio files to generate or transform any combination of music, speech, and sounds described in the prompts. Masu.
For example, you can create musical snippets based on text prompts, remove or add instruments to existing songs, change the accent or emotion of your voice, or even make sounds you’ve never heard before. Masu.
“This is wild,” says Ido Zmishlany, multi-platinum producer and songwriter and co-founder of One Take Audio, a member of the NVIDIA Inception program for cutting-edge startups. “Sound is my inspiration. It’s what drives me to make music. The idea of being able to instantly create completely new sounds in the studio is incredible.”
Understand audio by sound
“We wanted to create a model that understood and produced sound the same way humans do,” says Nvidia’s applied audio research manager and one of the dozen or so people behind Fugatto. , says orchestra conductor and composer Rafael Valle.
Supporting numerous audio generation and transformation tasks, Fugatto is the first fundamental generative AI model to demonstrate the ability to combine emergent properties (capabilities that result from the interaction of different trained abilities) with free-form instructions.
“Fugatto is our first step toward a future where unsupervised multitasking learning in audio synthesis and transformation comes from data and model scale,” said Valle.
Use case sample playlist
For example, music producers can use Fugatto to quickly prototype or edit song ideas and experiment with different styles, voices, and instruments. You can also add effects to improve the overall audio quality of existing tracks.
“The history of music is also the history of technology. The electric guitar gave the world rock ‘n’ roll. When the sampler came along, hip-hop was born,” Zmishlany said. “With AI, we are writing the next chapter of music. We have new instruments, new tools to make music, and it’s so exciting.”
Advertising agencies can apply Fugatto to quickly target existing campaigns to multiple regions and situations, and apply different accents and emotions to voiceovers.
Language learning tools can be customized to use any audio of the speaker’s choice. Imagine an online course told in the voice of a family member or friend.
Video game developers can use this model to modify prerecorded assets in their titles to match the changing actions of users as they play the game. Or you can create new assets on the fly from text instructions or optional voice input.
make fun noises
“One of the features of this model that we are particularly proud of is what we call the avocado chair,” Valle said, referring to the novel visuals created by the generative AI model for imaging.
For example, Fugate can make trumpet calls or saxophone calls. Anything a user can describe can be modeled.
The researchers found that with fine-tuning and a small amount of singing data, it can handle tasks for which it was not pre-trained, such as producing high-quality singing voices from text prompts.
Users gain artistic control
Fugatto has added several features.
During inference, the model uses a technique called ComposableART to combine instructions that were only seen separately during training. For example, you can combine prompts to request text that expresses sad emotions in a French accent.
The model’s ability to interpolate between instructions gives the user fine-grained control over the text instructions (in this case, the weight of the accent and the degree of sadness).
“We wanted to allow users to combine attributes in a subjective or artistic way, choosing how much weight to place on each attribute,” says Rohan, the AI researcher who designed these aspects of the model.・Mr. Badlani said.
Badlani, who earned a master’s degree in computer science with an emphasis on AI from Stanford University, says, “Tests often yield surprising results, and even though I’m a computer scientist, I feel a bit like an artist.” It has become.”
This model also produces sounds that change over time. He calls this time interpolation. For example, you can create the sound of a thunderstorm passing through an area as it intensifies and then slowly fades into the distance. It also gives users fine-grained control over how the soundscape unfolds.
Additionally, unlike most models that can only reproduce the training data they’ve been exposed to, Fugatto allows you to create soundscapes you’ve never seen before, such as a thunderstorm fading into dawn with birdsong.
Inside view
Fugatto is a foundational generative transformer model that builds on the team’s previous work in areas such as speech modeling, audio vocoding, and audio understanding.
The full version uses 2.5 billion parameters and was trained on a set of NVIDIA DGX systems with 32 NVIDIA H100 Tensor Core GPUs.
Fugates were created by people as diverse as India, Brazil, China, Jordan, and South Korea. Their collaboration enhances Fugatto’s multiaccent and multilingual capabilities.
One of the most challenging parts of the effort was generating a mixed dataset containing millions of audio samples used for training. The team employs a multifaceted strategy to generate data and instructions to significantly expand the range of tasks that the model can perform, while achieving more accurate performance and developing new tasks without the need for additional data. I made it executable.
We also examined existing datasets to uncover new relationships between data. The entire work took over a year.
Valle remembers two moments when he felt his team was on to something. “We were blown away when the prompts first generated music,” he said.
The team then demonstrated a fugate that responded to a prompt by creating electronic music with a dog barking to the beat.
“When the group broke up laughing, it really warmed my heart.”
Hear what Fugatto can do: