Article

AI & Machine Learning

The Global Impact of Nvidia’s AI Sound Model: Fugatto

By Kitty Wheeler

November 26, 2024

undefined mins

Share this article

Prioritise Us on Google

Share this article

Prioritise Us on Google

Nvidia introduces the ‘world’s most flexible sound machine

Nvidia's AI model revolutionises audio generation across multiple industries using ComposableART technology powered by Nvidia H100 GPUs

The rise of Gen AI has transformed how content is created, enabling artists, producers and developers to push the boundaries of creativity and efficiency.

In a groundbreaking move to push Gen AI further, Nvidia’s Gen AI model named Fugatto is designed to create and modify audio using text and audio prompts.

This technology allows users to generate a wide range of sounds, including music and voice alterations, marking a significant advancement in audio production capabilities.

As the demand for versatile audio solutions grows globally, Fugatto positions itself as a powerful tool for professionals in various creative industries across the world.

A new era in audio generation

Fugatto, short for Foundational Generative Audio Transformer Opus 1, represents a significant leap in Gen AI technology, because unlike previous models that could compose music or modify voices independently, Fugatto combines these functions with unparalleled flexibility.

Users can generate music snippets from text prompts, adjust existing tracks by adding or removing instruments, and even change vocal accents or emotions.

The model can also create entirely new sounds that have never been heard before.

Manager of Applied Audio Research at Nvidia, Rafael Valle

Rafael Valle, a Manager of Applied Audio Research at Nvidia, stated: “We wanted to create a model that understands and generates sound like humans do.”

This ambition has led to the development of Fugatto as the first foundational Gen AI model capable of showcasing emergent properties—abilities that arise from the interaction of its various trained skills.

The model's design allows it to interpret complex instructions and produce audio outputs that reflect nuanced human expressions.

Fugatto also employs a technique called ComposableART, allowing users to combine multiple instructions during the generation process.

For instance, users could request a voice prompt spoken with sadness in a French accent.

This level of control allows for artistic expression and fine-tuning of audio outputs, making it easier for creators to achieve their desired results.

Fugatto’s practical applications across industries

The versatility of Fugatto opens up numerous applications across different sectors.

Music

Music producers can use the model to quickly prototype ideas, experimenting with various styles and arrangements without extensive manual input.

“The history of music is also a history of technology. The electric guitar gave the world rock and roll. When the sampler showed up, hip-hop was born,” said Ido Zmishlany, a multi-platinum producer and songwriter — and cofounder of One Take Audio, a member of the Nvidia Inception programme for cutting-edge startups.

Member of the Nvidia Inception programme, Ido Zmishlany (image credit: Spotify for Artists)

“The idea that I can create entirely new sounds on the fly in the studio is incredible”, Ido adds.

This capability not only accelerates the creative process but also enhances collaboration among artists.

Advertising

In advertising, agencies can leverage Fugatto to tailor voiceovers for specific regions by modifying accents and emotions based on target demographics.

This adaptability ensures that marketing messages resonate more effectively with diverse audiences.

Language learning platforms could personalise lessons by using voices familiar to learners, enhancing engagement through relatable audio content.

Video gaming

Video game developers also stand to benefit from Fugatto’s capabilities.

They can modify pre-recorded sound assets in real-time based on gameplay dynamics or generate new audio content dynamically from text instructions.

This adaptability allows for a more immersive gaming experience as soundscapes evolve with player actions.

Audio production

Moreover, Fugatto's ability to generate sounds that change over time adds another layer of realism to audio production.

For example, it can simulate the gradual transition of a thunderstorm moving through an area, complete with crescendos of thunder that fade into the distance.

This feature enhances the realism of soundscapes created by users and opens up new creative possibilities.

Under the hood: technology and training

Fugatto is built on a foundational generative transformer architecture that incorporates advancements in speech modelling and audio understanding.

The full version utilises 2.5 billion parameters and was trained on Nvidia DGX systems equipped with 32 Nvidia H100 Tensor Core GPUs.

This powerful hardware enables the model to process vast amounts of audio data efficiently.

The development team behind Fugatto comprises researchers from diverse backgrounds around the world, including India, Brazil, China, Jordan and South Korea.

Their collaboration has strengthened the model's multilingual and multi-accent capabilities.

Challenge of Fugatto

One of the challenges faced during development was creating a blended dataset containing millions of audio samples for training purposes.

The team employed innovative strategies to generate data that expanded the range of tasks Fugatto could perform while maintaining high accuracy.

Rafael recalls two pivotal moments during development: “The first time it generated music from a prompt, it blew our minds.”

Later demonstrations showcased Fugatto responding to prompts with unexpected yet delightful results, such as creating electronic music featuring barking dogs in rhythm with the beat.

Fugatto’s unique capabilities extend beyond traditional sound generation; it can create entirely new soundscapes based on user descriptions.

For instance, it can produce sounds like a trumpet barking or a saxophone meowing—demonstrating its ability to interpret imaginative prompts creatively.

The potential of Gen AI applications like this are vast and varied; from enhancing music production processes to revolutionising advertising strategies and enriching video game experiences.

Nvidia’s Fugatto represents not just an advancement in technology but also an opportunity for creators across industries to explore new frontiers in sound design.

As Rafael noted about the future possibilities: “Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale.”

Explore the latest edition of Technology Magazine and be part of the conversation at our global conference series, Tech & AI LIVE.

Discover all our upcoming events and secure your tickets today.

Technology Magazine is a BizClik brand

Company portals

NVIDIA