AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

Le Wang1,2, Jun Wang2, Feng Deng2, Chen Zhang2, Di Zhang2, Kun Gai2
1China University of Mining and Technology, 2Kuaishou Technology

Abstract

We present AudioGen-Omni — a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and songs coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both sung and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also showcasing high- quality results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.

Method

Architecture

To the best of our knowledge, AudioGen-Omni is the first unified framework capable of generating diverse audio types—including audio, speech, and song—under flexible multimodal conditions, enabling precise text-audio-visual alignment. Text, video, lyrics, and transcription are used as conditional inputs. A lightweight module maps raw grapheme or phoneme sequences into dense, frame-aligned representations without requiring phoneme duration supervision. To ensure text-audio-visual synchrony, Rotary Positional Embeddings (RoPE) are applied not only to visual and audio streams but also to temporally aligned text inputs such as lyrics and transcription, enhancing temporal consistency across modalities.

AudioGen-Omni Demo Gallery

Video To Audio

A bonfire burns intensely in the snowy terrain.

A helicopter hovers at low altitude near massive flames.

The audio clip features the sound of a sniper rifle being fired.

The audio features the sound of rain falling gently on lotus flowers.

A polar bear plays guitar quickly while wearing headphones, with the sound of a large waterfall in the background.

You can hear someone chewing, accompanied by smacking sounds in this audio.

Video To Speech

Ten babies lost to abortion. College students aren't huge fans of pro-life displays. Remember this Texas state guy who engaged.

Detailed brush that I like to use that's a little bit skinnier has some really really firm brushes right here So let me bring you guys in and show you exactly how I'm gonna super.

那你就先了解清楚情况再说,不要张口就来。

你即刻去通知那个人,此番出手一击不中,现在宫中风口又紧。

The classic look gets a bit undercut by these buttons on the side. If you take a look, it's got this kind of cheap plastic look.

For today's show, if you've got a moment, hop on to Twitter. How much are you using customized keyboards on your digital devices? Tweet us at GlobeNow. I'm Afan Chaudhry.

Video To Song

Frack juice is ready now. Na, na, na, na, na, na, na. I want a glass of dino juice. What does dino juice mean?

Quack, quack, quack, quack. But only four little ducks.

Keep our Jake, she was a bad bad member list calling it Kris now, baby I' a Rick crash out my face.

I should just walk away, but I can't move my feet more than I know you.

But in my darkest hour, I know that you are there.

Everything I'll do this way man, anybody good I swear I'm, I'm chilling, I'm good.

Joint Video–Audio Generation Pipeline

Here, we demonstrate that the AudioGen-Omni model has been successfully extended to generate audio-video content from image and text inputs by simply adding a video latent stream to the original video feature branch. The provided image serves as the first frame of the video, and the model then generates the corresponding video and audio based on the textual description. Built upon the native MMdiT audio-video framework, this approach enables unified image-and-text-to-audio-video generation. Below are some internal evaluation examples showcasing the model’s performance across various scenarios. More details and application cases will be shared soon.

Joint Video–Audio Generation Pipeline: Internal Evaluation Gallery

Text & Image To Video & Audio

A man wearing a black fitted long-sleeve shirt and glasses speaks in front of a background of brick walls and cabinets, saying: 'because then you see people from across the world coming out. And when you become friends with certain people, like different people, you learn about what they believe in. And it just helps someone be more aware.'

A man in a black shirt sits on a couch with a black-and-white patterned backrest, speaking happily with expressive hand gestures before a light-colored wall, saying: 'Where are my words guiding me? Where are my words guiding me? Are they guiding you to be a person that people look at and say, man, they are an encourager, they are loving, they are caring.'

Two people sit before a wooden plank backdrop, the left in a white shirt with short light brown hair, the right in a gray polo over a white undershirt; in warm indoor lighting, a single young male voice (16–25) speaks angrily over continuous wind howling, saying: 'You do it this way. It just seems like you get so much more glory going this route with it. Or, I mean, that's the game Habakkuk was playing. He was like, how could you let your people practice injustice?''

In an indoor setting with hunting-themed decor, including mounted deer heads and antlers, a man in a blue shirt and glasses speaks in a neutral tone, with no background sounds, saying: 'Brad has filmed about 15 of those. We have those already. And then Lisa, she will be giving us instructions on how.'

Text To Speech

Prompt
Transcription
AudioGen-Omni

In this audio, you can hear a young woman around 30 years old speaking English without any regional accent.

This mixture was formulated to wash two pillows.

In the audio scene, there is one individuals, likely male, speaking in English without any apparent mood or physical indicators.

In front of the chamber is our calibration equipment. Behind is the tiers instrument.

A female speaker, aged between 16 and 25, is speaking in English.

There will be another video detailing these mods, but in a nutshell, these are pretty convenient.

An adult male is speaking, and the language used is English.

Loan sharks, cutthroat interest rates and unmanageable debt are a huge problem all over the country.

Text To Music

Prompt
Lyrics
AudioGen-Omni

A lively jazz piece with a saxophone solo over an upbeat piano melody, electric bass guitar, and syncopated acoustic drum beat. Suitable for swing dancing or a classy coffee shop atmosphere.

""

A hip-hop beat perfect for a summer setting, featuring drums, bass, piano, synths, and occasional string accents, creating a laid-back yet rhythmic vibe.

It's all slow jams in the back of the T skyline in her eyes and a start to wonder that this time is my time to be shiny.

A lively pop-rock track with drums, bass, electric guitar, and acoustic guitar. The genre is rock, including elements of alternative and pop. The mood is cheerful. This piece could be used as background music for a shopping mal.

Go a leave behind one drinking at not to again.

A lively pop-rock track with drums, bass, electric guitar, and acoustic guitar.

My friends said you want bad man, I should't listen to them back.

Text To Sound Effects

Prompt:

You can hear the sound of eating fruit in the audio, with sounds of chewing and occasional movement.

Prompt:

In the audio, the tractor's engine rumbles loudly.

Prompt:

In the audio, a group of birds can be heard chirping.

Prompt:

A woman walking with the sound of her footsteps on a hard surface.