AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

Le Wang1,2, Jun Wang2, Feng Deng2, Chen Zhang2, Di Zhang2, Kun Gai2
1China University of Mining and Technology, 2Kuaishou Technology

Abstract

We present AudioGen-Omni — a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and songs coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both sung and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also showcasing high- quality results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.

Method

Architecture

To the best of our knowledge, AudioGen-Omni is the first unified framework capable of generating diverse audio types—including audio, speech, and song—under flexible multimodal conditions, enabling precise text-audio-visual alignment. Text, video, lyrics, and transcription are used as conditional inputs. A lightweight module maps raw grapheme or phoneme sequences into dense, frame-aligned representations without requiring phoneme duration supervision. To ensure text-audio-visual synchrony, Rotary Positional Embeddings (RoPE) are applied not only to visual and audio streams but also to temporally aligned text inputs such as lyrics and transcription, enhancing temporal consistency across modalities.

AudioGen-Omni Demo Gallery

Video To Audio

A bonfire burns intensely in the snowy terrain.

A helicopter hovers at low altitude near massive flames.

The audio clip features the sound of a sniper rifle being fired.

The audio features the sound of rain falling gently on lotus flowers.

A polar bear plays guitar quickly while wearing headphones, with the sound of a large waterfall in the background.

You can hear someone chewing, accompanied by smacking sounds in this audio.

Video To Speech

Ten babies lost to abortion. College students aren't huge fans of pro-life displays. Remember this Texas state guy who engaged.

Detailed brush that I like to use that's a little bit skinnier has some really really firm brushes right here So let me bring you guys in and show you exactly how I'm gonna super.

那你就先了解清楚情况再说,不要张口就来。

你即刻去通知那个人,此番出手一击不中,现在宫中风口又紧。

The classic look gets a bit undercut by these buttons on the side. If you take a look, it's got this kind of cheap plastic look.

For today's show, if you've got a moment, hop on to Twitter. How much are you using customized keyboards on your digital devices? Tweet us at GlobeNow. I'm Afan Chaudhry.

Video To Song

Frack juice is ready now. Na, na, na, na, na, na, na. I want a glass of dino juice. What does dino juice mean?

Quack, quack, quack, quack. But only four little ducks.

Keep our Jake, she was a bad bad member list calling it Kris now, baby I' a Rick crash out my face.

I should just walk away, but I can't move my feet more than I know you.

But in my darkest hour, I know that you are there.

Everything I'll do this way man, anybody good I swear I'm, I'm chilling, I'm good.

Text To Speech

Prompt
Transcription
AudioGen-Omni

In this audio, you can hear a young woman around 30 years old speaking English without any regional accent.

This mixture was formulated to wash two pillows.

In the audio scene, there is one individuals, likely male, speaking in English without any apparent mood or physical indicators.

In front of the chamber is our calibration equipment. Behind is the tiers instrument.

A female speaker, aged between 16 and 25, is speaking in English.

There will be another video detailing these mods, but in a nutshell, these are pretty convenient.

An adult male is speaking, and the language used is English.

Loan sharks, cutthroat interest rates and unmanageable debt are a huge problem all over the country.

Text To Music

Prompt
Lyrics
AudioGen-Omni

A lively jazz piece with a saxophone solo over an upbeat piano melody, electric bass guitar, and syncopated acoustic drum beat. Suitable for swing dancing or a classy coffee shop atmosphere.

""

A hip-hop beat perfect for a summer setting, featuring drums, bass, piano, synths, and occasional string accents, creating a laid-back yet rhythmic vibe.

It's all slow jams in the back of the T skyline in her eyes and a start to wonder that this time is my time to be shiny.

A lively pop-rock track with drums, bass, electric guitar, and acoustic guitar. The genre is rock, including elements of alternative and pop. The mood is cheerful. This piece could be used as background music for a shopping mal.

Go a leave behind one drinking at not to again.

A lively pop-rock track with drums, bass, electric guitar, and acoustic guitar.

My friends said you want bad man, I should't listen to them back.

Text To Sound Effects

Prompt:

You can hear the sound of eating fruit in the audio, with sounds of chewing and occasional movement.

Prompt:

In the audio, the tractor's engine rumbles loudly.

Prompt:

In the audio, a group of birds can be heard chirping.

Prompt:

A woman walking with the sound of her footsteps on a hard surface.