SoundReactor: Frame-level Online Video-to-Audio Generation

Saito, Koichi; Tanke, Julian

SoundReactor: Frame-level Online Video-to-Audio Generation

Koichi Saito^1,†, Julian Tanke¹, Christian Simon², Masato Ishii¹, Kazuki Shimada¹, Zachary Novack³, Zhi Zhong², Akio Hayakawa¹, Takashi Shibuya¹, Yuki Mitsufuji^{1, 2}

¹Sony AI, ²Sony Group Corporation, ³UC-San Diego
^†Project lead, Koichi Saito

Paper (arXiv PDF) arXiv Code (Stay tuned)

Abstract

Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100.

Our Scope and Framework

Model overview of the frame-level online V2A pipeline — Figure 1. Our scope is the frame-level online video-to-audio (V2A) generation task, **where future video frames are not available in advance.**
This contrasts with conventional offline V2A, where an entire sequence or chunks of frames are available in advance.

SoundReactor overview — Figure 2. Overview of SoundReactor with three components:
(a) Video token modeling, (b) Audio token modeling, (c) Multimodal AR transformer with diffusion/ECT head.

Demo Samples on OGameData^[1]

All models are trained on OGameData dataset.
V-AURA^[2],^† , the baseline, is the state-of-the-art offline AR V2A model.
To our best knowledge, our models are the first frame-level online V2A models.
Among our models, Ours-ECT (NFE=4) performs the best.

† Although V-AURA is the most analogous model to our setup, it is not directly applicable to the frame-level online V2A task (see the manuscript for details). ↩︎

More samples →

Training, Sampling, and Latency

Training Stage 1 — Stage 1 training: Diffusion pre-training. $\vec{x}^{0}_{i}$ and $\vec{v}_{i}$ are continuous audio and video latents, respectively. $F_{\bm{\phi}}$ and $D_{\bm{\theta}}$ are transformer and diffusion head, respectively.

Training Stage 2 — Stage 2 training: Consistency fine-tuning. $\vec{x}^{0}_{i}$ and $\vec{v}_{i}$ are continuous audio and video latents, respectively. $F_{\bm{\phi}}$ and $D_{\bm{\theta}}$ are transformer and diffusion head, respectively.

Latency Breakdown — Per-frame latency. Waveform-level latency is the elapsed time from feeding the previous audio token $\vec{x}_{i-1}$ into the model until the output $\vec{x}_{i}$ is incrementally decoded into a waveform, including the encoding of a raw video frame $V_{i}$. Token‑level latency is identical to this but excludes the waveform decoding. Benchmark is done on a single H100 GPU with a batch size of one, using a single CUDA stream measured on 30FPS, 480p videos with the default Ours-ECT at $\omega=3$.

BibTeX

@article{Saito2025SoundReactor,
  title={SoundReactor: Frame-level Online Video-to-Audio Generation},
  author={Koichi Saito and Julian Tanke and Christian Simon and Masato Ishii and Kazuki Shimada and Zachary Novack and Zhi Zhong and Akio Hayakawa and Takashi Shibuya and Yuki Mitsufuji},
  year={2025},
  eprint={2510.02110},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  journal={arXiv preprint arXiv:2510.02110},
  url={https://arxiv.org/abs/2510.02110},
}

References

H. Che, X. He, Q. Liu, C. Jin and H. Chen, "Gamegen-x: Interactive open-world game video generation," ICLR 2025
I. Viertola, V. Iashin and E. Rahtu, "Temporally Aligned Audio for Video with Autoregression," ICASSP 2025

[ref-1] H. Che, X. He, Q. Liu, C. Jin and H. Chen, "Gamegen-x: Interactive open-world game video generation," ICLR 2025

[ref-2] I. Viertola, V. Iashin and E. Rahtu, "Temporally Aligned Audio for Video with Autoregression," ICASSP 2025

SoundReactor: Frame-level Online Video-to-Audio Generation

Abstract

Our Scope and Framework

Demo Samples on OGameData[1]

Training, Sampling, and Latency

BibTeX

References

Demo Samples on OGameData^[1]