Skip to the content.

Vocal dereverberation samples from GibbsDDRM and DiffDereverb

DiffDereverb w/ GibbsDDRM [1, ICML2023], DiffDereverb, and DiffDereverb+ [2, ICASSP2023] are our proposed methods.

The pre-trained diffusion models of proposed methods are only trained with the dry vocals. Reverb conversion (RC) [3] and Music enhancement (ME) [4] are trained with the paris of wet and dry vocals.

Methods Sample 1 w/ long reverb Sample 2 w/ long reverb Sample 1 w/ short reverb Sample 2 w/ short reverb
Wet
(Unprocessed)
DiffDereverb w/ GibbsDDRM [1, ICML2023]
(Unsupervised)
DiffDereverb [2, ICASSP2023]
(Unsupervised)
DiffDereverb+ [2, ICASSP2023]
(Unsupervised)
RC [3]
(Supervised)
ME [4]
(Supervised)
WPE [5]
(Unsupervised)


Note that the audio samples of ME are 16 kHz since ME is trained with 16 kHz vocals. All the other audio samples are 44.1 kHz.

Reference

[1] N. Murata, K. Saito, C.-H. Lai, Y. Takida, T. Uesaka, Y. Mitsufuji, S. Ermon, “GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration,” 40th International Conference on Machine Learning, 2023.

[2] K. Saito, N. Murata, T. Uesaka, C.-H. Lai, Y. Takida, T. Fukui, Y. Mitsufuji, “Unsupervised Vocal Dereverberation with Diffusion-Based Generative Models,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.

[3] Koo, S. Paik, and K. Lee, “Reverb conversion of mixed vocal tracks using an end-to-end convolutional deep neural network,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 81–85.

[4] N. Kandpal, O. Nieto, and Z. Jin, “Music enhancement via image translation and vocoding,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 3124–3128.

[5] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, pp. 1717–1731, 2010.