Vocal dereverberation samples from GibbsDDRM and DiffDereverb
DiffDereverb w/ GibbsDDRM [1, ICML2023], DiffDereverb, and DiffDereverb+ [2, ICASSP2023] are our proposed methods.
The pre-trained diffusion models of proposed methods are only trained with the dry vocals. Reverb conversion (RC) [3] and Music enhancement (ME) [4] are trained with the paris of wet and dry vocals.
Methods | Sample 1 w/ long reverb | Sample 2 w/ long reverb | Sample 1 w/ short reverb | Sample 2 w/ short reverb |
---|---|---|---|---|
Wet (Unprocessed) |
||||
DiffDereverb w/ GibbsDDRM [1, ICML2023] (Unsupervised) |
||||
DiffDereverb [2, ICASSP2023] (Unsupervised) |
||||
DiffDereverb+ [2, ICASSP2023] (Unsupervised) |
||||
RC [3] (Supervised) |
||||
ME [4] (Supervised) |
||||
WPE [5] (Unsupervised) |
Note that the audio samples of ME are 16 kHz since ME is trained with 16 kHz vocals. All the other audio samples are 44.1 kHz.
Reference
[1] N. Murata, K. Saito, C.-H. Lai, Y. Takida, T. Uesaka, Y. Mitsufuji, S. Ermon, “GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration,” 40th International Conference on Machine Learning, 2023.
[2] K. Saito, N. Murata, T. Uesaka, C.-H. Lai, Y. Takida, T. Fukui, Y. Mitsufuji, “Unsupervised Vocal Dereverberation with Diffusion-Based Generative Models,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
[3] Koo, S. Paik, and K. Lee, “Reverb conversion of mixed vocal tracks using an end-to-end convolutional deep neural network,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 81–85.
[4] N. Kandpal, O. Nieto, and Z. Jin, “Music enhancement via image translation and vocoding,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 3124–3128.
[5] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, pp. 1717–1731, 2010.