Skip to the content.

Audio samples

Reverb conversion (RC) [1] and Music enhancement (ME) [2] are trained with the paris of wet and dry vocals. Weighted prediction error (WPE) [3], Proposed, and Proposed+ are the unsupervised methods. The pre-trained model of proposed methods is only trained with the dry vocals.

Methods Sample 1 w/ short reverb Sample 2 w/ short reverb Sample 1 w/ long reverb Sample 2 w/ long reverb
Wet
(Unprocessed)
RC [1]
(Supervised)
ME [2]
(Supervised)
WPE [3]
(Unsupervised)
Proposed
(Unsupervised)
Proposed+
(Unsupervised)


Note that the audio samples of ME are 16 kHz since ME is trained with 16 kHz vocals. All the other audio samples are 44.1 kHz.

Reference

[1] Koo, S. Paik, and K. Lee, “Reverb conversion of mixed vocal tracks using an end-to-end convolutional deep neural network,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 81–85.

[2] N. Kandpal, O. Nieto, and Z. Jin, “Music enhancement via image translation and vocoding,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 3124–3128.

[3] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, pp. 1717–1731, 2010.