Audio samples
Reverb conversion (RC) [1] and Music enhancement (ME) [2] are trained with the paris of wet and dry vocals. Weighted prediction error (WPE) [3], Proposed, and Proposed+ are the unsupervised methods. The pre-trained model of proposed methods is only trained with the dry vocals.
Methods | Sample 1 w/ short reverb | Sample 2 w/ short reverb | Sample 1 w/ long reverb | Sample 2 w/ long reverb |
---|---|---|---|---|
Wet (Unprocessed) |
||||
RC [1] (Supervised) |
||||
ME [2] (Supervised) |
||||
WPE [3] (Unsupervised) |
||||
Proposed (Unsupervised) |
||||
Proposed+ (Unsupervised) |
Note that the audio samples of ME are 16 kHz since ME is trained with 16 kHz vocals. All the other audio samples are 44.1 kHz.
Reference
[1] Koo, S. Paik, and K. Lee, “Reverb conversion of mixed vocal tracks using an end-to-end convolutional deep neural network,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 81–85.
[2] N. Kandpal, O. Nieto, and Z. Jin, “Music enhancement via image translation and vocoding,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 3124–3128.
[3] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, pp. 1717–1731, 2010.