Artificial Intelligence-driven Speech Signal Extraction and Generation Based on Wave RNN Models

Yanze Wang1, Xuming Han 1
1School of Information Science and Technology, Jinan University, Guangzhou, Guangdong, 510632, China

Abstract

In the era of artificial intelligence, the technology of speech conversion has developed rapidly and has gradually become a hot topic of research in the field of speech processing. This paper explores the problem of speech signal extraction and generation based on Wave RNN model, and constructs a speech conversion generation model driven by artificial intelligence. First, the short-time Fourier transform is utilized to convert and preprocess the speech signal in the time-frequency domain. Second, a stepwise speech enhancement model is proposed to enhance the perceived strength of the speech signal. Then, a speech generation model based on improved self-attention mechanism and RNN is designed to realize the generation of speech signals. Finally, the model effect is evaluated for application. The time-frequency domain feature that mixes time-domain features and frequencydomain features is able to capture the characteristics of speech signals more comprehensively than a single time-domain feature and frequency-domain feature, which corresponds to a higher recognition accuracy and a lower training loss value. Meanwhile, after speech enhancement, the average accuracy of model A~D speech recognition is improved by 19%~25%, which indicates that the stepped speech enhancement model used in this paper can substantially enhance the perceptual strength of speech signals. In addition, the language conversion model in this paper outperforms other speech conversion models in both MCD and RMSE, and its advantage in rhyme mapping is obvious, and the pitch of the output speech is more accurate and natural. The model in this paper has high practical value in speech signal generation and conversion.

Keywords: wave RNN; short-time Fourier transform; stepwise speech enhancement; speech conversion