The essence of music is the carrier of human emotion expression, with the continuous deepening of music science and technology research, how to realize more accurate music emotion recognition has become the focus of public attention. This paper constructs a music emotion recognition model based on discrete emotion space (WLDNN_SAGAN). After pre-processing the collected audio data of vocal performances, the attention mechanism is introduced to weight and fuse the extracted low-level and middle-high-level music emotion features, and then the fused feature information is inputted into the WLDNN_SAGAN network to classify music emotions. The experimental results show that the model in this paper will improve the recognition accuracy of different emotions. Compared with the comparison model, the accuracy of this paper’s model reaches 60% and above on three DIFFERENT datasets. The emotional vein of Chinese folk song performance identified by the model is lightness towards sadness and sacredness, which is consistent with the historical facts of Chinese folk song creation. In conclusion, the emotional expression of vocal performance can be enhanced by understanding the cultural connotation, applying singing techniques and body language.