Music conductors rely on the visual impact of gestures and emotions for the interpretation and expression of musical works. In this paper, we utilize spatio-temporal two-stream convolutional neural network and replace the original VGG-16 network with ResNet-34 network with deeper network structure to construct a conductor recognition model for improving music conductor level. The Dropou optimization is applied in the fully connected layer to reduce the overfitting phenomenon, and the network structure is designed to fuse the temporal and spatial networks in advance with the feature maps, in view of the defects that the network structure of dual-stream convolutional neural network is shallow and the temporal and spatial networks do not learn the temporal and spatial information correlation. After the construction is completed, the model is applied in the teaching of a music college. The spatio-temporal information fusion convolutional neural network proposed in this paper is compared with other existing methods, and it is found that the optimized design helps the convolutional neural network to learn better, and better emotion and action effects can be obtained. It has better recognition accuracy on the dataset and obtained the highest accuracy of 74.3% on the CoST dataset. The results of the dimensions of music perception ability of the conductor students in the experimental class are better than the reference class, and the dimensions of pitch and intensity are more than 20% ahead of the control class, which proves that the model in this paper is more powerful to promote the development of music perception of the conductor students.
1970-2025 CP (Manitoba, Canada) unless otherwise stated.