In recent years, with the rapid development of artificial intelligence, big data, machine learning and other technologies, human society is entering a more and more intelligent society, and the interaction between humans and machines becomes more and more common. In this paper, image processing operations are added on the basis of Kinect’s original acquisition of gong dance images, which reduces the influence of external light, background and other factors, and makes the human capture efficiency increase dramatically, and a spatio-temporal graph is constructed on the basis of the continuous human posture key point data, which describes the distribution of the human posture key points in different dataset types. Aiming at the problems existing in the traditional spatio-temporal map convolutional network, a multi-dimensional attention mechanism is designed to guide the model to reasonably allocate the weight resources in three dimensions: space, time and channel, respectively. Experiments are conducted on NTU-RGB+D, Kinect skeleton and Taiji datasets, respectively, which show that the AGCN-STC proposed in this paper has better recognition performance on all three datasets, and the recognition accuracy is improved by 0.9 percentage points compared with AM-GCN. Two actors are used as samples for visual measurement and quantitative analysis to compare the differences between the performance gestures of the two ornaments. Finally, based on the results of the study, we propose a transmission path for the Guanzhong gong dance, which is a reference for the cultural transmission of the Guanzhong gong dance.