The application of virtual reality technology in the field of teaching is becoming more and more widespread, and the study utilizes virtual reality technology to create contexts, create cross-cultural teaching contexts, and construct speech recognition models. In addition, the ecological model of language learning based on virtual reality is constructed, and the cross-cultural context teaching/learning VR system is developed based on it and introduced into language learning. After testing the speech recognition performance of the cross-cultural context teaching/learning VR system, the students’ learning interactions in the cross-cultural communication context in the language learning system of this paper are studied in depth. By comparing the pre- and post-test experimental data of the experimental group and the control group, the effectiveness of the cross-cultural contextual teaching/learning VR system in language learning is examined. The speech recognition efficiency and correctness rate of the cross-cultural contextual teaching/learning VR system are 99.7% and 99.5%, respectively, which are excellent. In the cross-cultural communication classroom situation, the teacher-student language ratio is comparable, and the teacher’s indirect influence is increasing and the direct influence is gradually decreasing. The pre-test English proficiency of the experimental and control groups was similar. After the experiment, the mean values of all dimensions and total scores of English proficiency of the experimental group were higher than those of the control group, and the p-value was less than 0.05, and the difference of post-test English proficiency between the two groups was significant. The cross-cultural contextual teaching/learning VR system has a significant positive effect on the improvement of users’ language learning effect.