Abstract:Aiming at the low efficiency and accuracy of traditional oral English scoring, a multimodal attention fusion network architecture is proposed to speed up the training efficiency of the model and the accuracy of oral English scoring. The network robustness is improved by comprehensively considering the prosodic sound characteristics of the spoken language pronunciation and the text information of the answered question. Through simulation, the proposed model is compared with LSTM, BiLSTM and GRU network model, and the score estimation accuracy of the proposed model is 96. 8%, which is significantly higher than other methods. The simulation results show that the proposed method can significantly reduce the scoring time and improve the scoring efficiency.