A Computational Approach to the Classification of Chinese Syntactic Structures Using a Large-Scale Corpus

Song Wang1
1School of Chinese Language and Literature, Xinyang College, Xinyang, Henan, 464000, China

Abstract

Syntactic analysis is a basic work in the field of natural language processing, which explores the syntactic structures and their interaction relations in sentences. This paper first describes the basic approach of syntactic analysis, and explores the computational method of Chinese syntactic structure classification from large-scale corpus construction. Then, a grid-based large-scale corpus construction and distribution model is constructed. And the word embedding model BERT is used as the pre-trained language model, and the captured semantic features are input into the Bi-LSTM model to extract the contextual bidirectional sequence information, and the results of Chinese syntactic structure classification are obtained by the Conditional Random Field (CRF) processing. Through manual proofreading as well as the calculation of confidence level, the average correct rate of syntactic structure classification of the final Chinese canonical corpus is increased from 94.21% to 99.06%, which is an improvement of 4.85%. The syntactic structure classification accuracy of the BERT-Bi-LSTM-CRF1 and BERT-Bi-LSTM-CRF2 models with “complement structure” and “object structure” were higher than those of the BERT model, the Bi-LSTM-CRF model and the BERT-Bi-LSTM-CRF3 model with all syntactic structures. Meanwhile, the accuracy of the syntactic structure annotation method of BERT-Bi-LSTM-CRF model + manual differs from that of manual annotation by only 0.66%, and the average time spent is reduced by 37.04%, which reduces the workload of the annotators and improves the efficiency of the annotation, which verifies the validity and practicability of this paper’s model in automatic classification of Chinese syntactic structures.

Keywords: large-scale corpus, BERT, Bi-LSTM, CRF, syntactic structure classification