Abstract:In view of the defects of complex data and low prediction accuracy in the current Internet topic recognition, this paper proposes a sensitive word recognition model based on an improved weighted latent Dirichlet allocation (LDA) model. A corpus of sensitive words in a specific field is established; in order to improve the identification efficiency of sensitive information topics, a coarse-grained text classification is proposed for the corpus; a weighting model is proposed, and more words with low-frequency implicit relations can be found by increasing the distribution weight of words with low co-occurrence frequency but obvious sensitive characteristics; Taking the data crawled by mainstream news websites as an example, the proposed model is verified. The results show that the proposed model can identify and extract more detailed sensitive information topics from each text category, The simulation results further verify the effectiveness and accuracy of the proposed model.