Start Over

Text classification with improved word embedding and adaptive segmentation.

Authors :: Sun, Guoying
Cheng, Yanan
Zhang, Zhaoxin
Tong, Xiaojun
Chai, Tingting
Source :: Expert Systems with Applications. Mar2024:Part B, Vol. 238, pN.PAG-N.PAG. 1p.
Publication Year :: 2024
Abstract: Text classification first needs to convert the text into embedding vectors. Considering that static word embedding models such as Word2vec do not consider the position information of word and the difference of its role in different documents, while dynamic word embedding models such as Bert consume a large amount of time. An improved word embedding model based on pre-trained Word2vec is proposed, which achieves better classification accuracy and much lower classification time than Bert. At first, the concept of Term Document Frequency (TDF) is proposed on the basis of TF-IDF, and the TF-IDF-TDF of each word in different documents is calculated. Then, The positional encoding is added. Finally, in order to reduce the misleading of words with low importance, a filter is designed to set the embedding vector with low importance to zero. Considering that the sequence length that the deep learning model can handle is limited, and the text sequence exceeding the Maximum Sequence Length (MSL) set by the deep learning model will be directly truncated and discarded, an adaptive segmentation model is proposed, which can set different segmentation strategies for different texts according to the length of the text and the MSL. In order to maintain the continuity of adjacent text after segmentation, an adjacent-segment-vector-attended co-attention network is designed. In addition, the multi-channel convolution and the capsule network are designed to further extract deep hidden features. Multiple comparative experiment results show that the proposed model achieves the best Accuracy and Micro-F1 on five long text baseline datasets and six short text baseline datasets. In addition, when the MSL is not set too large compared with the document length in the dataset, the classification results of the proposed model are not affected by it. • An pre-trained Word2vec based embedding model is proposed. • An adaptive segmented text classification model is proposed. • An adjacent-segment-vector-attended co-attention network is designed. • The multi-channel convolution and capsule network are used to extract deep hidden features. [ABSTRACT FROM AUTHOR]