1. Data labeling through the centralities of co-reference networks improves the classification accuracy of scientific papers.
- Author
-
Xie, Zheng, Lv, Yiqin, Song, Yiping, and Wang, Qi
- Subjects
CITATION networks ,CLASSIFICATION ,NEIGHBORHOODS - Abstract
Labeled data are fed to learning models of classification tasks to help them learn to classify unlabeled data. Massive papers are hinged by citations to a few influential papers, much smaller than the total size, which, if labeled, would cause the spread of label information to the most of the papers. We utilized the co-reference relationship between papers since the references cited by a paper dataset usually cannot be contained by the dataset. We stated the problem of optimal paper labeling as the problem of picking a given fraction of nodes from a co-reference network to maximize the number of their neighbors, which is a submodular maximization problem with a cardinality constraint, NP-hard for general networks. We approximately solved it by picking nodes according to the ranks of specific network centralities. We further proved that labeling papers according to the rank of degree, the lowest-complexity centrality, can give a near-optimal solution with specific constraints on the maximum degree of co-reference network and labeling proportion. Experimental results show that our method brings a significant improvement in the accuracy of classification. • A model-agnostic optimization objective is proposed to improve paper classification accuracy. • The optimization objective is maximizing a paper set's neighborhood in a co-reference network. • Labeling high-centrality nodes is a labor-saving and approximative way to the maximizing problem. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF