Back to Search
Start Over
A Data-Driven Approach for Extracting Representative Information From Large Datasets With Mixed Attributes
- Source :
- IEEE Transactions on Engineering Management. 69:1806-1822
- Publication Year :
- 2022
- Publisher :
- Institute of Electrical and Electronics Engineers (IEEE), 2022.
-
Abstract
- The rapid growth of information technology and Internet applications has provided users with an explosion of information. Mobile e-commerce applications and web search engines are of great interest in extracting representative information from the original abundant information. However, the information extracted by several existing methods, such as top-k, are often quite similar, which is difficult to meet users’ demand for diversified information. In order to increase the diversity of representative information, this article proposes a data-driven approach to automatically identifying a subset of the original dataset that can cover more themes and content. The data-driven approach consists of two stages. First, a new unified similarity measure is proposed for handling dataset with categorical and numeric attributes. We inject external knowledge and attribute interactions into the similarity learning process to improve the accuracy of similarity estimation between data objects. Second, we develop an enhanced density peaks clustering algorithm based on shared nearest neighbors to automatically identify representative objects according to the previous estimated similarity. The enhanced density peaks algorithm takes the local structure in the entire data space into consideration, which makes the proposed approach relatively insensitive to variations in dataset’ density and dimensionality. Theoretical analysis demonstrates that the time complexity of the proposed approach can achieve the best $\bm {O}({\bm {N}\log \bm {N}})$ . Extensive comparison experiments were conducted on artificial and real-world datasets. The experimental results demonstrate the effectiveness and robustness of the proposed approach.
- Subjects :
- Computer science
Strategy and Management
Similarity measure
computer.software_genre
Similarity (network science)
Content (measure theory)
Data mining
Electrical and Electronic Engineering
Cluster analysis
Categorical variable
computer
Time complexity
Similarity learning
Curse of dimensionality
Subjects
Details
- ISSN :
- 15580040 and 00189391
- Volume :
- 69
- Database :
- OpenAIRE
- Journal :
- IEEE Transactions on Engineering Management
- Accession number :
- edsair.doi...........afd198cda5a3a762b8d14cb42fd13b21
- Full Text :
- https://doi.org/10.1109/tem.2019.2934485