14 results on '"Yunming Ye"'
Search Results
2. Object-Extraction-Based Hidden Web Information Retrieval
- Author
-
Hui, Song, primary, Ling, Zhang, additional, Yunming, Ye, additional, and Fanyuan, Ma, additional
- Published
- 2002
- Full Text
- View/download PDF
3. The Author-Topic-Community Model: A Generative Model Relating Authors’ Interests and Their Community Structure
- Author
-
Yunming Ye, Xiaofeng Zhang, William W. L. Cheung, and Chunshan Li
- Subjects
Topic model ,Generative model ,Computer science ,Robustness (computer science) ,User modeling ,Inference ,ComputerApplications_COMPUTERSINOTHERSYSTEMS ,Graphical model ,Citation ,Data science ,Synthetic data - Abstract
In this paper, we introduce a generative model named Author-Topic-Community (ATC) model which can infer authors’ interests and their community structure at the same time based on the contents and citation information of a document corpus. Via the mutual promotion between the author topics and the author community structure introduced in the ATC model, the robustness of the model towards cases with spare citation information can be enhanced. Variational inference is adopted to estimate the model parameters of ATC. We performed evaluation using both synthetic data as well as a real dataset which contains SIGKDD and SIGMOD papers published in 10 years. By constrasting the performance of ATC with some state-of-the-art methods which model authors’ interests and their community structure separately, our experimental results show that 1) the ATC model with the inference of the authors’ interests and the community structure integrated can improve the accuracy of author topic modeling and that of author community discovery; and 2) more in-depth analysis of the authors’ influence can be readily supported.
- Published
- 2012
4. Hybrid Random Forests: Advantages of Mixed Trees in Classifying Text Data
- Author
-
Graham J. Williams, Baoxun Xu, Yunming Ye, Mark Junjie Li, and Joshua Zhexue Huang
- Subjects
Incremental decision tree ,business.industry ,Computer science ,Decision tree learning ,Decision tree ,computer.software_genre ,Machine learning ,CHAID ,Random forest ,Alternating decision tree ,Data mining ,Artificial intelligence ,business ,computer - Abstract
Random forests are a popular classification method based on an ensemble of a single type of decision tree. In the literature, there are many different types of decision tree algorithms, including C4.5, CART and CHAID. Each type of decision tree algorithms may capture different information and structures. In this paper, we propose a novel random forest algorithm, called a hybrid random forest. We ensemble multiple types of decision trees into a random forest, and exploit diversity of the trees to enhance the resulting model. We conducted a series of experiments on six text classification datasets to compare our method with traditional random forest methods and some other text categorization methods. The results show that our method consistently outperforms these compared methods.
- Published
- 2012
5. Exploiting Word Cluster Information for Unsupervised Feature Selection
- Author
-
Michael K. Ng, Hanjing Su, Yunming Ye, Joshua Huang, and Qingyao Wu
- Subjects
Computer science ,business.industry ,Feature vector ,Feature selection ,Pattern recognition ,Machine learning ,computer.software_genre ,ComputingMethodologies_PATTERNRECOGNITION ,Discriminative model ,Feature (computer vision) ,Metric (mathematics) ,Benchmark (computing) ,Artificial intelligence ,Cluster analysis ,business ,computer ,Word (computer architecture) - Abstract
This paper presents an approach to integrate word clustering information into the process of unsupervised feature selection. In our scheme, the words in the whole feature space are clustered into groups based on the co-occurrence statistics of words. The resulted word clustering information and the bag-of-word information are combined together to measure the goodness of each word, which is our basic metric for selecting discriminative features. By exploiting word cluster information, we extend three well-known unsupervised feature selection methods and propose three new methods. A series of experiments are performed on three benchmark text data sets (the 20 Newsgroups, Reuters-21578 and CLASSIC3). The experimental results have shown that the new unsupervised feature selection methods can select more discriminative features, and in turn improve the clustering performance.
- Published
- 2010
6. A Survey of Open Source Data Mining Systems
- Author
-
Yunming Ye, Xiaofei Xu, Xiaojun Chen, and Graham J. Williams
- Subjects
Source code ,Database ,business.industry ,Computer science ,Data stream mining ,media_common.quotation_subject ,Usability ,computer.software_genre ,Software ,Documentation ,Scalability ,The Internet ,Small and medium-sized enterprises ,business ,computer ,media_common - Abstract
Open source data mining software represents a new trend in data mining research, education and industrial applications, especially in small and medium enterprises (SMEs). With open source software an enterprise can easily initiate a data mining project using the most current technology. Often the software is available at no cost, allowing the enterprise to instead focus on ensuring their staff can freely learn the data mining techniques and methods. Open source ensures that staff can understand exactly how the algorithms work by examining the source codes, if they so desire, and can also fine tune the algorithms to suit the specific purposes of the enterprise. However, diversity, instability, scalability and poor documentation can be major concerns in using open source data mining systems. In this paper, we survey open source data mining systems currently available on the Internet. We compare 12 open source systems against several aspects such as general characteristics, data source accessibility, data mining functionality, and usability. We discuss advantages and disadvantages of these open source data mining systems.
- Published
- 2007
7. PAKDD 2007 Industrial Track Workshop
- Author
-
Yunming Ye and Joshua Zhexue Huang
- Subjects
Engineering ,Government ,Data cleansing ,business.industry ,media_common.quotation_subject ,Acceptance rate ,Track (rail transport) ,computer.software_genre ,Data science ,Presentation ,Facility management ,Knowledge extraction ,business ,computer ,media_common - Abstract
The PAKDD 2007 Industrial Track Workshop was held on the 22nd of May 2007 in conjunction with The 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2007) at the Mandarin Garden Hotel in Nanjing, China. Following the PAKDD conference tradition to promote industry applications of new data mining techniques, methodologies and systems, the theme of this workshop was defined as ”Data Mining for Practitioners”. Industrial oriented papers on innovative applications of data mining technology to solve real world problems were solicited with emphasis on the following areas: 1 New data mining methodologies 1 Data mining application systems 1 Data cleansing and transformation in data mining 1 New application problems and data mining solutions 1 Data mining in government applications 1 Innovative data mining case studies The workshop received 82 submissions from 9 countries and regions in Asia, Europe and North America. The topics of these submissions covered new techniques, algorithms, systems, traditional applications, such as banking and finance, and new applications in manufacturing, facility management, environment and government services. After a rigorous review by the committee members, 13 papers were selected for presentation at the workshop. The acceptance rate was about 16%. For the industrial track papers in PAKDD series conferences, this low acceptance rate was unprecedented.
- Published
- 2007
8. MFCRank: A Web Ranking Algorithm Based on Correlation of Multiple Features
- Author
-
Yunming Ye, Xiaojun Chen, Yan Li, Joshua Huang, and Xiaofei Xu
- Subjects
Anchor text ,Information retrieval ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Hyperlink ,computer.software_genre ,Similitude ,Ranking (information retrieval) ,law.invention ,Search engine ,Ranking ,PageRank ,law ,Ranking SVM ,Web page ,Data mining ,computer ,Algorithm ,Link analysis - Abstract
This paper presents a new ranking algorithm MFCRank for topic-specific Web search systems. The basic idea is to correlate two types of similarity information into a unified link analysis model so that the rich content and link features in Web collections can be exploited efficiently to improve the ranking performance. First, a new surfer model JBC is proposed, under which the topic similarity information among neighborhood pages is used to weigh the jumping probability of the surfer and to direct the surfing activities. Secondly, as JBC surfer model is still query-independent, a correlation between the query and JBC is essential. This is implemented by the definition of MFCRank score, which is the linear combination of JBC score and the similarity value between the query and the matched pages. Through the two correlation steps, the features contained in the plain text, link structure, anchor text and user query can be smoothly correlated in one single ranking model. Ranking experiments have been carried out on a set of topic-specific Web page collections. Experimental results showed that our algorithm gained great improvement with regard to the ranking precision.
- Published
- 2006
9. Neighborhood Density Method for Selecting Initial Cluster Centers in K-Means Clustering
- Author
-
Xiaofei Xu, Shuigeng Zhou, Yunming Ye, Joshua Zhexue Huang, Graham J. Williams, and Xiaojun Chen
- Subjects
DBSCAN ,Clustering high-dimensional data ,Fuzzy clustering ,Computer science ,Correlation clustering ,Single-linkage clustering ,k-means clustering ,computer.software_genre ,Data stream clustering ,Search algorithm ,CURE data clustering algorithm ,Canopy clustering algorithm ,FLAME clustering ,Data mining ,Cluster analysis ,computer ,k-medians clustering - Abstract
This paper presents a new method for effectively selecting initial cluster centers in k-means clustering. This method identifies the high density neighborhoods from the data first and then selects the central points of the neighborhoods as initial centers. The recently published Neighborhood-Based Clustering (NBC) algorithm is used to search for high density neighborhoods. The new clustering algorithm NK-means integrates NBC into the k-means clustering process to improve the performance of the k-means algorithm while preserving the k-means efficiency. NBC is enhanced with a new cell-based neighborhood search method to accelerate the search for initial cluster centers. A merging method is employed to filter out insignificant initial centers to avoid too many clusters being generated. Experimental results on synthetic data sets have shown significant improvements in clustering accuracy in comparison with the random k-means and the refinement k-means algorithms.
- Published
- 2006
10. IglooG: A Distributed Web Crawler Based on Grid Service
- Author
-
Fei Liu, Yunming Ye, Minglu Li, Jia-di Yu, and Fanyuan Ma
- Subjects
Service (systems architecture) ,Database ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Overlay network ,Focused crawler ,Grid ,computer.software_genre ,Scalability ,Data_FILES ,Information system ,Web crawler ,computer ,Latent semantic indexing - Abstract
Web crawler is program used to download documents from the web site. This paper presents the design of a distributed web crawler on grid platform. This distributed web crawler is based on our previous work Igloo. Each crawler is deployed as grid service to improve the scalability of the system. Information services in our system are in charge of distributing URLs to balance the loads of the crawlers and are deployed as grid service. Information services are organized as Peer-to-Peer overlay network. According to the ID of crawler and semantic vector of crawl page that is computed by Latent Semantic Indexing, crawler can decide whether transmits the URL to information service or hold itself. We present an implementation of the distributed crawler based on Igloo and simulate the environment of Grid to evaluate the balancing load on the crawlers and crawl speed. Both the theoretical analysis and the experimental results show that our system is a high-performance and reliable system.
- Published
- 2005
11. iSurfer: A Focused Web Crawler Based on Incremental Learning from Positive Samples
- Author
-
Yiming Lu, Fanyuan Ma, Joshua Zhexue Huang, Yunming Ye, and Matthew Chiu
- Subjects
Computer science ,business.industry ,Sample (statistics) ,Focused crawler ,Crawling ,Machine learning ,computer.software_genre ,Web page ,Incremental learning ,Information system ,The Internet ,Artificial intelligence ,business ,Web crawler ,computer - Abstract
This paper presents a focused Web crawling system iSurfer for information retrieval from the Web. Different from other focused crawlers, iSurfer uses an incremental method to learn a page classification model and a link prediction model. It employs an online sample detector to incrementally distill new samples from crawled Web pages for online updating of the model learned. Other focused crawling systems use classifiers that are built from initial positive and negative samples and can not learn incrementally. The performances of these classifiers depend on the topical coverage of the initial positive and negative samples. However, the initial samples, particularly the negative ones, with a good coverage of target topics are difficult to find. Therefore, the iSurfer’s incremental learning strategy has an advantage. It starts from a few positive samples and gains more integrated knowledge about the target topics over time. Our experiments on various topics have demonstrated that the incremental learning method can improve the harvest rate with a few initial samples.
- Published
- 2004
12. Improved Email Classification through Enriched Feature Space
- Author
-
Joshua Zhexue Huang, Yunming Ye, Fanyuan Ma, and Hongqiang Rong
- Subjects
Computer science ,business.industry ,Feature vector ,Pattern recognition ,Machine learning ,computer.software_genre ,Semantics ,Electronic mail ,Support vector machine ,Statistical classification ,Naive Bayes classifier ,ComputingMethodologies_PATTERNRECOGNITION ,Information system ,Artificial intelligence ,business ,computer - Abstract
This paper presents a novel feature space enriching (FSE) technique to address the problem of sparse and noisy feature space in email classification. The (FSE) technique employs two semantic knowledge bases to enrich the original sparse feature space, which results in more semantic-richer features. From the enriched feature space, the classification algorithms can learn improved classifiers. Naive Bayes and support vector machine are selected as the classification algorithms. Experiments on an enterprise email dataset have shown that the FSE technique is effective for improving the email classification performance.
- Published
- 2004
13. Enhanced Email Classification Based on Feature Space Enriching
- Author
-
Joshua Zhexue Huang, Hongqiang Rong, Yunming Ye, and Fanyuan Ma
- Subjects
business.industry ,Computer science ,Feature vector ,Pattern recognition ,Machine learning ,computer.software_genre ,Semantics ,Electronic mail ,Support vector machine ,Statistical classification ,Naive Bayes classifier ,ComputingMethodologies_PATTERNRECOGNITION ,Knowledge base ,Domain knowledge ,Artificial intelligence ,business ,computer - Abstract
Email classification is challenging due to its sparse and noisy feature space. To address this problem, a novel feature space enriching (FSE) technique based on two semantic knowledge bases is proposed in this paper. The basic idea of FSE is to select the related semantic features that will increase the global information for learning algorithms from the semantic knowledge bases, and use them to enrich the original sparse feature space. The resulting feature space of FSE can provide semantic-richer features for classification algorithms to learn improved classifiers. Naive Bayes and support vector machine are selected as the classification algorithms. Experiments on a bilingual enterprise email dataset have shown that: (1) the FSE technique can improve the email classification accuracy, especially for the sparse classes, (2) the SVM classifier benefits more from FSE than the naive Bayes classifier, (3) with the support of domain knowledge, the FSE technique can be more effective.
- Published
- 2004
14. A New Multivariate Decision Tree Construction Algorithm Based on Variable Precision Rough Set
- Author
-
Liang Zhang, Fanyuan Ma, Shui Yu, and Yunming Ye
- Subjects
Incremental decision tree ,Multivariate statistics ,Computer science ,business.industry ,Generalization ,Decision tree learning ,Dominance-based rough set approach ,ID3 algorithm ,Decision tree ,Machine learning ,computer.software_genre ,Equivalence relation ,Rough set ,Artificial intelligence ,business ,Algorithm ,computer - Abstract
In this paper we extend previous research and present a novel approach to construct multivariate decision tree, which has to some extent the ability of fault tolerance, by employing a development of RST, namely the variable precision rough sets (VPRS) model. Based on variable precision rough set theory, a new concept of generalization of one equivalence relation with respect to another one with precision β is introduced and used for construction of multivariate decision tree. The experimentation result shows its fitness to create multivariate decision tree retrieved from noisy data.
- Published
- 2003
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.