Author: "Yunming Ye" / Publisher: springer berlin heidelberg - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Yunming Ye"' showing total 14 results

Start Over Author "Yunming Ye" Publisher springer berlin heidelberg

14 results on '"Yunming Ye"'

1. Object-Extraction-Based Hidden Web Information Retrieval

Author: Hui, Song, Ling, Zhang, Yunming, Ye, Fanyuan, Ma, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Meng, Xiaofeng, editor, Su, Jianwen, editor, and Wang, Yujun, editor
Published: 2002
Full Text: View/download PDF

2. Object-Extraction-Based Hidden Web Information Retrieval

Author: Hui, Song, primary, Ling, Zhang, additional, Yunming, Ye, additional, and Fanyuan, Ma, additional
Published: 2002
Full Text: View/download PDF

3. The Author-Topic-Community Model: A Generative Model Relating Authors’ Interests and Their Community Structure

Author: Yunming Ye, Xiaofeng Zhang, William W. L. Cheung, and Chunshan Li
Subjects: Topic model, Generative model, Computer science, Robustness (computer science), User modeling, Inference, ComputerApplications_COMPUTERSINOTHERSYSTEMS, Graphical model, Citation, Data science, Synthetic data
Abstract: In this paper, we introduce a generative model named Author-Topic-Community (ATC) model which can infer authors’ interests and their community structure at the same time based on the contents and citation information of a document corpus. Via the mutual promotion between the author topics and the author community structure introduced in the ATC model, the robustness of the model towards cases with spare citation information can be enhanced. Variational inference is adopted to estimate the model parameters of ATC. We performed evaluation using both synthetic data as well as a real dataset which contains SIGKDD and SIGMOD papers published in 10 years. By constrasting the performance of ATC with some state-of-the-art methods which model authors’ interests and their community structure separately, our experimental results show that 1) the ATC model with the inference of the authors’ interests and the community structure integrated can improve the accuracy of author topic modeling and that of author community discovery; and 2) more in-depth analysis of the authors’ influence can be readily supported.
Published: 2012

4. Hybrid Random Forests: Advantages of Mixed Trees in Classifying Text Data

Author: Graham J. Williams, Baoxun Xu, Yunming Ye, Mark Junjie Li, and Joshua Zhexue Huang
Subjects: Incremental decision tree, business.industry, Computer science, Decision tree learning, Decision tree, computer.software_genre, Machine learning, CHAID, Random forest, Alternating decision tree, Data mining, Artificial intelligence, business, computer
Abstract: Random forests are a popular classification method based on an ensemble of a single type of decision tree. In the literature, there are many different types of decision tree algorithms, including C4.5, CART and CHAID. Each type of decision tree algorithms may capture different information and structures. In this paper, we propose a novel random forest algorithm, called a hybrid random forest. We ensemble multiple types of decision trees into a random forest, and exploit diversity of the trees to enhance the resulting model. We conducted a series of experiments on six text classification datasets to compare our method with traditional random forest methods and some other text categorization methods. The results show that our method consistently outperforms these compared methods.
Published: 2012

5. Exploiting Word Cluster Information for Unsupervised Feature Selection

Author: Michael K. Ng, Hanjing Su, Yunming Ye, Joshua Huang, and Qingyao Wu
Subjects: Computer science, business.industry, Feature vector, Feature selection, Pattern recognition, Machine learning, computer.software_genre, ComputingMethodologies_PATTERNRECOGNITION, Discriminative model, Feature (computer vision), Metric (mathematics), Benchmark (computing), Artificial intelligence, Cluster analysis, business, computer, Word (computer architecture)
Abstract: This paper presents an approach to integrate word clustering information into the process of unsupervised feature selection. In our scheme, the words in the whole feature space are clustered into groups based on the co-occurrence statistics of words. The resulted word clustering information and the bag-of-word information are combined together to measure the goodness of each word, which is our basic metric for selecting discriminative features. By exploiting word cluster information, we extend three well-known unsupervised feature selection methods and propose three new methods. A series of experiments are performed on three benchmark text data sets (the 20 Newsgroups, Reuters-21578 and CLASSIC3). The experimental results have shown that the new unsupervised feature selection methods can select more discriminative features, and in turn improve the clustering performance.
Published: 2010

6. A Survey of Open Source Data Mining Systems

Author: Yunming Ye, Xiaofei Xu, Xiaojun Chen, and Graham J. Williams
Subjects: Source code, Database, business.industry, Computer science, Data stream mining, media_common.quotation_subject, Usability, computer.software_genre, Software, Documentation, Scalability, The Internet, Small and medium-sized enterprises, business, computer, media_common
Abstract: Open source data mining software represents a new trend in data mining research, education and industrial applications, especially in small and medium enterprises (SMEs). With open source software an enterprise can easily initiate a data mining project using the most current technology. Often the software is available at no cost, allowing the enterprise to instead focus on ensuring their staff can freely learn the data mining techniques and methods. Open source ensures that staff can understand exactly how the algorithms work by examining the source codes, if they so desire, and can also fine tune the algorithms to suit the specific purposes of the enterprise. However, diversity, instability, scalability and poor documentation can be major concerns in using open source data mining systems. In this paper, we survey open source data mining systems currently available on the Internet. We compare 12 open source systems against several aspects such as general characteristics, data source accessibility, data mining functionality, and usability. We discuss advantages and disadvantages of these open source data mining systems.
Published: 2007

7. PAKDD 2007 Industrial Track Workshop

Author: Yunming Ye and Joshua Zhexue Huang
Subjects: Engineering, Government, Data cleansing, business.industry, media_common.quotation_subject, Acceptance rate, Track (rail transport), computer.software_genre, Data science, Presentation, Facility management, Knowledge extraction, business, computer, media_common
Abstract: The PAKDD 2007 Industrial Track Workshop was held on the 22nd of May 2007 in conjunction with The 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2007) at the Mandarin Garden Hotel in Nanjing, China. Following the PAKDD conference tradition to promote industry applications of new data mining techniques, methodologies and systems, the theme of this workshop was defined as ”Data Mining for Practitioners”. Industrial oriented papers on innovative applications of data mining technology to solve real world problems were solicited with emphasis on the following areas: 1 New data mining methodologies 1 Data mining application systems 1 Data cleansing and transformation in data mining 1 New application problems and data mining solutions 1 Data mining in government applications 1 Innovative data mining case studies The workshop received 82 submissions from 9 countries and regions in Asia, Europe and North America. The topics of these submissions covered new techniques, algorithms, systems, traditional applications, such as banking and finance, and new applications in manufacturing, facility management, environment and government services. After a rigorous review by the committee members, 13 papers were selected for presentation at the workshop. The acceptance rate was about 16%. For the industrial track papers in PAKDD series conferences, this low acceptance rate was unprecedented.
Published: 2007

8. MFCRank: A Web Ranking Algorithm Based on Correlation of Multiple Features

Author: Yunming Ye, Xiaojun Chen, Yan Li, Joshua Huang, and Xiaofei Xu
Subjects: Anchor text, Information retrieval, Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Hyperlink, computer.software_genre, Similitude, Ranking (information retrieval), law.invention, Search engine, Ranking, PageRank, law, Ranking SVM, Web page, Data mining, computer, Algorithm, Link analysis
Abstract: This paper presents a new ranking algorithm MFCRank for topic-specific Web search systems. The basic idea is to correlate two types of similarity information into a unified link analysis model so that the rich content and link features in Web collections can be exploited efficiently to improve the ranking performance. First, a new surfer model JBC is proposed, under which the topic similarity information among neighborhood pages is used to weigh the jumping probability of the surfer and to direct the surfing activities. Secondly, as JBC surfer model is still query-independent, a correlation between the query and JBC is essential. This is implemented by the definition of MFCRank score, which is the linear combination of JBC score and the similarity value between the query and the matched pages. Through the two correlation steps, the features contained in the plain text, link structure, anchor text and user query can be smoothly correlated in one single ranking model. Ranking experiments have been carried out on a set of topic-specific Web page collections. Experimental results showed that our algorithm gained great improvement with regard to the ranking precision.
Published: 2006

9. Neighborhood Density Method for Selecting Initial Cluster Centers in K-Means Clustering

Author: Xiaofei Xu, Shuigeng Zhou, Yunming Ye, Joshua Zhexue Huang, Graham J. Williams, and Xiaojun Chen
Subjects: DBSCAN, Clustering high-dimensional data, Fuzzy clustering, Computer science, Correlation clustering, Single-linkage clustering, k-means clustering, computer.software_genre, Data stream clustering, Search algorithm, CURE data clustering algorithm, Canopy clustering algorithm, FLAME clustering, Data mining, Cluster analysis, computer, k-medians clustering
Abstract: This paper presents a new method for effectively selecting initial cluster centers in k-means clustering. This method identifies the high density neighborhoods from the data first and then selects the central points of the neighborhoods as initial centers. The recently published Neighborhood-Based Clustering (NBC) algorithm is used to search for high density neighborhoods. The new clustering algorithm NK-means integrates NBC into the k-means clustering process to improve the performance of the k-means algorithm while preserving the k-means efficiency. NBC is enhanced with a new cell-based neighborhood search method to accelerate the search for initial cluster centers. A merging method is employed to filter out insignificant initial centers to avoid too many clusters being generated. Experimental results on synthetic data sets have shown significant improvements in clustering accuracy in comparison with the random k-means and the refinement k-means algorithms.
Published: 2006

10. IglooG: A Distributed Web Crawler Based on Grid Service

Author: Fei Liu, Yunming Ye, Minglu Li, Jia-di Yu, and Fanyuan Ma
Subjects: Service (systems architecture), Database, Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Overlay network, Focused crawler, Grid, computer.software_genre, Scalability, Data_FILES, Information system, Web crawler, computer, Latent semantic indexing
Abstract: Web crawler is program used to download documents from the web site. This paper presents the design of a distributed web crawler on grid platform. This distributed web crawler is based on our previous work Igloo. Each crawler is deployed as grid service to improve the scalability of the system. Information services in our system are in charge of distributing URLs to balance the loads of the crawlers and are deployed as grid service. Information services are organized as Peer-to-Peer overlay network. According to the ID of crawler and semantic vector of crawl page that is computed by Latent Semantic Indexing, crawler can decide whether transmits the URL to information service or hold itself. We present an implementation of the distributed crawler based on Igloo and simulate the environment of Grid to evaluate the balancing load on the crawlers and crawl speed. Both the theoretical analysis and the experimental results show that our system is a high-performance and reliable system.
Published: 2005

11. iSurfer: A Focused Web Crawler Based on Incremental Learning from Positive Samples

Author: Yiming Lu, Fanyuan Ma, Joshua Zhexue Huang, Yunming Ye, and Matthew Chiu
Subjects: Computer science, business.industry, Sample (statistics), Focused crawler, Crawling, Machine learning, computer.software_genre, Web page, Incremental learning, Information system, The Internet, Artificial intelligence, business, Web crawler, computer
Abstract: This paper presents a focused Web crawling system iSurfer for information retrieval from the Web. Different from other focused crawlers, iSurfer uses an incremental method to learn a page classification model and a link prediction model. It employs an online sample detector to incrementally distill new samples from crawled Web pages for online updating of the model learned. Other focused crawling systems use classifiers that are built from initial positive and negative samples and can not learn incrementally. The performances of these classifiers depend on the topical coverage of the initial positive and negative samples. However, the initial samples, particularly the negative ones, with a good coverage of target topics are difficult to find. Therefore, the iSurfer’s incremental learning strategy has an advantage. It starts from a few positive samples and gains more integrated knowledge about the target topics over time. Our experiments on various topics have demonstrated that the incremental learning method can improve the harvest rate with a few initial samples.
Published: 2004

12. Improved Email Classification through Enriched Feature Space

Author: Joshua Zhexue Huang, Yunming Ye, Fanyuan Ma, and Hongqiang Rong
Subjects: Computer science, business.industry, Feature vector, Pattern recognition, Machine learning, computer.software_genre, Semantics, Electronic mail, Support vector machine, Statistical classification, Naive Bayes classifier, ComputingMethodologies_PATTERNRECOGNITION, Information system, Artificial intelligence, business, computer
Abstract: This paper presents a novel feature space enriching (FSE) technique to address the problem of sparse and noisy feature space in email classification. The (FSE) technique employs two semantic knowledge bases to enrich the original sparse feature space, which results in more semantic-richer features. From the enriched feature space, the classification algorithms can learn improved classifiers. Naive Bayes and support vector machine are selected as the classification algorithms. Experiments on an enterprise email dataset have shown that the FSE technique is effective for improving the email classification performance.
Published: 2004

13. Enhanced Email Classification Based on Feature Space Enriching

Author: Joshua Zhexue Huang, Hongqiang Rong, Yunming Ye, and Fanyuan Ma
Subjects: business.industry, Computer science, Feature vector, Pattern recognition, Machine learning, computer.software_genre, Semantics, Electronic mail, Support vector machine, Statistical classification, Naive Bayes classifier, ComputingMethodologies_PATTERNRECOGNITION, Knowledge base, Domain knowledge, Artificial intelligence, business, computer
Abstract: Email classification is challenging due to its sparse and noisy feature space. To address this problem, a novel feature space enriching (FSE) technique based on two semantic knowledge bases is proposed in this paper. The basic idea of FSE is to select the related semantic features that will increase the global information for learning algorithms from the semantic knowledge bases, and use them to enrich the original sparse feature space. The resulting feature space of FSE can provide semantic-richer features for classification algorithms to learn improved classifiers. Naive Bayes and support vector machine are selected as the classification algorithms. Experiments on a bilingual enterprise email dataset have shown that: (1) the FSE technique can improve the email classification accuracy, especially for the sparse classes, (2) the SVM classifier benefits more from FSE than the naive Bayes classifier, (3) with the support of domain knowledge, the FSE technique can be more effective.
Published: 2004

14. A New Multivariate Decision Tree Construction Algorithm Based on Variable Precision Rough Set

Author: Liang Zhang, Fanyuan Ma, Shui Yu, and Yunming Ye
Subjects: Incremental decision tree, Multivariate statistics, Computer science, business.industry, Generalization, Decision tree learning, Dominance-based rough set approach, ID3 algorithm, Decision tree, Machine learning, computer.software_genre, Equivalence relation, Rough set, Artificial intelligence, business, Algorithm, computer
Abstract: In this paper we extend previous research and present a novel approach to construct multivariate decision tree, which has to some extent the ability of fault tolerance, by employing a development of RST, namely the variable precision rough sets (VPRS) model. Based on variable precision rough set theory, a new concept of generalization of one equivalence relation with respect to another one with precision β is introduced and used for construction of multivariate decision tree. The experimentation result shows its fitness to create multivariate decision tree retrieved from noisy data.
Published: 2003

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

14 results on '"Yunming Ye"'

1. Object-Extraction-Based Hidden Web Information Retrieval

2. Object-Extraction-Based Hidden Web Information Retrieval

3. The Author-Topic-Community Model: A Generative Model Relating Authors’ Interests and Their Community Structure

4. Hybrid Random Forests: Advantages of Mixed Trees in Classifying Text Data

5. Exploiting Word Cluster Information for Unsupervised Feature Selection

6. A Survey of Open Source Data Mining Systems

7. PAKDD 2007 Industrial Track Workshop

8. MFCRank: A Web Ranking Algorithm Based on Correlation of Multiple Features

9. Neighborhood Density Method for Selecting Initial Cluster Centers in K-Means Clustering

10. IglooG: A Distributed Web Crawler Based on Grid Service

11. iSurfer: A Focused Web Crawler Based on Incremental Learning from Positive Samples

12. Improved Email Classification through Enriched Feature Space

13. Enhanced Email Classification Based on Feature Space Enriching

14. A New Multivariate Decision Tree Construction Algorithm Based on Variable Precision Rough Set

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

14 results on '"Yunming Ye"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources