Author: "Arabie, Ph." - Searchworks@Jio Institute Digital Library Search Results

51. Feature Clustering Method to Detect Monotonic Chain Structures in Symbolic Data.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Opitz, O., Ritter, G., Schader, M., Weihs, C., Brito, Paula, and Cucumel, Guy
Abstract: Finding a linear structure in multidimensional data is a main purpose of the principal component analysis (PCA). This paper describes a feature clustering method to detect monotonic chain structures embedded in symbolic data tables based on the Cartesian system model (CSM) which is a mathematical model to manipulate symbolic objects. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

52. Building Symbolic Objects from Data Streams.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Opitz, O., Ritter, G., Schader, M., Weihs, C., Brito, Paula, and Cucumel, Guy
Abstract: With the increase of computer use in all sectors of activity, more and more data are available as streams of structured records so that it is not possible to store all data before analyzing them in a data mining perspective. New data management systems have been studied to handle such data streams and new algorithms have been developed to perform stream mining. In this paper, we propose approaches to extend the construction of symbolic objects to data streams: symbolic objects are built and maintained as a representation of a complete stream or a sliding window on the stream. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

53. Clustering and Validation of Interval Data.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Opitz, O., Ritter, G., Schader, M., Weihs, C., Brito, Paula, and Cucumel, Guy
Abstract: The paper addresses the problem of assessing the validity of the clusters found by a clustering algorithm. The determination of the "true" number of "natural" clusters has often been considered as the central problem of cluster validation. Many different stopping rules have been proposed in the research literature but most of them are applicable only to classical data (qualitative or quantitative). In this paper we investigate the problem of the determination of the number of clusters for symbolic objects described by interval variables. We consider five classical methods and two hypothesis tests based on the Poisson point process. We extend these methods to interval data. We apply them to the meteorological stations data set. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

54. 3WaySym-Scal: Three-Way Symbolic Multidimensional Scaling.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Opitz, O., Ritter, G., Schader, M., Weihs, C., Brito, Paula, and Cucumel, Guy
Abstract: Multidimensional scaling aims at reconstructing dissimilarities between pairs of objects by distances in a low dimensional space. However, in some cases the dissimilarity itself is not known, but the range, or a histogram of the dissimilarities is given. This type of data fall in the wider class of symbolic data (see Bock and Diday (2000)). We model three-way two-mode data consisting of an interval of dissimilarities for each object pair from each of K sources by a set of intervals of the distances defined as the minimum and maximum distance between two sets of embedded rectangles representing the objects. In this paper, we provide a new algorithm called 3WaySym-Scal using iterative majorization, that is based on an algorithm, I-Scal developed for the two-way case where the dissimilarities are given by a range of values ie an interval (see Groenen et al. (2006)). The advantage of iterative majorization is that each iteration is guaranteed to improve the solution until no improvement is possible. We present the results on an empirical data set on synthetic musical tones. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

55. An Agglomerative Hierarchical Clustering Algorithm for Improving Symbolic Object Retrieval.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Opitz, O., Ritter, G., Schader, M., Weihs, C., Brito, Paula, and Cucumel, Guy
Abstract: One of the main novelties of the Symbolic data analysis is the introduction of symbolic objects (SOs): "aggregated data" that synthesize information concerning a group of individuals of a population. SOs are particularly suitable for representing (and managing) census data that require the availability of aggregated information. This paper proposes a new (conceptual) hierarchical agglomerative clustering algorithm whose output is a "tree" of progressively general SO descriptions. Such a tree can be effectively used to outperform the resource retrieval task, specifically for finding the SO to which an individual belongs to and/or to determine a more general representation of a given SO. (i.e. finding a more general segment of information which a SO belongs to). [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

56. A Clustering Algorithm for Symbolic Interval Data Based on a Single Adaptive Hausdorff Distance.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Opitz, O., Ritter, G., Schader, M., Weihs, C., Brito, Paula, and Cucumel, Guy
Abstract: This paper introduces a dynamic clustering method to partitioning symbolic interval data. This method furnishes a partition and a prototype for each cluster by optimizing an adequacy criterion that measures the fitting between the clusters and their representatives. To compare symbolic interval data, the method uses a single adaptive Hausdorff distance that changes at each iteration but is the same for all the clusters. Experiments with real and synthetic symbolic interval data sets showed the usefulness of the proposed method. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

57. Symbolic Analysis to Learn Evolving CyberTraffic.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Opitz, O., Ritter, G., Schader, M., Weihs, C., Brito, Paula, and Cucumel, Guy
Abstract: Monitoring Internet traffic in order to both dynamically tune network resources and ensure services continuity is a big challenge. Two main research issues characterize the analysis of the huge amount of data generated by Internet traffic: 1) learning a normal adaptive model which must be able to detect anomalies, and 2) computational efficiency of the learning algorithm in order to work properly on-line. In this chapter, we propose a methodology which returns a set of symbolic objects representing an adaptive model of ‘normal' daily network traffic. The model can then be used to discover traffic anomalies of interest for the network administrator. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

58. On the Analysis of Symbolic Data.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Opitz, O., Ritter, G., Schader, M., Weihs, C., Cucumel, Guy, and Bertrand, Patrice
Abstract: Symbolic data extend the classical tabular model, where each individual, takes exactly one value for each variable by allowing multiple, possibly weighted, values for each variable. New variable types - interval-valued, categorical multi-valued and modal variables - have been introduced, which allow representing variability and/or uncertainty inherent to the data. But are we still in the same framework when we allow for the variables to take multiple values? Are the definitions of basic notions still so straightforward? What properties remain valid? In this paper we discuss some issues that arise when trying to apply classical data analysis techniques to symbolic data. The central question of the measurement of dispersion, and the consequences of different possible choices in the design of multivariate methods will be addressed. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

59. Dependencies and Variation Components of Symbolic Interval-Valued Data.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Opitz, O., Ritter, G., Schader, M., Weihs, C., Brito, Paula, and Cucumel, Guy
Abstract: In 1987, Diday added a new dimension to data analysis with his fundamental paper introducing the notions of symbolic data and their analyses. He and his colleagues, among others, have developed innumerable techniques to analyse symbolic data; yet even more is waiting to be done. One area that has seen much activity in recent years involves the search for a measure of dependence between two symbolic random variables. This paper presents a covariance function for interval-valued data. It also discusses how the total, between interval, and within interval variations relate; and in particular, this relationship shows that a covariance function based only on interval midpoints does not capture all the variations in the data. While important in its own right, the covariance function plays a central role in many multivariate methods. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

60. Data Mining in Higher Education.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: The aim of this paper is the critical discussion of different data mining methods in the context of the demand-oriented development of bachelor and master study courses at german universities. The initial point of the investigation was the question, to what extent the knowledge concerning the selection of the so-called "Fachkernkombinationen" (major fields of study) at the Fakultät Wirtschaftswissenschaften of the Technische Universität Dresden, could be used to provide new important and therefore demand-oriented impulses for the development of new bachelor and master courses. In order to identify these entrainment combinations it is obvious to examine the combinations of the major fields of study by means of different data mining methods. Special attention applies to the association analysis which is classical used within the ranges trade (basket analysis) or e-business (web content and web Usage mining) — an application in the higher education management is missing until now. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

61. Attribute Aware Anonymous Recommender Systems.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Anonymous recommender systems are the electronic pendant to vendors, who ask the customers a few questions and subsequently recommend products based on the answers. In this article we will propose attribute aware classifier-based approaches for such a system and compare it to classifier-based approaches that only make use of the product IDs and to an existing real-life knowledge-based system. We will show that the attribute-based model is very robust against noise and provides good results in a learning over time experiment. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

62. Rescaling Proximity Matrix Using Entropy Analyzed by INDSCAL.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: [Yokoyama and Okada (2005, in press)] suggested a new method that rescales a brand-switching data matrix using entropy. These studies applied the rescaling method to car-switching data, and the configuration derived by Kruskal's multidimensional scaling was interpreted as the circumplex. In the present paper, we apply that rescaling method to intergenerational occupational mobility data for four years, and analyzed the resluts by Krsukal's multidimensional scaling. As a result, the configurations are also interpreted as the circumplex. Furthermore, we also find that the result of the analysis of these rescaled data by INDSCAL is interpreted as the circumplex. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

63. Where Did I See You Before... A Holistic Method to Compare and Find Archaeological Artifacts.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: This paper describes the Secanto (Section Analysis Tool) computer program designed to find look-alikes of archaeological objects by comparing their shapes (sections, profiles). The current database contains the low resolution images of about 1000 profiles of handmade Iron Age ceramic vessels from The Netherlands and Northern Germany, taken from 14 ‘classic' publications. A point-and-click data entry screen enables the user to enter her/his own profile and within 2 minutes the best look-alikes (best, according to a calculated similarity parameter) are retrieved from the database. The images, essentially treated as two-dimensional information carriers, are directly compared by measuring their surface curvatures. The differences between these curvatures are expressed in a similarity parameter, which can also be interpreted as a ‘distance between'. The method looks very promising, also for other types of artifacts like stone tools and coins. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

64. Uncovering the Internal Structure of the Roman Brick and Tile Making in Frankfurt-Nied by Cluster Validation.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: During the past few years, a complex model of history and relations of Roman brick and tile production in south-west Germany has been developed by archaeologists. However, open questions remain concerning the brickyard of Frankfurt-Nied. From the statistical point of view the set of bricks and tiles of this location is divided into two clusters. These clusters can be confirmed by cluster validation. As a result of these validations, archaeologists can now modify and consolidate their ideas about the internal structures of Roman brick and tile making in Frankfurt-Nied. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

65. Vowel Classification by a Neurophysiologically Parameterized Auditory Model.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: A meaningful feature extraction is a very important challenge indispensable to allow good classification results. In Automatic Speech Recognition human performance is still superior to technical solutions. In this paper a feature extraction for sound data is presented that is perceptually motivated by the signal processing of the human auditory system. The physiological mechanisms of signal transduction in the human ear and its neural representation are described. The generated pulse spiking trains of the inner hair cells are connected to a feed forward timing artificial Hubel-Wiesel network, which is a structured computational map for higher cognitive functions as e.g. vowel recognition. According to the theory of Greenberg a signal triggers a set of delay trajectories. In the paper this is shown for classification of different vowels from several speakers. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

66. Using MCMC as a Stochastic Optimization Procedure for Monophonic and Polyphonic Sound.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Based on a model of Davy and Godsill (2002) we describe a general model for time series from monophonic and polyphonic musical sound to estimate the pitch. The model is a hierarchical Bayes Model which will be estimated with MCMC methods. For parameter estimation an MCMC based stochastic optimization is introduced. A comparative study illustrates usefullness of the MCMC algorithm. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

67. A Probabilistic Framework for Audio-Based Tonal Key and Chord Recognition.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: A unified probabilistic framework for audio-based chord and tonal key recognition is described and evaluated. The proposed framework embodies an acoustic observation likelihood model and key & chord transition models. It is shown how to conceive these models and how to use music theory to link key/chord transition probabilities to perceptual similarities between keys/chords. The advantage of a theory based model is that it does not require any training, and consequently, that its performance is not affected by the quality of the available training data. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

68. Part-of-Speech Discovery by Clustering Contextual Features.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: An unsupervised method for part-of-speech discovery is presented whose aim is to induce a system of word classes by looking at the distributional properties of words in raw text. Our assumption is that the word pair consisting of the left and right neighbors of a particular token is characteristic of the part of speech to be selected at this position. Based on this observation, we cluster all such word pairs according to the patterns of their middle words. This gives us centroid vectors that are useful for the induction of a system of word classes and for the correct classification of ambiguous words. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

69. Comparing the Stability of Different Clustering Results of Dialect Data.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: [Mucha and Haimerl (2005)] proposed an algorithm to determine the stability of clusters found in hierarchical cluster analysis (HCA) and to calculate the rate of recovery by which an element can be reassigned to the same cluster in successive classifications of bootstrap samples. As proof of the concept this algorithm was applied to quantitative linguistics data. These investigations used only HCA algorithms. This paper will take a broader look at the stability of clustering results, and it will take different cluster algorithms into account; e.g. we compare the stability values of partitions from HCA with results from partitioning algorithms. To ease the comparison, the same data set - from dialect research of Northern Italy, as in [Mucha and Haimerl (2005)] - will be used here. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

70. The Relationship of Word Length and Sentence Length: The Inter-Textual Perspective.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: The present study concentrates on the relation between sentence length (SL) and word length (WL) as a possible factor in text classification. The dependence of WL and SL is discussed in terms of general system theory and synergetics; the results achieved thus are relevant not only for linguistic studies of text classification, but for the study of other complex systems, as well. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

71. Classifying German Questions According to Ontology-Based Answer Types.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: In this paper we describe the evaluation of three machine learning algorithms that assign ontology based answer types to questions in a question-answering task. We used shallow and syntactical features to classify about 1400 German questions with a Decision Tree, a k-nearest Neighbor, and a Naïve Bayes algorithm. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

72. Clustering of Polysemic Words.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: In this paper, we propose an approach for constructing clusters of related terms that may be used for deriving formal conceptual structures in a later stage. In contrast to previous approaches in this direction, we explicitly take into account the fact that words can have different, possibly even unrelated, meanings. To account for such ambiguities in word meaning, we consider two alternative soft clustering techniques, namely Overlapping Pole-Based Clustering (PoBOC) and Clustering by Committees (CBC). These soft clustering algorithms are used to detect different contexts of the clustered words, resulting in possibly more than one cluster membership per word. We report on initial experiments conducted on textual data from the tourism domain.1 [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

73. Unsupervised Decision Trees Structured by Gene Ontology (GO-UDTs) for the Interpretation of Microarray Data.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Unsupervised data mining of microarray gene expression data is a standard approach for finding relevant groups of genes as well as samples. Clustering of samples is important for finding e.g. disease subtypes or related treatments. Unfortunately, most sample-wise clustering methods do not facilitate the biological interpretation of the results. We propose a novel approach for microarray sample-wise clustering that computes dendrograms with Gene Ontology terms annotated to each node. These dendrograms resemble decision trees with simple rules which can help to find biologically meaningful differences between the sample groups. We have applied our method to a gene expression data set from a study of prostate cancer. The original clustering which contains clinically relevant features is well reproduced, but in addition our unsupervised decision tree rules give hints for a biological explanation of the clusters. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

74. Joint Analysis of In-situ Hybridization and Gene Expression Data.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: To understand transcriptional regulation during development a detailed analysis of gene expression is needed. In-situ hybridization experiments measure the spatial distribution of mRNA-molecules and thus complement DNA-microarray experiments. This is of very high biological relevance, as co-location is a necessary condition for possible molecular interactions. We use publicly available in-situ data from embryonal development of Drosophila and derive a co-location index for pairs of genes. Our image processing pipeline for in-situ images provides a simpler alternative for the image processing part at comparable performance compared to published prior work. We formulate a mixture model which can use the pair-wise co-location indices as constraints in a mixture estimation on gene expression time-courses. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

75. Discovering Biomarkers for Myocardial Infarction from SELDI-TOF Spectra.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: We describe a three-step procedure to separate patients with myocardial infarction from a control group based on SELDI-TOF mass spectra. The procedure returns features ("biomarkers") that are strongly present in one of the two groups. These features should allow future subjects to be classified as at-risk of myocardial infarction. The algorithm uses morphological operations to reduce noise in the input data as well as for performing baseline correction. In contrast to previous approaches on SELDI-TOF spectra, we avoid black-box machine learning procedures and use only features (protein masses) that are easy to interpret. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

76. Enhancing Bluejay with Scalability, Genome Comparison and Microarray Visualization.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: The Bluejay genome browser (Browser for Linear Units in Java™) is a flexible visualization environment for biological sequences, which is capable of producing high-quality graphical outputs (http://bluejay.ucalgary.ca). We have recently added functionalities to Bluejay to realize the true potential of 2D bioinformatics visualization. We describe the three major new functionalities that will be of added value to the user: (i) exploration of large genomes using level-of-detail management; (ii) comparative visualization of multiple genomes; (iii) visualization of microarray data in a genomic context. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

77. The Influence of Specific Information on the Credit Risk Level.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: The paper presents the influence of specific information on the credit risk level. The effect of some particular information can be expected but in some cases it can be truly surprising. It is due to the exact content and history of previous information and the general standing of the company. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

78. Foreign Exchange Trading with Support Vector Machines.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: This paper analyzes and examines the general ability of Support Vector Machine (SVM) models to correctly predict and trade daily EUR exchange rate directions. Seven models with varying kernel functions are considered. Each SVM model is benchmarked against traditional forecasting techniques in order to ascertain its potential value as out-of-sample forecasting and quantitative trading tool. It is found that hyperbolic SVMs perform well in terms of forecasting accuracy and trading results via a simulated strategy. This supports the idea that SVMs are promising learning systems for coping with nonlinear classification tasks in the field of financial time series applications. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

79. Credit Risk of Collaterals: Examining the Systematic Linkage between Insolvencies and Physical Assets in Germany.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: According to the new capital adequacy framework (Basel II) the Basel Committee on Banking Supervision (BCBS) strongly advices banks to investigate the relationship between default rates and values of collaterals of secured loan portfolios. This is caused by the fact that the values of collaterals are expected to decline with rising defaults. However, the literature on modelling and examining this effect is rather rare. Therefore, we present a framework based on the Internal Ratings Based (IRB) approach of Basel II in order to examine such dependencies using standard econometric tests. We apply it to insolvency rates and empirical data for physical assets. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

80. On Goal Reaching Time Distributions Estimated from DAX Stock Index Investments.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: This research paper analyzes the distributional properties of stock index time series data from a new perspective, that is, time optimal decision making building on the conceptual foundation of the time optimal approach to portfolio selection introduced by Burkhardt. In this approach, the investor's goal is to reach a predefined level of wealth as soon as possible. We investigate the empirical properties of the goal reaching times for DAX stock index investments for various levels of aspired wealth, compare the observed properties to those expected by the Inverse Gaussian distributional model, investigate the use of overlapping instead of independent goal reaching times, and highlight some methodological issues involved in the empirical analysis. The results are of immediate interest to investors. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

81. A Model of Rational Choice Among Distributions of Goal Reaching Times.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: This research note develops a theory of rational choice among distributions of goal reaching times. A motivation of the choice problem considered here is the time optimal approach to portfolio selection by Burkhardt, which considers an investor who is interested to reach a predefined level of wealth and whose preferences can be defined on the feasible probability distributions of the time at which this goal is reached for the first time. Here a more general choice problem is considered, called time optimal decision making. The decision maker is faced with a set of mutually exclusive actions, each of which provides a known distribution of goal reaching times. It is shown that the axiomatic approach of rational choice of von Neumann/Morgenstern can be applied to reach again an expected utility representation for the preferences under consideration. This result not only provides a rational foundation for time optimal decision making, and particularly time optimal portfolio selection, but also for the analysis of time preferences in a stochastic setting, an approach which is completely new to the literature to the best of the author's knowledge. Prime areas of application are decision analysis, portfolio selection, the analysis of saving plans and the development of new financial products. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

82. On the Notions and Properties of Risk and Risk Aversion in the Time Optimal Approach to Decision Making.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: This research proposes and discusses proper notions of risk and risk aversion for risks in the dimension of time, which are suitable for the analysis of time optimal decision making according to Burkhardt. The time optimal approach assumes a decision maker with a given goal, e.g. a given future wealth level, that he would like to reach as early as possible. To reach the goal, he can choose from a set of mutually exclusive actions, e.g. risky investments, for which the respective probability distributions of the goal reaching times are known. Our notions of risk and risk aversion are new and derived based on a rational model of choice. They yield intuitively appealing results which are exemplified by an application to an insurance problem. Furthermore, we investigate the choice implications of positive, zero, and negative risk aversion by means of a new St. Petersburg Game. The results indicate that in the time optimal approach nonnegative risk aversion would generally result in counterintuitive choices, whereas negative risk aversion shows the potential to imply plausible choices. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

83. Multilevel Dimensions of Consumer Relationships in the Healthcare Service Market M-L IRT vs. M-L SEM Approach.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: The aim of the paper is to compare two measurement models: IRT multilevel and SEM multilevel model of patients - physicians relationships. These relationships are nested in the institutional context of healthcare units. The Likert-type scale was developed and the nature of the constructs discussed. This scale was adopted on individual (patients) and well as institutional (units) level along with between variable that describes cluster specific characteristics. CTT and IRT multilevel random intercept models are discussed. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

84. Women's Occupational Mobility and Segregation in the Labour Market: Asymmetric Multidimensional Scaling.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: The aim of this paper is to examine the career dynamics among women using asymmetric multidimensional scaling. Based upon a national sample in Japan in 1995, we analyze the occupational mobility tables of 1405 women among nine occupational categories obtained at various time periods. We find that asymmetric career changes within similar occupational categories take place a lot in one's 30s or later. Women, unlike in the case of men, appear to change their occupational status in mid-career. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

85. Balanced Scorecard Simulator — A Tool for Stochastic Business Figures.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Long term business success is highly dependent on how fast the business reacts on the changes in the market situation. Those who want to be successful need relevant, in-time and accurate information. Balanced Scorecard Simulator is a management tool that can be used efficiently in the processes of planning, decision and controlling. Based on the Balanced Scorecard concept the program combines imprecise data of business figures with forward and backward computation. It is also possible to find out whether or not the data are consistent with the BSC model. The visualization of the simulation results is done by a Kiviat diagram. The aim of the design is a software tool based on a BSC model and MCMC methods but is easy to handle. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

86. Integration of Customer Value into Revenue Management.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: This paper studies how information related to the customer value can be incorporated into the decision on the acceptance of booking requests. Information requirements are derived from the shortcomings of transaction-oriented revenue management and sources of this information are identified in the booking process. Afterwards, information related to customer value is integrated into the network approach of inventory control. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

87. Building on the Arules Infrastructure for Analyzing Transaction Data with R.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: The free and extensible statistical computing environment R with its enormous number of extension packages already provides many state-of-the-art techniques for data analysis. Support for association rule mining, a popular exploratory method which can be used, among other purposes, for uncovering cross-selling opportunities in market baskets, has become available recently with the R extension package arules. After a brief introduction to transaction data and association rules, we present the formal framework implemented in arules and demonstrate how clustering and association rule mining can be applied together using a market basket data set from a typical retailer. This paper shows that implementing a basic infrastructure with formal classes in R provides an extensible basis which can very efficiently be employed for developing new applications (such as clustering transactions) in addition to association rule mining. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

88. Pricing Energy in a Multi-Utility Market.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: We present a solution to the problem of tariff design for an energy supplier (utility). The tariffs for electricity and — optionally — heat created with our pricing model are optimal in terms of the utility's profit and take into account the consumers' predicted behavior, their load curve, the utility's generation prices, and prices for trading electricity on a day-ahead market like the European Energy Exchange (EEX). Furthermore, we analyze the repercussions of different assumptions about consumer behavior on a simulated market with four competing utilities. Consumer demand is modeled using an attraction model that reflects consumer inertia. Consumers will not always change their supplier, even if the total energy bill could be reduced by doing so: First, motivation to search for lower prices and to actually switch one's supplier is low, given the small possible savings. Second, legal constraints may demand a minimal contract duration in some countries. The resulting nonlinear profit optimization problem of the suppliers is solved with a genetic algorithm. By varying the attraction parameters and thus representing different degrees of inertia, we observe different developments of the market. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

89. Disproportionate Samples in Hierarchical Bayes CBC Analysis.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Empirical surveys frequently make use of conjoint data records, where respondents can be split up into segments of different size. A lack of knowledge how to handle such random samples when using Hierarchical Bayes-regression gave cause to a more detailed observation of the preciseness of estimation results. The study on hand comprises a survey on the effects of disproportionate random samples on the calculation of part-worths in choice-based conjoint analyses. An explorative simulation using artificial data demonstrates that disproportionate segment sizes have mostly negative effects on the goodness of part-worth estimation when applying Hierarchical Bayes-regression. These effects vary depending on the degree of disproportion. This finding could be generated due to the introduction of a quality criterion designed to compare both true and estimated part-worths, which is applied on a flexible range of sample structure. Subsequent to the simulation, recommendations will be issued how to best handle disproportionate data samples. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

90. Classification in Marketing Research by Means of LEM2-generated Rules.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Lenz, Hans -J., and Decker, Reinhold
Abstract: The vagueness and uncertainty of data is a frequent problem in marketing research. Since rough sets have already proven their usefulness in dealing with such data in important domains like medicine and image processing, the question arises, whether they are a useful concept for marketing as well. Against this background we investigate the rough set theory-based LEM2 algorithm as a classification tool for marketing research. Its performance is demonstrated by means of synthetic as well as real-world marketing data. Our empirical results provide evidence that the LEM2 algorithm undoubtedly deserves more attention in marketing research as it is the case so far. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

91. Adaptive Conjoint Analysis for Pricing Music Downloads.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Finding the right pricing for music downloads is of ample importance to the recording industry and music download service providers. For the recently introduced music downloads, reference prices are still developing and to find a revenue maximizing pricing scheme is a challenging task. The most commonly used approach is to employ linear pricing (e.g., iTunes, musicload). Lately, subscription models have emerged, offering their customers unlimited access to streaming music for a monthly fee (e.g., Napster, RealNetworks). However, other pricing strategies could also be used, such as quantity rebates starting at certain download volumes. Research has been done in this field and Buxmann et al. (2005) have shown that price cuts can improve revenue. In this paper we apply different approaches to estimate consumer's willingness to pay (WTP) for music downloads and compare our findings with the pricing strategies currently used in the market. To make informed decisions about pricing, knowledge about the consumer's WTP is essential. Three approaches based on adaptive conjoint analysis to estimate the WTP for bundles of music downloads are compared. Two of the approaches are based on a status-quo product (at market price and alternatively at an individually self-stated price), the third approach uses a linear model assuming a fixed utility per title. All three methods seem to be robust and deliver reasonable estimations of the respondent's WTPs. However, all but the linear model need an externally set price for the status-quo product which can introduce a bias. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

92. Improving the Probabilistic Modeling of Market Basket Data.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Current approaches to market basket simulation neglect the fact that empty transactions are typically not recorded and therefore should not occur in simulated data. This paper suggest how the simulation framework without associations can be extended to avoid empty transactions and explores the possible consequences for several measures of interestingness used in association rule filtering. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

93. Classification of Reference Models.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: The usefulness of classifications for reuse, especially for the selection of reference models is emphasised in the literature. Nevertheless, an empirical classification of reference models using formal cluster analysis methods is still an open issue. In this paper a cluster analysis is applied on the latest and largest freely available reference model catalogue. In the result, based on at last 9 selected variables, three different clusters of reference models could be identified (practitioner reference models, scientific business process reference models and scientific multi-view reference models). Important implications of the result are: a better documentation is generally needed to improve the reusability of reference models and there is a gap between scientific insights (regarding the usefulness of multi-view reference models and regarding the usefulness of concepts for reuse and customisation) and their application as well as tool support in practice. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

94. Heterogeneity in Preferences for Odd Prices.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: The topic of odd pricing has attracted many researchers in marketing. Most empirical applications in this field have been conducted on the aggregate level, thereby assuming homogeneity in consumer response to odd pricing. This paper provides the first empirical study to measure odd pricing effects at the individual consumer level. We use a Hierarchical Bayes mixture of normals model to estimate individual part-worths in a conjoint experiment, and demonstrate that preferences of consumers for odd and even prices can be very heterogeneous. Methodologically, our study offers new insights concerning the selection of the appropriate number of components in a continuous mixture model. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

95. Investigating Unstructured Texts with Latent Semantic Analysis.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Latent semantic analysis (LSA) is an algorithm applied to approximate the meaning of texts, thereby exposing semantic structure to computation. LSA combines the classical vector-space model — well known in computational linguistics — with a singular value decomposition (SVD), a two-mode factor analysis. Thus, bag-of-words representations of texts can be mapped into a modified vector space that is assumed to reflect semantic structure. In this contribution the authors describe the lsa package for the statistical language and environment R and illustrate its proper use through examples from the areas of automated essay scoring and knowledge representation. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

96. Collaborative Filtering Based on User Trends.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Recommender systems base their operation on past user ratings over a collection of items, for instance, books, CDs, etc. Collaborative Filtering (CF) is a succesful recommendation technique. User ratings are not expected to be independent, as users follow trends of similar rating behavior. In terms of Text Mining, this is analogous to the formation of higher-level concepts from plain terms. In this paper, we propose a novel CF algorithm which uses Latent Semantic Indexing (LSI) to detect rating trends and performs recommendations according to them. Our results indicate its superiority over existing CF algorithms. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

97. Putting Successor Variety Stemming to Work.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Stemming algorithms find canonical forms for inflected words, e. g. for declined nouns or conjugated verbs. Since such a unification of words with respect to gender, number, time, and case is a language-specific issue, stemming algorithms operationalize a set of linguistically motivated rules for the language in question. The most well-known rule-based algorithm for the English language is from [Porter (1980)]. The paper presents a statistical stemming approach which is based on the analysis of the distribution of word prefixes in a document collection, and which thus is widely language-independent. In particular, our approach tackles the problem of index construction for multi-lingual documents. Related work for statistical stemming either focuses on stemming quality (such as [Bachin et al. (2002) or Bordag (2005)]) or investigates runtime performance ([Mayfield and McNamee (2003)] for example), but neither provides a reasonable tradeoff between both. For selected retrieval tasks under vector-based document models we report on new results related to stemming quality and collection size dependency. Interestingly, successor variety stemming has neither been investigated under similarity concerns for index construction nor is it applied as a technology in current retrieval applications. The results show that this disregard is not justified. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

98. Plagiarism Detection Without Reference Collections.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Current research in the field of automatic plagiarism detection for text documents focuses on the development of algorithms that compare suspicious documents against potential original documents. Although recent approaches perform well in identifying copied or even modified passages ([Brin et al. (1995), Stein (2005)]), they assume a closed world where a reference collection must be given (Finkel (2002)). Recall that a human reader can identify suspicious passages within a document without having a library of potential original documents in mind. This raises the question whether plagiarized passages within a document can be detected automatically if no reference is given, e. g. if the plagiarized passages stem from a book that is not available in digital form. This paper contributes right here; it proposes a method to identify potentially plagiarized passages by analyzing a single document with respect to changes in writing style. Such passages then can be used as a starting point for an Internet search for potential sources. As well as that, such passages can be preselected for inspection by a human referee. Among others, we will present new style features that can be computed efficiently and which provide highly discriminative information: Our experiments, which base on a test corpus that will be published, show encouraging results. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

99. Applying Clickstream Data Mining to Real-Time Web Crawler Detection and Containment Using ClickTips Platform.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: Web crawler uncontrolled widespread has led to undesired situations of server overload and contents misuse. Most programs still have legitimate and useful goals, but standard detection heuristics have not evolved along with Web crawling technology and are now unable to identify most of today's programs. In this paper, we propose an integrated approach to the problem that ensures the generation of up-to-date decision models, targeting both monitoring and clickstream differentiation. The ClickTips platform sustains Web crawler detection and containment mechanisms and its data webhousing system is responsible for clickstream processing and further data mining. Web crawler detection and monitoring helps preserving Web server performance and Web site privacy and clickstream differentiated analysis provides focused report and interpretation of navigational patterns. The generation of up-to-date detection models is based on clickstream data mining and targets not only well-known Web crawlers, but also camouflaging and previously unknown programs. Experiments with different real-world Web sites are optimistic, proving that the approach is not only feasible but also adequate. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

100. Canonical Forms for Frequent Graph Mining.

Author: Bock, H. -H., Gaul, W., Vichi, M., Arabie, Ph., Baier, D., Critchley, F., Decker, R., Diday, E., Greenacre, M., Lauro, C., Meulman, J., Monari, P., Nishisato, S., Ohsumi, N., Optiz, O., Ritter, G., Schader, M., Weihs, C., Decker, Reinhold, and Lenz, Hans -J.
Abstract: A core problem of approaches to frequent graph mining, which are based on growing subgraphs into a set of graphs, is how to avoid redundant search. A powerful technique for this is a canonical description of a graph, which uniquely identifies it, and a corresponding test. I introduce a family of canonical forms based on systematic ways to construct spanning trees. I show that the canonical form used in gSpan ([Yan and Han (2002)]) is a member of this family, and that MoSS/MoFa ([Borgelt and Berthold (2002), Borgelt et al. (2005)]) is implicitly based on a different member, which I make explicit and exploit in the same way. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

745 results on '"Arabie, Ph."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources