9 results on '"Klaus Pommerening"'
Search Results
2. Controlling false match rates in record linkage using extreme value theory
- Author
-
Andreas Borg, Murat Sariyar, and Klaus Pommerening
- Subjects
Data cleansing ,Biomedical Research ,Databases, Factual ,Calibration (statistics) ,Computer science ,Health Informatics ,computer.software_genre ,Plot (graphics) ,Mean excess plot ,Statistics ,Registries ,Extreme value theory ,Linkage (software) ,Models, Statistical ,Computational Biology ,Fellegi–Sunter model ,Mixture model ,Generalized Pareto distribution ,Computer Science Applications ,Data quality ,Statistics of extreme values ,Database Management Systems ,Medical Record Linkage ,Data mining ,computer ,Algorithms ,Medical Informatics ,Record linkage - Abstract
Cleansing data from synonyms and homonyms is a relevant task in fields where high quality of data is crucial, for example in disease registries and medical research networks. Record linkage provides methods for minimizing synonym and homonym errors thereby improving data quality. We focus our attention to the case of homonym errors (in the following denoted as ‘false matches’), in which records belonging to different entities are wrongly classified as equal. Synonym errors (‘false non-matches’) occur when a single entity maps to multiple records in the linkage result. They are not considered in this study because in our application domain they are not as crucial as false matches. False match rates are frequently computed manually through a clerical review, so without modelling the distribution of the false match rates a priori. An exception is the work of Belin and Rubin (1995) [4]. They propose to estimate the false match rate by means of a normal mixture model that needs training data for a calibration process. In this paper we present a new approach for estimating the false match rate within the framework of Fellegi and Sunter by methods of Extreme Value Theory (EVT). This approach needs no training data for determining the threshold for matches and therefore leads to a significant cost-reduction. After giving two different definitions of the false match rate, we present the tools of the EVT used in this paper: the generalized Pareto distribution and the mean excess plot. Our experiments with real data show that the model works well, with only slightly lower accuracy compared to a procedure that has information about the match status and that maximizes the accuracy.
- Published
- 2011
- Full Text
- View/download PDF
3. Leitfaden zum Datenschutz in medizinischen Forschungsprojekten: Generische Lösungen der TMF 2.0
- Author
-
Johannes Drepper, Klaus Pommerening, Thomas Ganslandt, and Krister Helbing
- Subjects
Gynecology ,Clinical study ,03 medical and health sciences ,medicine.medical_specialty ,0302 clinical medicine ,Political science ,medicine ,030212 general & internal medicine ,030210 environmental & occupational health - Abstract
Scroll down to open individual chapters Das Vertrauen von Patienten und Probanden ist eine unverzichtbare Voraussetzung fur den Erfolg medizinischer Forschungsprojekte, die ohne die Erhebung, langfristige Speicherung und Analyse von klinischen Daten und Proben nicht durchgefuhrt werden konnen. Medizinische Forschung arbeitet heute uberwiegend vernetzt in zunehmend groseren Forschungsverbunden. Entsprechend nimmt auch die Bedeutung von Datenschutz und Datensicherheit immer weiter zu. Die TMF hat bereits 2003 erstmals generische Datenschutzkonzepte fur medizinische Forschungsverbunde veroffentlicht. Auf dieser Basis konnten zahlreiche Forschungsprojekte ihre Datenschutzkonzepte schneller erarbeiten und abstimmen. Die dabei gewonnenen Erfahrungen sind in die grundlegende Uberarbeitung der generischen Konzepte eingeflossen. So tragt das neue Konzept der Vielschichtigkeit medizinischer Forschungsprozesse durch einen modularen Aufbau Rechnung und wurde zudem in einen umfassenden Leitfaden eingebettet.
- Published
- 2014
- Full Text
- View/download PDF
4. Missing values in deduplication of electronic patient data
- Author
-
Andreas Borg, Murat Sariyar, and Klaus Pommerening
- Subjects
Computer science ,media_common.quotation_subject ,Inference ,Health Informatics ,Ambiguity ,Patient data ,Missing data ,computer.software_genre ,Research and Applications ,Regression ,Neoplasms ,Statistics ,Data deduplication ,Electronic Health Records ,Humans ,Data mining ,Imputation (statistics) ,Medical Record Linkage ,Registries ,computer ,Record linkage ,media_common - Abstract
Data deduplication refers to the process in which records referring to the same real-world entities are detected in datasets such that duplicated records can be eliminated. The denotation ‘record linkage’ is used here for the same problem.1 A typical application is the deduplication of medical registry data.2 3 Medical registries are institutions that collect medical and personal data in a standardized and comprehensive way. The primary aims are the creation of a pool of patients eligible for clinical or epidemiological studies and the computation of certain indices such as the incidence in order to oversee the development of diseases. The latter task in particular requires a database in which synonyms and homonyms do not distort the measures. For instance, synonyms would lead to an overestimation of the incidence and thereby possibly to false resource allocations. The record linkage procedure must itself be reliable and of high quality in order to achieve clean data (for measures regarding the quality of record linkage methods see also Christen and Goiser4). A number of other important works have also investigated record linkage.5–16 Missing values in record linkage applications constitute serious problems in addition to the difficulties introduced by them in areas in which there is no necessity for computing comparison patterns. In settings such as survey analysis missing values emerge, for example, due to missing responses or knowledge of the participants. Analyses based on the data gathered can be biased in this case because of unfilled fields, for example, higher wages are less likely to be revealed than lower ones. Papers that deal with missing values in survey analysis are, for example, the ones of Acock17 and King et al.18 In contrast, in record linkage of electronic health records using personal data, the impact of missing values is augmented because they occur in comparison fields if any of the underlying fields has a missing value. Therefore, missingness in record linkage applications with a significant number of NA values is not ignorable, ie, not random. This non-randomness can also occur when blocking is applied in order to reduce the number of resulting record pairs: one or more features are selected as grouping variables and only pairs with agreement in these variables are considered. A comprehensive survey regarding blocking is given by Christen.19 The distinction into missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) of Little and Rubin20 is only relevant as a starting point. An introduction to missing values in clinical trials based on these distinctions is given by Molenberghs and Kenward.21 Ding and Simonoff22 show that the Little/Rubin distinctions are unrelated to the accuracy of different missing-value treatments when classification trees are used in prediction time and the missingness is independent of the class value. This holds for three of the four evaluated datasets in our study (see next section). We give a short overview of the notions in Little and Rubin:20 MCAR applies when the probability that a value of a variable is missing (NA) does not depend on the values of other observed or unobserved variables o and u, that is, P(NA | o, u) = P(NA); MAR is present when the probability of NA depends only on (the values of other) observed variables, that is, P(NA | o, u) = P(NA | o); MNAR means that P(NA | o, u) cannot be quantified without additional assumptions. The most used technique for dealing with missing values seems to be imputation, which means to replace every NA by a value estimated from the data available. Imputation can be point based or distribution based. In the latter case the (conditional) distribution of the missing value is calculated and predictions are based on this estimated distribution. Multiple (or repeated) imputation generates some complete versions of the data that are combined for final inference in a statistical setting. Regarding further information on this variant we refer to Little and Rubin.20 There is no internationally published systematic approach to missing values in record linkage, as far as we know. Works such as the ones by McGlincy23 or James24 do not—as their titles might suggest—deal with the missing values in the matching attributes but with predicting matches as such. The former paper states that the ‘problem of missing links is similar to the problem of non-response in surveys’, which renders missing values in matching attributes out of sight. Our paper is meant to serve as the base for future work regarding missing values in record linkage. Relevant papers regarding classification trees with missing values are the papers of Ding and Simonoff22 and Saar-Tsechansky and Provost.25 The former work investigates six different approaches—probabilistic split, complete case method, grand mode/mean imputation, separate class, surrogate split, and complete variable method—to missing values and concludes that treating missing values as a separate class (in this paper: imputation with unique value 0.5) performs best when missingness is related to the response variable, otherwise results exhibit more ambiguity. The authors use real datasets and simulated datasets in which missing values are increased based on MCAR, MAR and MNAR sampling. Among others, they use a classification induction tree algorithm that is used in this paper (ie, classification and regression trees (CART); see Methods section). In the articles by Saar-Tsechansky and Provost25 a set of C4.5-classification trees induced on reduced sets of attributes (ie, reduced-model classification) exhibit the best results. For further information regarding the classification-tree induction approach C4.5 we refer to Salzberg.26 This reduced model classification is compared with predictive value imputation (eg, surrogate-split mechanism in CART; see Methods section) and distribution-based imputation (eg, sample-based induction; see Methods section) used by C4.5. Datasets with ‘naturally occurring’ missing values and with increased numbers of missing values (chosen at random: MCAR) were considered. The authors explicitly deal solely with missingness in prediction time. We want to tackle the induction time as well. This paper empirically studies the effect of different approaches for missing values on the accuracy in a record linkage setting in which classification trees are used for the classification of record pairs as match or non-match. Our main aim is to determine the best record linkage strategy on a large amount of real-world data as well as on data based on them in which NA values are manually increased. The number of the data items considered in the evaluation is above five million, which is unusually large for classification-tree settings: datasets in Saar-Tsechansky and Provost25 have at most 21 000 items and Ding and Simonoff22 perform classification with CART with at most 100 000 items (their implementation of CART cannot cope with more data in prediction time).
- Published
- 2011
5. Active learning strategies for the deduplication of electronic patient data using classification trees
- Author
-
Murat Sariyar, Klaus Pommerening, and Andreas Borg
- Subjects
Active learning ,Computer science ,Active learning (machine learning) ,Information Storage and Retrieval ,Context (language use) ,Health Informatics ,Semi-supervised learning ,Machine learning ,computer.software_genre ,Set (abstract data type) ,Artificial Intelligence ,Bagging ,Data deduplication ,Electronic Health Records ,Humans ,business.industry ,String (computer science) ,Decision Trees ,Online machine learning ,Computer Science Applications ,Data mining ,Artificial intelligence ,Medical Record Linkage ,String metric ,business ,computer ,Algorithms - Abstract
Graphical abstractDisplay Omitted Highlights? Active learning for medical record linkage is used on a large data set. ? We compare a simple active learning strategy with a more sophisticated variant. ? The active learning method of Sarawagi and Bhamidipaty (2002) 6] is extended. ? We deliver insights into the variations of the results due to random sampling in the active learning strategies. IntroductionSupervised record linkage methods often require a clerical review to gain informative training data. Active learning means to actively prompt the user to label data with special characteristics in order to minimise the review costs. We conducted an empirical evaluation to investigate whether a simple active learning strategy using binary comparison patterns is sufficient or if string metrics together with a more sophisticated algorithm are necessary to achieve high accuracies with a small training set. Material and MethodsBased on medical registry data with different numbers of attributes, we used active learning to acquire training sets for classification trees, which were then used to classify the remaining data. Active learning for binary patterns means that every distinct comparison pattern represents a stratum from which one item is sampled. Active learning for patterns consisting of the Levenshtein string metric values uses an iterative process where the most informative and representative examples are added to the training set. In this context, we extended the active learning strategy by Sarawagi and Bhamidipaty (2002) 6]. ResultsOn the original data set, active learning based on binary comparison patterns leads to the best results. When dropping four or six attributes, using string metrics leads to better results. In both cases, not more than 200 manually reviewed training examples are necessary. ConclusionsIn record linkage applications where only forename, name and birthday are available as attributes, we suggest the sophisticated active learning strategy based on string metrics in order to achieve highly accurate results. We recommend the simple strategy if more attributes are available, as in our study. In both cases, active learning significantly reduces the amount of manual involvement in training data selection compared to usual record linkage settings.
- Published
- 2011
6. THEMPO: a knowledge-based system for therapy planning in pediatric oncology
- Author
-
M. Sergl, Klaus Pommerening, U. Nauerth, D. Schoppe, H.-M. Dittrich, and Robert Müller
- Subjects
Descriptive knowledge ,medicine.medical_specialty ,Medical Records Systems, Computerized ,Bioinformatics, Medicine, Therapy planning, Electronic patient record, Protocoldirected care ,Health Informatics ,Expert Systems ,Semantic network ,Patient Care Planning ,Knowledge-based systems ,Neuroblastoma ,Rule-based machine translation ,Artificial Intelligence ,Neoplasms ,Antineoplastic Combined Chemotherapy Protocols ,Computer Graphics ,Medicine ,Humans ,Medical physics ,Child ,business.industry ,Medical record ,Precursor Cell Lymphoblastic Leukemia-Lymphoma ,Combined Modality Therapy ,Computer Science Applications ,Treatment Outcome ,Knowledge base ,Therapy, Computer-Assisted ,Systems architecture ,Graph (abstract data type) ,Radiotherapy, Adjuvant ,Artificial intelligence ,ddc:004 ,business ,Software - Abstract
This article describes the knowledge-based system THEMPO (Therapy Management in Pediatric Oncology), which supports protocol-directed therapy planning and configuration in pediatric oncology. THEMPO provides a semantic network controlled by graph grammars to cover the different types of knowledge relevant in the domain, and offers a suite of acquisition tools for knowledge base authoring. Medical problem solvers, operating on the oncological network, reason about adequate therapeutic and diagnostic timetables for a patient. Furthermore, a corresponding patient record, also based on semantic networks and graph grammars, has been implemented to represent the course of therapy of an oncological patient.
- Published
- 1997
7. Observable radizielle Untergruppen von halbeinfachen algebraischen Gruppen
- Author
-
Klaus Pommerening
- Subjects
510 Mathematics ,General Mathematics ,510 Mathematik ,Humanities ,Mathematics - Abstract
Sei G eine affine algebraische Gruppe, definiert tiber einem algebraisch abgeschlossenen K6rper k von beliebiger Charakteristik. Die observablen Untergruppen von G sind die Untergruppen, die als Stabilisatoren bei rationalen Darstellungen von G auftreten. Sie wurden in [2] und [5] ausfiihrlich diskutiert. Dabei zeigte sich, dab es im allgemeinen wohl sehr schwierig ist zu entscheiden, ob eine Untergruppe observabel ist. Daher ist es sinnvoll, Kriterien zu finden. In [9] gab Sukhanov ftir Charakteristik 0 ein notwendiges und hinreichendes Kriterium daftir an, dab eine radizielle Untergruppe einer halbeinfachen algebraischen Gruppe observabel ist; dabei will ich unter einer radiziellen Untergruppe eine zusammmenh~ingende abgeschlossene Untergruppe verstehen, die von einem maximalen Torus normalisiert wird. Sukhanovs Kriterium l~iBt sich leicht auf beliebige Charakteristik tibertragen und in eine sehr handliche Form bringen (2.4). Das wirklich neue Ergebnis dieses Artikels steckt in w Eine genaue Charakterisierung der kleinsten observablen Untergruppe, die eine gegebene radizielle Untergruppe entNilt. Als Anwendung hiervon lassen sich diejenigen homogenen R~ume vom Typ ,,halbeinfache Gruppe modulo radizieller Untergruppe" bestimmen, die nur konstante globale Funktionen besitzen. In den Bezeichnungen halte ich mich an [6].
- Published
- 1979
- Full Text
- View/download PDF
8. Über die unipotenten Klassen reduktiver Gruppen
- Author
-
Klaus Pommerening
- Subjects
Algebra and Number Theory ,Humanities ,Mathematics - Abstract
Diese Arbeit ist die Fortsetzung von [9]. Es wird bewiesen, daβ der Klassifikationssatz von Bala-Carter [1, 2] fu¨r die nilpotenten Elemente der Lie-Algebra einer halbeinfachen Gruppe in beliebiger guter Charakteristik gilt. Die Bezeichnungen und Konventionen von [9] bleiben in Kraft, aber reduktive Gruppen mu¨ssen nicht unbedingt zusammenha¨ngend sein.
- Published
- 1977
- Full Text
- View/download PDF
9. Ordered sets with the standardizing property and straightening laws for algebras of invariants
- Author
-
Klaus Pommerening
- Subjects
Combinatorics ,Mathematics(all) ,Class (set theory) ,Group action ,Property (philosophy) ,General Mathematics ,Law ,Ordered set ,Unipotent ,Mathematics - Abstract
In Math. Z. (176 (1981), 359–374) I explicitly determined the invariants of a certain class of unipotent group actions, and obtained a positive partial answer to Hilbert's 14th problem for nonreductive groups. The class of groups for which the method worked remained quite obscure. Theorem (4.2) of the present paper gives a precise description of the cases where the algebras of invariants are spanned by standard bitableaux, hence have a straightening law. The unipotent groups in question (“radizielle Untergruppen” of GLn) correspond, up to conjugation, to finite (partially) ordered sets. The promised description is done by properties of the ordered sets that are easy to test. This is another example where combinatorial methods are important for the theory of invariants.
- Published
- 1987
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.