1. Machine learning for cross-gazetteer matching of natural features
- Author
-
Michele Volpi, Ross S. Purves, Elise Acheson, University of Zurich, and Acheson, Elise
- Subjects
Matching (statistics) ,Geospatial analysis ,Geography, Planning and Development ,Feature extraction ,0211 other engineering and technologies ,0507 social and economic geography ,02 engineering and technology ,Library and Information Sciences ,1710 Information Systems ,computer.software_genre ,Machine learning ,Domain (software engineering) ,Task (project management) ,3305 Geography, Planning and Development ,ontology ,gazetteer matching ,910 Geography & travel ,natural features ,021101 geological & geomatics engineering ,Planning and Development ,Geography ,business.industry ,record linking ,05 social sciences ,Pipeline (software) ,Random forest ,10122 Institute of Geography ,Feature (computer vision) ,feature types ,3309 Library and Information Sciences ,Artificial intelligence ,business ,050703 geography ,computer ,random forest ,Information Systems - Abstract
Defining and identifying duplicate records in a dataset is a challenging task which grows more complex when the modeled entities themselves are hard to delineate. In the geospatial domain, it may not be clear where a mountain, stream, or valley ends and begins, a problem carried over when such entities are catalogued in gazetteers. In this paper, we take two gazetteers, GeoNames and SwissNames3D, and perform matching - identifying records in each that are about the same entity - across a sample of natural feature records. We first perform rule-based matching, establishing competitive results, then apply machine learning using Random Forests, a method well-suited to the matching task. We report on the performance of a wider array of matching features than has been previously studied, including domain-specific ones such as feature type, land cover class, and elevation. Our results show an increase in performance using machine learning over rules, with a notable performance gain from considering feature types, but negligible gains from other specialized matching features. We argue that future work in this area should strive to be more reproducible and report results on a realistic testing pipeline including candidate selection, feature extraction, and classification.
- Published
- 2019
- Full Text
- View/download PDF