1. Improving the Accuracy of Automated Occupation Coding at Any Production Rate
- Author
-
Michael Blohm, Hyukjun Gweon, Stefan H. Steiner, Matthias Schonlau, and Lars Kaczmirek
- Subjects
Official statistics ,business.industry ,Statistical learning ,Computer science ,Machine learning ,computer.software_genre ,k-nearest neighbors algorithm ,Respondent ,Artificial intelligence ,Data mining ,business ,computer ,Coding (social sciences) ,Production rate - Abstract
Occupation coding, an important task in official statistics, refers to coding a respondent's text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually at great expense. We propose two new methods for automatic coding: a hybrid method that combines a rule-based approach based on duplicates with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that both methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. We also find that statistical learning is improved by combining separate models for the detailed occupation codes and for aggregate occupation codes. Further, we and defing duplicates based on n-gram variables (a concept from text mining) is preferable to one based on exact string matches.
- Published
- 2016
- Full Text
- View/download PDF