Back to Search Start Over

Token-based spelling variant detection in Middle Low German texts.

Authors :
Barteld, Fabian
Biemann, Chris
Zinsmeister, Heike
Source :
Language Resources & Evaluation; Dec2019, Vol. 53 Issue 4, p677-706, 30p
Publication Year :
2019

Abstract

In this paper we present a pipeline for the detection of spelling variants, i.e., different spellings that represent the same word, in non-standard texts. For example, in Middle Low German texts in and ihn (among others) are potential spellings of a single word, the personal pronoun 'him'. Spelling variation is usually addressed by normalization, in which non-standard variants are mapped to a corresponding standard variant, e.g. the Modern German word ihn in the case of in. However, the approach to spelling variant detection presented here does not need such a reference to a standard variant and can therefore be applied to data for which a standard variant is missing. The pipeline we present first generates spelling variants for a given word using rewrite rules and surface similarity. Afterwards, the generated types are filtered. We present a new filter that works on the token level, i.e., taking the context of a word into account. Through this mechanism ambiguities on the type level can be resolved. For instance, the Middle Low German word in can not only be the personal pronoun 'him', but also the preposition 'in', and each of these has different variants. The detected spelling variants can be used in two settings for Digital Humanities research: On the one hand, they can be used to facilitate searching in non-standard texts. On the other hand, they can be used to improve the performance of natural language processing tools on the data by reducing the number of unknown words. To evaluate the utility of the pipeline in both applications, we present two evaluation settings and evaluate the pipeline on Middle Low German texts. We were able to improve the F1 score compared with previous work from 0.39 to 0.52 for the search setting and from 0.23 to 0.30 when detecting spelling variants of unknown words. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
1574020X
Volume :
53
Issue :
4
Database :
Complementary Index
Journal :
Language Resources & Evaluation
Publication Type :
Academic Journal
Accession number :
139882003
Full Text :
https://doi.org/10.1007/s10579-018-09441-5