Back to Search
Start Over
Automatically extracted parallel corpora enriched with highly useful metadata? A Wikipedia case study combining machine learning and social technology.
- Source :
-
Digital Scholarship in the Humanities . Apr2021, Vol. 36 Issue 1, p1-15. 15p. - Publication Year :
- 2021
-
Abstract
- The extraction of large amounts of multilingual parallel text from web resources is a widely used technique in natural language processing. However, automatically collected parallel corpora usually lack precise metadata, which are crucial to accurate data analysis and interpretation. The combination of automated extraction procedures and manual metadata enrichment may help address this issue. Wikipedia is a promising candidate for the exploration of the potential of said combination of methods because it is a rich source of translations in a large number of language pairs and because its open and collaborative nature makes it possible to identify and contact the users who produce translations. This article tests to what extent translated texts automatically extracted from Wikipedia by means of neural networks can be enriched with pertinent metadata through a self-submission-based user survey. Special emphasis is placed on data usefulness, defined in terms of a catalogue of previously established assessment criteria, most prominently metadata quality. The results suggest that from a quantitative perspective, the proposed methodology is capable of capturing metadata otherwise not available. At the same time, the crowd-based collection of data and metadata may face important technical and social limitations. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 2055768X
- Volume :
- 36
- Issue :
- 1
- Database :
- Academic Search Index
- Journal :
- Digital Scholarship in the Humanities
- Publication Type :
- Academic Journal
- Accession number :
- 150091611
- Full Text :
- https://doi.org/10.1093/llc/fqaa002