1. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
- Author
-
Kreutzer, Julia, Caswell, Isaac, Wang, Lisa, Wahab, Ahsan, van Esch, Daan, Ulzii-Orshikh, Nasanbayar, Tapo, Allahsera, Subramani, Nishant, Sokolov, Artem, Sikasote, Claytone, Setyawan, Monang, Sarin, Supheakmungkol, Samb, Sokhar, Sagot, Benoît, Rivera, Clara, Rios, Annette, Papadimitriou, Isabel, Osei, Salomey, Suarez, Pedro Ortiz, Orife, Iroro, Ogueji, Kelechi, Rubungo, Andre Niyongabo, Nguyen, Toan Q., Müller, Mathias, Müller, André, Muhammad, Shamsuddeen Hassan, Muhammad, Nanda, Mnyakeni, Ayanda, Mirzakhalov, Jamshidbek, Matangira, Tapiwanashe, Leong, Colin, Lawson, Nze, Kudugunta, Sneha, Jernite, Yacine, Jenny, Mathias, Firat, Orhan, Dossou, Bonaventure F. P., Dlamini, Sakhile, de Silva, Nisansa, Ballı, Sakine Çabuk, Biderman, Stella, Battisti, Alessia, Baruwa, Ahmed, Bapna, Ankur, Baljekar, Pallavi, Azime, Israel Abebe, Awokoya, Ayodele, Ataman, Duygu, Ahia, Orevaoghene, Ahia, Oghenefego, Agrawal, Sweta, Adeyemi, Mofetoluwa, Google Inc., Masakhane NLP, Turkic Interlingua (TIL), Computer Science Department [Haveford], Haverford College, RobotsMali, Intel Labs Berkeley, Intel Corporation [USA], University of Zambia [Lusaka] (UNZA), African Institute for Mathematical Sciences (AIMS), Automatic Language Modelling and ANAlysis & Computational Humanities (ALMAnaCH), Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Universität Zürich [Zürich] = University of Zurich (UZH), Stanford University, Kwame Nkrumah University of Science and Technology (KNUST), Sorbonne Université (SU), University of Waterloo [Waterloo], University of Electronic Science and Technology of China [Chengdu] (UESTC), University of Notre Dame [Indiana] (UND), Bayero University Kano (BUK), University of South Florida [Tampa] (USF), Hugging Face, Jacobs University [Bremen], University of Moratuwa, EleutherAI, Obafemi Awolowo University (OAU), University of Ibadan, InstaDeep, University of Maryland [Baltimore], Defence Space Administration [Abuja], ANR-18-CE38-0003,BASNUM,Numérisation et analyse du Dictionnaire universel de Basnage de Beauval: lexicographie et réseaux scientifiques(2018), Kwame Nkrumah University of Science and Technology [GHANA] (KNUST), University of Electronic Science and Technology of China (UESTC), and University of Moratuwa [Moratuwa]
- Subjects
FOS: Computer and information sciences ,Human-Computer Interaction ,Linguistics and Language ,Computer Science - Computation and Language ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Artificial Intelligence ,Communication ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Computation and Language (cs.CL) ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,Computer Science Applications - Abstract
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases., Comment: Accepted at TACL; pre-MIT Press publication version
- Published
- 2022
- Full Text
- View/download PDF