Author: "Masoud, Maraim" / Topic: applications - Searchworks@Jio Institute Digital Library Search Results

1. Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

Author: McMillan-Major, Angelina, Alyafeai, Zaid, Biderman, Stella, Chen, Kimbo, De Toni, Francesco, Dupont, Gérard, Elsahar, Hady, Emezue, Chris, Aji, Alham Fikri, Ilić, Suzana, Khamis, Nurulaqilla, Leong, Colin, Masoud, Maraim, Soroa, Aitor, Suarez, Pedro Ortiz, Talat, Zeerak, van Strien, Daniel, Jernite, Yacine, Hugging Face, University of Washington [Seattle], King Fahd University of Petroleum and Minerals (KFUPM), Booz Hallen Hamilton Inc, EleutherAI, Chercheur indépendant, The University of Western Australia (UWA), University of the Basque Country/Euskal Herriko Unibertsitatea (UPV/EHU), Automatic Language Modelling and ANAlysis & Computational Humanities (ALMAnaCH), Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Sorbonne Université (SU), British Library, and ANR-18-CE38-0003,BASNUM,Numérisation et analyse du Dictionnaire universel de Basnage de Beauval: lexicographie et réseaux scientifiques(2018)
Subjects: Tools, FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Science - Databases, LR Infrastructures and Architectures, Applications, Systems, Databases (cs.DB), Collaborative Resource Construction & Crowdsourcing, Computation and Language (cs.CL), [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Abstract: In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor., Comment: 8 pages plus appendix and references
Published: 2022
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

1 results on '"Masoud, Maraim"'

1. Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Language

Database

1 results on '"Masoud, Maraim"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources