Back to Search Start Over

Collecting Tweets to Investigate Regional Variation in Canadian English

Authors :
Miletic, Filip
Przewozny-Desriaux, Anne
Tanguy, Ludovic
Cognition, Langues, Langage, Ergonomie (CLLE-ERSS)
École pratique des hautes études (EPHE)
Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J)-Université Bordeaux Montaigne-Centre National de la Recherche Scientifique (CNRS)
Miletic, Filip
Source :
12th Conference on Language Resources and Evaluation (LREC 2020), 12th Conference on Language Resources and Evaluation (LREC 2020), May 2020, Marseille, France. pp.6255-6264
Publication Year :
2020
Publisher :
HAL CCSD, 2020.

Abstract

International audience; We present a 78.8-million-tweet, 1.3-billion-word corpus aimed at studying regional variation in Canadian English with a specific focus on the dialect regions of Toronto, Montreal, and Vancouver. Our data collection and filtering pipeline reflects complex design criteria, which aim to allow for both data-intensive modeling methods and user-level variationist sociolinguistic analysis. It specifically consists in identifying Twitter users from the three cities, crawling their entire timelines, filtering the collected data in terms of user location and tweet language, and automatically excluding near-duplicate content. The resulting corpus mirrors national and regional specificities of Canadian English, it provides sufficient aggregate and user-level data, and it maintains a reasonably balanced distribution of content across regions and users. The utility of this dataset is illustrated by two example applications: the detection of regional lexical and topical variation, and the identification of contact-induced semantic shifts using vector space models. In accordance with Twitter's developer policy, the corpus will be publicly released in the form of tweet IDs.

Details

Language :
English
Database :
OpenAIRE
Journal :
12th Conference on Language Resources and Evaluation (LREC 2020), 12th Conference on Language Resources and Evaluation (LREC 2020), May 2020, Marseille, France. pp.6255-6264
Accession number :
edsair.dedup.wf.001..b294456814499cbcf9b71e5dfded6c06