Start Over

Microblog language identification: overcoming the limitations of short, unedited and idiomatic text.

Authors :: Carter, Simon
Weerkamp, Wouter
Tsagkias, Manos
Source :: Language Resources & Evaluation. Mar2013, Vol. 47 Issue 1, p195-215. 21p.
Publication Year :: 2013
Abstract: Multilingual posts can potentially affect the outcomes of content analysis on microblog platforms. To this end, language identification can provide a monolingual set of content for analysis. We find the unedited and idiomatic language of microblogs to be challenging for state-of-the-art language identification methods. To account for this, we identify five microblog characteristics that can help in language identification: the language profile of the blogger (blogger), the content of an attached hyperlink (link), the language profile of other users mentioned (mention) in the post, the language profile of a tag (tag), and the language of the original post (conversation), if the post we examine is a reply. Further, we present methods that combine these priors in a post-dependent and post-independent way. We present test results on 1,000 posts from five languages (Dutch, English, French, German, and Spanish), which show that our priors improve accuracy by 5 % over a domain specific baseline, and show that post-dependent combination of the priors achieves the best performance. When suitable training data does not exist, our methods still outperform a domain unspecific baseline. We conclude with an examination of the language distribution of a million tweets, along with temporal analysis, the usage of twitter features across languages, and a correlation study between classifications made and geo-location and language metadata fields. [ABSTRACT FROM AUTHOR]

Subjects :: *LINGUISTIC identity
*MICROBLOGS
*INSTANT messaging
*ONLINE social networks
*CONTENT analysis

Details

Language :: English
ISSN :: 1574020X
Volume :: 47
Issue :: 1
Database :: Academic Search Index
Journal :: Language Resources & Evaluation
Publication Type :: Academic Journal
Accession number :: 85873232
Full Text :: https://doi.org/10.1007/s10579-012-9195-y

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Microblog language identification: overcoming the limitations of short, unedited and idiomatic text.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Microblog language identification: overcoming the limitations of short, unedited and idiomatic text.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources