Start Over

Semi-automatic generation of multilingual datasets for stance detection in Twitter.

Authors :: Zotova, Elena
Agerri, Rodrigo
Rigau, German
Source :: Expert Systems with Applications. May2021, Vol. 170, pN.PAG-N.PAG. 1p.
Publication Year :: 2021
Abstract: • New method to semi-automatically build labeled stance detection datasets from Twitter. • Translation strategies outperform zero-shot approaches when data is translated to a high-resourced language. • User-based information helps to label individual tweets. • Our method is applicable to quickly and cheaply generate labeled Twitter-based data. Popular social media networks provide the perfect environment to study the opinions and attitudes expressed by users. While interactions in social media such as Twitter occur in many natural languages, research on stance detection (the position or attitude expressed with respect to a specific topic) within the Natural Language Processing field has largely been done for English. Although some efforts have recently been made to develop annotated data in other languages, there is a telling lack of resources to facilitate multilingual and crosslingual research on stance detection. This is partially due to the fact that manually annotating a corpus of social media texts is a difficult, slow and costly process. Furthermore, as stance is a highly domain- and topic-specific phenomenon, the need for annotated data is specially demanding. As a result, most of the manually labeled resources are hindered by their relatively small size and skewed class distribution. This paper presents a method to obtain multilingual datasets for stance detection in Twitter. Instead of manually annotating on a per tweet basis, we leverage user-based information to semi-automatically label large amounts of tweets. Empirical monolingual and cross-lingual experimentation and qualitative analysis show that our method helps to overcome the aforementioned difficulties to build large, balanced and multilingual labeled corpora. We believe that our method can be easily adapted to easily generate labeled social media data for other Natural Language Processing tasks and domains. [ABSTRACT FROM AUTHOR]

Subjects :: *NATURAL language processing
*SKEWNESS (Probability theory)
*NATURAL languages
*SOCIAL interaction
*SOCIAL media

Details

Language :: English
ISSN :: 09574174
Volume :: 170
Database :: Academic Search Index
Journal :: Expert Systems with Applications
Publication Type :: Academic Journal
Accession number :: 148986708
Full Text :: https://doi.org/10.1016/j.eswa.2020.114547

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Semi-automatic generation of multilingual datasets for stance detection in Twitter.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Semi-automatic generation of multilingual datasets for stance detection in Twitter.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources