The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation

Authors :: van der Lee, Chris
Emmery, Chris
Wubben, Sander
Krahmer, Emiel
Language, Communication and Cognition
Cognitive Science & AI
Source :: Proceedings of The 13th International Conference on Natural Language Generation, 68-79, STARTPAGE=68;ENDPAGE=79;TITLE=Proceedings of The 13th International Conference on Natural Language Generation
Publication Year :: 2020
Abstract: This paper describes the CACAPO dataset, built for training both neural pipeline and end-to-end data-to-text language generation systems. The dataset is multilingual (Dutch and English), and contains almost 10,000 sentences from human-written news texts in the sports, weather, stocks, and incidents domain, together with aligned attribute-value paired data. The dataset is unique in that the linguistic variation and indirect ways of expressing data in these texts reflect the challenges of real world NLG tasks.

Language :: English
Database :: OpenAIRE
Journal :: Proceedings of The 13th International Conference on Natural Language Generation, 68-79, STARTPAGE=68;ENDPAGE=79;TITLE=Proceedings of The 13th International Conference on Natural Language Generation
Accession number :: edsair.narcis........fac43e94b9275d7bc9002c9a3af6e2e6

Tools