Back to Search Start Over

WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus

Authors :
Qian, Hongjing
Zhu, Yutao
Dou, Zhicheng
Gu, Haoqi
Zhang, Xinyu
Liu, Zheng
Lai, Ruofei
Cao, Zhao
Nie, Jian-Yun
Wen, Ji-Rong
Publication Year :
2023

Abstract

In this paper, we introduce a new NLP task -- generating short factual articles with references for queries by mining supporting evidence from the Web. In this task, called WebBrain, the ultimate goal is to generate a fluent, informative, and factually-correct short article (e.g., a Wikipedia article) for a factual query unseen in Wikipedia. To enable experiments on WebBrain, we construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia articles and their crawlable Wikipedia references. WebBrain-Raw is ten times larger than the previous biggest peer dataset, which can greatly benefit the research community. From WebBrain-Raw, we construct two task-specific datasets: WebBrain-R and WebBrain-G, which are used to train in-domain retriever and generator, respectively. Besides, we empirically analyze the performances of the current state-of-the-art NLP techniques on WebBrain and introduce a new framework ReGen, which enhances the generation factualness by improved evidence retrieval and task-specific pre-training for generation. Experiment results show that ReGen outperforms all baselines in both automatic and human evaluations.<br />Comment: Codes in https://github.com/qhjqhj00/WebBrain

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2304.04358
Document Type :
Working Paper