How to Evaluate Single-Round Dialogues Like Humans: An Information-Oriented Metric

Authors :: Yan Liu
Sheng-hua Zhong
Peiqi Liu
Zhong Ming
Source :: IEEE/ACM Transactions on Audio, Speech, and Language Processing. 28:2211-2223
Publication Year :: 2020
Publisher :: Institute of Electrical and Electronics Engineers (IEEE), 2020.
Abstract: Developing a dialogue response generation system is one of important topics in natural language processing, but many obstacles are yet to be overcome before autogenerated dialogues with a human-like quality can become possible. A good evaluation method will help narrow the gap between machines and humans in dialogue generation. Unfortunately, the existing automatic evaluation methods are biased and correlate very poorly with human judgments of response quality. Such methods are incapable of assessing whether a dialogue response generation system can produce high-quality, knowledge-related and informative dialogues. In response to this challenge, we design an information-oriented framework to simulate human subjective evaluation. Using this framework, we implement a learning-based metric to evaluate the quality of a dialogue. An experimental validation demonstrates our proposed metric's effectiveness in dialogue selection and model evaluation on a Twitter dataset (in English) and a Weibo dataset (in Chinese). In addition, the metric is more relevant than the existing methods of dialogue evaluation to human subjective judgment.

Full Text Access

Tools