Back to Search Start Over

Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction

Authors :
Park, Chanjun
Koo, Seonmin
Lee, Seolhwa
Seo, Jaehyung
Eo, Sugyeong
Moon, Hyeonseok
Lim, Heuiseok
Publication Year :
2023

Abstract

Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to its potential for performance improvement, data-centric AI has long been exclusively validated using real-world data and publicly available benchmark datasets. In respect of this, data-centric AI still highly depends on real-world data, and the verification of models using synthetic data has not yet been thoroughly carried out. Given the challenges above, we ask the question: Does data quality control (noise injection and balanced data), a data-centric AI methodology acclaimed to have a positive impact, exhibit the same positive impact in models trained solely with synthetic data? To address this question, we conducted comparative analyses between models trained on synthetic and real-world data based on grammatical error correction (GEC) task. Our experimental results reveal that the data quality control method has a positive impact on models trained with real-world data, as previously reported in existing studies, while a negative impact is observed in models trained solely on synthetic data.<br />Accepted for Data-centric Machine Learning Research (DMLR) Workshop at ICML 2023

Details

Language :
English
Database :
OpenAIRE
Accession number :
edsair.doi.dedup.....a6389cc5aeb61c997ef20eeecd22405c