1. Exploiting Sentence Order in Document Alignment
- Author
-
Brian Thompson and Philipp Koehn
- Subjects
FOS: Computer and information sciences ,050101 languages & linguistics ,Computer Science - Computation and Language ,Computer science ,business.industry ,05 social sciences ,02 engineering and technology ,computer.software_genre ,Task (project management) ,Reduction (complexity) ,Order (business) ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,0501 psychology and cognitive sciences ,Artificial intelligence ,business ,Computation and Language (cs.CL) ,computer ,Natural language processing ,Sentence - Abstract
We present a simple document alignment method that incorporates sentence order information in both candidate generation and candidate re-scoring. Our method results in 61% relative reduction in error compared to the best previously published result on the WMT16 document alignment shared task. Our method improves downstream MT performance on web-scraped Sinhala--English documents from ParaCrawl, outperforming the document alignment method used in the most recent ParaCrawl release. It also outperforms a comparable corpora method which uses the same multilingual embeddings, demonstrating that exploiting sentence order is beneficial even if the end goal is sentence-level bitext., EMNLP2020
- Published
- 2020
- Full Text
- View/download PDF