Start Over

A factory of comparable corpora from Wikipedia

Authors :: Universitat Politècnica de Catalunya. Departament de Ciències de la Computació
Universitat Politècnica de Catalunya. GPLN - Grup de Processament del Llenguatge Natural
Barrón-Cedeño, Alberto
España Bonet, Cristina
Boldoba Trapote, Josu
Márquez Villodre, Luís
Universitat Politècnica de Catalunya. Departament de Ciències de la Computació
Universitat Politècnica de Catalunya. GPLN - Grup de Processament del Llenguatge Natural
Barrón-Cedeño, Alberto
España Bonet, Cristina
Boldoba Trapote, Josu
Márquez Villodre, Luís
Publication Year :: 2015
Abstract: Multiple approaches to grab comparable data from the Web have been developed up to date. Nevertheless, coming out with a high-quality comparable corpus of a specific topic is not straightforward. We present a model for the automatic extraction of comparable texts in multiple languages and on specific topics from Wikipedia. In order to prove the value of the model, we automatically extract parallel sentences from the comparable collections and use them to train statistical machine translation engines for specific domains. Our experiments on the English–Spanish pair in the domains of Computer Science, Science, and Sports show that our in-domain translator performs significantly better than a generic one when translating in-domain Wikipedia articles. Moreover, we show that these corpora can help when translating out-of-domain texts<br />Peer Reviewed<br />Postprint (published version)

Details

Database :: OAIster
Notes :: 11 p., application/pdf, English
Publication Type :: Electronic Resource
Accession number :: edsoai.ocn927098159
Document Type :: Electronic Resource

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

A factory of comparable corpora from Wikipedia

Abstract

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

A factory of comparable corpora from Wikipedia

Abstract

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources