60 results on '"Mazo, Hélène"'
Search Results
2. Evaluating Pre-training Objectives for Low-Resource Translation into Morphologically Rich Languages
- Author
-
Dhar, Prajit, Bisazza, Arianna, van Noord, Gertjan, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, Piperidis, Stelios, and Computational Linguistics (CL)
- Abstract
The scarcity of parallel data is a major limitation for Neural Machine Translation (NMT) systems, in particular for translation into morphologically rich languages (MRLs). An important way to overcome the lack of parallel data is to leverage target monolingual data, which is typically more abundant and easier to collect. We evaluate a number of techniques to achieve this, ranging from back-translation to random token masking, on the challenging task of translating English into four typologically diverse MRLs, under low-resource settings. Additionally, we introduce Inflection Pre-Training (or PT-Inflect), a novel pre-training objective whereby the NMT system is pre-trained on the task of re-inflecting lemmatized target sentences before being trained on standard source-to-target language translation. We conduct our evaluation on four typologically diverse target MRLs, and find that PT-Inflect surpasses NMT systems trained only on parallel data. While PT-Inflect is outperformed by back-translation overall, combining the two techniques leads to gains in some of the evaluated language pairs.
- Published
- 2022
3. AGILe: The First Lemmatizer for Ancient Greek Inscriptions
- Author
-
de Graaf, Evelien, Stopponi, Silvia, Bos, Jasper, Peels-Matthey, Saskia, Nissim, Malvina, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, Piperidis, Stelios, Computational Linguistics (CL), Theoretical and Empirical Linguistics (TEL), and Research Centre for Historical Studies (CHS)
- Subjects
ancient Greek ,lemmatizer ,digital classics - Abstract
To facilitate corpus searches by classicists as well as to reduce data sparsity when training models, we focus on the automatic lemmatization of ancient Greek inscriptions, which have not received as much attention in this sense as literary text data has. We show that existing lemmatizers for ancient Greek, trained on literary data, are not performant on epigraphic data, due to major language differences between the two types of texts. We thus train the first inscription-specific lemmatizer achieving above 80% accuracy, and make both the models and the lemmatized data available to the community. We also provide a detailed error analysis highlighting peculiarities of inscriptions which again highlights the importance of a lemmatizer dedicated to inscriptions.
- Published
- 2022
4. Proceedings of the Language Resources and Evaluation Conference
- Author
-
LS OZ Taal en spraaktechnologie, ILS LLI, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, Piperidis, Stelios, LS OZ Taal en spraaktechnologie, ILS LLI, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, and Piperidis, Stelios
- Published
- 2022
5. The Index Thomisticus Treebank as Linked Data in the LiLa Knowledge Base
- Author
-
Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, Piperidis, Stelios, Mambrini, Francesco, Passarotti, Marco, Moretti, Giovanni, Pellegrini, Matteo, Mambrini Francesco (ORCID:0000-0003-0834-7562), Passarotti Marco (ORCID:0000-0002-9806-7187), Pellegrini Matteo (ORCID:0000-0003-4378-5824), Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, Piperidis, Stelios, Mambrini, Francesco, Passarotti, Marco, Moretti, Giovanni, Pellegrini, Matteo, Mambrini Francesco (ORCID:0000-0003-0834-7562), Passarotti Marco (ORCID:0000-0002-9806-7187), and Pellegrini Matteo (ORCID:0000-0003-4378-5824)
- Abstract
Although the Universal Dependencies initiative today allows for cross-linguistically consistent annotation of morphology and syntax in treebanks for several languages, syntactically annotated corpora are not yet interoperable with many lexical resources that describe properties of the words that occur therein. In order to cope with such limitation, we propose to adopt the principles of the Linguistic Linked Open Data community, to describe and publish dependency treebanks as LLOD. In particular, this paper illustrates the approach pursued in the LiLa Knowledge Base, which enables interoperability between corpora and lexical resources for Latin, to publish as Linguistic Linked Open Data the annotation layers of two versions of a Medieval Latin treebank (the Index Thomisticus Treebank).
- Published
- 2022
6. Odi et Amo. Creating, Evaluating and Extending Sentiment Lexicons for Latin
- Author
-
Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Sprugnoli, Rachele, Passarotti, Marco Carlo, Corbetta, Daniela, Peverelli, Andrea, Sprugnoli Rachele (ORCID:0000-0001-6861-5595), Passarotti Marco (ORCID:0000-0002-9806-7187), Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Sprugnoli, Rachele, Passarotti, Marco Carlo, Corbetta, Daniela, Peverelli, Andrea, Sprugnoli Rachele (ORCID:0000-0001-6861-5595), and Passarotti Marco (ORCID:0000-0002-9806-7187)
- Abstract
Sentiment lexicons are essential for developing automatic sentiment analysis systems, but the resources currently available mostly cover modern languages. Lexicons for ancient languages are few and not evaluated with high-quality gold standards. However, the study of attitudes and emotions in ancient texts is a growing field of research which poses specific issues (e.g., lack of native speakers, limited amount of data, unusual textual genres for the sentiment analysis task, such as philosophical or documentary texts) and can have an impact on the work of scholars coming from several disciplines besides computational linguistics, e.g. historians and philologists. The work presented in this paper aims at providing the research community with a set of sentiment lexicons built by taking advantage of manually-curated resources belonging to the long tradition of Latin corpora and lexicons creation. Our interdisciplinary approach led us to release: i) two automatically generated sentiment lexicons; ii) a Gold Standard developed by two Latin language and culture experts; iii) a Silver Standard in which semantic and derivational relations are exploited so to extend the list of lexical items of the Gold Standard. In addition, the evaluation procedure is described together with a first application of the lexicons to a Latin tragedy.
- Published
- 2020
7. A New Latin Treebank for Universal Dependencies: Charters between Ancient Latin and Romance Languages
- Author
-
Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Cecchini, Flavio Massimiliano, Korkiakangas, Timo, Passarotti, Marco, Cecchini Flavio Massimiliano (ORCID:0000-0001-9029-1822), Passarotti Marco (ORCID:0000-0002-9806-7187), Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Cecchini, Flavio Massimiliano, Korkiakangas, Timo, Passarotti, Marco, Cecchini Flavio Massimiliano (ORCID:0000-0001-9029-1822), and Passarotti Marco (ORCID:0000-0002-9806-7187)
- Abstract
The present work introduces a new Latin treebank that follows the Universal Dependencies (UD) annotation standard. The treebank is obtained from the automated conversion of the Late Latin Charter Treebank 2 (LLCT2), originally in the Prague Dependency Treebank (PDT) style. As this treebank consists of Early Medieval legal documents, its language variety differs considerably from both the Classical and Medieval learned varieties prevalent in the other currently available UD Latin treebanks. Consequently, besides significant phenomena from the perspective of diachronic linguistics, this treebank also poses several challenging technical issues for the current and future syntactic annotation of Latin in the UD framework. Some of the most relevant cases are discussed in depth, with comparisons between the original PDT and the resulting UD annotations. Additionally, an overview of the UD-style structure of the treebank is given, and some diachronic aspects of the transition from Latin to Romance languages are highlighted.
- Published
- 2020
8. Sustainable Language Data Sharing to Support Language Equality in Multilingual Europe - Why Language Data Matters:ELRC White Paper
- Author
-
Berzins, Aivars, Choukri, Khalid, Giagkou, Maria, Lösch, Andrea, Mazo, Helene, Piperidis, Stelios, Rigault, Mickaël, Schnur, Eileen, Small, Lilli, Genabith, Josef van, Vasiljevs, Andrejs, Adamson, Andero, Anastasiou, Dimitra, Avraamides-Haratsi, Natassa, Bel, Núria, Bódi, Zoltán, Branco, António, Budin, Gerhard, Dadurkevicius, Virginijus, Smeytere, Stijn de, Dobreva, Hrístina, Domeij, Rickard, Dunne, Jane, Eide, Kristine, Foti, Claudia, Gavriilidou, Maria, Grouas, Thibault, Gruzitis, Normund, Hajic, Jan, Heinisch, Barbara, Hoste, Verónique, Jönsson, Arne, Kakoyianni-Doa, Fryni, Kirchmeier, Sabine, Koeva, Svetla, Konturová, Lucia, Kotzian, Jürgen, Krek, Simon, Kristmannsson, Gauti, Kuhmonen, Kaisamari, Lindén, Krister, Lynn, Teresa, Magone, Armands, Mazo, Hélène, Melero, Maite, Mihailescu, Laura, Montemagni, Simonetta, Õ Conaire, Micheál, Odijk, Jan, Ogrodniczuk, Maciej, Pecina, Pavel, Olsen, Jon Arild, Pedersen, Bolette Sandford, Perez, David, Repar, Andras, Terryn, Ayla Rigouts, Rögnvaldsson, Eirikur, Rosner, Mike, Routzouni, Nancy, Soria, Claudia, Soska, Alexandra, Spiteri, Donatienne, Tadic, Marko, Tiberius, Carole, Tufis, Dan, Utka, Andrius, Vale, Paolo, van den Berg, Piet, Váradi, Tamás, Vare, Kadri, Witt, Andreas, Yvon, Francois, Ziedins, Janis, Zumrik, Miroslav, Berzins, Aivars, Choukri, Khalid, Giagkou, Maria, Lösch, Andrea, Mazo, Helene, Piperidis, Stelios, Rigault, Mickaël, Schnur, Eileen, Small, Lilli, Genabith, Josef van, Vasiljevs, Andrejs, Adamson, Andero, Anastasiou, Dimitra, Avraamides-Haratsi, Natassa, Bel, Núria, Bódi, Zoltán, Branco, António, Budin, Gerhard, Dadurkevicius, Virginijus, Smeytere, Stijn de, Dobreva, Hrístina, Domeij, Rickard, Dunne, Jane, Eide, Kristine, Foti, Claudia, Gavriilidou, Maria, Grouas, Thibault, Gruzitis, Normund, Hajic, Jan, Heinisch, Barbara, Hoste, Verónique, Jönsson, Arne, Kakoyianni-Doa, Fryni, Kirchmeier, Sabine, Koeva, Svetla, Konturová, Lucia, Kotzian, Jürgen, Krek, Simon, Kristmannsson, Gauti, Kuhmonen, Kaisamari, Lindén, Krister, Lynn, Teresa, Magone, Armands, Mazo, Hélène, Melero, Maite, Mihailescu, Laura, Montemagni, Simonetta, Õ Conaire, Micheál, Odijk, Jan, Ogrodniczuk, Maciej, Pecina, Pavel, Olsen, Jon Arild, Pedersen, Bolette Sandford, Perez, David, Repar, Andras, Terryn, Ayla Rigouts, Rögnvaldsson, Eirikur, Rosner, Mike, Routzouni, Nancy, Soria, Claudia, Soska, Alexandra, Spiteri, Donatienne, Tadic, Marko, Tiberius, Carole, Tufis, Dan, Utka, Andrius, Vale, Paolo, van den Berg, Piet, Váradi, Tamás, Vare, Kadri, Witt, Andreas, Yvon, Francois, Ziedins, Janis, and Zumrik, Miroslav
- Published
- 2019
9. A Multilingual Wikified Data Set of Educational Material
- Author
-
Hendrickx, Iris, Takoulidou, Eirini, Naskos, Thanasis, Kermanidis, Katia Lida, Sosoni, Vilelmini, Vos, Hugo De, Stasimioti, Maria, Zaanen, Menno Van, Georgakopoulou, Panayota, Kordoni, Valia, Popovic, Maja, Egg, Markus, Bosch, Antal Van den, chair), Nicoletta Calzolari (Conference, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, and Cognitive Science & AI
- Published
- 2018
10. Translation Crowdsourcing: Creating a Multilingual Corpus of Online Educational Content
- Author
-
Sosoni, Vilelmini, Kermanidis, Katia Lida, Stasimioti, Maria, Naskos, Thanasis, Takoulidou, Eirini, Zaanen, Menno Van, Castilho, Sheila, Georgakopoulou, Panayota, Kordoni, Valia, Egg, Markus, chair), Nicoletta Calzolari (Conference, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, and Cognitive Science & AI
- Published
- 2018
11. Improving Machine Translation of Educational Content via Crowdsourcing
- Author
-
Behnke, Maximiliana, Barone, Antonio Valerio Miceli, Sennrich, Rico, Sosoni, Vilelmini, Naskos, Thanasis, Takoulidou, Eirini, Stasimioti, Maria, Zaanen, Menno Van, Castilho, Sheila, Gaspari, Federico, Georgakopoulou, Panayota, Kordoni, Valia, Egg, Markus, Kermanidis, Katia Lida, chair), Nicoletta Calzolari (Conference, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, and Cognitive Science & AI
- Published
- 2018
12. The AnnCor CHILDES Treebank
- Author
-
Odijk, Jan, Dimitriadis, Alexis, Klis, Martijn Van der, Koppen, Marjo Van, Otten, Meie, Veen, Remco van der, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Piperidis, Stelios, Tokunaga, Takenobu, LS OZ Taal en spraaktechnologie, LS Psycholinguistiek, LS Franse Taalkunde, LS BZ Variatielinguistiek vh Nederlands, and ILS LLI
- Subjects
treebank ,treebank querying ,CHILDES ,Dutch ,GrETEL ,Language and Linguistics ,Computer Science(all) - Abstract
This paper (1) presents the first partially manually verified treebank for Dutch CHILDES corpora, the AnnCor CHILDES Treebank; (2) argues explicitly that it is useful to assign adult grammar syntactic structures to utterances of children who are still in the process of acquiring the language; (3) argues that human annotation and automatic checks on this annotation must go hand in hand; (4) argues that explicit annotation guidelines and conventions must be developed and adhered to and emphasises consistency of the annotations as an important desirable property for annotations. It also describes the tools used for annotation and automated checks on edited syntactic structures, as well as extensions to an existing treebank query application (GrETEL) and the multiple formats in which the resources will be made available
- Published
- 2018
13. The AnnCor CHILDES Treebank
- Author
-
LS OZ Taal en spraaktechnologie, LS Psycholinguistiek, LS Franse Taalkunde, LS BZ Variatielinguistiek vh Nederlands, UiL OTS LLI, Odijk, Jan, Dimitriadis, Alexis, Klis, Martijn Van der, Koppen, Marjo Van, Otten, Meie, Veen, Remco van der, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Piperidis, Stelios, Tokunaga, Takenobu, LS OZ Taal en spraaktechnologie, LS Psycholinguistiek, LS Franse Taalkunde, LS BZ Variatielinguistiek vh Nederlands, UiL OTS LLI, Odijk, Jan, Dimitriadis, Alexis, Klis, Martijn Van der, Koppen, Marjo Van, Otten, Meie, Veen, Remco van der, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Piperidis, Stelios, and Tokunaga, Takenobu
- Published
- 2018
14. SMILE Swiss German Sign Language Dataset
- Author
-
Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, Calzolari, N ( Nicoletta ), Choukri, K ( Khalid ), Cieri, C ( Christopher ), Declerck, T ( Thierry ), Goggi, S ( Sara ), Hasida, K ( Koiti ), Isahara, H ( Hitoshi ), Maegaard, B ( Bente ), Mariani, J ( Joseph ), Mazo, H ( Hélène ), Moreno, A ( Asuncion ), Odijk, J ( Jan ), Piperidis, S ( Stelios ), Tokunaga, T ( Takenobu ), Ebling, Sarah; https://orcid.org/0000-0001-6511-5085, Camgöz, Necati Cihan, Boyes Braem, Penny, Tissi, Katja, Sidler-Miserez, Sandra, Stoll, Stephanie, Hadfield, Simon, Haug, Tobias, Bowden, Richard, Tornay, Sandrine, Razavi, Marzieh, Magimai-Doss, Mathew, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, Calzolari, N ( Nicoletta ), Choukri, K ( Khalid ), Cieri, C ( Christopher ), Declerck, T ( Thierry ), Goggi, S ( Sara ), Hasida, K ( Koiti ), Isahara, H ( Hitoshi ), Maegaard, B ( Bente ), Mariani, J ( Joseph ), Mazo, H ( Hélène ), Moreno, A ( Asuncion ), Odijk, J ( Jan ), Piperidis, S ( Stelios ), Tokunaga, T ( Takenobu ), Ebling, Sarah; https://orcid.org/0000-0001-6511-5085, Camgöz, Necati Cihan, Boyes Braem, Penny, Tissi, Katja, Sidler-Miserez, Sandra, Stoll, Stephanie, Hadfield, Simon, Haug, Tobias, Bowden, Richard, Tornay, Sandrine, Razavi, Marzieh, and Magimai-Doss, Mathew
- Published
- 2018
15. CLARIN: towards FAIR and responsible data science using language resources
- Author
-
Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, de Jong, Franciska, de Smedt, Koenraad, Fiser, Darja, van Uytvanck, Dieter, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, de Jong, Franciska, de Smedt, Koenraad, Fiser, Darja, and van Uytvanck, Dieter
- Published
- 2018
16. Modelling multi-issue bargaining dialogues: Data collection, annotation design and corpus
- Author
-
Petukhova, Volha, Stevens, Christopher, de Weerd, Hermanes, Taatgen, Niels, Cnossen, Fokeltje, Malchanau, Andrei, Calzolari, Nicoletta, Choukri, Khalid, Declerck, Thierry, Goggi, Sara, Grobelnik, Marko, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asunción, Odijk, Jan, Piperidis, Stelios, and Artificial Intelligence
- Subjects
ISO 24617-2 dialogue act annotation scheme extension ,dialogue modelling ,negotiation corpus collection ,dialogue act annotation - Abstract
The paper describes experimental dialogue data collection activities, as well semantically annotated corpus creation undertaken within EU-funded METALOGUE project. The project aims to develop a dialogue system with flexible dialogue management to enable systems adaptive, reactive, interactive and proactive dialogue behaviour in setting goals, choosing appropriate strategies and monitoring numerous parallel interpretation and management processes. To achieve these goals negotiation (or more precisely multi-issue bargaining) scenario has been considered as the specific setting and application domain. The dialogue corpus forms the basis for the design of task and interaction models of participants negotiation behaviour, and subsequently for dialogue system development which would be capable to replace one of the negotiators. The METALOGUE corpus will be released to the community for research purposes.
- Published
- 2016
17. ArchiMob - A Corpus of Spoken Swiss German
- Author
-
Calzolari, Nicoletta, Choukri, Khalid, Declerck, Thierry, Goggi, Sara, Grobelnik, Marko, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asunción, Odijk, Jan, Piperidis, Stelios, Calzolari, N ( Nicoletta ), Choukri, K ( Khalid ), Declerck, T ( Thierry ), Goggi, S ( Sara ), Grobelnik, M ( Marko ), Maegaard, B ( Bente ), Mariani, J ( Joseph ), Mazo, H ( Hélène ), Moreno, A ( Asunción ), Odijk, J ( Jan ), Piperidis, S ( Stelios ), Samardžić, Tanja, Scherrer, Yves, Glaser, Elvira; https://orcid.org/0000-0002-9620-3851, Calzolari, Nicoletta, Choukri, Khalid, Declerck, Thierry, Goggi, Sara, Grobelnik, Marko, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asunción, Odijk, Jan, Piperidis, Stelios, Calzolari, N ( Nicoletta ), Choukri, K ( Khalid ), Declerck, T ( Thierry ), Goggi, S ( Sara ), Grobelnik, M ( Marko ), Maegaard, B ( Bente ), Mariani, J ( Joseph ), Mazo, H ( Hélène ), Moreno, A ( Asunción ), Odijk, J ( Jan ), Piperidis, S ( Stelios ), Samardžić, Tanja, Scherrer, Yves, and Glaser, Elvira; https://orcid.org/0000-0002-9620-3851
- Published
- 2016
18. Khresmoi Professional: Multilingual Semantic Search for Medical Professionals
- Author
-
Peychev, Deyan, Tamchyna, Aleš, Dědek, Jan, Kelly, Liadh, Georgiev, Georgi, Choukri, Khalid, Urešová, Zdeňka, Foncubierta, Antonio, Schneller, Priscille, Gaudinat, Arnaud, Gobeill, Julien, Dungs, Sebastian, Boyer, Célia, Dušek, Ondřej, Jones, Gareth, Langs, Georg, Pecina, Pavel, Lawson, Nolan, Gschwandtner, Manfred, Herrera, Alba, Vishnyakova, Dina, Hajič, Jan, Bystroň, Jakub, Tinte, Miguel, Jordan, Matthias, Markonis, Dimitrios, Samwald, Matthias, Kainberger, Franz, Leveling, Johannes, Petrak, Johann, Müller, Henning, Hanbury, Allan, Cunningham, Hamish, Funk, Adam, Kaderk, Klemens, Mareček, David, Aswani, Niraj, Masselot, Alexandre, Hlaváčová, Jaroslava, Roberts, Angus, Kritz, Marlene, Birngruber, Erich, Rosa, Rudolf, Holzer, Markus, Pletneva, Natalia, Martínez, Iván, Donner, René, Gomez, Paz, Greenwood, Mark, Cruchet, Sarah, Pentchev, Konstantin, Mazo, Hélène, Jordán, Blanca, Dolamic, Ljiljana, Palotti, João, Beckers, Thomas, Fuhr, Norbert, Stefanov, Veronika, Novák, Michal, Pottecher, Diana, Vargas, Alejandro, Kriewel, Sascha, Popel, Martin, Momtchev, Vassil, Sachs, Alexander, Burner, Andreas, Eggel, Ivan, Ruch, Patrick, and Goeuriot, Lorraine
- Abstract
There is increasing interest in and need for innovative solutions to medical search. In this paper we present the EU-funded Khresmoi medical search and access system, currently in year 3 of 4 of development across 12 partners. The Khresmoi system uses a component-based architecture housed in the cloud to allow for the development of several innovative applications to support target users' medical information needs. The Khresmoi search systems based on this architecture have been designed to support the multilingual and multimodal information needs of three target groups: the general public, general practitioners and consultant radiologists. In this paper we focus on the presentation of the systems to support the latter two groups using semantic, multilingual text and image-based (including 2D and 3D radiology images) search.
- Published
- 2013
19. Khresmoi - multilingual semantic search of medical text and images
- Author
-
Peychev, Deyan, Dědek, Jan, Kelly, Liadh, Georgiev, Georgi, Choukri, Khalid, Urešová, Zdeňka, Foncubierta, Antonio, Schneller, Priscille, Gaudinat, Arnaud, Gobeill, Julien, Dungs, Sebastian, Boyer, Célia, Jones, Gareth, Müller, Henning, Langs, Georg, Pecina, Pavel, Lawson, Nolan, Gschwandtner, Manfred, Herrera, Alba, Vishnyakova, Dina, Hajič, Jan, Bystroň, Jakub, Tinte, Miguel, Jordan, Matthias, Markonis, Dimitrios, Samwald, Matthias, Kainberger, Franz, Roberts, Angus, Hanbury, Allan, Cunningham, Hamish, Funk, Adam, Kaderk, Klemens, Ruch, Patrick, Aswani, Niraj, Masselot, Alexandre, Hlaváčová, Jaroslava, Kritz, Marlene, Birngruber, Erich, Holzer, Markus, Pletneva, Natalia, Martínez, Iván, Donner, René, Gomez, Paz, Greenwood, Mark, Cruchet, Sarah, Pentchev, Konstantin, Mazo, Hélène, Jordán, Blanca, Dolamic, Ljiljana, Beckers, Thomas, Fuhr, Norbert, Stefanov, Veronika, Pottecher, Diana, Vargas, Alejandro, Kriewel, Sascha, Momtchev, Vassil, Burner, Andreas, Eggel, Ivan, and Goeuriot, Lorraine
- Abstract
The Khresmoi project is developing a multilingual multimodal search and access system for medical and health information and documents. This scientific demonstration presents the current state of the Khresmoi integrated system, which includes components for text and image annotation, semantic search, search by image similarity and machine translation. The flexibility in adapting the system to varying requirements for different types of medical information search is demonstrated through two instantiations of the system, one aimed at medical professionals in general and the second aimed at radiologists. The key innovations of the Khresmoi system are the integration of multiple software components in a flexible scalable medical search system, the use of annotation cycles including manual correction to improve semantic search, and the possibility to do large scale visual similarity search on 2D and 3D (CT, MR) medical images.
- Published
- 2013
20. Khresmoi: Multimodal Multilingual Medical Information Search
- Author
-
Peychev, Deyan, Dědek, Jan, Kelly, Liadh, Georgiev, Georgi, Choukri, Khalid, Urešová, Zdeňka, Foncubierta, Antonio, Gaudinat, Arnaud, Masselot, Alexandre, Gobeill, Julien, Dungs, Sebastian, Boyer, Célia, Jones, Gareth, Müller, Henning, Langs, Georg, Pecina, Pavel, Lawson, Nolan, Gschwandtner, Manfred, Herrera, Alba, Vishnyakova, Dina, Hajič, Jan, Bystroň, Jakub, Tinte, Miguel, Jordan, Matthias, Markonis, Dimitrios, Samwald, Matthias, Kainberger, Franz, Roberts, Angus, Hanbury, Allan, Cunningham, Hamish, Funk, Adam, Kaderk, Klemens, Ruch, Patrick, Aswani, Niraj, Mriewel, Sascha, Hlaváčová, Jaroslava, Kritz, Marlene, Birngruber, Erich, Holzer, Markus, Pletneva, Natalia, Martínez, Iván, Donner, René, Gomez, Paz, Greenwood, Mark, Cruchet, Sarah, Pentchev, Konstantin, Mazo, Hélène, Jordán, Blanca, Dolamic, Ljiljana, Beckers, Thomas, Fuhr, Norbert, Stefanov, Veronika, Pottecher, Diana, Vargas, Alejandro, Schneller, Priscille, Momtchev, Vassil, Burner, Andreas, Eggel, Ivan, and Goeuriot, Lorraine
- Abstract
Khresmoi is a European Integrated Project developing a multilingual multimodal search and access system for medical and health information and documents. It addresses the challenges of searching through huge amounts of medical data, including general medical information available on the internet, as well as radiology data in hospital archives. It is developing novel semantic search and visual search techniques for the medical domain. At the MIE Village of the Future, Khresmoi proposes to have two interactive demonstrations of the system under development, as well as an overview oral presentation and potentially some poster presentations.
- Published
- 2012
21. The EuroPat Corpus: A Parallel Corpus of European Patent Data
- Author
-
Heafield, Kenneth, Farrow, Elaine, Van Der Linde, Jelmer, Ramírez-Sánchez, Gema, Wiggins, Dion, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, and Piperidis, Stelios
- Subjects
Technical translation ,Parallel data ,Patent ,Corpus ,Legal - Abstract
We present the EuroPat corpus of patent-specific parallel data for 6 official European languages paired with English: German, Spanish, French, Croatian, Norwegian, and Polish. The filtered parallel corpora range in size from 51 million sentences (Spanish-English) to 154k sentences (Croatian-English), with the unfiltered (raw) corpora being up to 2 times larger. Access to clean, high quality, parallel data in technical domains such as science, engineering, and medicine is needed for training neural machine translation systems for tasks like online dispute resolution and eProcurement. Our evaluation found that the addition of EuroPat data to a generic baseline improved the performance of machine translation systems on in-domain test data in German, Spanish, French, and Polish; and in translating patent data from Croatian to English. The corpus has been released under Creative Commons Zero, and is expected to be widely useful for training high-quality machine translation systems, and particularly for those targeting technical documents such as patents and contracts.
- Published
- 2022
22. Dynamic Human Evaluation for Relative Model Comparisons
- Author
-
Thórhildur Thorleiksdóttir, Cedric Renggli, Nora Hollenstein, Ce Zhang, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, and Piperidis, Stelios
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,Natural Language Generation ,Relative Model Comparison ,Human Evaluation ,Crowdsourcing ,Computation and Language (cs.CL) - Abstract
Collecting human judgements is currently the most reliable evaluation method for natural language generation systems. Automatic metrics have reported flaws when applied to measure quality aspects of generated text and have been shown to correlate poorly with human judgements. However, human evaluation is time and cost-intensive, and we lack consensus on designing and conducting human evaluation experiments. Thus there is a need for streamlined approaches for efficient collection of human judgements when evaluating natural language generation systems. Therefore, we present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study. The main results indicate that a decision about the superior model can be made with high probability across different labelling strategies, where assigning a single random worker per task requires the least overall labelling effort and thus the least cost., Proceedings of the Thirteenth Language Resources and Evaluation Conference, ISBN:979-10-95546-72-6
- Published
- 2022
23. TallVocabL2Fi : A Tall Dataset of 15 Finnish L2 Learners’ Vocabulary
- Author
-
Frankie Robertson, Chang Li-Hsin, Söyrinki Sini, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, and Piperidis, Stelios
- Subjects
toinen kieli ,oppiminen ,mittaus ,Finnish ,word knowledge ,mental lexicon ,mittausmenetelmät ,koneoppiminen ,sanavarasto ,data ,word response data ,sanat ,kielen oppiminen ,arviointi ,learner data - Abstract
Previous work concerning measurement of second language learners has tended to focus on the knowledge of small numbers of words, often geared towards measuring vocabulary size. This paper presents a “tall” dataset containing information about a few learners’ knowledge of many words, suitable for evaluating Vocabulary Inventory Prediction (VIP) techniques, including those based on Computerised Adaptive Testing (CAT). In comparison to previous comparable datasets, the learners are from varied backgrounds, so as to reduce the risk of overfitting when used for machine learning based VIP. The dataset contains both a self-rating test and a translation test, used to derive a measure of reliability for learner responses. The dataset creation process is documented, and the relationship between variables concerning the participants, such as their completion time, their language ability level, and the triangulated reliability of their self-assessment responses, are analysed. The word list is constructed by taking into account the extensive derivation morphology of Finnish, and infrequent words are included in order to account for explanatory variables beyond word frequency peerReviewed
- Published
- 2022
24. Aspect-based emotion analysis and multimodal coreference : a case study of customer comments on Adidas instagram posts
- Author
-
De Bruyne, Luna, Karimi, Akbar, De Clercq, Orphée, Prati, Andrea, Hoste, Veronique, Calzolari, Nicoletta, Frédéric, Béchet, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, and Piperidis, Stelios
- Subjects
Emotion Detection ,ABEA ,lt3 ,Sentiment Analysis ,ABSA ,Languages and Literatures ,Multimodal Coreference - Abstract
While aspect-based sentiment analysis of user-generated content has received a lot of attention in the past years, emotion detection at the aspect level has been relatively unexplored. Moreover, given the rise of more visual content on social media platforms, we want to meet the ever-growing share of multimodal content. In this paper, we present a multimodal dataset for Aspect-Based Emotion Analysis (ABEA). Additionally, we take the first steps in investigating the utility of multimodal coreference resolution in an ABEA framework. The presented dataset consists of 4,900 comments on 175 images and is annotated with aspect and emotion categories and the emotional dimensions of valence and arousal. Our preliminary experiments suggest that ABEA does not benefit from multimodal coreference resolution, and that aspect and emotion classification only requires textual information. However, when more specific information about the aspects is desired, image recognition could be essential.
- Published
- 2022
25. Recognizing semantic relations by combining transformers and fully connected models
- Author
-
Roussinov, Dmitri, Sharoff, Serge, Puchnina, Nadezhda, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, and Piperidis, Stelios
- Subjects
P1 - Abstract
Automatically recognizing an existing semantic relation (e.g. "is a", "part of", "property of", "opposite of" etc.) between two words (phrases, concepts, etc.) is an important task affecting many NLP applications and has been subject of extensive experimentation and modeling. Current approaches to automatically telling if a relation exists between two given concepts X and Y can be grouped into two types: 1) those modeling word-paths connecting X and Y in text and 2) those modeling distributional properties of X and Y separately, not necessary in the proximity to each other. Here, we investigate how both types can be improved and combined. We suggest a distributional approach that is based on an attention-based transformer. We have also developed a novel word path model that combines useful properties of a convolutional network with a fully connected language model. While our transformer-based approach works better, both our models significantly outperform the state-of-the-art within their classes of approaches. We also demonstrate that combining the two approaches results in additional gains since they use somewhat different data sources.
- Published
- 2020
26. The FISKMÖ Project : Resources and Tools for Finnish-Swedish Machine Translation and Cross-Linguistic Research
- Author
-
Tiedemann, Jörg, Nieminen, Tommi, Aulamo, Mikko, Kanerva, Jenna, Leino, Akseli, Ginter, Filip, Papula, Niko, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Language Technology, Department of Digital Humanities, and Mind and Matter
- Subjects
6121 Languages ,113 Computer and information sciences - Abstract
This paper presents FISKMÖ, a project that focuses on the development of resources and tools for cross-linguistic research and machine translation between Finnish and Swedish. The goal of the project is the compilation of a massive parallel corpus out of translated material collected from web sources, public and private organisations and language service providers in Finland with its two official languages. The project also aims at the development of open and freely accessible translation services for those two languages for the general purpose and for domain-specific use. We have released new data sets with over 3 million translation units, a benchmark test set for MT development, pre-trained neural MT models with high coverage and competitive performance and a self-contained MT plugin for a popular CAT tool. The latter enables offline translation without dependencies on external services making it possible to work with highly sensitive data without compromising security concerns.
- Published
- 2020
27. Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction
- Author
-
Bollegala, Danushka, Kiryo, Ryuichi, Tsujino, Kosuke, Yukawa, Haruki, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asunción, Odijk, Jan, and Piperidis, Stelios
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computation and Language (cs.CL) ,Machine Learning (cs.LG) - Abstract
Language-independent tokenisation (LIT) methods that do not require labelled language resources or lexicons have recently gained popularity because of their applicability in resource-poor languages. Moreover, they compactly represent a language using a fixed size vocabulary and can efficiently handle unseen or rare words. On the other hand, language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources. Unlike subtokens produced by LIT methods, LST methods produce valid morphological subwords. Despite the contrasting trade-offs between LIT vs. LST methods, their performance on downstream NLP tasks remain unclear. In this paper, we empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages. Our experimental results covering eight languages show that LST consistently outperforms LIT when the vocabulary size is large, but LIT can produce comparable or better results than LST in many languages with comparatively smaller (i.e. less than 100K words) vocabulary sizes, encouraging the use of LIT when language-specific resources are unavailable, incomplete or a smaller model is required. Moreover, we find that smoothed inverse frequency (SIF) to be an accurate method to create word embeddings from subword embeddings for multilingual semantic similarity prediction tasks. Further analysis of the nearest neighbours of tokens show that semantically and syntactically related tokens are closely embedded in subword embedding spaces, Comment: To appear in the 12th Language Resources and Evaluation (LREC 2020) Conference
- Published
- 2020
- Full Text
- View/download PDF
28. ReSiPC: a Tool for Complex Searches in Parallel Corpora
- Author
-
Oliver, A., Bojana Mikelenić, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, and Piperidis, Stelios
- Subjects
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,parallel corpora ,regular expressions ,contrastive linguistics - Abstract
In this paper, a tool specifically designed to allow for complex searches in large parallel corpora is presented. The formalism for the queries is very powerful as it uses standard regular expressions that allow for complex queries combining word forms, lemmata and POS- tags. As queries are performed over POS-tags, at least one of the languages in the parallel corpus should be POS-tagged. Searches can be performed in one of the languages or in both languages at the same time. The program is able to POS-tag the corpora using the Freeling analyzer through its Python API. ReSiPC is developed in Python version 3 and it is distributed under a free license (GNU GPL). The tool can be used to provide data for contrastive linguistics research and an example of use in a Spanish-Croatian parallel corpus is presented. ReSiPC is designed for queries in POS-tagged corpora, but it can be easily adapted for querying corpora containing other kinds of information.
- Published
- 2020
29. ZuCo 2.0: A Dataset of Physiological Recordings During Natural Reading and Annotation
- Author
-
Hollenstein, Nora, Troendle, Marius, Zhang, Ce, Langer, Nicolas, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, and Piperidis, Stelios
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,Annotation ,Computer Science - Human-Computer Interaction ,Physiological data ,Corpus ,Human-Computer Interaction (cs.HC) ,Cognitive methods ,EEG ,Eye-tracking ,Human language processing ,Naturalistic reading ,Computation and Language (cs.CL) - Abstract
We recorded and preprocessed ZuCo 2.0, a new dataset of simultaneous eye-tracking and electroencephalography during natural reading and during annotation. This corpus contains gaze and brain activity data of 739 English sentences, 349 in a normal reading paradigm and 390 in a task-specific paradigm, in which the 18 participants actively search for a semantic relation type in the given sentences as a linguistic annotation task. This new dataset complements ZuCo 1.0 by providing experiments designed to analyze the differences in cognitive processing between natural reading and annotation. The data is freely available here: https://osf.io/2urht/., Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), ISBN:979-10-95546-34-4
- Published
- 2020
- Full Text
- View/download PDF
30. The MARCELL Legislative Corpus
- Author
-
Váradi, Tamás, Koeva, Svetla, Yamalov, Martin, Tadić, Marko, Sass, Bálint, Nitoń, Bartłomiej, Ogrodniczuk, Maciej, Pęzik, Piotr, Barbu Mititelu, Verginica, Ion, Radu, Irimia, Elena, Mitrofan, Maria, Păi Textcommabelows, Vasile, Tufi Textcommabelows, Dan, Radovan Garabík, Krek, Simon, Repar, Andraz, Rihtar, Matjaž, Brank, Janez, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, and Piperidis, Stelios
- Subjects
law corpus ,comparable corpus ,under-resourced languages - Abstract
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub- corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency and/or noun phrase annotation, the corpus is enriched with the IATE and EuroVoc labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represent a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
- Published
- 2020
31. The European Language Technology Landscape in 2020: Language- Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
- Author
-
Rehm, Georg, Marheinecke, Katrin, Hegele, Stefanie, Piperidis, Stelios, Bontcheva, Kalina, Hajic, Jan, Choukri, Khalid, Vasiļjevs, Andrejs, Backfried, Gerhard, Prinz, Christoph, Manuel Gomez- Perez, Jose, Meertens, Luc, Lukowicz, Paul, van Genabith, Josef, Lösch, Andrea, Slusallek, Philipp, Irgens, Morten, Gatellier, Patrick, Köhler, Joachim, Le Bars, Laure, Anastasiou, Dimitra, Auksoriūtė, Albina, Bel, Núria, Branco, António, Budin, Gerhard, Daelemans, Walter, De Smedt, Koenraad, Garabík, Radovan, Gavriilidou, Maria, Gromann, Dagmar, Koeva, Svetla, Krek, Simon, Krstev, Cvetana, Lindén, Krister, Magnini, Bernardo, Odijk, Jan, Ogrodniczuk, Maciej, Rögnvaldsson, Eiríkur, Rosner, Mike, Pedersen, Bolette, Skadina, Inguna, Tadić, Marko, Tufiș, Dan, Váradi, Tamás, Vider, Kadri, Way, Andy, Yvon, François, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, and Piperidis, Stelios
- Subjects
national and international projects ,infrastructural issues ,policy issues ,infrastructures ,multilingualism - Abstract
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.
- Published
- 2020
32. MISA: Multilingual 'ISA' extraction from corpora
- Author
-
Stefano Faralli, Lefever, E., Ponzetto, S. P., Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, and Tokunaga, Takenobu
- Subjects
lt3 ,Framework ,Hearst patterns ,Hypernym extraction ,Multilinguality ,Languages and Literatures - Published
- 2019
33. Abstractive Document Summarization without Parallel Data
- Author
-
Nikolov, Nikola I, Hahnloser, Richard H R, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, and University of Zurich
- Subjects
FOS: Computer and information sciences ,3310 Linguistics and Language ,Computer Science - Computation and Language ,570 Life sciences ,biology ,3309 Library and Information Sciences ,Computation and Language (cs.CL) ,1203 Language and Linguistics ,10194 Institute of Neuroinformatics ,3304 Education - Abstract
ive summarization typically relies on large collections of paired articles and summaries. However, in many cases, parallel data is scarce and costly to obtain. We develop an abstractive summarization system that relies only on large collections of example summaries and non-matching articles. Our approach consists of an unsupervised sentence extractor that selects salient sentences to include in the final summary, as well as a sentence abstractor that is trained on pseudo-parallel and synthetic data, that paraphrases each of the extracted sentences. We perform an extensive evaluation of our method: on the CNN/DailyMail benchmark, on which we compare our approach to fully supervised baselines, as well as on the novel task of automatically generating a press release from a scientific journal article, which is well suited for our system. We show promising performance on both tasks, without relying on any article-summary pairs., Comment: LREC 2020
- Published
- 2019
- Full Text
- View/download PDF
34. The Metalogue Debate Trainee Corpus: Data Collection and Annotation
- Author
-
Petukhova, Volha, Malchanau, Andrei, Oualil, Youssef, Klakow, Dietrich, Luz, Saturnino, Haider, Fasih, Campbell, Nick, Koryzis, Dimitris, Spiliotopoulos, Dimitris, Albert, Pierre, Linz, Nicklas, Alexandersson, Jan, chair), Nicoletta Calzolari (Conference, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, and Tokunaga, Takenobu
- Abstract
This paper describes the Metalogue Debate Trainee Corpus (DTC). DTC has been collected and annotated in order to facilitate thedesign of instructional and interactive models for Virtual Debate Coach application - an intelligent tutoring system used by youngparliamentarians to train their debate skills. The training is concerned with the use of appropriate multimodal rhetorical devices in orderto improve (1) the organization of arguments, (2) arguments’ content selection, and (3) argument delivery techniques. DTC containstracking data from motion and speech capturing devices and semantic annotations - dialogue acts - as defined in ISO 24617-2 anddiscourse relations as defined in ISO 24617-8. The corpus comes with a manual describing the data collection process, annotationactivities including an overview of basic concepts and their definitions including annotation schemes and guidelines on how to applythem, tools and other resources. DTC will be released in the ELRA catalogue in second half of 2018.
- Published
- 2018
35. Crowdsourcing Regional Variation Data and Automatic Geolocalisation of Speakers of European French
- Author
-
Jean-Philippe Goldman, Yves Scherrer, Julie Glikman, Mathieu Avanzi, Christophe Benzitoun, Philippe Boula de Mareüil, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11), European Language Resources Association (ELRA), Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, Department of Digital Humanities, and Language Technology
- Subjects
regionalism ,linguistic geography ,geolocalisation ,6121 Languages ,crowdsourcing ,cartography ,[INFO]Computer Science [cs] ,113 Computer and information sciences ,language variation ,linguistique ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] - Abstract
International audience; We present the crowdsourcing platform Donnez Votre Français à la Science (DFS, or âGive your French to Scienceâ), which aims to collect linguistic data and document language use, with a special focus on regional variation in European French. The activities not only gather data that is useful for scientific studies, but they also provide feedback to the general public; this is important in order to reward participants, to encourage them to follow future surveys, and to foster interaction with the scientific community. The two main activities described here are 1) a linguistic survey on lexical variation with immediate feedback and 2) a speaker geolocalisation system; i.e., a quiz that guesses the linguistic origin of the participant by comparing their answers with previously gathered linguistic data. For the geolocalisation activity, we set up a simulation framework to optimise predictions. Three classification algorithms are compared: the first one uses clustering and shibboleth detection, whereas the other two rely on feature elimination techniques with support Vector Machines and Maximum Entropy models as underlying base classifiers. The best-performing system uses a selection of 17 questions and reaches a localisation accuracy of 66%, extending the prediction from the one-best area (one among 109 base areas) to its first-order and second-order neighbouring areas.
- Published
- 2018
36. A fine-grained error analysis of NMT, PBMT and RBMT output for English-to-Dutch
- Author
-
Van Brussel, Laura, Tezcan, Arda, Macken, Lieve, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, and Tokunaga, Takenobu
- Subjects
LT3 ,Languages and Literatures - Published
- 2018
37. Joint Learning of Sense and Word Embeddings
- Author
-
Mohammed Alsuhaibani, Bollegala, D., Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Kôiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asunción, Odijk, Jan, Piperidis, Stelios, and Tokunaga, Takenobu
- Published
- 2018
38. Open Subtitles Paraphrase Corpus for Six Languages
- Author
-
Mathias Johan Philip Creutz, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, and University of Helsinki, Department of Digital Humanities
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,6121 Languages ,Computation and Language (cs.CL) - Abstract
This paper accompanies the release of Opusparcus, a new paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The corpus consists of paraphrases, that is, pairs of sentences in the same language that mean approximately the same thing. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows. The informal and colloquial genre that occurs in subtitles makes such data a very interesting language resource, for instance, from the perspective of computer assisted language learning. For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been checked manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.
- Published
- 2018
- Full Text
- View/download PDF
39. Croatian error-annotated corpus of non- professional written language
- Author
-
Štefanec, Vanja, Ljubešić, Nikola, Kuvač Kraljević, Jelena, Calzolari, Nicoletta, Khalid Choukr, Declerck, Thierry, Goggi, Sara, Grobelnik, Marko, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asunción, Odijk, Jan, and Piperidis, Stelios
- Subjects
error corpus ,language disorders ,Croatian - Abstract
In the paper authors will present the Croatian corpus of non-professional written language. Consisting of two subcorpora, i.e. the clinical subcorpus, consisting of written texts produced by speakers with various types of language disorders, and the healthy speakers subcorpus, as well as by the levels of its annotation, it offers an opportunity for different lines of research. Authors will present the corpus structure, describe the sampling methodology, explain the levels of annotation, and give some very basic statistic. On the basis of data from the corpus, existing language technologies for Croatian will be adapted in order to be implemented in a platform facilitating text production to speakers with language disorders. In this respect, several analyses of the corpus data will be presented.
- Published
- 2016
40. Compilation of an Arabic children’s corpus
- Author
-
Al-Sulaiti, Latifa, Abbas, Noorhan, Brierley, Claire, Atwell, Eric, Alghamdi, Ayman, Calzolari, Nicoletta, Choukri, Khalid, Declerck, Thierry, Goggi, Sara, Grobelnik, Marko, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asunción, Odijk, Jan, and Piperidis, Stelios
- Subjects
Z51 ,Z503 ,Z512 - Abstract
Inspired by the Oxford Children's Corpus, we have developed a prototype corpus of Arabic texts written and/or selected for children. Our Arabic Children's Corpus of 2950 documents and nearly 2 million words has been collected manually from the web during a 3-month project. It is of high quality, and contains a range of different children's genres based on sources located, including classic tales from The Arabian Nights, and popular fictional characters such as Goha. We anticipate that the current and subsequent versions of our corpus will lead to interesting studies in text classification, language use, and ideology in children's texts.
- Published
- 2016
41. ArchiMob - A Corpus of Spoken Swiss German
- Author
-
Samardžić, Tanja, Scherrer, Yves, Glaser, Elvira, University of Zurich, Calzolari, Nicoletta, Choukri, Khalid, Declerck, Thierry, Goggi, Sara, Grobelnik, Marko, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asunción, Odijk, Jan, and Piperidis, Stelios
- Subjects
3310 Linguistics and Language ,UFSP13-3 Language and Space ,ddc:410 ,430 German & related languages ,10096 Institute of German Studies ,3309 Library and Information Sciences ,1203 Language and Linguistics ,3304 Education - Abstract
Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety rarely recorded and that it is subject to considerable regional variation. This paper presents a freely available general-purpose corpus of spoken Swiss German suitable for linguistic research, but also for training automatic tools. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first describe how the documents were transcribed, segmented and aligned with the sound source, and how inconsistent transcriptions were unified through an additional normalisation layer. We then present a bootstrapping approach to automatic normalisation using different machine-translation-inspired methods. Furthermore, we evaluate the performance of part-of-speech taggers on our data and show how the same bootstrapping approach improves part-of-speech tagging by 10% over four rounds. Finally, we present the modalities of access of the corpus as well as the data format.
- Published
- 2016
42. Acoustic Features of Different Types of Laughter in North Sami Conversational Speech
- Author
-
Hiovain, Katri, Jokinen, Päivi Kristiina, Calzolari, Nicoletta, Choukri, Khalid, Declerck, Thierry, Goggi, Sara, Grobelnik, Marko, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asunción, Odijk, Jan, Piperidis, Stelios, Department of Modern Languages 2010-2017, and Department of Education
- Subjects
6164 Speech communication ,6161 Phonetics ,education ,6121 Languages - Published
- 2016
43. Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources
- Author
-
Váradi, Tamás, Nyéki, Bence, Koeva, Svetla, Tadić, Marko, Štefanec, Vanja, Ogrodniczuk, Maciej, Nitoń, Bartłomiej, Pęzik, Piotr, Barbu Mititelu, Verginica, Irimia, Elena, Mitrofan, Maria, Tufi Textcommabelows, Dan, Radovan Garabík, Krek, Simon, Repar, Andraž, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, and Piperidis, Stelios
- Subjects
national corpora ,comparable corpora ,domain corpora - Abstract
This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross- lingual terminological data extraction and classification.
44. Paraphrase Generation and Evaluation on Colloquial-Style Sentences
- Author
-
Eetu Sjöblom, Mathias Johan Philip Creutz, Yves Scherrer, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Department of Digital Humanities, and Language Technology
- Subjects
6121 Languages ,113 Computer and information sciences - Abstract
This paper presents FISKMÖ, a project that focuses on the development of resources and tools for cross-linguistic research and machine translation between Finnish and Swedish. The goal of the project is the compilation of a massive parallel corpus out of translated material collected from web sources, public and private organisations and language service providers in Finland with its two official languages. The project also aims at the development of open and freely accessible translation services for those two languages for the general purpose and for domain-specific use. We have released new data sets with over 3 million translation units, a benchmark test set for MT development, pre-trained neural MT models with high coverage and competitive performance and a self-contained MT plugin for a popular CAT tool. The latter enables offline translation without dependencies on external services making it possible to work with highly sensitive data without compromising security concerns. In this paper, we investigate paraphrase generation in the colloquial domain. We use state-of-the-art neural machine translation models trained on the Opusparcus corpus to generate paraphrases in six languages: German, English, Finnish, French, Russian, and Swedish. We perform experiments to understand how data selection and filtering for diverse paraphrase pairs affects the generated paraphrases. We compare two different model architectures, an RNN and a Transformer model, and find that the Transformer does not generally outperform the RNN. We also conduct human evaluation on five of the six languages and compare the results to the automatic evaluation metrics BLEU and the recently proposed BERTScore. The results advance our understanding of the trade-offs between the quality and novelty of generated paraphrases, affected by the data selection method. In addition, our comparison of the evaluation methods shows that while BLEU correlates well with human judgments at the corpus level, BERTScore outperforms BLEU in both corpus and sentence-level evaluation.
45. Building the Spanish-Croatian parallel corpus
- Author
-
Mikelenić, Bojana, Tadić, Marko, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, and Piperidis, Stelios
- Subjects
written corpus ,parallel corpus ,Spanish ,Croatian - Abstract
This paper describes the building of the first Spanish-Croatian unidirectional parallel corpus, which has been constructed at the Faculty of Humanities and Social Sciences of the University of Zagreb. The corpus is comprised of eleven Spanish novels and their translations to Croatian done by six different professional translators. All the texts were published between 1999 and 2012. The corpus has more than 2 Mw, with approximately 1 Mw for each language. It was automatically sentence segmented and aligned, as well as manually post-corrected, and contains 71, 778 translation units. In order to protect the copyright and to make the corpus available under permissive CC-BY licence, the aligned translation units are shuffled. This limits the usability of the corpus for research of language units at sentence and lower language levels only. There are two versions of the corpus in TMX format that will be available for download through META-SHARE and CLARIN ERIC infrastructure. The former contains plain TMX, while the latter is lemmatised and POS-tagged and stored in the aTMX format.
46. A real-world data resource of complex sensitive sentences based on documents from the Monsanto trial
- Author
-
Jan Neerbek, Morten Eskildsen, Peter Dolog, Ira Assent, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Béchet, Frédéric, Mazo, Hélène, and Moreno, Asunción
- Subjects
Document Classification ,Statistical and Machine Learning Methods ,Corpus (Creation Annotation Etc.) ,Text categorisation - Abstract
In this work we present a corpus for the evaluation of sensitive information detection approaches that addresses the need for real world sensitive information for empirical studies. Our sentence corpus contains different notions of complex sensitive information that correspond to different aspects of concern in a current trial of the Monsanto company. This paper describes the annotations process, where we both employ human annotators and furthermore create automatically inferred labels regarding technical, legal and informal communication within and with employees of Monsanto, drawing on a classification of documents by lawyers involved in the Monsanto court case. We release corpus of high quality sentences and parse trees with these two types of labels on sentence level. We characterize the sensitive information via several representative sensitive information detection models, in particular both keyword-based (n-gram) approaches and recent deep learning models, namely, recurrent neural networks (LSTM) and recursive neural networks (RecNN). Data and code are made publicly available.
47. Evaluating language tools for fifteen EU-official under-resourced languages
- Author
-
Alves, D., Gaurish Thakkar, Tadic, M., Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, and Piperidis, Stelios
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,language processing chains ,under-resourced languages ,evaluation ,Computation and Language (cs.CL) - Abstract
This article presents the results of the evaluation campaign of language tools available for fifteen EU-official under-resourced languages. The evaluation was conducted within the MSC ITN CLEOPATRA action that aims at building the cross-lingual event- centric knowledge processing on top of the application of linguistic processing chains (LPCs) for at least 24 EU-official languages. In this campaign, we concentrated on three existing NLP platforms (Stanford CoreNLP, NLP Cube, UDPipe) that all provide models for under-resourced languages and in this first run we covered 15 under- resourced languages for which the models were available. We present the design of the evaluation campaign and present the results as well as discuss them. We considered the difference between reported and our tested results within a single percentage point as being within the limits of acceptable tolerance and thus consider this result as reproducible. However, for a number of languages, the results are below what was reported in the literature, and in some cases, our testing results are even better than the ones reported previously. Particularly problematic was the evaluation of NERC systems. One of the reasons is the absence of universally or cross-lingually applicable named entities classification scheme that would serve the NERC task in different languages analogous to the Universal Dependency scheme in parsing task. To build such a scheme has become one of our the future research directions.
48. What comes first: Combining motion capture and eye tracking data to study the order of articulators in constructed action in sign language narratives
- Author
-
Tommi Jantunen, Puupponen, A., Burger, B., Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, and Piperidis, Stelios
- Subjects
narration ,constructed action ,viittomakieli ,eye tracking, sign language ,motion capture ,katseenseuranta ,suomalainen viittomakieli ,liikkeenkaappaus - Abstract
We use synchronized 120 fps motion capture and 50 fps eye tracking data from two native signers to investigate the temporal order in which the dominant hand, the head, the chest and the eyes start producing overt constructed action from regular narration in seven short Finnish Sign Language stories. From the material, we derive a sample of ten instances of regular narration to overt constructed action transfers in ELAN which we then further process and analyze in Matlab. The results indicate that the temporal order of articulators shows both contextual and individual variation but that there are also repeated patterns which are similar across all the analyzed sequences and signers. Most notably, when the discourse strategy changes from regular narration to overt constructed action, the head and the eyes tend to take the leading role, and the chest and the dominant hand tend to start acting last. Consequences of the findings are discussed. peerReviewed
49. Language Resources for Historical Newspapers: the Impresso Collection
- Author
-
Maud Ehrmann, Matteo Romanello, Simon Clematide, Philipp Ströbel, Raphaël Barman, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, University of Zurich, and Ehrmann, Maud
- Subjects
historical document processing ,multi-layered historical semantic annotations ,topic modeling ,410 Linguistics ,historical newspapers ,000 Computer science, knowledge & systems ,language resources ,OCR ,named entity processing ,10105 Institute of Computational Linguistics ,historical texts ,text reuse ,natural language processing ,digital humanities ,historical and multilingual language resources - Abstract
Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge-- and real promise of digitization-- is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this `Big Data of the Past'. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the `impresso - Media Monitoring of the Past' project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents.
50. On Practical Realisation of Autosegmental Representations in Lexical Transducers of Tonal Bantu Languages
- Author
-
Anssi Yli-Jyrä, Adda, Gilles, Choukri, Khalid, Kasinskaite-Buddeberg, Irmgarda, Mariani, Joseph, Mazo, Hélène, Sakriani, Sakti, and Language Technology
- Subjects
lexical transducers ,compositionality ,6121 Languages ,origin correspondence ,113 Computer and information sciences ,tonal languages ,autosegmental phonology - Abstract
A lexical transducer is a language technology resource that is typically used to predict the orthographic word forms and to model the relation between the lexical and the surface word forms of a morphologically complex language. This paper motivates the construction of tone-enhanced lexical transducers for tonal languages and gives two supporting arguments for the feasibility of finite-state compilation of autosegmental derivations. According to the Common Timeline Argument, adding a common timeline to autosegmental representations is crucial for their computational processing. According to the Compilation Argument, the compilation of autosegmental grammars requires combining code-theoretic and model-theoretic research lines.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.