6 results on '"Díaz de Ilarraza, Arantza"'
Search Results
2. EUSKOR: End-to-end coreference resolution system for Basque.
- Author
-
Soraluze A, Arregi O, Arregi X, and Díaz de Ilarraza A
- Subjects
- Humans, Pattern Recognition, Automated, Semantics, Spain, Language, Natural Language Processing
- Abstract
This paper describes the process of adapting the Stanford Coreference resolution module to the Basque language, taking into account the characteristics of the language. The module has been integrated in a linguistic analysis pipeline obtaining an end-to-end coreference resolution system for the Basque language. The adaptation process explained can benefit and facilitate other languages with similar characteristics in the implementation of their coreference resolution systems. During the experimentation phase, we have demonstrated that language-specific features have a noteworthy effect on coreference resolution, obtaining a gain in CoNLL score of 7.07 with respect to the baseline system. We have also analysed the effect that preprocessing has in coreference resolution, comparing the results obtained with automatic mentions versus gold mentions. When gold mentions are provided, the results increase 11.5 points in CoNLL score in comparison with results obtained when automatic mentions are used. The contribution of each sieve is analysed concluding that morphology is essential for agglutinative languages to obtain good performance in coreference resolution. Finally, an error analysis of the coreference resolution system is presented which have revealed our system's weak points and help to determine the improvements of the system. As a result of the error analysis, we have enriched the Basque coreference resolution adding new two sieves, obtaining an improvement of 0.24 points in CoNLL F1 when automatic mentions are used and of 0.39 points when the gold mentions are provided., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2019
- Full Text
- View/download PDF
3. Coreferential Relations in Basque: The Annotation Process.
- Author
-
Ceberio K, Aduriz I, Díaz de Ilarraza A, and Garcia-Azkoaga I
- Subjects
- Humans, Language, Spain, Linguistics, Natural Language Processing
- Abstract
In this paper we present the coreferential tagging of part of the EPEC Corpus of Basque. Although coreference is a pragmatic linguistic phenomenon highly dependent on the situational context, it shows some language-specific patterns that vary according to the features of each language. Due to the fact that Basque is not an Indo-European language, it differs considerably in grammar from the languages spoken in surrounding areas. We will explain these features and the decisions made in each case. After describing the criteria defined for coreferential tagging in Basque, the annotation process will be explained. Our annotation is based on a morphologically and syntactically annotated corpus that provides us with a manageable environment, in which the specific structures that are part of a reference chain can be more easily identified. A part of the corpus was tagged by two annotators who marked up the same text independently, and by another annotator that acted as judge, solving problems in case of disagreement. All this process has been automatized as a result of previous studies carried out in this field. The automatic detection of mentions (Soraluze et al., in: Proceedings of Konvens, 2012) has provided us with a better working environment, and given us the possibility to build a first significant corpus for a later computational treatment of automatic coreferential resolution.
- Published
- 2018
- Full Text
- View/download PDF
4. Ebaluatoia: crowd evaluation for English-Basque machine translation.
- Author
-
Aranberri, Nora, Labaka, Gorka, Díaz de Ilarraza, Arantza, and Sarasola, Kepa
- Subjects
LANGUAGE & languages ,ENGLISH language ,BASQUE language ,LINGUISTICS ,NATURAL language processing ,PAIRED comparisons (Mathematics) - Abstract
This work explores the feasibility of a crowd-based pair-wise comparison evaluation to get feedback on machine translation progress for under-resourced languages. Specifically, we propose a task based on simple work units to compare the outputs of five English-to-Basque systems, which we implement in a web application. In our design, we put forward two key aspects that we believe community collaboration initiatives should consider in order to attract and maintain participants, that is, providing both a community challenge and a personal challenge. We describe how these aspects can comply with a strict methodology to ensure research validity. In particular, we consider the evaluation set size and the characteristics of the test sentences, the number of evaluators per comparison pair, and a mechanism to identify dishonest participation (or participants with insufficient linguistic knowledge). We also describe our dissemination effort, which targeted both general users and interest groups. Over 500 people participated actively in the Ebaluatoia campaign and we were able to collect over 35,000 evaluations in a short period of 10 days. From the results, we complete the ranking of the systems under evaluation and establish whether the difference in quality between the systems is significant. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
5. Improving mention detection for Basque based on a deep error analysis.
- Author
-
SORALUZE, ANDER, ARREGI, OLATZ, ARREGI, XABIER, and DÍAZ DE ILARRAZA, ARANTZA
- Subjects
BASQUE language ,ERROR detection (Information theory) ,NATURAL language processing ,COMPUTER network protocols ,MATCHING theory - Abstract
This paper presents the improvement process of a mention detector for Basque. The system is rule-based and takes into account the characteristics of mentions in Basque. A classification of error types is proposed based on the errors that occur during mention detection. A deep error analysis distinguishing error types and causes is presented and improvements are proposed. At the final stage, the system obtains an F-measure of 74.57% under the Exact Matching protocol and of 80.57% under Lenient Matching. We also show the performance of the mention detector with gold standard data as input, in order to omit errors caused by the previous stages of linguistic processing. In this scenario, we obtain an F-measure of 85.89% with Strict Matching and of 89.06% with Lenient Matching, i.e., a difference of 11.32 and 8.49 percentage points, respectively. Finally, how improvements in mention detection affect coreference resolution is analysed. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
6. Computer aided classification of diagnostic terms in spanish.
- Author
-
Pérez, Alicia, Gojenola, Koldo, Casillas, Arantza, Oronoz, Maite, and Díaz de Ilarraza, Arantza
- Subjects
- *
COMPUTER-aided design , *MEDICAL records , *SPANISH language , *FINITE state machines ,INTERNATIONAL Statistical Classification of Diseases & Related Health Problems - Abstract
The goal of this paper is to classify Medical Records (MRs) by their diagnostic terms (DTs) according to the International Classification of Diseases Clinical Modification (ICD-9-CM). The challenge we face is twofold: (i) to treat the natural and non-standard language in which doctors express their diagnostics and (ii) to perform a large-scale classification problem. We propose the use of Finite-State Transducers (FSTs) that, for their underlying topology, constrain the allowed input DT string while synchronously produce the output ICD-9-CM class. It is outstanding their versatility to efficiently implement soft-matching operations between terms expressed in natural language to standard terms and, hence, to the final ICD-9-CM code. The FSTs were built up from a corpora and standard resources such as the ICD-9-CM and SNOMED CT amongst others. Our corpora count on a big-data comprising more than 20,000 DTs from MRs from the Basque Hospital System so as to model natural language in this domain. An F1-measure of 91.2 was achieved on a test-set of 2850 randomly selected DTs, and a random 5-fold cross validation on a training set served to double-check the stability of the provided results. Real MRs were of much help to adapt the system to natural language. Misspellings, colloquial and specific language and abbreviations made the classification process difficult. The FSTs were proven efficient in this large-scale classification task. Moreover, the composition operation for FSTs made it easy the addition of new features to the system. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.