Descriptor: "Morphological segmentation" / Topic: computer.software_genre - Searchworks@Jio Institute Digital Library Search Results

1. Neural machine translation with a polysynthetic low resource language

Author: John Ortega, Richard Alexander Castro Mamani, and Kyunghyun Cho
Subjects: Linguistics and Language, Machine translation, Low resource, Computer science, business.industry, 02 engineering and technology, computer.software_genre, Language and Linguistics, Recurrent neural network, Artificial Intelligence, Morpheme, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Suffix, Computational linguistics, business, Baseline (configuration management), computer, Morphological segmentation, Software, Natural language processing
Abstract: Low-resource languages (LRL) with complex morphology are known to be more difficult to translate in an automatic way. Some LRLs are particularly more difficult to translate than others due to the lack of research interest or collaboration. In this article, we experiment with a specific LRL, Quechua, that is spoken by millions of people in South America yet has not undertaken a neural approach for translation until now. We improve the latest published results with baseline BLEU scores using the state-of-the-art recurrent neural network approaches for translation. Additionally, we experiment with several morphological segmentation techniques and introduce a new one in order to decompose the language’s suffix-based morphemes. We extend our work to other high-resource languages (HRL) like Finnish and Spanish to show that Quechua, for qualitative purposes, can be considered compatible with and translatable into other major European languages with measurements comparable to the state-of-the-art HRLs at this time. We finalize our work by making our best two Quechua–Spanish translation engines available on-line.
Published: 2020
Full Text: View/download PDF

2. Morfeus+: Word parsing in Basque beyond morphological segmentation

Author: Koldo Gojenola, Xabier Artola, Itziar Aduriz, Zuhaitz Beloki, Jose Maria Arriola, and Nerea Ezeiza
Subjects: Agglutinative language, Linguistics and Language, Parsing, Grammar, business.industry, Computer science, media_common.quotation_subject, 02 engineering and technology, computer.software_genre, Language and Linguistics, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Word structure, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Morphological segmentation, Word (computer architecture), Natural language processing, media_common
Abstract: This work describes the formalization of a word structure grammar that represents the complex morphological and morphosyntactic information embedded within the word forms of an agglutinative language (Basque), giving a comprehensive linguistic description of the main morphological phenomena, such as affixation, derivation, and composition, and also taking into account the modeling of both standard and non-standard words. We have identified the relevant issues to be addressed in the representation of such a grammar.We also present the development of Morfeus+, a tool for the analysis of unrestricted texts, testing its applicability and showing that its coverage is wide and robust, allowing the efficient processing of big volumes of data.This paper describes a mature system that has required several person/years and that tries to integrate a rigorous linguistic specification together with more practical implementation matters, such as the appropriate treatment of unknown words in unrestricted texts.
Published: 2020
Full Text: View/download PDF

3. Morphological Segmentation based Image Processing Approaches for Agro IoT System for Crop Production

Author: J.A.Adlin Layola, Niha K, and S. Amutha
Subjects: business.industry, Computer science, Image processing, Image segmentation, computer.software_genre, Field (computer science), Variety (cybernetics), Histogram, Data mining, Internet of Things, business, Image resolution, Morphological segmentation, computer
Abstract: IoT along with Image segmentation is applied in a variety of applications separately. This character of application in agriculture field is still existing and attained positive approaches in success, though the combinations of these tools are unreal. This paper illustrates a methodology for combining IoT along with image segmentation for resolving the ecological factor as well as man-made factor (pesticides/fertilizers). Here a variety of sensing devices which is used for measuring the vital environmental factors and growth of plant is captured and segmented using morphological segmentation to the leaf lattice which is evaluated with the help of histogram analysis.
Published: 2021
Full Text: View/download PDF

4. Should we find another model?: Improving Neural Machine Translation Performance with ONE-Piece Tokenization Method without Model Modification

Author: Hyeonseok Moon, Heuiseok Lim, Sugyeong Eo, and Chanjun Park
Subjects: Structure (mathematical logic), Vocabulary, Machine translation, business.industry, Computer science, media_common.quotation_subject, 02 engineering and technology, 010501 environmental sciences, computer.software_genre, Application software, 01 natural sciences, Tokenization (data security), 0202 electrical engineering, electronic engineering, information engineering, Slow speed, Model modification, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Morphological segmentation, Natural language processing, 0105 earth and related environmental sciences, media_common
Abstract: Most of the recent Natural Language Processing(NLP) studies are based on the Pretrain-Finetuning Approach (PFA), but in small and medium-sized enterprises or companies with insufficient hardware there are many limitations to servicing NLP application software using such technology due to slow speed and insufficient memory. The latest PFA technologies require large amounts of data, especially for low-resource languages, making them much more difficult to work with. We propose a new tokenization method, ONE-Piece, to address this limitation that combines the morphology-considered subword tokenization method and the vocabulary method used after probing for an existing method that has not been carefully considered before. Our proposed method can also be used without modifying the model structure. We experiment by applying ONE-Piece to Korean, a morphologically-rich and low-resource language. We derive an optimal subword tokenization result for Korean-English machine translation by conducting a case study that combines the subword tokenization method, morphological segmentation, and vocabulary method. Through comparative experiments with all the tokenization methods currently used in NLP research, ONE-Piece achieves performance comparable to the current Korean-English machine translation state-of-the-art model.
Published: 2021
Full Text: View/download PDF

5. Morphological Segmentation for Seneca

Author: Emily Prud'hommeaux, Zoey Liu, and Robert Jimerson
Subjects: Grammar, Computer science, business.industry, media_common.quotation_subject, Model selection, computer.software_genre, Task (project management), Domain (software engineering), Set (abstract data type), Labeled data, Artificial intelligence, Architecture, business, computer, Morphological segmentation, Natural language processing, media_common
Abstract: This study takes up the task of low-resource morphological segmentation for Seneca, a critically endangered and morphologically complex Native American language primarily spoken in what is now New York State and Ontario. The labeled data in our experiments comes from two sources: one digitized from a publicly available grammar book and the other collected from informal sources. We treat these two sources as distinct domains and investigate different evaluation designs for model selection. The first design abides by standard practices and evaluate models with the in-domain development set, while the second one carries out evaluation using a development domain, or the out-of-domain development set. Across a series of monolingual and crosslinguistic training settings, our results demonstrate the utility of neural encoder-decoder architecture when coupled with multi-task learning.
Published: 2021
Full Text: View/download PDF

6. Towards a First Automatic Unsupervised Morphological Segmentation for Inuinnaqtun

Author: Ngoc Tan Le and Fatiha Sadat
Subjects: Rule-based machine translation, Machine translation, Computer science, business.industry, Polysynthetic language, Morphological analysis, Artificial intelligence, Language family, business, computer.software_genre, Morphological segmentation, computer, Natural language processing
Abstract: Low-resource polysynthetic languages pose many challenges in NLP tasks, such as morphological analysis and Machine Translation, due to available resources and tools, and the morphologically complex languages. This research focuses on the morphological segmentation while adapting an unsupervised approach based on Adaptor Grammars in low-resource setting. Experiments and evaluations on Inuinnaqtun, one of Inuit language family in Northern Canada, considered a language that will be extinct in less than two generations, have shown promising results.
Published: 2021
Full Text: View/download PDF

7. Morphology Model and Segmentation for Old Turkic Language

Author: Ualsher Tukeyev and Dinara Zhanabergenova
Subjects: Machine translation, Computer science, Turkish, language, Segmentation, Morphology (biology), Linguistic distance, Turkic languages, computer.software_genre, computer, Morphological segmentation, Linguistics, language.human_language
Abstract: Old Turkic language is the basis of all modern Turkic languages. Its study is very important for Turkic peoples who possess modern Turkic languages. This is important both from a historical point of view and for the study of modern issues of neural machine translation, issues of the linguistic distance of modern Turkic languages from their progenitor. This paper proposes the development of a computational model of the morphology of Old Turkic language based on the CSE (Complete Set of Endings) – model of morphology and a study on this basis of the issue of morphological segmentation of the texts of Old Turkic language, which will subsequently be used for neural machine translation of Old Turkic language into modern Turkic languages. Since most of the modern Turkic languages, except for the Turkish language, belong to low-resource languages, the issues of developing computational models of morphology, developing models, algorithms and software for processing Turkic languages are relevant.
Published: 2021
Full Text: View/download PDF

8. Minimally-Supervised Morphological Segmentation using Adaptor Grammars with Linguistic Priors

Author: Judith L. Klavans, Sujay Khandagale, Cass Lowry, Ramy Eskander, Francesca Callejas, Smaranda Muresan, and Maria Polinsky
Subjects: Rule-based machine translation, business.industry, Computer science, Prior probability, Artificial intelligence, business, computer.software_genre, Morphological segmentation, computer, Natural language processing
Published: 2021
Full Text: View/download PDF

9. Development of Morphological Segmentation for the Kyrgyz Language on Complete Set of Endings

Author: Ualsher Tukeyev, Nella Israilova, and Aigerim Toleush
Subjects: Artificial neural network, business.industry, Computer science, Text segmentation, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, computer.software_genre, Set (abstract data type), Development (topology), Data model, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, Artificial intelligence, business, Morphological segmentation, computer, Natural language processing
Abstract: The problem of word segmentation of source texts in the training of neural network models is one of the actual problems of natural language processing. A new model of the morphology of the Kyrgyz language based on a complete set of endings (CSE) is developed. Based on the developed CSE-model of the morphology of the Kyrgyz language, a computational data model, algorithm and a program for morphological segmentation are developed. Experiments on morphological segmentation of Kyrgyz language texts showed 82% accuracy of text segmentation by the proposed method.
Published: 2021
Full Text: View/download PDF

10. Semi-supervised Induction of Morpheme Boundaries in Czech Using a Word-Formation Network

Author: Jan Bodnár, Magda Ševčíková, and Zdeněk Žabokrtský
Subjects: Czech, business.industry, Computer science, Word formation, computer.software_genre, Base (topology), language.human_language, Market segmentation, Morpheme, language, Segmentation, Artificial intelligence, business, computer, Morphological segmentation, Natural language processing
Abstract: This paper deals with automatic morphological segmentation of Czech lemmas contained in the word-formation network DeriNet. Capturing derivational relations between base and derived lemmas, and segmenting lemmas into sequences of morphemes are two closely related formal models of how words come into existence. Thus we propose a novel segmentation method that benefits from the existence of the network; our solution constitutes new state-of-the-art for the Czech language.
Published: 2020
Full Text: View/download PDF

11. MorphoLex-FR: A derivational morphological database for 38,840 French words

Author: Claudia Sánchez-Gutiérrez, Joël Macoir, Hugo Mailhot, Maximiliano A. Wilson, and S. Hélène Deacon
Subjects: Databases, Factual, Computer science, Experimental and Cognitive Psychology, computer.software_genre, Lexicon, 050105 experimental psychology, 03 medical and health sciences, 0302 clinical medicine, Cognition, Arts and Humanities (miscellaneous), Noun, Developmental and Educational Psychology, Lexical decision task, 0501 psychology and cognitive sciences, General Psychology, Language, Database, 05 social sciences, Multilevel model, Linguistics, Prefix, Variable (computer science), Psychology (miscellaneous), Suffix, computer, Morphological segmentation, 030217 neurology & neurosurgery
Abstract: Studies on morphological processing in French, as in other languages, have shown disparate results. We argue that a critical and long-overlooked factor that could underlie these diverging results is the methodological differences in the calculation of morphological variables across studies. To address the need for a common morphological database, we present MorphoLex-FR, a sizeable and freely available database with 12 variables for prefixes, roots, and suffixes for the 38,840 words of the French Lexicon Project. MorphoLex-FR constitutes a first step to render future studies addressing morphological processing in French comparable. The procedure we used for morphological segmentation and variable computation is effectively the same as that in MorphoLex, an English morphological database. This will allow for cross-linguistic comparisons of future studies in French and English that will contribute to our understanding of how morphologically complex words are processed. To validate these variables, we explored their influence on lexical decision latencies for morphologically complex nouns in a series of hierarchical regression models. The results indicated that only morphological variables related to the suffix explained lexical decision latencies. The frequency and family size of the suffix exerted facilitatory effects, whereas the percentage of more frequent words in the morphological family of the suffix was inhibitory. Our results are in line with previous studies conducted in French and in English. In conclusion, this database represents a valuable resource for studies on the effect of morphology in visual word processing in French.
Published: 2019

12. Script Independent Morphological Segmentation for Arabic Maghrebi Dialects: An Application to Machine Translation

Author: Salima Harrat, Karima Meftouh, Kamel Smaïli, École normale supérieure - Bouzaréah-Alger (ENS Bouzaréah-Alger), Statistical Machine Translation and Speech Modelization and Text (SMarT), Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Recherche en Informatique (LRI-ANNABA), Université Badji Mokhtar - Annaba [Annaba] (UBMA), and Université Badji Mokhtar Annaba (UBMA)
Subjects: General Computer Science, Machine translation, business.industry, Computer science, Arabic dialects, Text segmentation, computer.software_genre, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Focus (linguistics), Tree (data structure), Morphological segmentation, Scripting language, Segmentation, Artificial intelligence, business, computer, Arabic script, Word (computer architecture), Natural language processing
Abstract: International audience; This research deals with resources creation for under-resourced languages. We try to adapt existing resources for other resourced-languages to process less-resourced ones. We focus on Arabic dialects of the Maghreb, namely Algerian, Moroccan and Tunisian. We first adapt a well-known statistical word segmenter to segment Algerian dialect texts written in both Arabic and Latin scripts. We demonstrate that unsupervised morphological segmentation could be applied to Arabic dialects regardless of used script. Next, we use this kind of segmentation to improve statistical machine translation scores between the tree Maghrebi dialects and French. We use a parallel multidialectal corpus that includes six Arabic dialects in addition to MSA and French. We achieved interesting results. Regards to word segmentation, the rate of correctly segmented words reached 70% for those written in Latin script and 79% for those written in Arabic script. For machine translation, the unsupervised morphological segmentation helped to decrease out-of-vocabulary words rates by a minimum of 35%.
Published: 2019
Full Text: View/download PDF

13. Improving Automatically Induced Lexicons for Highly Agglutinating Languages Using Data-Driven Morphological Segmentation

Author: Wiehan Agenbag and Thomas Niesler
Subjects: Computer science, business.industry, Artificial intelligence, business, computer.software_genre, Morphological segmentation, computer, Natural language processing, Data-driven
Published: 2019
Full Text: View/download PDF

14. Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages

Author: Judith L. Klavans, Ramy Eskander, and Smaranda Muresan
Subjects: Rule-based machine translation, Low resource, business.industry, Computer science, Morphological analysis, Polysynthetic language, Artificial intelligence, computer.software_genre, Part of speech, business, computer, Morphological segmentation, Natural language processing
Abstract: Polysynthetic languages pose a challenge for morphological analysis due to the root-morpheme complexity and to the word class “squish”. In addition, many of these polysynthetic languages are low-resource. We propose unsupervised approaches for morphological segmentation of low-resource polysynthetic languages based on Adaptor Grammars (AG) (Eskander et al., 2016). We experiment with four languages from the Uto-Aztecan family. Our AG-based approaches outperform other unsupervised approaches and show promise when compared to supervised methods, outperforming them on two of the four languages.
Published: 2019
Full Text: View/download PDF

15. Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages

Author: Jesús Manuel Mager Hois, Katharina Kann, Ivan Meza, and Hinrich Schütze
Subjects: FOS: Computer and information sciences, Computer science, 02 engineering and technology, computer.software_genre, 03 medical and health sciences, 0302 clinical medicine, Resource (project management), Morpheme, Polysynthetic language, 0202 electrical engineering, electronic engineering, information engineering, Baseline (configuration management), Nahuatl, Computer Science - Computation and Language, business.industry, language.human_language, 030221 ophthalmology & optometry, language, 020201 artificial intelligence & image processing, Artificial intelligence, State (computer science), business, Computation and Language (cs.CL), Morphological segmentation, computer, Natural language processing, Word (computer architecture)
Abstract: Morphological segmentation for polysynthetic languages is challenging, because a word may consist of many individual morphemes and training data can be extremely scarce. Since neural sequence-to-sequence (seq2seq) models define the state of the art for morphological segmentation in high-resource settings and for (mostly) European languages, we first show that they also obtain competitive performance for Mexican polysynthetic languages in minimal-resource settings. We then propose two novel multi-task training approaches -one with, one without need for external unlabeled resources-, and two corresponding data augmentation methods, improving over the neural baseline for all languages. Finally, we explore cross-lingual transfer as a third way to fortify our neural model and show that we can train one single multi-lingual model for related languages while maintaining comparable or even improved performance, thus reducing the amount of parameters by close to 75%. We provide our morphological segmentation datasets for Mexicanero, Nahuatl, Wixarika and Yorem Nokki for future research., Comment: Long Paper, 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Published: 2018
Full Text: View/download PDF

16. Automatically Tailoring Unsupervised Morphological Segmentation to the Language

Author: Ramy Eskander, Owen Rambow, and Smaranda Muresan
Subjects: Set (abstract data type), ComputingMethodologies_PATTERNRECOGNITION, Rule-based machine translation, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Artificial intelligence, business, computer.software_genre, Morphological segmentation, computer, Natural language processing, Selection (genetic algorithm)
Abstract: Morphological segmentation is beneficial for several natural language processing tasks dealing with large vocabularies. Unsupervised methods for morphological segmentation are essential for handling a diverse set of languages, including low-resource languages. Eskander et al. (2016) introduced a Language Independent Morphological Segmenter (LIMS) using Adaptor Grammars (AG) based on the best-on-average performing AG configuration. However, while LIMS worked best on average and outperforms other state-of-the-art unsupervised morphological segmentation approaches, it did not provide the optimal AG configuration for five out of the six languages. We propose two language-independent classifiers that enable the selection of the optimal or nearly-optimal configuration for the morphological segmentation of unseen languages.
Published: 2018
Full Text: View/download PDF

17. Unsupervised Learning of Morphology with Graph Sampling

Author: Maciej Sumalvico
Subjects: Computer science, business.industry, 05 social sciences, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Pattern recognition, Statistical model, 02 engineering and technology, computer.software_genre, 050105 experimental psychology, Graph, Graph sampling, Morphological analysis, 0202 electrical engineering, electronic engineering, information engineering, Unsupervised learning, Probability distribution, 020201 artificial intelligence & image processing, 0501 psychology and cognitive sciences, Segmentation, Artificial intelligence, business, computer, Morphological segmentation, Computer Science::Formal Languages and Automata Theory, Natural language processing
Abstract: We introduce a language-independent, graph-based probabilistic model of morphology, which uses transformation rules operating on whole words instead of the traditional morphological segmentation. The morphological analysis of a set of words is expressed through a graph having words as vertices and structural relationships between words as edges. We define a probability distribution over such graphs and develop a sampler based on the Metropolis-Hastings algorithm. The sampling is applied in order to determine the strength of morphological relationships between words, filter out accidental similarities and reduce the set of rules necessary to explain the data. The model is evaluated on the task of finding pairs of morphologically similar words, as well as generating new words. The results are compared to a state-of-the-art segmentation-based approach.
Published: 2017
Full Text: View/download PDF

18. Morphological segmentation method for Turkic language neural machine translation

Author: Aidana Karibayeva, Ualsher Tukeyev, and Zhandos Zhumanov
Subjects: kyrgyz, 0209 industrial biotechnology, General Computer Science, Machine translation, Computer science, 020209 energy, General Chemical Engineering, 02 engineering and technology, Kazakh, morphological segmentation, computer.software_genre, uzbek, 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, kazakh, turkic languages, business.industry, General Engineering, Engineering (General). Civil engineering (General), Turkic languages, neural machine translation, language.human_language, Uzbek, language, Artificial intelligence, TA1-2040, business, computer, Morphological segmentation, Natural language processing
Abstract: Dictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corpora. Therefore, this study introduces a new morphological segmentation approach for Turkic languages based on the complete set of endings (CSE), which reduces the vocabulary volume of the source corpora. Herein, we demonstrate the proposed CSE-based morphological segmentation method for the Kazakh, Kyrgyz, and Uzbek languages and present the results of computational NMT experiments for the Kazakh language. The NMT experiment results show that in comparison with byte-pair encoding (BPE)-based segmentation, the proposed CSE-based segmentation increases the bilingual evaluation understudy score of 0.5 and 0.2 points on average for Kazakh–English and English–Kazakh pairs, respectively. Furthermore, in comparison with the BPE-based segmentation, the proposed CSE-based segmentation approach reduced the vocabulary size in NMT by more than a factor of two. This feature of the proposed segmentation approach will be crucial for NMT as the size of the source corpora is increased to improve translation quality.
Published: 2020
Full Text: View/download PDF

19. Automatically identifying blend splinters that are morpheme candidates

Author: David Correia Saavedra
Subjects: 060201 languages & linguistics, Linguistics and Language, business.industry, Computer science, 06 humanities and the arts, Word formation, computer.software_genre, Language and Linguistics, Computer Science Applications, Morpheme, 0602 languages and literature, Artificial intelligence, business, Morphological segmentation, computer, Word (computer architecture), Natural language processing, Information Systems, Automated method
Abstract: Forms such as -topia in privatopia or -ercise in dancercise are known as blend splinters: they might not be morphemes, but they are clearly involved in word formation. This article offers an automated method that can highlight blend splinters which have the potential to become morphemes in their own right. For instance, the word alcoholic has given rise a large number of blends such as workaholic or rageaholic , so that the splinter -holic is now recognized as a morpheme in the Oxford English Dictionary Online. Because of the sheer number of newly coined blends, it is difficult to identify splinters that are turning into morphemes on the sole basis of human observation. It would therefore be desirable to have an automated method that could process large amounts of data and identify such elements. This article develops such a method, relying on unsupervised morphological segmentation (Harris, 1955). A custom blend database was established for this purpose. The method is able to detect splinters mentioned in previous research, such as -tainment , -ercise , and cyber- , but in addition, it also detects elements that have not been discussed so far, including -tastic , -sumer , and -verse .
Published: 2014
Full Text: View/download PDF

20. Minimally-Supervised Morphological Segmentation using Adaptor Grammars

Author: Kairit Sirts and Sharon Goldwater
Subjects: Linguistics and Language, Computer science, business.industry, Communication, Model selection, Pattern recognition, Machine learning, computer.software_genre, Training methods, Computer Science Applications, Human-Computer Interaction, Data set, ComputingMethodologies_PATTERNRECOGNITION, Rule-based machine translation, Artificial Intelligence, Nonparametric bayesian, Artificial intelligence, business, computer, Morphological segmentation
Abstract: This paper explores the use of Adaptor Grammars, a nonparametric Bayesian modelling framework, for minimally supervised morphological segmentation. We compare three training methods: unsupervised training, semi-supervised training, and a novel model selection method. In the model selection method, we train unsupervised Adaptor Grammars using an over-articulated metagrammar, then use a small labelled data set to select which potential morph boundaries identified by the metagrammar should be returned in the final output. We evaluate on five languages and show that semi-supervised training provides a boost over unsupervised training, while the model selection method yields the best average results over all languages and is competitive with state-of-the-art semi-supervised systems. Moreover, this method provides the potential to tune performance according to different evaluation metrics or downstream tasks.
Published: 2013
Full Text: View/download PDF

21. Morphological Analysis of the Dravidian Language Family

Author: Antoni Oliver, Arun Kumar, Lluís Padró, Ryan Cotterell, Universitat Politècnica de Catalunya. Departament de Ciències de la Computació, and Universitat Politècnica de Catalunya. GPLN - Grup de Processament del Llenguatge Natural
Subjects: Exploit, Computer science, business.industry, Speech recognition, Dravidian languages, 02 engineering and technology, computer.software_genre, ComputingMethodologies_ARTIFICIALINTELLIGENCE, Set (abstract data type), 03 medical and health sciences, 0302 clinical medicine, Natural language processing (Computer science), Morphological analysis, 030221 ophthalmology & optometry, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Segmentation, Artificial intelligence, Informàtica::Aspectes socials [Àrees temàtiques de la UPC], Tractament del llenguatge natural (Informàtica), business, Morphological segmentation, computer, Natural language processing
Abstract: The Dravidian family is one of the most widely spoken set of languages in the world, yet there are very few annotated resources available to NLP researchers. To remedy this, we create DravMorph, a corpus annotated for morphological segmentation and part-of-speech. Also, we exploit novel features and higher-order models to achieve promising results on these corpora on both tasks, beating techniques proposed in the literature by as much as 4 points in segmentation F1.
Published: 2017

22. Phonemic representations in morphological segmentation of written English words

Author: Robin K. Morris and Cintia S. Widmann
Subjects: Visual word recognition, Linguistics and Language, business.industry, Cognitive Neuroscience, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Phonology, computer.software_genre, Language and Linguistics, Linguistics, Lexical decision task, Artificial intelligence, Psychology, business, computer, Morphological segmentation, Priming (psychology), Natural language processing, Orthography
Abstract: We addressed the issue of the kinds of representations involved in morphological segmentation during visual word recognition. Specifically, we asked whether morphological segmentation operates on phonemic representations. The results of two masked priming experiments indicated that words with appearance of morphological complex structure (ponder) primed their apparent embedded roots (POND) as much as actual morphologically complex words (dreamer) primed their actual embedded roots (DREAM). However, the effect was significantly reduced in naming and it became inhibitory in lexical decision for primes (caper) whose phonemic representations did not completely overlap with those of their potential roots (CAP) but whose orthographic representations did. This suggests that morphological segmentation is not restricted to orthographic representations, but that it also engages phonemic representations.
Published: 2009
Full Text: View/download PDF

23. Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences

Author: Antonio Toral and Víctor M. Sánchez-Cartagena
Subjects: Computer science, business.industry, Character (computing), Speech recognition, Deep learning, computer.software_genre, Translation (geometry), Task (project management), Rule-based machine translation, Machine translation system, Language model, Artificial intelligence, business, computer, Morphological segmentation, Natural language processing
Abstract: This paper presents the systems submitted by the Abu-MaTran project to the Englishto-Finnish language pair at the WMT 2016 news translation task. We applied morphological segmentation and deep learning in order to address (i) the data scarcity problem caused by the lack of in-domain parallel data in the constrained task and (ii) the complex morphology of Finnish. We submitted a neural machine translation system, a statistical machine translation system reranked with a neural language model and the combination of their outputs tuned on character sequences. The combination and the neural system were ranked first and second respectively according to automatic evaluation metrics and tied for the first place in the human evaluation.
Published: 2016
Full Text: View/download PDF

24. An unsupervised approach for morphological segmentation of highly agglutinative Tamil language

Author: Ananthi Sheshasaayee and V. R. Angela Deepa
Subjects: Agglutinative language, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, computer.software_genre, language.human_language, ComputingMethodologies_PATTERNRECOGNITION, Tamil, language, Identification (biology), Segmentation, Artificial intelligence, Suffix, business, Morphological segmentation, computer, Natural language processing
Abstract: Morphological learning through unsupervised means enables the automatic identification of affixes, morphological segmentation of words followed by the generation of paradigms incorporating the list of affixes with the combined list of stems for a particular language. For segmenting the words into stems and affixes various unsupervised approaches have been deployed. But for highly agglutinative languages like Tamil very less computational work has been done in this direction. This paper mainly portrays a morphology acquisition framework based on an unsupervised approach for the morphological segmentation of highly agglutinative Tamil language.
Published: 2015
Full Text: View/download PDF

25. To Split or Not, and If so, Where? Theoretical and Empirical Aspects of Unsupervised Morphological Segmentation

Author: Amit Kirschenbaum
Subjects: Multiple sequence alignment, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Scale-space segmentation, computer.software_genre, Range (mathematics), Similarity (psychology), Segmentation, Artificial intelligence, Computational linguistics, Focus (optics), business, computer, Morphological segmentation, Natural language processing
Abstract: The purpose of this paper is twofold: First, it offers an overview of challenges encountered by unsupervised, knowledge free methods when analysing language data (with focus on morphology). Second, it presents a system for unsupervised morphological segmentation comprising two complementary methods that can handle a broad range of morphological processes. The first method collects words which share distributional and form similarity and applies Multiple Sequence Alignment to derive segmentation of these words. The second method then analyses less frequent words utilizing the segmentation results of the first method. The challenges presented in the theoretical part are demonstrated exemplarily on the workings and output of the introduced unsupervised system and accompanied by suggestions how to address them in future works.
Published: 2015
Full Text: View/download PDF

26. Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling

Author: Raphael Rubino, Tommi A. Pirinen, Miquel Esplà-Gomis, Prokopis Prokopidis, Vassilis Papavassiliou, Sergio Ortiz Rojas, Antonio Toral, and Nikola Ljubešić
Subjects: Machine translation, business.industry, Computer science, Crawling, Translation (geometry), computer.software_genre, Machine learning, Task (project management), Rule-based machine translation, Artificial intelligence, Evaluation of machine translation, business, Web crawler, computer, Morphological segmentation, Natural language processing
Abstract: This paper presents the machine translation systems submitted by the Abu-MaTran project for the Finnish‐English language pair at the WMT 2015 translation task. We tackle the lack of resources and complex morphology of the Finnish language by (i) crawling parallel and monolingual data from the Web and (ii) applying rule-based and unsupervised methods for morphological segmentation. Several statistical machine translation approaches are evaluated and then combined to obtain our final submissions, which are the top performing English-to-Finnish unconstrained (all automatic metrics) and constrained (BLEU), and Finnish-to-English constrained (TER) systems.
Published: 2015
Full Text: View/download PDF

27. Suffix Sequences Based Morphological Segmentation for Afaan Oromo

Author: Massimo Melucci, Solomon Teferra, and Getachew Mamo Wegari
Subjects: Machine translation, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, computer.software_genre, Hierarchical clustering, Tree (data structure), Morpheme, Artificial intelligence, Suffix, business, computer, Morphological segmentation, Natural language processing
Abstract: This paper reports on a morphological segmentation model for Afaan Oromo based on suffix sequences approach. Understanding and identifying the suffix sequences of a language allow us to detect morpheme boundaries of many words of Afaan Oromo. Morphological segmentation models can be used in many Natural Language Processing applications such as machine translation, speech recognition, information retrieval and part-of-speech tagging. A divisive hierarchical clustering and frequency distribution were used to build a tree of candidate stems from which segmented suffix sequences can be modeled. The proposed morphological segmentation model was evaluated with test word-lists. The accuracy obtained by our morphological segmentation model is encouraging.
Published: 2015

28. Tuning Phrase-Based Segmented Translation for a Morphologically Complex Target Language

Author: Mikko Kurimo, Stig-Arne Grönroos, and Sami Virpioja
Subjects: Phrase, Machine translation, Computer science, business.industry, Speech recognition, Artificial intelligence, Translation (geometry), computer.software_genre, business, Morphological segmentation, computer, Natural language processing, Task (project management)
Abstract: This article describes the Aalto University entry to the English-to-Finnish shared translation task in WMT 2015. The system participates in the constrained condition, but in addition we impose some further constraints, using no language-specific resources beyond those provided in the task. We use a morphological segmenter, Morfessor FlatCat, but train and tune it in an unsupervised manner. The system could thus be used for another language pair with a morphologically complex target language, without needing modification or additional resources.
Published: 2015
Full Text: View/download PDF

29. Towards a psycholinguistic computational model for morphological parsing

Author: R. Harald Baayen and Robert Schreuder
Subjects: Visual word recognition, Morphological parsing, Mental lexicon, Computer science, business.industry, General Mathematics, General Engineering, General Physics and Astronomy, Lexicon, computer.software_genre, Word lists by frequency, Segmentation, Artificial intelligence, Spurious relationship, business, Morphological segmentation, computer, Natural language processing
Abstract: Psycholinguistic experiments on visual word recognition in Dutch and other languages show ubiquitous effects of word frequency for regular complex words. The present study presents a simulation experiment with a computational model for morphological segmentation that is designed on psycholinguistic principles. Results suggests that these principles, in combination with the presence of form and frequency information for complex words in the lexicon, protect the system against spurious segmentations and substantially enhance segmentation accuracy.
Published: 2000
Full Text: View/download PDF

30. Morfessor 2.0: Toolkit for statistical morphological segmentation

Author: Mikko Kurimo, Sami Virpioja, Stig-Arne Grönroos, and Peter Smit
Subjects: Software, business.industry, Computer science, Interface (Java), Probabilistic logic, Artificial intelligence, business, computer.software_genre, computer, Morphological segmentation, Natural language processing
Abstract: Morfessor is a family of probabilistic machine learning methods for finding the morphological segmentation from raw text data. Recent developments include the development of semi-supervised methods for utilizing annotated data. Morfessor 2.0 is a rewrite of the original, widely-used Morfessor 1.0 software, with well documented command-line tools and library interface. It includes new features such as semi-supervised learning, online training, and integrated evaluation code.
Published: 2014
Full Text: View/download PDF

31. Unsupervised Segmentation for Different Types of Morphological Processes Using Multiple Sequence Alignment

Author: Amit Kirschenbaum
Subjects: Multiple sequence alignment, Computer science, business.industry, Morphological type, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Unsupervised segmentation, Pattern recognition, computer.software_genre, ComputingMethodologies_PATTERNRECOGNITION, Similarity (psychology), Distributional similarity, Segmentation, Identification (biology), Artificial intelligence, business, computer, Morphological segmentation, Natural language processing, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: The aim of unsupervised and knowledge free morphological segmentation is the identification of boundaries between morphs in words of a given language without relying on any knowledge source about that language. This paper describes a segmentation method that draws on previous approaches based both on semantic and orthographical similarity to identify morphologically related words. Using a version of Multiple Sequence Alignment originally applied in bioinformatics, the method extracts both concatenative and non-concatenative (e.g. introflection and circumfixation) morphological patterns and can thus handle languages of different morphological types as well as non-dominant morphological processes within languages of a particular predominant morphological type.
Published: 2013
Full Text: View/download PDF

32. Extrinsic Evaluation on Automatic Summarization Tasks: Testing Affixality Measurements for Statistical Word Stemming

Author: Gerardo Sierra, Juan-Manuel Torres-Moreno, Carlos-Francisco Méndez-Cruz, and Alfonso Medina-Urrea
Subjects: Computer science, business.industry, Pattern recognition, Artificial intelligence, Stemming, computer.software_genre, business, Morphological segmentation, Automatic summarization, computer, Natural language processing, Word (computer architecture)
Abstract: This paper presents some experiments of evaluation of a statistical stemming algorithm based on morphological segmentation. The method estimates affixality of word fragments. It combines three indexes associated to possible cuts. This unsupervised and language-independent method has been easily adapted to generate an effective morphological stemmer. This stemmer has been coupled with Cortex, an automatic summarization system, in order to generate summaries in English, Spanish and French. Summaries have been evaluated using ROUGE. The results of this extrinsic evaluation show that our stemming algorithm outperforms several classical systems.
Published: 2013
Full Text: View/download PDF

33. Machine Learning in Morphological Segmentation

Author: A. Elmotataz, Christophe Charrier, M. Lecluse, G. Lebrun, Olivier Lezoray, and Cyril Meurie
Subjects: Support vector machine, Watershed, Computer science, business.industry, Model selection, Scale-space segmentation, Artificial intelligence, business, Machine learning, computer.software_genre, Morphological segmentation, computer
Abstract: The segmentation of microscopic images is a challenging application that can have numerous applications ranging from prognosis to diagnosis. Mathematical morphology is a very well established theory to process images. Segmentation by morphological means is based on watershed that considers an image as a topographic surface. Watershed requires input and marker image. The user can provide the latter but far more relevant results can be obtained for watershed segmentation if marker extraction relies on prior knowledge. Parameters governing marker extraction varying from image to image, machine learning approaches are of interest for robust extraction of markers. We review different strategies for extracting markers by machine learning: single classifier, multiple classifier, single classifier optimized by model selection.
Published: 2012
Full Text: View/download PDF

34. Computational strategies for reducing annotation effort in language documentation

Author: Katrin Erk, Alexis Palmer, Eric W. Campbell, Jason Baldridge, Taesun Moon, and Telma Can
Subjects: Computer science, Active learning (machine learning), business.industry, Process (engineering), Language documentation, Temporal annotation, computer.software_genre, Annotation, Mayan languages, Artificial intelligence, Computational linguistics, business, computer, Morphological segmentation, Natural language processing
Abstract: With the urgent need to document the world's dying languages, it is important to explore ways to speed up language documentation efforts. One promising avenue is to use techniques from computational linguistics to automate some of the process. Here we consider unsupervised morphological segmentation and active learning for creating interlinear glossed text (IGT) for the Mayan language Uspanteko. The practical goal is to produce a totally annotated corpus that is as accurate as possible given limited time for manual annotation. We discuss results from several experiments that suggest there is indeed much promise in these methods but also show that further development is necessary to make them robustly useful for a wide range of conditions and tasks. We also provide a detailed discussion of how two documentary linguists perceived machine support in IGT production and how their annotation performance varied with different levels of machine support.
Published: 2010
Full Text: View/download PDF

35. What’s in a word?: On integrating recent approaches to secondary associations, submorphemic units and morphological segmentation

Author: Garry W. Davis
Subjects: Linguistics and Language, Computer science, Position (vector), business.industry, Morpheme, Artificial intelligence, computer.software_genre, business, computer, Morphological segmentation, Language and Linguistics, Word (computer architecture), Natural language processing
Abstract: Submorphemic units occupy a dubious position between the phoneme and the morpheme in traditional (structuralist-generative) analyses. The present study places submorphemic units within their proper...
Published: 1992
Full Text: View/download PDF

36. Unsupervised morphological segmentation and clustering with document boundaries

Author: Taesun Moon, Jason Baldridge, and Katrin Erk
Subjects: business.industry, Computer science, Pattern recognition, Word stem, computer.software_genre, Character (mathematics), Simple (abstract algebra), Mayan languages, Key (cryptography), Benchmark (computing), Artificial intelligence, business, Cluster analysis, computer, Morphological segmentation, Natural language processing, Word (computer architecture)
Abstract: Many approaches to unsupervised morphology acquisition incorporate the frequency of character sequences with respect to each other to identify word stems and affixes. This typically involves heuristic search procedures and calibrating multiple arbitrary thresholds. We present a simple approach that uses no thresholds other than those involved in standard application of X2 significance testing. A key part of our approach is using document boundaries to constrain generation of candidate stems and affixes and clustering morphological variants of a given word stem. We evaluate our model on English and the Mayan language Uspanteko; it compares favorably to two benchmark systems which use considerably more complex strategies and rely more on experimentally chosen threshold values.
Published: 2009
Full Text: View/download PDF

37. Robust extraction of characters from color scene image using mathematical morphology

Author: Robert M. Haralick, Lixu Gu, N. Tanaka, and T. Kaneko
Subjects: Computer science, business.industry, Feature extraction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Image segmentation, Optical character recognition, Mathematical morphology, computer.software_genre, Robustness (computer science), Computer vision, Artificial intelligence, business, Morphological segmentation, computer
Abstract: Current character extraction systems for scene images are not robust for most real-world applications. In contrast, the system present here achieves robust performance by using morphological segmentation. This paper describes a new morphological segmentation algorithm-differential top-hats (DTT). In, addition, a complete system for extraction of characters from color scene images is presented. The system was verified through experiments on sequences of outdoor color images with varying external conditions. A high average extraction rate of 95% is obtained.
Published: 2002
Full Text: View/download PDF

38. Morphological Segmentation of Objects for Thick-Layered Manufacturing

Author: J. J. Broek, Joris S. M. Vergeest, Zoltán Rusák, György Kuczogi, and Imre Horváth
Subjects: Rapid prototyping, Conceptual design, business.industry, Computer science, Computer Aided Design, Computer vision, Artificial intelligence, Image segmentation, business, computer.software_genre, Morphological segmentation, computer, Design for manufacturability
Abstract: Physical concept modeling (PCM) is a rapidly maturing subfield of rapid prototyping. It aims at producing physically editable materialized models to support functional and shape synthesis during conceptual design. PCM raises unprecedented design methodological and technological issues. Numerous technological problems stem from (i) the free-form shape and large size of the fabricated objects, (ii) the required surface finish and decoration, (iii) the need for on-line manufacturing with short production time and minimal costs, and (iv) the requested trade-off in terms of the efforts and the improvement of the product. The FF-TLOM technology has been earlier proposed by the research team of the authors to solve these problems. It permits to convert early CAD. models into a materialized model even for large-sized objects that cannot be fabricated by conventional layer depositioning or high-speed milling. This paper presents algorithms for preliminary morphological segmentation of the CAD model to improve the performance characteristics of the FF-TLOM process. The authors deal with only free-form objects without zero- and first-order shape singularities. The segmentation algorithms are based on normal vector analysis and feature point recognition. Pre-defined segmentation features are used to guide the process. The actual extent of the segments is found by dimensional and geometric analyses that are supported by a segmentation feature graph. The developed algorithms are able to define the segments in such a way that at least one planar surface is provided for the further orthogonal slicing. The segments are sliced into layers of standard thickness based on a higher order approximation of the free-form shapes.
Published: 1999
Full Text: View/download PDF

39. Broad coverage automatic morphological segmentation of German words

Author: O. Mertineit, Rudolf Schmidt, T. Pachunke, and Klaus Wothke
Subjects: Syntax (programming languages), Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, computer.software_genre, Syntax, GeneralLiterature_MISCELLANEOUS, language.human_language, German, Set (abstract data type), Rule-based machine translation, language, Artificial intelligence, business, Morphological segmentation, computer, Natural language processing, Word (group theory), ComputingMethodologies_COMPUTERGRAPHICS
Abstract: A system for the automatic segmentation of German words into morphs was developed. The main linguistic knowledge sources used by the system are a word syntax and a morph dictionary. The syntax is written in the formalism of right linear regular grammars and comprises approximately 1, 400 rules describing the set of those sequences of morph classes which underlie syntactically well formed words. The morph dictionary contains almost 11, 000 morphs. Each morph is assigned to up to 6 morph classes. - Statistical evaluations with 6000 text words showed that more than 99% of the segmented words got a correct segmentation.
Published: 1992
Full Text: View/download PDF

40. Towards an automatic morphological segmentation

Author: Josse de Kock and Walter Bossaert
Subjects: Computer science, business.industry, Segmentation, Artificial intelligence, computer.software_genre, business, Morphological segmentation, computer, Natural language processing
Abstract: Experience tends to prove that it is possible to achieve such a segmentation by means of exclusively formal criteria. It consists in establishing and grading these criteria and formulating them mathematicall~ The use of the computer guarantees objectiveness up to a significant linguistic level; the computer is an instrument of research for new rules; it guarantees the control of established rules.
Published: 1969
Full Text: View/download PDF

41. Morphological Segmentation and OPUS for Finnish-English Machine Translation

Author: Filip Ginter, Jörg Tiedemann, and Jenna Kanerva
Subjects: Training set, Machine translation, business.industry, Computer science, Speech recognition, Opus, computer.software_genre, Stress (linguistics), Artificial intelligence, business, Baseline (configuration management), computer, Morphological segmentation, Natural language processing
Abstract: This paper describes baseline systems for Finnish-English and English-Finnish machine translation using standard phrasebased and factored models including morphological features. We experiment with compound splitting and morphological segmentation and study the effect of adding noisy out-of-domain data to the parallel and the monolingual training data. Our results stress the importance of training data and demonstrate the effectiveness of morphological pre-processing of Finnish.

42. The Universitat d’Alacant Submissions to the English-to-Kazakh News Translation Task at WMT 2019

Author: Víctor M. Sánchez-Cartagena, Juan Antonio Pérez-Ortiz, and Felipe Sánchez-Martínez
Subjects: Computer science, business.industry, 4. Education, 020209 energy, 02 engineering and technology, Kazakh, 010501 environmental sciences, computer.software_genre, Translation (geometry), 01 natural sciences, language.human_language, Task (project management), Rule-based machine translation, 0202 electrical engineering, electronic engineering, information engineering, language, Machine translation system, Artificial intelligence, business, Transfer of learning, Morphological segmentation, computer, Natural language processing, 0105 earth and related environmental sciences
Abstract: This paper describes the two submissions of Universitat d’Alacant to the English-to-Kazakh news translation task at WMT 2019. Our submissions take advantage of monolingual data and parallel data from other language pairs by means of iterative backtranslation, pivot backtranslation and transfer learning. They also use linguistic information in two ways: morphological segmentation of Kazakh text, and integration of the output of a rule-based machine translation system. Our systems were ranked second in terms of chrF++ despite being built from an ensemble of only 2 independent training runs.
Full Text: View/download PDF

43. Cognate-aware morphological segmentation for multilingual neural translation

Author: Mikko Kurimo, Sami Virpioja, Stig-Arne Grönroos, Centre of Excellence in Computational Inference, COIN, Dept Signal Process and Acoust, Aalto-yliopisto, and Aalto University
Subjects: FOS: Computer and information sciences, Computer science, 02 engineering and technology, 010501 environmental sciences, computer.software_genre, 01 natural sciences, cognate, morphology, 0202 electrical engineering, electronic engineering, information engineering, Proper noun, Cognate, 0105 earth and related environmental sciences, Transformer (machine learning model), Computer Science - Computation and Language, business.industry, Estonian, language.human_language, neural machine translation, language, 020201 artificial intelligence & image processing, Artificial intelligence, multilingual, business, Computation and Language (cs.CL), computer, Morphological segmentation, Natural language processing
Abstract: This article describes the Aalto University entry to the WMT18 News Translation Shared Task. We participate in the multilingual subtrack with a system trained under the constrained condition to translate from English to both Finnish and Estonian. The system is based on the Transformer model. We focus on improving the consistency of morphological segmentation for words that are similar orthographically, semantically, and distributionally; such words include etymological cognates, loan words, and proper names. For this, we introduce Cognate Morfessor, a multilingual variant of the Morfessor method. We show that our approach improves the translation quality particularly for Estonian, which has less resources for training the translation model., To appear in WMT18

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

43 results on '"Morphological segmentation"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources