Descriptor: "Parsing" / Publication Type: Electronic Resources - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Parsing"' showing total 616 results

Start Over Descriptor "Parsing" Publication Type Electronic Resources

616 results on '"Parsing"'

1. Understanding and generating language with Abstract Meaning Representation

Author: Damonte, Marco, Cohen, Shay, and Lopez, Adam
Subjects: 006.3, natural language processing, NLP, Abstract Meaning Representation, AMR, algorithms, reentrancies, parsing
Abstract: Abstract Meaning Representation (AMR) is a semantic representation for natural language that encompasses annotations related to traditional tasks such as Named Entity Recognition (NER), Semantic Role Labeling (SRL), word sense disambiguation (WSD), and Coreference Resolution. AMR represents sentences as graphs, where nodes represent concepts and edges represent semantic relations between them. Sentences are represented as graphs and not trees because nodes can have multiple incoming edges, called reentrancies. This thesis investigates the impact of reentrancies for parsing (from text to AMR) and generation (from AMR to text). For the parsing task, we showed that it is possible to use techniques from tree parsing and adapt them to deal with reentrancies. To better analyze the quality of AMR parsers, we developed a set of fine-grained metrics and found that state-of-the-art parsers predict reentrancies poorly. Hence we provided a classification of linguistic phenomena causing reentrancies, categorized the type of errors parsers do with respect to reentrancies, and proved that correcting these errors can lead to significant improvements. For the generation task, we showed that neural encoders that have access to reentrancies outperform those who do not, demonstrating the importance of reentrancies also for generation. This thesis also discusses the problem of using AMR for languages other than English. Annotating new AMR datasets for other languages is an expensive process and requires defining annotation guidelines for each new language. It is therefore reasonable to ask whether we can share AMR annotations across languages. We provided evidence that AMR datasets for English can be successfully transferred to other languages: we trained parsers for Italian, Spanish, German, and Chinese to investigate the cross-linguality of AMR. We showed cases where translational divergences between languages pose a problem and cases where they do not. In summary, this thesis demonstrates the impact of reentrancies in AMR as well as providing insights on AMR for languages that do not yet have AMR datasets.
Published: 2020
Full Text: View/download PDF

2. Structural analysis of Arabic tweets

Author: Albogamy, Fahad and Zhao, Liping
Subjects: 006.3, Parsing, Treebank, POS tagging, Arabic tweets, NLP
Abstract: This thesis explores the task of analysing the linguistic structure of Arabic tweets. Arabic tweets raise many challenges that make Natural Language Processing (NLP) tasks difficult. We are faced with the same linguistic issues that any ordinary language has as well as more genre-specific problems. Tweets are difficult to manipulate because they do not always maintain formal grammar and correct spelling, and abbreviations are often used to overcome length restrictions. Arabic tweets also exhibit linguistic phenomena such as usage of different dialects, Romanised Arabic and borrowing of foreign words. All these characteristics of the microblogging genre make NLP tasks on Twitter very different from their counterparts in more formal texts. Within most NLP systems there are several early stages such as tagging, stemming and parsing that may need to be redesigned to take into account characteristics of tweets in order to be able to extract their important linguistic features. To fulfil this need, three of the most fundamental parts of the linguistic pipeline, namely POS tagging, stemming and parsing have been revisited for Arabic tweets. To the best of our knowledge, this is the first attempt to carry out this task for Arabic tweets. We investigate the challenges of processing Arabic tweets, studying a number of standard Arabic processing tools and highlighting their limitations when manipulating Arabic tweets. We make three state-of-the-art POS taggers for Modern Standard Arabic (MSA) robust towards noise when applied to the Arabic tweets. We develop the first fast and robust POS tagger for Arabic tweets and create the first POS-tagged corpus of Arabic tweets. Also, we develop two approaches to stemming Arabic tweet words: a heavy stemmer and a light stemmer, and we find that the light stemmer provides the most suitable approach for stemming Arabic tweets words because it does not use dictionaries, is fast, and yields greater accuracy compared with the heavy stemmer and MSA stemmers. We are able to automatically create the first dependency treebank from unlabelled tweets by using two approaches: using a rule-based parser only and using a rule-based parser and a data-driven parser in a bootstrapping technique. Then, we train a data-driven parsing base model on them to parse Arabic tweets. The findings are encouraging. We are able to improve POS tagging accuracy from 49% to 74.0% on Arabic tweets. Experimental results show that the light stemmer achieves 77.9% accuracy. It outperforms three well-known stemmers for Arabic. Our parser reaches 71.0% accuracy which is better than the performance of French parsing for social media data and it is not far behind English parsing for tweets.
Published: 2018

3. From Tokens to Trees: Mapping Syntactic Structures in the Deserts of Data-Scarce Languages

Author: Vilares, David, Muñoz Ortiz, Alberto, Vilares, David, and Muñoz Ortiz, Alberto
Abstract: [Abstract]: Low-resource learning in natural language processing focuses on developing effective resources, tools, and technologies for languages that are less popular within the industry and academia. This effort is crucial for several reasons, including ensuring that as many languages as possible are represented digitally, and enhancing access to language technologies for native speakers of minority languages. In this context, this paper outlines the motivation, research lines, and results from a Leonardo Grant - by FBBVA - on low-resource languages and parsing as sequence labeling. The project’s primary aim was to devise fast and accurate methods for low-resource syntactic parsing and to examine evaluation strategies as well as strengths and weaknesses in comparison to alternative parsing strategies.
Published: 2024

4. Dancing in the Syntax Forest: Fast, Accurate and Explainable Sentiment Analysis with SALSA

Author: Gómez-Rodríguez, Carlos, Imran, Muhammad, Vilares, David, Solera, Elena, Kellert, Olga, Gómez-Rodríguez, Carlos, Imran, Muhammad, Vilares, David, Solera, Elena, and Kellert, Olga
Abstract: [Abstract]: Sentiment analysis is a key technology for companies and institutions to gauge public opinion on products, services or events. However, for large-scale sentiment analysis to be accessible to entities with modest computational resources, it needs to be performed in a resource-efficient way. While some efficient sentiment analysis systems exist, they tend to apply shallow heuristics, which do not take into account syntactic phenomena that can radically change sentiment. Conversely, alternatives that take syntax into account are computationally expensive. The SALSA project, funded by the European Research Council under a Proof-of-Concept Grant, aims to leverage recently-developed fast syntactic parsing techniques to build sentiment analysis systems that are lightweight and efficient, while still providing accuracy and explainability through the explicit use of syntax. We intend our approaches to be the backbone of a working product of interest for SMEs to use in production.
Published: 2024

5. Grammar Assistance Using Syntactic Structures (GAUSS)

Author: Zamaraeva, Olga, Suárez Allegue, Lorena, Gómez-Rodríguez, Carlos, Alonso-Ramos, Margarita, Ogneva, Anastasiia, Zamaraeva, Olga, Suárez Allegue, Lorena, Gómez-Rodríguez, Carlos, Alonso-Ramos, Margarita, and Ogneva, Anastasiia
Abstract: [Abstract]: Automatic grammar coaching serves an important purpose of advising on standard grammar varieties while not imposing social pressures or reinforcing established social roles. Such systems already exist but most of them are for English and few of them offer meaningful feedback. Furthermore, they typically rely completely on neural methods and require huge computational resources which most of the world cannot afford. We propose a grammar coaching system for Spanish that relies on (i) a rich linguistic formalism capable of giving informative feedback; and (ii) a faster parsing algorithm which makes using this formalism practical in a real-world application. The approach is feasible for any language for which there is a computerized grammar and is less reliant on expensive and environmentally costly neural methods. We seek to contribute to Greener AI and to address global education challenges by raising the standards of inclusivity and engagement in grammar coaching.
Published: 2024

6. On the Ziv-Merhav Theorem beyond Markovianity I

Author: Barnfield, N, Grondin, R, Pozzoli, G, Raquépas, R, Barnfield, N, Grondin, R, Pozzoli, G, and Raquépas, R
Abstract: We generalize to a broader class of decoupled measures a result of Ziv and Merhav on universal estimation of the specific cross (or relative) entropy, originally for a pair of multi-level Markov measures. Our generalization focuses on abstract decoupling conditions and covers pairs of suitably regular g-measures and pairs of equilibrium measures arising from the “small space of interactions” in mathematical statistical mechanics.
Published: 2024

7. On the rapid use of verb-control information in sentence processing

Author: Psicologia, Universitat Rovira i Virgili, Demestre, J, Psicologia, Universitat Rovira i Virgili, and Demestre, J
Abstract: A central topic in psycholinguistics is the study of how and when the parser assigns an antecedent to referentially-dependent elements. One such referentially-dependent element is the null subject of non-finite clauses. The aim of the present study was to examine the role of verb control information in the assignment of an antecedent to such a null subject. The results so far are inconclusive. Some authors argue that verb control information has a late influence, whereas others argue that such verb-specific information has a very rapid influence. We report a self-paced reading study in Spanish in which verb type (subject vs. object control) and grammaticality (grammatical vs. ungrammatical) were manipulated. The grammaticality manipulation was carried out by introducing a person anomaly at the infinitive itself, and not at a later word (e.g., "Te prometi/aconseje adelgazarme/adelgazarte cinco quilos en un mes." Literal translation, "I to you promised/advised to losemyself/yourself five kilos in a month"). With such a manipulation we can examine whether at the first possible point (i.e., the infinitive) verb control information was used to assign the correct antecedent (i.e., the subject in sentences with a subject-control verb, and the object in sentences with an object-control verb) to PRO. The results showed that at the infinitive there was a main effect of grammaticality, meaning that the correct antecedent has already been assigned to PRO. The present findings are consistent with models that assume that verb-specific information plays an important role in the initial stages of sentence processing.
Published: 2024

8. Incremental generative models for syntactic and semantic natural language processing

Author: Buys, Jan Moolman and Blunsom, Phil
Subjects: 006.3, Natural Language Processing, Computer Science, Machine Learning, Computational Linguistics, Parsing, Deep Learning, Generative Models, Language Modelling, Semantic Parsing, Dependency Parsing, Incremental Parsing
Abstract: This thesis investigates the role of linguistically-motivated generative models of syntax and semantic structure in natural language processing (NLP). Syntactic well-formedness is crucial in language generation, but most statistical models do not account for the hierarchical structure of sentences. Many applications exhibiting natural language understanding rely on structured semantic representations to enable querying, inference and reasoning. Yet most semantic parsers produce domain-specific or inadequately expressive representations. We propose a series of generative transition-based models for dependency syntax which can be applied as both parsers and language models while being amenable to supervised or unsupervised learning. Two models are based on Markov assumptions commonly made in NLP: The first is a Bayesian model with hierarchical smoothing, the second is parameterised by feed-forward neural networks. The Bayesian model enables careful analysis of the structure of the conditioning contexts required for generative parsers, but the neural network is more accurate. As a language model the syntactic neural model outperforms both the Bayesian model and n-gram neural networks, pointing to the complementary nature of distributed and structured representations for syntactic prediction. We propose approximate inference methods based on particle filtering. The third model is parameterised by recurrent neural networks (RNNs), dropping the Markov assumptions. Exact inference with dynamic programming is made tractable here by simplifying the structure of the conditioning contexts. We then shift the focus to semantics and propose models for parsing sentences to labelled semantic graphs. We introduce a transition-based parser which incrementally predicts graph nodes (predicates) and edges (arguments). This approach is contrasted against predicting top-down graph traversals. RNNs and pointer networks are key components in approaching graph parsing as an incremental prediction problem. The RNN architecture is augmented to condition the model explicitly on the transition system configuration. We develop a robust parser for Minimal Recursion Semantics, a linguistically-expressive framework for compositional semantics which has previously been parsed only with grammar-based approaches. Our parser is much faster than the grammar-based model, while the same approach improves the accuracy of neural Abstract Meaning Representation parsing.
Published: 2017

9. Adapting compiler front ends for generalised parsing

Author: Walsh, Robert Michael
Subjects: 006.3, parsing, lexical analysis, disambiguation
Abstract: Traditional compiler front-ends are designed around near-deterministic parsing algorithms, restricting the form of the parsing grammar. The corresponding lexical analysers are required to provide only a single partitioning of an input string into tokens. This requires the lexical analyser to apply lexical disambiguation techniques. A generalised parser can in principle generate all derivations from a grammar whose tokens are the characters of the underlying alphabet (so called character-level parsing) but this results in a very large data structure. This thesis investigates a form of lexical analysis that can offer all tokenisations of a string to a parser, as well as a corresponding multiple input parsing algorithm, MGLL, which can simultaneously construct derivations over all tokenisations. The goal is to provide equivalent generality to a character-level parser, but with signicantly improved performance in terms of both the space required to represent the derivations and the time required to construct them. In addition, lexical-level disambiguation constructs are provided which may be used to model traditional lexical disambiguation strategies under the new framework. The thesis will also discuss the translation of derivation trees into abstract syntax forms suitable for use in structural semantics, by annotating a grammar with a small set of local operations called GIFT operators. Some preliminary work on how to specify ambiguity reduction at syntactic level shall also be described. The thesis concludes with two case studies. One demonstrating the application of this new form of lexical analysis to the C# 2.0 language specication. The other drawn from the PLanCompS project, which shows how to generate derivations in an appropriate abstract syntax for the C# 1.2 language, from a parser which uses the concrete grammar from the C# 1.2 language standard specication.
Published: 2016

10. The application of constraint rules to data-driven parsing

Author: Jaf, Sardar
Subjects: 006.3, parsing, natural language parsing, dependency parsing, data-driven parsing, constraint rules, constraint rules for parsing, grammar extraction, syntactic parsing, parsing Arabic, non-projective parsing
Abstract: The process of determining the structural relationships between words in both natural and machine languages is known as parsing. Parsers are used as core components in a number of Natural Language Processing (NLP) applications such as online tutoring applications, dialogue-based systems and textual entailment systems. They have been used widely in the development of machine languages. In order to understand the way parsers work, we will investigate and describe a number of widely used parsing algorithms. These algorithms have been utilised in a range of different contexts such as dependency frameworks and phrase structure frameworks. We will investigate and describe some of the fundamental aspects of each of these frameworks, which can function in various ways including grammar-driven approaches and data-driven approaches. Grammar-driven approaches use a set of grammatical rules for determining the syntactic structures of sentences during parsing. Data-driven approaches use a set of parsed data to generate a parse model which is used for guiding the parser during the processing of new sentences. A number of state-of-the-art parsers have been developed that use such frameworks and approaches. We will briefly highlight some of these in this thesis. There are three specific important features that it is important to integrate into the development of parsers. These are efficiency, accuracy, and robustness. Efficiency is concerned with the use of as little time and computing resources as possible when processing natural language text. Accuracy involves maximising the correctness of the analyses that a parser produces. Robustness is a measure of a parser’s ability to cope with grammatically complex sentences and produce analyses of a large proportion of a set of sentences. In this thesis, we present a parser that can efficiently, accurately, and robustly parse a set of natural language sentences. Additionally, the implementation of the parser presented here allows for some trading-off between different levels of parsing performance. For example, some NLP applications may emphasise efficiency/robustness over accuracy while some other NLP systems may require a greater focus on accuracy. In dialogue-based systems, it may be preferable to produce a correct grammatical analysis of a question, rather than incorrectly analysing the grammatical structure of a question or quickly producing a grammatically incorrect answer for a question. Alternatively, it may be desirable that document translation systems translate a document into a different language quickly but less accurately, rather than slowly but highly accurately, because users may be able to correct grammatically incorrect sentences manually if necessary. The parser presented here is based on data-driven approaches but we will allow for the application of constraint rules to it in order to improve its performance.
Published: 2015

11. Scalable semi-supervised grammar induction using cross-linguistically parameterized syntactic prototypes

Author: Boonkwan, Prachya and Steedman, Mark
Subjects: 006.3, grammar induction, parsing, language parameters, semi-supervised learning
Abstract: This thesis is about the task of unsupervised parser induction: automatically learning grammars and parsing models from raw text. We endeavor to induce such parsers by observing sequences of terminal symbols. We focus on overcoming the problem of frequent collocation that is a major source of error in grammar induction. For example, since a verb and a determiner tend to co-occur in a verb phrase, the probability of attaching the determiner to the verb is sometimes higher than that of attaching the core noun to the verb, resulting in erroneous attachment *((Verb Det) Noun) instead of (Verb (Det Noun)). Although frequent collocation is the heart of grammar induction, it is precariously capable of distorting the grammar distribution. Natural language grammars follow a Zipfian (power law) distribution, where the frequency of any grammar rule is inversely proportional to its rank in the frequency table. We believe that covering the most frequent grammar rules in grammar induction will have a strong impact on accuracy. We propose an efficient approach to grammar induction guided by cross-linguistic language parameters. Our language parameters consist of 33 parameters of frequent basic word orders, which are easy to be elicited from grammar compendiums or short interviews with naïve language informants. These parameters are designed to capture frequent word orders in the Zipfian distribution of natural language grammars, while the rest of the grammar including exceptions can be automatically induced from unlabeled data. The language parameters shrink the search space of the grammar induction problem by exploiting both word order information and predefined attachment directions. The contribution of this thesis is three-fold. (1) We show that the language parameters are adequately generalizable cross-linguistically, as our grammar induction experiments will be carried out on 14 languages on top of a simple unsupervised grammar induction system. (2) Our specification of language parameters improves the accuracy of unsupervised parsing even when the parser is exposed to much less frequent linguistic phenomena in longer sentences when the accuracy decreases within 10%. (3) We investigate the prevalent factors of errors in grammar induction which will provide room for accuracy improvement. The proposed language parameters efficiently cope with the most frequent grammar rules in natural languages. With only 10 man-hours for preparing syntactic prototypes, it improves the accuracy of directed dependency recovery over the state-ofthe- art Gillenwater et al.’s (2010) completely unsupervised parser in: (1) Chinese by 30.32% (2) Swedish by 28.96% (3) Portuguese by 37.64% (4) Dutch by 15.17% (5) German by 14.21% (6) Spanish by 13.53% (7) Japanese by 13.13% (8) English by 12.41% (9) Czech by 9.16% (10) Slovene by 7.24% (11) Turkish by 6.72% and (12) Bulgarian by 5.96%. It is noted that although the directed dependency accuracies of some languages are below 60%, their TEDEVAL scores are still satisfactory (approximately 80%). This suggests us that our parsed trees are, in fact, closely related to the gold-standard trees despite the discrepancy of annotation schemes. We perform an error analysis of over- and under-generation analysis. We found three prevalent problems that cause errors in the experiments: (1) PP attachment (2) discrepancies of dependency annotation schemes and (3) rich morphology. The methods presented in this thesis were originally presented in Boonkwan and Steedman (2011). The thesis presents a great deal more detail in the design of crosslinguistic language parameters, the algorithm of lexicon inventory construction, experiment results, and error analysis.
Published: 2014

12. On repairing sentences : an experimental and computational analysis of recovery from unexpected syntactic disambiguation in sentence parsing

Author: Green, Matthew James, Mitchell, Don, and Monsell, Stephen
Subjects: 150, sentence processing, parsing, syntactic disambiguation, eye tracking
Abstract: This thesis contends that the human parser has a repair mechanism. It is further contended that the human parser uses this mechanism to alter previously built structure in the case of unexpected disambiguation of temporary syntactic ambiguity. This position stands in opposition to the claim that unexpected disambiguation of temporary syntactic ambiguity is accomplished by the usual first pass parsing routines, a claim that arises from the relatively extraordinary capabilities of computational parsers, capabilities which have recently been extended by hypothesis to be available to the human sentence processing mechanism. The thesis argues that, while these capabilities have been demonstrated in computational parsers, the human parser is best explained in the terms of a repair based framework, and that this argument is demonstrated by examining eye movement behaviour in reading. In support of the thesis, evidence is provided from a set of eye tracking studies of reading. It is argued that these studies show that eye movement behaviours at disambiguation include purposeful visual search for linguistically relevant material, and that the form and structure of these searches vary reliably according to the nature of the repairs that the sentences necessitate.
Published: 2013

13. Probabilistic grammar induction from sentences and structured meanings

Author: Kwiatkowski, Thomas Mieczyslaw, Steedman, Mark., and Goldwater, Sharon
Subjects: 410, semantics, parsing, natural language understanding
Abstract: The meanings of natural language sentences may be represented as compositional logical-forms. Each word or lexicalised multiword-element has an associated logicalform representing its meaning. Full sentential logical-forms are then composed from these word logical-forms via a syntactic parse of the sentence. This thesis develops two computational systems that learn both the word-meanings and parsing model required to map sentences onto logical-forms from an example corpus of (sentence, logical-form) pairs. One of these systems is designed to provide a general purpose method of inducing semantic parsers for multiple languages and logical meaning representations. Semantic parsers map sentences onto logical representations of their meanings and may form an important part of any computational task that needs to interpret the meanings of sentences. The other system is designed to model the way in which a child learns the semantics and syntax of their first language. Here, logical-forms are used to represent the potentially ambiguous context in which childdirected utterances are spoken and a psycholinguistically plausible training algorithm learns a probabilistic grammar that describes the target language. This computational modelling task is important as it can provide evidence for or against competing theories of how children learn their first language. Both of the systems presented here are based upon two working hypotheses. First, that the correct parse of any sentence in any language is contained in a set of possible parses defined in terms of the sentence itself, the sentence’s logical-form and a small set of combinatory rule schemata. The second working hypothesis is that, given a corpus of (sentence, logical-form) pairs that each support a large number of possible parses according to the schemata mentioned above, it is possible to learn a probabilistic parsing model that accurately describes the target language. The algorithm for semantic parser induction learns Combinatory Categorial Grammar (CCG) lexicons and discriminative probabilistic parsing models from corpora of (sentence, logical-form) pairs. This system is shown to achieve at or near state of the art performance across multiple languages, logical meaning representations and domains. As the approach is not tied to any single natural or logical language, this system represents an important step towards widely applicable black-box methods for semantic parser induction. This thesis also develops an efficient representation of the CCG lexicon that separately stores language specific syntactic regularities and domain specific semantic knowledge. This factorised lexical representation improves the performance of CCG based semantic parsers in sparse domains and also provides a potential basis for lexical expansion and domain adaptation for semantic parsers. The algorithm for modelling child language acquisition learns a generative probabilistic model of CCG parses from sentences paired with a context set of potential logical-forms containing one correct entry and a number of distractors. The online learning algorithm used is intended to be psycholinguistically plausible and to assume as little information specific to the task of language learning as is possible. It is shown that this algorithm learns an accurate parsing model despite making very few initial assumptions. It is also shown that the manner in which both word-meanings and syntactic rules are learnt is in accordance with observations of both of these learning tasks in children, supporting a theory of language acquisition that builds upon the two working hypotheses stated above.
Published: 2012

14. Integrated supertagging and parsing

Author: Auli, Michael, Koehn, Philipp, and Lopez, Adam
Subjects: 415.0285, natural language processing, parsing, Combinatory Categorial Grammar, Supertagging
Abstract: Parsing is the task of assigning syntactic or semantic structure to a natural language sentence. This thesis focuses on syntactic parsing with Combinatory Categorial Grammar (CCG; Steedman 2000). CCG allows incremental processing, which is essential for speech recognition and some machine translation models, and it can build semantic structure in tandem with syntactic parsing. Supertagging solves a subset of the parsing task by assigning lexical types to words in a sentence using a sequence model. It has emerged as a way to improve the efficiency of full CCG parsing (Clark and Curran, 2007) by reducing the parser’s search space. This has been very successful and it is the central theme of this thesis. We begin by an analysis of how efficiency is being traded for accuracy in supertagging. Pruning the search space by supertagging is inherently approximate and to contrast this we include A* in our analysis, a classic exact search technique. Interestingly, we find that combining the two methods improves efficiency but we also demonstrate that excessive pruning by a supertagger significantly lowers the upper bound on accuracy of a CCG parser. Inspired by this analysis, we design a single integrated model with both supertagging and parsing features, rather than separating them into distinct models chained together in a pipeline. To overcome the resulting complexity, we experiment with both loopy belief propagation and dual decomposition approaches to inference, the first empirical comparison of these algorithms that we are aware of on a structured natural language processing problem. Finally, we address training the integrated model. We adopt the idea of optimising directly for a task-specific metric such as is common in other areas like statistical machine translation. We demonstrate how a novel dynamic programming algorithm enables us to optimise for F-measure, our task-specific evaluation metric, and experiment with approximations, which prove to be excellent substitutions. Each of the presented methods improves over the state-of-the-art in CCG parsing. Moreover, the improvements are additive, achieving a labelled/unlabelled dependency F-measure on CCGbank of 89.3%/94.0% with gold part-of-speech tags, and 87.2%/92.8% with automatic part-of-speech tags, the best reported results for this task to date. Our techniques are general and we expect them to apply to other parsing problems, including lexicalised tree adjoining grammar and context-free grammar parsing.
Published: 2012

15. Rightward movement phenomena in human language

Author: Kamada, Kohji, Ackema, Peter, Hurford, Jim., Heycock, Caroline., and Kirby, Simon
Subjects: 410, postverbal constructions, locality, parsing
Abstract: The aim of my thesis is to show that some properties of rightward movement constructions (a cover term referring to sentences where an element appears to be “displaced” to the right) may be derived from syntactic principles and interface conditions within the framework of the minimalist program, and also to claim that properties which have up to now been dealt with purely in syntax receive a better account in terms of language processing. I develop a nonmovement approach to the Japanese Post-Verbal Construction (JPVC) by claiming that a postverbal phrase is adjoined to an element by External Merge, and that it is permitted as a syntactic object by a licensing condition which allows it to be construed as an argument or a modifier by interpretive rules at the interface level (SEM/LF). Many syntactic properties of the JPVC are accounted for in terms of independently motivated interface conditions and syntactic principles. I assume that the parser is a system that can make use of UG principles as well as language particular rules, and that the parser should be universal. The interaction of syntactic principles with parsing strategies makes it possible to cope with elusive problems concerning scope ambiguity as well as locality effects observed in the JPVC. This interaction may also account for the Right Roof Constraint effect displayed by the rightward movement constructions in English (i.e., Heavy "P Shift (H"PS), Extraposition from "P, and Right Dislocation). Furthermore, it predicts that languages fall into three types with respect to the possibility of the HNPS construction: (i) both subjects and objects can appear in postverbal position (e.g., Italian, Japanese, Turkish); (ii) subjects cannot do so (e.g., English); (iii) neither subjects nor objects can appear in postverbal position (e.g., Dutch, German). The claim that there is a parsing strategy relating to linear distance is supported by an experiment designed as a test for the effect of the length of intervening elements on acceptability of the JPVC, with the data obtained using Magnitude Estimation, a technique used in psychophysics to measure judgements of sensory stimuli.
Published: 2009

16. Wide-coverage parsing for Turkish

Author: Çakici, Ruket, Steedman, Mark., and Osborne, Miles
Subjects: 300.285, combinatory categorial grammar, CCG, parsing, natural language processing, morphology, syntax
Abstract: Wide-coverage parsing is an area that attracts much attention in natural language processing research. This is due to the fact that it is the first step tomany other applications in natural language understanding, such as question answering. Supervised learning using human-labelled data is currently the best performing method. Therefore, there is great demand for annotated data. However, human annotation is very expensive and always, the amount of annotated data is much less than is needed to train well-performing parsers. This is the motivation behind making the best use of data available. Turkish presents a challenge both because syntactically annotated Turkish data is relatively small and Turkish is highly agglutinative, hence unusually sparse at the whole word level. METU-Sabancı Treebank is a dependency treebank of 5620 sentences with surface dependency relations and morphological analyses for words. We show that including even the crudest forms of morphological information extracted from the data boosts the performance of both generative and discriminative parsers, contrary to received opinion concerning English. We induce word-based and morpheme-based CCG grammars from Turkish dependency treebank. We use these grammars to train a state-of-the-art CCG parser that predicts long-distance dependencies in addition to the ones that other parsers are capable of predicting. We also use the correct CCG categories as simple features in a graph-based dependency parser and show that this improves the parsing results. We show that a morpheme-based CCG lexicon for Turkish is able to solve many problems such as conflicts of semantic scope, recovering long-range dependencies, and obtaining smoother statistics from the models. CCG handles linguistic phenomena i.e. local and long-range dependencies more naturally and effectively than other linguistic theories while potentially supporting semantic interpretation in parallel. Using morphological information and a morpheme-cluster based lexicon improve the performance both quantitatively and qualitatively for Turkish. We also provide an improved version of the treebank which will be released by kind permission of METU and Sabancı.
Published: 2009

17. Prosody, syntax and the lexicon in parsing ambiguous sentences

Author: Mani, Nivedita and Coleman, John
Subjects: 425, Ambiguity, English language, Parsing, Prosodic analysis, Syntax, Lexicology
Abstract: This thesis tests the early incorporation of prosodic information during on-line processing of ambiguous word pairs such as Packing cases. The word pair is syntactically ambiguous between a noun or verb phrase interpretation. However, the two interpretations are prosodically distinct. An on-line, cross-modal, response-time task found that subjects disambiguated the word pairs using prosodic information. Experiment 2 swapped the timing,fo and amplitude of the noun phrase versions with the verb phrase versions. If prosodic information were guiding parsing, swapping the prosody of the alternatives should change subjects' parses of the word-pairs. Subjects interpreted the cross-synthesised noun phrases as verb phrases and the crosssynthesised verb phrases as noun phrases. This provides additional evidence in favour of early prosodic processing. Experiment 3 tested whether subjects' ability to differentiate the two forms would be affected by flattening the fo of the word pairs. Subjects' ability to disambiguate the word pairs was reduced by flattening the fo of the stimuli. Again, this provides evidence in favour of fo guiding parsing. Experiment 4 investigated the perceptual salience of prosodic information in the absence of lexical information, by testing parsing of delexicalised versions of the same wordpairs. Subjects continued to disambiguate the stimuli. This indicates that prosody can guide parsing even without lexical information. The results of the four experiments provide strong evidence in favour of the early incorporation of prosodic information in parsing: prosodic information can influence on-line parsing even in the presence of contradictory syntactic and spectral preferences; and in the absence of lexical information. This thesis concludes that the results of the experiments support strong interaction models of processing.
Published: 2006

18. Dependency as Modality, Parsing as Permutation: A Neurosymbolic Perspective on Categorial Grammars

Author: Kogkalidis, Konstantinos and Kogkalidis, Konstantinos
Abstract: Type systems (in the form of categorial grammars) are the front runners in the quest for a formally elegant, computationally attractive and flexible theory of linguistic form and meaning. Words enact typed constants, and interact with one another via means of grammatical rules enacted by type inferences, composing larger phrases in the process. The end result is at the same time a parse, a proof and a program, bridging the seemingly disparate fields of linguistics, formal logics and computer science. The transition from form to meaning is traditionally handled via a series of homomorphisms that simplify nuances of the syntactic calculus to move towards a uniform semantic calculus. Alluring as this might be, it poses pragmatic considerations. For the setup to work on the semantic level, one has no choice but to start from the hardest part, namely natural language syntax. Phenomena like movement, word-order variation, discontinuities, etc. require careful treatment that needs to be both general enough to encompass the full range of grammatical utterances, yet strict enough to ward off ungrammatical derivations. This thesis takes an operational shortcut in targeting a ``deeper'' calculus of grammatical composition, engaging only minimally with surface syntax. Where previously functional functional syntactic types would be position-conscious, requiring their arguments in predetermined positions upon a binary tree, they are now agnostic to both tree structure and sequential order, alleviating the need for syntactic refinements. This simplification comes at the cost of a misalignment between provability and grammaticality: the laxer semantic calculus permits more proofs than linguistically desired. To circumvent this underspecification, the thesis takes a step away from the established norm, proposing the incorporation of unary type operators extending the function-argument axis with grammatical role labels. The new calculus produces mixed unary/n-ary trees, each unary tr
Published: 2023

19. Parallel Enumeration of Parse Trees

Author: Margarita Mikhelson and Alexander Okhotin, Mikhelson, Margarita, Okhotin, Alexander, Margarita Mikhelson and Alexander Okhotin, Mikhelson, Margarita, and Okhotin, Alexander
Abstract: A parallel algorithm for enumerating parse trees of a given string according to a fixed context-free grammar is defined. The algorithm computes the number of parse trees of an input string; more generally, it applies to computing the weight of a string in a weighted grammar. The algorithm is first implemented on an arithmetic circuit of depth O((log n)²) with O(n⁶) elements. Then, it is improved using fast matrix multiplication to use only O(n^5.38) elements, while preserving depth O((log n)²).
Published: 2023
Full Text: View/download PDF

20. Dependency parsing with bottom-up Hierarchical Pointer Networks

Author: Fernández-González, Daniel, Gómez-Rodríguez, Carlos, Fernández-González, Daniel, and Gómez-Rodríguez, Carlos
Abstract: [Abstract] Dependency parsing is a crucial step towards deep language understanding and, therefore, widely demanded by numerous Natural Language Processing applications. In particular, left-to-right and top-down transition-based algorithms that rely on Pointer Networks are among the most accurate approaches for performing dependency parsing. Additionally, it has been observed for the top-down algorithm that Pointer Networks’ sequential decoding can be improved by implementing a hierarchical variant, more adequate to model dependency structures. Considering all this, we develop a bottom-up oriented Hierarchical Pointer Network for the left-to-right parser and propose two novel transition-based alternatives: an approach that parses a sentence in right-to-left order and a variant that does so from the outside in. We empirically test the proposed neural architecture with the different algorithms on a wide variety of languages, outperforming the original approach in practically all of them and setting new state-of-the-art results on the English and Chinese Penn Treebanks for non-contextualized and BERT-based embeddings.
Published: 2023

21. Discontinuous grammar as a foreign language

Author: Fernández-González, Daniel, Gómez-Rodríguez, Carlos, Fernández-González, Daniel, and Gómez-Rodríguez, Carlos
Abstract: [Abstract] In order to achieve deep natural language understanding, syntactic constituent parsing is a vital step, highly demanded by many artificial intelligence systems to process both text and speech. One of the most recent proposals is the use of standard sequence-to-sequence models to perform constituent parsing as a machine translation task, instead of applying task-specific parsers. While they show a competitive performance, these text-to-parse transducers are still lagging behind classic techniques in terms of accuracy, coverage and speed. To close the gap, we here extend the framework of sequence-to-sequence models for constituent parsing, not only by providing a more powerful neural architecture for improving their performance, but also by enlarging their coverage to handle the most complex syntactic phenomena: discontinuous structures. To that end, we design several novel linearizations that can fully produce discontinuities and, for the first time, we test a sequence-to-sequence model on the main discontinuous benchmarks, obtaining competitive results on par with task-specific discontinuous constituent parsers and achieving state-of-the-art scores on the (discontinuous) English Penn Treebank.
Published: 2023

22. Semantics engineering with concrete syntax

Author: Storm, T. (Tijs) van der and Storm, T. (Tijs) van der
Abstract: Semantics engineering tools like Redex can be used to define, explore, and debug formal definitions of programming language semantics. However, such tools are often based on abstract syntax, which makes the definition of rules and the exploration of execution traces rather unfriendly. In this paper we introduce Credex, a library in the Rascal meta-programming language for defining small-step evaluation-context semantics, where terms and matching patterns are what-you-see-is-what-you-get. Credex employs parsing for decomposing terms into context and redex. Since Rascal’s grammar formalism is based on general parsing, a non-unique decomposition of a term literally corresponds to an ambiguous parse. We demonstrate the use of Credex, detail some aspects of its implementation, and discuss three case-studies.
Published: 2023
Full Text: View/download PDF

23. On Solving Solved Problems

Author: Sebastian Erdweg, Erdweg, Sebastian, Sebastian Erdweg, and Erdweg, Sebastian
Abstract: Some problems are considered solved by the research community. But are they really and does that mean we should stop investigating them? In this essay, I argue that "solved" problems often only appear solved on the surface, while fundamental open research problems lurk below the surface. It requires dedicated researchers to discover those open problems by applying the existing solutions and putting them to the test.
Published: 2023
Full Text: View/download PDF

24. Semantics Engineering with Concrete Syntax

Author: Tijs van der Storm, van der Storm, Tijs, Tijs van der Storm, and van der Storm, Tijs
Abstract: Semantics engineering tools like Redex can be used to define, explore, and debug formal definitions of programming language semantics. However, such tools are often based on abstract syntax, which makes the definition of rules and the exploration of execution traces rather unfriendly. In this paper we introduce Credex, a library in the Rascal meta-programming language for defining small-step evaluation-context semantics, where terms and matching patterns are what-you-see-is-what-you-get. Credex employs parsing for decomposing terms into context and redex. Since Rascal’s grammar formalism is based on general parsing, a non-unique decomposition of a term literally corresponds to an ambiguous parse. We demonstrate the use of Credex, detail some aspects of its implementation, and discuss three case-studies.
Published: 2023
Full Text: View/download PDF

25. Context in Parsing: Techniques and Applications

Author: Eric Van Wyk, Van Wyk, Eric, Eric Van Wyk, and Van Wyk, Eric
Abstract: This paper discusses two approaches to parsing: Eelco Visser’s scannerless generalized LR parsing and our context-aware scanning paired with deterministic LR parsing. We compare the underlying techniques, specifically how parser context is used to disambiguate lexical syntax, and their use in the context of language evolution and composition applications. We also reflect on the many discussions shared with Eelco on these topics, and on our shared realization that our different assumptions about the contexts in which our approaches were used drove and justified the technical decisions made in each.
Published: 2023
Full Text: View/download PDF

26. Analysing the SML97 Definition: Lexicalisation

Author: Elizabeth Scott and Adrian Johnstone, Scott, Elizabeth, Johnstone, Adrian, Elizabeth Scott and Adrian Johnstone, Scott, Elizabeth, and Johnstone, Adrian
Abstract: The specification of the syntax and semantics for Standard ML have been designed to support the generation of a compiler front end, but actual implementations have required significant modification to the specification. Since the specification was written there have been major advances in the development of language analysis systems that can handle general syntax specifications. We are revisiting the SML specification to consider to what extent, using modern tooling, it can be implemented exactly as originally written. In this short paper we focus on the lexical specification.
Published: 2023
Full Text: View/download PDF

27. The role of argument structure requirements and recency constraints in human sentence processing

Author: Kamide, Yuki
Subjects: 150, Parsing, Psycholinguistics, Japanese
Published: 1998

28. A Case for Generative Linguistics in New Testament Exegesis : Surveying the Current Theoretical Landscape and Possible Applicability to Biblical Studies

Author: Kristiansson, Per and Kristiansson, Per
Abstract: This essay surveys the current theoretical landscape of modern linguistics, asking whethe generative and possibly transformational linguistics can be applied to syntactic analysis of New Testament texts written in Koine Greek to find lingual hallmarks in the form of personal usage of syntactic rules that uniquely identify the authors of the texts. The conclusion is that there seems to be evidence that an application of a minimalist approach could make the detection of such lingual hallmarks possible.
Published: 2022

29. Explaining and Applying Graph Neural Networks on Text

Author: Funke, Thorben, Avishek, Anand, Grünefeld, Nils, Funke, Thorben, Avishek, Anand, and Grünefeld, Nils
Abstract: Text classification is an essential task in natural language processing. While graph neural networks (GNNs) have successfully been applied to this problem both through graph classification and node classification approaches, their typical applications suffer from several issues. In the graph classification case, common graph construction techniques tend to leave out syntactic information. In the node classification case, most widespread datasets and applications tend to suffer from encoding relatively little information in the chosen node features. Finally, there are great benefits to be gained from combining the two GNN approaches. To tackle these concerns, we propose DepNet, a two-stage framework for text classification using GNN models. In the first stage we replace current graph construction methods by utilizing syntactic dependency parsing in order to include as much syntactic information in the GNN input as possible. In the second stage we combine both graph classification and node classification methods by utilizing the former to produce node embeddings for the latter, maximizing the potential of GNNs for text classification. We find that this technique significantly improves the performance of both graph classification and node classification approaches to text classification.
Published: 2022

30. Indentation-sensitive parsing for Parsec (Article)

Author: A?acan, O.S., Adams, M.D., A?acan, O.S., and Adams, M.D.
Abstract: Several popular languages including Haskell and Python use the indentation and layout of code as an essential part of their syntax. In the past, implementations of these languages used ad hoc techniques to implement layout. Recent work has shown that a simple extension to context-free grammars can replace these ad hoc techniques and provide both formal foundations and efficient parsing algorithms for indentation sensitivity. However, that previous work is limited to bottom-up, LR(k) parsing, and many combinator-based parsing frameworks including Parsec use top-down algorithms that are outside its scope. This paper remedies this by showing how to add indentation sensitivity to parsing frameworks like Parsec. It explores both the formal semantics of and efficient algorithms for indentation sensitivity. It derives a Parsec-based library for indentation-sensitive parsing and presents benchmarks on a real-world language that show its efficiency and practicality. © 2014 ACM.
Published: 2022

31. Indentation-sensitive parsing for Parsec (Article)

Author: Adams, M.D., A?acan, O.S., Adams, M.D., and A?acan, O.S.
Abstract: Several popular languages including Haskell and Python use the indentation and layout of code as an essential part of their syntax. In the past, implementations of these languages used ad hoc techniques to implement layout. Recent work has shown that a simple extension to context-free grammars can replace these ad hoc techniques and provide both formal foundations and efficient parsing algorithms for indentation sensitivity. However, that previous work is limited to bottom-up, LR(k) parsing, and many combinator-based parsing frameworks including Parsec use top-down algorithms that are outside its scope. This paper remedies this by showing how to add indentation sensitivity to parsing frameworks like Parsec. It explores both the formal semantics of and efficient algorithms for indentation sensitivity. It derives a Parsec-based library for indentation-sensitive parsing and presents benchmarks on a real-world language that show its efficiency and practicality. © 2014 ACM.
Published: 2022

32. The pilot implementation of ixml

Author: Pemberton, S. (Steven) and Pemberton, S. (Steven)
Published: 2022

33. Modernizing the WebDSL Front-End: A Case Study in SDF3 and Statix

Author: de Krieger, Max (author) and de Krieger, Max (author)
Abstract: The front-end of a compiler reads the source program and performs analyses such as type checking. The goal of the front-end is to check for the presence of syntactic and semantic errors before the program is passed to the back-end of the compiler for tasks such as optimization and code generation. WebDSL is a domain-specific language for web programming that is being used for over 15 years. With WebDSL, many applications have been developed which have thousands of daily users. While the language has evolved over the years, the core of its implementation remains unchanged and is starting to show signs of a legacy system. The current WebDSL syntax is defined in SDF2 and all other parts of the compiler are implemented in Stratego. This thesis presents a modernized front-end of the WebDSL compiler, utilizing the meta-languages of the Spoofax language workbench. Specifically, we introduce a syntax definition of WebDSL in SDF3 that is implemented without the use of post-parse filters, and an executable declarative specification of the WebDSL static semantics in Statix. We use the modernized front-end as the largest case study to date for the meta-languages SDF3 and Statix, in order to evaluate their expressiveness, performance, and elegance when they are used to implement a real world language., Computer Science
Published: 2022

34. Hardware-Accelerated PEG Parsing Machine

Author: Bakker, Roy (author) and Bakker, Roy (author)
Abstract: Information exchange through the countless webservices is central in the current age of technology, which increases the importance for security in these developments. One such security feature is the validation of data based on standardized data structures. The aim of this thesis is to develop a flexible hardware-accelerated text-based recognizer that provides this strict syntax validation. To this end, a parsing machine architecture was adopted in order to fulfill the flexibility and strict recognition requirements. The parsing machine architecture was developed by formalizing the fundamental PEG expressions and creating a micro-architecture based on these PEG expressions, which led to the specification of the PPEG instruction set architecture. This architecture was then mathematically formalized and a proof for its strict adherence to the formalized PEG behavior was provided. The parsing machine architecture was implemented on an FPGA, a virtualization of the parsing machine was implemented in Python for easy analysis of its behavior, and a PEG compiler and assembler were developed for the PEG-PPEG translation. Finally, a memoization unit was developed as an extension to the parsing machine for an improved parsing throughput. By running benchmarks for CSV, XML, JSON, and Java files on the PPEG parsing machine implementation, its parsing behavior was analyzed and compared to existing solutions. This showed that the minimum stack sizes depend solely on the size and complexity of the PEG; the percentage of clock cycles spent on jumps in instruction and data memory is substantial, ranging from 18\% and 40\%; the PPEG-compiled binary code size is relatively small compared to other solutions; and the throughput of the PPEG parsing machine is comparable if not better than other solutions running on faster hardware. Finally, the memoization unit was found to benefit large complex grammars more than small grammars., Computer Engineering
Published: 2022

35. Multitask Pointer Network for Multi-Representational Parsing

Author: Fernández-González, Daniel, Gómez-Rodríguez, Carlos, Fernández-González, Daniel, and Gómez-Rodríguez, Carlos
Abstract: [Abstract] Dependency and constituent trees are widely used by many artificial intelligence applications for representing the syntactic structure of human languages. Typically, these structures are separately produced by either dependency or constituent parsers. In this article, we propose a transition-based approach that, by training a single model, can efficiently parse any input sentence with both constituent and dependency trees, supporting both continuous/projective and discontinuous/non-projective syntactic structures. To that end, we develop a Pointer Network architecture with two separate task-specific decoders and a common encoder, and follow a multitask learning strategy to jointly train them. The resulting quadratic system, not only becomes the first parser that can jointly produce both unrestricted constituent and dependency trees from a single model, but also proves that both syntactic formalisms can benefit from each other during training, achieving state-of-the-art accuracies in several widely-used benchmarks such as the continuous English and Chinese Penn Treebanks, as well as the discontinuous German NEGRA and TIGER datasets.
Published: 2022

36. Uniform Parsing for Hyperedge Replacement Grammars

Author: Björklund, Henrik, Drewes, Frank, Ericson, Petter, Starke, Florian, Björklund, Henrik, Drewes, Frank, Ericson, Petter, and Starke, Florian
Abstract: It is well known that hyperedge-replacement grammars can generate NP-complete graph languages even under seemingly harsh restrictions. This means that the parsing problem is difficult even in the non-uniform setting, in which the grammar is considered to be fixed rather than being part of the input. Little is known about restrictions under which truly uniform polynomial parsing is possible. In this paper we propose a low-degree polynomial-time algorithm that solves the uniform parsing problem for a restricted type of hyperedge-replacement grammars which we expect to be of interest for practical applications.
Published: 2021
Full Text: View/download PDF

37. EXTRACTION AND CLASSIFICATION MODEL FOR BUSINESS EVENT FROM CONTRACTS AND TEMPORAL CONSTRAINTS IN SERVICE ENGAGEMENTS

Author: Sagar V. Chavan, Virat V Giri, Nilesh D Ghorpade, Shivraj A Patil, Sagar V. Chavan, Virat V Giri, Nilesh D Ghorpade, and Shivraj A Patil
Abstract: Contract is a self-agreed, enforceable by law and deliberate agreement between two or more competent authority and parties. Contracts are made in written but may be implied or spoken, and generally have to do with another organization, employment, sale or lease, or tenancy. We assume service engagement is a part of business events. Business events such as payments, purchase, sells, delivery etc. not only impotent processes but are also inherently temporally constrained. Analysis phase is carried out to find out business event and their temporal relationships which helps business partners to analyze what to supply and what to require from others as its participates in the service engagement specified by a contract. Contracts are always be in unstructured text and their details also described in form of unstructured text. Our proposed system through this paper is to introduce a novel approach for employing classification, parsing to extract business event and their temporal constraints from contract text. Also we organize the event terms into cluster automatically with the use of topic modeling.
Published: 2021

38. Tens of gigabytes per second JSON-to-Arrow conversion with FPGA accelerators

Author: Peltenburg, J.W. (author), Hadnagy, A. (author), Brobbel, M. (author), Morrow, Robert (author), Al-Ars, Z. (author), Peltenburg, J.W. (author), Hadnagy, A. (author), Brobbel, M. (author), Morrow, Robert (author), and Al-Ars, Z. (author)
Abstract: JSON is a popular data interchange format for many web, cloud, and IoT systems due to its simplicity, human readability, and widespread support. However, applications must first parse and convert the data to a native in-memory format before being able to perform useful computations. Many big data applications with high performance requirements convert JSON data to Apache Arrow RecordBatches, the latter being a widely-used columnar in-memory format for large tabular data sets used in data analytics. In this paper, we analyze the performance characteristics of such applications and show that JSON parsing represents a bottleneck in the system. Various strategies are explored to speed up JSON parsing on CPU and GPU as much as possible. Due to performance limitation of the CPU and GPU implementations, we furthermore present an FPGA accelerated implementation. We explain how hardware components that can parse variable-sized and nested structures can be combined to produce JSON parsers for any type of JSON document. Several fully integrated FPGA-accelerated JSON parser implementations are presented using the Intel Arria 10 GX and Xilinx VU37P devices, and compared to the performance of their respective host systems; an Intel Xeon and an IBM POWER9 system. Result show the accelerators achieve an end-to-end throughput close to 7 GB/s with the Arria 10 GX using PCIe, and close to 20 GB/s with the VU37P using OpenCAPI 3. Depending on the complexity of the JSON data to parse, the bandwidth is limited by the host-to-accelerator interface or available FPGA resources. Overall, this provides a throughput increase of up to 6x, compared to the baseline application. Also, we observe a full system energy efficiency improvement of up to 59x more JSON data parsed per joule., Computer Engineering
Published: 2021
Full Text: View/download PDF

39. Incremental Scannerless Generalized LR Parsing

Author: Sijm, Maarten (author) and Sijm, Maarten (author)
Abstract: The Scannerless Generalized LR (SGLR) parsing algorithm supports the development of composed languages seamlessly but does not support incremental parsing. The Incremental Generalized LR (IGLR) parsing algorithm, on the other hand, does not support the seamless composition of languages. This thesis presents the Incremental Scannerless Generalized LR (ISGLR) parsing algorithm and investigates the effects of combining the SGLR and IGLR parsing algorithms. While the algorithmic differences are orthogonal, the fact that scannerless parsing relies on non-deterministic parsing for disambiguation has a negative impact on incrementality. Nonetheless, we show that the ISGLR parsing algorithm performs better than the batch SGLR parsing algorithm in typical scenarios. On average, the ISGLR parser can reuse 99% of a previous parse result. When parsing from scratch, the ISGLR parser has a 24% run time overhead compared to the SGLR parser, but when parsing incrementally for changes that are smaller than 1% of the input size on average, it has a 9× speedup., Successor of https://doi.org/10.1145/3359061.3361085, Computer Science
Published: 2021

40. Performance impact of the modular architecture in the incremental SGLR parsing algorithm: Research Project TU Delft

Author: Coman, Mara (author) and Coman, Mara (author)
Abstract: JSGLR2 is a modular Java implementation of the SGLR parsing algorithm that supports systematic benchmarking and improvement of its several parsing variants. By splitting the code into several components, they can be tested in isolation and thus optimized more effortlessly. The modular architecture, although beneficial for efficiently identifying and implementing optimizations, negatively impacts the performance of the parsing algorithm. This paper aims to measure the overhead introduced by the code architecture for one of the variants, more specifically the incremental variant, which combines incremental parsing with SGLR parsing. It does so by comparing the original implementation with a version with the modularity removed. The evaluation is done on programming languages used in practice: Java, WebDSL and SDF3. The results show that the inlined parser outperforms the previous one, achieving speedups of up to 16% in batch parsing and up to 10% in incremental parsing., CSE3000 Research Project, Computer Science and Engineering
Published: 2021

41. Undoing Software Engineering: Demodularization of a SGLR Parser for Performance Gains

Author: Kapitonenko, Nik (author) and Kapitonenko, Nik (author)
Abstract: JSGLR2 is a java implementation of the Scannerless Generalized LR-parsing (SGLR) algorithm. It employs a modular architecture. This architecture comes with a performance overhead for letting multiple components interact with each other. This paper looks into the size of the performance overhead penalty for the recovery parser variant. It does so by creating an 'inlined' version of the recovery parser variant. The inlined recovery variant is a JSGLR2 implementation that ignores the modular architecture, and hard-codes the components. The performance of the inlined variant is measured with a pre-existing evaluation suite. The results show that there is a performance increase between the original, and the inlined variant., CSE3000 Research Project, Computer Science and Engineering
Published: 2021

42. Towards an Empirical Characterisation and a Corpus-Driven Taxonomy of Fragments in Written Contemporary English

Author: Fernández Pena, Yolanda and Fernández Pena, Yolanda
Abstract: This study investigates ‘fragments’ in contemporary English. Fragments are structurally noncanonical constituents that convey the propositional meaning of a full clause, such as Good Old Hendon next stop or What a weirdo. This investigation constitutes an innovative approach to the topic since it (i) explores fragments in exclusively written (i.e. planned/edited) discourse, and (ii) aims at providing a corpus-driven taxonomy and an empirical account of the constructions, strategies and phenomena that are classifiable as fragments based on linguistically objectifiable (formal/textual) criteria, two areas much neglected in prior literature. The results reveal that fragments are not uncommon in written registers, particularly in letters and novels/stories. The most frequent types identified are phrasal and verbless, followed by clausal, wh-fragments and Small Clauses. Most of them show a high rate of subject and/or verb omission whose recoverability in context is facilitated by means of functional elements or latent lexical items licensed by the construction itself., Este estudio investiga ‘fragmentos’ en inglés contemporáneo. Los fragmentos son constituyentes estructuralmente no canónicos que tienen el significado proposicional de una cláusula completa, como Good Old Hendon next stop o What a weirdo. Esta investigación constituye un enfoque innovador sobre el tema ya que (i) explora los fragmentos en el discurso escrito exclusivamente y (ii) tiene como objetivo elaborar una taxonomía basada en corpus y una descripción empírica de las construcciones, estrategias y fenómenos clasificables como fragmentos basada en criterios lingüísticos objetivables (formales/textuales), dos cuestiones poco exploradas en la literatura previa. Los resultados revelan que los fragmentos no son infrecuentes en los registros escritos, especialmente en las cartas y las novelas/historias. Los tipos más frecuentes identificados son frasales y sin verbo, seguidos de clausales, wh- y Small Clauses. La mayoría muestran una gran proporción de omisión de sujeto y/o verbo, recuperables en el contexto mediante elementos funcionales o elementos léxicos latentes justificados por la construcción en la que aparecen.
Published: 2021

43. Rule-Based Top-Down Parsing for Acyclic Contextual Hyperedge Replacement Grammars

Author: Drewes, Frank, Hoffmann, Berthold, Minas, Mark, Drewes, Frank, Hoffmann, Berthold, and Minas, Mark
Abstract: Contextual hyperedge replacement (CHR) strengthens the generative power of hyperedge replacement (HR) significantly, thus increasing its usefulness for practical modeling. We define top-down parsing for CHR grammars by graph transformation, and prove that it is correct as long as the generation and use of context nodes in productions does not create cyclic dependencies. An efficient predictive version of this algorithm can be obtained as in the case of HR grammars., Also part of the Theoretical Computer Science and General Issues book sub series (LNTCS, volume 12741)
Published: 2021
Full Text: View/download PDF

44. Indentation-Sensitive Parsing for Parsec (Conference Object)

Author: Agacan, Ömer S., Adams, Michael D., Agacan, Ömer S., and Adams, Michael D.
Abstract: Several popular languages including Haskell and Python use the indentation and layout of code as an essential part of their syntax. In the past, implementations of these languages used ad hoc techniques to implement layout. Recent work has shown that a simple extension to context-free grammars can replace these ad hoc techniques and provide both formal foundations and efficient parsing algorithms for indentation sensitivity. However, that previous work is limited to bottom-up, LR(k) parsing, and many combinator-based parsing frameworks including Parsec use top-down algorithms that are outside its scope. This paper remedies this by showing how to add indentation sensitivity to parsing frameworks like Parsec. It explores both the formal semantics of and efficient algorithms for indentation sensitivity. It derives a Parsec-based library for indentation-sensitive parsing and presents benchmarks on a real-world language that show its efficiency and practicality.
Published: 2021

45. Incremental Scannerless Generalized LR Parsing

Author: Sijm, Maarten P. (author) and Sijm, Maarten P. (author)
Abstract: The Scannerless Generalized LR (SGLR) parsing algorithm supports the development of composed languages seamlessly but does not support incremental parsing. The Incremental Generalized LR (IGLR) parsing algorithm, on the other hand, does not support the seamless composition of languages. This thesis presents the Incremental Scannerless Generalized LR (ISGLR) parsing algorithm and investigates the effects of combining the SGLR and IGLR parsing algorithms. While the algorithmic differences are orthogonal, the fact that scannerless parsing relies on non-deterministic parsing for disambiguation has a negative impact on incrementality. Nonetheless, we show that the ISGLR parsing algorithm performs better than the batch SGLR parsing algorithm in typical scenarios. On average, the ISGLR parser can reuse 99% of a previous parse result. When parsing from scratch, the ISGLR parser has a 24% run time overhead compared to the SGLR parser, but when parsing incrementally for changes that are smaller than 1% of the input size on average, it has a 9× speedup., Successor of https://doi.org/10.1145/3359061.3361085, Computer Science
Published: 2021

46. Avoiding Gaps in Romance: Evidence from Italian and French for a Structural Parsing Principle

Author: Konrad, I, Burattin, M, Cecchetto, C, Foppolo, F, Staub, A, Donati, C, Konrad, Ingrid, Burattin, Massimo, Cecchetto, Carlo, Foppolo, Francesca, Staub, Adrian, Donati, Caterina, Konrad, I, Burattin, M, Cecchetto, C, Foppolo, F, Staub, A, Donati, C, Konrad, Ingrid, Burattin, Massimo, Cecchetto, Carlo, Foppolo, Francesca, Staub, Adrian, and Donati, Caterina
Abstract: Existing evidence suggests that the parser avoids positing a movement dependency if the grammar does not require doing so. By investigating the processing of two syntactic ambiguities that have not been the subject of processing studies before, we provide more conclusive evidence for this parsing bias in two Romance languages: French and Italian. In two acceptability‐judgment experiments and two self‐paced‐reading studies, we found that sentences that involved a filler–gap dependency (indirect questions in Italian and free relatives in French) were dispreferred compared to sentences involving the same lexical material but no filler–gap dependency (declarative complement clauses in both languages). Crucially, the filler–gap dependency was not dispreferred when there was no available competitor. We conclude by discussing the relevance of these results for syntactic theory, in particular for the questionable status of Merge over Move as a grammatical principle.
Published: 2021

47. TimeSeq: sequences with normalized time information for Seq2Seq translation task

Author: Universitat Politècnica de Catalunya. Departament de Ciències de la Computació, Ruiz Costa-Jussà, Marta, González Bermúdez, Meritxell, Villasana Falcon, Aram Javier, Universitat Politècnica de Catalunya. Departament de Ciències de la Computació, Ruiz Costa-Jussà, Marta, González Bermúdez, Meritxell, and Villasana Falcon, Aram Javier
Abstract: During last years, researchers have focused on developing Deep Learning models to translate natural language questions to SQL structured queries, what is known as the "text-to-sql" translation task. By 2020, DL models have reached outstanding results in WikiSQL and Spider challenges. However, there are still some challenges they have not addressed. Time normalization and small size of training samples stand out among these challenges. Nowadays, state-of-the-art models for text-to-sql task are not able to deal with common questions like, "what were the average sales for last three Christmas?" or "how much did I sell in every weekend of the last quarter of previous year?". On the other hand, models are trained with relatively small number of samples compared to the number of parameters in the deep learning models and collecting more questions and annotate the dataset is not easy. For these reasons, we propose TimeSeq, a new standard for annotating temporal information as normalized token sequences (time sequences), which can take advantage of DL models that perform seq-to-seq translation for automatically normalizing relative time information contained in common questions about business databases. Currently this standard applies to the context of questions made about the content of a transactional database; but could grow to other domains. We demonstrated with our experiments that deep learning models can learn to summarize the temporal information contained in a natural language question into a time sequence. We developed a process for data augmentation to increase the examples of pairs of questions and time sequences, based in the substitution of paraphrases corresponding to 3 ontologies we integrated: (1) Ontology of Paraphrases for Temporal Expressions, (2) Ontology of Paraphrases for General Business Database Querying, and (3) Ontology of Paraphrases for Specific Industry Domain, and an algorithm to perform the data augmentation, by creating new pairs of question
Published: 2021

48. More Data and New Tools. Advances in Parsing the Index Thomisticus Treebank

Author: Ehrmann, Maud, Karsdorp, Folgert, Wevers, Melvin, Lee Andrews, Tara, Burghardt, Manuel, Kestemont, Mike, Manjavacas, Enrique, Piotrowski, Michael, van Zundert, Joris, Gamba, Federica, Passarotti, Marco Carlo, Ruffolo, Paolo, Marco Passarotti (ORCID:0000-0002-9806-7187), Ehrmann, Maud, Karsdorp, Folgert, Wevers, Melvin, Lee Andrews, Tara, Burghardt, Manuel, Kestemont, Mike, Manjavacas, Enrique, Piotrowski, Michael, van Zundert, Joris, Gamba, Federica, Passarotti, Marco Carlo, Ruffolo, Paolo, and Marco Passarotti (ORCID:0000-0002-9806-7187)
Abstract: This paper investigates the recent advances in parsing the Index Thomisticus Treebank, which encompasses Medieval Latin texts by Thomas Aquinas. The research focuses on two types of variables. On the one hand, it examines the impact that a larger dataset has on the results of parsing; on the other hand, performances of new parsers are analysed with respect to less recent tools. Term of comparison to determine the effective parsing advances are the results in parsing the Index Thomisticus Treebank described in a previous work. First, the best performing parser among those concerned in that study is tested on a larger dataset than the one originally used. Then, some parser combinations that were developed in the same study are evaluated as well, assessing that more training data result in more accurate performances. Finally, to examine the impact that newly available tools have on parsing results, we train, test, and evaluate two neural parsers chosen among those best performing in the CoNLL 2018 Shared Task. Our experiments reach the highest accuracy rates achieved so far in automatic syntactic parsing of the Index Thomisticus Treebank and of Latin overall.
Published: 2021

49. On the Use of Parsing for Named Entity Recognition

Author: CITIC. Grupo LyS. Departamento de Ciencias da Computación e Tecnoloxías da Información., Alonso, Miguel A., Gómez-Rodríguez, Carlos, Vilares, Jesús, CITIC. Grupo LyS. Departamento de Ciencias da Computación e Tecnoloxías da Información., Alonso, Miguel A., Gómez-Rodríguez, Carlos, and Vilares, Jesús
Abstract: [Abstract] Parsing is a core natural language processing technique that can be used to obtain the structure underlying sentences in human languages. Named entity recognition (NER) is the task of identifying the entities that appear in a text. NER is a challenging natural language processing task that is essential to extract knowledge from texts in multiple domains, ranging from financial to medical. It is intuitive that the structure of a text can be helpful to determine whether or not a certain portion of it is an entity and if so, to establish its concrete limits. However, parsing has been a relatively little-used technique in NER systems, since most of them have chosen to consider shallow approaches to deal with text. In this work, we study the characteristics of NER, a task that is far from being solved despite its long history; we analyze the latest advances in parsing that make its use advisable in NER settings; we review the different approaches to NER that make use of syntactic information; and we propose a new way of using parsing in NER based on casting parsing itself as a sequence labeling task.
Published: 2021

50. Graph Parsing as Graph Transformation : Correctness of Predictive Top-Down Parsers

Author: Drewes, Frank, Hoffmann, Berthold, Minas, Mark, Drewes, Frank, Hoffmann, Berthold, and Minas, Mark
Abstract: Hyperedge replacement (HR) allows to define context-free graph languages, but parsing is NP-hard in the general case. Predictive top-down (PTD) is an efficient, backtrack-free parsing algorithm for subclasses of HR and contextual HR grammars, which has been described and implemented in earlier work, based on a representation of graphs and grammar productions as strings. In this paper, we define PTD parsers for HR grammars by graph transformation rules and prove that they are correct.
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

616 results on '"Parsing"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources