850 results on '"CORPORA"'
Search Results
2. Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
- Author
-
Kegalj, Jana, Antunović, Goranka, and Tominac Coslovich, Sandra
- Subjects
eksplicitacija ,udc:81(043.3) ,žanrovska analiza ,translating ,HUMANISTIC SCIENCES. Philology. General Linguistics ,normalizacija ,pomorskopravni tekstovi ,korpusna analiza ,genre analysis ,HUMANISTIČKE ZNANOSTI. Filologija. Opće jezikoslovlje (lingvistika) ,triangulacija ,triangulation ,explicitation ,corpora ,corpus analysis ,prevođenje ,žanr ,normalization ,Linguistics and languages ,translation universals ,genre ,prijevodne univerzalije ,Lingvistika i jezici ,korpus ,maritime law - Abstract
Žanrovska su se obilježja pokazala značajnim čimbenikom u donošenju odluka u procesu prevođenja, stoga je i žanrovska analiza postala značajan metodološki alat u znanosti o prevođenju. Pristupajući tekstu holistički, naglašavajući komunikativnu funkciju različitih obilježja promatranog teksta, žanrovska analiza pruža sveobuhvatan okvir unutar kojega se mogu promatrati specifična obilježja prijevodnih tekstova te u tom kontekstu i prijevodnih univerzalija kao univerzalnih obilježja prijevoda. U ovome je radu žanrovska analiza upotrijebljena za utvrđivanje leksičkogramatičkih obilježja žanra pomorskopravnih tekstova na hrvatskome jeziku i obilježja pomorskopravnih tekstova prevedenih na hrvatski jezik kako bi se utvrdile razlike između izvornih i prijevodnih tekstova i uočila moguća manifestacija tzv. prijevodnih univerzalija. U tu je svrhu sastavljen jednosmjerni usporedni korpus pomorskopravnih tekstova, engleskih izvornika i njihovih hrvatskih prijevoda, uz usporedivi korpus pomorskopravnih tekstova izvorno napisanih na hrvatskome jeziku. Analiza je pokazala da se prijevodi po svojim leksičkogramatičkim svojstvima razlikuju od tekstova izvorno napisanih na hrvatskome jeziku te da se u prijevodima realiziraju određene prijevodne univerzalije, prvenstveno eksplicitacija, no u znatnoj mjeri i implicitacija, što nije inače specifično za prijevodne tekstove no proizlazi iz obilježja žanra. Osim toga, zahvaljujući žanrovskoj analizi utvrđena je i realizacija normalizacije kao prijevodne univerzalije. Rezultati istraživanja pružili su znanstveni doprinos u smislu iscrpnog opisa žanra institucionalnih pomorskopravnih tekstova na hrvatskom jeziku i njihovih prijevoda na temelju korpusne analize. Osim toga, istraživanjem se dao doprinos u teorijskom smislu analize prijevodnih univerzalija na primjeru hrvatskih prijevoda, ali i u praktičnom smislu važnosti prepoznavanja žanrovskih obilježja u svakodnevnom profesionalnom prevođenju, što nadalje ima i implikacije u obrazovanju prevoditelja. Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been recognized by translation scholars as a relevant factor in the decision-making process, especially when it comes to the translation of legal texts. The common denominator of genre analysis and translation is the emphasis on communicative function, as the main focus of both is on the function of a particular linguistic unit in a (con)text. In this sense, one of the more important stages in translation is the process of adapting a target text to the genre conventions of the target culture, which means that a translator must be familiar with the genre conventions of the source and target cultures. In the Croatian linguistic tradition, genre has played a marginal role and has been associated with literary genres. In this research, it was assigned a central role as a yardstick against which maritime legal texts were analysed and their linguistic features defined. In this research, genre analysis was used as a methodological tool in the analysis of maritime legal texts originally written in Croatian and the translations of maritime legal texts into Croatian. In translation studies, analyses of translations into different languages have shown that translated texts differ from texts originally written in a particular language in various ways. Such an analysis of translated Croatian texts has not been done before. Therefore, the aim of the thesis is to analyse various features of this type of text, especially with regard to a possible realization of translation universals in Croatian translations. The topic of translation universals has attracted the interest of numerous scholars and has led to findings in various languages that shed light on both the translation process itself and translation strategies and procedures. However, the notion of “universal” features of translation inherent in the translation process itself has been controversial, and researchers have provided evidence for and against it, so that the term has been toned down over time to “tendencies” or “rules”. Regardless of the opposing views, the concept is still considered relevant because it provides insight into the translation process and decision-making during translation. The study in this thesis was conducted on two corpora of maritime legal texts. Specifically, one corpus was a unidirectional parallel corpus of maritime legal texts consisting of two subcorpora: maritime legal texts in English (MarLaw EN) and their translations into Croatian (MarLaw CRO). The other corpus was a comparable corpus of maritime legal texts in Croatian (MarLaw HR). The corpora were compiled using the online corpus analysis tool Sketch Engine, which immediately annotates and tags the corpora, but also contains ready- made corpora. The main goal of the research was to conduct a genre analysis of original maritime legal texts in Croatian and of translations of maritime legal texts into Croatian in order to determine the linguistic features of the genre and the manifestation of translation universals. It was assumed that translations will differ in their linguistic characteristics from texts originally written in Croatian, as research in other languages has shown that translations differ from texts originally written in that language. It was also predicted that explicitation and normalization will be the most common translation universals in translations. The first phase of the research involved benchmarking the generic features of maritime legal texts by comparing the MarLaw HR corpus with the two general Croatian language corpora available in Sketch Engine, hrWaC and Hrvatska jezična riznica (Croatian Language Repository). The hrWaC corpus is a web corpus compiled from the .hr domain, while the Croatian Language Repository is a corpus of standard Croatian language compiled from selected texts from literature, scientific publications, textbooks, online journals and newspapers, and non-fiction books. The comparison was based on quantitative indicators, in particular statistical results on lexical density, lexical richness, readability, average sentence length, and corpus results such as keywords, frequencies and distribution of parts of speech, ngrams, and collocations. To gain a deeper insight into the specific features, a quantitative analysis was conducted, which was complemented by a qualitative analysis that provided further explanations. The results showed that the MarLaw HR corpus has a lower type-token ratio, which is probably due to genre and thematic constraints, but the value of lexical density, as defined by Halliday (1985), is greater for the MarLaw HR corpus than for the general language corpora, indicating that the sentences in the specialized corpus are very dense and informative. The specialized corpus also has much longer sentences, which is consistent with the register of legal texts in general, which has a tendency to be all-inclusive and redundant (Bhatia 1996). Another feature of the legal register in maritime legal texts is nominalization, which is evident in a high noun-verb ratio, but also in quantitative corpus data, such as a larger number of nouns and adjectives compared to general language corpora. The specialized corpus also shows a larger number of nouns among the most frequent words, further confirming the tendency towards nominalization. The larger proportion of nouns in the specialized corpus contributes to information density (Biber 1989), since the focus is on the subject rather than the action, as well as to the impersonal, abstract style. The most frequent keywords in the MarLaw HR corpus, listed according to their keyness score, were divided into three main domains: legal, maritime and general lexis with technical meaning, which maps the range and scope of the genre. This analysis also revealed a larger proportion of verbal nouns compared to the two general language corpora and a larger proportion of prepositions, which is consistent with previous results. For verbs, the MarLaw HR corpus has a limited number of tenses, which is also consistent with the trends in the legal register. A distinctive feature of the genre is that present and future tenses are used to express obligation, while modal verbs are used as explicit expressions of strong obligation or strong prohibition. Adjectives and adverbs expressing obligation are also very common in the specialized corpus. Verbs are often decomposed into constructions consisting of a periphrastic verb and a noun, which also contributes to nominalization. Another feature of the specialized corpus is a larger proportion of performative verbs used as secondary means of expressing the imperative. At the discourse level, the specialized corpus shows an increased proportion of relative and demonstrative pronouns and certain groups of discourse markers, such as conditional, causal, relative, and explanatory. Another distinctive feature of the genre is multinomial expressions, particularly those involving verbs. The following part of the research involved the comparison of the specialized corpus of Croatian maritime legal texts with the parallel corpus of English and translated Croatian texts in order to analyse the specific features of translated texts and potential manifestations of translation universals. The analysis, also conducted in Sketch Engine, included statistical and corpus results, similar to the previous phase. Each identified feature was then associated with a particular translation universal. The statistical analysis showed that translations have the lowest average sentence length, lexical richness, and readability score, indicating their simplification compared to source and original texts. The translations also showed a similar tendency to nominalization as the original texts and a similar distribution of tenses. However, the translations also showed some peculiarities, such as lower valency of periphrastic verbs, a lower use of performative verbs, an increased use of modal verbs, and, at the discourse level, an increased use of pronouns, prepositions, and connectives. Manifestations of translation universals was considered from two points of view, that of the source text and that of the original text, in order to have a better insight into their realization. In this sense, explicitation was identified in the proportion and distribution of adjectives, pronouns, conjunctions and textual connectors, in expressing modality. The tendency to normalization was noted in the proportion and distribution of nouns and the reduction of English binomial expressions with synonymous components, the latter not being a feature of Croatian legal register. Another interesting feature is the tendency towards implicitation in the translations in the category of discourse markers, which are usually an example of explicitation in translations in other languages. Genre has been shown to be a decisive factor in identifying translation universals, as the texts written in a particular genre can be counterposed against texts written in the same genre, otherwise the results would be biased by specific generic features. It should be remembered that the specialized corpus MarLaw is limited both thematically and genre-wise, so the results obtained are specific to this particular genre. Nevertheless, it has revealed the tendencies in the translation of maritime legal texts, which can be reproduced for comparative purposes with other legal genres, and other genres to identify the differences and similarities in the translation process. The research has also contributed to the ongoing discussion on the translation universals from the perspective of the Croatian language, which has not been studied before. The results may have implications for professional translation, as well as for translator training. The research results may stimulate further research in the area of genre and translation analysis using corpus tools, as well as the compilation of other useful corpora of Croatian language to further develop Croatian language resources.
- Published
- 2023
3. Treebanking and corpus annotation in LiLa
- Author
-
Flavio Massimiliano Cecchini
- Subjects
Latin ,Corpora ,Syntax ,LiLa ,Linguistic Linked Open Data ,Treebanks - Abstract
Presentation given on May 25th, 2023 at the LiLa Closing Event describing the modelling of syntactic information in the LiLa knowledge base.
- Published
- 2023
- Full Text
- View/download PDF
4. Contrasting signed and spoken languages
- Author
-
Sílvia Gabarró-López and Laurence Meurant
- Subjects
Sign language linguistics ,Linguistics and Language ,Gesture studies ,Corpora ,gesture studies ,Contrastive studies ,contrastive studies ,corpora ,sign language linguistics ,signed/spoken languages ,multimodality ,Signed/spoken languages ,Multimodality - Abstract
For years, the study of spoken languages, on the basis of written and then also oral productions, was the only way to investigate the human language capacity. As an introduction to this first volume of Languages in Contrast devoted to the comparison of spoken and signed languages, we propose to look at the reasons for the late emergence of the consideration of signed languages and multimodality in language studies. Next, the main stages of the history of sign language research are summarized. We highlight the benefits of studying cross-modal and multimodal data, as opposed to the isolated investigation of signed or spoken languages, and point out the remaining methodological obstacles to this approach. This contextualization prefaces the presentation of the outline of the volume.
- Published
- 2022
5. CLARIN Resource Families for Oral History
- Author
-
Lenardič, Jakob, Calamai, Silvia, Scagliola, Stefania, and van den Heuvel, Henk
- Subjects
oral history ,metadata ,research infrastructure ,corpora ,FAIR - Abstract
The CLARIN Resource Families (CRF) initiative provides manually curated overviews of prominent language resources and technologies deposited across the distributed CLARIN infrastructure (Lenardič and Fišer 2022). The main aim of CRF is to support other core services of CLARIN from the perspective of the FAIR principles (Wilkinson et al. 2016). CRF enhances the findability and accessibility of CLARIN resources by collating them under their most common typological characteristic. The initiative facilitates re-use by providing comprehensive descriptions tailored to the unique technical features of each of the families, as well as their qualitative characteristics. Furthermore, CRF provides a funding instrument for external projects to contribute new overviews. Though originally focused on written corpora (e.g., corpora of parliamentary proceedings, corpora of academic texts), in 2022, CRF was expanded to include corpora of oral history. At present one collection is currently featured – the Ravensbrück corpora (Calamai et al. 2022a) – whose creation was supported by the aforementioned CRF funding instrument. This corpus family contains 8 collections of recorded interviews with survivors of the female concentration camp Ravensbrück, conducted in different languages, such as English, German, Hebrew, and French. See https://www.clarin.eu/resource-families/oral-history-corpora. One collection is available for download (Collection Bruzzone; see Bruzzone and Beccaria Rolfi 1976) while the others can be streamed online. The inclusion of the Ravensbrück corpora in CRF represents an illustrative example of how the CLARIN infrastructure incorporates and provides documentation for complex objects like oral history sources whose provenance and metadata documentation widely differ from standard written corpora and even from contemporary interviews born digitally. The team working on the Ravensbrück resource family (see Calamai et al. 2022b) availed themselves of CLARIN’s Component Metadata Infrastructure (CMDI), which is a framework for metadata description that “supports flexible definitions of metadata structure and semantics” by allowing researchers to “create and use their own [metadata] schema tailored specifically towards the requirements of [their] project” (Windhouwer and Goosen 2022: 194 and 199). All the 8 collections within the Ravensbrück family are accompanied by extensive CMDI metadata, prepared by Calamai et al. (2022a,b). The peculiarity of the interviews in the Ravensbrück family is that they were mostly recorded on an analogue carrier (i.e., audio cassettes), so a new CMDI metadata profile was created that is tailored to such legacy interviews not born digitally. This metadata profile has additional components describing “information about the context in which the interviews were conducted” as well as “information about the process of digitisation” (Calamai et al. 2022a: 3). Being thus digitised, comprehensively described, and carefully curated, the Ravensbrück corpora present a unique opportunity to study and compare these historical interviews. To facilitate their use in research, CLARIN offers through its Speech data and Technology network (Draxler et al. 2020) an open-source web application called TranscriptionPortal (https://speechandtech.eu/transcription-portal), where certain audio recordings (e.g., Collection Bruzzone, United States Holocaust Memorial Museum) can be uploaded and then orthographically transcribed on the fly, with manual phonetic and word alignment for a variety of languages., Funded in the context of the CLARIN Resource Families Project.
- Published
- 2023
- Full Text
- View/download PDF
6. Euskararen eredutik hizkuntza-ereduen euskarara
- Author
-
Alezabal Roteta, Izaskun, Aranzabe Urruzola, María Jesús, and Lindemann, David
- Subjects
Morphology ,Computational Linguistics ,Basque ,Corpora ,Corpus Linguistics ,Language Models - Abstract
Lan honen helburua da euskararen ingurukoak azaltzea Euskal Filologia ikasketak abiapuntu hartuta eta ordenagailuek euskara ikasteko behar dituzten testu-corpus erraldoiak eta hizkuntza-ereduak lortzera iritsi direnera arte. Bide horretan bidelagun izan dugu Miren bai irakasle, bai gure tesi-lanetako epaimahaikide, bai sailkide modura. HITZ GAKOAK: euskara; morfologia; hizkuntzalaritza konputazionala; corpusak; corpus-hizkuntzalaritza; hizkuntza-ereduak.
- Published
- 2023
- Full Text
- View/download PDF
7. Guidelines for the Annotation of Parameters of Narration
- Author
-
Lehmann, Nico, Serova, Dina, Lukassek, Julia, Döring, Sophia, Goymann, Frank, Lüdeling, Anke, and Akbari, Roodabeh
- Subjects
narration ,register ,annotation ,ddc:410 ,guidelines ,corpora ,410 Linguistik - Abstract
The present guidelines describe the annotation of narrative phenomena on the clause level, using a combination of ideas and methods from linguistics and lit- erary studies. The main categories marking the discourse strategy “narration” in stretches of text have been narrowed down to mediacy, i. e. involving a narrator, and sequentiality of events. This document specifies how to define mediacy, and in turn determine whether a narrator is present, as well as how to identify events and their sequential ordering. Lastly, a functional layer annotation is proposed which allows researchers to compare different types of narrative instances. This offers a basis for investigating a potential narrative register which is said to be important for many kinds of register studies.
- Published
- 2023
8. Quali prospettive per ItaDraCor? Risorse e strumenti per la codifica di testi teatrali in lingua italiana
- Author
-
Luca Giovannini, Ingo Börner, Frank Fischer, Carsten Milling, Daniil Skorinkin, and Peer Trilcke
- Subjects
letteratura italiana ,letteratura teatrale ,onboarding ,drama ,aiucd2023 ,corpora ,community building - Abstract
Poster per il XII convegno annuale dell'AIUCD (Siena, 5-7 giugno 2023, www.aiucd2023.unisi.it)., Poster for the XII AIUCD annual conference (Siena, 5-7 June 2023, www.aiucd2023.unisi.it).
- Published
- 2023
- Full Text
- View/download PDF
9. Investigating the role of an indigenised variety of English in the acquisitional and sociolinguistic contexts of the Malaysian ecology
- Author
-
Sie, Samantha
- Subjects
generative grammar ,minimalist syntax ,multilingualism ,substrate transfer ,narrative task ,linguistics ,language contact ,crosslinguistic influence ,grammaticality judgement task ,British English ,corpora ,Malaysian English ,TalkBank ,language acquisition ,sociolinguistic survey ,finiteness ,language attitude ,sociolinguistics ,World Englishes - Abstract
The realm of New Englishes offers enriching avenues to explore the interplay between language acquisition and sociolinguistic influences in linguistically diverse ecologies. Yet research into this interdisciplinary arena remains lacking. Accordingly, this thesis addresses this paradigm gap by focusing on the Malaysian ecology. One of the three empirical studies conducted as part of this project is i) a CASE STUDY which examines the morphosyntactic properties of an indigenised variety of English viz., Colloquial Malaysian English (CME). The data generated from naturalistic conversations came from two pairs of adult Malaysians with different L1 backgrounds (i.e., Malay and Chinese). While many of the non-standard features supplied could be explained by substrate influence, there were also features resembling general second language (L2) behaviours and creative innovation. The MAIN STUDY adopts a concurrent embedded design, which comprises ii) an ACQUISITIONAL STUDY and iii) a SOCIOLINGUISTIC STUDY. The ACQUISITIONAL STUDY investigates the roles of the first language (L1) and CME in the ultimate acquisition of finiteness in Standard English (StE). The adult participants recruited for this study were 145 Malaysians and 30 British (control). Malaysians who acquired English as (one of) their L1(s) (L1-MalE(+)) were predicted to have less difficulty than their L1-Malay and L1-Chinese peers and perform more similarly to the British English (BritE) monolinguals. This is because, despite the prevalence of CME in the local environment, L1-MalE(+) learners would merely have to reset the optional features of finiteness in CME to obligatory, as required in StE. Meanwhile, L1-Malay and L1-Chinese learners would be faced with an additional learnability burden of acquiring finiteness as a new functional feature, given its absence in their L1s. Findings from a grammaticality judgement task and narrative task revealed that although the Malaysian cohort behaved statistically differently from the L1-BritE control, the L1-MalE(+) groups outperformed the L1-Chinese and L1-Malay groups across the board. That said, the L1-Malay group fared considerably better than its L1-Chinese counterpart and was about on par with the L1-MalE(+) peers. These findings indicated clear L1 effects modulated by typological proximity. Meanwhile, the SOCIOLINGUISTIC STUDY explores Malaysians’ attitudinal behaviours towards CME and StE. The same participants from the acquisitional study undertook a sociolinguistic survey administered for this study. Findings revealed that the participants were non-discriminatory towards CME and StE, and that they were aware of when to use these varieties across different social settings. Altogether, this thesis demonstrates the facilitative role of CME in the acquisition of StE, and concurrently vindicates the functional importance of CME and StE as legitimate varieties in the Malaysian milieu.
- Published
- 2023
- Full Text
- View/download PDF
10. A Keywords Corpus Analysis of ‘Taking the Knee’ and its Representation by the UK Press
- Author
-
SPIVEY, Martin
- Subjects
media representation ,critical discourse analysis ,corpora ,keywords ,race - Abstract
In recent years, the violent treatment of black citizens by police officers in the United States has received global attention. This has led to various forms of protest, one of which is the ‘taking the knee’ gesture performed by professional athletes prior to sports events. This paper presents the findings of a small-scale, exploratory keywords corpus analysis of 35 UK newspaper articles on ‘taking the knee’ in European football between June 2020 and November 2021. The aim is to discover how the issue has been covered by the UK press and there is some evidence to show that, while it creates some controversy, the gesture has been presented in a mostly positive (or at least neutral) light. However, more forensic and detailed analysis is required in order to gain a greater understanding of the issue at hand.
- Published
- 2022
11. TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese
- Author
-
Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, João Paulo Teixeira, Moacir Antonelli Ponti, and Sandra Aluísio
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Linguistics and Language ,Computer Science - Computation and Language ,Portuguese ,PORTUGUÊS DO BRASIL ,Library and Information Sciences ,TTS ,Language and Linguistics ,Speech synthesis ,Machine Learning (cs.LG) ,Education ,Corpora ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Speech provides a natural way for human–computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 h from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in European Portuguese. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001, as well as CNPq (National Council of Technological and Scientific Development) Grant 304266/2020-5 info:eu-repo/semantics/publishedVersion
- Published
- 2022
12. Adverbial phrases a más, cuanto más and más y más: a historical approach from the 18th to the 20th century using two Spanish corpora
- Author
-
Puertas Ribés, Elia
- Subjects
Fraseología diacrónica ,Corpora ,Diachronic phraseology ,Tradiciones discursivas ,Lingüística de corpus ,Discursive traditions ,Adverbial phrases ,General Earth and Planetary Sciences ,Locución adverbial ,General Environmental Science - Abstract
El presente artículo compara algunas locuciones adverbiales a través de la frecuencia de uso de tres esquemas sintácticos distintos. El análisis se realiza en dos corpus diacrónicos diferentes: por un lado, un corpus epistolar, más próximo a la oralidad, y compilado por el grupo de investigación “Sociolingüística” de la Universitat Jaume I; y, por otro, el Corpus Diacrónico del Español(CORDE), asociado a géneros escritos más formales. Seguidamente, se comparan los resultados obtenidos en ambas bases de datos con el propósito de analizar qué tipo de estructuras utilizan mayoritariamente los autores en distintas tradiciones discursivas. Finalmente, se consultan diferentes fuentes lexicográficas para conocer el tratamiento que reciben las locuciones adverbiales desde el siglo XVIII hasta el XX, intentando esclarecer de este modo si el uso en una tradición discursiva concreta puede ser un factor determinante para su proceso de lematización. This article compares some adverbial phrases through the frequency of use of three syntactic schemes. The analysis is carried out in two different linguistic corpora: on the one hand, an epistolary corpus, compiled by the “Sociolingüística” research group of the University of Jaume I; on the other hand, the Diachronic Corpus of Spanish (CORDE). Then, the results obtained are compared with the purpose of analyzing what type of structures the authors use in social interaction. Finally, different lexicographical sources are consulted to know the treatment received by adverbial phrases from the eighteenth century to the twentieth century, trying to clarify whether the use in a specific discursive tradition can be a determining factor for the process of lemmatization
- Published
- 2022
13. CLS INFRA D7.1 On Programmable Corpora
- Author
-
Börner, Ingo, Trilcke, Peer, Fischer, Frank, Milling, Carsten, Göbel, Mathias, Schwindt, Mark, Skorinkin, Daniil, and Sluyter-Gäthje, Henny
- Subjects
Corpora ,Literature ,API ,DraCor ,Programmable Corpora ,Computational Literary Studies - Abstract
While the discipline of Computational Literary Studies (CLS) consolidates, infrastructural challenges arise that have to be addressed to ensure that good, sustainable and open scholarship can be carried out in this dynamic field of Digital Humanities research. In this situation, Work Package 7 of the CLS project, entitled “Building the Ecosystem of and for Programmable Corpora”, is developing a small-scale, but highly functional prototype for an infrastructural ecosystem for CLS research, following the concept of a network-based software architecture. The prototype, implemented as the multi-component system “DraCor” (Drama Corpora Platform), realizes the concept of “Programmable Corpora”, which is defined as corpora that expose an open, transparently documented and (at least partly) research-driven API to make texts machine-actionable. This report gives a detailed description of the DraCor system as a prototype for “Programmable Corpora”. It also shares two first experiments in adapting and transferring the approach of an API-based CLS research infrastructure to other systems and resources.
- Published
- 2023
- Full Text
- View/download PDF
14. Utilisation des corpus pour l’enseignement de l’interaction en formation professionnelle de métiers manuels : exemple d’un exercice numérique sur « genre »
- Author
-
Anita Thomas and France Rousset
- Subjects
FLE ,numérique ,marqueurs discursifs ,interaction ,General Earth and Planetary Sciences ,corpus ,corpora ,French language teaching ,discourse markers ,digital tools ,General Environmental Science - Abstract
Cet article s’inscrit dans la continuité des études sur l’utilisation des corpus oraux comme ressource didactique en français langue étrangère. Il présente les défis sous-jacents à la construction d’un exercice sous format numérique à partir de l’expression polysémique genre à l’oral. L’exercice a été développé dans le cadre du projet de recherche appliquée DiCoi (Digitalisation – Corpus – Interaction), dont le but est de soutenir le développement de la compétence d’interaction orale de jeunes en formation professionnelle de métiers manuels. L’effet des interventions est mesuré à l’aide de feedback, mais aussi de la collecte d’interactions orales libres enregistrées sous format audio. La réception de l’exercice est généralement positive, mais l’analyse du phénomène dans le corpus d’interaction entre allophones révèle la difficulté d’établir un lien clair entre input et output. Néanmoins, une didactisation soigneuse du matériel authentique semble rendre les phénomènes cibles saillants. This article is in line with studies on the use of oral corpora as a didactic resource in French as a foreign language. It presents the challenges underlying the construction of an exercise in digital format using the example of the polysemous expression genre in spoken language. The exercise was developed within the applied research project DiCoi (Digitalisation - Corpus - Interaction), whose aim is to support the development of the competence of oral interaction of young people in vocational training for manual trades. The effect of the interventions is measured by means of feedback, but also by collecting a corpus of audio recorded free oral interactions. The reception of the exercise is generally positive, but the analysis of the phenomenon in the interaction corpus reveals the difficulty of establishing a proper link between input and output. Nevertheless, careful preparation of the authentic material seems to make the target phenomena salient.
- Published
- 2023
15. Wissenschaftssprachliche Kompetenz beim Schreiben in Deutsch als fremde Wissenschaftssprache. Eine korpusbasierte Untersuchung
- Author
-
Antonella Nardi and Cristina Farroni
- Subjects
AWS ,GFL ,DaF ,terminology ,ordinary academic language ,Wissenschaftssprache ,Korpora ,scientific language ,corpora ,Terminologie - Abstract
Beim Schreiben einer Hausarbeit in fremder Wissenschaftssprache müssen Studierende einerseits fachbezogenes Wissen durch den Gebrauch von Fachtermini verbal umsetzen, andererseits müssen sie auch fähig sein, dem/den Lesenden dieses Wissen und ihr Untersuchungsvorhaben durch eine kompetente Auswahl spezifisch alltagswissenschaftlicher Ausdrücke (vgl. Ehlich 1999) zu vermitteln. Das Anliegen dieses Beitrags ist es, auf der Basis eines Korpus von in deutscher Sprache abgefassten Seminararbeiten italophoner Studierenden das Repertoire an Fachterminologie und an alltagssprachlichen Ausdrücken mit wissenschaftlicher Verwendung zu untersuchen, welches die Schreibenden zu diesen Zwecken einsetzen. Nach einem quantitativ-qualitativen korpusbasierten Ansatz werden die Daten nach Keywords, Lemmata und häufigen Kookkurrenzen extrahiert bzw. ausgewertet und anhand ausgewählter Beispiele auf der Suche nach spezifischen Schwierigkeiten bzw. guten Leistungen im Sprachgebrauch der Schreibenden interpretiert.When writing a seminar paper in a foreign language, students must be able to convey subject-related knowledge through the use of specialized terminology, while at the same time, they must make this knowledge, along with the aims of their work, available to the reader through the use of ordinary academic language (vgl. Ehlich 1999). This article will examine the repertoire of terminology, as well as everyday expressions with scientific usage within a corpus of seminar papers written in German by Italian students. Taking a quantitative-qualitative corpus-based approach, the data will be extracted and analyzed for keywords, lemmas and frequent co-occurrences, interpreted on the basis of selected examples, with the aim of identifying difficulties and students’ good use of language., Korpora Deutsch als Fremdsprache Volume 2 Issue 1 2022
- Published
- 2023
- Full Text
- View/download PDF
16. Resources for Turkish natural language processing
- Author
-
Çöltekin, Çağrı, Doğruöz, A. Seza, and Çetinoğlu, Özlem
- Subjects
CORPORA ,TOOLS ,Technology and Engineering ,CONSTRUCTION ,AUTHOR ,Turkish ,lt3 ,RECOGNITION ,Linguistics ,LEXICON ,Lexical resources ,NLP ,TEXT ,MORPHOLOGY - Abstract
This paper presents a comprehensive survey of corpora and lexical resources available for Turkish. We review a broad range of resources, focusing on the ones that are publicly available. In addition to providing information about the available linguistic resources, we present a set of recommendations, and identify gaps in the data available for conducting research and building applications in Turkish Linguistics and Natural Language Processing.
- Published
- 2023
17. Exploring collocations in the Corpus of Contemporary American English
- Author
-
Sharon Hartle
- Subjects
Corpora ,Corpus of Contemporary American English (COCA) ,English Language Teaching ,collocations ,TESOL ,collocations, English Language Teaching, Corpora, wordandphrase, Corpus of Contemporary American English (COCA), TESOL ,wordandphrase - Published
- 2023
18. Akuzativne zamjeničke zanaglasnice ju i je u bosanskome, hrvatskome i srpskome: kontrastivno korpusno istraživanje standardnih i razgovornih idioma
- Author
-
Kolaković, Zrinka, Jurkiewicz-Rohrbacher, Edyta, and Gradischnig, Jasmin Denise
- Subjects
Bosnian ,Croatian ,Serbian ,standard and colloquial varieties ,variation ,corpora ,third-person singular feminine clitic in the accusative ,bosanski ,hrvatski ,srpski ,standardni i razgovorni idiom(i) ,varijacija ,korpusi ,zanaglasnica trećega lica jednine ženskoga roda u akuzativu - Abstract
Empirijska su istraživanja o varijaciji u uporabi zanaglasnica iznimno rijetka. Međutim, pažljiva i kontrastivna analiza bosanskih, hrvatskih i srpskih gramatika i ostalih jezičnih priručnika kao što su npr. jezični savjetnici, upućuje na postojanje određenih razlika u uporabi zanaglasnica ne samo među trima standardima nego i unutar njihovih razgovornih idioma. Ovaj se rad bavi razlikama u uporabi dviju varijanata akuzativne zamjeničke zanaglasnice trećega lica jednine ženskoga roda ju i je. Sudeći prema stanju opisanome u gramatikama, uporaba oblika ju znatno je ograničenija u bosanskome i srpskome nego u hrvatskome standardu (cf. Silić i Pranjković 2007; Mrazović i Vukadinović 2009; Ridjanović 2012; Piper i Klajn 2014). Kako bi se utvrdio stvarni raspon razlika u uporabi spomenutih zanaglasnica, primjeri za analizu crpljeni su ne samo iz tradicionalnih referentnih korpusa s jezičnom građom koja bi trebala odražavati stanje u trima standardima (Santos 1998; Utvić 2011; Čermák i Rosen 2012; Brozović Rončević i dr. 2018) nego i iz {bs, hr, sr}WaC-a, triju golemih mrežnih korpusa (Ljubešić i Klubička 2014) pretraživih s pomoću iznimno funkcionalnoga sučelja NoSketchEngine primjenom morfosintaktičkih oznaka, tj. tagova. Ručno je označeno 3916 primjera s akuzativnim zanaglasnim varijantama ju i je. Istraživanje potvrđuje kako jezični korpusi mogu poslužiti za dobivanje prijeko potrebnih uvida u raspon varijacije ne samo s obzirom na normu triju standarda nego i u stvarnoj svakodnevnoj uporabi, odnosno razgovornim idiomima. S pomoću frekvencije varijantnih oblika i konstrukcija u kojima se rabe utvrđuje se u kojoj se mjeri uzus razlikuje od propisane norme. Uopćeni linearni regresijski model otkriva statistički značajne razlike u uporabi akuzativne zanaglasne varijanate ju među standardnim idiomom hrvatskoga s jedne i standardnim idiomima bosanskoga i srpskoga jezika s druge strane. Osim toga, uočeno je da se razgovorni idiomi bosanskoga, hrvatskoga i srpskoga jezika značajno razlikuju u uporabi zanaglasne varijante ju. Nadalje, podaci pokazuju da je distribucija zanaglasnice ju značajno drukčija u razgovornim nego u standardnim idiomima analiziranih jezika. I na kraju, na temelju uopćenoga linearnoga regresijskoga modela, pokazano je da za razliku od -(j)e završetka riječi koja prethodi zanaglasnici trećega lica ženskoga roda u akuzativu, -(j)u završetak ima statistički značajan utjecaj na izbor među varijantama ju i je. Točnije govoreći, smanjuje vjerojatnost uporabe zanaglasnice ju., Empirical studies on variation in clitic use are almost entirely lacking. However, a careful contrastive analysis of both BCS grammar books and language advisor handbooks indicates differences in the usage of clitics not only between BCS standard varieties but also within their colloquial varieties. This paper deals specifically with variation in usage of the third-person feminine accusative CL ju and je ‘her’. According to the previous literature, the usage of the CL variant ju is more limited in Bosnian and Serbian than in Croatian standard (cf. Silić & Pranjković 2007; Mrazović & Vukadinović 2009; Ridjanović 2012; Piper & Klajn 2014). To test the range of this variation, we turn not only to traditionally compiled reference corpora with language material that should reflect standard BCS varieties (Santos 1998; Utvić 2011; Čermák & Rosen 2012; Brozović Rončević et al. 2018) but also to {bs,hr,sr}WaC, three massive web corpora (Ljubešić & Klubička 2014) available via a unified, functional interface NoSketchEngine, with morphosyntactic annotation based on the common tagset. We manually annotated 3.916 data points with CL variants ju and je. This study shows how language corpora can give us valuable insights into the range of variation not only in the BCS standard varieties but also in the real language use, i.e., their colloquial varieties through the frequency of forms and patterns which are partially discarded by the BCS normativists. The generalized linear regression model revealed statistically significant differences in the usage of the CL ju between the standard Croatian variety on the one hand, and standard Bosnian and Serbian on the other. It also showed that BCS colloquial varieties differ significantly concerning the distribution of the CL variant ju. Furthermore, our data showed that the distribution of the CL variant ju is significantly different in standard and colloquial varieties of analyzed languages. Finally, based on our generalized linear regression model, we showed that the -(j)u ending of the CL host has a significant impact on the choice between the CL variants ju and je: it decreases the probability of the ju usage.
- Published
- 2023
19. Il corpus per imparare il serbo. Il futuro dell’apprendimento linguistico
- Author
-
Perisic, Olja
- Subjects
linguistica ,traduzione ,didattica ,lessico ,linguistica, corpora, didattica, serbo, lessico, morfo-sintassi, traduzione ,serbo ,morfo-sintassi ,corpora - Published
- 2023
20. Statistical machine translation proposal for Uzbek to English
- Author
-
Alisher Shakirovich Ismailov, Gulshoda Shamsiyeva, and Nilufar Abdurakhmonova
- Subjects
Q1-390 ,Science (General) ,Education (General) ,statistical machine translation ,corpora ,L7-991 ,machine translation ,natural language - Abstract
The machine translation means is a translating one natural language to another natural language automatically [1]. The machine translation is one of the major and the most active areas in natural language processing. The last decade have seen the rise of the use of statistical approaches to the machine translation. The statistical machine translation approaches learn translation parameters automatically from alignment text rather than relying on rule-based approaches. There has been quite extensive work in statistical machine translation area for some language pairs. However, there are very limited research sources available for the Uzbek to English language pair [2]. In this paper, we propose statistical machine translation algorithm for Uzbek to English. The developing English to Uzbek statistical machine translation algorithm is an interesting obstacle from a number of perspectives. The most important challenge is that English and Uzbek are typologically distant languages. The English language has very limited morphology and Uzbek is an agglutinative language with a very rich and productive derivational and inflectional morphology. The Uzbek word structures that can correspond to complete phrases of several words in English when translated. In this paper, propose that will achieve Uzbek to English statistical machine translation algorithm using phrase-base model. Moreover, in order to achieve statistical machine translation we need to develop English-Uzbek corpora. In this paper, we present briefly about English-Uzbek corpora development.
- Published
- 2021
21. Morphological Tagging and Lemmatization in the Albanian Language
- Author
-
Mati Diellza Nagavci, Hamiti Mentor, and Mollakuqe Elissa
- Subjects
Political science (General) ,albanian language ,part of speech tagging ,natural language processing ,corpora ,Law ,JA1-92 ,lemmatization - Abstract
An important element of Natural Language Processing is parts of speech tagging. With fine-grained word-class annotations, the word forms in a text can be enhanced and can also be used in downstream processes, such as dependency parsing. The improved search options that tagged data offers also greatly benefit linguists and lexicographers. Natural language processing research is becoming increasingly popular and important as unsupervised learning methods are developed. There are some aspects of the Albanian language that make the creation of a part-of-speech tag set challenging. This research provides a discussion of those issues linguistic phenomena and presents a proposal for a part-of-speech tag set that can adequately represent them. The corpus contains more than 250,000 tokens, each annotated with a medium-sized tag set. The Albanian language’s syntagmatic aspects are adequately represented. Additionally, in this paper are morphologically and part-of-speech tagged corpora for the Albanian language, as well as lemmatize and neural morphological tagger trained on these corpora. Based on the held-out evaluation set, the model achieves 93.65% accuracy on part-of-speech tagging, The morphological tagging rate was 85.31 % and the lemmatization rate was 88.95%. Furthermore, the TF-IDF technique weighs terms and with the scores are highlighted words that have additional information for the Albanian corpus.
- Published
- 2021
22. The use of online resources for the synchronic and diacronic study of political language. The case of 'representative democracy'
- Author
-
Zanettin, Federico and Proietti, Fausto
- Subjects
corpora, representative democracy, digital archives ,18th century ,AZ20-999 ,translation ,History of scholarship and learning. The humanities ,corpora ,quantitative/qualitative historical research ,digital archives ,General Works ,representative democracy - Abstract
In this article we wish to provide a survey of online documentary resources for the historical study of political thought and illustrate a series of “good practices” which could be adopted to optimize the use of publicly available digital collections in historiographical research. As a case study, we first report on the search, in very large digital collections of historical documents, for occurrences of the phrase “representative democracy” (and its dictionary equivalents in Italian and French) between 1778, the first occurrence of the term recorded in our data, and 1799. Most of the occurrences retrieved through a careful process of selection and scrutiny have not previously been discussed in the literature. In the last part of the article we discuss the contribution of text analysis tools to diachronic research, looking at frequency data from resources such as Google Books Ngram Viewer and HathiTrust + Bookworm, and comparing findings about the lexical profiles of “democracy” and “representative democracy” in historical and contemporary corpora.
- Published
- 2021
23. MIDEUKO - Mittelhochdeutsches Korpus für die NFDI
- Author
-
Arnold, Eckhart and Müller, Stefan
- Subjects
NFDI ,Sammlungen ,Corpora ,Textplus ,Text+ ,Text+ Plenary 22 ,Collections ,Lexical Resources - Abstract
Poster presented at the plenary meeting of the NFDI consortium Text+ on September 12, 2022 in Mannheim, Germany., Text+ receives a funding grant from the DFG under the grant agreement number 460033370. Text+ is contributing to the NFDI, the National Research Data Infrastructure (NFDI).
- Published
- 2022
- Full Text
- View/download PDF
24. A Gramateca e a Literateca como macroscópios linguísticos
- Author
-
Santos, Diana Maria de Sousa Marques Pinto dos
- Subjects
Português ,Corpus linguistics ,Corpora ,Literary studies ,Visualização ,Estudos literários ,Corpos ,Linguística com corpos ,Visualization - Abstract
This paper demonstrates several features of Gramateca and Literateca, which are environments for linguistic and literary research on top of being a large richly annotated corpora in Portuguese. The paper presents them and pinpoints several new functionalities; it additionally provides ten examples of research questions that demonstrate the usefulness of this kind of Web-based service, that can be conceived as language macroscopes. Examples concern semantics, morphosyntax, distant reading of literary texts, and information extraction. Neste artigo exploramos várias potencialidades que os ambientes da Gramateca e da Literateca permitem aos usuários interessados na pesquisa em língua portuguesa. Por um lado, apresentamos estes ambientes dando conta de novas funcionalidades acessíveis; por outro, trazemos dez exemplos de perguntas de pesquisa para demonstrar a utilidade da existência destes serviços, que pretendem ser uma espécie de macroscópio para observar a língua, nas vertentes semântica e morfossintática, assim como para a leitura distante de textos literários e a extração de informação em português.
- Published
- 2022
25. On the Use of Corpora in Second Language Acquisition – Chinese as an Example
- Author
-
Mária Ištvánová
- Subjects
teaching methodology ,Linguistics and Language ,Chinese ,Computer science ,Process (engineering) ,business.industry ,Teaching method ,P1-1091 ,corpora ,computer.software_genre ,Second-language acquisition ,Language and Linguistics ,Second language ,ComputingMilieux_COMPUTERSANDEDUCATION ,Chinese language ,second language acquisition ,Artificial intelligence ,linguistic research ,business ,Philology. Linguistics ,Curriculum ,Specialized translation ,computer ,Natural language processing - Abstract
This paper aims to introduce the language corpora and the advantages of their use in the process of Chinese language acquisition. We provide practical examples of the corpora's direct and indirect use for teaching and learning Chinese as a second language. The exploratory approach towards Chinese by using various types of corpora is applicable for general language seminars as well as specialized translation seminars. The indirect use is mainly linked to the preparation of teaching materials and facilitates the curriculum design.
- Published
- 2021
26. Nouns Formed with -ism Suffix as a Type of Nominalization in Modern Russian Language
- Author
-
Olga V. Kukushkina and Zhang Shuchun
- Subjects
Russian language ,History ,Noun ,naming ,P1-1091 ,General Medicine ,Suffix ,corpora ,Philology. Linguistics ,Linguistics ,Nominalization - Abstract
In the article Russian nouns formed with -ism suffix, often containing borrowed roots from other Indo-European languages, are analyzed from the perspective of their syntactic function as nominalization. As nominalization regarded in linguistics as a type of pure syntactic transformation (or “syntactic derivation” according to the definition of Jerzy Kuryłowicz), the -ost’ suffix as a word-formational formant for syntactic derivation is well acknowledged and described. As a contrast, the studied group of nouns with -ism suffix is seldom associated with or regarded as deadjectival nominals in previous works and dictionaries. Our analysis based on explanatory and morpheme dictionaries has shown, the motivational correlation between nouns with -ism suffix and adjectives are described in multiple ways, often contrasted one another. For instance, -isms are described in the morpheme dictionary (Lopantin, Ulukhanov 2016) as deadjectival nouns, while in the Shvedova dictionary as non-derivatives, motivating adjectives of quality. In addition, the seme ‘quality’ is also described variously in the two dictionaries – directly or using synonyms with different formants. The analysis of word-usages of -isms was conducted with the corpus “Russian Newspapers of the End of the XX Century”, developed by the Laboratory for General and Computational Lexicology and Lexicography (Lomonosov Moscow State University). The analysis has shown that the diagnostic context automatically differentiate the usage of -isms as nominalizations are the dependent names of the “feature carrier”, which as a result of nominalization has been moved from the position of subject to a dependent attribute.
- Published
- 2021
27. Argument structure taxonomy based 269 verb picture corpus in Kannada
- Author
-
Ahmed, Wasim and Krishnan, Gopee
- Subjects
Grammar ,Language Disorders ,Psycholinguistics ,Pragmatics ,Agrammatism ,Naming ,Argument ,Corpus ,Standardized Picture ,Semantics ,Kannada ,Verb ,Corpora ,Dravidian Language ,Paragrammatism ,Communication Disorders ,Aphasia ,Speech ,Dementia ,Syntax ,Kannadiga ,Neurolinguistics ,Language - Abstract
This is a first of its kind standardized verb pictures dataset that can be incorporated into experiments, assessment tools and therapeutic programs.
- Published
- 2022
- Full Text
- View/download PDF
28. Repository: 'It all begins today!' Archiving the Tweets Published by the realdonaldtrump Twitter Account 20 January 2017-08 January 2021
- Author
-
Priego, Ernesto and Barata, Carlota
- Subjects
digital humanities data ,Trump ,political history ,social media ,Twitter ,dataset ,digital history ,corpus ,text analysis ,corpora ,digital humanities ,political data science - Abstract
Repostitory of source data and supplemental materials for a project led by Dr Ernesto Priego with data refining and analysis by Carlota Barata documenting and analysing the text, source and timestamp metadata of the tweets published by the realdonaldtrump account between 20 January 2017 and 08 January 2021. Versioning is expected. Please see the README.txt file for more information.
- Published
- 2022
- Full Text
- View/download PDF
29. Anotação semântica (semi)automática de corpora: a frase nominal em alemão
- Author
-
Arias Arias, Iván, Iriarte Sanromán, Álvaro, Domínguez Vázquez, María José, and Universidade do Minho
- Subjects
Anotação semântica ,Corpora ,Semantische annotation ,Humanidades::Línguas e Literaturas ,Nominale valenz ,Korpora ,Lexikalisches paket ,Valência nominal ,Pacote lexical ,PLN ,NLP - Abstract
Dissertação de mestrado Europeu em Lexicografia, Nos dias de hoje, no âmbito da investigação e da prática lexicográfica, a utilização de corpora tem-se revelado muito recorrente, principalmente pelo facto de ser considerada como a metodologia mais fiável para alcançarmos exemplos representativos das línguas naturais. Embora as ferramentas de Processamento de Língua Natural (PLN) tenham conseguido grandes avanços na anotação morfossintática de textos, continua a faltar uma anotação semântica exaustiva e sistematizada. Esta carência evidencia-se principalmente quando se fala em lexicografia e gramática de valências, pois na bibliografia teórica (cf. Domínguez, 2011) aponta-se para o facto de a valência semântica ser fulcral para a delimitação de argumentos que acompanham um lexema considerado como portador de valência. Daí surge, no contexto desta investigação, a necessidade de uma aproximação à anotação semântica de corpora, em que se preste atenção especial aos argumentos no nível da frase nominal e ao seu comportamento semântico, para além da etiquetagem morfossintática com a qual contamos normalmente. A gramática e lexicografia de valências, assim como a semântica léxica, constituem, portanto, o ponto de partida teórico da presente dissertação de mestrado. No que diz respeito à metodologia, o presente trabalho cingir-se-á à análise das estruturas argumentais de três nomes do campo semântico da comunicação em alemão (Bericht, Diskussion e Frage) e, através de metodologia de PLN, desenhar-se-á um API script que possibilite o cruzamento de dados de corpora com alguns pacotes lexicais delimitados e criados no âmbito dos projetos PORTLEX, MultiGenera e MultiComb. Esta metodologia permitir-nos-á analisar, a posteriori, a fiabilidade do script desenvolvido, e conduzirá para a extração de conclusões relativas ao valor que poderia trazer consigo a anotação semântica sistematizada de corpora., Heutzutage wird in der Wörterbuchforschung und in der Lexikographie immer häufiger auf Korpora zurückgegriffen, weil sie als zuverlässige Methode gelten, um repräsentative Beispiele der natürlichen Sprache zu finden. Obgleich die Entwicklung von Tools im Bereich der natürlichen Sprachverarbeitung (NLP) dazu führte, dass die Texte morphosyntaktisch annotiert sind, fehlt es immer noch an einer umfassenden und systematisierten semantischen Annotation. Dieser Mangel wird besonders deutlich, wenn man sich mit der Valenzlexikographie und der Valenzgrammatik befasst, da in der Literatur (vgl. Domínguez, 2011) darauf hingewiesen wird, dass die semantische Valenz wesentlich für die Abgrenzung von Ergänzungen ist, die neben einem als Valenzträger zu betrachtenden Lexem auftreten. Daraus ergibt sich, dass es einem Ansatz zur semantischen Annotation von Korpora bedarf, bei dem die nominalen Ergänzungen und ihr semantisches Verhalten im Vordergrund stehen und der sich zum Ziel setzt, die Grenzen der bereits existierenden morphosyntaktischen Annotation zu überschreiten. Die Valenzgrammatik und -lexikographie sowie die lexikalische Semantik stellen daher den theoretischen Ausgangspunkt der vorliegenden Masterarbeit dar. Die Vorgehensweise dieser Arbeit beschränkt sich auf die Analyse der Argumentstrukturen von drei Substantiven aus dem semantischen Feld der Kommunikation im Deutschen (Bericht, Diskussion und Frage). Mithilfe von Tools der NLP wird ein Skript entwickelt, das einen Abgleich zwischen den aus Korpora stammenden Daten und den lexikalischen Paketen entnommenen Daten ermöglicht. Die sog. lexikalischen Paketen wurden im Rahmen der Projekte PORTLEX, MultiComb und MultiGenera erstellt. Anschließend ist die Zuverlässigkeit des erstellten Skripts zu analysieren und es werden Schlussfolgerungen hinsichtlich des Wertes der systematisierten semantischen Annotation von Korpora gezogen., EMLEX - With the support of the ERASMUS+ Programme of the European
- Published
- 2022
30. TEORIA DIALÓGICA DO DISCURSO (TDD) E PESQUISA COM GRANDES CORPORA: PROCESSO DE COMPOSIÇÃO DE DISCURSOS SOBRE A DIVULGAÇÃO CIENTÍFICA
- Author
-
Fetter, Giselle Liana
- Subjects
linguística de corpus ,corpus linguistics ,teoria dialógica do discurso ,corpora ,dialogical discourse theory ,science communication ,divulgação científica - Abstract
Research with extensive corpora based on Dialogic Discourse Theory (DDT), which follows the precepts of Bakhtin Circle, are not commonly implemented, but these studies are possible if the researcher has a specific object and focus. Therefore, this paper aims to present the composition process of an extensive corpus for the analysis based on DDT. The object of study is the discourse of professors from Brazilian universities about science communication, a topic that is still scarcely studied. The corpuswas structured with 226 scientific papers about science communication, published between 2013 and 2018, in Google Scholar, by the search of the term divulgação científica. As for the delimitation criteria, it was determined that the authorship of the papers belonged to professors of postgraduate programs, belonging to seven CAPES’ major areas of knowledge, which resulted in 114 papers. It was observed that the application of methodological procedures based on Corpus Linguistics (CL) and the use of a corpus analysis program – Wordsmith Tools – collaborated in the delimitation of the object of study and in the organization of the discourses. As pesquisas com extensos corpora baseadas na Teoria Dialógica do Discurso (TDD), que segue os pressupostos do Círculo de Bakhtin, não são comumente realizadas, porém, tais estudos são possíveis se o pesquisador tiver um objeto e um foco específico. Assim, este artigo objetiva apresentar o processo de composição de um extenso corpus para a análise com base na TDD. O objeto de estudo é o discurso de professores-pesquisadores de universidades brasileiras sobre a divulgação científica, tema ainda escasso de estudos. Para a estruturação do corpus, foram coletados, a partir do termo divulgação científica 226 artigos científicos sobre a divulgação científica, publicados entre 2013 e 2018, na ferramenta Google Acadêmico. Como critérios de recorte, delimitou-se que a autoria fosse de professores-pesquisadores de pós-graduação stricto sensu de universidades brasileiras, vinculados a sete grandes áreas do conhecimento da CAPES, o que resultou em 114 artigos. Observou-se que a aplicação de procedimentos metodológicos fundamentados nos preceitos da Linguística de Corpus (LC) e o uso de uma ferramenta de análise de corpus – Wordsmith Tools – auxiliou na delimitação do objeto de estudo e na organização dos discursos.
- Published
- 2022
- Full Text
- View/download PDF
31. Epistemologies of evidence-based medicine: a plea for corpus-based conceptual research in the medical humanities
- Author
-
Jan Buts, Mona Baker, Eivind Engebretsen, and Saturnino Luz
- Subjects
Social epistemology ,Modern medicine ,Evidence-based medicine ,Health (social science) ,Education ,Basic concept ,03 medical and health sciences ,Humanities ,0302 clinical medicine ,Corpora ,Corpus linguistics ,Conceptual history ,Humans ,Medical humanities ,030212 general & internal medicine ,Sociology ,Evidence ,Reductionism ,030503 health policy & services ,Health Policy ,Scientific Contribution ,Epistemology ,Knowledge ,Philosophy of medicine ,0305 other medical science - Abstract
Evidence-based medicine has been the subject of much controversy within and outside the field of medicine, with its detractors characterizing it as reductionist and authoritarian, and its proponents rejecting such characterization as a caricature of the actual practice. At the heart of this controversy is a complex linguistic and social process that cannot be illuminated by appealing to the semantics of the modifier evidence-based. The complexity lies in the nature of evidence as a basic concept that circulates in both expert and non-expert spheres of communication, supports different interpretations in different contexts, and is inherently open to contestation. We outline a new methodology that combines a social epistemological perspective with advanced methods of corpus linguistics and elements of conceptual history to investigate this and other basic concepts that underpin the practice and ethos of modern medicine. The potential of this methodology to offer new insights into controversies such as those surrounding EBM is demonstrated through a case study of the various meanings supported by evidence and based, as attested in a large electronic corpus of online material written by non-experts as well as a variety of experts in different fields, including medicine.
- Published
- 2021
32. A cognitive approach to the English as-predicate construction
- Author
-
Yoichiro, Hasebe
- Subjects
constructions ,prototype-categories ,subjectivity ,corpora ,835.1 - Abstract
論文(Article)
- Published
- 2021
33. O carpinteiro e a madeira: a constituição de corpora jurídicos em perspectiva etnometodológica / The carpenter and the wood: the constitution of legal data from an ethnomethodological perspective
- Author
-
Rubens Damasceno-Morais
- Subjects
lcsh:Language and Literature ,Linguistics and Language ,Argumentative ,transcription of oral data ,Constitution ,Philosophy ,media_common.quotation_subject ,corpora ,ethnomethodology ,tribunal ,Language and Linguistics ,Doctoral research ,Education ,transcrição de dados orais ,lcsh:Philology. Linguistics ,argumentação ,Tribunal ,lcsh:P1-1091 ,etnometodologia ,argumentation ,court ,lcsh:P ,Eristic ,Humanities ,media_common - Abstract
Resumo: Este artigo propoe-se a relatar uma experiencia de pesquisa com corpora complexos, a fim de compartilhar o processo e procedimentos de elaboracao de um banco de dados instituido precipuamente para pesquisa doutoral, empreendida na Universite Lumiere Lyon II/Franca, no laboratorio ICAR, cuja especialidade e, justamente, o trabalho com a analise de corpora em diversos niveis de extensao e complexidade. A partir de uma perspectiva etnometodologica (MONDADA, 2008; OCHS, SCHEGLOFF, THOMPSON, 1996; SCHEGLOFF, 1999; TRAVERS, 2001; TRAVERSO, 2007), numa imersao em territorio juridico (CORNU, 2005; DUPRET, 2006; LATOUR, 2004), a pesquisa ora relatada buscou descrever e analisar como os magistrados realizam a gestao do desacordo, em situacoes, muitas vezes, acentuadamente eristicas. Sem nos distanciarmos dos estudos teoricos acerca dos preceitos de metodologia de trabalhos academicos em geral (GIL, 2002; MOTTA-ROTH; HENDGES, 2010; SALOMON, 2014), constituimos um banco de dados balizados pela nocao de situacao argumentativa , uma nocao da retorica antiga retomada por Plantin (1993, 1995, 1996, 2016), a qual poe em destaque situacoes de conflito de opinioes, em contextos argumentativos varios. A partir da exaustiva e intricada transcricao dupla dos dados (BAUDE, 2006; BLANCHE-BENVENISTE, 2008; KERBRAT-ORECCHIONI, 2006), a pesquisa culminou na confirmacao de que o discurso juridico esta longe de ser frio e asseptico e que as interacoes argumentativas naquele contexto se analisadas no calor das deliberacoes tem muito a nos ensinar sobre o argumentar em contexto institucional. Isso pode ser conferido em quatro capitulos analiticos cujo planejamento e execucao ora trazemos a lume, a partir do estudo do direito em acao, isto e, em situacao de interacao, por meio de deliberacoes de magistrados em processos de danos morais, num tribunal brasileiro de Segunda Instância. Palavras-chave: etnometodologia; corpora ; argumentacao; tribunal; transcricao de dados orais. Abstract: This article proposes to report a research experience with complex corpora, on the aim of sharing the backstage of elaborating a database instituted mainly for doctoral research, undertaken at the Universite Lumiere Lyon II/France, in the ICAR laboratory, whose specialty is precisely work with corpora analysis at different levels of extension and complexity. From an ethnomethodological perspective (MONDADA, 2008; OCHS, SCHEGLOFF, THOMPSON, 1996; SCHEGLOFF, 1999; TRAVERS, 2001; TRAVERSO, 2007), in an immersion in legal territory (CORNU, 2005; DUPRET, 2006; LATOUR, 2004) , the research reported here sought to describe and analyze how magistrates manage disagreement, in situations that are often eristic. Without distancing ourselves from theoretical studies about the precepts of methodology of academic works in general (GIL, 2002; MOTTA-ROTH; HENDGES, 2010; SALOMON, 2014), we formed a database based on the notion of argumentative situation , a rhetorical notion retaken up by Plantin (1993, 1995, 1996, 2016), which highlights situations of conflict of opinion, in various argumentative contexts. From the exhaustive and intricate double transcription of the data (BAUDE, 2006; BLANCHE-BENVENISTE, 2008; KERBRAT-ORECCHIONI, 2006). The research culminated in the confirmation that the legal discourse is far from being cold and aseptic and that argumentative interactions in that context, if analyzed in the heat of deliberations, have much to teach us about arguing in an institutional context. This can be seen in four analytical chapters whose planning and execution now we bring to light, from the study of law in action, that is, in a concrete situation, from the deliberations of magistrates in moral damages cases, in a Brazilian court of Second Instance. Keywords: ethnomethodology; corpora ; argumentation; court; transcription of oral data.
- Published
- 2021
34. Back to the Drawing Board: A Longitudinal Study of Fossilized Errors
- Author
-
Hiroaki Watanabe and Robert Long
- Subjects
Longitudinal study ,Fossilized Errors ,Corpora ,Drawing board ,Computer science ,Error Correction ,Visual arts - Abstract
The problem of fossilized errors has been a problematic issue with EFL researchers because it shows that traditional methods of instruction are not effective. Fossilized errors were thus examined with university-level first-year Japanese EFL students to better understand the context in which they are occurring and their frequency over the course of an academic year. Data was collected from two corpora, the Monologic and Dialogic Corpus (MDC) 2019, which has 20,368 words, and 42 subjects, and the second corpus MDC2020, which has 16,997 words and 29 participants. Errors in the 2019/2020 corpora were identified and then coded for frequency; results showed the following fossilized errors: articles deletions (92/94), prepositions (39/43) plurals (54/55), subject-verb agreement (85/46), and general wording (60/69). However, in looking at clauses with errors/100 words, there were 5.29 errors in the 2019 corpus, whereas, in the 2020 corpus, there was a slight improvement of 3.35 errors/100 words, indicating that marginal progress was made. These results show many of these errors are interlingual and that students are unaware of their errors that they are making in their spontaneous speech. Alternative methods of instruction are thus needed in EFL education to highlight awareness and self-editing skills., The IAFOR International Conference on Education (IICE2021), January 6-9, 2021, Hawaii, USA(新型コロナ感染拡大に伴い、オンライン開催に変更)
- Published
- 2021
35. CORPORA AND CORPUS LINGUISTIC APPROACHES TO STUDY BUSINESS LANGUAGE
- Author
-
Nafruza Esanboyevna Azizova
- Subjects
sketch engine ,english for specific purposes (esp) ,english for academic purposes (eap) ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,corpus linguistics ,corpora ,lcsh:Education (General) ,ComputingMethodologies_PATTERNRECOGNITION ,british national corpus (bnc) ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,lcsh:Science (General) ,lcsh:L7-991 ,corporate social responsibility(csr) ,lcsh:Q1-390 - Abstract
This article deals with the corpora and corpus linguistic approaches to study business language. The term corpus is understood more specifically as a compilation of naturally-occurring texts stored electronically and available for quantitative and qualitative analysis.
- Published
- 2021
36. Penile Fracture - Our Initial Experience and Outcome
- Author
-
Tajamul Rashid, Fayaz Ahmad Najar, and Peer Hilal Ahmad Makhdoomi
- Subjects
medicine.medical_specialty ,business.industry ,lcsh:R5-130.5 ,Penile fracture ,030232 urology & nephrology ,corpora ,medicine.disease ,Outcome (game theory) ,Surgery ,03 medical and health sciences ,0302 clinical medicine ,fracture ,030220 oncology & carcinogenesis ,medicine ,business ,lcsh:General works - Abstract
BACKGROUND Penile fractures occur when the engorged penile corpora are forced to buckle and “pop” under the pressure of a blunt sexual trauma, due to slippage of the penis out of the vagina during intercourse. Patients typically describe that a “plop” sound was followed by immediate de-tumescence, severe pain, and swelling, called as “egg-plant” deformity, as a result of the injury. The immediate surgical exploration with evacuation of the haematoma and repair of tunica albuginea defect is the ideal treatment. METHODS Over a period of more than 3 years between May 2015 and January 2019 we have treated 26 patients with penile fractures. All of them presented within 24 hours after sustaining the injury. None had associated urethral injury. Apart from clinical examination and history the investigation most commonly used by us to aid diagnosis was Ultrasound (USG) and colour doppler which helped in identifying the site and size of the defect as well as the blood collections. All were treated by surgical exploration. RESULTS Patients were discharged either on 2nd or 3rd post-operative day. None of our patients developed any postoperative wound infection. Post-operative hematoma developed in 01 patient. 01 patient had complaints of slight bend of the penis to the affected side but with no sexual problem. There was no history of erectile dysfunction in any of these patients. CONCLUSIONS To diagnose penile fracture, our study relied on history and physical examination mainly and did not recommend imaging, except for, in patients with possible urethral injuries. Immediate surgical intervention can make good functional results and surgical exploration can be considered in all cases of penile fractures. The procedure is simple with minimal morbidity, low morbidity and short hospital stay. KEYWORDS Fracture, Corpora, Tear
- Published
- 2020
37. Text+: Language- and text-based Research Data Infrastructure
- Author
-
Hinrichs, Erhard, Geyken, Alexander, Leinen, Peter, Speer, Andreas, Stein, Regine, Blumtritt, Jonathan, Borek, Luise, Eckart, Thomas, Engelberg, Stefan, Grötschel, Martin, Henrich, Andreas, Heyer, Gerhard, Horstmann, Wolfram, Jefferies, Neil, Kudella, Christoph, Lobin, Henning, Müller-Spitzer, Carolin, Neuber, Frederike, Neuefeind, Claes, Rapp, Andrea, Rißler-Pipka, Nanette, Teich, Elke, Thomas, Christian, Trippel, Thorsten, Wieder, Philipp, Witt, Andreas, Arnold, Denis, Bopp, Jutta, Buddenbohm, Stefan, Calvo Tello, José, Fisseni, Bernhard, Gradl, Tobias, Grumt-Suarez, Melanie, Jahnke, Alexander, Lemnitzer, Lothar, Scholze, Frank, Schöch, Christof, and Walker, Nathalie
- Subjects
language ,Infrastructure/Operations ,Text+ ,philologies ,corpora ,Collections ,Editions ,Nationale Forschungdateninfrastruktur ,humanities ,spoken language ,NFDI ,written language ,test ,Textplus ,language documentation ,research data management ,research data infrastructure ,digital humanities ,Lexical Resources ,multimodality - Abstract
All persons listed have actively contributed to the Text+ proposal. In addition, thanks go to Alexander Czmiel, Sonja Friedrichs, Anne Klammt, Wolfgang Klein, Frank Michaelis, Stefan Schmunk, Alexander Steckel and Roberta Toscano. Text+ aims to develop a research data infrastructure for Humanities disciplines and beyond whose primary research focus is on language and text. Text+ will be flexible, scalable, and thus open for different discipline-specific requirements. By offering easy access to high quality research data, Text+ will support a maximum of methodological diversity, which in turn is a prerequisite for innovative and transdisciplinary research. Text+ focuses on Collections, Lexical Resources and Editions. These data domains have a long tradition of research and are linked to mature methodological paradigms that require distinctive but also cross-disciplinary practices of data generation, curation and management. The three types of research data are indispensable for a wide range of Humanities disciplines, including, but not limited to, Classical Philology, Linguistics, Literary Studies, Social and Cultural Anthropology, Non-European Cultures, Jewish Studies and Religious Studies, Philosophy, and language- and text-based research in the Social and Political Sciences. From the outset, 26 data centres will participate in Text+ that are technically sound and that are highly regarded in their fields of specialisation. They will provide data, tools, and services for the analysis and re-use of research data across a broad range of disciplines. By grouping data, tools, and services into thematic clusters, an optimal bundling is achieved. There are 34 institutions participating in Text+ that represent the communities addressed by Text+ as broadly as possible: research libraries, universities, Digital Humanities data centres as well as members of the Union of German Academies of Arts and Sciences and of the Leibniz Society. In addition, leading computing centres ensure robust and persistent operation of services for a distributed research data infrastructure. The high level of interest in Text+ is not only evidenced by the substantial in-kind contributions by the Text+ partner institutions, but is also documented by the more than 120 research-driven user stories and by the large number of letters of support from the communities of interest participating in Text+. At the heart of the governance structure are three scientific coordination committees for the data domains and one for the infrastructure. Their task is to continuously evaluate the portfolio of data, tools and services and to promote its further development according to the priorities of the participating disciplines in coordination with the infrastructure providers. The research data management strategy of Text+ is the core instrument for achieving the main objectives of Text+ in the NFDI context. It paves the way for the integration of data, tools and services into an infrastructure that meets relevant standards and implements the FAIR and CARE principles.
- Published
- 2022
- Full Text
- View/download PDF
38. Text+: Language- and text-based Research Data Infrastructure
- Author
-
Hinrichs, Erhard, Leinen, Peter, Geyken, Alexander, Speer, Andreas, and Stein, Regine
- Subjects
language ,Infrastructure/Operations ,Text+ ,philologies ,corpora ,Collections ,Editions ,Nationale Forschungdateninfrastruktur ,humanities ,spoken language ,NFDI ,written language ,test ,Textplus ,language documentation ,research data management ,research data infrastructure ,digital humanities ,Lexical Resources ,multimodality - Abstract
Text+ aims to develop a research data infrastructure for Humanities disciplines and beyond whose primary research focus is on language and text. Text+ will be flexible, scalable, and thus open for different discipline-specific requirements. By offering easy access to high quality research data, Text+ will support a maximum of methodological diversity, which in turn is a prerequisite for innovative and transdisciplinary research. Text+ focuses on Collections, Lexical Resources and Editions. These data domains have a long tradition of research and are linked to mature methodological paradigms that require distinctive but also cross-disciplinary practices of data generation, curation and management. The three types of research data are indispensable for a wide range of Humanities disciplines, including, but not limited to, Classical Philology, Linguistics, Literary Studies, Social and Cultural Anthropology, Non-European Cultures, Jewish Studies and Religious Studies, Philosophy, and language- and text-based research in the Social and Political Sciences. From the outset, 26 data centres will participate in Text+ that are technically sound and that are highly regarded in their fields of specialisation. They will provide data, tools, and services for the analysis and re-use of research data across a broad range of disciplines. By grouping data, tools, and services into thematic clusters, an optimal bundling is achieved. There are 34 institutions participating in Text+ that represent the communities addressed by Text+ as broadly as possible: research libraries, universities, Digital Humanities data centres as well as members of the Union of German Academies of Arts and Sciences and of the Leibniz Society. In addition, leading computing centres ensure robust and persistent operation of services for a distributed research data infrastructure. The high level of interest in Text+ is not only evidenced by the substantial in-kind contributions by the Text+ partner institutions, but is also documented by the more than 120 research-driven user stories and by the large number of letters of support from the communities of interest participating in Text+. At the heart of the governance structure are three scientific coordination committees for the data domains and one for the infrastructure. Their task is to continuously evaluate the portfolio of data, tools and services and to promote its further development according to the priorities of the participating disciplines in coordination with the infrastructure providers. The research data management strategy of Text+ is the core instrument for achieving the main objectives of Text+ in the NFDI context. It paves the way for the integration of data, tools and services into an infrastructure that meets relevant standards and implements the FAIR and CARE principles.
- Published
- 2022
- Full Text
- View/download PDF
39. Language and the pandemic: The construction of semantic frames in Greek-German comparison
- Author
-
Nikolaos Katsaounis
- Subjects
Cultural Studies ,Linguistics and Language ,Coronavirus disease 2019 (COVID-19) ,pandemic ,First language ,frame semantics ,Context (language use) ,corpora ,Structuring ,cross-linguistic analysis ,language.human_language ,Linguistics ,Education ,lcsh:Philology. Linguistics ,German ,covid-19 ,lcsh:P1-1091 ,Pandemic ,Frame semantics ,language ,Frame (artificial intelligence) ,Sociology ,lcsh:L ,lcsh:Education - Abstract
This paper aims to provide an insight into the way native speakers of different first languages (L1) who live in the same country and are therefore influenced to the same degree by the current Covid-19 pandemic (e g share the same everyday experiences and are confronted with the same linguistic input in the same context) frame Covid-19 related events More specifically, a comparison between the framework of L1 speakers of German and L1 speakers of Greek, all residing in Greece during the pandemic Our goal is to unveil commonalities and differences in the structuring of concepts using a frame-semantic approach In or-der to investigate the frames that are indexed when talking about experiences, topics, and concepts newly introduced by the Covid-19 pandemic, we chose to build a small bilingual corpus based on participants’ answers in surveys in the respective languages We use preliminary data to evaluate the feasibility of the theoretical as well as the methodological approach This paper presents the pilot phase of a broader project whose final conclusions will be available in 2021 © Nikolaos Katsaounis 2020
- Published
- 2020
40. Representación de la fraseología del español en herramientas digitales
- Author
-
Matteo DE BENI
- Subjects
fraseografía ,Linguistics and Language ,fraseología ,idioms ,diccionarios digitales ,digital dictionaries ,corpus ,corpora ,locuciones ,locuciones, fraseografía, corpus, diccionarios digitales ,Language and Linguistics ,phraseography - Abstract
espanol La eclosion de la informatica y su aplicacion a las humanidades redundan en la realizacion de herramientas digitales dedicadas, entre otros niveles de analisis linguistico, al lexico. Dentro de dicho ambito, el interes esta aqui puesto concretamente en la fraseologia y la fraseografia del espanol y en su presencia y representacion tanto en corpus como en diccionarios electronicos. Precisamente la lexicografia practica, aunque puede sin duda aprovechar los recursos del medio electronico, ha quedado sacudida por los avances tecnicos y se encuentra ahora en una fase de crisis, que se podra convertir en una enorme oportunidad. Tras ofrecer una aproximacion a los aspectos aludidos, en este articulo se presenta el volumen monografico Representacion de la fraseologia en herramientas digitales: problemas, avances, propuestas y los trabajos que lo conforman. EnglishThe emergence of computing and its application to the humanities have resulted in the realization of digital tools dedicated, among other levels of linguistic analysis, to the lexicon. Within this field, the interest is here specifically placed on Spanish phraseology and in its presence and representation both in corpora and in electronic dictionaries. Although it can undoubtedly take advantage of the resources of the electronic medium, practical lexicography in particular has been shaken by technological advancements and is now in a phase of crisis, which may turn out to become an enormous opportunity. After offering an overview of the aforementioned aspects, this article presents the monographic volume Representacion de la fraseologia en herramientas digitales: problemas, avances, propuestas and the works included.
- Published
- 2020
41. Labels and Usages in English Dictionaries : Corpus Evidence
- Subjects
linguistic labels ,English dictionaries ,corpora - Published
- 2020
42. From the world to word order: Deriving biases in noun phrase order from statistical properties of the world
- Author
-
Jennifer Culbertson, Simon Kirby, and Marieke Schouwstra
- Subjects
Typology ,Linguistics and Language ,Computer science ,02 engineering and technology ,corpora ,computer.software_genre ,Information theory ,050105 experimental psychology ,Language and Linguistics ,0202 electrical engineering, electronic engineering, information engineering ,0501 psychology and cognitive sciences ,silent gesture ,information theory ,060201 languages & linguistics ,business.industry ,05 social sciences ,06 humanities and the arts ,word order ,Noun phrase ,Order (business) ,0602 languages and literature ,020201 artificial intelligence & image processing ,Artificial intelligence ,typology ,business ,computer ,Natural language processing ,Word order - Abstract
The world’s languages exhibit striking diversity. At the same time, recurring linguistic patterns suggest the possibility that this diversity is shaped by features of human cognition.One well-studied example is word order in complex noun phrases (like these two red vases).While many orders of these elements are possible, a subset appear to be preferred. It has been argued that this order reflects a single underlying representation of noun phrases tructure, from which preferred orders are straightforwardly derived (e.g. Cinque 2005).Building on previous experimental evidence using artificial language learning by Culbertson& Adger (2014), we show that these preferred orders arise not only in existing languages,but also in improvised sequences of gestures produced by English speakers. We then use corpus data from a wide range of languages to argue that the hypothesized underlying structure of the noun phrase might be learnable from statistical features relating objects and their properties conceptually. Using an information-theoretic measure of strength of association, we find that adjectival properties (e.g. red) are on average more closely related to the objects they modify (e.g. wine), than numerosities (e.g. two), which are in turn more closely related to the objects they modify than demonstratives (e.g. this). It is exactly those orders which transparently reflect this—by placing adjectives closest to the noun, and demonstratives farthest away—which are more common across languages, and preferred in our silent gesture experiments. These results suggest that our experience with objects in the world, combined with a preference for transparent mappings from conceptual structure to linear order, can explain constraints on noun phrase order.
- Published
- 2020
43. Predicative Adverbs and Adjectives with Infinitival Subjects. A Corpus Investigation
- Author
-
Agnieszka Patejuk and Adam Przepiórkowski
- Subjects
korpusy ,Linguistics and Language ,język polski ,Computer science ,modalność nieepistemiczna ,podmioty bezokolicznikowe ,predicative adjectives ,corpora ,Language and Linguistics ,Linguistics ,przymiotniki predykatywne ,non-epistemic modality ,Polish ,predicative adverbs ,przysłówki predykatywne ,Predicative expression ,infinitival subjects - Abstract
Celem artykułu jest porównanie ze sobą dwóch konstrukcji predykatywnych w języku polskim, w których podmiotem jest fraza bezokolicznikowa: konstrukcji z przysłówkami predykatywnymi i konstrukcji z przymiotnikami predykatywnymi. Ta ostatnia konstrukcja, o postaci "przymiotnik predykatywny + łącznik + podmiot bezokolicznikowy", nie została wcześniej opisana w polskiej literaturze dotyczącej predykacji, łączników czy podmiotów bezokolicznikowych. Na podstawie danych korpusowych, przede wszystkim z Narodowego Korpusu Języka Polskiego, pokazujemy, że konstrukcja ta jest znacznie rzadsza niż analogiczna konstrukcja z przysłówkami predykatywnymi. Twierdzimy także, że w zasadzie te same predykaty mogą zostać zrealizowane albo jako przysłówki, albo jako przymiotniki, gdy podmiotem jest fraza bezokolicznikowa - obserwowane różnice nie mają charakteru systemowego, a wynikają jedynie z braków w leksykonie lub z tego, że nie zawsze przymiotniki i odpowiadające im przysłówki mają te same zestawy znaczeń. W szczególności pewne predykaty wyrażające modalność nieepistemiczną mogą być wyrażone tylko za pomocą przymiotników, gdyż odpowiadające im przysłówki nie wyrażają takiej modalności. Artykuł omawia także nowe dane korpusowe stanowiące dodatkowy argument za hipotezą, że przyczyną znacznie niższej frekwencji przymiotników predykatywnych niż przysłówków w omawianych konstrukcjach jest to, że - aby możliwe było połączenie przymiotnika predykatywnego z podmiotem bezokolicznikowym - podmiot ten musi ulec składniowej nominalizacji, podczas gdy przysłówki mogą łączyć się z podmiotami bezokolicznikowymi bezpośrednio; stąd preferencja dla składniowo prostszych konstrukcji z przysłówkami. The aim of this paper is to compare two Polish predicative constructions with infinitival subjects, namely those with predicative adverbs and those with predicative adjectives. The latter construction, of the form "predicative adjective + copula + infinitival subject" has hardly been noticed in Polish literature on predication, copulas, or infinitival subjects. On the basis of corpus data, mainly from the National Corpus of Polish, we demonstrate that this construction is much rarer than the analogous construction with predicative adverbs. We also show that roughly the same predicates may be expressed as either adverbs or as adjectives when the subject is an infinitival phrase - any observed differences are not systematic but rather stem from lexical gaps and differences in the meanings of particular adverbs and adjectives. In particular, certain modal predicates may only be expressed as adjectives because the corresponding adverbs do not express the same non-epistemic modal meanings. Finally, we provide new corpus evidence for an earlier claim that predicative adjectives are much rarer than adverbs when the subject is infinitival because they require this subject to undergo covert nominalisation; as adverbs combine with infinitival subjects directly, they are usually preferred.
- Published
- 2020
44. Application of Data Mining Methods in Internet of Things Technology for the Translation Systems in Traditional Ethnic Books
- Author
-
Yujing Luo and Yueting Xiang
- Subjects
Phrase ,General Computer Science ,Scale (ratio) ,Machine translation ,business.industry ,Computer science ,Internet of Things ,General Engineering ,Ethnic group ,translation ,data mining ,computer.software_genre ,Translation (geometry) ,Fluency ,Corpora ,traditional national classic ,General Materials Science ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,Artificial intelligence ,business ,lcsh:TK1-9971 ,computer ,Natural language processing ,Decoding methods - Abstract
In order to translate the ethnic classics, based on the research on the Internet of things, machine learning, and translation technology of ethnic classics, the log-linear model is combined with the national corpus scale and the grammatical structure characteristics, and the phrase statistical machine translation is used to establish a discontinuous phrase extraction model. Then, the translation technology is studied from the three aspects of model definition, training, and decoding. Finally, the algorithm is compared with the traditional phrase extraction algorithm to verify its effectiveness. The results show that the extraction number of discontinuous phrase extraction model is significantly higher than that of traditional phrase extraction model, and the model can extract more phrases, handle larger and more complex text, and score higher in translation fluency. From the evaluation indexes scores of Bilingual Evaluation Understudy (B.L.E.U.) and National Institute of Standards and Technology (N.I.S.T.), it can be found that the B.L.E.U. and N.I.S.T. values of the traditional phrase extraction algorithm are lower than those of the discontinuous phrase extraction model algorithm. The discontinuous phrase extraction algorithm can not only extract the regular continuous phrase, but also obtain the discontinuous text, and the translation effect is better. In conclusion, the combination of Internet of things and machine learning can be used in the translation of ethnic classics to achieve high-quality translation of discontinuous phrases, which is of guiding significance for the study of machine translation.
- Published
- 2020
45. Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
- Author
-
Natalia Levshina
- Subjects
Zipf’s law of abbreviation ,frequency ,informativity ,n-grams ,corpora ,linguistic typology ,General Physics and Astronomy - Abstract
Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) can be more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study, which examines a more diverse sample of languages than in the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish), reveals intriguing cross-linguistic differences, which can be explained by typological properties of the languages. I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters, as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show consistent cross-linguistic differences in the size of correlations between word length and the corpus-based measures. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions.
- Published
- 2022
- Full Text
- View/download PDF
46. Editorial: Emergentist Approaches to Language
- Author
-
Brian MacWhinney, Vera Kempe, Patricia J. Brooks, and Ping Li
- Subjects
crosslinguistic ,phonology ,constructions ,Psychology ,emergentism ,corpora ,General Psychology ,child language ,BF1-990 - Published
- 2022
47. Іноземна мова професійного спрямування: методичні аспекти викладання
- Author
-
Tetiana Poliakova and Viktoriia Samarina
- Subjects
мова для спеціальних цілей ,languages for specific purposes ,technical language ,LSP teaching ,корпуси ,LSP research ,corpora ,дослідження мови для спеціальних цілей ,технічна мова ,викладання мови для спеціальних цілей - Abstract
Teaching a technical language is becoming increasingly important in modern European society which is characterized by movements of various kinds. Besides, groups of students are becoming more differentiated, and teachers who are not usually experts in a particular field face difficulties when creating courses for students as opportunities for training and advanced training for teachers are limited. There are a lot of problems that remain open or partially analyzed, and there isn’t just one way to solve them. But nevertheless we would like to try to present some solutions to these problematic issues and show how and with which linguistic means and instruments specialized languages can be described and what impact this can have on teaching. Викладання технічної мови стає дедалі актуальнішим у сучасному європейському суспільстві, що характеризується різними «рухами», проте групи студентів стають більш диференційованими, а викладачі, які зазвичай не є експертами в певній галузі, зазнають труднощів зі створенням курсів для студентів, оскільки можливості для навчання чи підвищення кваліфікації трапляються не часто. Є багато питань, які залишаються відкритими або частково розкритими, і єдина відповідь на них не завжди можлива, однак все одно варто спробувати представити вирішення таких проблемних питань. Необхідно показати, як саме та за допомогою яких засобівта інструментів можна описати спеціалізовані мови, а також те, який вплив це може чинити на викладання.
- Published
- 2022
48. Discontinuous reduplication: a typological sketch
- Author
-
Simone Mattiola, Francesca Masini, Mattiola, Simone, and Masini, Francesca
- Subjects
repetition ,Italian ,General Medicine ,discourse(-sensitive) typology ,corpora ,discontinuity ,typology ,reduplication - Abstract
The paper investigates discontinuous reduplication (DR), a pattern where reduplicant and base are separated by other material, by annotating a 214-example dataset collected from a 99-language sample. Several items turned out to serve as interposing elements, although their nature does not seem to correlate with function, unlike the category of the base. DR’s functions are a subset of those associated with reduplication cross-linguistically. All languages displaying DR also present contiguous reduplication, suggesting acontiguous reduplication > discontinuous reduplicationhierarchy. Finally, a corpus-based analysis of Italian (lacking DR according to grammars) unveiled a wealth of DR patterns, suggesting that corpora are essential for the typological enterprise.
- Published
- 2022
49. Research in Audiovisual Translation: files, methodology and tools
- Author
-
Briales Bellón, Isabel
- Subjects
Tools ,Traducción audiovisual ,Corpora ,Unidad de traducción ,Translation unit ,8- Lingüística y literatura::80 - Cuestiones generales relativas a la lingüística y literatura. Filología [CDU] ,Corpus ,Herramientas ,Metodología ,Audiovisual translation ,Metodology - Abstract
Corpas Pastor (2008: 89) señala que la Lingüística de Corpus está considerada como un enfoque metodológico adecuado para el estudio de las lenguas, aplicable tanto a la traducción como a la interpretación, pues permite analizar y describir una amplia variedad de discursos, reflexionar sobre la unidad de traducción y revisar el concepto de equivalencia. Partiendo de la base de que es factible trabajar con corpus en ámbitos tan dispares como la investigación sobre la variación lingüística o durantela realización de un encargo profesional o el desarrollo de una metodología para la enseñanza de lenguas, presentamos las fases que hemos tenido que superar para llevar a cabo el estudio comparativo del guion original del filme Bienvenue chez les Ch´tis y las versiones doblada y subtitulada en español. Destacamos la importancia de preparar correctamente el material de trabajo para poder extraer el máximo provecho de él. En el caso que presentamos, hemos utilizado los programas Final Draft, AntConc y WinAlign, que nos han permitido profundizar y desgranar los textos, localizar fácilmente los segmentos interesantes y crear unidades de traducción. Se trata de una metodología sistemática que permite abordar con rigurosidad la investigación en Traducción Audiovisual. Corpas Pastor (2008: 89) notes that corpus linguistics is considered a valid methodological approach to language studies, relevant to both translation and interpreting insofar that it can be used to analyse and describe a wide range of discourse types, reflect on translation units and re-examine the concept of equivalence. Taking as its starting point the fact that it is possible to work with corpora in such varied undertakings as researching language variations, performing professional translation/interpreting tasks and developing language teaching methods, this paper describes the stages involved in comparing and contrasting the original script for the film Bienvenue chez les Ch´tis and its dubbed and subtitled renderings in Spanish. We highlight the importance of properly preparing subject material to be able to make the most of its potential. In this particular case, we used the Final Draft, AntConc and WinAlign programmes to analyse the texts in depth, easily identify segments of interest and create translation units: a systematic method which enables research into Audiovisual Translation to be carried out in a rigorous manner.
- Published
- 2022
50. Corpora.unito.it
- Author
-
Barbera, Manuel, Corino, Elisa, Marello, Carla, and Onesti, Cristina
- Subjects
Corpus linguistics ,Corpus linguistics, corpora, newsgroup, varietà di lingua, CQP, lingua scritta ,newsgroup ,lingua scritta ,corpora ,CQP ,varietà di lingua - Published
- 2022
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.