111 results on '"corpus construction"'
Search Results
2. Research perspectives and trends in Artificial Intelligence-enhanced language education: A review
- Author
-
Zheng, Lu and Yang, Yong
- Published
- 2024
- Full Text
- View/download PDF
3. Bovine Viral Diarrhea Virus Named Entity Recognition Based on BioBERT and MRC.
- Author
-
Li, YinFei, Bai, YunLi, Wang, RuLin, and Zhou, WeiGuang
- Subjects
- *
BOVINE viral diarrhea virus , *READING comprehension , *DATA mining , *FEATURE extraction , *CATTLE industry - Abstract
Utilizing deep learning for data mining in biomedicine often falls short in leveraging prior knowledge and adapting to the complexities of biomedical literature mining. Entity recognition, a fundamental task in information extraction, also provides data support for Natural Language Processing (NLP) downstream tasks. Bovine Viral Diarrhea Virus (BVDV) results in considerable economic losses in the cattle industry due to calf diarrhea, bovine respiratory syndrome, and cow abortion. This study aims to extract information on BVDV from relevant literature and build a knowledge base. It enhances feature extraction in the BioBERT pre-trained model using the Machine Reading Comprehension (MRC) framework for information fusion and bi-directionally extracts corpus information through the Bi-LSTM network, followed by a CRF layer for decoding and prediction. The results show the construction of a BVDV Corpus with 22 biomedical entities and introduce the BioBERT-Bi-LSTM-CRF Integrated with MRC (BBCM) model for Named Entity Recognition (NER), combining prior knowledge and the reading comprehension framework (MRC). The BBCM model achieves F 1 -scores of 78.79% and 76.3% on the public datasets JNLPBA and GENIA, respectively, and 67.52% on the BVDV Corpus, outperforming other models. This research presents a targeted NER method for BVDV, effectively identifying related entities and exploring their relationships, thus providing valuable data support for NLP's downstream tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Semantic Augmentation in Chinese Adversarial Corpus for Discourse Relation Recognition Based on Internal Semantic Elements.
- Author
-
Hua, Zheng, Yang, Ruixia, Feng, Yanbin, and Yin, Xiaojun
- Subjects
CHINESE language ,DEEP learning ,CORPORA ,DISCOURSE - Abstract
This paper proposes incorporating linguistic semantic information into discourse relation recognition and constructing a Semantic Augmented Chinese Discourse Corpus (SACA) comprising 9546 adversative complex sentences. In adversative complex sentences, we suggest a quadruple (P, Q, R, Q β ) representing internal semantic elements, where the semantic opposition between Q and Q β forms the basis of the adversative relationship. P denotes the premise, and R represents the adversative reason. The overall annotation approach of this corpus follows the Penn Discourse Treebank (PDTB), except for the classification of senses. We combined insights from the Chinese Discourse Treebank (CDTB) and obtained eight sense categories for Chinese adversative complex sentences. Based on this corpus, we explore the relationship between sense classification and internal semantic elements within our newly proposed Chinese Adversative Discourse Relation Recognition (CADRR) task. Leveraging deep learning techniques, we constructed various classification models and the model that utilizes internal semantic element features, demonstrating their effectiveness and the applicability of our SACA corpus. Compared with pre-trained models, our model incorporates internal semantic element information to achieve state-of-the-art performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. The Victorian anti-vaccination discourse corpus (VicVaDis): construction and exploration.
- Author
-
Hardaker, Claire, Deignan, Alice, Semino, Elena, Coltman-Patel, Tara, Dance, William, Demjén, Zsófia, Sanderson, Chris, and Gatherer, Derek
- Subjects
- *
ANTI-vaccination movement , *CORPORA , *VACCINATION mandates , *SMALLPOX vaccines , *VICTORIAN Period, Great Britain, 1837-1901 - Abstract
This article introduces and explores the 3.5-million-word Victorian Anti-Vaccination Discourse Corpus (VicVaDis). The corpus is intended to provide a (freely accessible) historical resource for the investigation of the earliest public concerns and arguments against vaccination in England, which revolved around compulsory vaccination against smallpox in the second half of the 19th century. It consists of 133 anti-vaccination pamphlets and publications gathered from 1854 to 1906, a span of 53 years that loosely coincides with the Victorian era (1837–1901). This timeframe was chosen to capture the period between the 1853 Vaccination Act, which made smallpox vaccination for babies compulsory, and the 1907 Act that effectively ended the mandatory nature of vaccination. After an overview of the historical background, this article describes the rationale, design and construction of the corpus, and then demonstrates how it can be exploited to investigate the main arguments against compulsory vaccination by means of widely accessible corpus linguistic tools. Where appropriate, parallels are drawn between Victorian and 21st-century vaccine-hesitant attitudes and arguments. Overall, this article demonstrates the potential of corpus analysis to add to our understanding of historical concerns about vaccination. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Developing a multimodal corpus of L2 academic English from an English medium of instruction university in China.
- Author
-
Chen, Yu-Hua, Harrison, Simon, Stevens, Michael Paul, and Zhou, Qianqian
- Subjects
ENGLISH language ,CHINESE-speaking students ,ENGLISH as a foreign language ,CORPORA ,VIDEO recording - Abstract
This paper describes the rationale for and design of a new multimodal corpus of L2 academic English from a Sino-British university in China: the Corpus of Chinese Academic Written and Spoken English (cawse). The unique context for this corpus provides language samples from Chinese students who use English as a second language (L2) in a preliminary-year programme, which prepares students for academic studies at university level, at a campus where English is used as the Medium of Instruction (emi). Data were collected from a variety of settings, including written (i.e., exam scripts and essays) and spoken assessments (i.e., interviews and presentations), covering the full range of grades awarded to those language samples, as well as from student group interactions during teaching and learning activities. The multimodal nature of the corpus is realised through the availability of selected audio/video recordings accompanied by the orthographically transcribed text. This open-access corpus is designed to help shed light on Chinese students' academic L2 English language use in a variety of written, spoken and multimodal discourses. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus.
- Author
-
Zhang, Jinyi, Su, Ke, Tian, Ye, and Matsumoto, Tadahiro
- Subjects
MACHINE translating ,CORPORA ,CHINESE language ,ENGLISH language ,RESEARCH personnel - Abstract
This research introduces WCC-EC 2.0 (Web-Crawled Corpus—English and Chinese), a comprehensive parallel corpus designed for enhancing Neural Machine Translation (NMT), featuring over 1.6 million English-Chinese sentence pairs meticulously gathered via web crawling. This corpus, extracted through an advanced web crawler, showcases the vast linguistic diversity and richness of English and Chinese, uniquely spanning the rarely covered news and music domains. Our methodical approach in web crawling and corpus assembly, coupled with rigorous experiments and manual evaluations, demonstrated its superiority by achieving high BLEU scores, marking significant strides in translation accuracy and model resilience. Its inclusion of these specific areas adds significant value, providing a unique dataset that enriches the scope for NMT research and development. With the rise of NMT technology, WCC-EC 2.0 emerges not only as an invaluable resource for researchers and developers, but also as a pivotal tool for improving translation accuracy, training more resilient models, and promoting interlingual communication. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Toward establishing a knowledge graph for drought disaster based on ontology design and named entity recognition
- Author
-
Yihui Fang, Dejian Zhang, and Guoxiang Wu
- Subjects
corpus construction ,deep learning ,drought disaster ,knowledge graph ,named entity recognition ,ontology design ,Information technology ,T58.5-58.64 ,Environmental technology. Sanitary engineering ,TD1-1066 - Abstract
Drought disasters have caused serious impacts on the social economy and ecological environment, which are continuously and increasingly exacerbated by climate warming and other factors. Drought disaster management usually involves processing a mass of isolated data from many fields expressed in different terminologies and formats. These heterogeneous data or so-called data silos have greatly hindered drought disaster management in an information-rich manner. Establishing a drought disaster knowledge graph can facilitate the reuse of these heterogeneous data and provide references for drought disaster management, and ontology design and named entity recognition are the two major challenges. Therefore, in this study, we first designed a drought disaster ontology by recognizing the major concepts in the drought disaster field and their relationships, which was implemented with an ontology modeling language. We next constructed a drought disaster corpus and an integrated entity recognition model that was built by integrating multiple deep learning methods. Finally, we applied the integrated entity recognition model to extract information from the CNKI literature database. The integrated model shows satisfactory results in drought disaster named entity recognition. We thus conclude that combining ontology and deep learning technology toward establishing a knowledge graph for drought disasters is promising. HIGHLIGHTS Ontology was used to construct the schema for drought disaster knowledge graphs.; A corpus of drought disasters was constructed with unstructured documents.; Automatic drought disaster named entity recognition was achieved by the deep learning method.;
- Published
- 2023
- Full Text
- View/download PDF
9. Multimodal Corpus Construction and Research of Intercultural Communication Based on Multiple Regression Equation
- Author
-
Yao Wenjing
- Subjects
multiple regression equation ,intercultural communication ,multimode ,corpus construction ,62g08 ,Mathematics ,QA1-939 - Abstract
To promote the transformation of education and teaching concepts, guide the reform of teaching content and teaching methods, and promote the joint construction and sharing of high-quality curriculum teaching resources in colleges and universities through modern information technology, improve the quality of talent training and serve the construction of a learning society. Using video analysis and website research methods, the author takes the cross-cultural communication courses offered by 22 universities in the national excellent resource sharing courses published by the multimodal online corpus of “Love Course Network” as the research sample, taking the playability of teaching videos, noise level, picture definition, multimedia courseware playing, utilization of curriculum resources and network interaction as research contents, the construction and application of teaching videos are analyzed in depth. The results show that, in the construction of intercultural communication sharing courses, there are still problems such as poor playability of teaching videos, noise and poor picture clarity.
- Published
- 2023
- Full Text
- View/download PDF
10. Construction of cardiovascular information extraction corpus based on electronic medical records
- Author
-
Hongyang Chang, Hongying Zan, Shuai Zhang, Bingfei Zhao, and Kunli Zhang
- Subjects
cardiovascular disease ,corpus construction ,electronic medical record ,Biotechnology ,TP248.13-248.65 ,Mathematics ,QA1-939 - Abstract
Cardiovascular disease has a significant impact on both society and patients, making it necessary to conduct knowledge-based research such as research that utilizes knowledge graphs and automated question answering. However, the existing research on corpus construction for cardiovascular disease is relatively limited, which has hindered further knowledge-based research on this disease. Electronic medical records contain patient data that span the entire diagnosis and treatment process and include a large amount of reliable medical information. Therefore, we collected electronic medical record data related to cardiovascular disease, combined the data with relevant work experience and developed a standard for labeling cardiovascular electronic medical record entities and entity relations. By building a sentence-level labeling result dictionary through the use of a rule-based semi-automatic method, a cardiovascular electronic medical record entity and entity relationship labeling corpus (CVDEMRC) was constructed. The CVDEMRC contains 7691 entities and 11,185 entity relation triples, and the results of consistency examination were 93.51% and 84.02% for entities and entity-relationship annotations, respectively, demonstrating good consistency results. The CVDEMRC constructed in this study is expected to provide a database for information extraction research related to cardiovascular diseases.
- Published
- 2023
- Full Text
- View/download PDF
11. Construction of English Numerical Intelligence Text Translation Data Corpus in Colleges and Universities
- Author
-
Zhai Xiang
- Subjects
recurrent neural network ,bigru model ,attention mechanism ,seq2seq model ,corpus construction ,97p10 ,Mathematics ,QA1-939 - Abstract
Given the specialized nature of English text translation in academic settings and the frequent absence of reliable reference materials, translation processes often lack verifiable evidence, impacting both efficiency and quality. This paper addresses these challenges by first developing a basic syntactic error correction model that leverages the structural features of recurrent neural networks (RNNs) and gated recurrent unit (GRU) networks to establish a Seq2Seq syntactic error correction framework. To enhance this model, we incorporate an Attention mechanism into the Seq2Seq-based English grammar error correction model. This innovation enables the model to swiftly focus on segments most pertinent to the current context, thereby boosting operational efficiency. Subsequently, we create a college English text translation data corpus using Numerical Intelligence techniques to maintain grammatical accuracy within the corpus. Comparative analysis of the model training reveals that the Seq2Seq model with the Attention mechanism achieves an accuracy rate of 41.7%, which represents a 9.19% improvement over the basic model, underscoring its significant advantage. Furthermore, the average accuracy rate for grammatical error correction stands at 72.87%. A practical application analysis shows a minimal difference of only 0.05 points between the model’s grammar correction scores and those of human teachers. The corpus developed using this enhanced grammar error correction model scored 86 overall, outperforming other corpora. Therefore, the augmented Seq2Seq model with the Attention mechanism proves highly effective for developing English text translation corpora in collegiate environments.
- Published
- 2024
- Full Text
- View/download PDF
12. Juegos con propósito para la anotación del Corpus Oral Sonoro del Español rural.
- Author
-
Díaz, Rosa Lilia Segundo, Bonilla, Johnatan E., Bouzouita, Miriam, and Ruiz, Gustavo Rovelo
- Subjects
- *
CORPORA , *SPANISH language , *SPANISH literature - Abstract
The study of dialectal microvariation in spoken Spanish faces challenges due to the absence of an adequate morpho-syntactically annotated and parsed corpus. Therefore, this article introduces a novel technique, a game-based approach, for creating resources for non-standard Spanish language varieties. The article provides an overview of the progress in designing three Games With A Purpose (GWAPs) prototypes, to wit, Agentes, Tesoros, and Anotatlón. These games aim to facilitate the confirmation and correction of the morpho-syntactic tagging task of the COSER-AP (Corpus Oral y Sonoro del Español Rural-Anotado y Parseado, 'Annotated and Parsed Audible Corpus of Spoken Rural Spanish'). First, the article presents the methodology used to build the games. Second, it offers a detailed description of the implemented Game Design Elements (GDEs). Finally, the article discusses the results of a pilot evaluation that assesses player enjoyment and the linguistic accuracy. Findings are promising, with Tesoros and Anotatlón demonstrating high levels of enjoyment. Additionally, Agentes proves to be effective in collecting a large number of annotations. The linguistic accuracy also shows potential benefits of gamified approaches in linguistic annotation tasks. However, it also emphasizes the importance of considering regional in player assessment and training them in multidialectal contexts. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
13. Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers
- Author
-
Hyesoo Kong, Hwamook Yoon, Jaewook Seol, Mihwan Hyun, Hyejin Lee, Soonyoung Kim, and Wonjun Choi
- Subjects
BERT ,corpus construction ,metadata extraction ,transfer learning ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
With the accelerating development of science and technology, the academic papers being published in various fields are increasing rapidly. Academic papers specially in science and technology fields are a crucial media for researchers who develop new technologies by identifying knowledge regarding the latest technological trends and conduct derivative studies in science and technology. Therefore, the continual collection of extensive academic papers, structuring of metadata, and construction of databases are significant tasks. However, research on automatic metadata extraction from Korean papers is not being actively conducted currently owing to insufficient Korean training data. We automatically constructed the largest labeled corpus in South Korea to date from 315,320 PDF papers belonging to 503 Korean academic journals and this labeled corpus can be used for training the models of automatic extraction for 12 metadata types from PDF papers. This labeled corpus is available at https://doi.org/10.23057/48. Moreover, we developed inspection process and guidelines for the automatically constructed data and performed a full inspection of the validation and testing data. The reliability of the inspected data was verified through the inter-annotator agreement measurement. Using our corpus, we trained and evaluated the BERT based transfer learning model to verify its reliability. Furthermore, we proposed new training methods that can improve the metadata extraction performance of Korean papers, and through these methods, we developed KorSciBERT-ME-J and KorSciBERT-ME-J+C models. The KorSciBERT-ME-J showed the highest performance with an F1 score of 99.36%, as well as robust performance in automatic metadata extraction from Korean academic papers in various formats.
- Published
- 2023
- Full Text
- View/download PDF
14. Reaching beneath the tip of the iceberg: A guide to the Freiburg Multimodal Interaction Corpus
- Author
-
Rühlemann Christoph and Ptak Alexander
- Subjects
talk-in-interaction ,multimodality ,gesture ,eye tracking ,corpus construction ,Philology. Linguistics ,P1-1091 - Abstract
Most corpora tacitly subscribe to a speech-only view filtering out anything that is not a ‘word’ and transcribing the spoken language merely orthographically despite the fact that the “speech-only view on language is fundamentally incomplete” (Kok 2017, 2) due to the deep intertwining of the verbal, vocal, and kinesic modalities (Levinson and Holler 2014). This article introduces the Freiburg Multimodal Interaction Corpus (FreMIC), a multimodal and interactional corpus of unscripted conversation in English currently under construction. At the time of writing, FreMIC comprises (i) c. 29 h of video-recordings transcribed and annotated in detail and (ii) automatically (and manually) generated multimodal data. All conversations are transcribed in ELAN both orthographically and using Jeffersonian conventions to render verbal content and interactionally relevant details of sequencing (e.g. overlap, latching), temporal aspects (pauses, acceleration/deceleration), phonological aspects (e.g. intensity, pitch, stretching, truncation, voice quality), and laughter. Moreover, the orthographic transcriptions are exhaustively PoS-tagged using the CLAWS web tagger (Garside and Smith 1997). ELAN-based transcriptions also provide exhaustive annotations of re-enactments (also referred to as (free) direct speech, constructed dialogue, etc.) as well as silent gestures (meaningful gestures that occur without accompanying speech). The multimodal data are derived from psychophysiological measurements and eye tracking. The psychophysiological measurements include, inter alia, electrodermal activity or GSR, which is indicative of emotional arousal (e.g. Peräkylä et al. 2015). Eye tracking produces data of two kinds: gaze direction and pupil size. In FreMIC, gazes are automatically recorded using the area-of-interest technology. Gaze direction is interactionally key, for example, in turn-taking (e.g. Auer 2021) and re-enactments (e.g. Pfeiffer and Weiss 2022), while changes in pupil size provide a window onto cognitive intensity (e.g. Barthel and Sauppe 2019). To demonstrate what opportunities FreMIC’s (combination of) transcriptions, annotations, and multimodal data open up for research in Interactional (Corpus) Linguistics, this article reports on interim results derived from work-in-progress.
- Published
- 2023
- Full Text
- View/download PDF
15. BertSRC: transformer-based semantic relation classification
- Author
-
Yeawon Lee, Jinseok Son, and Min Song
- Subjects
Relation extraction ,Semantic relation classification ,Corpus construction ,Annotation method ,Deep learning ,BERT ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract The relationship between biomedical entities is complex, and many of them have not yet been identified. For many biomedical research areas including drug discovery, it is of paramount importance to identify the relationships that have already been established through a comprehensive literature survey. However, manually searching through literature is difficult as the amount of biomedical publications continues to increase. Therefore, the relation classification task, which automatically mines meaningful relations from the literature, is spotlighted in the field of biomedical text mining. By applying relation classification techniques to the accumulated biomedical literature, existing semantic relations between biomedical entities that can help to infer previously unknown relationships are efficiently grasped. To develop semantic relation classification models, which is a type of supervised machine learning, it is essential to construct a training dataset that is manually annotated by biomedical experts with semantic relations among biomedical entities. Any advanced model must be trained on a dataset with reliable quality and meaningful scale to be deployed in the real world and can assist biologists in their research. In addition, as the number of such public datasets increases, the performance of machine learning algorithms can be accurately revealed and compared by using those datasets as a benchmark for model development and improvement. In this paper, we aim to build such a dataset. Along with that, to validate the usability of the dataset as training data for relation classification models and to improve the performance of the relation extraction task, we built a relation classification model based on Bidirectional Encoder Representations from Transformers (BERT) trained on our dataset, applying our newly proposed fine-tuning methodology. In experiments comparing performance among several models based on different deep learning algorithms, our model with the proposed fine-tuning methodology showed the best performance. The experimental results show that the constructed training dataset is an important information resource for the development and evaluation of semantic relation extraction models. Furthermore, relation extraction performance can be improved by integrating our proposed fine-tuning methodology. Therefore, this can lead to the promotion of future text mining research in the biomedical field.
- Published
- 2022
- Full Text
- View/download PDF
16. Analyzing Chinese text with clause relevance structure.
- Author
-
Lyu, Chen and Feng, Wenhe
- Subjects
- *
CHINESE language , *RECOGNITION (Psychology) , *DISCOURSE analysis , *STRUCTURAL analysis (Engineering) , *RHETORICAL theory - Abstract
Discourse structure is generally represented as hierarchical structure, the two most well known representations are rhetorical structure theory (RST) and Penn discourse treebank (PDTB). The main problem of the hierarchical structure is that it can not describe the direct semantic relevance between the elementary discourse units (EDU), especially the non-adjacent and cross-level EDUs. Discourse dependency structure (DDS) has been put forward in recent years to describe the head-dependent relation between the EDUs. However, the judgment process of the head can not be answered theoretically. This problem is particularly serious in Chinese discourse analysis, because Chinese lacks the form differences between the main clauses and the subordinate clauses. In this paper, we propose clause relevance structure to represent the discourse structure. Compared with the hierarchical discourse structure and DDS, the clause relevance structure can effectively describe the direct semantic association between discontinuous and cross-level clauses in a text, and the construction of structure is not presupposed by the head recognition. We propose the judgment criteria and formal constraints of the clause relevance structure, and built a human-annotated corpus on Chinese text. Based on the Chinese corpus, we explore the automatic recognition of clause relevance structure. The clause relevance recognition task is formalized as a classification problem and performed by the BERT-based model. A bidirectional LSTM layer is added on the top of the BERT to improve the performance, and the recognition accuracy (90.77%) is achieved by the BERT-LSTM model. Experimental results show that the long distance clause pairs are the main difficulties in the clause relevance recognition, and these difficulties mainly focus on the positive examples, while the clause pairs with short distance are especially difficult to be correctly recognized as negative relevance. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
17. An Efficient Framework for Constructing Speech Emotion Corpus Based on Integrated Active Learning Strategies.
- Author
-
Ren, Fuji, Liu, Zheng, and Kang, Xin
- Abstract
Speech emotion recognition has been developed rapidly in recent decades because of the appearance of machine learning. Nevertheless, lack of corpus remains a significant issue. For actual speech emotion corpus construction, many professional actors are required to perform voices with various emotions in specific scenes. In the process of data labelling, since the number of samples of different emotion categories is extremely imbalanced, it is difficult to efficiently label the samples. Hence, we proposed an integrated active learning sampling strategy and designed an efficient framework for constructing speech emotion corpora in order to address the problems presented above. Comparing experiments with other active learning algorithms on 13 datasets, our method was shown to improve sampling efficiency. In addition, it is able to select small category samples to be labelled with preference in imbalanced datasets. During the actual corpus construction experiments, our method can prioritize selecting small class emotion samples. As even when the amount of labelled data is less than 50%, the accuracy rate still can reach 90%. This greatly enhances the efficiency of constructing the speech emotion corpus and fills in the gaps. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
18. BertSRC: transformer-based semantic relation classification.
- Author
-
Lee, Yeawon, Son, Jinseok, and Song, Min
- Abstract
The relationship between biomedical entities is complex, and many of them have not yet been identified. For many biomedical research areas including drug discovery, it is of paramount importance to identify the relationships that have already been established through a comprehensive literature survey. However, manually searching through literature is difficult as the amount of biomedical publications continues to increase. Therefore, the relation classification task, which automatically mines meaningful relations from the literature, is spotlighted in the field of biomedical text mining. By applying relation classification techniques to the accumulated biomedical literature, existing semantic relations between biomedical entities that can help to infer previously unknown relationships are efficiently grasped. To develop semantic relation classification models, which is a type of supervised machine learning, it is essential to construct a training dataset that is manually annotated by biomedical experts with semantic relations among biomedical entities. Any advanced model must be trained on a dataset with reliable quality and meaningful scale to be deployed in the real world and can assist biologists in their research. In addition, as the number of such public datasets increases, the performance of machine learning algorithms can be accurately revealed and compared by using those datasets as a benchmark for model development and improvement. In this paper, we aim to build such a dataset. Along with that, to validate the usability of the dataset as training data for relation classification models and to improve the performance of the relation extraction task, we built a relation classification model based on Bidirectional Encoder Representations from Transformers (BERT) trained on our dataset, applying our newly proposed fine-tuning methodology. In experiments comparing performance among several models based on different deep learning algorithms, our model with the proposed fine-tuning methodology showed the best performance. The experimental results show that the constructed training dataset is an important information resource for the development and evaluation of semantic relation extraction models. Furthermore, relation extraction performance can be improved by integrating our proposed fine-tuning methodology. Therefore, this can lead to the promotion of future text mining research in the biomedical field. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
19. End-to-end framework for agricultural entity extraction – A hybrid model with transformer.
- Author
-
Nismi Mol, E.A. and Santosh Kumar, M.B.
- Subjects
- *
NATURAL language processing , *TRANSFORMER models , *AGRICULTURE , *KNOWLEDGE graphs , *RANDOM fields - Abstract
• Agricultural Entity Extraction from raw Textual documents. • A Hybrid Approach for Extraction without labeled Dataset. • Combines Dictionaries, Regular expression-based Rules and Deep learning techniques. • Fine-tuning of transformer-based model and obtain improved performance. Entity extraction is one prerequisite for various Natural Language Processing(NLP) applications like relation extraction, question answering and knowledge graph construction. This paper proposes a complete framework for extracting agricultural entities from unstructured textual documents by combining dictionary-based, rule-based and transformer-based deep learning techniques. Initially, an entity extraction method based on dictionaries is employed to obtain patterns for creating rules that could identify more similar entities. The dictionary has been updated with the newly obtained entities, and this combined dictionary-based and rule-based technique has been repeatedly applied in multiple steps to obtain enough sample data. The labeled dataset is constructed with these extracted entities to train the proposed BERT-based transformer model with Conditional Random Field(CRF) for agricultural entity extraction, and the performance is evaluated. The results show the enhanced performance of the proposed model compared to other state-of-the-art models with a Precision of 88.84 %, Recall of 88.9 % and F1-Score of 88.87 %. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation.
- Author
-
Zhang, Jinyi, Tian, Ye, Mao, Jiannan, Han, Mei, and Matsumoto, Tadahiro
- Subjects
MACHINE translating ,NATURAL language processing ,COMPARATIVE grammar ,JAPANESE language ,CHINESE language ,LANGUAGE research - Abstract
Featured Application: This research crawled a bilingual Japanese-Chinese corpus of a certain size through websites. As a necessary resource for Japanese-Chinese neural machine translation (NMT), it is beneficial for researchers to promote the progress of Japanese-Chinese language-related natural language processing research. Specifically, topics include comparative analysis of grammar, comparative studies of Chinese and Japanese languages, compilation of dictionaries, etc. This will have great significance and contribution to the cultural exchange and industrial cooperation between China and Japan. It also has important theoretical significance and application value to the industrialization of Japanese-Chinese machine translation. In addition, the application of this research will be of great significance in strengthening civil communication and enhancing mutual understanding between China and Japan, as the current Chinese and Japanese relations are not well perceived by the citizens of both countries. We hope that the construction and pathways of the Japanese-Chinese bilingual corpus in this research will help to solve the problem of language barriers in Japanese-Chinese people-to-people communication and mutual understanding. We offer the WCC-JC as a free download under the premise that it is intended for research purposes only. Currently, there are only a limited number of Japanese-Chinese bilingual corpora of a sufficient amount that can be used as training data for neural machine translation (NMT). In particular, there are few corpora that include spoken language such as daily conversation. In this research, we attempt to construct a Japanese-Chinese bilingual corpus of a certain scale by crawling the subtitle data of movies and TV series from the websites. We calculated the BLEU scores of the constructed WCC-JC (Web Crawled Corpus—Japanese and Chinese) and the other compared corpora. We also manually evaluated the translation results using the translation model trained on the WCC-JC to confirm the quality and effectiveness. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
21. Using Twitter to Detect Hate Crimes and Their Motivations: The HateMotiv Corpus.
- Author
-
Alnazzawi, Noha
- Subjects
HATE crimes ,SOCIAL media ,CYBERBULLYING ,HARASSMENT - Abstract
With the rapidly increasing use of social media platforms, much of our lives is spent online. Despite the great advantages of using social media, unfortunately, the spread of hate, cyberbullying, harassment, and trolling can be very common online. Many extremists use social media platforms to communicate their messages of hatred and spread violence, which may result in serious psychological consequences and even contribute to real-world violence. Thus, the aim of this research was to build the HateMotiv corpus, a freely available dataset that is annotated for types of hate crimes and the motivation behind committing them. The dataset was developed using Twitter as an example of social media platforms and could provide the research community with a very unique, novel, and reliable dataset. The dataset is unique as a consequence of its topic-specific nature and its detailed annotation. The corpus was annotated by two annotators who are experts in annotation based on unified guidelines, so they were able to produce an annotation of a high standard with F-scores for the agreement rate as high as 0.66 and 0.71 for type and motivation labels of hate crimes, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
22. Language production experiments as tools for corpus construction: A contrastive study of complementizer agreement.
- Author
-
Fingerhuth, Matthias and Breuer, Ludwig Maximilian
- Subjects
SYNTAX (Grammar) ,LINGUISTIC context ,HISTORICAL linguistics - Abstract
Furthermore, it is only by extending the results from the 2PL completion task, which shows parallel use of CA-morpheme and full pronouns, that we could assume that the I -s i -morphemes in the translation task are also CA-morphemes and not clitic pronouns. While we generally follow this interpretation as a CA morpheme, in our subsequent analysis, we will distinguish instances where only a I -st i -morpheme appears from those where both CA morpheme and clitic pronoun (i.e., I -stu i ) occur. With few exceptions, the presence of both I -s i -morpheme and full pronoun allows an interpretation that the I -s i -morpheme is not a mere clitic pronoun but indeed an indication of CA. Instead, the majority of respondents ( I ob i 9; 60%, I wann i 6; 67%) shows CA and a full pronoun in the completion task, but only a CA-morpheme/clitic in the translation task. Keywords: syntax; language production experiments; complementizer agreement; corpus construction EN syntax language production experiments complementizer agreement corpus construction 237 262 26 05/12/22 20220501 NES 220501 1 Introduction Investigation of specific phenomena in corpora of spontaneous speech often meets hurdles, as they may be limited to specific interactive or linguistic contexts or of a generally low frequency (Lenz 2008: 163; [28]: 513). [Extracted from the article]
- Published
- 2022
- Full Text
- View/download PDF
23. Linguistically-Based Comparison of Different Approaches to Building Corpora for Text Simplification: A Case Study on Italian.
- Author
-
Brunato, Dominique, Dell'Orletta, Felice, and Venturi, Giulia
- Subjects
NATURAL language processing ,CORPORA ,LINGUISTIC analysis - Abstract
In this paper, we present an overview of existing parallel corpora for Automatic Text Simplification (ATS) in different languages focusing on the approach adopted for their construction. We make the main distinction between manual and (semi)–automatic approaches in order to investigate in which respect complex and simple texts vary and whether and how the observed modifications may depend on the underlying approach. To this end, we perform a two-level comparison on Italian corpora, since this is the only language, with the exception of English, for which there are large parallel resources derived through the two approaches considered. The first level of comparison accounts for the main types of sentence transformations occurring in the simplification process, the second one examines the results of a linguistic profiling analysis based on Natural Language Processing techniques and carried out on the original and the simple version of the same texts. For both levels of analysis, we chose to focus our discussion mostly on sentence transformations and linguistic characteristics that pertain to the morpho-syntactic and syntactic structure of the sentence. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
24. The Electronic Corpus of 17th- and 18th-century Polish Texts.
- Author
-
Gruszczyński, Włodzimierz, Adamiec, Dorota, Bronikowska, Renata, Kieraś, Witold, Modrzejewski, Emanuel, Wieczorek, Aleksandra, and Woliński, Marcin
- Subjects
- *
CORPORA , *POLISH language , *SLAVIC languages - Abstract
The paper describes the process of building the electronic corpus of 17th- and 18th-century Polish texts, a relatively large, balanced, structurally and morphologically annotated resource of the Middle Polish language, available for searching at https://www.korba.edu.pl. The corpus consists of samples extracted from over seven hundred texts written and published between 1601 and 1772, summing up to a total size of 13.5 million tokens which makes it one of the largest historical corpora for a Slavic language. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
25. Constructing fine-grained entity recognition corpora based on clinical records of traditional Chinese medicine
- Author
-
Tingting Zhang, Yaqiang Wang, Xiaofeng Wang, Yafei Yang, and Ying Ye
- Subjects
TCM clinical records ,Fine-grained annotation ,Named entity recognition ,Corpus construction ,Guideline development ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Background In this study, we focus on building a fine-grained entity annotation corpus with the corresponding annotation guideline of traditional Chinese medicine (TCM) clinical records. Our aim is to provide a basis for the fine-grained corpus construction of TCM clinical records in future. Methods We developed a four-step approach that is suitable for the construction of TCM medical records in our corpus. First, we determined the entity types included in this study through sample annotation. Then, we drafted a fine-grained annotation guideline by summarizing the characteristics of the dataset and referring to some existing guidelines. We iteratively updated the guidelines until the inter-annotator agreement (IAA) exceeded a Cohen’s kappa value of 0.9. Comprehensive annotations were performed while keeping the IAA value above 0.9. Results We annotated the 10,197 clinical records in five rounds. Four entity categories involving 13 entity types were employed. The final fine-grained annotated entity corpus consists of 1104 entities and 67,799 tokens. The final IAAs are 0.936 on average (for three annotators), indicating that the fine-grained entity recognition corpus is of high quality. Conclusions These results will provide a foundation for future research on corpus construction and named entity recognition tasks in the TCM clinical domain.
- Published
- 2020
- Full Text
- View/download PDF
26. Linguistically-Based Comparison of Different Approaches to Building Corpora for Text Simplification: A Case Study on Italian
- Author
-
Dominique Brunato, Felice Dell'Orletta, and Giulia Venturi
- Subjects
text simplification ,aligned corpora ,linguistic complexity ,Italian language ,corpus construction ,Psychology ,BF1-990 - Abstract
In this paper, we present an overview of existing parallel corpora for Automatic Text Simplification (ATS) in different languages focusing on the approach adopted for their construction. We make the main distinction between manual and (semi)–automatic approaches in order to investigate in which respect complex and simple texts vary and whether and how the observed modifications may depend on the underlying approach. To this end, we perform a two-level comparison on Italian corpora, since this is the only language, with the exception of English, for which there are large parallel resources derived through the two approaches considered. The first level of comparison accounts for the main types of sentence transformations occurring in the simplification process, the second one examines the results of a linguistic profiling analysis based on Natural Language Processing techniques and carried out on the original and the simple version of the same texts. For both levels of analysis, we chose to focus our discussion mostly on sentence transformations and linguistic characteristics that pertain to the morpho-syntactic and syntactic structure of the sentence.
- Published
- 2022
- Full Text
- View/download PDF
27. Claves para analizar datos en Twitter. Recolección y procesamiento de corpus.
- Author
-
BONILLA NEIRA, LAURA CRISTINA
- Subjects
- *
ACQUISITION of data , *DISCOURSE analysis , *DATA analysis , *INFORMATION processing , *VERBS , *VISUALIZATION , *ETHNOLOGY - Abstract
This paper aims to present a methodological proposal for the analysis of Twitter data from a mixed approach. Specifically, the procedure for collecting and processing information is characterized by the use of qualitative and quantitative resources, as well as the construction of a manageable corpus for subsequent qualitative analysis. The procedure to approach Twitter digital discourses consists of the following route: 1) virtual ethnography registration; 2) data collection through the Twitter API using Python; 3) visualisation and filtering of the data with Open Refine; 4) corpus construction; 5) categorisation and labelling of the iconic verb utterances with Atlas. ti. The paper reconstructs the methodological path carried out in ongoing doctoral research with a qualitative approach, from which the examples are extracted, to offer an accessible route that can be replicated in research with this type of data. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
28. 웹 말뭉치에 대한 문장 필터링 데이터 셋 구축 방법.
- Author
-
남충현 and 장경식
- Subjects
NATURAL language processing ,ARTIFICIAL neural networks ,TASK performance ,DEEP learning ,CORPORA - Abstract
Pretrained models with high performance in various tasks within natural language processing have the advantage of learning the linguistic patterns of sentences using large corpus during the training, allowing each token in the input sentence to be represented with appropriate feature vectors. One of the methods of constructing a corpus required for a pre-trained model training is a collection method using web crawler. However, sentences that exist on web may contain unnecessary words in some or all of the sentences because they have various patterns. In this paper, we propose a dataset construction method for filtering sentences containing unnecessary words using neural network models for corpus collected from the web. As a result, we construct a dataset containing a total of 2,330 sentences. We also evaluated the performance of neural network models on the constructed dataset, and the BERT model showed the highest performance with an accuracy of 93.75%. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
29. WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation
- Author
-
Jinyi Zhang, Ye Tian, Jiannan Mao, Mei Han, and Tadahiro Matsumoto
- Subjects
Japanese-Chinese bilingual corpus ,neural machine translation ,corpus construction ,manual evaluation ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
Currently, there are only a limited number of Japanese-Chinese bilingual corpora of a sufficient amount that can be used as training data for neural machine translation (NMT). In particular, there are few corpora that include spoken language such as daily conversation. In this research, we attempt to construct a Japanese-Chinese bilingual corpus of a certain scale by crawling the subtitle data of movies and TV series from the websites. We calculated the BLEU scores of the constructed WCC-JC (Web Crawled Corpus—Japanese and Chinese) and the other compared corpora. We also manually evaluated the translation results using the translation model trained on the WCC-JC to confirm the quality and effectiveness.
- Published
- 2022
- Full Text
- View/download PDF
30. Live blog summarization.
- Author
-
Avinesh, P. V. S., Peyrard, Maxime, and Meyer, Christian M.
- Subjects
- *
ONLINE journalism , *NEWS websites , *SCIENTIFIC community , *BLOGS , *CORPORA - Abstract
Live blogs are an increasingly popular news format to cover breaking news and live events in online journalism. Online news websites around the world are using this medium to give their readers a minute by minute update on an event. Good summaries enhance the value of the live blogs for a reader, but are often not available. In this article, (a) we first define the task of summarizing a live blog, (b) study ways of automatically collecting corpora for live blog summarization, and (c) understand the complexity of the task by empirically evaluating well-known state-of-the-art unsupervised and supervised summarization systems on our new corpus. We show that live blog summarization poses new challenges in the field of news summarization, since frequency and positional signals cannot be used. We make our tools publicly available to reconstruct the corpus and to conduct our empirical experiments. This encourages the research community to build upon and replicate our results. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
31. Cohort selection for construction of a clinical natural language processing corpus
- Author
-
Naga Lalitha Valli ALLA, Aipeng CHEN, Sean BATONGBACAL, Chandini NEKKANTTI, Hong-Jie Dai, and Jitendra JONNAGADDALA
- Subjects
Electronic health records ,Clinical natural language processing ,Surgical pathology reports ,HL7 messages ,Corpus construction ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
In Electronic Health Record (EHR) systems, key patient information is often captured in the form of unstructured clinical notes. The information from these notes can be extracted using Clinical Natural Language Processing (NLP). Training corpus is a key factor in development of efficient clinical NLP models. Clinical NLP corpus construction is complex and multifaceted. There are several challenges in corpus construction, but one challenge often not researched well is cohort selection aspect. In this study, we present methods employed and challenges encountered in cohort selection for construction of a clinical NLP corpus. In specific we present our methods in selection of cancer pathology reports to construct a corpus for automatic deidentification. 2100 pathology reports were extracted from 1833 (518 male and 1313 female) cancer patients using Health Level-7 (HL7) message standard. In terms of the age group distribution, the age group 60–70 years was highest with 872 patients. Our findings suggest deciphering the segment information from HL7 messages that are collected from different hospitals is a challenging task. The quality of HL7 messages also varied significantly with inconsistent tags making it difficult to identify reports that meet criteria set a priori. One of key lessons learned is linking the HL7 reports data with additional EMR data such as admissions, would help in identifying high quality reports and resolve duplicates. Also, our findings suggest that, in general the EHR data quality is poor with varying clinical coding and metadata standards between different hospitals. It is vital to identify and address these challenges for development of a high-quality corpus.
- Published
- 2021
- Full Text
- View/download PDF
32. The Lexicon of Argumentation for Argument Mining: methodological considerations
- Author
-
Patrick Saint Dizier
- Subjects
argumentation ,computational linguistics ,corpus construction ,Language and Literature - Abstract
In this article, in contrast with numerous statistical approaches which show little interest for linguistic analysis, we propose methodological considerations aiming at identifying several categories of linguistic cues which are typical of argumentation and can be used for automatic argument mining within the framework of computational linguistics. These are elaborated from a corpus of arguments that was compiled for this study. An argument being a relation between a controversial statement (claim) and a support or an attack, we deal with the marks specific to each class and those related to the expression of the relations which hold between them. To conclude we propose an experimental model to evaluate the strength of each of these components. Some elements of an implementation and an evaluation are provided.
- Published
- 2020
- Full Text
- View/download PDF
33. A Guide to Using Corpora for English Language Learners. Robert Poole.
- Author
-
Sharon Hartle
- Subjects
freely accessible digital corpora ,english language learning ,classroom-based activities ,concordance searches ,grammatical and lexical pattern analysis ,socio-cultural language analysis ,corpus construction ,American literature ,PS1-3576 ,English literature ,PR1-9680 - Abstract
Review of A Guide to Using Corpora for English Language Learners, by Robert Poole.
- Published
- 2020
- Full Text
- View/download PDF
34. Using Twitter to Detect Hate Crimes and Their Motivations: The HateMotiv Corpus
- Author
-
Noha Alnazzawi
- Subjects
text mining ,corpus construction ,annotation guidelines ,hate crime motivation ,Bibliography. Library science. Information resources - Abstract
With the rapidly increasing use of social media platforms, much of our lives is spent online. Despite the great advantages of using social media, unfortunately, the spread of hate, cyberbullying, harassment, and trolling can be very common online. Many extremists use social media platforms to communicate their messages of hatred and spread violence, which may result in serious psychological consequences and even contribute to real-world violence. Thus, the aim of this research was to build the HateMotiv corpus, a freely available dataset that is annotated for types of hate crimes and the motivation behind committing them. The dataset was developed using Twitter as an example of social media platforms and could provide the research community with a very unique, novel, and reliable dataset. The dataset is unique as a consequence of its topic-specific nature and its detailed annotation. The corpus was annotated by two annotators who are experts in annotation based on unified guidelines, so they were able to produce an annotation of a high standard with F-scores for the agreement rate as high as 0.66 and 0.71 for type and motivation labels of hate crimes, respectively.
- Published
- 2022
- Full Text
- View/download PDF
35. Using Social Media to Detect Fake News Information Related to Product Marketing: The FakeAds Corpus
- Author
-
Noha Alnazzawi, Najlaa Alsaedi, Fahad Alharbi, and Najla Alaswad
- Subjects
social media ,fake news ,corpus construction ,text mining ,Bibliography. Library science. Information resources - Abstract
Nowadays, an increasing portion of our lives is spent interacting online through social media platforms, thanks to the widespread adoption of the latest technology and the proliferation of smartphones. Obtaining news from social media platforms is fast, easy, and less expensive compared with other traditional media platforms, e.g., television and newspapers. Therefore, social media is now being exploited to disseminate fake news and false information. This research aims to build the FakeAds corpus, which consists of tweets for product advertisements. The aim of the FakeAds corpus is to study the impact of fake news and false information in advertising and marketing materials for specific products and which types of products (i.e., cosmetics, health, fashion, or electronics) are targeted most on Twitter to draw the attention of consumers. The corpus is unique and novel, in terms of the very specific topic (i.e., the role of Twitter in disseminating fake news related to production promotion and advertisement) and also in terms of its fine-grained annotations. The annotation guidelines were designed with guidance by a domain expert, and the annotation is performed by two domain experts, resulting in a high-quality annotation, with agreement rate F-scores as high as 0.815.
- Published
- 2022
- Full Text
- View/download PDF
36. An extensive review of tools for manual annotation of documents.
- Author
-
Neves, Mariana and Ševa, Jurica
- Subjects
- *
NATURAL language processing , *ANNOTATIONS , *SCIENTIFIC literature - Abstract
Motivation Annotation tools are applied to build training and test corpora, which are essential for the development and evaluation of new natural language processing algorithms. Further, annotation tools are also used to extract new information for a particular use case. However, owing to the high number of existing annotation tools, finding the one that best fits particular needs is a demanding task that requires searching the scientific literature followed by installing and trying various tools. Methods We searched for annotation tools and selected a subset of them according to five requirements with which they should comply, such as being Web-based or supporting the definition of a schema. We installed the selected tools (when necessary), carried out hands-on experiments and evaluated them using 26 criteria that covered functional and technical aspects. We defined each criterion on three levels of matches and a score for the final evaluation of the tools. Results We evaluated 78 tools and selected the following 15 for a detailed evaluation: BioQRator, brat, Catma, Djangology, ezTag, FLAT, LightTag, MAT, MyMiner, PDFAnno, prodigy, tagtog, TextAE, WAT-SL and WebAnno. Full compliance with our 26 criteria ranged from only 9 up to 20 criteria, which demonstrated that some tools are comprehensive and mature enough to be used on most annotation projects. The highest score of 0.81 was obtained by WebAnno (of a maximum value of 1.0). [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
37. The ReCAP Project: Similarity Methods for Finding Arguments and Argument Graphs.
- Author
-
Bergmann, Ralph, Biertz, Manuel, Dumani, Lorik, Lenz, Mirko, Ludwig, Anna-Katharina, Neumann, Patrick J., Ollinger, Stefan, Sahitaj, Premtim, Schenkel, Ralf, and Witry, Alex
- Abstract
Argumentation Machines search for arguments in natural language from information sources on the Web and reason with them on the knowledge level to actively support the deliberation and synthesis of arguments for a particular user query. The recap project is part of the Priority Program ratio and aims at novel contributions to and confluence of methods from information retrieval, knowledge representation, as well as case-based reasoning for the development of future argumentation machines. In this paper we summarise recent research results from the project. In particular, a new German corpus of 100 semantically annotated argument graphs from the domain of education politics has been created and is made available to the argumentation research community. Further, we discuss a comprehensive investigation in finding arguments and argument graphs. We introduce a probabilistic ranking framework for argument retrieval, i.e. for finding good premises for a designated claim. For finding argument graphs, we developed methods for case-based argument retrieval considering the graph structure of an argument together with textual and ontology-based similarity measures applied to claims, premises, and argument schemes. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
38. Constructing fine-grained entity recognition corpora based on clinical records of traditional Chinese medicine.
- Author
-
Zhang, Tingting, Wang, Yaqiang, Wang, Xiaofeng, Yang, Yafei, and Ye, Ying
- Subjects
CHINESE medicine ,MEDICAL records ,CORPORA ,NAMED-entity recognition - Abstract
Background: In this study, we focus on building a fine-grained entity annotation corpus with the corresponding annotation guideline of traditional Chinese medicine (TCM) clinical records. Our aim is to provide a basis for the fine-grained corpus construction of TCM clinical records in future.Methods: We developed a four-step approach that is suitable for the construction of TCM medical records in our corpus. First, we determined the entity types included in this study through sample annotation. Then, we drafted a fine-grained annotation guideline by summarizing the characteristics of the dataset and referring to some existing guidelines. We iteratively updated the guidelines until the inter-annotator agreement (IAA) exceeded a Cohen's kappa value of 0.9. Comprehensive annotations were performed while keeping the IAA value above 0.9.Results: We annotated the 10,197 clinical records in five rounds. Four entity categories involving 13 entity types were employed. The final fine-grained annotated entity corpus consists of 1104 entities and 67,799 tokens. The final IAAs are 0.936 on average (for three annotators), indicating that the fine-grained entity recognition corpus is of high quality.Conclusions: These results will provide a foundation for future research on corpus construction and named entity recognition tasks in the TCM clinical domain. [ABSTRACT FROM AUTHOR]- Published
- 2020
- Full Text
- View/download PDF
39. Automatic Building of a Large Arabic Spelling Error Corpus
- Author
-
Aichaoui, Shaimaa Ben, Hiri, Nawel, Dahou, Abdelhalim Hafedh, and Cheragui, Mohamed Amine
- Published
- 2023
- Full Text
- View/download PDF
40. Two evaluations on Ontology-style relation annotations.
- Author
-
Bou, Savong, Miwa, Makoto, and Sasaki, Yutaka
- Subjects
- *
ONTOLOGY , *RDF (Document markup language) , *DOCUMENT markup languages - Abstract
In this paper, we propose an Ontology-Style Relation (OSR) annotation approach. In conventional Relation Extraction (RE) datasets, relations are annotated as a link between two entity mentions. In contrast, in our OSR annotation, a relation is annotated as a relation mention (i.e. , not a link but a node) and rdfs:domain and rdfs:range links are annotated from the relation mention to its argument entity mentions. This approach has the following benefits: (1) the relation annotations can be easily converted to Resource Description Framework (RDF) triples to populate an Ontology, (2) some part of conventional RE tasks can be tackled as Named Entity Recognition (NER) tasks, and the relation classes are limited to several RDF properties, and (3) OSR annotations can be used for clear documentations of Ontology contents. We conducted two kinds of evaluation to investigate effects of OSR annotation. We converted (1) an in-house corpus of Japanese Rules of the Road (RoR) in conventional annotations into the OSR annotations and built a novel OSR-RoR corpus and (2) SemEval-2010 Task 8 dataset into the OSR annotations (called OSR-SemEval corpus). We compared the NER and RE performance using neural NER/RE tool DyGIE++ on the conventional and OSR annotations. The experimental results show that the OSR annotations make the RE task easier while introducing slight complexity into the NER task. • The relation annotations can be easily converted to Resource Description Framework (RDF) triples to populate an Ontology. • Some part of conventional RE tasks can be tackled as Named Entity Recognition (NER) tasks. The relation classes are limited to several RDF properties. • OSR annotations can be used for clear documentations of Ontology contents. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. Developing a cardiovascular disease risk factor annotated corpus of Chinese electronic medical records
- Author
-
Jia Su, Bin He, Yi Guan, Jingchi Jiang, and Jinfeng Yang
- Subjects
Cardiovascular disease risk factors ,Chinese electronic medical records ,Annotation ,Corpus construction ,Information extraction ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Background Cardiovascular disease (CVD) has become the leading cause of death in China, and most of the cases can be prevented by controlling risk factors. The goal of this study was to build a corpus of CVD risk factor annotations based on Chinese electronic medical records (CEMRs). This corpus is intended to be used to develop a risk factor information extraction system that, in turn, can be applied as a foundation for the further study of the progress of risk factors and CVD. Results We designed a light annotation task to capture CVD risk factors with indicators, temporal attributes and assertions that were explicitly or implicitly displayed in the records. The task included: 1) preparing data; 2) creating guidelines for capturing annotations (these were created with the help of clinicians); 3) proposing an annotation method including building the guidelines draft, training the annotators and updating the guidelines, and corpus construction. Meanwhile, we proposed some creative annotation guidelines: (1) the under-threshold medical examination values were annotated for our purpose of studying the progress of risk factors and CVD; (2) possible and negative risk factors were concerned for the same reason, and we created assertions for annotations; (3) we added four temporal attributes to CVD risk factors in CEMRs for constructing long term variations. Then, a risk factor annotated corpus based on de-identified discharge summaries and progress notes from 600 patients was developed. Built with the help of clinicians, this corpus has an inter-annotator agreement (IAA) F1-measure of 0.968, indicating a high reliability. Conclusion To the best of our knowledge, this is the first annotated corpus concerning CVD risk factors in CEMRs and the guidelines for capturing CVD risk factor annotations from CEMRs were proposed. The obtained document-level annotations can be applied in future studies to monitor risk factors and CVD over the long term.
- Published
- 2017
- Full Text
- View/download PDF
42. Grammar Errors by Slovenian Learners of Japanese: Corpus Analysis of Writings on Beginner and Intermediate Levels
- Author
-
Miha PAVLOVIČ
- Subjects
Learner's Corpus ,corpus construction ,error analysis ,grammar error ,second language acquisition ,Philology. Linguistics ,P1-1091 - Abstract
This paper presents the construction of a corpus of writings by Slovene learners of Japanese as a foreign language at the beginner and intermediate levels and an analysis of the grammar errors contained within it, with the purpose of providing a simple and effective means of acquiring data on errors made by students of Japanese as a second language. Additionally, an error analysis of the grammar errors in the corpus and a comparison of the most common errors found on both levels, reveals the types of errors that carry over from the beginner to the intermediate level, negatively affecting the learning process. By compiling and analyzing a collection of 182 written texts written by Japanese learners, 492 cases of grammar misuse were observed on the beginner and 564 on the intermediate level. A comparative analysis of the most common types of grammar misuse on each level highlights the types of errors that seem to carry over from the beginner to the intermediate level. The findings can be useful to Japanese language learners as well as teachers. Furthermore, the learner’s corpus created in the process marks the first step towards the creation of a larger, annotated and publicly accessible learner corpus of writings by Slovenian learners of Japanese to be used for further research in the field of second language acquisition.
- Published
- 2020
- Full Text
- View/download PDF
43. Constructing a longitudinal learner corpus to track L2 spoken English
- Author
-
Abe Mariko and Yusuke Kondo
- Subjects
longitudinal data ,learner corpus ,L2 spoken English ,corpus construction ,automated speech recognition technology ,Philology. Linguistics ,P1-1091 - Abstract
The main purposes of this article are to provide an overview of a research project on a longitudinal learner spoken corpus and to share procedures related to the transcription of learners’ utterances from audio files using automated speech recognition (ASR) technology (IBM Watson Speech-to-text). The data of the corpus were collected twice or thrice a year for three consecutive years from 2016, creating eight data collection points altogether. They were gathered from 120 secondary school students who had been learning English in an English as a Foreign Language context for three years. The students were asked to take a monologue speaking test, the Telephone Standard Speaking Test, consisting of various tasks. The overall discussion of the article focuses on the details of this project and highlights how a methodological approach of combining electronic learner language data and ASR technology is useful in constructing learner spoken corpora.
- Published
- 2019
44. GRAMMAR ERRORS BY SLOVENIAN LEARNERS OF JAPANESE: CORPUS ANALYSIS OF WRITINGS ON BEGINNER AND INTERMEDIATE LEVELS.
- Author
-
PAVLOVIČ, Miha
- Subjects
SECOND language acquisition ,JAPANESE language ,GRAMMAR ,ERROR analysis in mathematics ,CORPORA ,LANGUAGE & languages - Abstract
Copyright of Acta Linguistica Asiatica is the property of Acta Linguistica Asiatica and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2020
- Full Text
- View/download PDF
45. An Efficient Framework for Constructing Speech Emotion Corpus Based on Integrated Active Learning Strategies
- Author
-
Fuji Ren, Zheng Liu, and Xin Kang
- Subjects
Human-Computer Interaction ,Corpus construction ,Active learning ,Imbalanced dataset ,Speech emotion recognition ,Affective computing ,Software - Abstract
Speech emotion recognition has been developed rapidly in recent decades because of the appearance of machine learning. Nevertheless, lack of corpus remains a significant issue. For actual speech emotion corpus construction, many professional actors are required to perform voices with various emotions in specific scenes. In the process of data labelling, since the number of samples of different emotion categories is extremely imbalanced, it is difficult to efficiently label the samples. Hence, we proposed an integrated active learning sampling strategy and designed an efficient framework for constructing speech emotion corpora in order to address the problems presented above. Comparing experiments with other active learning algorithms on 13 datasets, our method was shown to improve sampling efficiency. In addition, it is able to select small category samples to be labelled with preference in imbalanced datasets. During the actual corpus construction experiments, our method can prioritize selecting small class emotion samples. As even when the amount of labelled data is less than 50%, the accuracy rate still can reach 90%. This greatly enhances the efficiency of constructing the speech emotion corpus and fills in the gaps.
- Published
- 2022
46. WellXplain: Wellness concept extraction and classification in Reddit posts for mental health analysis.
- Author
-
Garg, Muskan
- Subjects
- *
LANGUAGE models , *MENTAL health personnel , *MENTAL health screening , *LANGUAGE acquisition , *MENTAL health - Abstract
Amid the ongoing mental health crisis, there is an increasing need to discern possible signs of mental disturbance manifested in social media text. During in-person therapy sessions, mental health professionals employ manual methods to identify the root causes and outcomes of latent factors contributing to mental disturbances, which is a painstaking and time-intensive process. Neglecting multi-dimensional aspects of well-being (i.e., wellness dimensions) over time can adversely affect an individual's mind. To enable such fine-grained mental health screening, we define the task of identifying wellness concepts and classifying in pre-defined dimensions in Reddit posts. We construct a novel dataset called WellXplain , which consists of 3,092 instances and a total of 72,813 words. Our experts developed an annotation scheme based on a well-adapted Halbert L. Dunn's theory of wellness dimensions. Further, the data encompasses human-annotated text spans as pertinent explanations for decision-making during wellness concept classification. We anticipate that releasing the dataset and evaluating the baselines will facilitate the development of new language models for concept extraction and classification in healthcare domain. • Introducing the dataset importance in mental healthcare simulations. • Creating a corpus for well-being concept analysis. • Exploring domain-specific transformers and language models. • Assessing traditional multi-class classifier and its reliability. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction
- Author
-
Tanaka, Koji, Chu, Chenhui, Kajiwara, Tomoyuki, Nakashima, Yuta, Takemura, Noriko, Nagahara, Hajime, and Fujikawa, Takao
- Published
- 2022
- Full Text
- View/download PDF
48. Sampling and Features: A Commentary on Condit-Schultz (2016)
- Author
-
Mitchell Ohriner
- Subjects
rap music ,corpus construction ,symbolic music representation ,Music ,M1-5000 - Abstract
In this commentary, I highlight some of the novel contributions of Nathaniel Condit-Schultz's "MCFlow: A Digital Corpus of Rap Transcriptions" and discuss issues of rhyme definition, sampling and corpus construction, feature representation, and historical narratives.
- Published
- 2017
- Full Text
- View/download PDF
49. Resource creation for opinion mining: a case study with Marathi movie reviews
- Author
-
Mhaske, N. T. and Patil, A. S.
- Published
- 2021
- Full Text
- View/download PDF
50. Information Extraction from Public Meeting Articles
- Author
-
Virgo, Felix Giovanni, Chu, Chenhui, Ogawa, Takaya, Tanaka, Koji, Ashihara, Kazuki, Nakashima, Yuta, Takemura, Noriko, Nagahara, Hajime, and Fujikawa, Takao
- Published
- 2022
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.