Descriptor: "Doc2Vec" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Doc2Vec"' showing total 353 results

Start Over Descriptor "Doc2Vec"

353 results on '"Doc2Vec"'

1. A text-based recommender system for recommending relevant news articles

Author: Walek, Bogdan and Müller, Patrik
Published: 2025
Full Text: View/download PDF

2. Deep learning based online fake review detection technique.

Author: Singh, Uday Pratap and Kaur, Nirmal
Abstract: The growth of e-commerce has led to fraud practices emerging as one of the major risks to the industry. Fraudulent activities substantially harm the ranking systems of e-commerce platforms and have a negative impact on users' buying experiences. Today's consumers strive to learn about all the benefits and drawbacks of a product or service before making a purchase; therefore, online polling websites are crucial to increasing product sales. This paper focuses on detecting false reviews posted by users on e-commerce platforms by utilizing deep learning technique. To classify the fake reviews, a deep learning model called Bi-LSTM CNN is proposed and developed utilizing Doc2vec with term frequency-inverse document frequency (TF-IDF) as feature extraction method. The proposed model achieves an accuracy of 97.3% on the Ott dataset as compared to other deep learning based models. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

3. How does air pollution affect floating population in metropolitan city: embedding-based approach.

Author: Lee, Won Sang
Subjects: ASSOCIATION rule mining, AIR pollution, URBAN pollution, AGE groups, MACHINE learning
Abstract: Recently, the air pollution has been seriously regarded in the urban environment. Particularly, the substantial relationship between the air pollution and the daily movements of citizens has not been sufficiently investigated yet. This study attempts to empirically identify the patterns of air pollution using association rule mining from Seoul, the metropolitan city in South Korea. As a result, 214 patterns on air pollution are discovered, and those are embedded into vectors based on Doc2Vec technique. Then, this paper further examines how the movement of citizens reacts to the discovered patterns of air pollution by deploying the linear regression on the floating population with emphasis on the walk-traffic. Specifically, the walk-traffic is categorized into 14 categories by gender and age group, and the effects of air pollution patterns on each subgroup walk traffic were analyzed. Findings of this paper provide the empirical evidences on the estimated air pollution sensitivity by generation and gender to researchers and practitioners. This paper has the contribution on newly proposing the methodological framework for further managing the air pollution in the urban environment. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. LncRNA–miRNA interactions prediction based on meta‐path similarity and Gaussian kernel similarity.

Author: Xie, Jingxuan, Xu, Peng, Lin, Ye, Zheng, Manyu, Jia, Jixuan, Tan, Xinru, Sun, Jianqiang, and Zhao, Qi
Subjects: LINCRNA, MICRORNA, RNA, NEIGHBORHOODS, ALGORITHMS
Abstract: Long non‐coding RNAs (lncRNAs) and microRNAs (miRNAs) are two typical types of non‐coding RNAs that interact and play important regulatory roles in many animal organisms. Exploring the unknown interactions between lncRNAs and miRNAs contributes to a better understanding of their functional involvement. Currently, studying the interactions between lncRNAs and miRNAs heavily relies on laborious biological experiments. Therefore, it is necessary to design a computational method for predicting lncRNA–miRNA interactions. In this work, we propose a method called MPGK‐LMI, which utilizes a graph attention network (GAT) to predict lncRNA–miRNA interactions in animals. First, we construct a meta‐path similarity matrix based on known lncRNA–miRNA interaction information. Then, we use GAT to aggregate the constructed meta‐path similarity matrix and the computed Gaussian kernel similarity matrix to update the feature matrix with neighbourhood information. Finally, a scoring module is used for prediction. By comparing with three state‐of‐the‐art algorithms, MPGK‐LMI achieves the best results in terms of performance, with AUC value of 0.9077, AUPR of 0.9327, ACC of 0.9080, F1‐score of 0.9143 and precision of 0.8739. These results validate the effectiveness and reliability of MPGK‐LMI. Additionally, we conduct detailed case studies to demonstrate the effectiveness and feasibility of our approach in practical applications. Through these empirical results, we gain deeper insights into the functional roles and mechanisms of lncRNA–miRNA interactions, providing significant breakthroughs and advancements in this field of research. In summary, our method not only outperforms others in terms of performance but also establishes its practicality and reliability in biological research through real‐case analysis, offering strong support and guidance for future studies and applications. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. Analisis Perbandingan Teknik Word2vec dan Doc2vec dalam Mengukur Kemiripan Dokumen Menggunakan Cosine Similarity

Author: Dede Iskandar and Ana Kurniawati
Subjects: Word2vec, Doc2vec, Cosine Similarity, Kemiripan Dokumen, Technology, Information technology, T58.5-58.64
Abstract: Tempatkan Era digital memudahkan akses dokumen online dalam jumlah besar menjadi lebih mudah dan cepat, namun juga menimbulkan tantangan kompleks dalam pengelolaan dan analisis informasi. Salah satu tantangan utama adalah mengukur kemiripan antar dokumen, yang penting untuk berbagai aplikasi seperti deteksi plagiarisme. Menanggapi tantangan ini, banyak teknik yang dapat digunakan dalam merepresentasikan dokumen menjadi vektor untuk mengukur kemiripan dokumen. Dalam penelitian ini teknik Word2vec dan Doc2vec digunakan untuk merepresentasikan dokumen menjadi vektor, dan dalam mengukur kemiripan dokumen menggunakan metode Cosine Similarity. Objek penelitian dilakukan pada paragraf abstrak dari 20 jurnal ilmiah dengan tema data mining yang diterbitkan antara tahun 2020 hingga 2024 dari E-Journal Universitas Gunadarma. Metodologi penelitian meliputi pengumpulan data, text mining, pra-pemrosesan teks, implementasi teknik Word2vec dan Doc2vec, serta pengukuran Cosine Similarity. Hasil penelitian menunjukkan bahwa teknik Word2vec menghasilkan nilai Cosine Similarity yang lebih tinggi dibandingkan dengan Doc2vec untuk pasangan jurnal yang sama, dapat dilihat pada pasangan jurnal J02 dengan J14 memiliki nilai Cosine Similarity 0.892 pada teknik Word2vec, sedangkan pada Doc2vec nilainya 0.434. Hal ini menandakan bahwa hasil teknik Word2vec terbukti lebih efektif dalam menangkap kemiripan semantik antara jurnal-jurnal dibandingkan dengan teknik Doc2vec. Abstract The digital era has made access to many online documents easier and faster, but it has also created complex challenges in information management and analysis. One of the main challenges is measuring the similarity between documents, which is crucial for various applications such as plagiarism detection. In response to this challenge, many techniques can be used to represent documents as vectors to measure document similarity. In this research, Word2vec and Doc2vec techniques are used to represent documents as vectors, and Cosine Similarity is used to measure document similarity. The research objects are abstract paragraphs from 20 scientific journals on data mining published between 2020 and 2024 from Gunadarma University's E-Journal. The research methodology includes data collection, text mining, text pre-processing, Word2vec and Doc2vec techniques implementations, and Cosine Similarity measurement. The results show that the Word2vec technique produces higher Cosine Similarity values compared to Doc2vec for the same journal pairs, as seen in the journal pair J02 and J14 having a Cosine Similarity value of 0.892 using the Word2vec technique, while with Doc2vec the value is 0.434. This indicates that the Word2vec technique proves to be more effective in capturing semantic similarities between journals compared to the Doc2vec technique.
Published: 2025
Full Text: View/download PDF

6. Exploring technology fusion by combining latent Dirichlet allocation with Doc2vec: a case of digital medicine and machine learning.

Author: Gao, Qiang and Jiang, Man
Abstract: As a driving force behind innovation, technological fusion has emerged as a prevailing trend in knowledge innovation. However, current research lacks the semantic analysis and identification of knowledge fusion across technological domains. To bridge this gap, we propose a strategy that combines the latent Dirichlet allocation (LDA) topic model and the Doc2vec neural network semantic model to identify fusion topics across various technology domains. Then, we fuse the semantic information of patents to measure the characteristics of fusion topics in terms of knowledge diversity, homogeneity and cohesion. Applying this method to a case study in the fields of digital medicine and machine learning, we identify six fusion topics from two technology domains, revealing two distinct trends: diffusion from the center to the periphery and clustering from the periphery to the center. The study shows that the fusion measure of topic-semantic granularity can reveal the variability of technology fusion processes at a profound level. The proposed research method will benefit scholars in conducting multi-domain technology fusion research and gaining a deeper understanding of the knowledge fusion process across technology domains from a semantic perspective. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. Offense Feature Extraction and Comparative Analysis of Jaccard Similarity and Word Embedding Techniques for IPC Section Recommendations in First Information Report

Author: Srivastav, Ambrish, Prajapat, Shaligram, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Naik, Nitin, editor, Jenkins, Paul, editor, Prajapat, Shaligram, editor, and Grace, Paul, editor
Published: 2024
Full Text: View/download PDF

8. A Similarity Approach for the Classification of Mitigations in Public Cybersecurity Repositories into NIST-SP 800-53 Catalog

Author: Elmarkez, Ahmed, Mesli-Kesraoui, Soraya, Oquendo, Flavio, Berruet, Pascal, Kesraoui, Djamal, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, van Leeuwen, Jan, Series Editor, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Kobsa, Alfred, Series Editor, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Nierstrasz, Oscar, Series Editor, Pandu Rangan, C., Editorial Board Member, Sudan, Madhu, Series Editor, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Vardi, Moshe Y, Series Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Bouzefrane, Samia, editor, and Sauveron, Damien, editor
Published: 2024
Full Text: View/download PDF

9. On the Relevance of Graph2Vec Source Code Embeddings for Software Defect Prediction

Author: Miholca, Diana-Lucia, Oneţ-Marian, Zsuzsanna, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Fill, Hans-Georg, editor, Domínguez Mayo, Francisco José, editor, van Sinderen, Marten, editor, and Maciaszek, Leszek A., editor
Published: 2024
Full Text: View/download PDF

10. LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach

Author: Aqsa Khalid, Maria Hanif, Abdul Hameed, Zeeshan Ashraf, Mrim M. Alnfiai, and Salma M. Mohsen Alnefaie
Subjects: TF-IDF, Word2Vec, Doc2Vec, LogiTriBlend, SVM, XGBoost, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Email phishing remains a prevalent and sophisticated cyber threat, targeting individuals and organizations by disguising malicious intent in seemingly legitimate communications. Effective classification of phishing and legitimate emails is crucial for cybersecurity. In this study, we investigated various text vectorization techniques and machine learning models to address the challenge of email classification. We utilized three vectorization techniques: TF-IDF, Word2Vec, and Doc2Vec. These techniques were applied to traditional machine learning algorithms, and their performance was evaluated against a proposed stacking model, LogiTriBlend. The dataset comprised 501 phishing and 4090 legitimate emails, undergoing preprocessing steps like stemming, lemmatization, and noise removal. To handle the dataset’s imbalance, Synthetic Minority Over-sampling Technique (SMOTE) was employed. The model combines multiple base learners, including Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost, with a Logistic Regression meta-learner. The experimental results indicated that the LogiTriBlend model achieved an accuracy of 99.34% using Doc2Vec, outperforming Word2Vec and TF-IDF feature extraction methods, which obtained accuracies of 99.12% and 98.80%, respectively. The Doc2Vec method resulting in superior email classification performance. Among the models tested, the proposed stacking model, LogiTriBlend, demonstrated robust results; however, the highest accuracy was consistently achieved using Doc2Vec.
Published: 2024
Full Text: View/download PDF

11. Deceptive opinion spam detection using feature reduction techniques.

Author: Maurya, Sushil Kumar, Singh, Dinesh, and Maurya, Ashish Kumar
Abstract: People usually prepare themselves by reading online reviews before purchasing a product. Sellers sometimes try to imitate user experience as a deceptive review to increase profits. Deceptive opinion spam detection has emerged as a challenging task in the field of opinion mining. Feature reduction techniques play the most important role in data mining which finds the essential features and removes the unnecessary dimensions that only contribute to the noise. This article extracts various textual features of gold-standard deceptive hotel reviews using different representation techniques like Part of Speech tag (POS tag), Bag of Word (BoW), and Doc2Vec. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are applied to reduce the features' dimensions. Various supervised classifiers like Decision Tree (DT), Na¨ıve Bayes (NB), Logistic Regression (LR), and Support Vector Machine (SVM) are used to classify deceptive opinions and truthful opinions. The features used by these supervised classifiers cannot retain sequential information from reviews. To overcome this problem, we used the Words Attention-based Bidirectional Long Short-Term Memory (WABiLSTM) network model that trains to learn the patterns of words. The article examines machine and deep learning-based spam detection models and provides their outline and results. The metrics like accuracy, precision, recall, and F-Measure are used to analyze the performance of these classification models. The experimental results showed the model's performance improved after reducing the features. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

12. DeepPPThermo: A Deep Learning Framework for Predicting Protein Thermostability Combining Protein-Level and Amino Acid-Level Features.

Author: Xiang, Xiaoyang, Gao, Jiaxuan, and Ding, Yanrui
Subjects: *DEEP learning, *MACHINE learning, *SHORT-term memory, *PROTEINS
Abstract: Using wet experimental methods to discover new thermophilic proteins or improve protein thermostability is time-consuming and expensive. Machine learning methods have shown powerful performance in the study of protein thermostability in recent years. However, how to make full use of multiview sequence information to predict thermostability effectively is still a challenge. In this study, we proposed a deep learning-based classifier named DeepPPThermo that fuses features of classical sequence features and deep learning representation features for classifying thermophilic and mesophilic proteins. In this model, deep neural network (DNN) and bi-long short-term memory (Bi-LSTM) are used to mine hidden features. Furthermore, local attention and global attention mechanisms give different importance to multiview features. The fused features are fed to a fully connected network classifier to distinguish thermophilic and mesophilic proteins. Our model is comprehensively compared with advanced machine learning algorithms and deep learning algorithms, proving that our model performs better. We further compare the effects of removing different features on the classification results, demonstrating the importance of each feature and the robustness of the model. Our DeepPPThermo model can be further used to explore protein diversity, identify new thermophilic proteins, and guide directed mutations of mesophilic proteins. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

13. 지방자치단체 비상벨에 AI Whisper와 ChatGPT 기술연계.

Author: 이우익 and 박대우
Subjects: ARTIFICIAL intelligence
Abstract: Local governments in Korea are installing and operating emergency bells. The emergency bell must transmit the user's emergency rescue call or crisis situation to the control center or police station. However, in the case of local government public emergency bells, in addition to situation-specific delivery, two-way emergency response is required to respond to emergency situations, and emergency support must be available even when personal information protection is necessary depending on the situation. Additionally, it must be possible to provide accurate answers to citizens' questions or requests. In this paper, we connect Whisper and ChatGPT (Generative Pre-trained Transformer) systems to local government emergency bells, respond to emergency situation communication analysis through artificial intelligence video analysis, voice and text recognition and learning, and provide emergency response to more accurate responses. I want to study warning systems. This study is expected to provide a solution to improve the quality of life of the people [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

14. Smarter people analytics with organizational text data: Demonstrations using classic and advanced NLP models.

Author: Guo, Feng, Gallagher, Christopher M., Sun, Tianjun, Tavoosi, Saba, and Min, Hanyi
Subjects: NATURAL language processing, PERSONNEL management, TEXT mining, DATA analytics, SYNTAX (Grammar)
Abstract: Recent developments in text mining and natural language processing (NLP) have paved a new way for analysing text data. These techniques are particularly useful for human resource management (HRM) due to the large amount of text information in the field. This paper adds to the literature by introducing and demonstrating steps of using NLP. Two demonstrations are presented: Demonstration One illustrates how simple and straightforward Bag‐of‐Word models applied on textual comments can be used to predict numerical ratings of companies; Demonstration Two shows how personality (self‐reported scores on the Big Five) can be predicted from situational interview questions through more complex Doc2Vec algorithms. Together, these demonstrations show that both simple and complex techniques are effective tools in predicting organizational outcomes. Accessible syntax and guides for beginners are also provided. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

15. Comparison between document vectorization methods: a case study for textual data.

Author: Kubrusly, Jéssica and Valenotti, Gabriel G. L.
Subjects: TEXT mining, TEXTUAL criticism, DATA mining, WOMEN'S clothing, DATABASES, RANDOM forest algorithms
Abstract: Copyright of Sigmae is the property of Universidade Federal de Alfenas (UNIFAL-MG) and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2024

16. An Optimized Chinese Filtering Model Using Value Scale Extended Text Vector.

Author: Siyu Lu, Ligao Cai, Zhixin Liu, Shan Liu, Bo Yang, Lirong Yin, Mingzhe Liu, and Wenfeng Zheng
Subjects: ARTIFICIAL neural networks, WORD frequency, INFORMATION retrieval, WEBSITES, DATA analysis
Abstract: With the development of Internet technology, the explosive growth of Internet information presentation has led to difficulty in filtering effective information. Finding a model with high accuracy for text classification has become a critical problem to be solved by text filtering, especially for Chinese texts. This paper selected the manually calibrated Douban movie website comment data for research. First, a text filtering model based on the BP neural network has been built; Second, based on the Term Frequency-Inverse Document Frequency (TF-IDF) vector spacemodel and the doc2vec method, the text word frequency vector and the text semantic vector were obtained respectively, and the text word frequency vector was linearly reduced by the Principal Component Analysis (PCA)method. Third, the text word frequency vector after dimensionality reduction and the text semantic vector were combined, add the text value degree, and the text synthesis vector was constructed. Experiments show that the model combined with text word frequency vector degree after dimensionality reduction, text semantic vector, and text value has reached the highest accuracy of 84.67%. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

17. An exploratory study of net zero discourse based on South Korean newspapers: a topic modeling and sentiment analysis approach.

Author: Yun, Bitnari, Lim, JongYeon, and Yun, Minyoung
Abstract: Public support for net zero is an important determinant of the solution for climate change. Newspapers can be used as a data source for observing public discourse on a variety of subjects including net zero. Public discourse produced through newspapers is not value-neutral; in particular, media inevitably has political orientation. This study examines the major topics of net zero articles, and if and how articles from different political orientations differ in the tone of news report of the topic of net zero. Latent Dirichlet allocation (LDA)-based topic modeling was applied to infer the major topics, and sentiment analysis was applied to understand the sentiments and tones of the articles. We found that 16 major topics were composed of 5 groups to discuss various aspects such as environment, socio-politics, and the economy. We also found that although the majority of articles are neutral, differences in the tone of South Korean media was particularly noticeable in negatively classified articles. These findings suggest that newspapers, particularly politically oriented newspapers, can influence public acceptance, support for net zero, and therefore the implementation of government policies in certain ways. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

18. Emvirus: An embedding-based neural framework for human-virus protein-protein interactions prediction

Author: Pengfei Xie, Jujuan Zhuang, Geng Tian, and Jialiang Yang
Subjects: SARS-CoV-2, human-virus PPI, Word embedding, Doc2vec, Neural networks, Infectious and parasitic diseases, RC109-216, Public aspects of medicine, RA1-1270
Abstract: Human-virus protein-protein interactions (PPIs) play critical roles in viral infection. For example, the spike protein of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) binds primarily to human angiotensin-converting enzyme 2 (ACE2) protein to infect human cells. Thus, identifying and blocking these PPIs contribute to controlling and preventing viruses. However, wet-lab experiment-based identification of human-virus PPIs is usually expensive, labor-intensive, and time-consuming, which presents the need for computational methods. Many machine-learning methods have been proposed recently and achieved good results in predicting human-virus PPIs. However, most methods are based on protein sequence features and apply manually extracted features, such as statistical characteristics, phylogenetic profiles, and physicochemical properties. In this work, we present an embedding-based neural framework with convolutional neural network (CNN) and bi-directional long short-term memory unit (Bi-LSTM) architecture, named Emvirus, to predict human-virus PPIs (including human–SARS-CoV-2 PPIs). In addition, we conduct cross-viral experiments to explore the generalization ability of Emvirus. Compared to other feature extraction methods, Emvirus achieves better prediction accuracy.
Published: 2023
Full Text: View/download PDF

19. Learning Software Project Management From Analyzing Q&A’s in the Stack Exchange

Author: Alireza Ahmadi, Fatemeh Delkhosh, Gouri Deshpande, Raymond A. Patterson, and Guenther Ruhe
Subjects: Software project management, PMBOK, stack exchange, BERT, Doc2Vec, learning, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Software Project Management (SPM) is considered the key driver for the success or failure of software projects. Project failure is caused by various factors, the most important of which is poor SPM. Thus, we investigated the needs of practitioners by focusing on Project Management Q&A communities. More precisely, we targeted Stack Exchange to identify the primary needs of software project managers. More than 5000 SPM questions were analyzed from the conceptual model given by the Project Management Body of Knowledge PMBOK. For pre-training of the Machine Learning classifiers, we implemented Bidirectional Encoder Representations from Transformers (BERT) and Doc2Vec text embedding and compared their performance. Our results showed that BERT outperforms Doc2Vec for pre-training in almost all scenarios. Schedule management, followed by resource management, are the main PMBOK knowledge areas of concern for project managers. Among the process groups, the emphasis of the questions is on planning. We compared the findings with the learning and training status quo in 11 top Canadian universities. We analyzed 46 SPM-related courses and found that the rank correlation of PMBOK knowledge areas is 0.23 between the key content of the analyzed courses and the focus of Q&A’s knowledge areas analyzed from Stack Exchange.
Published: 2023
Full Text: View/download PDF

20. Distributional Representation of Cyclic Alternating Patterns for A-Phase Classification in Sleep EEG.

Author: Vergara-Sánchez, Diana Laura, Calvo, Hiram, and Moreno-Armendáriz, Marco A.
Subjects: ELECTROENCEPHALOGRAPHY, MACHINE learning, SLEEP
Abstract: This article describes a detailed methodology for the A-phase classification of the cyclic alternating patterns (CAPs) present in sleep electroencephalography (EEG). CAPs are a valuable EEG marker of sleep instability and represent an important pattern with which to analyze additional characteristics of sleep processes, and A-phase manifestations have been linked to some specific conditions. CAP phase detection and classification are not commonly carried out routinely due to the time and attention this problem requires (and if present, CAP labels are user-dependent, visually evaluated, and hand-made); thus, an automatic tool to solve the CAP classification problem is presented. The classification experiments were carried out using a distributional representation of the EEG data obtained from the CAP Sleep Database. For this purpose, data symbolization was performed using the one-dimensional symbolic aggregate approximation (1d-SAX), followed by the vectorization of symbolic data with a trained Doc2Vec model and a final classification with ten classic machine learning models for two separate validation strategies. The best results were obtained using a support vector classifier with a radial basis kernel. For hold-out validation, the best F1 Score was 0.7651; for stratified 10-fold cross-validation, the best F1 Score was 0.7611 ± 0.0133. This illustrates that the proposed methodology is suitable for CAP classification. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

21. Processing-in-Memory Development Strategy for AI Computing Using Main-Path and Doc2Vec Analyses.

Author: Chung, Euiyoung and Sohn, So Young
Abstract: Processing-in-Memory (PiM), which combines a memory device with a Processing Unit (PU) into an integrated chip, has drawn special attention in the field of Artificial Intelligence semiconductors. Currently, in the development and commercialization of PiM's technology, there are challenges in the hegemony competition between the PU and memory device industries. In addition, there are challenges in finding strategic partnerships rather than independent development due to the complexity of technological development caused by heterogeneous chips. In this study, patent Main Path Analysis (MPA) is used to identify the majority and complementary groups between PU and memory devices for PiM. Subsequently, Document-to-Vector (Doc2Vec) and similarity-scoring analyses are used to determine the potential partners for technical cooperation required for PiM technology development for the majority group identified. According to the empirical results, PiM core technology is evolving from PU to memory device with an 'architecture-operation-architecture' design pattern. The ten ASIC candidates are identified for strategic partnerships with memory device suppliers. Those partnership candidates include several mobile AP firms, implying PiM's opportunities in the field of mobile applications. It suggests that memory device suppliers should prepare for different technology strategies for PiM technology development. This study contributes to the literature and high-tech industry via the proposed quantitative technology partnership model. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

22. Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings.

Author: Blanco-Fernández, Yolanda, Gil-Solla, Alberto, Pazos-Arias, José J., and Quisi-Peralta, Diego
Subjects: *CORPORA, *VOCABULARY, *OCCUPATIONAL retraining
Abstract: Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

23. EVALUATING OF EFFICACY SEMANTIC SIMILARITY METHODS FOR COMPARISON OF ACADEMIC THESIS AND DISSERTATION TEXTS.

Author: Hassan, Ramadan Thkri and Ahmed, Nawzat Sadiq
Subjects: SEMANTICS, NATURAL language processing, ACCURACY, SIMILARITY (Language learning), COMPARATIVE studies
Abstract: Detecting semantic similarity between documents is vital in natural language processing applications. One widely used method for measuring the semantic similarity of text documents is embedding, which involves converting texts into numerical vectors using various NLP methods. This paper presents a comparative analysis of four embedding methods for detecting semantic similarity in theses and dissertations, namely Term Frequency-Inverse Document Frequency, Document to Vector, Sentence Bidirectional Encoder Representations from Transformers, and Bidirectional Encoder Representations from Transformers with cosine similarity. The study used two datasets consisting of 27 documents from Duhok Polytechnic University and 100 documents from ProQuest.com. The texts from these documents were pre-processed to make them suitable for semantic similarity analysis. The evaluation of the methods was based on several metrics, including accuracy, precision, Recall, F1 score, and processing time. The results showed that the traditional method, TF-IDF, outperformed modern methods in embedding and detecting actual semantic similarity between documents, with processing time not exceeding a few seconds. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

24. Evaluating the Possibility of Evasion Attacks to Machine Learning-Based Models for Malicious PowerShell Detection

Author: Mezawa, Yuki, Mimura, Mamoru, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Su, Chunhua, editor, Gritzalis, Dimitris, editor, and Piuri, Vincenzo, editor
Published: 2022
Full Text: View/download PDF

25. Hybridizing Sentence Transformer Model with Multi-KNN for Biomedical Documents

Author: Ahmad, Owais, Verma, Sadika, Azim, Shahid, Sharan, Aditi, Bansal, Jagdish Chand, Series Editor, Deep, Kusum, Series Editor, Nagar, Atulya K., Series Editor, Jacob, I. Jeena, editor, Kolandapalayam Shanmugam, Selvanayaki, editor, and Bestak, Robert, editor
Published: 2022
Full Text: View/download PDF

26. Clustering Text: A Comparison Between Available Text Vectorization Techniques

Author: Singh, Lovedeep, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Reddy, V. Sivakumar, editor, Prasad, V. Kamakshi, editor, Wang, Jiacun, editor, and Reddy, K. T. V., editor
Published: 2022
Full Text: View/download PDF

27. EVALUATING OF EFFICACY SEMANTIC SIMILARITY METHODS FOR COMPARISON OF ACADEMIC THESIS AND DISSERTATION TEXTS

Author: Ramadan T. Hassan and Nawzat S. Ahmed
Subjects: TF-IDF, BERT, SBERT, Doc2Vec, Semantic Similarity, Cosine Similarity, Science
Abstract: Detecting semantic similarity between documents is vital in natural language processing applications. One widely used method for measuring the semantic similarity of text documents is embedding, which involves converting texts into numerical vectors using various NLP methods. This paper presents a comparative analysis of four embedding methods for detecting semantic similarity in theses and dissertations , namely Term Frequency–Inverse Document Frequency, Document to Vector, Sentence Bidirectional Encoder Representations from Transformers, and Bidirectional Encoder Representations from Transformers with cosine similarity. The study used two datasets consisting of 27 documents from Duhok Polytechnic University and 100 documents from ProQuest.com. The texts from these documents were pre-processed to make them suitable for semantic similarity analysis. The evaluation of the methods was based on several metrics, including accuracy, precision, Recall, F1 score, and processing time. The results showed that the traditional method, TF-IDF, outperformed modern methods in embedding and detecting actual semantic similarity between documents, with processing time not exceeding a few seconds.
Published: 2023
Full Text: View/download PDF

28. Using Retrieved Sources for Semantic and Lexical Plagiarism Detection.

Author: Saeed, Ayoub Ali M. and Taqa, Alaa Yaseen
Subjects: *LANGUAGE models, *PDF (Computer file format), *PLAGIARISM, *SEARCH engines
Abstract: Plagiarism is described as using someone else's ideas or work without their permission. Using lexical and semantic text similarity notions, this paper presents a plagiarism detection system for examining suspicious texts against available sources on the Web. The user can upload suspicious files in pdf or docx formats. The system will search three popular search engines for the source text (Google, Bing, and Yahoo) and try to identify the top five results for each search engine on the first retrieved page. The corpus is made up of the downloaded files and scraped web page text of the search engines' results. The corpus text and suspicious documents will then be encoded as vectors. For lexical plagiarism detection, the system will leverage Jaccard similarity and Term Frequency-Inverse Document Frequency (TFIDF) techniques, while for semantic plagiarism detection, Doc2Vec and Sentence Bidirectional Encoder Representations from Transformers (SBERT) intelligent text representation models will be used. Following that, the system compares the suspicious text to the corpus text. Finally, a generated plagiarism report will show the total plagiarism ratio, the plagiarism ratio from each source, and other details. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

29. Word Embedding for High Performance Cross-Language Plagiarism Detection Techniques.

Author: Bouaine, Chaimaa, Benabbou, Faouzia, and Sadgali, Imane
Subjects: DEEP learning, PLAGIARISM, MACHINE learning, NATURAL languages, INTELLECTUAL property
Abstract: Academic plagiarism has become a serious concern as it leads to the retardation of scientific progress and violation of intellectual property. In this context, we make a study aiming at the detection of cross-linguistic plagiarism based on Natural language Preprocessing (NLP), Embedding Techniques, and Deep Learning. Many systems have been developed to tackle this problem, and many rely on machine learning and deep learning methods. In this paper, we propose Cross-language Plagiarism Detection (CL-PD) method based on Doc2Vec embedding techniques and a Siamese Long Short-Term Memory (SLSTM) model. Embedding techniques help capture the text's contextual meaning and improve the CL-PD system's performance. To show the effectiveness of our method, we conducted a comparative study with other techniques such as GloVe, FastText, BERT, and Sen2Vec on a dataset combining PAN11, JRC-Acquis, Europarl, and Wikipedia. The experiments for the SpanishEnglish language pair show that Doc2Vec+SLSTM achieve the best results compared to other relevant models, with an accuracy of 99.81%, a precision of 99.75%, a recall of 99.88%, an f-score of 99.70%, and a very small loss in the test phase. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

30. Political Profiling of Nepali Twitter Users using Vector Model.

Author: Timalsina, Arun K. and Kharbuja, Ramesh
Subjects: MICROBLOGS, SOCIAL media, FEATURE extraction, VECTOR spaces, POLITICAL parties
Abstract: Everyday people in social networks create a huge amount of data as posts, blogs, tweets, articles, comments, etc. in the form of text, images, audios and videos. The number of social media users and the data they are adding up in cloud is increasing drastically day by day. People from all over the globe with different region, culture, language, education, public figures posts or blogs reflecting their vision and opinion. These micro-blogs are now being used by researchers and business houses for assessing customer opinion to their implicit intension and behavior. Using the tweet contents, this research is to classify a Nepali twitter user to one of the pre-defined class of political parties in Nepal using vector space model. In this approach a set of words is defined as document class that represents to a political party. A number of steps for text-preprocessing is to be done based on morphological structure of Nepali language for the better result. TF-IDF and Doc2Vec methods are used to extract the feature of the terms being used in tweets. Similarity measure is used to match the tweeter's profile with political party's class through similarity matching score. Vector model-based TF-IDF and Doc2Vec methods are compared for their effectiveness in the domain of tweets in Nepali language. [ABSTRACT FROM AUTHOR]
Published: 2023

31. Transforming Data with Ontology and Word Embedding for an Efficient Classification Framework

Author: Thi Thanh Sang Nguyen, Pham Minh Thu Do, Thanh Tuan Nguyen, and Thanh Tho Quan
Subjects: ontology, Onto2Vec, Doc2Vec, Classification, Computer engineering. Computer hardware, TK7885-7895, Systems engineering, TA168
Abstract: Transforming data into appropriate formats is crucial because it can speed up the training process and enhance the performance of classification algorithms. It is, however, challenging due to the complicated process, resource-intensive and preserved meaning of the data. This study proposes new approaches to building knowledge representation models using word-embedding and ontology techniques, which can transform text data into digital data and still keep semantic/context information of themselves in order to enhance modeling data later. To evaluate the effectiveness of the built models, a classification framework is proposed and performed on a public real dataset. Experimental results show that the constructed knowledge representation models contribute significantly to the performance of classification methods.
Published: 2023
Full Text: View/download PDF

32. Discursive construction of migrant otherness on Facebook: A distributional semantics approach.

Author: Yantseva, Victoria
Subjects: *IMMIGRANTS, *EMIGRATION & immigration, *COMPUTATIONAL linguistics, *SEMANTICS
Abstract: This work aims to study the construction of migrant categories and immigration discourse on Swedish-speaking Facebook pages in the last decade. It combines the insights from computational linguistics and distributional semantics approach with those from discursive psychology to explore a corpus of more than 1 M Facebook posts. This allows one to compare the meanings of labels denoting various categories of migrants and identify the key interpretative repertoires used to discuss the immigration topic. The study finds that the 'immigrant' category has stronger association with potential costs, benefits and threat to the host society, while the 'refugee' category is presented as in need of support and solidarity. Nevertheless, objectification and exclusionary rhetoric are used in relation to both categories, although in different ways, while the immigration issue is often interpreted as a matter of Sweden's national concern rather than as a part of people's actual experiences and life paths. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

33. Searching for associations between social media trending topics and organizations.

Author: Henriques, João and Ferreira, João
Subjects: SOCIAL media, CONVOLUTIONAL neural networks, UNIVERSITY hospitals, TEXT mining, MARKETING
Abstract: Trending topics are the most discussed topics at the moment on social media platforms, particularly on Twitter and Facebook. While the access to trending topics are free and available to everyone, marketing specialists and specific software are more expensive, therefore there are companies that do not have the budget to support those costs. The main goal of this work is to search for associations between trending topics and companies on social media platforms and HotRivers prototype was developed to fill this gap. This approach was applied to Twitter and used text mining techniques to process tweets, train personalized models of companies and deliver a list of the matched trending topics of the target company. So, in this work were tested different pre-processing text techniques and a method to select tweets called Centroid Strategy used on trending topics to avoid unwanted tweets. Also, were tested three models, an embedding vectors approach with Doc2Vec model, a probabilistic model with Latent Dirichlet Allocation, and a classification task approach with a Convolutional Neural Network used on the final architecture. The approach was validated with real cases like Adidas, Nike and Portsmouth Hospitals University. In the results stand out that trending topic Nike has an association with the company Nike and #WorldPatientSafetyDay has an association with Portsmouth Hospitals University. This prototype, HotRivers, can be a new marketing tool that points the direction to the next campaign. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

34. Supporting crime script analyses of scams with natural language processing.

Author: Lwin Tun, Zeya and Birks, Daniel
Subjects: NATURAL language processing, CRIME analysis, SWINDLERS & swindling, INTERNET fraud, DIGITAL technology
Abstract: In recent years, internet connectivity and the ubiquitous use of digital devices have afforded a landscape of expanding opportunity for the proliferation of scams involving attempts to deceive individuals into giving away money or personal information. The impacts of these schemes on victims have shown to encompass social, psychological, emotional and economic harms. Consequently, there is a strong rationale to enhance our understanding of scams in order to devise ways in which they can be disrupted. One way to do so is through crime scripting, an analytical approach which seeks to characterise processes underpinning crime events. In this paper, we explore how Natural Language Processing (NLP) methods might be applied to support crime script analyses, in particular to extract insights into crime event sequences from large quantities of unstructured textual data in a scalable and efficient manner. To illustrate this, we apply NLP methods to a public dataset of victims' stories of scams perpetrated in Singapore. We first explore approaches to automatically isolate scams with similar modus operandi using two distinct similarity measures. Subsequently, we use Term Frequency-Inverse Document Frequency (TF-IDF) to extract key terms in scam stories, which are then used to identify a temporal ordering of actions in ways that seek to characterise how a particular scam operates. Finally, by means of a case study, we demonstrate how the proposed methods are capable of leveraging the collective wisdom of multiple similar reports to identify a consensus in terms of likely crime event sequences, illustrating how NLP may in the future enable crime preventers to better harness unstructured free text data to better understand crime problems. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

35. Identifying Fake Reviews in Relation with Property and Political Data Using Deep Learning.

Author: S, Ashwin Valliappan and Ramya, G R
Subjects: DEEP learning, LANGUAGE models, MACHINE learning
Abstract: Modern deep learning algorithms have achieved tremendous success in many visual applications by training a model with all relevant task-specific data. In this module fake reviews evolving in online political and property data will be discussed. Regarding political information there are many differences while switching from offline to online mode. The views and comments of the politicians in social media act as a question mark "whether it is true or not?". In some cases, it appears to be real and sometimes it may be a false/fake statement. All because, the online information is often considered as the most important source of info for the users these kinds of reviews and comments affects the trust factors of social media. Now days in social media, more of toxic contents and unwanted information were present rather than the useful information. These toxic comments acts as a threat for users using the application. Hence users do not come forward to post their thoughts and actions or to share the information. This module is used to distinguish reviews, identify the factor in online mode of political and property comments in social media. The LSTM and BERT (Bidirectional Encoder Representations from Transformers) algorithms are used in the first module because they offer a wide variety of parameters such as learning rates, input and output biases, and text categorization. Additionally, GPT2 (Generative Pre-Trained Transformer 2) is implemented which helps in text generation, increasing the size of dataset for training in different classification models. Thus it is important to understand for us that ultimately the model will be getting trained in such a way that it is able to give accurate results while we are undergoing classification of text and also while we are undergoing the identification of fake/authentic comments that are there in the source of our data. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

36. Assessing receptive vocabulary using state‑of‑the‑art natural language processing techniques.

Author: Crossley, Scott and Holmes, Langdon
Subjects: NATURAL language processing, TRANSFORMER models, LINGUISTIC models, VOCABULARY, LINGUISTIC analysis
Abstract: Semantic embedding approaches commonly used in natural language processing such as transformer models have rarely been used to examine L2 lexical knowledge. Importantly, their performance has not been contrasted with more traditional annotation approaches to lexical knowledge. This study used NLP techniques related to lexical annotations and semantic embedding approaches to model the receptive vocabulary of L2 learners based on their lexical production during a writing task. The goal of the study is to examine the strengths and weaknesses of both approaches in understanding L2 lexical knowledge. Findings indicate that transformer approaches based on semantic embeddings outperform linguistic annotations and Word2vec models in predicting L2 learners' vocabulary scores. The findings help to support the strength and accuracy of semantic-embedding approaches as well as their generalizability across tasks when compared to linguistic feature models. Limitations to semantic-embedding approaches, especially interpretability, are discussed. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

37. Diagnostic Surveillance of High-Grade Gliomas: Towards Automated Change Detection Using Radiology Report Classification

Author: Di Noto, Tommaso, Atat, Chirine, Teiga, Eduardo Gamito, Hegi, Monika, Hottinger, Andreas, Cuadra, Meritxell Bach, Hagmann, Patric, Richiardi, Jonas, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Kamp, Michael, editor, Koprinska, Irena, editor, Bibal, Adrien, editor, Bouadi, Tassadit, editor, Frénay, Benoît, editor, Galárraga, Luis, editor, Oramas, José, editor, Adilova, Linara, editor, Krishnamurthy, Yamuna, editor, Kang, Bo, editor, Largeron, Christine, editor, Lijffijt, Jefrey, editor, Viard, Tiphaine, editor, Welke, Pascal, editor, Ruocco, Massimiliano, editor, Aune, Erlend, editor, Gallicchio, Claudio, editor, Schiele, Gregor, editor, Pernkopf, Franz, editor, Blott, Michaela, editor, Fröning, Holger, editor, Schindler, Günther, editor, Guidotti, Riccardo, editor, Monreale, Anna, editor, Rinzivillo, Salvatore, editor, Biecek, Przemyslaw, editor, Ntoutsi, Eirini, editor, Pechenizkiy, Mykola, editor, Rosenhahn, Bodo, editor, Buckley, Christopher, editor, Cialfi, Daniela, editor, Lanillos, Pablo, editor, Ramstead, Maxwell, editor, Verbelen, Tim, editor, Ferreira, Pedro M., editor, Andresini, Giuseppina, editor, Malerba, Donato, editor, Medeiros, Ibéria, editor, Fournier-Viger, Philippe, editor, Nawaz, M. Saqib, editor, Ventura, Sebastian, editor, Sun, Meng, editor, Zhou, Min, editor, Bitetta, Valerio, editor, Bordino, Ilaria, editor, Ferretti, Andrea, editor, Gullo, Francesco, editor, Ponti, Giovanni, editor, Severini, Lorenzo, editor, Ribeiro, Rita, editor, Gama, João, editor, Gavaldà, Ricard, editor, Cooper, Lee, editor, Ghazaleh, Naghmeh, editor, Richiardi, Jonas, editor, Roqueiro, Damian, editor, Saldana Miranda, Diego, editor, Sechidis, Konstantinos, editor, and Graça, Guilherme, editor
Published: 2021
Full Text: View/download PDF

38. Design of Book Recommendation System Using Sentiment Analysis

Author: Mounika, Addanki, Saraswathi, S., Xhafa, Fatos, Series Editor, Suma, V., editor, Bouhmala, Noureddine, editor, and Wang, Haoxiang, editor
Published: 2021
Full Text: View/download PDF

39. Improved Multi-label Medical Text Classification Using Features Cooperation

Author: Chaib, Rim, Azizi, Nabiha, Zemmal, Nawel, Schwab, Didier, Belhaouari, Samir Brahim, Xhafa, Fatos, Series Editor, Saeed, Faisal, editor, Mohammed, Fathey, editor, and Al-Nahari, Abdulaziz, editor
Published: 2021
Full Text: View/download PDF

40. 'One vs All' Classifier Analysis for Multi-label Movie Genre Classification Using Document Embedding

Author: Guehria, Sonia, Belleili, Habiba, Azizi, Nabiha, Belhaouari, Samir Brahim, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Abraham, Ajith, editor, Piuri, Vincenzo, editor, Gandhi, Niketa, editor, Siarry, Patrick, editor, Kaklauskas, Arturas, editor, and Madureira, Ana, editor
Published: 2021
Full Text: View/download PDF

41. Extrinsic Plagiarism Detection for French Language with Word Embeddings

Author: Elamine, Maryam, Bougares, Fethi, Mechti, Seifeddine, Hadrich Belguith, Lamia, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Abraham, Ajith, editor, Siarry, Patrick, editor, Ma, Kun, editor, and Kaklauskas, Arturas, editor
Published: 2021
Full Text: View/download PDF

42. Bug Prediction Using Source Code Embedding Based on Doc2Vec

Author: Aladics, Tamás, Jász, Judit, Ferenc, Rudolf, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Gervasi, Osvaldo, editor, Murgante, Beniamino, editor, Misra, Sanjay, editor, Garau, Chiara, editor, Blečić, Ivan, editor, Taniar, David, editor, Apduhan, Bernady O., editor, Rocha, Ana Maria A. C., editor, Tarantino, Eufemia, editor, and Torre, Carmelo Maria, editor
Published: 2021
Full Text: View/download PDF

43. An in-Depth Analysis of the Software Features’ Impact on the Performance of Deep Learning-Based Software Defect Predictors

Author: Diana-Lucia Miholca, Vlad-Ioan Tomescu, and Gabriela Czibula
Subjects: Deep learning, Doc2vec, latent semantic indexing, software defect prediction, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Software Defects Prediction represents an essential activity during software development that contributes to continuously improving software quality and software maintenance and evolution by detecting defect-prone modules in new versions of a software system. In this paper, we are conducting an in-depth analysis on the software features’ impact on the performance of deep learning-based software defect predictors. We further extend a large-scale feature set proposed in the literature for detecting defect-proneness, by adding conceptual software features that capture the semantics of the source code, including comments. The conceptual features are automatically engineered using Doc2Vec, an artificial neural network based prediction model. A broad evaluation performed on the Calcite software system highlights a statistically significant improvement obtained by applying deep learning-based classifiers for detecting software defects when using conceptual features extracted from the source code for characterizing the software entities.
Published: 2022
Full Text: View/download PDF

44. A WeChat Official Account Reading Quantity Prediction Model Based on Text and Image Feature Extraction

Author: Zijian Bai, Shuangyi Ma, and Geng Li
Subjects: Feature extraction, neural network, WeChat official accounts, Doc2Vec, user engagement, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: This paper describes a study that built a neural network prediction model based on feature extraction, focusing on text analysis and image analysis of WeChat official accounts reading quantity. Based on the embedding method of the deep learning model, we extracted the text features in the title and the image features in the cover picture, explored the relationship between these features and the reading quantity, and built a neural network model based on these features to predict the reading quantity. The results show that there is a phenomenon of sentiment fusion in the text, and a sentence vector model based on Doc2Vec and a neural network model both had a good performance. This paper proposes a tool that can predict the reading quantity in advance and help administrators adjust the titles and images according to the predicted results.
Published: 2022
Full Text: View/download PDF

45. 一种融合D＿BBAS方法的重复缺陷报告检测.

Author: 曾　方, 谢　琪, and 崔梦天
Subjects: *ARTIFICIAL neural networks, *COMPUTER software development, *MAINTENANCE costs, *OFFICES, *FEATURE extraction, *INSTITUTIONAL repositories
Abstract: Developers in large software development environments rely on bug reports to complete fixes. Since reporters may use different representations to describe the same bug due to different expression habits, automated detection of duplicate bug reports can reduce development redundancy as well as maintenance costs. Recent detection of repetitive bug reports tends to use deep neural networks and considers both structured and unstructured information to generate hybrid representation features. In order to obtain the features of unstructured information of bug reports more effectively, this paper proposes a D＿BBAS(Doc2 Vec and BERT BiLSTM-Attention Similarity) method, which trains a feature extraction model based on a large-scale bug report library to generate a bug summary text representation set and a bug description text representation set that can reflect deep semantic information. Then, these two distributed representation sets compute the similarity of bug report pairs, resulting in two new similarity features. These two new features will participate in the detection of duplicate bug reports when combined with the traditional features generated based on structured information. In this paper, the effectiveness of the proposed approach is verified on the bug report repositories of well-known open-source projects Eclipse, NetBeans and Open Office, which contain more than 500, 000 bug reports. The experimental results show that compared with the representative methods, the method in this paper improves the F1 value by 1.7% on average, which proves the effectiveness of the method in this paper. [ABSTRACT FROM AUTHOR]
Published: 2022

46. Membrane Clustering of Coronavirus Variants Using Document Similarity.

Author: Lehotay-Kéry, Péter and Kiss, Attila
Subjects: *SARS-CoV-2, *DOCUMENT clustering, *VIRAL genomes, *COVID-19 pandemic, *CORONAVIRUSES, *BASE pairs, *COMPUTATIONAL neuroscience
Abstract: Currently, as an effect of the COVID-19 pandemic, bioinformatics, genomics, and biological computations are gaining increased attention. Genomes of viruses can be represented by character strings based on their nucleobases. Document similarity metrics can be applied to these strings to measure their similarities. Clustering algorithms can be applied to the results of their document similarities to cluster them. P systems or membrane systems are computation models inspired by the flow of information in the membrane cells. These can be used for various purposes, one of them being data clustering. This paper studies a novel and versatile clustering method for genomes and the utilization of such membrane clustering models using document similarity metrics, which is not yet a well-studied use of membrane clustering models. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

47. Network dynamics in university-industry collaboration: a collaboration-knowledge dual-layer network perspective.

Author: Chen, Hongshu, Song, Xinna, Jin, Qianqian, and Wang, Ximeng
Abstract: Collaborations between universities and firms provide a key pathway for innovation. In the big data era, however, the interactions between these two communities are being reshaped by information of much higher complexity and knowledge exchanges with more volume and pace. With this research, we put forward a methodology for comprehensively measuring both actor collaboration and produced knowledge in shaping network dynamics of university-industry collaboration. Using dual-layer networks consisting of organizations and topics, we mapped the longitudinal correlations between partnerships and knowledge in terms of both co-applications of patents and semantics. Network structures, individual characteristics, and knowledge proximity indicators were used to depict the longitudinal networks and then model the network dynamics. Further, a stochastic actor-oriented model was used to provide insights into the factors contributing to the network's evolution. A case study on university-industry collaborations in the information and communications technology sector demonstrates the feasibility of the methodology. The result of this study can be used for future research into the mechanisms that underpin university-industry collaborations and opportunity discovery. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

48. A Novel Customer-Oriented Recommendation System for Paid Knowledge Products.

Author: Yang, Ting, Zhang, Jilong, Wang, Liye, and Zhang, Jin
Abstract: With the rapid development of knowledge payment, customers are faced with a large number of knowledge products when purchasing, leading to the need for an effective recommendation system. However, existing recommendation systems cannot accurately and adequately represent paid knowledge products with implicit but specialized features and sparse interactive histories, and thus are deemed not suitable for such products. In this paper, we propose a novel recommendation system for knowledge products, the core of which is the designed customer-oriented representation of knowledge products. Specifically, we utilize customer activity information on the free knowledge sharing platform as the knowledge document for each customer of paid knowledge products, to extract customer knowledge background and preference. Then, a deep learning-based model Doc2vec is adopted to transfer knowledge documents to customer knowledge background vectors. Such vectors of a particular paid knowledge product are further aggregated to a product-level vector for customer-oriented product representation, based on which two recommendation results are generated with product ratings and similarities of paid knowledge products, respectively. Extensive comparative experiments are conducted to demonstrate the effectiveness of the proposed system for the representation and recommendation of paid knowledge products. This paper will contribute to the literature of knowledge payment and recommendation systems, as well as provide practical implications for the information service and the operation of knowledge products on knowledge payment platforms. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

49. Semantic Detection of Targeted Attacks Using DOC2VEC Embedding

Author: Mariam S. El-Rahmany, Ensaf Hussein Mohamed, and Mohamed H. Haggag
Subjects: doc2vec, multi-class text classification, pretexting, sentence embedding, targeted attacks detection, Computer software, QA76.75-76.765
Abstract: The targeted attack is one of the social engineering attacks. The detection of this type of attack is considered a challenge as it depends on semantic extraction of the intent of the attacker. However, previous research has primarily relies on the Natural Language Processing or Word Embedding techniques that lack the context of the attacker's text message. Based on Sentence Embedding and machine learning approaches, this paper introduces a model for semantic detection of targeted attacks. This model has the advantage of encoding relevant information, which helps to improve the performance of the multi-class classification process. Messages will be categorized based on the type of security rule that the attacker has violated. The suggested model was tested using a dialogue dataset taken from phone calls, which was manually categorized into four categories. The text is pre-processed using natural language processing techniques, and the semantic features are extracted as Sentence Embedding vectors that are augmented with security policy sentences. Machine Learning algorithms are applied to classify text messages. The experimental results show that sentence embeddings with doc2vec achieved high prediction accuracy 96.8%. So, it outperformed the method applied to the same dialog dataset.
Published: 2021
Full Text: View/download PDF

50. Distributional Representation of Cyclic Alternating Patterns for A-Phase Classification in Sleep EEG

Author: Diana Laura Vergara-Sánchez, Hiram Calvo, and Marco A. Moreno-Armendáriz
Subjects: EEG, classification, cyclic alternating pattern, Doc2Vec, distributional representation, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: This article describes a detailed methodology for the A-phase classification of the cyclic alternating patterns (CAPs) present in sleep electroencephalography (EEG). CAPs are a valuable EEG marker of sleep instability and represent an important pattern with which to analyze additional characteristics of sleep processes, and A-phase manifestations have been linked to some specific conditions. CAP phase detection and classification are not commonly carried out routinely due to the time and attention this problem requires (and if present, CAP labels are user-dependent, visually evaluated, and hand-made); thus, an automatic tool to solve the CAP classification problem is presented. The classification experiments were carried out using a distributional representation of the EEG data obtained from the CAP Sleep Database. For this purpose, data symbolization was performed using the one-dimensional symbolic aggregate approximation (1d-SAX), followed by the vectorization of symbolic data with a trained Doc2Vec model and a final classification with ten classic machine learning models for two separate validation strategies. The best results were obtained using a support vector classifier with a radial basis kernel. For hold-out validation, the best F1 Score was 0.7651; for stratified 10-fold cross-validation, the best F1 Score was 0.7611 ± 0.0133. This illustrates that the proposed methodology is suitable for CAP classification.
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

353 results on '"Doc2Vec"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources