Descriptor: "Data mining" / Database: Academic Search Index - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Data mining"' showing total 28,760 results

Start Over Descriptor "Data mining" Database Academic Search Index

28,760 results on '"Data mining"'

1. Supporting Information Visualization Research in an Academic Library: Lessons Learned from an Analysis of the Literature.

Author: Groenendyk, Michael and Neugebauer, Tomasz
Subjects: *WORLD Wide Web, *MOBILE apps, *DATA mining, *ACADEMIC libraries, *LIBRARIANS, *INFORMATION storage & retrieval systems, *RESEARCH, *ACCESS to information, *USER interfaces
Abstract: This paper summarizes librarian research on information visualization as well as general trends in the broader field, highlighting the most recent trends, important journals, and which subject disciplines are most involved with information visualization. By comparing librarian research to the broader field, the paper identifies opportunities for libraries to improve their information visualization support services. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

2. Exploring the Opinion Shifting of Social Media Users Based on Stance Detection.

Author: Sun, Ran, An, Lu, and Li, Gang
Subjects: *SOCIAL media, *DATA mining, *PUBLIC opinion, *COVID-19 vaccines
Abstract: Fine‐grained mining of stance change among social media users during crises contributes significantly to a comprehensive understanding of the development of public opinion online. This study collected detailed information on 227,281 Twitter users who continuously participated in the discussion about the COVID‐19 vaccine to analyze patterns of opinion shifting. Utilizing the COVID‐Twitter‐BERT pre‐trained model, we detect each user's stance on the COVID‐19 vaccine and construct a time‐series dataset reflecting their stance over time. A prediction model for users' opinion shifting is established based on the LightGBM model, and the SHAP explanation method is used to rank the importance of features to identify key features influencing the opinion shifting prediction. Our research findings indicate that nearly half of the users maintained a consistent stance on vaccines, with a relatively low proportion of shifts between pro‐vaccine and anti‐vaccine stances. The F1 value of the prediction model based on the LightGBM model is 0.92. These findings can assist in monitoring social media user opinions online. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. Building a Multimodal Dataset of Academic Paper for Keyword Extraction.

Author: Zhang, Jingyu, Yan, Xinyi, Xiang, Yi, Zhang, Yingyi, and Zhang, Chengzhi
Subjects: *DATA mining, *TEXT mining, *KEYWORDS, *PREDICTION models, *DATA fusion (Statistics)
Abstract: Up to this point, keyword extraction task typically relies solely on textual data. Neglecting visual details and audio features from image and audio modalities leads to deficiencies in information richness and overlooks potential correlations, thereby constraining the model's ability to learn representations of the data and the accuracy of model predictions. Furthermore, the currently available multimodal datasets for keyword extraction task are particularly scarce, further hindering the progress of research on multimodal keyword extraction task. Therefore, this study constructs a multimodal dataset of academic paper consisting of 1,000 samples, with each sample containing paper text, images, audios and keywords. Based on unsupervised and supervised methods of keyword extraction, experiments are conducted using textual data from papers, as well as text extracted from images and audio. The aim is to investigate the differences in performance in keyword extraction task with respect to different modal information and the fusion of multimodal information. The experimental results indicate that text from different modalities exhibits distinct characteristics in the model. The concatenation of paper text, image text and audio text can effectively enhance the keyword extraction performance of academic papers. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. An Overview of TikTok's Technology and Issues: Data collection and security concerns pose problems for Congress.

Subjects: *SMARTPHONES, *ARTIFICIAL intelligence, *DATA mining, *ADMINISTRATIVE law
Abstract: The article presents the discussion on TikTok is a globally popular video-sharing smartphone application. Topics include app builds this feed through a "recommendation engine" that uses artificial intelligence (AI) technologies and data mining practices; and administration is still considering options to curtail TikTok's ability to operate in the US.
Published: 2024

5. Golden lichtenberg algorithm: a fibonacci sequence approach applied to feature selection.

Author: Pereira, João Luiz Junho, Francisco, Matheus Brendon, Ma, Benedict Jun, Gomes, Guilherme Ferreira, and Lorena, Ana Carolina
Subjects: *MACHINE learning, *GOLDEN ratio, *FIBONACCI sequence, *FEATURE selection, *COMBINATORIAL optimization, *METAHEURISTIC algorithms
Abstract: Computational and technological advancements have led to an increase in data generation and storage capacity. Many annotated datasets have been used to train machine learning models for predictive tasks. Feature selection (FS) is a combinatorial binary optimization problem that arises from a need to reduce dataset dimensionality by finding the subset of features with maximum predictive accuracy. While different methodologies have been proposed, metaheuristics adapted to binary optimization have proven to be reliable and efficient techniques for FS. This paper applies the first and unique population-trajectory metaheuristic, the Lichtenberg algorithm (LA), and enhances it with a Fibonacci sequence to improve its exploration capabilities in FS. Substituting the random scales that controls the Lichtenberg figures' size and the population distribution in the original version by a sequence based on the golden ratio, a new optimal exploration–exploitation LF's size decay is presented. The new few hyperparameters golden Lichtenberg algorithm (GLA), LA, and eight other popular metaheuristics are then equipped with the v-shaped transfer function and associated with the K-nearest neighbor classifier in the search of the optimized feature subsets through a double cross-validation experiment method on 15 UCI machine learning repository datasets. The binary GLA selected reduced subsets of features, leading to the best predictive accuracy and fitness values at the lowest computational cost. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. Majority re-sampling via sub-class clustering for imbalanced datasets.

Author: Ke, Shih-Wen, Tsai, Chih-Fong, Pan, Yi-Ying, and Lin, Wei-Chao
Subjects: *DATA mining, *SOCIAL problems, *MACHINE learning, *ALGORITHMS
Abstract: Many real world domain problem datasets are class imbalanced where the number of data in a given class is much less than in the other classes. In related literatures, under- and over-sampling techniques are widely used techniques to re-balance the class imbalanced datasets. However, their limitations include the risk of removing representative majority class data samples and the overfitting problem because of generating a large number of synthetic minority class data samples. Therefore, a novel approach, namely Majority Re-sampling visa Sub-class Clustering (MRSC) is introduced. It uses a clustering algorithm to group the majority class data into several clusters, i.e. sub-classes. Then, a new training set containing multiple sub-classes and a minority class is produced, after which the classifier is trained using this new multi-class dataset which has a lower imbalance ratio than the original dataset. The experimental results obtained using 44 two-class imbalanced datasets show that MRSC combined with the k-NN classifiers, including single and ensemble classifiers, significantly outperforms the other classifiers as well as seven state-of-the-art re-sampling approaches. Moreover, for the clustering algorithms based on affinity propagation and k-means, very similar results can be produced, without significant differences in performance, which indicate the stability of MRSC. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. Effective algorithms for mining frequent-utility itemsets.

Author: Liu, Xuan, Chen, Genlang, Wen, Shiting, and Huang, Jingfang
Subjects: *DATA mining, *PROBLEM solving, *ALGORITHMS, *SCALABILITY, *SPEED
Abstract: The current pattern mining algorithms focus on discovering either frequent itemsets or high-utility itemsets. The goal of this research is to study the problem of mining frequent-utility itemsets. To solve this problem, two novel algorithms named FUIMTWU-Tree (Frequent-utility Itemset Mining based on TWU-Tree) and FUIMTF-Tree (Frequent-utility Itemset Mining based on TF-Tree) are presented based on the integration of IHUP and HUI-Miner. The TWU-tree and TF-Tree structures are utilised to avoid the unnecessary utility-list construction of itemsets that do not appear in a transaction dataset. The performance of the proposed algorithms is evaluated on various datasets. The results of the experiments demonstrate that FUIMTWU-Tree and FUIMTF-Tree perform efficiently in terms of speed, pruning performance and scalability. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. Incorporating Domain-Specific Traits into Personality-Aware Recommendations for Financial Applications.

Author: Takayanagi, Takehiro and Izumi, Kiyoshi
Subjects: *BEHAVIORAL economics, *EDUCATIONAL finance, *PERSONALITY, *FINANCE education, *DATA mining, *RECOMMENDER systems
Abstract: The general personality traits, notably the Big-Five personality traits, have been increasingly integrated into recommendation systems. The personality-aware recommendations, which incorporate human personality into recommendation systems, have shown promising results in general recommendation areas including music, movie, and e-commerce recommendations. On the other hand, the number of research delving into the applicability of personality-aware recommendations in specialized domains such as finance and education remains limited. In addition, these domains have unique challenges in incorporating personality-aware recommendations as domain-specific psychological traits such as risk tolerance and behavioral biases play a crucial role in explaining user behavior in these domains. Addressing these challenges, this study addresses an in-depth exploration of personality-aware recommendations in the financial domain, specifically within the context of stock recommendations. First, this study investigates the benefits of deploying general personality traits in stock recommendations through the integration of personality-aware recommendations with user-based collaborative filtering approaches. Second, this study further verifies whether incorporating domain-specific psychological traits along with general personality traits enhances the performance of stock recommender systems. Thirdly, this paper introduces a personalized stock recommendation model that incorporates both general personality traits and domain-specific psychological traits as well as transaction data. The experimental results show that the proposed model outperformed baseline models in financial stock recommendations. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. Know Thy Enemy and Know Yourself – The Role of Operational Data in Managing the Mines and Booby Trap Threat in Vietnam, 1965–73.

Author: Evans, Roland, Temple, Tracey, and Nelson, Liz
Subjects: *DATA mining, *LAND mines, *PROBLEM solving, *DATA analysis, *VICTIMS
Abstract: Victim operated explosive devices (VOEDs) such as mines and booby traps, have been an enduring problem since their large-scale use started in the 1940s. While the overall problem is often known about in general terms, the real complexion of the problem was not necessarily fully appreciated. Eventually the need to understand the problem and the response to it was partially identified and acted upon in Vietnam through the collection and analysis of operational data. This did not solve the problem of mines and booby traps, but it did offer a means to better manage the threat. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. TriMLP: A Foundational MLP-Like Architecture for Sequential Recommendation.

Author: Jiang, Yiheng, Xu, Yuanbo, Yang, Yongjian, Yang, Funing, Wang, Pengyang, Li, Chaozhuo, Zhuang, Fuzhen, and Xiong, Hui
Abstract: The article focuses on the triangular multi-layer perceptron (TriMLP)-like artificial intelligence (AI) architecture model for sequential recommendation, balancing computational efficiency with improved performance. Topics include TriMLP's addressing of limitations of existing MLP models in sequential contexts, empirical studies highlighting the incompatibility of standard MLP structures with sequential data and TriMLP's competitive performance across various datasets.
Published: 2024
Full Text: View/download PDF

11. A new factor analysis model for factors obeying a Gamma distribution.

Author: Zhou, Guoqiong, Jiang, Wenjiang, and Lin, Shixun
Subjects: *MAXIMUM likelihood statistics, *MARKOV chain Monte Carlo, *GAMMA distributions, *DATA mining, *GAUSSIAN distribution
Abstract: The traditional factor analysis model assumes that the factors obey a normal distribution, which is not appropriate in fields whose data are nonnegative. For this kind of problem, we construct a more practical factor model, assuming that the factors obey a Gamma distribution. We develop a new factor analysis model and discuss its true loading matrix. Then we study its parameter estimation with the maximum likelihood estimation (MLE) method based on an Expectation-Maximization (EM) algorithm, where step E is realized by the Metropolis-Hastings (M-H) algorithm in the Markov Chain Monte Carlo (MCMC) method. We use the new model to empirically study real data, and evaluate its information extraction ability, using the defined true loading matrix to calculate the true loading of the factor. We compare the new model and traditional factor analysis models on simulated and real data, respectively, whose results show that the new model has better information extraction ability for nonnegative data when the number of factors is the same. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

12. GOGCN: an attentional network with geometry and orientation-awareness for airborne LiDAR point cloud classification.

Author: Chen, Yang, Li, Jianzhou, Xing, Yin, Li, Xiao, and Luo, Lili
Subjects: *DATA mining, *POINT cloud, *INFORMATION architecture, *DEEP learning, *LIDAR
Abstract: The airborne LiDAR point cloud has its own characteristics, however, the classification method always fails to capture these characteristics. In this paper, a classification method named GOGCN was designed that adopts a U-Net network structure and uses a directionally constrained nearest neighbourhood search during down-sampling to generate the directionally aware feature. The point cloud geometric structure is obtained through geometry-aware information extraction, and then a graph attention convolution is utilised to learn the most representative features. A comparative experiment on GML(B) dataset and one engineering dataset demonstrated that GOGCN network have well performance and can be widely used in classification. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

13. Water Supply Pipeline Operation Anomaly Mining and Spatiotemporal Correlation Study.

Author: Yang, Yanmei, Liu, Ao, Wang, Zegen, Yong, Zhiwei, Sun, Tao, Li, Jie, and Ma, Guoli
Subjects: *MUNICIPAL water supply, *POLYWATER, *WATER pressure, *APRIORI algorithm, *WATER supply
Abstract: The recurrent manifestation of anomalies in water supply network systems exerts a profound influence on individuals' daily lives. Despite this impact, contemporary research on urban water supply networks reveals a conspicuous lack in the thorough examination of spatiotemporal patterns and the relevance of these anomalies. This investigation meticulously scrutinizes anomalies within a specified segment of the water supply pipe network located in a county in southwest China. Clustering algorithms [ K -means and density-based spatial clustering of applications with noise (DBSCAN)] and statistical methods (standard deviation) identify anomalous water pressure. Subsequently, the Apriori algorithm is utilized to extract association rules for different types of anomalies, and these rules are compared with user similarity, quantified through standard Euclidean distance. The key findings are as follows. First, anomalies in water pressure are predominantly concentrated in May, September, and November. On a 24-h scale, the highest incidence of anomalies occurs between 6:00 a.m. and 9:00 a.m. Areas with the highest anomaly occurrence are primarily situated near the city center and the railway station. Second, correlation rules exist among occurrences of anomalous values at various monitoring sites within the study area. In concrete terms, identical water pressure abnormal types frequently co-occur (confidence level >50% , support level >3%) at diverse monitoring sites, with this correlation linked to the types of users around the monitoring sites. Finally, the categorization of anomalies results in significantly enhanced accuracy in correlation rule outcomes, surpassing the comprehensive analysis of anomalies overall. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

14. A taxonomy of unsupervised feature selection methods including their pros, cons, and challenges.

Author: Dwivedi, Rajesh, Tiwari, Aruna, Bharill, Neha, Ratnaparkhe, Milind, and Tiwari, Alok Kumar
Subjects: *PATTERN recognition systems, *FEATURE selection, *SCIENTIFIC literature, *MACHINE learning, *DATA mining
Abstract: In pattern recognition, statistics, machine learning, and data mining, feature or attribute selection is a standard dimensionality reduction method. The goal is to apply a set of rules to select essential and relevant features from the original dataset. In recent years, unsupervised feature selection approaches have garnered significant attention across various research fields. This study presents a well-organized summary of the latest and most effective unsupervised feature selection techniques in the scientific literature. We introduce a taxonomy of these strategies, elucidating their significant features and underlying principles. Additionally, we outline the pros, cons, challenges, and practical applications of the broad categories of unsupervised feature selection approaches reviewed in the literature. Furthermore, we conducted a comparison of several state-of-the-art unsupervised feature selection methods through experimental analysis. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

15. Accelerating adverse pregnancy outcomes research amidst rising medication use: parallel retrospective cohort analyses for signal prioritization.

Author: Hwang, Yeon Mi, Piekos, Samantha N., Paquette, Alison G., Wei, Qi, Price, Nathan D., Hood, Leroy, and Hadlock, Jennifer J.
Abstract: Background: Pregnant women are significantly underrepresented in clinical trials, yet most of them take medication during pregnancy despite the limited safety data. The objective of this study was to characterize medication use during pregnancy and apply propensity score matching method at scale on patient records to accelerate and prioritize the drug effect signal detection associated with the risk of preterm birth and other adverse pregnancy outcomes. Methods: This was a retrospective study on continuously enrolled women who delivered live births between 2013/01/01 and 2022/12/31 (n = 365,075) at Providence St. Joseph Health. Our exposures of interest were all outpatient medications prescribed during pregnancy. We limited our analyses to medication that met the minimal sample size (n = 600). The primary outcome of interest was preterm birth. Secondary outcomes of interest were small for gestational age and low birth weight. We used propensity score matching at scale to evaluate the risk of these adverse pregnancy outcomes associated with drug exposure after adjusting for demographics, pregnancy characteristics, and comorbidities. Results: The total medication prescription rate increased from 58.5 to 75.3% (P < 0.0001) from 2013 to 2022. The prevalence rate of preterm birth was 7.7%. One hundred seventy-five out of 1329 prenatally prescribed outpatient medications met the minimum sample size. We identified 58 medications statistically significantly associated with the risk of preterm birth (P ≤ 0.1; decreased: 12, increased: 46). Conclusions: Most pregnant women are prescribed medication during pregnancy. This highlights the need to utilize existing real-world data to enhance our knowledge of the safety of medications in pregnancy. We narrowed down from 1329 to 58 medications that showed statistically significant association with the risk of preterm birth even after addressing numerous covariates through propensity score matching. This data-driven approach demonstrated that multiple testable hypotheses in pregnancy pharmacology can be prioritized at scale and lays the foundation for application in other pregnancy outcomes. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

16. The effectiveness of social media communication of sustainable fashion brands: evidence from consumer engagement.

Author: Lang, Chunmin, Su, Qiuli, Xia, Sibei, and Zboja, James
Subjects: *SUSTAINABLE fashion, *TEXT mining, *CLOTHING industry, *CONSUMERS, *SOCIAL media, *DATA mining
Abstract: The purpose of this study is to identify how sustainable fashion brands communicate with consumers on social media, specifically, Instagram. By applying a data mining technique to analyze brand posts and consumer comments, we thoroughly examine how consumers engage with sustainable messages from each brand and how those engagements are associated with brand posts. A sample of six sustainable fashion brands was selected, including Eileen Fisher, Everlane, Girlfriend Collective, Patagonia, Reformation, and Stella McCartney. Public posts of the six brands and corresponding consumer comments on Instagram between November 1, 2019 and November 1, 2022 were collected. LDA (Latent Dirichlet Allocation) text mining technique was applied to extract topics from each brand’s posts and follower comments. A fixed effects linear panel model was then employed to examine if consumer comments are significantly associated with sustainable messages from each brand. Ten major sustainability-related themes were identified from brand posts, and four major sustainability-related themes were discovered from consumer comments. The linear panel results provide preliminary evidence that the sustainability-related themes from consumer comments are positively influenced by the sustainability-related themes from brand posts, indicating that brand posts indeed affect consumers’ opinions about sustainability. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

17. Frequency‐Aware Facial Image Shadow Removal through Skin Color and Texture Learning.

Author: Zhang, Ling, Xie, Wenyang, and Xiao, Chunxia
Subjects: *IMAGE fusion, *DATA mining, *TEXTURE mapping, *FEATURE extraction, *HUMAN skin color, *LIGHTING
Abstract: Existing facial image shadow removal methods predominantly rely on pre‐extracted facial features. However, these methods often fail to capitalize on the full potential of these features, resorting to simplified utilization. Furthermore, they tend to overlook the importance of low‐frequency information during the extraction of prior features, which can be easily compromised by noises. In our work, we propose a frequency‐aware shadow removal network (FSRNet) for facial image shadow removal, which utilizes the skin color and texture information in the face to help recover illumination in shadow regions. Our FSRNet uses a frequency‐domain image decomposition network to extract the low‐frequency skin color map and high‐frequency texture map from the face images, and applies a color‐texture guided shadow removal network to produce final shadow removal result. Concretely, the designed fourier sparse attention block (FSABlock) can transform images from the spatial domain to the frequency domain and help the network focus on the key information. We also introduce a skin color fusion module (CFModule) and a texture fusion module (TFModule) to enhance the understanding and utilization of color and texture features, promoting high‐quality result without color distortion and detail blurring. Extensive experiments demonstrate the superiority of the proposed method. The code is available at https://github.com/laoxie521/FSRNet. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

18. A data mining approach to identify key radioresponsive genes in mouse model of radiation induced intestinal injury.

Author: Sharma, Suchitra, Rehan, Aliza, and Dutta, Ajaswrata
Subjects: *LIPID metabolism, *GASTROINTESTINAL system, *GENE expression, *DATA mining, *DATABASES
Abstract: AbstractRadiation-mediated GI injury (RIGI) in humans either due to accidental or intentional exposures, can only be managed with supporting care with no approved countermeasures available till now. Early detection and monitoring of RIGI is important for effective medical management and improve survival chances in exposed individual. The present study aims to identify new signatures of RIGI using data mining approach followed by validation of selected hub genes in mouse model. Using microarray datasets from Gene Expression Omnibus database, differentially expressed genes were identified. Pathway analysis suggested lipid metabolism as one of the predominant pathways altered in irradiated GI tissue. A protein-protein interaction network revealed top 08 hub genes related to lipid metabolism, namely Fabp1, Fabp2, Fabp6, Npc1l1, Ppar-α, Abcg8, Hnf-4α, and Insig1. qRT-PCR analysis revealed significant up-regulation of Fabp6 and Hnf-4α and down-regulation of Fabp1, Fabp2 and Insig1 transcripts in irradiated intestine. Radiation dose and time kinetics study revealed that the selected 05 genes were altered differentially in the irradiated intestine. Extensive alteration in lipid profiles and modification was observed in irradiated intestine. Finding suggests that lipid metabolism is one of the key targets of radiation and its mediators may act as biomarkers in detection and progression of RIGI. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

19. Evaluation of water richness in coal seam roof aquifer based on factor optimization and random forest method.

Author: Gai, Guichao, Qiu, Mei, Zhang, Weiqiang, and Shi, Longqing
Subjects: *RANDOM forest algorithms, *RANK correlation (Statistics), *STATISTICAL correlation, *DATA mining, *WATER testing
Abstract: Water richness evaluation of coal seam roofs is a crucial prerequisite for preventing and controlling water hazards in coal seam roofs. For this purpose, Spearman correlation and GeoDetector were employed for factor optimization to investigate the significance of lithological and structural factors and the impact of interactions among different factors on the water richness of coal seam roofs. Water richness evaluation models of coal seam roofs were separately established via the entropy weight method (EWM), coefficient of variation method (CVM), and random forest method (RFM), and the predictive accuracies of these models were compared. Eleven lithological and structural factors were collected. Through Spearman correlation analysis, 6 factors were identified to have significant influences on water richness. By utilizing the interaction detection of GeoDetector, the effects of interfactor interactions on water richness were explored, and 3 combination factors were selected. AICc was used to determine the model's superiority. The evaluation results of the study area based on factor optimization and three methods were further compared and validated via pumping tests, workface water inflow tests, and three-dimensional high-density electrical method. The results indicated that the RFM exhibited higher prediction accuracy than did the entropy weight and coefficient of variation methods. Additionally, within each method, factor optimization led to improved model accuracy. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

20. Design of agricultural question answering information extraction method based on improved BILSTM algorithm.

Author: Tang, Ruipeng, Yang, Jianbu, Tang, Jianxun, Aridas, Narendra Kumar, and Talip, Mohamad Sofian Abu
Subjects: *LANGUAGE models, *NATURAL language processing, *QUESTION answering systems, *AGRICULTURAL pests, *DATA mining
Abstract: With the rapid growth of the agricultural information and the need for data analysis, how to accurately extract useful information from massive data has become an urgent first step in agricultural data mining and application. In this study, an agricultural question-answering information extraction method based on the BE-BILSTM (Improved Bidirectional Long Short-Term Memory) algorithm is designed. Firstly, it uses Python's Scrapy crawler framework to obtain the information of soil types, crop diseases and pests, and agricultural trade information, and remove abnormal values. Secondly, the information extraction converts the semi-structured data by using entity extraction methods. Thirdly, the BERT (Bidirectional Encoder Representations from Transformers) algorithm is introduced to improve the performance of the BILSTM algorithm. After comparing with the BERT-CRF (Conditional Random Field) and BILSTM algorithm, the result shows that the BE-BILSTM algorithm has better information extraction performance than the other two algorithms. This study improves the accuracy of the agricultural information recommendation system from the perspective of information extraction. Compared with other work that is done from the perspective of recommendation algorithm optimization, it is more innovative; it helps to understand the semantics and contextual relationships in agricultural question and answer, which improves the accuracy of agricultural information recommendation systems. By gaining a deeper understanding of farmers' needs and interests, the system can better recommend relevant and practical information. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

21. Fault diagnosis of reducers based on digital twins and deep learning.

Author: Liu, Weimin, Han, Bin, Zheng, Aiyun, Zheng, Zhi, Chen, Shujun, and Jia, Shikui
Subjects: *FAULT diagnosis, *DIGITAL twins, *FEATURE extraction, *DATA mining, *DIGITAL learning
Abstract: A new method was proposed to address fault diagnosis by applying the digital twin (DT) high-fidelity behavior and the deep learning (DL) data mining capabilities. Subsequently, the proposed fault distribution GAN (FDGAN) was built to map virtual and physical entities for the data from the established test platform. Finally, the MobileViG was employed to validate the model and diagnose faults. The accuracy of the proposed method with training samples of 600 and 800 were 88.4% and 99.5%, respectively. These accuracies surpass those of other methods based on CycleGAN (98.86%), CACGAN (94.92%), ACGAN (86.45%), ML1D-GAN (82.33%), and transfer learning (99.38%). Therefore, with the integration of global connectivity, an innovative network structure, and training methods, FDGAN can effectively address challenges such as network degradation, limited feature extraction in small windows, and insufficient model robustness. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

22. Spatial data mining-focused research on image processing techniques for remotely sensed images.

Author: Fan, Yongxia, Feng, Shizhou, and Du, Jing
Subjects: *ANIMAL herds, *IMAGE recognition (Computer vision), *IMAGE processing, *PRINCIPAL components analysis, *DATA mining
Abstract: Spatial data mining is an important approach for collecting useful data from big datasets, especially remotely sensed images. This study tackles issues in environmental monitoring and management using sophisticated image processing. The Horse Herd Optimization-based VGG19 (HHO-VGG19) is proposed to improve land cover classification, recognition of objects, detection of changes, and detection of anomalies. The study used the BCDD dataset, which was scaled to 512 × 512 pixels, then applied Z-score normalization and extracted features using Principal Component Analysis (PCA). The VGG19 architecture was enhanced by utilizing Horse Herd Optimization to enhance image classification efficiency. The HHO-VGG19 model surpasses conventional techniques, with F1-score of 92%, a recall of 94%, an accuracy of 98.5%, and a 30-second execution time reduction. The findings indicate the efficiency of integrating sophisticated image processing with spatial data mining, giving an effective tool for remote sensing image processing in environmental uses including tracking ecosystems and handling of natural resources. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

23. A New Scene Sensing Model Based on Multi-Source Data from Smartphones.

Author: Ding, Zhenke, Deng, Zhongliang, Hu, Enwen, Liu, Bingxun, Zhang, Zhichao, and Ma, Mingyang
Abstract: Smartphones with integrated sensors play an important role in people's lives, and in advanced multi-sensor fusion navigation systems, the use of individual sensor information is crucial. Because of the different environments, the weights of the sensors will be different, which will also affect the method and results of multi-source fusion positioning. Based on the multi-source data from smartphone sensors, this study explores five types of information—Global Navigation Satellite System (GNSS), Inertial Measurement Units (IMUs), cellular networks, optical sensors, and Wi-Fi sensors—characterizing the temporal, spatial, and mathematical statistical features of the data, and it constructs a multi-scale, multi-window, and context-connected scene sensing model to accurately detect the environmental scene in indoor, semi-indoor, outdoor, and semi-outdoor spaces, thus providing a good basis for multi-sensor positioning in a multi-sensor navigation system. Detecting environmental scenes provides an environmental positioning basis for multi-sensor fusion localization. This model is divided into four main parts: multi-sensor-based data mining, a multi-scale convolutional neural network (CNN), a bidirectional long short-term memory (BiLSTM) network combined with contextual information, and a meta-heuristic optimization algorithm. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

24. An evolutionary algorithm‐based classification method for high‐dimensional imbalanced mixed data with missing information.

Author: Liu, Yi, Li, Gengsong, Zheng, Qibin, Yang, Guoli, Liu, Kun, and Qin, Wei
Subjects: *EVOLUTIONARY computation, *MISSING data (Statistics), *FEATURE selection, *QUANTUM operators, *DATA mining
Abstract: The data scale keeps growing by leaps and the majority of it is high‐dimensional imbalanced data, which is hard to classify. Data missing often happens in software which further aggravates the difficulty of classifying the data. In order to resolve high‐dimensional imbalanced mixed‐variables missing data classification problem, a novel method based on particle swarm optimization is developed. It has three original components including multiple feature selection, mixed attribute imputation, and quantum oversampling. Multiple feature selection uses a two‐stage strategy to obtain stable relevant features. Mixed attribute imputation separates features into continuous and discrete features and fills missing values with different models. Quantum oversampling chooses instances to balance data based on the quantum operator. Furthermore, particle swarm optimization is employed to optimize the parameters of the components to obtain preferable classification results. Six representative classification datasets, six typical algorithms, and four measures are taken to conduct exhaust experiments, and results indicate that the proposed method is superior to the comparison algorithms. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

25. Handwritten text recognition and information extraction from ancient manuscripts using deep convolutional and recurrent neural network.

Author: El Bahi, Hassan
Subjects: *CONVOLUTIONAL neural networks, *RECURRENT neural networks, *DATA mining, *DATA augmentation, *FEATURE extraction, *TEXT recognition, *HANDWRITING recognition (Computer science)
Abstract: Digitizing ancient manuscripts and making them accessible to a broader audience is a crucial step in unlocking the wealth of information they hold. However, automatic recognition of handwritten text and the extraction of relevant information such as named entities from these manuscripts are among the most difficult research topics, due to several factors such as poor quality of manuscripts, complex background, presence of ink stains, cursive handwriting, etc. To meet these challenges, we propose two systems, the first system performs the task of handwritten text recognition (HTR) in ancient manuscripts; it starts with a preprocessing operation. Then, a convolutional neural network (CNN) is used to extract the features of each input image. Finally, a recurrent neural network (RNN) which has Long Short-Term Memory (LSTM) blocks with the Connectionist Temporal Classification (CTC) layer will predict the text contained in the image. The second system focuses on recognizing named entities and deciphering the relationships among words directly from images of old manuscripts, bypassing the need for an intermediate text transcription step. Like the previous system, this second system starts with a preprocessing step. Then the data augmentation technique is used to increase the training dataset. After that, the extraction of the most relevant features is done automatically using a CNN model. Finally, the recognition of names entities and the relationship between word images is performed using a bidirectional LSTM. Extensive experiments on the ESPOSALLES dataset demonstrate that the proposed systems achieve the state-of-the-art performance exceeding existing systems. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

26. Geoinference of author affiliations using NLP-based text classification.

Author: Lee, Brian, Brownstein, John S., and Kohane, Isaac S.
Subjects: *MACHINE learning, *DATA mining, *URBAN renewal, *RESEARCH personnel, *BIBLIOMETRICS
Abstract: Author affiliations are essential in bibliometric studies, requiring relevant information extraction from free-text affiliations. Precisely determining an author's location from their affiliation is crucial for understanding research networks, collaborations, and geographic distribution. Existing geoparsing tools using regular expressions have limitations due to unstructured and ambiguous affiliations, resulting in erroneous location identification, especially for unconventional variations or misspellings. Moreover, their inefficient handling of big datasets hampers large-scale bibliometric studies. Though machine learning-based geoparsers exist, they depend on explicit location information, creating challenges when detailed geographic data is absent. To address these issues, we developed and evaluated a natural language processing model to predict the city, state, and country from an author's free-text affiliation. Our model automates location inference, overcoming drawbacks of existing methods. Trained and tested with MapAffil, a publicly available geoparsed dataset of PubMed affiliations up to 2018, our model accurately retrieves high-resolution locations, even without explicit mentions of a city, state, or even country. Leveraging NLP techniques and the LinearSVC algorithm, our machine learning model achieves superior accuracy based on several validation datasets. This research demonstrates a practical application of text classification for inferring specific geographical locations from free-text affiliations, benefiting researchers and institutions in analyzing research output distribution. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

27. Short‐term power prediction of distributed PV based on multi‐scale feature fusion with TPE‐CBiGRU‐SCA.

Author: Zou, Hongbo, Yang, Changhua, Ma, Henrui, Zhu, Suxun, Sun, Jialun, Yang, Jinlong, and Wang, Jiahao
Subjects: *PREDICTION theory, *DATA mining, *FEATURE extraction, *DATA analysis, *MULTISENSOR data fusion
Abstract: To address the challenge of insufficient comprehensive extraction and fusion of meteorological conditions, temporal features, and power periodic features in short‐term power prediction for distributed photovoltaic (PV) farms, a TPE‐CBiGRU‐SCA model based on multiscale feature fusion is proposed. First, multiscale feature fusion of meteorological features, temporal features, and hidden periodic features is performed in PV power to construct the model input features. Second, the relationships between PV power and its influencing factors are modelled from spatial and temporal scales using CNN and Bi‐GRU, respectively. The spatiotemporal features are then weighted and fused using the SCA attention mechanism. Finally, TPE‐based hyperparameter optimization is used to refine network parameters, achieving PV power prediction for a single field station. Validation with data from a PV field station shows that this method significantly enhances feature extraction comprehensiveness through multiscale fusion at both data and model layers. This improvement leads to a reduction in MAE and RMSE by 26.03% and 38.15%, respectively, and an increase in R2 to 96.22%, representing a 3.26% improvement over other models. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

28. A data mining‐based interruptible load contract model for the modern power system.

Author: Hui, Zou, Jun, Yang, and Qi, Meng
Subjects: *ELECTRIC power systems, *DATA mining, *CONTRACTS
Abstract: To devise more scientifically rational interruptible load contracts, this paper introduces a novel model for interruptible load contracts within modern electric power systems, grounded in data mining techniques. Initially, user characteristics are clustered using data mining technology to determine the optimal number of clusters. Building on this, the potential for different users to participate in interruptible load programs is analysed based on daily load ratios, yielding various user‐type parameters. Furthermore, the paper develops an interruptible load contract model that incorporates load response capabilities, enhancing the traditional interruptible load contract model based on principal‐agent theory through considerations of user type parameters and maximum interruptible load limits. The objective function, aimed at maximizing the profits of the electric company, is solved, and lastly, through the use of real data, a case study analysis focusing on commercial users with the strongest load response capabilities is conducted. The results affirm the efficacy of the proposed model. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

29. Triple Ionosophere PhotoMeter onboard Fengyun-3E satellite: Data validation and ionosphere information extraction.

Author: Jiang, Fang, Mao, Tian, Fu, LiPing, and Wang, Jin-Song
Subjects: *AIRGLOW, *DATA mining, *IONOSPHERE, *RADIANCE, *PHOTOMETERS
Abstract: The Triple Ionosophere PhotoMeter (TRIPM) is designed to make the disk observations of the Earth airglow emissions at OI 135.6 nm and N 2 Lyman-Birge-Hopfield (LBH) band that flew aboard the newly launched early morning satellite Fengyun-3E (FY-3E) which was launched on July 5th, 2021. This is the first attempt to obtain the global twilight airglow data in the far ultraviolet band using Ionosophere PhotoMeter. The seasonal behavior of the twilight airglow at the OI 135.6 nm and N 2 LBH band is exhibited and interpreted in the paper. Validity of the TRIPM data is analyzed by comparison with the simulation results of the Global Airglow (GLOW) and the simulation results reveal good consistency with the observational results. Furthermore, based on the ionospheric contributions calculated by GLOW model, we extract the ionospheric information from the TRIPM 135.6 nm emissions. This means the ionospheric signature in the 135.6 nm radiance at the dawn and dusk times can be provided. Comparison between the ionospheric 135.6 nm radiance due to O+ and electron radiative recombination with GPS TEC data shows a similar morphology of equatorial arcs and seasonal variation of the ionosphere. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

30. AIDM-CT: software for cone-beam X-ray tomography and deep-learning-based analysis.

Author: Wang, Hongwei, Xu, Cong, Guan, Yu, Zhang, Shou, Chen, Xingbang, Wang, Fuli, and Liu, Huiqiang
Subjects: *COMPUTED tomography, *GRAPHICAL user interfaces, *NONDESTRUCTIVE testing, *IMAGE segmentation, *DATA mining, *PYTHON programming language, *DEEP learning
Abstract: Non-destructive testing (NDT) with high-resolution cone-beam X-ray computed tomography (XCT) plays a crucial role in revealing the 3D distribution and morphology of defects within an object. It’s necessary to generalise an intelligent and customised XCT-based NDT protocol, providing solutions for reconstruction quality improvement and complex defect quantitative analysis for users. We analysed the key techniques of XCT and developed the systematic software for data acquisition, reconstruction, intelligent super-resolution and segmentation in our advanced imaging and data mining laboratory (AIDM-CT). The experimental results demonstrated the software efficiently achieves the control of XCT scanning, artefact correction algorithms related to cone-beam geometrics, beam hardening, ring and radial artefacts, deep-learning based slice super-resolution and feature segmentation, and quantitative analysis. The software is programmed by Python language to provide the friendly multi-function graphical user interface (GUI), in which the programming of reconstruction and corrections combined with Computer Unified Device Architecture (CUDA) acceleration to guarantee the high efficiency of XCT task. Particularly, the multi-functional super resolution module (MF-GAN) is proposed to optimise the quality of CT image and the improved U-Net segmentation module (CBAM U-Net) is also developed to fulfill the high-precision quantitative non-destructive testing of homogeneous and variform microdefects in our AIDM-CT software. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

31. Post-marketing safety concerns with rimegepant based on a pharmacovigilance study.

Author: Hu, Jia-Ling, Wu, Jing-Ying, Xu, Shan, Qian, Shi-Yan, Jiang, Cheng, and Zheng, Guo-Qing
Subjects: *PHARMACOLOGY, *RISK assessment, *CLINICAL drug trials, *STATISTICAL correlation, *DRUG side effects, *PATIENT safety, *DIGESTION, *T-test (Statistics), *RESEARCH funding, *SEX distribution, *BODY weight, *FISHER exact test, *CALCITONIN, *RETROSPECTIVE studies, *AGE distribution, *MOTION sickness, *DESCRIPTIVE statistics, *COMMERCIAL product evaluation, *MEDICAL records, *ACQUISITION of data, *RESEARCH, *VOMITING, *DATA analysis software, *MIGRAINE, *CELL receptors, *TIME, *ALGORITHMS, *CHEMICAL inhibitors
Abstract: Purpose: This study aimed to comprehensively assess the safety of rimegepant administration in real-world clinical settings. Methods: Data from the Food and Drug Administration Adverse Event Reporting System (FAERS) spanning the second quarter of 2020 through the first quarter of 2023 were retrospectively analyzed in this pharmacovigilance investigation. This study focuses on employing subgroup analysis to monitor rimegepant drug safety. Descriptive analysis was employed to examine clinical characteristics and concomitant medication of adverse event reports associated with rimegepant, including report season, reporter country, sex, age, weight, dose, and frequency, onset time, et al. Correlation analysis, including techniques such as violin plots, was utilized to explore relationships between clinical characteristics in greater detail. Additionally, four disproportionality analysis methods were applied to assess adverse event signals associated with rimegepant. Results: A total of 5,416,969 adverse event reports extracted from the FAERS database, 10, 194 adverse events were identified as the "primary suspect" (PS) drug attributed to rimegepant. Rimegepant-associated adverse events involved 27 System Organ Classes (SOCs), and the significant SOC meeting all four detection criteria was "general disorders and administration site conditions" (SOC: 10018065). Additionally, new significant adverse events were discovered, including "vomiting projectile" (PT: 10047708), "eructation" (PT: 10015137), "motion sickness" (PT: 10027990), "feeling drunk" (PT: 10016330), "reaction to food additive" (PT: 10037977), etc. Descriptive analysis indicated that the majority of reporters were consumers (88.1%), with most reports involving female patients. Significant differences were observed between female and male patients across age categories, and the concomitant use of rimegepant with other medications was complex. Conclusion: This study has preliminarily identified potential new adverse events associated with rimegepant, such as those involving the gastrointestinal system, nervous system, and immune system, which warrant further research to determine their exact mechanisms and risk factors. Additionally, significant differences in rimegepant-related adverse events were observed across different age groups and sexes, and the complexity of concomitant medication use should be given special attention in clinical practice. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

32. Data mining of PubChem bioassay records reveals diverse OXPHOS inhibitory chemotypes as potential therapeutic agents against ovarian cancer.

Author: Sharma, Sejal, Feng, Liping, Boonpattrawong, Nicha, Kapur, Arvinder, Barroilhet, Lisa, Patankar, Manish S., and Ericksen, Spencer S.
Subjects: *HIGH throughput screening (Drug development), *DATA libraries, *DATA mining, *REACTIVE oxygen species, *OXIDATIVE phosphorylation, *FUNCTIONAL groups
Abstract: Focused screening on target-prioritized compound sets can be an efficient alternative to high throughput screening (HTS). For most biomolecular targets, compound prioritization models depend on prior screening data or a target structure. For phenotypic or multi-protein pathway targets, it may not be clear which public assay records provide relevant data. The question also arises as to whether data collected from disparate assays might be usefully consolidated. Here, we report on the development and application of a data mining pipeline to examine these issues. To illustrate, we focus on identifying inhibitors of oxidative phosphorylation, a druggable metabolic process in epithelial ovarian tumors. The pipeline compiled 8415 available OXPHOS-related bioassays in the PubChem data repository involving 312,093 unique compound records. Application of PubChem assay activity annotations, PAINS (Pan Assay Interference Compounds), and Lipinski-like bioavailability filters yields 1852 putative OXPHOS-active compounds that fall into 464 clusters. These chemotypes are diverse but have relatively high hydrophobicity and molecular weight but lower complexity and drug-likeness. These chemotypes show a high abundance of bicyclic ring systems and oxygen containing functional groups including ketones, allylic oxides (alpha/beta unsaturated carbonyls), hydroxyl groups, and ethers. In contrast, amide and primary amine functional groups have a notably lower than random prevalence. UMAP representation of the chemical space shows strong divergence in the regions occupied by OXPHOS-inactive and -active compounds. Of the six compounds selected for biological testing, 4 showed statistically significant inhibition of electron transport in bioenergetics assays. Two of these four compounds, lacidipine and esbiothrin, increased in intracellular oxygen radicals (a major hallmark of most OXPHOS inhibitors) and decreased the viability of two ovarian cancer cell lines, ID8 and OVCAR5. Finally, data from the pipeline were used to train random forest and support vector classifiers that effectively prioritized OXPHOS inhibitory compounds within a held-out test set (ROCAUC 0.962 and 0.927, respectively) and on another set containing 44 documented OXPHOS inhibitors outside of the training set (ROCAUC 0.900 and 0.823). This prototype pipeline is extensible and could be adapted for focus screening on other phenotypic targets for which sufficient public data are available. Scientific contribution Here, we describe and apply an assay data mining pipeline to compile, process, filter, and mine public bioassay data. We believe the procedure may be more broadly applied to guide compound selection in early-stage hit finding on novel multi-protein mechanistic or phenotypic targets. To demonstrate the utility of our approach, we apply a data mining strategy on a large set of public assay data to find drug-like molecules that inhibit oxidative phosphorylation (OXPHOS) as candidates for ovarian cancer therapies. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

33. An explainable language model for antibody specificity prediction using curated influenza hemagglutinin antibodies.

Author: Wang, Yiquan, Lv, Huibin, Teo, Qi Wen, Lei, Ruipeng, Gopal, Akshita B., Ouyang, Wenhao O., Yeung, Yuen-Hei, Tan, Timothy J.C., Choi, Danbi, Shen, Ivana R., Chen, Xin, Graham, Claire S., and Wu, Nicholas C.
Subjects: *LANGUAGE models, *IMMUNOLOGIC memory, *ANTIBODY specificity, *ANTIBODY formation, *DEEP learning
Abstract: Despite decades of antibody research, it remains challenging to predict the specificity of an antibody solely based on its sequence. Two major obstacles are the lack of appropriate models and the inaccessibility of datasets for model training. In this study, we curated >5,000 influenza hemagglutinin (HA) antibodies by mining research publications and patents, which revealed many distinct sequence features between antibodies to HA head and stem domains. We then leveraged this dataset to develop a lightweight memory B cell language model (mBLM) for sequence-based antibody specificity prediction. Model explainability analysis showed that mBLM could identify key sequence features of HA stem antibodies. Additionally, by applying mBLM to HA antibodies with unknown epitopes, we discovered and experimentally validated many HA stem antibodies. Overall, this study not only advances our molecular understanding of the antibody response to the influenza virus but also provides a valuable resource for applying deep learning to antibody research. • Assembled a dataset of 5,561 published antibodies to influenza HA from 132 donors • Antibodies to HA head and stem domains have distinct convergent sequence features • Developed a lightweight language model (mBLM) for antibody specificity prediction • Discovered HA stem antibodies and key somatic hypermutations using mBLM Predicting antibody specificity based solely on sequence has remained an obstacle in humoral research. Leveraging a curated dataset of >5,000 influenza hemagglutinin (HA) antibodies, Wang et al. develop a lightweight memory B cell language model for antibody specificity prediction. This model identifies unique HA stem antibodies and key antibody sequence features. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

34. Peptide hemolytic activity analysis using visual data mining of similarity-based complex networks.

Author: Castillo-Mendieta, Kevin, Agüero-Chapin, Guillermin, Marquez, Edgar A., Perez-Castillo, Yunierkis, Barigye, Stephen J., Vispo, Nelson Santiago, García-Jacas, Cesar R., and Marrero-Ponce, Yovani
Subjects: *PEPTIDES, *DATA mining, *DRUG development, *DATA science, *METADATA
Abstract: Peptides are promising drug development frameworks that have been hindered by intrinsic undesired properties including hemolytic activity. We aim to get a better insight into the chemical space of hemolytic peptides using a novel approach based on network science and data mining. Metadata networks (METNs) were useful to characterize and find general patterns associated with hemolytic peptides, whereas Half-Space Proximal Networks (HSPNs), represented the hemolytic peptide space. The best candidate HSPNs were used to extract various subsets of hemolytic peptides (scaffolds) considering network centrality and peptide similarity. These scaffolds have been proved to be useful in developing robust similarity-based model classifiers. Finally, using an alignment-free approach, we reported 47 putative hemolytic motifs, which can be used as toxic signatures when developing novel peptide-based drugs. We provided evidence that the number of hemolytic motifs in a sequence might be related to the likelihood of being hemolytic. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

35. Archaeogenetic Data Mining Supports a Uralic–Minoan Homeland in the Danube Basin †.

Author: Revesz, Peter Z.
Abstract: Four types of archaeogenetic data mining are used to investigate the origin of the Minoans and the Uralic peoples: (1) six SNP mutations related to eye, hair, and skin phenotypes; (2) whole-genome admixture analysis using the G25 system; (3) an analysis of the history of the U5 mitochondrial DNA haplogroup; and (4) an analysis of the origin of each currently known Minoan mitochondrial and y-DNA haplotypes. The uniform result of these analyses is that the Minoans and the Uralic peoples had a common homeland in the lower and middle Danube Basin, as well as the Black Sea coastal regions. This new result helps to reconcile archaeogenetics with linguistics, which have shown that the Minoan language belongs to the Uralic language family. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

36. Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph.

Author: Peng, Zhongxing, Zheng, Gengzhong, and Huang, Wei
Abstract: In this paper, we introduce an efficient and effective algorithm for Graph-based Semi-Supervised Learning (GSSL). Unlike other GSSL methods, our proposed algorithm achieves efficiency by constructing a bipartite graph, which connects a small number of representative points to a large volume of raw data by capturing their underlying manifold structures. This bipartite graph, with a sparse and anti-diagonal affinity matrix which is symmetrical, serves as a low-rank approximation of the original graph. Consequently, our algorithm accelerates both the graph construction and label propagation steps. In particular, on the one hand, our algorithm computes the label propagation in closed-form, reducing its computational complexity from cubic to approximately linear with respect to the number of data points; on the other hand, our algorithm calculates the soft label matrix for unlabeled data using a closed-form solution, thereby gaining additional acceleration. Comprehensive experiments performed on six real-world datasets demonstrate the efficiency and effectiveness of our algorithm in comparison to five state-of-the-art algorithms. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

37. Insights into the RNA Virome of the Corn Leafhopper Dalbulus maidis , a Major Emergent Threat of Maize in Latin America.

Author: Debat, Humberto, Farrher, Esteban Simon, and Bejerman, Nicolas
Abstract: The maize leafhopper (Dalbulus maidis) is a significant threat to maize crops in tropical and subtropical regions, causing extensive economic losses. While its ecological interactions and control strategies are well studied, its associated viral diversity remains largely unexplored. Here, we employ high-throughput sequencing data mining to comprehensively characterize the D. maidis RNA virome, revealing novel and diverse RNA viruses. We characterized six new viral members belonging to distinct families, with evolutionary cues of beny-like viruses (Benyviridae), bunya-like viruses (Bunyaviridae) iflaviruses (Iflaviridae), orthomyxo-like viruses (Orthomyxoviridae), and rhabdoviruses (Rhabdoviridae). Phylogenetic analysis of the iflaviruses places them within the genus Iflavirus in affinity with other leafhopper-associated iflaviruses. The five-segmented and highly divergent orthomyxo-like virus showed a relationship with other insect associated orthomyxo-like viruses. The rhabdo virus is related to a leafhopper-associated rhabdo-like virus. Furthermore, the beny-like virus belonged to a cluster of insect-associated beny-like viruses, while the bi-segmented bunya-like virus was related with other bi-segmented insect-associated bunya-like viruses. These results highlight the existence of a complex virome linked to D. maidis and paves the way for future studies investigating the ecological roles, evolutionary dynamics, and potential biocontrol applications of these viruses on the D. maidis—maize pathosystem. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

38. Identifying disputants' attitudinal variations in family mediations: A data mining approach.

Author: Xu, Qingxin
Subjects: *FAMILY mediation, *DATA mining, *EVALUATION, *DATA analysis, *CRITICAL discourse analysis
Abstract: This article combines linguistic analysis and data mining methods to explore variations in speakers' evaluative meaning-making in conflict talks. It focuses on conflict style construction through evaluative language, specifically how disputants advance attitudes. The corpus consists of 230 minutes of family mediation talks involving 12 divorcing spouses. The research draws from the Appraisal framework to analyse evaluative meaning-making at a discourse semantics level, capturing both explicit and implicit attitudes, as well as the scaling and dialogic framing of attitudes. Data exploration uses clustering algorithms via RStudio to identify variations in disputants' discursive behaviour. The findings uncover three conflict styles based on disputants' preference for attitude advancement formulations, with varying degrees of assertiveness and forcefulness. This study's contributions include a holistic treatment of evaluative meaning-making, the marriage of digital tools to nuanced linguistic annotation, and a novel interpretation for conflict style. The findings offer fresh insights into disputants' discursive self-presentation in confrontational exchanges. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

39. Hybrid model approach in data mining.

Author: Bakirarar, Batuhan, Cosgun, Erdal, and Elhan, Atilla Halil
Abstract: Studies on hybrid data mining approach has been increasing in recent years. Hybrid data mining is defined as an effective combination of various data mining techniques to use the power of each technique and compensate for each other's weaknesses. The purpose of this study is to present state-of-the-art data mining algorithms and applications and to propose a new hybrid data mining approach for classifying medical data. In addition, in the study, it was aimed to calculate performance metrics of data mining methods and to compare these metrics with the metrics obtained from the hybrid model. The study utilized simulated datasets produced on the basis of various scenarios and hepatitis dataset obtained from the UCI database. Supervised learning algorithms were used. In addition, hybrid models were created by combining these algorithms. In simulated datasets, it was observed that MCC values increased with a higher sample size and higher correlation between the independent variables. In addition, as the correlation between independent variables increased in imbalanced datasets, a noticeable increase was observed in the performance metrics of the group with lower sample size. A similar case was observed with the actual datasets. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

40. Development of an early warning system for higher education institutions by predicting first‐year student academic performance.

Author: Çırak, Cem Recai, Akıllı, Hakan, and Ekinci, Yeliz
Subjects: *UNIVERSITIES & colleges, *ACADEMIC achievement, *DATA analysis, *RANDOM forest algorithms, *REPENTANCE
Abstract: In this study, an early warning system predicting first‐year undergraduate student academic performance is developed for higher education institutions. The significant factors that affect first‐year student success are derived and discussed such that they can be used for policy developments by related bodies. The dataset used in experimental analyses includes 11,698 freshman students' data. The problem is constructed as classification models predicting whether a student will be successful or unsuccessful at the end of the first year. A total of 69 input variables are utilized in the models. Naive Bayes, decision tree and random forest algorithms are compared over model prediction performances. Random forest models outperformed others and reached 90.2% accuracy. Findings show that the models including the fall semester CGPA variable performed dramatically better. Moreover, the student's programme name and university placement exam score are identified as the other most significant variables. A critical discussion based on the findings is provided. The developed model may be used as an early warning system, such that necessary actions can be taken after the second week of the spring semester for students predicted to be unsuccessful to increase their success and prevent attrition. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

41. What is the best predictor of word difficulty? A case of data mining using random forest.

Author: Ha, Hung Tan, Nguyen, Duyen Thi Bich, and Stoeckel, Tim
Subjects: *DATA mining, *RANDOM forest algorithms, *VOCABULARY, *EMPIRICAL research, *DATA analysis
Abstract: Word frequency has a long history of being considered the most important predictor of word difficulty and has served as a guideline for several aspects of second language vocabulary teaching, learning, and assessment. However, recent empirical research has challenged the supremacy of frequency as a predictor of word difficulty. Accordingly, applied linguists have questioned the use of frequency as the principal criterion in the development of wordlists and vocabulary tests. Despite being informative, previous studies on the topic have been limited in the way the researchers measured word difficulty and the statistical techniques they employed for exploratory data analysis. In the current study, meaning recall was used as a measure of word difficulty, and random forest was employed to examine the importance of various lexical sophistication metrics in predicting word difficulty. The results showed that frequency was not the most important predictor of word difficulty. Due to the limited scope, research findings are only generalizable to Vietnamese learners of English. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

42. Investigating explainable transfer learning for battery lifetime prediction under state transitions.

Author: Tianze Lin, Sihui Chen, Harris, Stephen J., Tianshou Zhao, Yang Liu, and Jiayu Wan
Subjects: *PRODUCT quality, *TECHNOLOGICAL innovations, *MANUFACTURING processes, *DATA mining, *STANDARD deviations
Abstract: Battery lifetime prediction at early cycles is crucial for researchers and manufacturers to examine product quality and promote technology development. Machine learning has been widely utilized to construct data-driven solutions for high-accuracy predictions. However, the internal mechanisms of batteries are sensitive to many factors, such as charging/discharging protocols, manufacturing/storage conditions, and usage patterns. These factors will induce state transitions, thereby decreasing the prediction accuracy of data-driven approaches. Transfer learning is a promising technique that overcomes this difficulty and achieves accurate predictions by jointly utilizing information from various sources. Hence, we develop two transfer learning methods, Bayesian Model Fusion and Weighted Orthogonal Matching Pursuit, to strategically combine prior knowledge with limited information from the target dataset to achieve superior prediction performance. From our results, our transfer learning methods reduce root-mean-squared error by 41% through adapting to the target domain. Furthermore, the transfer learning strategies identify the variations of impactful features across different sets of batteries and therefore disentangle the battery degradation mechanisms and the root cause of state transitions from the perspective of data mining. These findings suggest that the transfer learning strategies proposed in our work are capable of acquiring knowledge across multiple data sources for solving specialized issues. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

43. How Do We Select a Combined Algorithm to Determine High-Quality Aerospace Researchers by Utilizing Data Mining Techniques?

Author: GhaviDel, Somayeh, Riahinia, Nosrat, Danesh, Farshid, and Chakoli, Abdolreza Noroozi
Subjects: *RESEARCH personnel, *AEROSPACE technology, *DATA mining, *PROGRAMMING languages, *AEROSPACE industries
Abstract: The aerospace industry and technology are always considered one of the country’s most important and valuable industries. The research area of "Aerospace" is among the priorities of the grand science and technology development strategies, and addressing it is strategically vital. The present research aims to estimate and predict the appropriate algorithm for identifying high-quality aerospace researchers based on Advanced Ensemble Classifier Techniques (AECT) in data mining on the outputs of scientometric analyses and predicting the most essential scientometric-related metrics to identify high-quality researchers. The present study was performed using the protocols of applied research and multiple methods. The studied population includes all aerospace researchers (1945 and 2021) indexed in "The Web of Science Core Collection (WOSCC)". DataLab software and multiple programming languages have been applied in this research. All three algorithms have an accuracy of 0.96 and an F1-score of 0.97, which indicates that the models have high accuracy, validity, sensitivity, and predictive power. The "Blending" algorithm is considered a suitable and predictive model. The output of the LightGBM algorithm is that the most important and robust metric in the evaluation of prominent researchers is a metric from the researchers' effectiveness dimension, the Q parameter. According to the knowledge obtained from the ability to predict AECT in the prediction of highquality researchers, it is possible to use the metrics mentioned in the evaluation of researchers in the field of scientometrics for more accurate and comprehensive prediction. An algorithm that can lead to the optimal and efficient classification of researchers provides the possibility of in-depth analysis of the available data about researchers and smooths the predictive power of the most high-quality researcher. The use of the proposed algorithms in this research, while suggesting the appropriate algorithm, led to reliable and valuable knowledge in classifying high-quality aerospace researchers. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

44. Centroid-based clustering validity: method and application to quantification of optimal cluster-data space.

Author: Nguyen, Sy Dzung
Subjects: *BURST noise, *DATA compression, *DATA distribution, *CLUSTER analysis (Statistics), *DATA mining
Abstract: Evaluation of clustering validity to set up an optimal cluster-data space (CDS) is a vital task in many fields related to data mining. Almost existing clustering validity indexes (CVIs) lack stability due to being too sensitive to noise, especially impulse noise. Here, we (1) propose a new CVI named DzI (Dzung Index) or fRisk2 using analysis of fuzzy-set-based accumulated risk degree (FARD), and (2) present a new algorithm named fRisk2-bA for determining the optimal number of data clusters. It is a method of evaluation of the centroid-based fuzzy clustering validity. In essence, the fRisk2 still focuses on enhancing the data compression in each cluster and expanding the separation between cluster centroids. However, these features are exploited indirectly through FARD. As a result, the proposed method not only can avoid the difficulties of the traditional ones relying on the compression and separation properties directly but also can distill better local and global attributes in the data distribution to estimate the CDS more fully. Along with the proved theory basis, surveys, including the ones based on noisy datasets from measurements, showed the compared advantages of fRisk2 as follows. (1) The accuracy, stability, and convergence of the fRisk2 are outstanding. (2) Its total calculating cost is lower than the other surveyed CVIs. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

45. δ-granular reduction in formal fuzzy contexts: Boolean reasoning, graph represent and their algorithms.

Author: Gong, Zengtai and Zhang, Jing
Subjects: *GRAPH algorithms, *DATA mining, *HYPERGRAPHS, *TRANSVERSAL lines, *ALGORITHMS
Abstract: The fuzzy concept lattice is one of the effective tools for data mining, and granular reduction is one of its significant research contents. However, little research has been done on granular reduction at different granularities in formal fuzzy contexts (FFCs). Furthermore, the complexity of the composition of the fuzzy concept lattice limits the interest in its research. Therefore, how to simplify the concept lattice structure and how to construct granular reduction methods with granularity have become urgent issues that need to be investigated. To this end, firstly, the concept of an object granule with granularity is defined. Secondly, two reduction algorithms, one based on Boolean reasoning and the other on a graph-theoretic heuristic, are formulated while keeping the structure of this object granule unchanged. Further, to simplify the structure of the fuzzy concept lattice, a partial order relation with parameters is proposed. Finally, the feasibility and effectiveness of our proposed reduction approaches are verified by data experiments. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

46. 基于数据挖掘的胆石症动物模型特点及应用分析.

Author: 王琳琳, 朱正望, 赵静涵, 苗明三, and 朱平生
Abstract: Objective To summarize the existing animal models of cholelithiasis, and to explore a pathological model that can better reflect the characteristics of clinical syndromes of traditional Chinese and western medicine and meet the needs of the development of clinical and basic research of traditional Chinese medicine. Methods The animal models for experimental research on cholelithiasis at home and abroad were collected and sorted out by searching CNKI, Wanfang, VIP, SinoMed, PubMed and other databases. The animal types, modeling methods, modeling cycles, detection indicators and positive drugs of the models were statistically analyzed. Results Among the 128 articles included, the animal types of cholelithiasis models were mainly guinea pigs, rabbits and C57BL/6J mice. The most used modeling method is high-fat diet, and the feeding cycle takes eight weeks to complete. High-frequency detection indicators were stone formation rate, total cholesterol, phospholipids, total bilirubin, total bile acid, etc. The commonly used intervention methods are traditional Chinese medicine compound, western medicine and single Chinese medicine or extract. Ursodeoxycholic acid (UDCA) is mainly used for western medicine intervention, whereas the intervention of traditional Chinese medicine mainly includes traditional Chinese medicine compound, acupuncture, exercise, diet and other methods. The positive control drug in the experiment was mostly UDCA. Conclusion With the continuous improvement and development of cholelithiasis animal model, there are many kinds of modeling methods to simulate the clinical manifestations of cholelithiasis in Chinese and western medicine, but there are also some limitations. This paper aims to provide reference for the selection, application and improvement of cholelithiasis model through data mining and characteristic analysis. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

47. 基于数据挖掘和网络药理学探讨中医药治疗糖尿病牙周炎的组方规律及作用机制.

Author: 李惠菁, 郜然然, 刘敏, 魏静, 何翔, and 吴也可
Abstract: Objective To explore the prescription rules of traditional Chinese medicine (TCM) in the treatment of diabetes periodontitis (DP) and the acting mechanisms of core drug combination. Methods Based on the relevant literature retrieved from the CNKI, Wanfang, VIP and Sinomed, a DP prescription database was established. Excel 2021, SPSS Modeler 18.0 and SPSS Statistics 26.0 were used to conduct the statistics of the frequency, efficacy classifications, properties, flavors, and meridian tropism of the included drugs. Association rule analysis and cluster analysis were performed to screen out the core drug combinations. The active components and action targets of core drug combinations were obtained through TCMSP and HERB. The DP related disease targets were predicted using GeneCards. The Venny platform was used to obtain the intersection of disease targets and drug targets. Key components were screened by Cytoscape to establish an “active component-target” network. Based on STRING platform data, PPI network was constructed by Cytoscape to screen core targets. GO functional annotation and KEGG signaling pathway enrichment analysis were carried out for the intersection targets by DAVID. AutoDockVina was applied for molecular docking between core targets and key components. Results A total of 36 articles were included, and 50 prescriptions involving 100 Chinese herbal medicines were extracted. Alismatis Rhizoma, Rehmanniae Radix Praeparata and Astragali Radix were the most common drugs. The most used drug category was deficiency-nourishing drugs. The properties of the herbs were mainly cold and warm, the major flavors were sweet and bitter, and the main meridian tropisms were kidney and liver. Six categories were classified by clustering analysis. Moutan Cortes- Corni Fructus-Rehmanniae Radix Praeparata was screened out as the core drug combination involving 18 active components, 164 drug action targets and 104 intersection of DP targets and drug combination targets. Quercetin, stigmasterol, kaempferol, β-sitosterol, tetrahydroalstonine, and sitosterol were the key components, and AKT1, IL-6, TNF, IL-1B, PTGS2, JUN, TP53, ESR1, and MMP9 were the core targets. GO analysis revealed 3 724 biological processes, 228 cellular components and 404 molecular functions. KEGG analysis showed that DP was treated by the core drug combination through regulating 235 signaling pathways. Molecular docking results showed that there was a good affinity between the core target and the key component. Conclusion Tonifying deficiency is the main treatment methods of TCM for DP, accompanied by clearing heat and removing dampness, activating blood circulation and removing blood stasis, replenishing qi and nourishing yin. Core drug combination (Moutan Cortes- Corni Fructus-Rehmanniae Radix Praeparata) treats DP through multi-component, multi-target and multi-pathway, which provide a reference for clinical diagnosis and treatment. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

48. Understanding the performance of geographic limits on Web of Science Core Collection databases, using the United Kingdom as an example.

Author: Fulbright, Helen A. and Stansfield, Claire
Subjects: *HEALTH information services, *INTERNET searching, *SERIAL publications, *DATA security, *DATABASES, *DATABASE searching, *ABSTRACTING, *DATA mining, *HEALTH, *CLINICAL medicine research, *CITATION analysis, *MEDLINE, *BIBLIOGRAPHICAL citations, *MEDICAL research, *INFORMATION retrieval, *BIBLIOGRAPHY, *BIBLIOMETRICS, *RESEARCH, *GEOGRAPHIC information systems, *MEDICINE information services
Abstract: Objective: To consider the approaches within Web of Science Core Collection (WoSCC) databases for limiting geographically. To compare the limits to an adaptation of NICE's UK MEDLINE filter for use on WoSCC databases. Methods: We tested and appraised the inbuilt functions and search field options that support identification by countries/regions and affiliations. We compared these with an adapted filter to identify healthcare research on or about the UK. We calculated the recall of the inbuilt limits and filter using 177 studies and investigated why records were missed. We also calculated the percentage reduction of the overall number-needed-to-screen (ONNS). Results: Inbuilt limits within WoSCC enable identification of research from specific countries/regions or affiliations if there is data in the address field. Refining by affiliations allows retrieval of research where affiliations are in the 200 or 500 most frequent for a set of results. An adaptation of the UK MEDLINE filter achieved an average of 97% recall. ONNS was significantly reduced using the filter. However, studies where the countries or regions are only mentioned within the full text or other non-searchable fields will be missed. Conclusion: Information specialists should consider how inbuilt geographic limits operate on WoSCC and whether these are suitable for their research. The adapted filter can sensitively limit to the UK and could be useful for systematic reviews due to its high recall and ability to significantly reduce ONNS. Geographic filters can be feasible to adapt for use on WoSCC databases (where similar search fields are used between platforms). [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

49. DDE KG Editor: A data service system for knowledge graph construction in geoscience.

Author: Hou, Chengbin, Liu, Kaichuang, Wang, Tianheng, Shi, Shunzhong, Li, Yan, Zhu, Yunqiang, Hu, Xiumian, Wang, Chengshan, Zhou, Chenghu, and Lv, Hairong
Subjects: *KNOWLEDGE graphs, *ELECTRONIC data processing, *DATA mining, *COMPUTER science, *EARTH scientists
Abstract: Deep‐time Digital Earth (DDE) is an innovative international big science program, focusing on scientific propositions of earth evolution, changing Earth Science by coordinating global geoscience data, and sharing global geoscience knowledge. To facilitate the DDE program with recent advances in computer science, the geoscience knowledge graph plays a key role in organizing the data and knowledge of multiple geoscience subjects into Knowledge Graphs (KGs), which enables the calculation and inference over geoscience KGs for data mining and knowledge discovery. However, the construction of geoscience KGs is challenging. Though there have been some construction tools, they commonly lack collaborative editing and peer review for building high‐quality large‐scale geoscience professional KGs. To this end, a data service system or tool, DDE KG Editor, is developed to construct geoscience KGs. Specifically, it comes with several distinctive features such as collaborative editing, peer review, contribution records, intelligent assistance, and discussion forums. Currently, global geoscientists have contributed over 60,000 ontologies for 22 subjects. The stability, scalability, and intelligence of the system are regularly improving as a public online platform to better serve the DDE program. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

50. ClimShift – A new toolbox for the detection of climate change.

Author: Magyari‐Sáska, Zsolt, Croitoru, Adina‐Eliza, Horváth, Csaba, and Dombay, Ștefan
Subjects: *CLIMATE change detection, *DATA mining
Abstract: Climate change no longer involves and affects just a few people or communities. However, most of them need climate change detection studies to adapt to the current and future climate conditions efficiently. The present research aimed to detect climate changes by considering the shift in climate conditions from one region to another over different periods based on a similarity index in the Carpathians basin using the new ClimShift toolbox, specially created for this purpose. Developed in R, based on the cosine similarity index and using a set of 32 climate indices (temperature and precipitation), ClimShift uses NC raster format (NetCDF files) as input data. The application is compatible with Microsoft and Unix/Linux environments. The toolbox allows the detection of forward and backward climate shifts. The results can be employed as a Climate Service and are extremely helpful for an efficient process of adaption to climate changes at a local/regional scale. A user‐friendly interface and a tutorial on how to use the toolbox are also available. The toolbox was tested for four locations in the Carpathians Basin (Vienna, Bekes, Cluj‐Napoca and Kosice) using 1961–1990 as a base period and 1991–2021 as an analysis period for the forward climate shift analysis. For Cluj‐Napoca, the application was also tested for the backward climate shift, using 1991–2021 as the base period and 1961–1990 as the analysis period, identifying the region where present climate conditions were specific during the older period. The scientific results indicated a significant shift towards the east and northeast from the older period to the most recent one and a low percentage (6%–10%) in the overlapping area with highly similar conditions between the two periods. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

28,760 results on '"Data mining"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources