169 results
Search Results
2. A Retrieval-augmented Generation application for Question-Answering in Nutrigenetics Domain
- Author
-
Benfenati, Domenico, De Filippis, Giovanni Maria, Rinaldi, Antonio Maria, Russo, Cristiano, and Tommasino, Cristian
- Published
- 2024
- Full Text
- View/download PDF
3. Leveraging AI for Current Research Information Systems: Opportunities and Challenges.
- Author
-
Hartmann, Simone and Niederlechner, Daniel
- Subjects
DATA privacy ,ALGORITHMIC bias ,ARTIFICIAL intelligence ,TREND analysis ,INFORMATION retrieval - Abstract
Integrating Artificial Intelligence (AI) into Current Research Information Systems (CRIS) offers significant opportunities to enhance research management. This paper explores AI's potential to automate data handling, improve analytical capabilities, and enhance user experiences within CRIS. Key areas of impact include data enrichment, advanced information retrieval, trend analysis, and predictive analytics. The paper also addresses the challenges and ethical considerations of AI integration, such as data privacy, security, and algorithmic bias. Insights from a Live Poll at the CRIS2024 conference reveal high familiarity with AI among participants, optimism about its potential, and recognition of implementation challenges. By overcoming these obstacles, AI can transform CRIS, making research management more efficient and effective. The paper concludes by advocating for collaboration and dialogue to guide the responsible integration of AI in CRIS, ensuring alignment with stakeholder interests. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. The Impact and Implementation of Distributed Ledger Technology (DLT) On Accounting Information Storage and Verification.
- Author
-
Deng, Yuting, Chen, Junhao, Yang, Xinyu, Chen, Jiawen, and Yuan, Qing
- Subjects
DATA warehousing ,INFORMATION retrieval ,DATA integrity ,SCALABILITY ,ALGORITHMS ,BLOCKCHAINS ,DATA encryption - Abstract
In recent years, blockchain technology has received attention because of its decentralized, immutable and other characteristics, but it faces storage and retrieval challenges. To address these challenges, this paper introduces IOTA distributed ledger technology, which solves the scalability and cost problems of traditional blockchains. By analyzing and experimenting the Tangle, the underlying consensus structure of IOTA, this paper reveals the main factors affecting its development, and proposes a segmented adaptive cutting-edge transaction selection algorithm to optimize the system performance. At the same time, based on IOTA distributed ledger, this paper proposes a data encryption storage and retrieval scheme, which speeds up the data link and retrieval speed, and ensures the integrity and security of data. Finally, this paper discusses the application of blockchain in accounting informatization, and puts forward the scheme of building a new generation of accounting informatization platform, which is of great value to the construction of accounting informatization. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. N-AMES: Named entity recognition using contextual attention on masked entities and sections.
- Author
-
Landolsi, Mohamed Yassine and Ben Romdhane, Lotfi
- Subjects
DATA mining ,DEEP learning ,INFORMATION retrieval ,MACHINE learning ,ACCESS to information - Abstract
Considerable and high-quality scientific research papers play a crucial role in acquiring knowledge across various fields. Extracting information from the full text of these papers presents a challenging task. Named Entity Recognition (NER) stands out as a key step in information extraction. This recognition is indispensable for many tasks and applications requiring direct access to relevant information, such as knowledge discovery and document retrieval. However, effectively processing important contextual information remains a challenge in the literature. In our work, we introduce a supervised NER method named N-AMES (Named entity recognition using contextual Attention on Masked Entities and Sections). Leveraging the attention mechanism of the BERT transformer, our method effectively processes both global and local entity context information. By incorporating section titles into a sequence of text chunks and training the model on masked entities, our approach achieves remarkable performance. Specifically, our model achieves an F1-measure of 74.10% and 72.30% in partial matching evaluation, outperforming state-of-the-art models on two distinct full-text research paper datasets: SciREX for machine learning and CRAFT for biomedical entities. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Towards an LLM based approach for medical e-consent.
- Author
-
Naji, Mouncef, Masmoudi, Maroua, and Zghal, Hajer Baazaoui
- Subjects
LANGUAGE models ,KNOWLEDGE graphs ,INFORMATION storage & retrieval systems ,INFORMATION retrieval - Abstract
The question of informed and voluntary consent emerges as a matter of significance in healthcare. Obtaining informed consent, encounters many obstacles coupled with systemic, clinician-related, and patient-related factors, demanding interventions at different levels. This paper introduces a novel approach to present personalized consent based on Large Language Models (LLMs). The personalization of information is displayed through the combination of the LLM with a knowledge graph. We focus in our approach on how the knowledge graph enhances and personalize content generation, allowing therefore the acquisition of informed consent. The paper focuses as well on aspects related to hyper-parameters of information retrieval that help giving better prompt to the LLM. Experiments have showcased intresting results in terms of personalization and information retrieval using metrics of Rouge, Faithfulness and Relevance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Determination of emissivity profiles using a Bayesian data-driven approach.
- Author
-
Sgheri, Luca, Sgattoni, Cristina, and Zugarini, Chiara
- Subjects
- *
EMISSIVITY , *DATABASES , *INFORMATION retrieval , *LAND cover , *TEST methods - Abstract
In this paper, we explore the determination of a spectral emissivity profile that closely matches real data, intended for use as an initial guess and/or a priori information in a retrieval code. Our approach employs a Bayesian method that integrates the CAMEL (Combined ASTER MODIS Emissivity over Land) emissivity database with the MODIS/Terra+Aqua Yearly Land Cover Type database. The solution is derived as a convex combination of high-resolution Huang profiles using the Bayesian framework. We test our method on IASI (Infrared Atmospheric Sounding Interferometer) data and find that it outperforms the linear spline interpolation of the CAMEL data and the Huang emissivity database itself. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
8. Instability results for cosine-dissimilarity-based nearest neighbor search on high dimensional Gaussian data
- Author
-
Giannella, Chris R.
- Published
- 2025
- Full Text
- View/download PDF
9. A dynamic snow depth retrieval model based on time-series clustering optimization for GPS-IR.
- Author
-
Wang, Tianyu, Zhang, Rui, Yang, Yunjie, Liu, Anmengyun, Jiang, Yao, Lv, Jichao, Tu, Jinsheng, and Song, Yunfan
- Subjects
- *
SNOW accumulation , *GLOBAL Positioning System , *MACHINE learning , *GPS receivers , *INFORMATION retrieval - Abstract
Due to the influence of environmental factors (i.e., terrain and surface coverage) around the GPS receivers, the snow depth retrieval results obtained by the existing global positioning system interferometric reflection (GPS-IR) method show significant variability. The resulting loss of reliability and accuracy limits the broad application of this technology. Therefore, this paper proposes a dynamic snow depth retrieval model based on time-series clustering optimization for GPS-IR to fully leverage multi-source satellite observation data for automatic and high-precision snow depth retrieval. The model employs Dynamic Time Warping distance measurement combined with the K-Medoids clustering algorithm to categorize frequency sequences obtained from various satellite trajectories, facilitating effective integration of multi-constellation data and acquisition of optimal datasets. Additionally, Long Short-Term Memory networks are integrated to capture and process the long-term dependencies in snow depth data, enhancing the model's adaptability in handling time-series data. Validated against SNOTEL measured data and standard machine learning algorithms (such as BP Neural Networks, RBF, and SVM), the model's retrieval capability is confirmed. For P351 and AB39 sites, the correlation coefficients for L1 band data retrieval were both 0.996, with RMSEs of 0.051 and 0.018 m, respectively. The experiment results show that the proposed model demonstrates superior precision and robustness in snow depth retrieval compared to the previous method. Then, we analyze the accuracy loss caused by sudden snowfall events. The proposed model and methodology offer new insights into the in-depth study of snow depth monitoring. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. The Process and Algorithm Analysis of Text Mining System Based on Artificial Intelligence.
- Author
-
Chai, Xiaoliang, Xu, Songxiao, Li, Shilin, and Zhao, Junyu
- Subjects
TEXT mining ,ARTIFICIAL intelligence ,GENETIC algorithms ,ALGORITHMS ,INFORMATION retrieval ,INFORMATION networks - Abstract
The rapid development of the Internet leads to the rapid growth of network information, we call it information explosion. The Internet is full of information, and it is difficult for users to find this information and useful knowledge of the ocean. The Web has become the world's largest information repository, and there is an urgent need for efficient access to the valuable knowledge of vast amounts of web information. The purpose of this paper is to study the process and algorithm analysis of text mining system based on artificial intelligence. This paper presents an algorithm of document feature acquisition based on genetic algorithm. Selecting suitable features is an important task in specific text classification and information retrieval. Finding appropriate feature vectors to represent the text will undoubtedly help with subsequent sorting and grouping. Based on the genetic algorithm of variable length chromosome, this paper improves the crossover, mutation and selection operations, and proposes an algorithm to obtain text feature vectors. This method has a wide range of applications and good results. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
11. A hierarchical shape description approach and its application in similarity measurement of polygon entities.
- Author
-
Ma, Jingzhen, Sun, Qun, Ma, Chao, Lyu, Zheng, Sun, Shijie, and Wen, Bowei
- Subjects
- *
SHAPE measurement , *MULTISENSOR data fusion , *INFORMATION retrieval , *POLYGONS , *INFORMATION processing , *MEASUREMENT - Abstract
Spatial similarity provides an important basis for geographic information processing and is widely applied in multi-source data fusion and update, data retrieval and query, and cartographic generalization. To address the shape description and similarity measurement of polygon entities, this study presents a new hierarchical shape description approach and examines its application in similarity measurement of polygon entities. Using the rotation and segmentation methods, we first constructed a hierarchical shape description model for target polygon entities, followed by measurement of global and hierarchical shape description of polygon entities, respectively, using the farthest-point-distance and geometric feature description methods. Finally, we constructed a comprehensive similarity measurement model through a weighted integration of position, size, direction, and shape. The hierarchical shape description approach proposed in this paper can be applied to the shape similarity measurement of polygon elements, similarity measurement after spatial object simplification, and multi-scale polygon entity matching. The experimental results showed that the hierarchical shape description approach and similarity measurement model are able to effectively measure spatial similarity between different polygon entities, and have obtained good application results. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. Microprism-based layered BIM modeling for railway station subgrade.
- Author
-
Fan, Xiaomeng, Pu, Hao, Schonfeld, Paul, Zhang, ShiHong, Li, Wei, Ran, Yang, and Wang, Jia
- Subjects
- *
BUILDING information modeling , *RAILROAD stations , *PARALLEL processing , *INFORMATION retrieval , *GEOLOGY - Abstract
Volumetric modeling for Railway Station Subgrade (RSS) is a complex task in applying BIM to railway stations. However, an effective method for handling RSS volumetric modeling, which is further complicated by heterogeneous data sources and geometric features, has been unavailable. To address this issue, this paper presents a BIM modeling method that generates layered microprisms to represent the volumetric information of an RSS. The proposed method is demonstrated with a real-world case, through which the modeling accuracy and the efficiency of spatial data retrieval are verified to be satisfactory. This paper contributes to the existing body of knowledge by proposing a unified and accurate volumetric modeling method for the multi-layer structures. In the future, modeling efficiency can be further improved by introducing GPU-based parallel processing. • Comprehensive volumetric information of railway station subgrade (RSS) is expressed with a multi-layer semantic model. • Involved layers of filler, geology and terrain are depicted by the proposed unified modeling method. • Multi-layer 3D microprisms are generated to express the irregular spaces among the involved layers. • A grid-based spatial retrieval method is developed to achieve high-efficient spatial query. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. Learning from construction accidents in virtual reality with an ontology-enabled framework.
- Author
-
Pedro, Akeem, Bao, Quy Lan, Hussain, Rahat, Soltani, Mehrtash, Pham, Hai Chien, and Park, Chansik
- Subjects
- *
INTERACTIVE learning , *INFORMATION retrieval , *ACCESS to information , *EDUCATIONAL games , *LEARNING modules - Abstract
Learning from accidents is essential in preventing their recurrence; however, the unstructured nature of construction accident data poses a significant challenge to the retrieval of insightful information from past incidents. The absence of engaging training tools that facilitate access to such information also impedes learning. This paper aims to develop an ontology-enabled Virtual Reality (VR) framework to provide access to incident data in immersive educational settings. The framework comprises three modules: 1) Ontology module for structuring information from accidents; 2) VR module for interactive learning based on accident cases; and 3) Semantic enrichment module for embedding accident information in VR scenarios. A prototype was developed to verify the frameworks' technical feasibility, usability, and educational potential. User trials confirm that the solution offers a promising medium for learning from accidents. It is anticipated that the framework would enhance practices for learning from accidents and contribute to improved safety outcomes in construction. • Learning from accidents is crucial for enhancing construction safety. • This paper proposes the CASE-VR framework to improve learning from accidents. • CASE-VR provides an ontology-enabled VR platform with structured accident data. • A prototype demonstrates the feasibility and usability of the proposed framework. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. Hybrid Approach To Unsupervised Keyphrase Extraction.
- Author
-
Singh, Vijender and Bolla, Bharat Kumar
- Subjects
INFORMATION retrieval - Abstract
The exponential growth of textual data poses a monumental challenge for extracting meaningful knowledge. Manually identifying descriptive keywords or keyphrases for each document is infeasible given the massive daily generated text. Automatic keyphrase extraction is, therefore, essential. However, current techniques struggle with learning the most salient semantic features from lengthy documents. This hybrid keyphrase extraction framework uniquely combines the complementary strengths of graph-based and textual feature methods. Our approach demonstrates improved performance over relying solely on statistical or graphical. Graph-based systems leverage word co- occurrence networks to score importance. Textual methods extract keyphrases using linguistic properties. Together, these complementary techniques overcome the limitations of relying on any strategy. The hybrid approach is evaluated on standard SemEval 2017 Task 10 and SemEval 2010 Task 5 benchmark datasets for scientific paper keyphrase extraction. Performance is quantified using the F1 score relative to human-annotated ground truth keyphrase. Results will quantify effectiveness on long documents with thousands of terms where only a few keywords represent salient concepts. Results show our technique effectively identifies the most salient semantic keywords, overcoming limitations of current techniques that struggle to mix features of graphical or statistical methods. Our experiments demonstrate that the proposed hybrid approach achieves superior F1 scores compared to current state-of-the-art methods on benchmark datasets. These results validate that synergistically combining graph and textual features enables more accurate keyphrase extraction, especially for long documents laden with extraneous terms. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. Listwise learning to rank method combining approximate NDCG ranking indicator with Conditional Generative Adversarial Networks.
- Author
-
Li, Jinzhong, Zeng, Huan, Xiao, Cunwei, Ouyang, Chunjuan, and Liu, Hua
- Subjects
- *
GENERATIVE adversarial networks , *INFORMATION retrieval - Abstract
Some previous empirical studies have shown that the performances of the listwise learning to rank approaches are in general better than the pointwise or pairwise learning to rank techniques. The listwise learning to rank methods which directly optimize information retrieval indicators are a type of essential and popular method of learning to rank. However, the existing learning to rank approaches based on Generative Adversarial Networks (GAN) do not utilize a loss function based on information retrieval indicators to optimize the generator and/or discriminator. Thus, an approach of learning to rank that combines approximate Normalized Discounted Cumulative Gain (NDCG) ranking indicators with Conditional Generative Adversarial Networks (CGAN) is proposed in this paper, named NCGAN-LTR. The NCGAN-LTR approach constructs loss functions of the generator and discriminator based on the Plackett-Luce model and an approximate version of the NDCG ranking indicator, which is utilized to train the network parameters of CGAN. The experimental results on four benchmark datasets of learning to rank, i.e., TREC TD2004, OHSUMED, MQ2008, and MSLR-WEB10K demonstrate that our proposed NCGAN-LTR approach has superior performance across almost various ranking indicators of information retrieval compared with the IRGAN-List approach. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. Vietnamese Legal Text Retrieval based on Sparse and Dense Retrieval approaches.
- Author
-
Khang, Nguyen Hoang Gia, Nhat, Nguyen Minh, Quoc, Trung Nguyen, and Hoang, Vinh Truong
- Subjects
LANGUAGE models ,VIETNAMESE language ,DATA augmentation ,LEGAL documents ,INFORMATION retrieval - Abstract
We introduce the combination of two techniques: Sparse Retrieval and Dense Retrieval, while experimenting with different training approaches to find the optimal method for the Vietnamese Legal Text Retrieval task. Moreover, the Question Answering task was only built on the open domain of UIT-ViQuAD but shown promising results on the in-domain legal dataset. Finally, we also mentioned the data augmentation of legal documents up to 3GB to train the Phobert language model, improve this backbone with Condenser, Cocondenser in this paper. Furthermore, these techniques can be utilized for other information retrieval assignments in languages with limited resources. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. A Review on recent research in information retrieval.
- Author
-
Ibrihich, S., Oussous, A., Ibrihich, O., and Esghir, M.
- Subjects
INFORMATION retrieval ,LITERATURE reviews ,NATURAL language processing - Abstract
In this paper, we present a survey of modeling and simulation approaches to describe information retrieval basics. We investigate its methods, its challenges, its models, its components and its applications. Our contribution is twofold: on the one hand, reviewing the literature on discovery some search techniques that help to get pertinent results and reach an effective search, and on the other hand, discussing the different research perspectives for study and compare more techniques used in information retrieval. This paper will also shedding the light on some of the famous AI applications in the legal field. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
18. Retrieval of behavior trees using map-and-reduce technique.
- Author
-
Abbas, Safia, Hodhod, Rania, and El-Sheikh, Mohamed
- Subjects
TREES ,TIME management ,COGNITIVE structures ,SOCIAL interaction - Abstract
There has been an increased interest in the creation of AI social agents who possess complex behaviors that allow them to perform social interactions. Behavior trees provide a plan model execution that has been widely used to build complex behaviors for AI social agents. Behavior trees can be represented in the form of a memory structure known as cognitive scripts, which would allow them to evolve through further development over multiple exposure to repeated enactment of a particular behavior or similar ones. Behavior trees that share the same context will then be able to learn from each other resulting in new behavior trees with richer experience. The main challenge appears in the expensive cost of retrieving contextually similar behavior trees (scripts) from a repertoire of scripts to allow for that learning process to occur. This paper introduces a novel application of map-and-reduce technique to retrieve cognitive with low computational time and memory allocation. The paper focuses on the design of a corpus of cognitive scripts, as a knowledge engineering key challenge, and the application of map-and-reduce with semantic information to retrieve contextually similar cognitive scripts. The results are compared to other techniques used to retrieve cognitive scripts in the literature, such as Pharaoh which uses the least common parent (LCP) technique in its core. The results show that the map-and-reduce technique can be successfully used to retrieve cognitive scripts with high retrieval accuracy of 92.6%, in addition to being cost effective. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
19. A new weakly supervised discrete discriminant hashing for robust data representation.
- Author
-
Wan, Minghua, Chen, Xueyu, Zhao, Cairong, Zhan, Tianming, and Yang, Guowei
- Subjects
- *
INFORMATION retrieval , *COMPUTER programming education , *MACHINE learning , *SUPERVISED learning , *INFORMATION processing - Abstract
In real applications, the label information on many data is inaccurate, or a completely reliable label needs to be obtained at a high cost. The previous supervised hashing algorithms consider only the label information in the mapping process from Euclidean space to Hamming space when learning hash codes. However, there is no doubt that these algorithms are suboptimal in maintaining the relationships between high-dimensional data spaces. To overcome this problem, this paper advances a new weakly supervised discrete discriminant hashing (WDDH) to ensure a more effective representation of data and better retrieval of information. First, we consider the nearest neighbour relationship between samples, and new neighbourhood graphs are constructed to describe the geometric relationship between samples. Second, the algorithm embeds the learning of the hash function into the model and optimises the hash codes by a one-step iterative updating algorithm. Finally, it is compared with the existing classical unsupervised hashing algorithm and supervised hashing algorithm on different databases. The results and discussion of the experiments clearly show that the proposed WDDH algorithm in this paper is more robust for data representation in learning low-quality label data, coarse-grained label data and noisy data. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
20. On the trade-off between ranking effectiveness and fairness.
- Author
-
Melucci, Massimo
- Subjects
- *
INFORMATION storage & retrieval systems , *FAIRNESS , *SYMMETRIC matrices , *AUTHOR-editor relationships , *ACCESS to information - Abstract
This paper addresses the problem of maximizing the effectiveness of the ranking produced by information retrieval or recommender systems and at the same time maximizing two fairnesses, that of the group and that of the individual. The context of this paper is therefore that of access to information carried out by users, who aim to satisfy their own information needs, to documents produced by authors and curators who aim to be exposed in a fair manner, i.e. without discriminating between groups nor individuals. The paper describes a general method based on the spectral decomposition of mixtures of symmetric matrices, each of which represents a variable to be maximized, and experiments conducted with the use of a test collection. The method described in this paper has explained if and how the trade-offs between effectiveness, group fairness and individual fairness manifest themselves. The experimental results show that maintaining an acceptable level of effectiveness and fairness at the same time is feasible and (b) the trade-offs exist but the order of magnitude of the variations depends on the measure of effectiveness used and therefore by what the user's model of access to information is as well as on the fairness measures and therefore on how authors or editors should be exposed. • Modern information access systems should balance fairness and effectiveness. • A single eigensystem achieves simultaneous maximization. • Fairness, effectiveness, and access measures are crucial in trade-offs. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. Secure multi-dimensional data retrieval with access control and range query in the cloud.
- Author
-
Mei, Zhuolin, Yu, Jin, Zhang, Caicai, Wu, Bin, Yao, Shimao, Shi, Jiaoli, and Wu, Zongda
- Subjects
- *
ACCESS control , *INFORMATION retrieval , *DATA encryption , *DATA security - Abstract
Outsourcing data to the cloud offers various advantages, such as improved reliability, enhanced flexibility, accelerated deployment, and so on. However, data security concerns arise due to potential threats such as malicious attacks and internal misuse of privileges, resulting in data leakage. Data encryption is a recognized solution to address these issues and ensure data confidentiality even in the event of a breach. However, encrypted data presents challenges for common operations like access control and range queries. To address these challenges, this paper proposes Secure Multi-dimensional Data Retrieval with Access Control and Range Search in the Cloud (SMDR). In this paper, we propose SMDR policy, which supports both access control and range queries. The design of the SMDR policy cleverly utilizes the minimum and maximum points of buckets, enabling the SMDR policy is highly appropriate for supporting range queries on multi-dimensional data. Additionally, we have made modifications to Ciphertext Policy-Attribute Based Encryption (CP-ABE) to enable effective integration with the SMDR policy, and then constructed a secure index using the SMDR policy and CP-ABE. By utilizing the secure index, access control and range queries can be effectively supported over the encrypted multi-dimensional data. To evaluate the efficiency of SMDR, extensive experiments have been conducted. The experimental results demonstrate the effectiveness and suitability of SMDR in handling encrypted multi-dimensional data. Additionally, we provide a detailed security analysis of SMDR. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. Textual tag recommendation with multi-tag topical attention.
- Author
-
Xu, Pengyu, Xia, Mingxuan, Xiao, Lin, Liu, Huafeng, Liu, Bing, Jing, Liping, and Yu, Jian
- Subjects
- *
TAGS (Metadata) , *INFORMATION retrieval , *INFORMATION services , *RECOMMENDER systems , *USER-generated content , *NEUROPROSTHESES - Abstract
Tagging can be regarded as the action of connecting relevant user-defined keywords to an item, indirectly improving the quality of the information retrieval services that rely on tags as data sources. Tag recommendation dramatically enhances the quality of tags by assisting users in tagging. Although there exist many studies on tag recommendation for textual content, few of them consider two characteristics in real applications, i.e., the long-tail distribution of tags and the topic-tag correlation. In this paper, we propose a Topic-Guided Tag Recommendation (TGTR) model to recommend tags by jointly incorporating dynamic neural topic. Specifically, TGTR first generates dynamic neural topic that would indicate the tags by a neural topic generator. Then, a sequence encoder is used to distill indicative features from the post. To effectively leverage the topic and alleviate the data imbalance, we design a multi-tag topical attention mechanism to get a tag-specific post representation for each tag with the help of dynamic neural topic. These three modules are seamlessly joined together via an end-to-end multi-task learning model, which is helpful for the three parts to enhance each other and balance the effects of topics and tags. Extensive experiments have been conducted on four real-world datasets and demonstrate that our model outperforms the state-of-the-art approaches by a large margin, especially on tail-tags. The code, data and hyper-parameter settings are publicly released for reproducibility. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
23. Knowledge enhanced edge-driven graph neural ranking for biomedical information retrieval.
- Author
-
Liu, Xiaofeng, Tan, Jiajie, and Dong, Shoubin
- Subjects
- *
INFORMATION retrieval , *MEDICAL databases , *INFORMATION networks - Abstract
Neural networks used for information retrieval tend to capture textual matching signals between a query and a document. However, neural ranking models for biomedical information retrieval often struggle to semantically well match the query to the documents. The main reasons are that biomedical terms have many different representations and the fact description related to the query is non-consecutive and non-local in the documents. In this paper, we present an edge-driven graph neural ranking method for biomedical information retrieval by incorporating knowledge from medical databases. First, we propose to form an edge-driven graph by connecting some biomedical terms in the query and the document through different types of edges. Then, we design a novel way of knowledge integration to introduce knowledge related to biomedical terms into the graph and construct a knowledge-query-doc graph. Finally, a graph neural ranking model is applied to capture non-local and non-contiguous match signals between the query and the document. Experimental results show on the biomedical datasets that our method outperforms the advanced neural models. And further analysis shows that the knowledge integration method can well reduce the semantic gap between the query and the document, and our graph model can provide interpretation for matching between the query and the document. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
24. Spectral clustering and query expansion using embeddings on the graph-based extension of the set-based information retrieval model.
- Author
-
Kalogeropoulos, Nikitas-Rigas, Kontogiannis, George, and Makris, Christos
- Subjects
- *
K-nearest neighbor classification , *SPARSE graphs , *INFORMATION retrieval , *CENTROID - Abstract
This paper presents a straightforward yet novel approach to enhance graph-based information retrieval models, by calibrating the relationships between node terms, leading to better evaluation metrics at the retrieval phase, and by reducing the total size of the graph. This is achieved by integrating spectral clustering, embedding-based graph pruning and term re-weighting. Spectral clustering assigns each term to a specific cluster, allowing us to propose two pruning methods: out-cluster and in-cluster pruning based on node similarities. In-cluster pruning refers to pruning edges between terms within the same cluster, while out-cluster pruning refers to edges that connect different clusters. Both methods utilize spectral embeddings to assess node similarities, resulting in more manageable clusters termed concepts. These concepts are likely to contain semantically similar terms, with each term's concept defined as the centroid of its cluster. We show that this graph pruning strategy significantly enhances the performance and effectiveness of the overall model, reducing, at the same time, its graph sparsity. Moreover, during the retrieval phase, the conceptually calibrated centroids are used to re-weight terms generated by user queries, and the precomputed embeddings enable efficient query expansion through a k-Nearest Neighbors (K-NN) approach, offering substantial enhancement with minimal additional time cost. To the best of our knowledge, this is the first application of spectral clustering and embedding-based conceptualization to prune graph-based IR models. Our approach enhances both retrieval efficiency and performance while enabling effective query expansion with minimal additional computational overhead. Our proposed technique is applied across various graph-based information retrieval models, improving evaluation metrics and producing sparser graphs. • Integrates spectral clustering for effective graph pruning in information retrieval. • Proposes efficient query expansion using spectral embedding-based conceptualization. • Enhances retrieval model performance and reduces graph complexity and edge count. • Performs comparisons with prior implementations and state-of-the-art models. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
25. Self-supervised incomplete cross-modal hashing retrieval.
- Author
-
Peng, Shouyong, Yao, Tao, Li, Ying, Wang, Gang, Wang, Lili, and Yan, Zhiming
- Subjects
- *
DATA recovery , *INFORMATION retrieval , *ENCYCLOPEDIAS & dictionaries , *ACQUISITION of data , *ENTROPY - Abstract
Benefiting from fast retrieval speed and low storage costs, cross-modal hashing retrieval has become a widely-used approximate nearest-neighbor technique in large-scale data retrieval. Most existing cross-modal hashing methods assume that the cross-modal data points are complete. However, cross-modal data completeness is difficult to be satisfied in the real world, because of the indefinite factors in data collecting. Moreover, due to the expensive cost of annotating all data points in large-scale applications, there is a growing interest in unsupervised hashing retrieval that can learn the correlations of cross-modal data without ground-truth. Therefore, how to perform unsupervised hashing retrieval on incomplete cross-modal data becomes a problem worthy of study. In this paper, we propose a Self-supervised Incomplete Cross-modal Hashing retrieval (SICH) method, which integrates data recovery and hashing encoding into a unified framework. Specifically, we first design a self-supervised semantic module to effectively mine the semantic information among pseudo-labels, and then a hash code dictionary is constructed to guide the hashing function learning with an asymmetric guidance mechanism. Besides, to fully take advantage of the incomplete data points in cross-modal learning, we introduce a data recovery network aiming at recovering missing data by minimizing conditional entropy and maximizing mutual information between different modalities. Extensive experiments on two benchmark datasets verify that our method consistently outperforms state-of-the-art cross-modal hashing methods. • Propose a novel unsupervised cross-modal hashing retrieval on incomplete cross-modal data. • Self-supervised semantic mining for refining pseudo-label semantic information. • A data recovery network for recovering missing data. • An asymmetric guidance mechanism transfers self-supervised-based global knowledge to hashing learning. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
26. An Overview of Real-World Data Infrastructure for Cancer Research.
- Author
-
Price, G., Peek, N., Eleftheriou, I., Spencer, K., Paley, L., Hogenboom, J., van Soest, J., Dekker, A., van Herk, M., and Faivre-Finn, C.
- Subjects
- *
TUMOR treatment , *DATABASES , *REGIONAL medical programs , *MEDICAL information storage & retrieval systems , *FEDERATED learning , *DATABASE management , *INTERPROFESSIONAL relations , *HEALTH , *CLINICAL governance , *ONCOLOGY , *INTERNATIONAL relations , *MEDICAL research , *INFORMATION retrieval , *ACQUISITION of data , *ELECTRONIC health records , *RESEARCH , *HEALTH facilities , *MACHINE learning , *CASE studies - Abstract
There is increasing interest in the opportunities offered by Real World Data (RWD) to provide evidence where clinical trial data does not exist, but access to appropriate data sources is frequently cited as a barrier to RWD research. This paper discusses current RWD resources and how they can be accessed for cancer research. There has been significant progress on facilitating RWD access in the last few years across a range of scales, from local hospital research databases, through regional care records and national repositories, to the impact of federated learning approaches on internationally collaborative studies. We use a series of case studies, principally from the UK, to illustrate how RWD can be accessed for research and healthcare improvement at each of these scales. For each example we discuss infrastructure and governance requirements with the aim of encouraging further work in this space that will help to fill evidence gaps in oncology. There are challenges, but real-world data research across a range of scales is already a reality. Taking advantage of the current generation of data sources requires researchers to carefully define their research question and the scale at which it would be best addressed. • There is increasing interest in the opportunities offered by Real World Data (RWD) to provide evidence where clinical trial data does not exist. • Advances in healthcare data infrastructure have improved and real-world data access for research is available across a range of scales. • Local, regional, national, and federated RWD resources have different properties and provide access to cohorts of differing size and detail. • Choice of real-world data research infrastructure should be driven by research question and hypothesis. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
27. Intelligible graph contrastive learning with attention-aware for recommendation.
- Author
-
Mo, Xian, Zhao, Zihang, He, Xiaoru, Qi, Hang, and Liu, Hao
- Subjects
- *
DATA augmentation , *INFORMATION overload , *RANDOM walks , *SOURCE code , *INFORMATION retrieval - Abstract
Recommender systems are an important tool for information retrieval, which can aid in the solution of the issue of information overload. Recently, contrastive learning has shown remarkable performance in recommendation by data augmentation processes to address highly sparse data. Our paper proposes an Int elligible G raph C ontrastive L earning with attention-aware (IntGCL) for recommendation. Particularly, our IntGCL first introduces a novel attention-aware matrix into graph convolutional networks (GCN) to identify the importance between users and items, which is constructed to preserve the importance between users and items by a random walk with a restart strategy and can enhance the intelligibility of our model. Then, the attention-aware matrix is further utilised to guide the generation of a graph-generative model with attention-aware and a graph-denoising model for automatically generating two trainable contrastive views for data augmentation, which can de-noise and further enhance the intelligibility. Comprehensive experiments on four real-world datasets indicate the superiority of our IntGCL approach over multiple state-of-the-art methods. Our datasets and source code are available at https://github.com/restarthxr/InpGCL. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
28. A goal-oriented document-grounded dialogue based on evidence generation.
- Author
-
Song, Yong, Fan, Hongjie, Liu, Junfei, Liu, Yunxin, Ye, Xiaozhou, and Ouyang, Ye
- Subjects
- *
LANGUAGE models , *INFORMATION retrieval , *HALLUCINATIONS (Artificial intelligence) , *GOAL (Psychology) , *RECORDS management - Abstract
Goal-oriented Document-grounded Dialogue (DGD) is used for retrieving specific domain documents, assisting users in document content retrieval, question answering, and document management. Existing methods typically employ keyword extraction and vector space models to understand the content of documents, identify the intent of questions, and generate answers based on the capabilities of generation models. However, challenges remain in semantic understanding, long text processing, and context understanding. The emergence of Large Language Models (LLMs) has brought new capabilities in context learning and step-by-step reasoning. These models, combined with Retrieval Augmented Generation(RAG) methods, have made significant breakthroughs in text comprehension, intent detection, language organization, offering exciting prospects for DGD research. However, the "hallucination" issue arising from LLMs requires complementary methods to ensure the credibility of their outputs. In this paper we propose a goal-oriented document-grounded dialogue approach based on evidence generation using LLMs. It designs and implements methods for document content retrieval & reranking, fine-tuning and inference, and evidence generation. Through experiments, the method of combining LLMs with vector space model, or with key information matching technique is used as a comparison, the accuracy of the proposed method is improved by 21.91% and 12.81%, while the comprehensiveness is increased by 10.89% and 69.83%, coherence is enhanced by 38.98% and 53.27%, and completeness is boosted by 16.13% and 36.97%, respectively, on average. Additional, ablation analysis conducted reveals that the evidence generation method also contributes significantly to the comprehensiveness and completeness. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
29. Multi-relational graph contrastive learning with learnable graph augmentation.
- Author
-
Mo, Xian, Pang, Jun, Wan, Binyuan, Tang, Rui, Liu, Hao, and Jiang, Shuyu
- Subjects
- *
REPRESENTATIONS of graphs , *KNOWLEDGE graphs , *DATA augmentation , *SOURCE code , *INFORMATION retrieval - Abstract
Multi-relational graph learning aims to embed entities and relations in knowledge graphs into low-dimensional representations, which has been successfully applied to various multi-relationship prediction tasks, such as information retrieval, question answering, and etc. Recently, contrastive learning has shown remarkable performance in multi-relational graph learning by data augmentation mechanisms to deal with highly sparse data. In this paper, we present a M ulti- R elational G raph C ontrastive L earning architecture (MRGCL) for multi-relational graph learning. More specifically, our MRGCL first proposes a M ulti-relational G raph H ierarchical A ttention N etworks (MGHAN) to identify the importance between entities, which can learn the importance at different levels between entities for extracting the local graph dependency. Then, two graph augmented views with adaptive topology are automatically learned by the variant MGHAN, which can automatically adapt for different multi-relational graph datasets from diverse domains. Moreover, a subgraph contrastive loss is designed, which generates positives per anchor by calculating strongly connected subgraph embeddings of the anchor as the supervised signals. Comprehensive experiments on multi-relational datasets from three application domains indicate the superiority of our MRGCL over various state-of-the-art methods. Our datasets and source code are published at https://github.com/Legendary-L/MRGCL. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
30. Enabling coastal analytics at planetary scale.
- Author
-
Calkoen, Floris Reinier, Luijendijk, Arjen Pieter, Vos, Kilian, Kras, Etiënne, and Baart, Fedor
- Subjects
- *
COASTAL changes , *COASTAL mapping , *CLOUD computing , *COASTS , *INFORMATION retrieval - Abstract
Coastal science has entered a new era of data-driven research, facilitated by satellite data and cloud computing. Despite its potential, the coastal community has yet to fully capitalize on these advancements due to a lack of tailored data, tools, and models. This paper demonstrates how cloud technology can advance coastal analytics at scale. We introduce GCTS, a novel foundational dataset comprising over 11 million coastal transects at 100-m resolution. Our experiments highlight the importance of cloud-optimized data formats, geospatial sorting, and metadata-driven data retrieval. By leveraging cloud technology, we achieve up to 700 times faster performance for tasks like coastal waterline mapping. A case study reveals that 33% of the world's first kilometer of coast is below 5 m, with the entire analysis completed in a few hours. Our findings make a compelling case for the coastal community to start producing data, tools, and models suitable for scalable coastal analytics. • Open, scalable workflows can be up to 700 times faster than traditional approaches. • A novel 100 m Global Coastal Transect System for robust coastal analytics. • Analysis shows that 33% of the world's first kilometer of coast lies below 5 m. • High-speed cloud-native coastal waterline mapping at 50 kilometers per second. • Novel methods for data management that enable coastal analytics at scale. • PoC's demonstrate how cloud technology can advance coastal science at scale. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
31. Danish and Swedish National Data Collections for Cancer – Solutions for Radiotherapy.
- Author
-
Olsson, C.E., Krogh, S.L., Karlsson, M., Eriksen, J.G., Björk-Eriksson, T., Grau, C., Norman, D., Offersen, B.V., Nyholm, T., Overgaard, J., Zackrisson, B., and Hansen, C.R.
- Subjects
- *
TUMOR diagnosis , *REPORTING of diseases , *DESCRIPTIVE statistics , *ACQUISITION of data , *INFORMATION retrieval , *TUMORS , *EVALUATION - Abstract
Collecting large amounts of radiotherapy (RT) data from clinical systems is known to be a challenging task. Still, data collections outside the original RT systems are needed to follow-up on the quality of cancer care and to improve RT. This paper aims to describe how RT data is collected nationally in Denmark and Sweden for this purpose and gives an overview of the stored information in both countries' national data sources. Although both countries have clinical national quality registries with broad coverage and completeness for many cancer diagnoses, some were initiated already in the seventies, and less than one in ten includes quantitative information on RT to a level of detail useful for more than basic descriptive statistics. Detailed RT data can, however, be found in Denmark's DICOM Collaboration (DcmCollab) database, initiated in 2009 and in Sweden's quality registry for RT launched in 2023 (SKvaRT). Denmark has collected raw DICOM data for all patients enrolled in clinical trials, with files being directly and automatically transferred to DcmCollab from the original data sources at each RT centre. Sweden collects aggregated RT data into SKvaRT for all patients undergoing RT in Sweden, with DICOM files being transferred and selected alpha-numeric variables forwarded via a local intermediate storage database (MIQA) at each hospital. In designing their respective solutions, both countries have faced similar challenges regarding which RT variables to collect and how to technically link clinical systems to their data repositories. General lessons about how flexibility currently is balanced with storage requirements and data standards are presented here together with future plans to harvest real-world RT data. • DcmCollab in Denmark and SKvaRT in Sweden are national radiotherapy data collections. • The Danish system favours complete and flexible datasets for research purposes. • The Swedish system favours streamlined datasets focusing on quality improvements. • Each solution stores detailed and standardised radiotherapy data outside the clinical systems. • General lessons on proactive data management in radiotherapy are given by both solutions. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
32. AI in the spotlight: The impact of artificial intelligence disclosure on user engagement in short-form videos.
- Author
-
Chen, Hao, Wang, Pingping, and Hao, Shuaikang
- Subjects
- *
SOCIAL media , *CONCEPTUAL models , *ARTIFICIAL intelligence , *STRUCTURAL equation modeling , *PROBLEM solving , *DESCRIPTIVE statistics , *EXPERIMENTAL design , *PSYCHOLOGY , *INTENTION , *WEB development , *INFORMATION retrieval , *QUALITY assurance , *PATIENT participation , *VIDEO recording , *USER interfaces - Abstract
This paper aims to investigate how AI disclosure affects user engagement intention. We try to comprehensively explore the fundamental mechanisms behind AI disclosure and user engagement, along with the boundary conditions that impact the relationship. Based on the heuristic-systematic model, this study built a moderated mediation model and conducted an online experiment on "Credamo" platform. The 479 valid experimental data of the study were analyzed by partial least squares structural equation modeling (PLS-SEM). The results indicate that AI disclosure has a direct, positive impact on user engagement intention, while it could also diminish user engagement intention by lowering users' perceived content quality. However, the negative impact on perceived content quality can be mitigated by improving users' perceived AI capabilities. This study expands the research focus of AI disclosure and the practical application of heuristic-systematic model, which provides theoretical insights to the artificial intelligence related literature. In addition, we put forward to informed practical recommendations for video content creators and publishers in creating content and promoting user interaction. • This study applies heuristic-systematic model to investigate how AI disclosure affects user engagement intention. • The mediation effect of perceived content quality and moderation effect of perceived AI capability are examined. • Provided empirical evidence for video content creators and platforms to create content and promote user interaction. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
33. A secure and lightweight data management scheme based on redactable blockchain for Digital Copyright.
- Author
-
Zhuang, Chuxin, Dai, Qingyun, and Zhang, Yue
- Subjects
- *
DATABASE management , *SYSTEM analysis , *INFORMATION retrieval , *DATA management , *ALGORITHMS - Abstract
• We propose a secure and lightweight blockchain-based data management scheme for Digital Copyright. • We propose a transaction control mechanism based on ECDSA, ensuring the authenticity of the information stored on the blockchain through digital signature. • We design a chameleon hash algorithm-based lightweight data management scheme. • System analysis and experimental results confirm that our proposed scheme addresses a single point of failure, and reduces full nodes storage overhead and computation overhead of information retrieval and traceability. Traditional Digital Copyright (DC) management system faces a single point of failure, and has no strict traceability. Meanwhile, the current blockchain-based DC schemes take less consideration to the authenticity of DC information stored on the blockchain. Additionally, the full node storage overhead and computation overhead of information retrieval and traceability increase significantly with the number of blocks. Therefore, in this paper, we propose a secure and lightweight data management scheme based on the redactable blockchain for DC. Users generate their public and private keys, which provide a legitimate signature. Then, we propose a transaction control mechanism based on ECDSA, which means that the storage of DC information can only be accomplished by providing a legitimate and verifiable signature, including registration and transaction information. Furthermore, we adopt blockchain to record DC information and the chameleon hash algorithm to modify DC information stored on the blockchain when making DC transactions, while keeping the block headers unchanged. System analysis and experimental results confirm that our scheme can address a single point of failure and ensure the authenticity of the information. Meanwhile, our scheme effectively reduces full node storage overhead, and computation overhead of information retrieval and traceability. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
34. A neural harmonic-aware network with gated attentive fusion for singing melody extraction.
- Author
-
Yu, Shuai, Yu, Yi, Sun, Xiaoheng, and Li, Wei
- Subjects
- *
MELODY , *CONVOLUTIONAL neural networks , *SINGING , *INFORMATION retrieval - Abstract
Singing melody extraction from polyphonic musical audio is one of the most challenging tasks in music information retrieval (MIR). Recently, data-driven methods based on convolutional neural networks (CNNs) have achieved great success for this task. In the literature, harmonic relationship has been proven crucial for this task. However, few existing CNN-based singing melody extraction methods consider the harmonic relationship in the training stage. The state-of-the-art CNN based methods are not capable of capturing such long-dependency harmonic relationship due to limited receptive field and unacceptable computation cost. In this paper, we introduce a neural harmonic-aware network with gated attentive fusion (NHAN-GAF) for singing melody extraction. Specifically, in the 2-D spectrograms modeling branch, we propose to employ multiple parallel 1-D CNN kernels to capture the harmonic relations between 1–2 octaves along the frequency axis in the spectrogram. Considering the advantage of jointly using Time–Frequency (T-F) domain and time domain information, we use two-branch neural nets to learn discriminative representation for this task. A novel gated attentive fusion (GAF) network is suggested to encode potential correlations between the two branches and fuse the descriptors learned from raw waveform and T-F spectrograms. Moreover, the idea of GAF can be exploited to the multimedia applications with multimodal analysis. With the two proposed components, our proposed model is capable of learning the harmonic relationship in the spectrogram and better capturing the contextual but discriminative features for singing melody extraction. We use part of the vocal tracks of the RWC dataset and MIR-1 K dataset to train the model and evaluate the performance of the proposed model on the ADC 2004, MIREX 05 and MedleyDB datasets. The experimental results show that the proposed method outperforms the state-of-the-art ones. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
35. A semi-automatic data integration process of heterogeneous databases.
- Author
-
Barbella, Marcello and Tortora, Genoveffa
- Subjects
- *
DATA integration , *ELECTRONIC data processing , *INFORMATION retrieval , *DATABASES , *CONTENT analysis - Abstract
• Data Integration of two or more heterogeneous databases. • Syntactic and semantic analysis of textual data. • Semi-automatic process. [Display omitted] One of the most difficult issues today, is the integration of data from various sources. Thus, it arises the need of automatic Data Integration (DI) methods. However, in the literature there are fully automatic or semi-automatic DI techniques, but they require the involvement of IT-experts with specific domain skills. In this paper we present a novel DI methodology for which it is not required the involvement of IT-experts; in this methodology syntactically/semantically similar entities present in the sources are merged, by exploiting an information retrieval technique, a clustering method and a trained neural network. Although the suggested process is completely automated, we planned some interactions with the Company Manager, a figure who is not required to have IT-skills, but whose only contribution will be to define limits and tolerance thresholds during the DI process, based on the interests of the company. The validity of the proposed approach showed an integration accuracy between 99 % − 100 %. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
36. An efficacy analysis of data encryption architecture for cloud platform.
- Author
-
Malhotra, Sheenam and Singh, Williamjeet
- Subjects
DATA encryption ,INFORMATION storage & retrieval systems ,CLOUD computing ,CLOUD storage ,DATA analysis ,INFORMATION retrieval - Abstract
In recent times, cloud computing is being utilized largely for storage and information sharing purposes in several established commercial segments, particularly those where online businesses are prevalent, such as Google, Amazon, etc. Cloud system presents several benefits to users in terms of easy operations, low implementation, and maintenance expenses. However, significant risks are encountered in the data security procedures of cloud systems. Although the area is frequently being analyzed and reformed, the concern of cloud data protection and user reliability remains under uncertainty due to growing cyber-attack schemes as well as cloud storage system errors. To deal with this risk and contribute to the endeavor of providing optimal data security solutions in cloud data storage and retrieval system, this paper proposes a Symmetric Searchable Encryption influenced Machine Learning based cloud data encryption and retrieval model. The proposed model enhances data security and employs an effective keyword ranking approach by using an Artificial Neural Network. The comparative assessment of the proposed model against multiclass SVM and Naïve Bayes has established the better operational potentiality of the model. The effectiveness of the proposed work is justified by the association between high TPR and low FPR. Further, a low CCR of 0.6973 adds up to the success of the proposed work. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
37. A fast local citation recommendation algorithm scalable to multi-topics.
- Author
-
Yin, Maxwell J., Wang, Boyu, and Ling, Charles
- Subjects
- *
NATURAL language processing , *VECTOR spaces , *ALGORITHMS , *RESEARCH personnel - Abstract
In the era of rapid paper publications in various venues, automatic citation recommendations would be highly useful to researchers when they write papers. Local citation recommendation aims to recommend possible papers to cite given local citation contexts. Previous work mainly computes the similarity score between citation contexts and cited papers on a one-to-one basis, which is quite time-consuming. We train a pair of neural network encoders that map citation contexts and all possible cited papers to the same vector space, respectively. After that, we index the positions of all cited papers in the vector space. This makes our process for searching recommended papers considerably faster. On the other hand, existing methods tend to recommend papers that are highly similar to each other, which makes recommendations lack diversity. Therefore, we extend our algorithm to perform multi-topic recommendations. We generate multi-topic training examples based on the index we mentioned earlier. Furthermore, we specially design a multi-group contrastive learning method to train our model so that it can distinguish different topics. Empirical experiments show that our model outperforms previous methods by a wide margin. Our model is also light weighted and has been deployed online so that researchers can use it to obtain recommended citations for their own paper in real-time. • Proposed FLCR algorithm with sentence-transformer & k-d tree. • Introduced multi-topic citation recommendation for diverse contexts. • Developed large-scale dataset with 1.7 million citation contexts for evaluation. • Demonstrated significant performance improvement over previous methods. • Deployed demo for real-time citation recommendations. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. Multi-granularity retrieval of mineral resource geological reports based on multi-feature association.
- Author
-
Ma, Kai, Deng, Junyuan, Tian, Miao, Tao, Liufeng, Liu, Junjie, Xie, Zhong, Huang, Hua, and Qiu, Qinjun
- Subjects
- *
MINES & mineral resources , *APRIORI algorithm , *GEOLOGICAL research , *INFORMATION retrieval , *GEOLOGICAL maps , *NATURAL language processing - Abstract
[Display omitted] • Multi-granularity information association approaches are used to uncover multi-level geological knowledge and support accurate retrieval of mineral resource geological reports. • Extracting geological multiple feature information (topic, time, space, figure, and table) based on the multi-granularity of geological reports. • Mining potential multiple feature information associations by an improved apriori algorithm. Massive geologic report contains all kinds of multimodal geologic data information (geologic text, geologic maps, geologic tables, etc.), which contain a lot of rich geologic basic knowledge and expert experience knowledge about rocks and minerals, stratigraphic structure, geologic age, geographic location, and so on. Accurate retrieval of specific information from massive geologic data has become an important need for geologic information retrieval. However, the majority of existing research primarily revolves around extracting and associating information at a single granularity to facilitate geological semantic retrieval, which ignores many potential semantic associations, leading to ambiguity and fuzziness in semantic retrieval. To solve these problems, this paper proposes a multi-granularity (document-chapter-paragraph) geological information retrieval framework for accurate semantic retrieval. The framework firstly extracts topic feature information, spatiotemporal feature information, figure and table feature information based on the multi-granularity of geological reports. Then, an improved apriori algorithm is used to mine and visualize the associations among the feature information to discover the semantic associations of the geological reports at multiple levels of granularity. Finally, experiments are designed to validate the application of the proposed multi-granularity information retrieval framework on the accurate retrieval of geological reports. The experimental results show that the proposed multi-granularity information retrieval framework in this paper can dig deeper into underlying geo-semantic information and realize accurate retrieval. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Evaluating semantic similarity and relatedness between concepts by combining taxonomic and non-taxonomic semantic features of WordNet and Wikipedia.
- Author
-
Hussain, Muhammad Jawad, Bai, Heming, Wasti, Shahbaz Hassan, Huang, Guangjian, and Jiang, Yuncheng
- Subjects
- *
HYPERLINKS , *ARTIFICIAL intelligence , *INFORMATION retrieval , *COGNITIVE science - Abstract
Many applications in cognitive science and artificial intelligence utilize semantic similarity and relatedness to solve difficult tasks such as information retrieval, word sense disambiguation, and text classification. Previously, several approaches for evaluating concept similarity and relatedness based on WordNet or Wikipedia have been proposed. WordNet-based methods rely on highly precise knowledge but have limited lexical coverage. In contrast, Wikipedia-based models achieve more coverage but sacrifice knowledge quality. Therefore, in this paper, we focus on developing a comprehensive semantic similarity and relatedness method based on WordNet and Wikipedia. To improve the accuracy of existing measures, we combine various taxonomic and non-taxonomic features of WordNet, including gloss, lemmas, examples, sister-terms, derivations, holonyms/meronyms, and hypernyms/hyponyms, with Wikipedia gloss and hyperlinks, to describe concepts. We present a novel technique for extracting ' is-a ' and ' part-whole ' relationships between concepts using the Wikipedia link structure. The suggested technique identifies taxonomic and non-taxonomic relationships between concepts and offers dense vector representations of concepts. To fully exploit WordNet and Wikipedia's semantic attributes, the proposed method integrates their semantic knowledge at feature-level, combining semantic similarity and relatedness into a single comprehensive measure. The experimental results demonstrate the effectiveness of the proposed method over state-of-the-art measures on various gold standard benchmarks. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
40. Perf-Use-Sport study: Consummation of performance enhancing substances among athletes consulting in primary cares centers of Herault.
- Author
-
Jeannou, B., Feuvrier, F., Peyre-Costa, D., and Griffiths, K.
- Subjects
- *
PRIMARY care , *GENERAL practitioners , *MEDICAL practice , *MEDICAL centers , *INFORMATION retrieval - Abstract
In our knowledge data about supplementations of recreational athletes are very limited in France. The health of these athletes is mostly under the supervision of their general practitioner. If it was shown that these athletes use substances, their general practitioners will need to find effective ways to protect their health. The main objective of this study is to estimate the prevalence of use of performance enhancing substances among athletes consulting in primary care center. Others goals are to collect information about motivations of user's, person advising consumer's and places to purchase substances. All major athletes consulting in general medical practice between the 24th of August 2020 and the 06th of November 2020 were invited to answer an anonymous questionnaire. A total of 40 randomized physicians of Herault participate in the study. We installed posters and flyers in waiting room to provide information and allow the athlete to submit to online version or paper version of questionnaire. A total of 281 athletes submitted to the questionnaire with 54.5% of male (n = 153) and 45.5% of female (n = 128) and an average age of 39.7 years (± 14.9). Over than 96% was recreational athletes (n = 272) and 59.9% of them report using a least one substance. About 66.9% of consumers report using dietary supplement, 67.5% medications and 13.6% illicit drugs or anabolic agents. Motivations and places to purchase depend on substances. Near of 60% of athletes consulting in primary care center report use performances enhancing substances. It seems important that physicians are aware of this in order to help the athlete to protect his health. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
41. Improving software maintenance with improved bug triaging.
- Author
-
Gupta, Chetna, Inácio, Pedro R.M., and Freire, Mário M
- Subjects
FUZZY decision making ,SOFTWARE maintenance ,MATHEMATICAL optimization ,JUDGMENT (Psychology) ,TOPSIS method ,LEGAL judgments - Abstract
Bug triaging is a critical and time-consuming activity of software maintenance. This paper aims to present an automated heuristic approach combined with fuzzy multi-criteria decision-making for bug triaging. To date, studies lack consideration of multi-criteria inputs to gather decisive and explicit knowledge of bug reports. The proposed approach builds a bug priority queue using the multi-criteria fuzzy Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) method and combines it with Bacterial Foraging Optimization Algorithm (BFOA) and Bar Systems (BAR) optimization to select developers. A relative threshold value is computed and categorization of developers is performed using hybrid optimization techniques to make a distinction between active, inactive, or new developers for bug allocation. The harmonic mean of precision, recall, f-measure, and accuracy obtained is 92.05%, 89.21%, 85.09%, and 93.11% respectively. This indicates increased overall accuracy of 90%±2% when compared with existing approaches. Overall, it is a novel solution to improve the bug assignment process which utilizes intuitive judgment of triagers using fuzzy multi-criteria decision making and is capable of making a distinction between active, inactive, and new developers based on their relative workload categorization. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
42. Codebook-softened product quantization for high accuracy approximate nearest neighbor search.
- Author
-
Fan, Jingya, Pan, Zhibin, Wang, Liangzhuang, and Wang, Yang
- Subjects
- *
PATTERN recognition systems , *INFORMATION retrieval - Abstract
Product quantization (PQ) is a fundamental technique for approximate nearest neighbor (ANN) search in many applications such as information retrieval, computer vision and pattern recognition. In the existing PQ-based methods for approximate nearest neighbor search, the reachable best search accuracy is achieved by using fixed codebooks. The search performance is limited by the quality of the hard codebooks. Unlike the existing methods, in this paper, we present a novel codebook-softened product quantization (CSPQ) method to achieve more quantization levels by softening codebooks. We firstly analyze how well the database vectors match the trained codebooks by examining quantization error for each database vector, and select the bad-matching database vectors. Then, we give the trained codebooks b -bit freedom to soften codebooks. Finally, by minimizing quantization errors, the bad-matching vectors are encoded by softened codebooks and the labels of best-matching codebooks are recorded. Experimental results on SIFT1M, GIST1M and SIFT10M show that, despite its simplicity, our proposed method achieves higher accuracy compared with PQ and it can be combined with other non-exhaustive frameworks to achieve fast search. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
43. Short text classification for Arabic social media tweets.
- Author
-
Alzanin, Samah M., Azmi, Aqil M., and Aboalsamh, Hatim A.
- Subjects
MICROBLOGS ,DEEP learning ,FOLKSONOMIES ,SOCIAL media ,USER-generated content ,RADIAL basis functions ,INFORMATION retrieval ,SUPPORT vector machines - Abstract
With the rapid growth in the number of tweets published daily on Twitter, automated classification of tweets becomes necessary for broad diverse applications (e.g., information retrieval, topic labeling, sentiment analysis, rumors detection) to better understand what these tweets are, and what the users are expressing in this social platform. Text classification is the process of assigning one or more pre-defined categories to text according to its content. Tweets are short, and the short text does not have enough contextual information, which is part of the challenge in their classification. Adding to the challenge is the increase in ambiguity since the diacritical marking is not explicitly specified in most Modern Standard Arabic (MSA) texts. Not to mention the Arabic tweets are known to contain fused text of MSA and dialectal Arabic. In this paper, we propose a scheme to classify the textual tweets in the Arabic language based on its linguistic characteristics and content into five different categories. We explore two different textual representations: word embedding using Word2vec and stemmed text with term frequency-inverse document frequency (tf-idf). We tested three different classifiers: Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), and Random Forest (RF). All the classifiers had their hyperparameters tuned. We collected and manually annotated a dataset of approximately 35,600 Arabic tweets for the experiments. Statistically, the RF and the SVM with radial basis function (RBF) kernel performed equally well when used with stemming and tf-idf , achieving macro- F 1 scores ranging between 98.09% and 98.14%. The GNB with word embedding was disappointingly low performer. Our result tops the current state-of-the-art score of 92.95% using a deep learning approach, RNN-GRU (recurrent neural network-gated recurrent unit). [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
44. Embedding and generalization of formula with context in the retrieval of mathematical information.
- Author
-
Dadure, Pankaj, Pakray, Partha, and Bandyopadhyay, Sivaji
- Subjects
INFORMATION retrieval ,GENERALIZATION ,DOCUMENT imaging systems ,TECHNOLOGICAL innovations ,MATTRESSES - Abstract
Retrieval of mathematical information from scientific documents is one of the crucial tasks. Numerous Mathematical Information Retrieval (MIR) systems have been developed, which mainly focus on the improvement over the indexing and the searching mechanism, the poor results obtained for evaluation measures depict major limitations of such systems. These enhance the scope of improvement and new innovations through the inclusion of functionalities, which can resolve the challenges of MIR system. Further, to improve the performance of the MIR systems, this paper proposed a formula embedding and generalization approach with the context, in addition to innovative relevance measurement technique. In this approach, documents are preprocessed by the document preprocessor module and extracted the formulas in Presentation MathML format with their context. The formula embedding and generalization modules of the proposed approach formed the binary vectors where '1' represents the presence, and '0' represents the absence of a particular entity in a formula, and subsequently, the vectors of formulas with context are indexed by the indexer. The innovative relevance measurement technique of the proposed approach ranked those documents first, which are retrieved by both formula embedding and generalization modules as compared to the individual one. The proposed approach has been tested on the MathTagArticles of Wikipedia of NTCIR-12, and the obtained results verify the significance of the context of the formula and the dissimilarity factor in the retrieval of mathematical information. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
45. Metadata implementation and data discoverability: A survey on university libraries' Dataverse portals.
- Author
-
Chiu, Tzu-Heng, Chen, Hsin-liang, and Cline, Ellen
- Subjects
- *
ACADEMIC libraries , *METADATA , *DATA management , *INFORMATION retrieval , *INSTITUTIONAL repositories - Abstract
The purpose of this practical case study is to examine the development of Dataverse, a global research data management consortium. This paper is the second in a project focusing on data discoverability and current metadata implementation on the Dataverse portals established by 27 university libraries worldwide. Five research questions were proposed to identify the most popular metadata standards and elements, search interface options, and result display formats by those portals. The data were collected from 27 university libraries worldwide between December 1, 2020 and January 31, 2021. According to the results of the descriptive analyses, the most popular metadata elements for the dataset overview were Subject and Description , while Dataset persistent ID , Publication Date , Title, Author , Contact , Deposit Date , Depositor , Description , and Subject were the most popular elements for the metadata record of each dataset. Publication Year , Author Names , and Subject were found to be the most common search facets used by the portals. English was the most common language used for the search interfaces and metadata descriptions. Based on their findings from this evidence-based study, the authors recommend future research on the development of institutional data portal infrastructure, on stakeholder outreach and training, and on user studies on dataset retrieval. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
46. Bottom-up color-independent alignment learning for text–image person re-identification.
- Author
-
Du, Guodong, Zhu, Hanyue, and Zhang, Liyan
- Subjects
- *
IMAGE databases , *NATURAL languages , *INFORMATION retrieval , *MODALITY (Linguistics) , *COLOR - Abstract
Text-to-image person re-identification (TIReID) refers to identifying images of a person of interest from a large-scale person image database based on natural language descriptions. Most of existing methods generally rely heavily on color information when matching cross-modal data, which is a kind of overfitting and can be termed as the color over-reliance problem. This problem would distract the model from other tiny but discriminative clues (e.g. clothes details, structural information, etc.), which are essential for both semantic alignment and fine-grained matching, and thus leads to a sub-optimal retrieval performance. To this end, in this paper, we propose a novel Bottom-up Color-independent Alignment Learning Framework (BCALF) for text-based person retrieval to tackle this problem in two folds, decoupling color-independent discrete local features and aggregating multiple key discrete features. We employ color-confused images as an auxiliary modality and perform discrete fine-grained semantic alignment where the minimal semantic units interact within the joint feature space to focus solely on content information. Furthermore, the multiple discrete local features are aggregated into more discriminative non-local decisive features. BCALF achieves semantic alignment from minimal semantic units to non-local aggregation units, which can be understood as a bottom-up process. Experimental results demonstrate that BCALF consistently outperforms previous methods and achieves the state-of-the-art performance on the CUHK-PEDES, ICFG-PEDES and RSTPReid datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. Question-answering framework for building codes using fine-tuned and distilled pre-trained transformer models.
- Author
-
Xue, Xiaorui, Zhang, Jiansong, and Chen, Yunfeng
- Subjects
- *
NATURAL language processing , *ARTIFICIAL intelligence , *EVIDENCE gaps , *CONSTRUCTION projects , *INFORMATION retrieval , *DEEP learning - Abstract
Building code compliance checking is considered a bottleneck in construction projects, which calls for a novel approach to building code query and information retrieval. To address this research gap, the paper presents a question and answering framework comprising: (1) a 'retriever' for efficient context retrieval from building codes in response to an inquiry, and (2) a 'reader' for precise context interpretation and answer generation. The 'retriever', based on the BM25 algorithm, achieved a top-1 precision, recall, and F1-score of 0.95, 0.95, and 0.95, and a top-5 precision, recall, and F1-score of 0.97, 1.00, and 0.99, respectively. The 'reader', utilizing the transformer-based "xlm-roberta-base-squad2-distilled" model, achieved a top-4 accuracy of 0.95 and a top-1 F1-score of 0.84. A fine-tuning and model distillation process was used and shown to provide high performance on limited amount of training data, overcoming a common barrier in the development of domain-specific (e.g., construction) deep learning models. • State-of-the-art building code information retrieval relies on keyword match and requires deep domain knowledge to use. • A question-and-answer framework for building codes is proposed based on deep learning and natural language processing. • The framework overcomes a common barrier in the development of deep learning models in the AEC domain. • A top-5 precision, recall, and F1-score of 0.97, 1.00, 0.99, respectively, have been achieved. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
48. Detect-order-construct: A tree construction based approach for hierarchical document structure analysis.
- Author
-
Wang, Jiawei, Hu, Kai, Zhong, Zhuoyao, Sun, Lei, and Huo, Qiang
- Subjects
- *
TEXT summarization , *INFORMATION retrieval , *LATEX , *COMPUTER software , *READING - Abstract
Document structure analysis (aka document layout analysis) is crucial for understanding the physical layout and logical structure of documents, with applications in information retrieval, document summarization, knowledge extraction, etc. In this paper, we concentrate on Hierarchical Document Structure Analysis (HDSA) to explore hierarchical relationships within structured documents created using authoring software employing hierarchical schemas, such as LaTeX, Microsoft Word, and HTML. To comprehensively analyze hierarchical document structures, we propose a tree construction based approach that addresses multiple subtasks concurrently, including page object detection (Detect), reading order prediction of identified objects (Order), and the construction of intended hierarchical structure (Construct). We present an effective end-to-end solution based on this framework to demonstrate its performance. To assess our approach, we develop a comprehensive benchmark called Comp-HRDoc, which evaluates the above subtasks simultaneously. Our end-to-end system achieves state-of-the-art performance on two large-scale document layout analysis datasets (PubLayNet and DocLayNet), a high-quality hierarchical document structure reconstruction dataset (HRDoc), and our Comp-HRDoc benchmark. The Comp-HRDoc benchmark is publicly available at. • Proposed Detect-Order-Construct for hierarchical document structure analysis. • Devised end-to-end solution by casting tasks as relation prediction problems. • Designed multi-modal relation models with structure-aware improvements. • Established Comp-HRDoc for hierarchical document structure analysis evaluation. • Achieved state-of-the-art on PubLayNet, DocLayNet, HRDoc, and Comp-HRDoc. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. A knowledge graph completion model based on triple level interaction and contrastive learning.
- Author
-
Hu, Jie, Yang, Hongqun, Teng, Fei, Du, Shengdong, and Li, Tianrui
- Subjects
- *
KNOWLEDGE graphs , *INFORMATION retrieval , *RESEARCH personnel , *SEMANTICS , *INFERENCE (Logic) - Abstract
Knowledge graphs provide credible and structured knowledge for downstream tasks such as information retrieval. Nevertheless, the ubiquitous incompleteness of knowledge graphs often limits the performance of applications. To address the incompleteness, people have proposed the knowledge graph completion task to supplement the facts of incomplete triplets. Recently, researchers have proposed introducing text descriptions to enrich entity representations. Existing methods based on triple decoupling with text description solve the combinatorial explosion problem well. Nevertheless, they still suffer from a lack of global characteristics of factual triples. In addition, the success of contrastive learning research has improved such methods, but they are still limited by existing negative sampling, which is usually more costly than embedding-based methods. In order to solve these limitations, this paper proposes an innovative triple-level interaction model for knowledge graph completion named InCL-KGC. Concretely, the proposed model employs an on-verge interaction method to reduce text redundancy information for entity representation and capture the global semantics of factual triplets. Furthermore, we design an effective hard negative sampling strategy to improve contrast learning. Additionally, we perform an improved Harbsort algorithm for the purpose of reducing the adverse impact of candidate entity sparsity on inference. Extensive experiment consequences exhibit that our model transcends recent baselines with MRR, Hit@3, and Hits@10 increased by 1.2%, 3.2%, and 6.8% on WN18RR, while the index MRR, Hit@1, Hit@3, and Hits@10 were enhanced by 2.8%, 1%, 3.3%, 4.3% on FB15K-237. • Knowledge graph completion with triple-level interaction promotes capture factual global semantics. • Hard negative sampling reduces computational requirements for contrast learning. • Inference with fusion degree information alleviates knowledge graph sparsity impact. • The proposed model can fully utilize external knowledge while ensuring efficiency. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
50. D2GL: Dual-level dual-scale graph learning for sketch-based 3D shape retrieval.
- Author
-
Li, Wenjing, Bai, Jing, and Zheng, Hu
- Subjects
- *
FEATURE extraction , *COMPUTER vision , *INFORMATION retrieval , *ARCHITECTURAL design , *GENERALIZATION - Abstract
Sketch-based 3D shape retrieval (SBSR) is an active research area in the computer vision community, but it is still very challenging. One main reason is that existing deep learning-based methods usually treat sketches as 2D images, neglecting the sparsity and diversity. In this paper, we propose a novel Dual-level Dual-scale Graph Learning (D2GL) method to effectively enhance structural information and produce robust representations for sparse and diverse hand-drawn sketches. Specifically, in addition to the traditional branches for SBSR, we introduce a Dual-level Dual-scale Graph Self-attention (DLDS-GSA) as an auxiliary branch. DLDS-GSA further consists of two levels of encoders, i.e., a local structural encoder and a dual-scale global structural encoder, to capture both local discriminative and multi-scale global structures while minimizing the impact of various sketch drawing details. Comprehensive experiments on SHREC'13 and SHREC'14 datasets demonstrate the superiority of D2GL for SBSR, with extended experiments on PART-SHREC'14 confirming its generalization for unseen classes in SBSR. • Designing a network architecture incorporates the DLDS-GSA as an auxiliary network. • Designing a dual-level module DLDS-GSA for sketch representation. • The results on different datasets demonstrate the superiority of DLDS-GSA. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.