1. Digital Humanities & L Large Language Models arge Models: Practice and Research in Semantic Retrieval of Ancient Do Documents.
- Author
-
WANG Haoxian, ZHOU Ziming, DING Feifei, and WEI Chengfu
- Abstract
Against the backdrop of the increasing popularity of artificial intelligence technology, particularly large language models, this paper aims to explore their applications in the field of digital humanities, with a particular focus on the retrieval of ancient documents. Through the practice and exploration of the ancient document retrieval platform at Peking University Library, this study not only introduces new perspectives and methods to the field of digital humanities, but also promotes academic research and cultural heritage. It also provides practical references for other university libraries, which is an important guide.[Method/ Process] The article begins with an overview of the origins and development of the digital humanities, emphasizing its central role in humanities research. The paper then examines the current state of the arts in large language models and analyzes their potential and advantages for identifying and classifying ancient documents, semantic understanding and parsing, and information extraction and association. Through the analysis of practical case studies, this paper constructs a fundamental semantic retrieval model, the core architecture of which consists of two critical components. First, the construction phase of the retrieval engine involves meticulous preprocessing of the ancient document information to generate basic metadata. Using large-scale models, these metadata are subjected to in-depth processing and enhancement to create auxiliary search fields and enriched text. In addition, the text processed by the model and the original text are transformed into semantic vectors, which are then stored in an efficient vector engine for rapid retrieval. Second, the search and sort component is another core part of the model. This part accurately processes the user's search terms through large models to generate extended content and, in conjunction with the search terms, creates accurate semantic vectors. Utilizing the previously constructed vector engine, the model can efficiently retrieve relevant documents and intelligently sort the search results based on specific algorithms, ensuring that users can quickly obtain the most relevant and valuable information. Taking the ancient document system collection data of Peking University Library as the research object, the paper processes over 250, 000 records, primarily consisting of ancient books and rubbings, as well as over 10 million metadata items. Using the gradio framework on a server equipped with two NVIDIA RTX 4090 24G graphics cards, a semantic retrieval platform was created to effectively manage and retrieve these vast amounts of data. [Results/Conclusions] The main strengths and contributions of the study lie in the standardized metadata organization, the metadata extension supported by large models, the support for natural language search terms, the faulttolerant search mechanisms, and the efficient retrieval capabilities of the vector engine. However, there are shortcomings, such as the accuracy of results generated by large models and insufficient comprehensive analysis of user search data. Future efforts will be devoted to improving these issues to increase the effectiveness of the research. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF