Start Over

Federated Content Search for Lexical Resources (LexFCS): Specification

Authors :: Körner, Erik
Eckart, Thomas
Herold, Axel
Wiegand, Frank
Michaelis, Frank
Bremm, Matthias
Cotgrove, Louis
Trippel, Thorsten
Rau, Felix
Publication Year :: 2023
Publisher :: Zenodo, 2023.
Abstract: The landscape of digital lexical resources is often characterized by dedicated local portals and proprietary interfaces as primary access points for scholars and the interested public. In addition, legal and technical restrictions are potential issues that can make it difficult to efficiently query and use these valuable resources. As part of the research data consortium Text+, solutions for the storage and provision of digital language resources are being developed and provided in the context of the unified cross-domain German research data infrastructure NFDI. The specific topic of accessing lexical resources in a diverse and heterogenous landscape with a variety of participating institutions and established technical solutions is met with the development of the federated search and query framework LexFCS. The LexFCS extends the established CLARIN Federated Content Search that already allows accessing spatially distributed text corpora using a common specification of technical interfaces, data formats, and query languages. This paper describes the current state of development of the LexFCS, gives an insight into its technical details, and provides an outlook on its future development. The FCS specification (Schonefeld et al. 2014) will be extended with regard to announcing, querying and retrieving lexical resources. Specifically, this entails: Specifying the query language which is a “CQL Context Set” of the Contextual Query Language (standardized by the US Library of Congress) dedicated to query lexical entries. Its specification includes agreements on accessible fields of information (like part-of-speech, definitions, (semantically) related entries etc.) for a lexeme and how to combine them to complex queries. This is especially challenging due to the inherently hierarchical structure of lexical data. Specifying common data formats for a unified result presentation. On the basic level, this is achieved by a mandatory KWIC representation that allows annotating information types inline and by an advanced tabular-representation of all fields in a key-value-style. It is clearly understood that in most cases these representations can only provide a simplified view of the data. It is therefore endorsed to provide records in their complex native representation as well, with examples being different TEI dialects including TEI Lex-0, OntoLex/Lemon, and other formats. Extending the core FCS specification while remaining compatible with the overall architecture to enable the reuse of features such as access control for restricted resources or automatic registering of endpoints within the FCS system.<br />This publication was created in the context of the work of the association German National Research Data Infrastructure (NFDI) e.V. NFDI is financed by the Federal Republic of Germany and the 16 federal states, and the consortium Text+ is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -- project number 460033370. The authors would like to thank for the funding and support. Furthermore, the authors would like to thank all members of the Text+ data domain Lexical resources for their continuous work.