28 results on '"metadata extraction"'
Search Results
2. Automatic Extraction and Cluster Analysis of Natural Disaster Metadata Based on the Unified Metadata Framework.
- Author
-
Wang, Zongmin, Shi, Xujie, Yang, Haibo, Yu, Bo, and Cai, Yingchun
- Subjects
- *
NATURAL disasters , *CLUSTER analysis (Statistics) , *INFORMATION technology , *METADATA , *DATA mining , *NATURAL resources - Abstract
The development of information technology has led to massive, multidimensional, and heterogeneously sourced disaster data. However, there's currently no universal metadata standard for managing natural disasters. Common pre-training models for information extraction requiring extensive training data show somewhat limited effectiveness, with limited annotated resources. This study establishes a unified natural disaster metadata standard, utilizes self-trained universal information extraction (UIE) models and Python libraries to extract metadata stored in both structured and unstructured forms, and analyzes the results using the Word2vec-Kmeans cluster algorithm. The results show that (1) the self-trained UIE model, with a learning rate of 3 × 10−4 and a batch_size of 32, significantly improves extraction results for various natural disasters by over 50%. Our optimized UIE model outperforms many other extraction methods in terms of precision, recall, and F1 scores. (2) The quality assessments of consistency, completeness, and accuracy for ten tables all exceed 0.80, with variances between the three dimensions being 0.04, 0.03, and 0.05. The overall evaluation of data items of tables also exceeds 0.80, consistent with the results at the table level. The metadata model framework constructed in this study demonstrates high-quality stability. (3) Taking the flood dataset as an example, clustering reveals five main themes with high similarity within clusters, and the differences between clusters are deemed significant relative to the differences within clusters at a significance level of 0.01. Overall, this experiment supports effective sharing of disaster data resources and enhances natural disaster emergency response efficiency. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Real-Time Security Risk Assessment From CCTV Using Hand Gesture Recognition
- Author
-
Murat Koca
- Subjects
CCTV footage ,deep learning ,cyber security ,hand gesture recognition ,media-pipe ,metadata extraction ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Closed-Circuit Television (CCTV) surveillance systems, long associated with physical security, are becoming more crucial when combined with cybersecurity measures. Combining traditional surveillance with cyber defenses is a flexible method for protecting against both physical and digital dangers. This study introduces the use of convolutional neural networks (CNNs) and hand gesture detection using CCTV data to perform real-time security risk assessments. The suggested method’s emphasis on automated extraction of key information, such as identity and behavior, illustrates its special use in silent or acoustically challenging settings. This study uses deep learning techniques to develop a novel approach for detecting hand gestures in CCTV images by automatically extracting relevant features using a media-pipe architecture. For instance, it facilitates risk assessment through the use of hand gestures in noisy environments or muted audio streams. Given this method’s uniqueness and efficiency, the suggested solution will be able to alert appropriate authorities in the event of a security breach. There seems to be considerable opportunity for the development of applications in several domains of security, law enforcement, and public safety, including but not limited to shopping malls, educational institutions, transportation, the armed forces, theft, abduction, etc.
- Published
- 2024
- Full Text
- View/download PDF
4. NAA-LIMS: a laboratory information management system for neutron activation analysis at the Peruvian institute of nuclear energy
- Author
-
Rivas, Jherson and Bedregal, Patricia
- Published
- 2024
- Full Text
- View/download PDF
5. Generic features selection for structure classification of diverse styled scholarly articles.
- Author
-
Waqas, Muhammad and Anjum, Nadeem
- Abstract
The enormous growth in online research publications in diversified domains has attracted the research community to extract these valuable scientific resources by searching online digital libraries and publishers' websites. A precise search is desired to enlist most related articles by applying semantic queries to the document's metadata and the structural elements. The online search engines and digital libraries offer only keyword-based search on full-body text, which creates excessive results. Therefore, the research article's structural and metadata information has to be stored in machine comprehendible form by the online research publishers. The research community in recent years has adopted different approaches to extract structural information from research documents like rule-based heuristics and machine-learning-based approaches. Studies suggest that machine-learning-based techniques have produced optimum results for document structure extraction from publishers having diversified publication layouts. In this paper, we have proposed thirteen different logical layout structural (LLS) components. We have identified a two-staged innovative set of generic features that are associated with the LLS. This approach has given our technique an advantage against the state-of-the-art for structural classification of digital scientific articles with diversified publication styles. We have applied chi-square ( c h i 2 ) for feature selection, and the final result has revealed that SVM (Kernal function) has produced an optimum result with an overall F-measure of 0.95. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Event and Entity Extraction from Generated Video Captions
- Author
-
Scherer, Johannes, Bhowmik, Deepayan, Scherp, Ansgar, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Holzinger, Andreas, editor, Kieseberg, Peter, editor, Cabitza, Federico, editor, Campagner, Andrea, editor, Tjoa, A Min, editor, and Weippl, Edgar, editor
- Published
- 2023
- Full Text
- View/download PDF
7. Deep Learning Approaches for Big Data-Driven Metadata Extraction in Online Job Postings.
- Author
-
Skondras, Panagiotis, Zotos, Nikos, Lagios, Dimitris, Zervas, Panagiotis, Giotopoulos, Konstantinos C., and Tzimas, Giannis
- Subjects
- *
JOB postings , *DEEP learning , *MACHINE learning , *FEEDFORWARD neural networks , *LANGUAGE models , *METADATA , *JOB classification - Abstract
This article presents a study on the multi-class classification of job postings using machine learning algorithms. With the growth of online job platforms, there has been an influx of labor market data. Machine learning, particularly NLP, is increasingly used to analyze and classify job postings. However, the effectiveness of these algorithms largely hinges on the quality and volume of the training data. In our study, we propose a multi-class classification methodology for job postings, drawing on AI models such as text-davinci-003 and the quantized versions of Falcon 7b (Falcon), Wizardlm 7B (Wizardlm), and Vicuna 7B (Vicuna) to generate synthetic datasets. These synthetic data are employed in two use-case scenarios: (a) exclusively as training datasets composed of synthetic job postings (situations where no real data is available) and (b) as an augmentation method to bolster underrepresented job title categories. To evaluate our proposed method, we relied on two well-established approaches: the feedforward neural network (FFNN) and the BERT model. Both the use cases and training methods were assessed against a genuine job posting dataset to gauge classification accuracy. Our experiments substantiated the benefits of using synthetic data to enhance job posting classification. In the first scenario, the models' performance matched, and occasionally exceeded, that of the real data. In the second scenario, the augmented classes consistently outperformed in most instances. This research confirms that AI-generated datasets can enhance the efficacy of NLP algorithms, especially in the domain of multi-class classification job postings. While data augmentation can boost model generalization, its impact varies. It is especially beneficial for simpler models like FNN. BERT, due to its context-aware architecture, also benefits from augmentation but sees limited improvement. Selecting the right type and amount of augmentation is essential. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
8. Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification †.
- Author
-
Skondras, Panagiotis, Zervas, Panagiotis, and Tzimas, Giannis
- Subjects
LANGUAGE models ,MACHINE learning ,JOB classification ,TRANSFORMER models ,JOB descriptions ,NATURAL language processing ,FEEDFORWARD neural networks - Abstract
In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
9. Automatic Extraction and Cluster Analysis of Natural Disaster Metadata Based on the Unified Metadata Framework
- Author
-
Zongmin Wang, Xujie Shi, Haibo Yang, Bo Yu, and Yingchun Cai
- Subjects
metadata extraction ,UIE ,natural disaster ,Word2vec-Kmeans clustering ,Geography (General) ,G1-922 - Abstract
The development of information technology has led to massive, multidimensional, and heterogeneously sourced disaster data. However, there’s currently no universal metadata standard for managing natural disasters. Common pre-training models for information extraction requiring extensive training data show somewhat limited effectiveness, with limited annotated resources. This study establishes a unified natural disaster metadata standard, utilizes self-trained universal information extraction (UIE) models and Python libraries to extract metadata stored in both structured and unstructured forms, and analyzes the results using the Word2vec-Kmeans cluster algorithm. The results show that (1) the self-trained UIE model, with a learning rate of 3 × 10−4 and a batch_size of 32, significantly improves extraction results for various natural disasters by over 50%. Our optimized UIE model outperforms many other extraction methods in terms of precision, recall, and F1 scores. (2) The quality assessments of consistency, completeness, and accuracy for ten tables all exceed 0.80, with variances between the three dimensions being 0.04, 0.03, and 0.05. The overall evaluation of data items of tables also exceeds 0.80, consistent with the results at the table level. The metadata model framework constructed in this study demonstrates high-quality stability. (3) Taking the flood dataset as an example, clustering reveals five main themes with high similarity within clusters, and the differences between clusters are deemed significant relative to the differences within clusters at a significance level of 0.01. Overall, this experiment supports effective sharing of disaster data resources and enhances natural disaster emergency response efficiency.
- Published
- 2024
- Full Text
- View/download PDF
10. A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics.
- Author
-
Waqas, Muhammad, Anjum, Nadeem, and Afzal, Muhammad Tanvir
- Abstract
The immense growth in online research publications has attracted the research community to extract valuable information from scientific resources by exploring online digital libraries and publishers' websites. The metadata stored in a machine comprehendible form can facilitate a precise search to enlist most related articles by applying semantic queries to the document's metadata and the structural elements. The online search engines and digital libraries offer only keyword-based search on full-body text, which creates excessive results. The research community in recent years has adopted different approaches to extract structural information from research documents. We have distributed the content of an article into two logical layouts and metadata levels. This strategy has given our technique an advantage over the state-of-the-art (SOTA) extracting metadata with diversified publication styles. The experimental results have revealed that the proposed approach has shown a significant gain in performance of 20.26% to 27.14%. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
11. EKSTRAKCIJA METAPODATKOV S POMOČJO STROJNEGA UČENJA.
- Author
-
SABADIN, Ivančica
- Abstract
Copyright of Moderna Arhivistika is the property of Maribor Provincial Archives and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2023
- Full Text
- View/download PDF
12. Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers
- Author
-
Hyesoo Kong, Hwamook Yoon, Jaewook Seol, Mihwan Hyun, Hyejin Lee, Soonyoung Kim, and Wonjun Choi
- Subjects
BERT ,corpus construction ,metadata extraction ,transfer learning ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
With the accelerating development of science and technology, the academic papers being published in various fields are increasing rapidly. Academic papers specially in science and technology fields are a crucial media for researchers who develop new technologies by identifying knowledge regarding the latest technological trends and conduct derivative studies in science and technology. Therefore, the continual collection of extensive academic papers, structuring of metadata, and construction of databases are significant tasks. However, research on automatic metadata extraction from Korean papers is not being actively conducted currently owing to insufficient Korean training data. We automatically constructed the largest labeled corpus in South Korea to date from 315,320 PDF papers belonging to 503 Korean academic journals and this labeled corpus can be used for training the models of automatic extraction for 12 metadata types from PDF papers. This labeled corpus is available at https://doi.org/10.23057/48. Moreover, we developed inspection process and guidelines for the automatically constructed data and performed a full inspection of the validation and testing data. The reliability of the inspected data was verified through the inter-annotator agreement measurement. Using our corpus, we trained and evaluated the BERT based transfer learning model to verify its reliability. Furthermore, we proposed new training methods that can improve the metadata extraction performance of Korean papers, and through these methods, we developed KorSciBERT-ME-J and KorSciBERT-ME-J+C models. The KorSciBERT-ME-J showed the highest performance with an F1 score of 99.36%, as well as robust performance in automatic metadata extraction from Korean academic papers in various formats.
- Published
- 2023
- Full Text
- View/download PDF
13. Using Provenance in Data Analytics for Seismology: Challenges and Directions
- Author
-
da Costa, Umberto Souza, Espinosa-Oviedo, Javier Alfonso, Musicante, Martin A., Vargas-Solar, Genoveva, Zechinelli-Martini, José-Luis, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Chiusano, Silvia, editor, Cerquitelli, Tania, editor, Wrembel, Robert, editor, Nørvåg, Kjetil, editor, Catania, Barbara, editor, Vargas-Solar, Genoveva, editor, and Zumpano, Ester, editor
- Published
- 2022
- Full Text
- View/download PDF
14. Multi-perspective Approach for Curating and Exploring the History of Climate Change in Latin America within Digital Newspapers.
- Author
-
Vargas-Solar, Genoveva, Zechinelli-Martini, José-Luis, A. Espinosa-Oviedo, Javier, and M. Vilches-Blázquez, Luis
- Abstract
This paper introduces a multi-perspective approach to deal with curation and exploration issues in historical newspapers. It has been implemented in the platform LACLICHEV (Latin American Climate Change Evolution platform). Exploring the history of climate change through digitalized newspapers published around two centuries ago introduces four challenges: (1) curating content for tracking entries describing meteorological events; (2) processing (digging into) colloquial language (and its geographic variations5 ) for extracting meteorological events; (3) analyzing newspapers to discover meteorological patterns possibly associated with climate change; (4) designing tools for exploring the extracted content. LACLICHEV provides tools for curating, exploring, and analyzing historical newspaper articles, their description and location, and the vocabularies used for referring to meteorological events. This platform makes it possible to understand and identify possible patterns and models that can build an empirical and social view of the history of climate change in the Latin American region. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
15. Re-purposing Excavation Database Content as Paradata: An Explorative Analysis of Paradata Identification Challenges and Opportunities
- Author
-
Lisa Börjesson, Olle Sköld, Zanna Friberg, Daniel Löwenborg, Gísli Pálsson, and Isto Huvila
- Subjects
metadata ,paradata ,metadata extraction ,data reuse ,research data ,unstructured data ,archaeological data ,Bibliography. Library science. Information resources - Abstract
Although data reusers request information about how research data was created and curated, this information is often non-existent or only briefly covered in data descriptions. The need for such contextual information is particularly critical in fields like archaeology, where old legacy data created during different time periods and through varying methodological framings and fieldwork documentation practices retains its value as an important information source. This article explores the presence of contextual information in archaeological data with a specific focus on data provenance and processing information, i.e., paradata. The purpose of the article is to identify and explicate types of paradata in field observation documentation. The method used is an explorative close reading of field data from an archaeological excavation enriched with geographical metadata. The analysis covers technical and epistemological challenges and opportunities in paradata identification, and discusses the possibility of using identified paradata in data descriptions and for data reliability assessments. Results show that it is possible to identify both knowledge organisation paradata (KOP) relating to data structuring and knowledge-making paradata (KMP) relating to fieldwork methods and interpretative processes. However, while the data contains many traces of the research process, there is an uneven and, in some categories, low level of structure and systematicity that complicates automated metadata and paradata identification and extraction. The results show a need to broaden the understanding of how structure and systematicity are used and how they impact research data in archaeology and comparable field sciences. The insight into how a dataset's KOP and KMP can be read is also a methodological contribution to data literacy research and practice development. On a repository level, the results underline the need to include paradata about dataset creation, purpose, terminology, dataset internal and external relations, and eventual data colloquialisms that require explanation to reusers.
- Published
- 2022
- Full Text
- View/download PDF
16. Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification
- Author
-
Panagiotis Skondras, Panagiotis Zervas, and Giannis Tzimas
- Subjects
metadata extraction ,resumes ,CV ,big data ,multiclass classification ,ChatGPT ,Information technology ,T58.5-58.64 - Abstract
In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative.
- Published
- 2023
- Full Text
- View/download PDF
17. Deep Learning Approaches for Big Data-Driven Metadata Extraction in Online Job Postings
- Author
-
Panagiotis Skondras, Nikos Zotos, Dimitris Lagios, Panagiotis Zervas, Konstantinos C. Giotopoulos, and Giannis Tzimas
- Subjects
metadata extraction ,online job postings ,big data ,web crawling ,data preprocessing ,ChatGPT ,Information technology ,T58.5-58.64 - Abstract
This article presents a study on the multi-class classification of job postings using machine learning algorithms. With the growth of online job platforms, there has been an influx of labor market data. Machine learning, particularly NLP, is increasingly used to analyze and classify job postings. However, the effectiveness of these algorithms largely hinges on the quality and volume of the training data. In our study, we propose a multi-class classification methodology for job postings, drawing on AI models such as text-davinci-003 and the quantized versions of Falcon 7b (Falcon), Wizardlm 7B (Wizardlm), and Vicuna 7B (Vicuna) to generate synthetic datasets. These synthetic data are employed in two use-case scenarios: (a) exclusively as training datasets composed of synthetic job postings (situations where no real data is available) and (b) as an augmentation method to bolster underrepresented job title categories. To evaluate our proposed method, we relied on two well-established approaches: the feedforward neural network (FFNN) and the BERT model. Both the use cases and training methods were assessed against a genuine job posting dataset to gauge classification accuracy. Our experiments substantiated the benefits of using synthetic data to enhance job posting classification. In the first scenario, the models’ performance matched, and occasionally exceeded, that of the real data. In the second scenario, the augmented classes consistently outperformed in most instances. This research confirms that AI-generated datasets can enhance the efficacy of NLP algorithms, especially in the domain of multi-class classification job postings. While data augmentation can boost model generalization, its impact varies. It is especially beneficial for simpler models like FNN. BERT, due to its context-aware architecture, also benefits from augmentation but sees limited improvement. Selecting the right type and amount of augmentation is essential.
- Published
- 2023
- Full Text
- View/download PDF
18. Automatic Annotation of Images in Persian Scientific Documents Based on Text Analysis Methods
- Author
-
Azadeh fakhrzadeh, Mohadeseh Rahnama, and Jalal A Nasiri
- Subjects
image tagging ,text analysis ,image annotation ,image retrieval ,metadata extraction ,information technology ,Bibliography. Library science. Information resources - Abstract
In this paper a new method for annotating images in Persian scientific documents is suggested. Images in scientific documents contain valuable information. In many cases, by analyzing images one can understand the main idea and important results of the document. Due to explosive growth of image data, automatic image annotation has attracted extensive attention and become one of the growing subjects in the literature. Image annotation is the first step in image retrieval methods, in which descriptive tags are assigned to each image. Here, for image annotation the associated text is used. The caption and the part of the document that includes the reference to the image are considered. None phrases in the associated text are ranked based on five different methods: term frequency, inverse document frequency, term frequency–inverse document frequency, cosine similarity between word embedding of noun phrases in the text and the caption and using both term frequency–inverse document frequency and cosine similarity methods. Image tags in every method are the noun phrases with the highest rank. Suggested methods are evaluated on the test data from Iran scientific information database (Ganj), the main database of Persian scientific documents. Term frequency–inverse document frequency method gives the best results.
- Published
- 2022
19. Üstverilerin Derin Öğrenme Algoritmaları Kullanılarak Otomatik Olarak Çıkartılması ve Sınıflanması
- Author
-
Murat İnce
- Subjects
üstveri çıkartma ,konvolüsyonel sinir ağları ,tekrarlayan sinir ağları ,metadata extraction ,convolutional neural networks ,recurrent neural networks ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Science ,Science (General) ,Q1-390 - Abstract
Günümüzde bilişim teknolojilerinin yaygınlaşması sebebiyle dijital içerik ihtiyacı artmıştır. Bu içeriklerin oluşturulması zaman alıcı ve maliyetli bir süreçtir. İçerik oluşturulurken öğrenme nesnelerinden faydalanılmaktadır. Bu nesnelerin bilgisayarlar tarafından keşfedilebilir ve okunabilir olması yeniden kullanılabilirlik ve paylaşılabilirlik açısından önemlidir. Bu sebeple nesneler tanımlayıcı kimlik bilgilerini içeren üstveriler ile bütünleşik olarak kullanılmaktadırlar. Bu üstveriler ne kadar düzgün oluşturulup sınıflandırılırsa nesnelerin kullanılabilirliği o derece artmış olmaktadır. Bu sebeple nesnelerden otomatik üstveri çıkartan birçok yöntem geliştirilmiştir. Bu çalışmada da Konvolüsyonel Sinir Ağları (KSA), Tekrarlayan Sinir Ağları (TSA) gibi derin öğrenme ve Doğal Dil İşleme (DDİ) yöntemleri kullanılarak öğrenme nesnelerindeki içeriklerden otomatik olarak üstveri çıkartılması ve sınıflaması yapılmıştır. Sistemin başarısı ve doğruluğu örnek öğrenme nesneleri ile test edilmiştir. Sonuçlar sistemin başarılı bir şekilde kullanılabileceğini göstermiştir.
- Published
- 2021
- Full Text
- View/download PDF
20. Re-purposing Excavation Database Content as Paradata: An Explorative Analysis of Paradata Identification Challenges and Opportunities.
- Author
-
Börjesson, Lisa, Sköld, Olle, Friberg, Zanna, Löwenborg, Daniel, Palsson, Gisli, and Huvila, Isto
- Subjects
- *
ARCHAEOLOGICAL databases , *ARCHAEOLOGICAL excavations , *INFORMATION retrieval , *INFORMATION resources , *DOCUMENTATION - Abstract
Although data reusers request information about how research data was created and curated, this information is often non-existent or only briefly covered in data descriptions. The need for such contextual information is particularly critical in fields like archaeology, where old legacy data created during different time periods and through varying methodological framings and fieldwork documentation practices retains its value as an important information source. This article explores the presence of contextual information in archaeological data with a specific focus on data provenance and processing information, i.e., paradata. The purpose of the article is to identify and explicate types of paradata in field observation documentation. The method used is an explorative close reading of field data from an archaeological excavation enriched with geographical metadata. The analysis covers technical and epistemological challenges and opportunities in paradata identification, and discusses the possibility of using identified paradata in data descriptions and for data reliability assessments. Results show that it is possible to identify both knowledge organisation paradata (KOP) relating to data structuring and knowledge-making paradata (KMP) relating to fieldwork methods and interpretative processes. However, while the data contains many traces of the research process, there is an uneven and, in some categories, low level of structure and systematicity that complicates automated metadata and paradata identification and extraction. The results show a need to broaden the understanding of how structure and systematicity are used and how they impact research data in archaeology and comparable field sciences. The insight into how a dataset's KOP and KMP can be read is also a methodological contribution to data literacy research and practice development. On a repository level, the results underline the need to include paradata about dataset creation, purpose, terminology, dataset internal and external relations, and eventual data colloquialisms that require explanation to reusers. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
21. LAME: Layout-Aware Metadata Extraction Approach for Research Articles.
- Author
-
Jongyun Choi, Hyesoo Kong, Hwamook Yoon, Heungseon Oh, and Yuchul Jung
- Subjects
METADATA ,SCHOLARLY periodicals ,CONFERENCE papers ,ACADEMIC conferences - Abstract
The volume of academic literature, such as academic conference papers and journals, has increased rapidly worldwide, and research on metadata extraction is ongoing. However, high-performing metadata extraction is still challenging due to diverse layout formats according to journal publishers. To accommodate the diversity of the layouts of academic journals, we propose a novel LAyout-aware Metadata Extraction (LAME) framework equipped with the three characteristics (e.g., design of automatic layout analysis, construction of a large meta-data training set, and implementation of metadata extractor). In the framework, we designed an automatic layout analysis using PDFMiner. Based on the layout analysis, a large volume of metadata-separated training data, including the title, abstract, author name, author affiliated organization, and keywords, were automatically extracted. Moreover, we constructed a pre-trainedmodel, Layout-MetaBERT, to extract the metadata from academic journals with varying layout formats. The experimental results with our metadata extractor exhibited robust performance (Macro-F1, 93.27%) in metadata extraction for unseen journals with different layout formats. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
22. Text and metadata extraction from scanned Arabic documents using support vector machines.
- Author
-
Qin, Wenda, Elanwar, Randa, and Betke, Margrit
- Subjects
- *
SUPPORT vector machines , *SUPERVISED learning , *DOCUMENT imaging systems , *METADATA , *IMAGE analysis - Abstract
Text information in scanned documents becomes accessible only when extracted and interpreted by a text recognizer. For a recognizer to work successfully, it must have detailed location information about the regions of the document images that it is asked to analyse. It will need focus on page regions with text skipping non-text regions that include illustrations or photographs. However, text recognizers do not work as logical analyzers. Logical layout analysis automatically determines the function of a document text region, that is, it labels each region as a title, paragraph, or caption, and so on, and thus is an essential part of a document understanding system. In the past, rule-based algorithms have been used to conduct logical layout analysis, using limited size data sets. We here instead focus on supervised learning methods for logical layout analysis. We describe LABA, a system based on multiple support vector machines to perform logical L ayout A nalysis of scanned B ooks pages in A rabic. The system detects the function of a text region based on the analysis of various images features and a voting mechanism. For a baseline comparison, we implemented an older but state-of-the-art neural network method. We evaluated LABA using a data set of scanned pages from illustrated Arabic books and obtained high recall and precision values. We also found that the F-measure of LABA is higher for five of the tested six classes compared to the state-of-the-art method. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
23. ارائة روشی برای برچسب زدن تصاویر موجود در...
- Author
-
آزاده فخرزاده, محدثه رهنما, and جاللالدین نصیر&
- Subjects
SCIENCE databases ,NOUN phrases (Grammar) ,IMAGE retrieval ,ANNOTATIONS ,INFORMATION technology - Abstract
In this paper a new method for annotating images in Persian scientific documents is suggested. Images in scientific documents contain valuable information. In many cases, by analyzing images one can understand the main idea and important results of the document. Due to explosive growth of image data, automatic image annotation has attracted extensive attention and become one of the growing subjects in the literature. Image annotation is the first step in image retrieval methods, in which descriptive tags are assigned to each image. Here, for image annotation the associated text is used. The caption and the part of the document that includes the reference to the image are considered. None phrases in the associated text are ranked based on five different methods: term frequency, inverse document frequency, term frequency–inverse document frequency, cosine similarity between word embedding of noun phrases in the text and the caption and using both term frequency–inverse document frequency and cosine similarity methods. Image tags in every method are the noun phrases with the highest rank. Suggested methods are evaluated on the test data from Iran scientific information database (Ganj), the main database of Persian scientific documents. Term frequency–inverse document frequency method gives the best results. [ABSTRACT FROM AUTHOR]
- Published
- 2022
24. Üstverilerin Derin Öğrenme Algoritmaları Kullanılarak Otomatik Olarak Çıkartılması ve Sınıflanması.
- Author
-
İNCE, Murat
- Abstract
Copyright of Duzce University Journal of Science & Technology is the property of Duzce University Journal of Science & Technology and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2021
- Full Text
- View/download PDF
25. An Efficient Framework for Algorithmic Metadata Extraction over Scholarly Documents Using Deep Neural Networks
- Author
-
Raghavendra Nayaka, P. and Ranjan, Rajeev
- Published
- 2023
- Full Text
- View/download PDF
26. Extracting enhanced artificial intelligence model metadata from software repositories
- Author
-
Tsay, Jason, Braz, Alan, Hirzel, Martin, Shinnar, Avraham, and Mummert, Todd
- Published
- 2022
- Full Text
- View/download PDF
27. IndeGx: A Model and a Framework for Indexing RDF Knowledge Graphs with SPARQL-based Test Suits
- Author
-
Pierre Maillot, Olivier Corby, Catherine Faron, Fabien Gandon, Franck Michel, Institut National de Recherche en Informatique et en Automatique (Inria), Scalable and Pervasive softwARe and Knowledge Systems (Laboratoire I3S - SPARKS), Laboratoire d'Informatique, Signaux, et Systèmes de Sophia Antipolis (I3S), Université Nice Sophia Antipolis (1965 - 2019) (UNS), COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA)-Université Nice Sophia Antipolis (1965 - 2019) (UNS), COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA), Web-Instrumented Man-Machine Interactions, Communities and Semantics (WIMMICS), Inria Sophia Antipolis - Méditerranée (CRISAM), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Scalable and Pervasive softwARe and Knowledge Systems (Laboratoire I3S - SPARKS), COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA)-Laboratoire d'Informatique, Signaux, et Systèmes de Sophia Antipolis (I3S), Knowledge acquisition for aided design through agent interaction (ACACIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Laboratoire d'Informatique, Signaux, et Systèmes de Sophia-Antipolis (I3S) / Equipe KEWI, Université Côte d'Azur (UCA), ANR-18-CE23-0017,D2KAB,Des Données aux Connaissances en Agronomie et Biodiversité(2018), ANR-19-CE23-0014,DeKaloG,Graphes de connaissances décentralisés(2019), and ANR-19-P3IA-0002,3IA@cote d'azur,3IA Côte d'Azur(2019)
- Subjects
Human-Computer Interaction ,knowledge graph ,Computer Networks and Communications ,[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] ,endpoint description ,[INFO.INFO-WB]Computer Science [cs]/Web ,dataset description ,metadata extraction ,semantic index ,Software ,[INFO.INFO-SI]Computer Science [cs]/Social and Information Networks [cs.SI] - Abstract
International audience; In recent years, a large number of RDF datasets have been built and published on the Web in fields as diverse as linguistics or life sciences, as well as general datasets such as DBpedia or Wikidata. The joint exploitation of these datasets requires specific knowledge about their content, access points, and commonalities. However, not all datasets contain a self-description, and not all access points can handle the complex queries used to generate such a description. In this article, we provide a standard-based approach to generate the description of a dataset. The generated descriptions as well as the process of their computation are expressed using standard vocabularies and languages. We implemented our approach into a framework, called IndeGx, where each indexing feature and its computation is collaboratively and declaratively defined in a GitHub repository. We have experimented IndeGx on a set of 339 RDF datasets with endpoints listed in public catalogs, over 8 months. The results show that we can collect, as much as possible, important characteristics of the datasets depending on their availability and capacities. The resulting index captures the commonalities, variety and disparity in the offered content and services and it provides an important support to any application designed to query RDF datasets.
- Published
- 2023
- Full Text
- View/download PDF
28. IndeGx: A model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits.
- Author
-
Maillot, Pierre, Corby, Olivier, Faron, Catherine, Gandon, Fabien, and Michel, Franck
- Abstract
In recent years, a large number of RDF datasets have been built and published on the Web in fields as diverse as linguistics or life sciences, as well as general datasets such as DBpedia or Wikidata. The joint exploitation of these datasets requires specific knowledge about their content, access points, and commonalities. However, not all datasets contain a self-description, and not all access points can handle the complex queries used to generate such a description. In this article, we provide a standard-based approach to generate the description of a dataset. The generated descriptions as well as the process of their computation are expressed using standard vocabularies and languages. We implemented our approach into a framework, called IndeGx, where each indexing feature and its computation is collaboratively and declaratively defined in a GitHub repository. We have experimented IndeGx on a set of 339 RDF datasets with endpoints listed in public catalogs, over 8 months. The results show that we can collect, as much as possible, important characteristics of the datasets depending on their availability and capacities. The resulting index captures the commonalities, variety and disparity in the offered content and services and it provides an important support to any application designed to query RDF datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.