Descriptor: "metadata extraction" / Publication Year Range: Last 3 years - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"metadata extraction"' showing total 28 results

Start Over Descriptor "metadata extraction" Publication Year Range Last 3 years

28 results on '"metadata extraction"'

1. Deep Neural Networks for Automated Metadata Extraction

Author: El Omari, Abdellah, Antari, Jilali, Elkina, Hamza, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Motahhir, Saad, editor, and Bossoufi, Badre, editor
Published: 2024
Full Text: View/download PDF

2. Automatic Extraction and Cluster Analysis of Natural Disaster Metadata Based on the Unified Metadata Framework.

Author: Wang, Zongmin, Shi, Xujie, Yang, Haibo, Yu, Bo, and Cai, Yingchun
Subjects: *NATURAL disasters, *CLUSTER analysis (Statistics), *INFORMATION technology, *METADATA, *DATA mining, *NATURAL resources
Abstract: The development of information technology has led to massive, multidimensional, and heterogeneously sourced disaster data. However, there's currently no universal metadata standard for managing natural disasters. Common pre-training models for information extraction requiring extensive training data show somewhat limited effectiveness, with limited annotated resources. This study establishes a unified natural disaster metadata standard, utilizes self-trained universal information extraction (UIE) models and Python libraries to extract metadata stored in both structured and unstructured forms, and analyzes the results using the Word2vec-Kmeans cluster algorithm. The results show that (1) the self-trained UIE model, with a learning rate of 3 × 10−4 and a batch_size of 32, significantly improves extraction results for various natural disasters by over 50%. Our optimized UIE model outperforms many other extraction methods in terms of precision, recall, and F1 scores. (2) The quality assessments of consistency, completeness, and accuracy for ten tables all exceed 0.80, with variances between the three dimensions being 0.04, 0.03, and 0.05. The overall evaluation of data items of tables also exceeds 0.80, consistent with the results at the table level. The metadata model framework constructed in this study demonstrates high-quality stability. (3) Taking the flood dataset as an example, clustering reveals five main themes with high similarity within clusters, and the differences between clusters are deemed significant relative to the differences within clusters at a significance level of 0.01. Overall, this experiment supports effective sharing of disaster data resources and enhances natural disaster emergency response efficiency. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. Real-Time Security Risk Assessment From CCTV Using Hand Gesture Recognition

Author: Murat Koca
Subjects: CCTV footage, deep learning, cyber security, hand gesture recognition, media-pipe, metadata extraction, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Closed-Circuit Television (CCTV) surveillance systems, long associated with physical security, are becoming more crucial when combined with cybersecurity measures. Combining traditional surveillance with cyber defenses is a flexible method for protecting against both physical and digital dangers. This study introduces the use of convolutional neural networks (CNNs) and hand gesture detection using CCTV data to perform real-time security risk assessments. The suggested method’s emphasis on automated extraction of key information, such as identity and behavior, illustrates its special use in silent or acoustically challenging settings. This study uses deep learning techniques to develop a novel approach for detecting hand gestures in CCTV images by automatically extracting relevant features using a media-pipe architecture. For instance, it facilitates risk assessment through the use of hand gestures in noisy environments or muted audio streams. Given this method’s uniqueness and efficiency, the suggested solution will be able to alert appropriate authorities in the event of a security breach. There seems to be considerable opportunity for the development of applications in several domains of security, law enforcement, and public safety, including but not limited to shopping malls, educational institutions, transportation, the armed forces, theft, abduction, etc.
Published: 2024
Full Text: View/download PDF

4. NAA-LIMS: a laboratory information management system for neutron activation analysis at the Peruvian institute of nuclear energy

Author: Rivas, Jherson and Bedregal, Patricia
Published: 2024
Full Text: View/download PDF

5. Generic features selection for structure classification of diverse styled scholarly articles.

Author: Waqas, Muhammad and Anjum, Nadeem
Abstract: The enormous growth in online research publications in diversified domains has attracted the research community to extract these valuable scientific resources by searching online digital libraries and publishers' websites. A precise search is desired to enlist most related articles by applying semantic queries to the document's metadata and the structural elements. The online search engines and digital libraries offer only keyword-based search on full-body text, which creates excessive results. Therefore, the research article's structural and metadata information has to be stored in machine comprehendible form by the online research publishers. The research community in recent years has adopted different approaches to extract structural information from research documents like rule-based heuristics and machine-learning-based approaches. Studies suggest that machine-learning-based techniques have produced optimum results for document structure extraction from publishers having diversified publication layouts. In this paper, we have proposed thirteen different logical layout structural (LLS) components. We have identified a two-staged innovative set of generic features that are associated with the LLS. This approach has given our technique an advantage against the state-of-the-art for structural classification of digital scientific articles with diversified publication styles. We have applied chi-square ( c h i 2 ) for feature selection, and the final result has revealed that SVM (Kernal function) has produced an optimum result with an overall F-measure of 0.95. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. Event and Entity Extraction from Generated Video Captions

Author: Scherer, Johannes, Bhowmik, Deepayan, Scherp, Ansgar, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Holzinger, Andreas, editor, Kieseberg, Peter, editor, Cabitza, Federico, editor, Campagner, Andrea, editor, Tjoa, A Min, editor, and Weippl, Edgar, editor
Published: 2023
Full Text: View/download PDF

7. Deep Learning Approaches for Big Data-Driven Metadata Extraction in Online Job Postings.

Author: Skondras, Panagiotis, Zotos, Nikos, Lagios, Dimitris, Zervas, Panagiotis, Giotopoulos, Konstantinos C., and Tzimas, Giannis
Subjects: *JOB postings, *DEEP learning, *MACHINE learning, *FEEDFORWARD neural networks, *LANGUAGE models, *METADATA, *JOB classification
Abstract: This article presents a study on the multi-class classification of job postings using machine learning algorithms. With the growth of online job platforms, there has been an influx of labor market data. Machine learning, particularly NLP, is increasingly used to analyze and classify job postings. However, the effectiveness of these algorithms largely hinges on the quality and volume of the training data. In our study, we propose a multi-class classification methodology for job postings, drawing on AI models such as text-davinci-003 and the quantized versions of Falcon 7b (Falcon), Wizardlm 7B (Wizardlm), and Vicuna 7B (Vicuna) to generate synthetic datasets. These synthetic data are employed in two use-case scenarios: (a) exclusively as training datasets composed of synthetic job postings (situations where no real data is available) and (b) as an augmentation method to bolster underrepresented job title categories. To evaluate our proposed method, we relied on two well-established approaches: the feedforward neural network (FFNN) and the BERT model. Both the use cases and training methods were assessed against a genuine job posting dataset to gauge classification accuracy. Our experiments substantiated the benefits of using synthetic data to enhance job posting classification. In the first scenario, the models' performance matched, and occasionally exceeded, that of the real data. In the second scenario, the augmented classes consistently outperformed in most instances. This research confirms that AI-generated datasets can enhance the efficacy of NLP algorithms, especially in the domain of multi-class classification job postings. While data augmentation can boost model generalization, its impact varies. It is especially beneficial for simpler models like FNN. BERT, due to its context-aware architecture, also benefits from augmentation but sees limited improvement. Selecting the right type and amount of augmentation is essential. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

8. Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification †.

Author: Skondras, Panagiotis, Zervas, Panagiotis, and Tzimas, Giannis
Subjects: LANGUAGE models, MACHINE learning, JOB classification, TRANSFORMER models, JOB descriptions, NATURAL language processing, FEEDFORWARD neural networks
Abstract: In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

9. Automatic Extraction and Cluster Analysis of Natural Disaster Metadata Based on the Unified Metadata Framework

Author: Zongmin Wang, Xujie Shi, Haibo Yang, Bo Yu, and Yingchun Cai
Subjects: metadata extraction, UIE, natural disaster, Word2vec-Kmeans clustering, Geography (General), G1-922
Abstract: The development of information technology has led to massive, multidimensional, and heterogeneously sourced disaster data. However, there’s currently no universal metadata standard for managing natural disasters. Common pre-training models for information extraction requiring extensive training data show somewhat limited effectiveness, with limited annotated resources. This study establishes a unified natural disaster metadata standard, utilizes self-trained universal information extraction (UIE) models and Python libraries to extract metadata stored in both structured and unstructured forms, and analyzes the results using the Word2vec-Kmeans cluster algorithm. The results show that (1) the self-trained UIE model, with a learning rate of 3 × 10−4 and a batch_size of 32, significantly improves extraction results for various natural disasters by over 50%. Our optimized UIE model outperforms many other extraction methods in terms of precision, recall, and F1 scores. (2) The quality assessments of consistency, completeness, and accuracy for ten tables all exceed 0.80, with variances between the three dimensions being 0.04, 0.03, and 0.05. The overall evaluation of data items of tables also exceeds 0.80, consistent with the results at the table level. The metadata model framework constructed in this study demonstrates high-quality stability. (3) Taking the flood dataset as an example, clustering reveals five main themes with high similarity within clusters, and the differences between clusters are deemed significant relative to the differences within clusters at a significance level of 0.01. Overall, this experiment supports effective sharing of disaster data resources and enhances natural disaster emergency response efficiency.
Published: 2024
Full Text: View/download PDF

10. A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics.

Author: Waqas, Muhammad, Anjum, Nadeem, and Afzal, Muhammad Tanvir
Abstract: The immense growth in online research publications has attracted the research community to extract valuable information from scientific resources by exploring online digital libraries and publishers' websites. The metadata stored in a machine comprehendible form can facilitate a precise search to enlist most related articles by applying semantic queries to the document's metadata and the structural elements. The online search engines and digital libraries offer only keyword-based search on full-body text, which creates excessive results. The research community in recent years has adopted different approaches to extract structural information from research documents. We have distributed the content of an article into two logical layouts and metadata levels. This strategy has given our technique an advantage over the state-of-the-art (SOTA) extracting metadata with diversified publication styles. The experimental results have revealed that the proposed approach has shown a significant gain in performance of 20.26% to 27.14%. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

11. EKSTRAKCIJA METAPODATKOV S POMOČJO STROJNEGA UČENJA.

Author: SABADIN, Ivančica
Abstract: Copyright of Moderna Arhivistika is the property of Maribor Provincial Archives and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2023
Full Text: View/download PDF

12. Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers

Author: Hyesoo Kong, Hwamook Yoon, Jaewook Seol, Mihwan Hyun, Hyejin Lee, Soonyoung Kim, and Wonjun Choi
Subjects: BERT, corpus construction, metadata extraction, transfer learning, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: With the accelerating development of science and technology, the academic papers being published in various fields are increasing rapidly. Academic papers specially in science and technology fields are a crucial media for researchers who develop new technologies by identifying knowledge regarding the latest technological trends and conduct derivative studies in science and technology. Therefore, the continual collection of extensive academic papers, structuring of metadata, and construction of databases are significant tasks. However, research on automatic metadata extraction from Korean papers is not being actively conducted currently owing to insufficient Korean training data. We automatically constructed the largest labeled corpus in South Korea to date from 315,320 PDF papers belonging to 503 Korean academic journals and this labeled corpus can be used for training the models of automatic extraction for 12 metadata types from PDF papers. This labeled corpus is available at https://doi.org/10.23057/48. Moreover, we developed inspection process and guidelines for the automatically constructed data and performed a full inspection of the validation and testing data. The reliability of the inspected data was verified through the inter-annotator agreement measurement. Using our corpus, we trained and evaluated the BERT based transfer learning model to verify its reliability. Furthermore, we proposed new training methods that can improve the metadata extraction performance of Korean papers, and through these methods, we developed KorSciBERT-ME-J and KorSciBERT-ME-J+C models. The KorSciBERT-ME-J showed the highest performance with an F1 score of 99.36%, as well as robust performance in automatic metadata extraction from Korean academic papers in various formats.
Published: 2023
Full Text: View/download PDF

13. Using Provenance in Data Analytics for Seismology: Challenges and Directions

Author: da Costa, Umberto Souza, Espinosa-Oviedo, Javier Alfonso, Musicante, Martin A., Vargas-Solar, Genoveva, Zechinelli-Martini, José-Luis, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Chiusano, Silvia, editor, Cerquitelli, Tania, editor, Wrembel, Robert, editor, Nørvåg, Kjetil, editor, Catania, Barbara, editor, Vargas-Solar, Genoveva, editor, and Zumpano, Ester, editor
Published: 2022
Full Text: View/download PDF

14. Multi-perspective Approach for Curating and Exploring the History of Climate Change in Latin America within Digital Newspapers.

Author: Vargas-Solar, Genoveva, Zechinelli-Martini, José-Luis, A. Espinosa-Oviedo, Javier, and M. Vilches-Blázquez, Luis
Abstract: This paper introduces a multi-perspective approach to deal with curation and exploration issues in historical newspapers. It has been implemented in the platform LACLICHEV (Latin American Climate Change Evolution platform). Exploring the history of climate change through digitalized newspapers published around two centuries ago introduces four challenges: (1) curating content for tracking entries describing meteorological events; (2) processing (digging into) colloquial language (and its geographic variations5 ) for extracting meteorological events; (3) analyzing newspapers to discover meteorological patterns possibly associated with climate change; (4) designing tools for exploring the extracted content. LACLICHEV provides tools for curating, exploring, and analyzing historical newspaper articles, their description and location, and the vocabularies used for referring to meteorological events. This platform makes it possible to understand and identify possible patterns and models that can build an empirical and social view of the history of climate change in the Latin American region. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

15. Re-purposing Excavation Database Content as Paradata: An Explorative Analysis of Paradata Identification Challenges and Opportunities

Author: Lisa Börjesson, Olle Sköld, Zanna Friberg, Daniel Löwenborg, Gísli Pálsson, and Isto Huvila
Subjects: metadata, paradata, metadata extraction, data reuse, research data, unstructured data, archaeological data, Bibliography. Library science. Information resources
Abstract: Although data reusers request information about how research data was created and curated, this information is often non-existent or only briefly covered in data descriptions. The need for such contextual information is particularly critical in fields like archaeology, where old legacy data created during different time periods and through varying methodological framings and fieldwork documentation practices retains its value as an important information source. This article explores the presence of contextual information in archaeological data with a specific focus on data provenance and processing information, i.e., paradata. The purpose of the article is to identify and explicate types of paradata in field observation documentation. The method used is an explorative close reading of field data from an archaeological excavation enriched with geographical metadata. The analysis covers technical and epistemological challenges and opportunities in paradata identification, and discusses the possibility of using identified paradata in data descriptions and for data reliability assessments. Results show that it is possible to identify both knowledge organisation paradata (KOP) relating to data structuring and knowledge-making paradata (KMP) relating to fieldwork methods and interpretative processes. However, while the data contains many traces of the research process, there is an uneven and, in some categories, low level of structure and systematicity that complicates automated metadata and paradata identification and extraction. The results show a need to broaden the understanding of how structure and systematicity are used and how they impact research data in archaeology and comparable field sciences. The insight into how a dataset's KOP and KMP can be read is also a methodological contribution to data literacy research and practice development. On a repository level, the results underline the need to include paradata about dataset creation, purpose, terminology, dataset internal and external relations, and eventual data colloquialisms that require explanation to reusers.
Published: 2022
Full Text: View/download PDF

16. Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification

Author: Panagiotis Skondras, Panagiotis Zervas, and Giannis Tzimas
Subjects: metadata extraction, resumes, CV, big data, multiclass classification, ChatGPT, Information technology, T58.5-58.64
Abstract: In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative.
Published: 2023
Full Text: View/download PDF

17. Deep Learning Approaches for Big Data-Driven Metadata Extraction in Online Job Postings

Author: Panagiotis Skondras, Nikos Zotos, Dimitris Lagios, Panagiotis Zervas, Konstantinos C. Giotopoulos, and Giannis Tzimas
Subjects: metadata extraction, online job postings, big data, web crawling, data preprocessing, ChatGPT, Information technology, T58.5-58.64
Abstract: This article presents a study on the multi-class classification of job postings using machine learning algorithms. With the growth of online job platforms, there has been an influx of labor market data. Machine learning, particularly NLP, is increasingly used to analyze and classify job postings. However, the effectiveness of these algorithms largely hinges on the quality and volume of the training data. In our study, we propose a multi-class classification methodology for job postings, drawing on AI models such as text-davinci-003 and the quantized versions of Falcon 7b (Falcon), Wizardlm 7B (Wizardlm), and Vicuna 7B (Vicuna) to generate synthetic datasets. These synthetic data are employed in two use-case scenarios: (a) exclusively as training datasets composed of synthetic job postings (situations where no real data is available) and (b) as an augmentation method to bolster underrepresented job title categories. To evaluate our proposed method, we relied on two well-established approaches: the feedforward neural network (FFNN) and the BERT model. Both the use cases and training methods were assessed against a genuine job posting dataset to gauge classification accuracy. Our experiments substantiated the benefits of using synthetic data to enhance job posting classification. In the first scenario, the models’ performance matched, and occasionally exceeded, that of the real data. In the second scenario, the augmented classes consistently outperformed in most instances. This research confirms that AI-generated datasets can enhance the efficacy of NLP algorithms, especially in the domain of multi-class classification job postings. While data augmentation can boost model generalization, its impact varies. It is especially beneficial for simpler models like FNN. BERT, due to its context-aware architecture, also benefits from augmentation but sees limited improvement. Selecting the right type and amount of augmentation is essential.
Published: 2023
Full Text: View/download PDF

18. Automatic Annotation of Images in Persian Scientific Documents Based on Text Analysis Methods

Author: Azadeh fakhrzadeh, Mohadeseh Rahnama, and Jalal A Nasiri
Subjects: image tagging, text analysis, image annotation, image retrieval, metadata extraction, information technology, Bibliography. Library science. Information resources
Abstract: In this paper a new method for annotating images in Persian scientific documents is suggested. Images in scientific documents contain valuable information. In many cases, by analyzing images one can understand the main idea and important results of the document. Due to explosive growth of image data, automatic image annotation has attracted extensive attention and become one of the growing subjects in the literature. Image annotation is the first step in image retrieval methods, in which descriptive tags are assigned to each image. Here, for image annotation the associated text is used. The caption and the part of the document that includes the reference to the image are considered. None phrases in the associated text are ranked based on five different methods: term frequency, inverse document frequency, term frequency–inverse document frequency, cosine similarity between word embedding of noun phrases in the text and the caption and using both term frequency–inverse document frequency and cosine similarity methods. Image tags in every method are the noun phrases with the highest rank. Suggested methods are evaluated on the test data from Iran scientific information database (Ganj), the main database of Persian scientific documents. Term frequency–inverse document frequency method gives the best results.
Published: 2022

19. Üstverilerin Derin Öğrenme Algoritmaları Kullanılarak Otomatik Olarak Çıkartılması ve Sınıflanması

Author: Murat İnce
Subjects: üstveri çıkartma, konvolüsyonel sinir ağları, tekrarlayan sinir ağları, metadata extraction, convolutional neural networks, recurrent neural networks, Technology, Engineering (General). Civil engineering (General), TA1-2040, Science, Science (General), Q1-390
Abstract: Günümüzde bilişim teknolojilerinin yaygınlaşması sebebiyle dijital içerik ihtiyacı artmıştır. Bu içeriklerin oluşturulması zaman alıcı ve maliyetli bir süreçtir. İçerik oluşturulurken öğrenme nesnelerinden faydalanılmaktadır. Bu nesnelerin bilgisayarlar tarafından keşfedilebilir ve okunabilir olması yeniden kullanılabilirlik ve paylaşılabilirlik açısından önemlidir. Bu sebeple nesneler tanımlayıcı kimlik bilgilerini içeren üstveriler ile bütünleşik olarak kullanılmaktadırlar. Bu üstveriler ne kadar düzgün oluşturulup sınıflandırılırsa nesnelerin kullanılabilirliği o derece artmış olmaktadır. Bu sebeple nesnelerden otomatik üstveri çıkartan birçok yöntem geliştirilmiştir. Bu çalışmada da Konvolüsyonel Sinir Ağları (KSA), Tekrarlayan Sinir Ağları (TSA) gibi derin öğrenme ve Doğal Dil İşleme (DDİ) yöntemleri kullanılarak öğrenme nesnelerindeki içeriklerden otomatik olarak üstveri çıkartılması ve sınıflaması yapılmıştır. Sistemin başarısı ve doğruluğu örnek öğrenme nesneleri ile test edilmiştir. Sonuçlar sistemin başarılı bir şekilde kullanılabileceğini göstermiştir.
Published: 2021
Full Text: View/download PDF

20. Re-purposing Excavation Database Content as Paradata: An Explorative Analysis of Paradata Identification Challenges and Opportunities.

Author: Börjesson, Lisa, Sköld, Olle, Friberg, Zanna, Löwenborg, Daniel, Palsson, Gisli, and Huvila, Isto
Subjects: *ARCHAEOLOGICAL databases, *ARCHAEOLOGICAL excavations, *INFORMATION retrieval, *INFORMATION resources, *DOCUMENTATION
Abstract: Although data reusers request information about how research data was created and curated, this information is often non-existent or only briefly covered in data descriptions. The need for such contextual information is particularly critical in fields like archaeology, where old legacy data created during different time periods and through varying methodological framings and fieldwork documentation practices retains its value as an important information source. This article explores the presence of contextual information in archaeological data with a specific focus on data provenance and processing information, i.e., paradata. The purpose of the article is to identify and explicate types of paradata in field observation documentation. The method used is an explorative close reading of field data from an archaeological excavation enriched with geographical metadata. The analysis covers technical and epistemological challenges and opportunities in paradata identification, and discusses the possibility of using identified paradata in data descriptions and for data reliability assessments. Results show that it is possible to identify both knowledge organisation paradata (KOP) relating to data structuring and knowledge-making paradata (KMP) relating to fieldwork methods and interpretative processes. However, while the data contains many traces of the research process, there is an uneven and, in some categories, low level of structure and systematicity that complicates automated metadata and paradata identification and extraction. The results show a need to broaden the understanding of how structure and systematicity are used and how they impact research data in archaeology and comparable field sciences. The insight into how a dataset's KOP and KMP can be read is also a methodological contribution to data literacy research and practice development. On a repository level, the results underline the need to include paradata about dataset creation, purpose, terminology, dataset internal and external relations, and eventual data colloquialisms that require explanation to reusers. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

21. LAME: Layout-Aware Metadata Extraction Approach for Research Articles.

Author: Jongyun Choi, Hyesoo Kong, Hwamook Yoon, Heungseon Oh, and Yuchul Jung
Subjects: METADATA, SCHOLARLY periodicals, CONFERENCE papers, ACADEMIC conferences
Abstract: The volume of academic literature, such as academic conference papers and journals, has increased rapidly worldwide, and research on metadata extraction is ongoing. However, high-performing metadata extraction is still challenging due to diverse layout formats according to journal publishers. To accommodate the diversity of the layouts of academic journals, we propose a novel LAyout-aware Metadata Extraction (LAME) framework equipped with the three characteristics (e.g., design of automatic layout analysis, construction of a large meta-data training set, and implementation of metadata extractor). In the framework, we designed an automatic layout analysis using PDFMiner. Based on the layout analysis, a large volume of metadata-separated training data, including the title, abstract, author name, author affiliated organization, and keywords, were automatically extracted. Moreover, we constructed a pre-trainedmodel, Layout-MetaBERT, to extract the metadata from academic journals with varying layout formats. The experimental results with our metadata extractor exhibited robust performance (Macro-F1, 93.27%) in metadata extraction for unseen journals with different layout formats. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

22. Text and metadata extraction from scanned Arabic documents using support vector machines.

Author: Qin, Wenda, Elanwar, Randa, and Betke, Margrit
Subjects: *SUPPORT vector machines, *SUPERVISED learning, *DOCUMENT imaging systems, *METADATA, *IMAGE analysis
Abstract: Text information in scanned documents becomes accessible only when extracted and interpreted by a text recognizer. For a recognizer to work successfully, it must have detailed location information about the regions of the document images that it is asked to analyse. It will need focus on page regions with text skipping non-text regions that include illustrations or photographs. However, text recognizers do not work as logical analyzers. Logical layout analysis automatically determines the function of a document text region, that is, it labels each region as a title, paragraph, or caption, and so on, and thus is an essential part of a document understanding system. In the past, rule-based algorithms have been used to conduct logical layout analysis, using limited size data sets. We here instead focus on supervised learning methods for logical layout analysis. We describe LABA, a system based on multiple support vector machines to perform logical L ayout A nalysis of scanned B ooks pages in A rabic. The system detects the function of a text region based on the analysis of various images features and a voting mechanism. For a baseline comparison, we implemented an older but state-of-the-art neural network method. We evaluated LABA using a data set of scanned pages from illustrated Arabic books and obtained high recall and precision values. We also found that the F-measure of LABA is higher for five of the tested six classes compared to the state-of-the-art method. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

23. ارائة روشی برای برچسب زدن تصاویر موجود در...

Author: آزاده فخرزاده, محدثه رهنما, and جالل‌الدین نصیر&
Subjects: SCIENCE databases, NOUN phrases (Grammar), IMAGE retrieval, ANNOTATIONS, INFORMATION technology
Abstract: In this paper a new method for annotating images in Persian scientific documents is suggested. Images in scientific documents contain valuable information. In many cases, by analyzing images one can understand the main idea and important results of the document. Due to explosive growth of image data, automatic image annotation has attracted extensive attention and become one of the growing subjects in the literature. Image annotation is the first step in image retrieval methods, in which descriptive tags are assigned to each image. Here, for image annotation the associated text is used. The caption and the part of the document that includes the reference to the image are considered. None phrases in the associated text are ranked based on five different methods: term frequency, inverse document frequency, term frequency–inverse document frequency, cosine similarity between word embedding of noun phrases in the text and the caption and using both term frequency–inverse document frequency and cosine similarity methods. Image tags in every method are the noun phrases with the highest rank. Suggested methods are evaluated on the test data from Iran scientific information database (Ganj), the main database of Persian scientific documents. Term frequency–inverse document frequency method gives the best results. [ABSTRACT FROM AUTHOR]
Published: 2022

24. Üstverilerin Derin Öğrenme Algoritmaları Kullanılarak Otomatik Olarak Çıkartılması ve Sınıflanması.

Author: İNCE, Murat
Abstract: Copyright of Duzce University Journal of Science & Technology is the property of Duzce University Journal of Science & Technology and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2021
Full Text: View/download PDF

25. An Efficient Framework for Algorithmic Metadata Extraction over Scholarly Documents Using Deep Neural Networks

Author: Raghavendra Nayaka, P. and Ranjan, Rajeev
Published: 2023
Full Text: View/download PDF

26. Extracting enhanced artificial intelligence model metadata from software repositories

Author: Tsay, Jason, Braz, Alan, Hirzel, Martin, Shinnar, Avraham, and Mummert, Todd
Published: 2022
Full Text: View/download PDF

27. IndeGx: A Model and a Framework for Indexing RDF Knowledge Graphs with SPARQL-based Test Suits

Author: Pierre Maillot, Olivier Corby, Catherine Faron, Fabien Gandon, Franck Michel, Institut National de Recherche en Informatique et en Automatique (Inria), Scalable and Pervasive softwARe and Knowledge Systems (Laboratoire I3S - SPARKS), Laboratoire d'Informatique, Signaux, et Systèmes de Sophia Antipolis (I3S), Université Nice Sophia Antipolis (1965 - 2019) (UNS), COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA)-Université Nice Sophia Antipolis (1965 - 2019) (UNS), COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA), Web-Instrumented Man-Machine Interactions, Communities and Semantics (WIMMICS), Inria Sophia Antipolis - Méditerranée (CRISAM), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Scalable and Pervasive softwARe and Knowledge Systems (Laboratoire I3S - SPARKS), COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA)-Laboratoire d'Informatique, Signaux, et Systèmes de Sophia Antipolis (I3S), Knowledge acquisition for aided design through agent interaction (ACACIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Laboratoire d'Informatique, Signaux, et Systèmes de Sophia-Antipolis (I3S) / Equipe KEWI, Université Côte d'Azur (UCA), ANR-18-CE23-0017,D2KAB,Des Données aux Connaissances en Agronomie et Biodiversité(2018), ANR-19-CE23-0014,DeKaloG,Graphes de connaissances décentralisés(2019), and ANR-19-P3IA-0002,3IA@cote d'azur,3IA Côte d'Azur(2019)
Subjects: Human-Computer Interaction, knowledge graph, Computer Networks and Communications, [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR], endpoint description, [INFO.INFO-WB]Computer Science [cs]/Web, dataset description, metadata extraction, semantic index, Software, [INFO.INFO-SI]Computer Science [cs]/Social and Information Networks [cs.SI]
Abstract: International audience; In recent years, a large number of RDF datasets have been built and published on the Web in fields as diverse as linguistics or life sciences, as well as general datasets such as DBpedia or Wikidata. The joint exploitation of these datasets requires specific knowledge about their content, access points, and commonalities. However, not all datasets contain a self-description, and not all access points can handle the complex queries used to generate such a description. In this article, we provide a standard-based approach to generate the description of a dataset. The generated descriptions as well as the process of their computation are expressed using standard vocabularies and languages. We implemented our approach into a framework, called IndeGx, where each indexing feature and its computation is collaboratively and declaratively defined in a GitHub repository. We have experimented IndeGx on a set of 339 RDF datasets with endpoints listed in public catalogs, over 8 months. The results show that we can collect, as much as possible, important characteristics of the datasets depending on their availability and capacities. The resulting index captures the commonalities, variety and disparity in the offered content and services and it provides an important support to any application designed to query RDF datasets.
Published: 2023
Full Text: View/download PDF

28. IndeGx: A model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits.

Author: Maillot, Pierre, Corby, Olivier, Faron, Catherine, Gandon, Fabien, and Michel, Franck
Abstract: In recent years, a large number of RDF datasets have been built and published on the Web in fields as diverse as linguistics or life sciences, as well as general datasets such as DBpedia or Wikidata. The joint exploitation of these datasets requires specific knowledge about their content, access points, and commonalities. However, not all datasets contain a self-description, and not all access points can handle the complex queries used to generate such a description. In this article, we provide a standard-based approach to generate the description of a dataset. The generated descriptions as well as the process of their computation are expressed using standard vocabularies and languages. We implemented our approach into a framework, called IndeGx, where each indexing feature and its computation is collaboratively and declaratively defined in a GitHub repository. We have experimented IndeGx on a set of 339 RDF datasets with endpoints listed in public catalogs, over 8 months. The results show that we can collect, as much as possible, important characteristics of the datasets depending on their availability and capacities. The resulting index captures the commonalities, variety and disparity in the offered content and services and it provides an important support to any application designed to query RDF datasets. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

28 results on '"metadata extraction"'

1. Deep Neural Networks for Automated Metadata Extraction

2. Automatic Extraction and Cluster Analysis of Natural Disaster Metadata Based on the Unified Metadata Framework.

3. Real-Time Security Risk Assessment From CCTV Using Hand Gesture Recognition

4. NAA-LIMS: a laboratory information management system for neutron activation analysis at the Peruvian institute of nuclear energy

5. Generic features selection for structure classification of diverse styled scholarly articles.

6. Event and Entity Extraction from Generated Video Captions

7. Deep Learning Approaches for Big Data-Driven Metadata Extraction in Online Job Postings.

8. Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification †.

9. Automatic Extraction and Cluster Analysis of Natural Disaster Metadata Based on the Unified Metadata Framework

10. A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics.

11. EKSTRAKCIJA METAPODATKOV S POMOČJO STROJNEGA UČENJA.

12. Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers

13. Using Provenance in Data Analytics for Seismology: Challenges and Directions

14. Multi-perspective Approach for Curating and Exploring the History of Climate Change in Latin America within Digital Newspapers.

15. Re-purposing Excavation Database Content as Paradata: An Explorative Analysis of Paradata Identification Challenges and Opportunities

16. Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification

17. Deep Learning Approaches for Big Data-Driven Metadata Extraction in Online Job Postings

18. Automatic Annotation of Images in Persian Scientific Documents Based on Text Analysis Methods

19. Üstverilerin Derin Öğrenme Algoritmaları Kullanılarak Otomatik Olarak Çıkartılması ve Sınıflanması

20. Re-purposing Excavation Database Content as Paradata: An Explorative Analysis of Paradata Identification Challenges and Opportunities.

21. LAME: Layout-Aware Metadata Extraction Approach for Research Articles.

22. Text and metadata extraction from scanned Arabic documents using support vector machines.

23. ارائة روشی برای برچسب زدن تصاویر موجود در...

24. Üstverilerin Derin Öğrenme Algoritmaları Kullanılarak Otomatik Olarak Çıkartılması ve Sınıflanması.

25. An Efficient Framework for Algorithmic Metadata Extraction over Scholarly Documents Using Deep Neural Networks

26. Extracting enhanced artificial intelligence model metadata from software repositories

27. IndeGx: A Model and a Framework for Indexing RDF Knowledge Graphs with SPARQL-based Test Suits

28. IndeGx: A model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits.

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

28 results on '"metadata extraction"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources