Back to Search
Start Over
Using a secure continually updating web-source processing pipeline supporting a Real-time data synthesis and analysis of scientific literature: development and validation study
- Source :
- Journal of Medical Internet Research, Vol 23, Iss 5, p e25714 (2021), Journal of Medical Internet Research
- Publication Year :
- 2021
- Publisher :
- JMIR Publications, 2021.
-
Abstract
- Background The scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented “infodemic”; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis–related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query. Objective The main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19–related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data. Methods To create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources. Results REDASA (Realtime Data Synthesis and Analysis) is now one of the world’s largest and most up-to-date sources of COVID-19–related evidence; it consists of 104,000 documents. By capturing curators’ critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19–related information and represent around 10% of all papers about COVID-19. Conclusions This data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA’s design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers’ critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world’s largest COVID-19–related data corpora for searches and curation.
- Subjects :
- 020205 medical informatics
Computer science
Datasets as Topic
02 engineering and technology
Scientific literature
infrastructure
0302 clinical medicine
Open research
infodemic
0202 electrical engineering, electronic engineering, information engineering
030212 general & internal medicine
Longitudinal Studies
database
11 Medical and Health Sciences
computer.programming_language
pipeline
Unstructured data
methodology
17 Psychology and Cognitive Sciences
web crawl data
critical analysis
PanSurg REDASA Curators
data
Data Interpretation, Statistical
Public aspects of medicine
RA1-1270
Life Sciences & Biomedicine
data synthesis
Computer applications to medicine. Medical informatics
R858-859.7
Health Informatics
structured data synthesis
JavaScript
decision making
03 medical and health sciences
Humans
Relevance (information retrieval)
Real-time data
misinformation
Natural Language Processing
Original Paper
Internet
Science & Technology
research
SARS-CoV-2
literature
COVID-19
Data science
Search Engine
Health Care Sciences & Services
data science
08 Information and Computing Sciences
Web crawler
computer
Medical Informatics
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- Journal of Medical Internet Research, Vol 23, Iss 5, p e25714 (2021), Journal of Medical Internet Research
- Accession number :
- edsair.doi.dedup.....9b88b004954512e3575a2e44f4cc23fe