1. Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic Representation
- Author
-
Audeh, Bissan, Beigbeder, Michel, Zimmermann, Antoine, Jaillon, Philippe, Bousquet, Cédric, Choo, Kim-Kwang Raymond, Département Informatique et systèmes intelligents ( FAYOL-ENSMSE), Ecole Nationale Supérieure des Mines de St Etienne, Laboratoire Hubert Curien [Saint Etienne] (LHC), Institut d'Optique Graduate School (IOGS)-Université Jean Monnet [Saint-Étienne] (UJM)-Centre National de la Recherche Scientifique (CNRS), Centre de Recherches en Cancérologie de Toulouse (CRCT), Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS), University of Glasgow, Ecole Nationale Supérieure des Mines de St Etienne (ENSM ST-ETIENNE), Laboratoire Hubert Curien (LHC), Institut d'Optique Graduate School (IOGS)-Université Jean Monnet - Saint-Étienne (UJM)-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'Informatique Médicale et Ingénierie des Connaissances en e-Santé (LIMICS), and Université Paris 13 (UP13)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut National de la Santé et de la Recherche Médicale (INSERM)
- Subjects
Computer science ,Download ,Physiology ,Social Sciences ,lcsh:Medicine ,02 engineering and technology ,Representation (arts) ,Infographics ,Scraper site ,Sociology ,0202 electrical engineering, electronic engineering, information engineering ,Medicine and Health Sciences ,Biomechanics ,Computer Networks ,lcsh:Science ,Multidisciplinary ,Data Collection ,Applied Mathematics ,Simulation and Modeling ,computer.file_format ,Semantics ,Data model ,Social Networks ,Physical Sciences ,020201 artificial intelligence & image processing ,Graphs ,Algorithms ,Network Analysis ,Research Article ,Computer and Information Sciences ,Drug Research and Development ,Research and Analysis Methods ,Drug Safety ,020204 information systems ,Humans ,Social media ,RDF ,Structure (mathematical logic) ,Pharmacology ,Internet ,Information retrieval ,Biological Locomotion ,Data Visualization ,lcsh:R ,Biology and Life Sciences ,Linguistics ,[INFO.INFO-MO]Computer Science [cs]/Modeling and Simulation ,Crawling ,lcsh:Q ,computer ,Social Media ,Mathematics - Abstract
View at Publisher| Export | Download | Add to List| More... PLoS ONE Volume 12, Issue 1, January 2017, Article number e0169658 Open Access Vigi4Med scraper: A framework for web forum structured data extraction and semantic representation (Article) Audeh, B.a , Beigbeder, M.a, Zimmermann, A.a, Jaillon, P.b, Bousquet, C.c a University of Lyon, MINES Saint-Etienne, CNRS, Hubert Curien Laboratory, UMR 5516, Saint-Etienne, France b Ecole Nationale Superieure des Mines de Saint-Etienne, Saint-Etienne, France c INSERM, U1142, LIMICS, Paris, France View additional affiliations View references (36) Abstract The extraction of information from social media is an essential yet complicated step for data analysis in multiple domains. In this paper, we present Vigi4Med Scraper, a generic open source framework for extracting structured data from web forums. Our framework is highly configurable; using a configuration file, the user can freely choose the data to extract from any web forum. The extracted data are anonymized and represented in a semantic structure using Resource Description Framework (RDF) graphs. This representation enables efficient manipulation by data analysis algorithms and allows the collected data to be directly linked to any existing semantic resource. To avoid server overload, an integrated proxy with caching functionality imposes a minimal delay between sequential requests. Vigi4Med Scraper represents the first step of Vigi4Med, a project to detect adverse drug reactions (ADRs) from social networks founded by the French drug safety agency Agence Nationale de Securite du Medicament (ANSM). Vigi4Med Scraper has successfully extracted greater than 200 gigabytes of data from the web forums of over 20 different websites. © 2017 Audeh et al.This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- Published
- 2017