1. Word embedding for French natural language in healthcare: a comparative study (Preprint)
- Author
-
Emeric Dynomant, Romain Lelong, Badisse Dahamna, Clément Massonaud, Gaétan Kerdelhué, Julien Grosjean, Stéphane Canu, Stefan J Darmoni, OmicX, Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU), Service d'informatique biomédicale [Rouen], CHU Rouen, Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Laboratoire d'Informatique Médicale et Ingénierie des Connaissances en e-Santé (LIMICS), Université Paris 13 (UP13)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Sorbonne Université (SU), Equipe Apprentissage (DocApp - LITIS), and Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie)
- Subjects
data curation ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,[SDV.IB]Life Sciences [q-bio]/Bioengineering ,020201 artificial intelligence & image processing ,data mining ,02 engineering and technology ,natural language processing - Abstract
BACKGROUND Word embedding technologies are now used in a wide range of applications. However, no formal evaluation and comparison have been made on models produced by the three most famous implementations (Word2Vec, GloVe and FastText). OBJECTIVE The goal of this study is to compare embedding implementations on a corpus of documents produced in a working context, by health professionals. METHODS Models have been trained on documents coming from the Rouen university hospital. This data is not structured and cover a wide range of documents produced in a clinic (discharge summary, prescriptions ...). Four evaluation tasks have been defined (cosine similarity, odd one, mathematical operations and human formal evaluation) and applied on each model. RESULTS Word2Vec had the highest score for three of the four tasks (mathematical operations, odd one similarity and human validation), particularly regarding the Skip-Gram architecture. CONCLUSIONS Even if this implementation had the best rate, each model has its own qualities and defects, like the training time which is very short for GloVe or morphosyntaxic similarity conservation observed with FastText. Models and test sets produced by this study will be the first publicly available through a graphical interface to help advance French biomedical research.
- Published
- 2018
- Full Text
- View/download PDF