Back to Search Start Over

Automatic identification of variables in epidemiological datasets using logic regression

Authors :
Lorenz, Matthias W.
Abdi, Negin Ashtiani
Scheckenbach, Frank
Pflug, Anja
Bülbül, Alpaslan
Catapano, Alberico L.
Agewall, Stefan
Ezhov, Marat
Bots, Michiel L.
Kiechl, Stefan
Orth, Andreas
Norata, Giuseppe D.
Empana, Jean Philippe
Lin, Hung Ju
McLachlan, Stela
Bokemark, Lena
Ronkainen, Kimmo
Amato, Mauro
Schminke, Ulf
Srinivasan, Sathanur R.
Lind, Lars
Kato, Akihiko
Dimitriadis, Chrystosomos
Przewlocki, Tadeusz
Okazaki, Shuhei
Stehouwer, C. D.A.
Lazarevic, Tatjana
Willeit, Peter
Yanez, David N.
Steinmetz, Helmuth
Sander, Dirk
Poppert, Holger
Desvarieux, Moise
Ikram, M. Arfan
Bevc, Sebastjan
Staub, Daniel
Sirtori, Cesare R.
Iglseder, Bernhard
Engström, Gunnar
Tripepi, Giovanni
Beloqui, Oscar
Lee, Moo Sik
Friera, Alfonsa
Xie, Wuxiang
Grigore, Liliana
Plichart, Matthieu
Su, Ta Chen
Robertson, Christine
Schmidt, Caroline
Tuomainen, Tomi Pekka
Veglia, Fabrizio
Völzke, Henry
Nijpels, Giel
Jovanovic, Aleksandar
Willeit, Johann
Sacco, Ralph L.
Franco, Oscar H.
Hojs, Radovan
Uthoff, Heiko
Hedblad, Bo
Park, Hyun Woong
Suarez, Carmen
Zhao, Dong
Catapano, Alberico
Ducimetiere, Pierre
Chien, Kuo Liong
Price, Jackie F.
Bergström, Göran
Kauhanen, Jussi
Tremoli, Elena
Dörr, Marcus
Berenson, Gerald
Papagianni, Aikaterini
Kablak-Ziembicka, Anna
Kitagawa, Kazuo
Dekker, Jaqueline M.
Stolic, Radojica
Polak, Joseph F.
Sitzer, Matthias
Bickel, Horst
Rundek, Tatjana
Hofman, Albert
Ekart, Robert
Frauchiger, Beat
Castelnuovo, Samuela
Rosvall, Maria
Zoccali, Carmine
Landecho, Manuel F.
Bae, Jang Ho
Gabriel, Rafael
Liu, Jing
Baldassarre, Damiano
Kavousi, Maryam
APH - Health Behaviors & Chronic Diseases
AII - Infectious diseases
APH - Aging & Later Life
Epidemiology
Source :
Lorenz, M W, Abdi, N A, Scheckenbach, F, Pflug, A, Bülbül, A, Catapano, A, Agewall, S, Ezhov, M, Bots, M L, Kiechl, S, Orth, A & PROG-IMT study group 2017, ' Automatic identification of variables in epidemiological datasets using logic regression ', BMC Medical Informatics and Decision Making, vol. 17, no. 1, 40 . https://doi.org/10.1186/s12911-017-0429-1, BMC Medical Informatics and Decision Making, BMC Medical Informatics and Decision Making, 17(1):40. BioMed Central, BMC medical informatics and decision making [E], 17(1). BioMed Central, BMC Medical Informatics and Decision Making, Vol 17, Iss 1, Pp 1-11 (2017), BMC Medical Informatics and Decision Making, 17:40. BioMed Central Ltd., Abdi, N A, Scheckenback, F, Catapano, A L, Agewall, S, Ezhov, M, Bots, M L, Kiechl, S, Orth, A, McLachlan, S 2017, ' Automatic identification of variables in epidemiological datasets using logic regression ', Bmc medical informatics and decision making . https://doi.org/10.1186/s12911-017-0429-1
Publication Year :
2017

Abstract

Background For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

Details

Language :
English
ISSN :
14726947
Database :
OpenAIRE
Journal :
Lorenz, M W, Abdi, N A, Scheckenbach, F, Pflug, A, Bülbül, A, Catapano, A, Agewall, S, Ezhov, M, Bots, M L, Kiechl, S, Orth, A & PROG-IMT study group 2017, ' Automatic identification of variables in epidemiological datasets using logic regression ', BMC Medical Informatics and Decision Making, vol. 17, no. 1, 40 . https://doi.org/10.1186/s12911-017-0429-1, BMC Medical Informatics and Decision Making, BMC Medical Informatics and Decision Making, 17(1):40. BioMed Central, BMC medical informatics and decision making [E], 17(1). BioMed Central, BMC Medical Informatics and Decision Making, Vol 17, Iss 1, Pp 1-11 (2017), BMC Medical Informatics and Decision Making, 17:40. BioMed Central Ltd., Abdi, N A, Scheckenback, F, Catapano, A L, Agewall, S, Ezhov, M, Bots, M L, Kiechl, S, Orth, A, McLachlan, S 2017, ' Automatic identification of variables in epidemiological datasets using logic regression ', Bmc medical informatics and decision making . https://doi.org/10.1186/s12911-017-0429-1
Accession number :
edsair.doi.dedup.....c61a4c138a355655c1d06410dcbda75b