Back to Search
Start Over
Automatic identification of variables in epidemiological datasets using logic regression
- Source :
- Lorenz, M W, Abdi, N A, Scheckenbach, F, Pflug, A, Bülbül, A, Catapano, A, Agewall, S, Ezhov, M, Bots, M L, Kiechl, S, Orth, A & PROG-IMT study group 2017, ' Automatic identification of variables in epidemiological datasets using logic regression ', BMC Medical Informatics and Decision Making, vol. 17, no. 1, 40 . https://doi.org/10.1186/s12911-017-0429-1, BMC Medical Informatics and Decision Making, BMC Medical Informatics and Decision Making, 17(1):40. BioMed Central, BMC medical informatics and decision making [E], 17(1). BioMed Central, BMC Medical Informatics and Decision Making, Vol 17, Iss 1, Pp 1-11 (2017), BMC Medical Informatics and Decision Making, 17:40. BioMed Central Ltd., Abdi, N A, Scheckenback, F, Catapano, A L, Agewall, S, Ezhov, M, Bots, M L, Kiechl, S, Orth, A, McLachlan, S 2017, ' Automatic identification of variables in epidemiological datasets using logic regression ', Bmc medical informatics and decision making . https://doi.org/10.1186/s12911-017-0429-1
- Publication Year :
- 2017
-
Abstract
- Background For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
- Subjects :
- Carotid Artery Diseases
Matching (statistics)
Databases, Factual
Epidemiologic Factors
Computer science
Epidemiology
Sample (statistics)
Health Informatics
Logistic regression
computer.software_genre
lcsh:Computer applications to medicine. Medical informatics
Data management
Carotid Intima-Media Thickness
Set (abstract data type)
03 medical and health sciences
0302 clinical medicine
Meta-Analysis as Topic
Backup
Predictive Value of Tests
Statistics
Journal Article
Data Mining
Humans
030212 general & internal medicine
Medical Informatics Applications
Health Policy
Prognosis
Logic regression
Computer Science Applications
Identification (information)
Variable (computer science)
Meta-analysis
Logistic Models
030220 oncology & carcinogenesis
Data quality
lcsh:R858-859.7
Data mining
computer
Algorithms
Research Article
Subjects
Details
- Language :
- English
- ISSN :
- 14726947
- Database :
- OpenAIRE
- Journal :
- Lorenz, M W, Abdi, N A, Scheckenbach, F, Pflug, A, Bülbül, A, Catapano, A, Agewall, S, Ezhov, M, Bots, M L, Kiechl, S, Orth, A & PROG-IMT study group 2017, ' Automatic identification of variables in epidemiological datasets using logic regression ', BMC Medical Informatics and Decision Making, vol. 17, no. 1, 40 . https://doi.org/10.1186/s12911-017-0429-1, BMC Medical Informatics and Decision Making, BMC Medical Informatics and Decision Making, 17(1):40. BioMed Central, BMC medical informatics and decision making [E], 17(1). BioMed Central, BMC Medical Informatics and Decision Making, Vol 17, Iss 1, Pp 1-11 (2017), BMC Medical Informatics and Decision Making, 17:40. BioMed Central Ltd., Abdi, N A, Scheckenback, F, Catapano, A L, Agewall, S, Ezhov, M, Bots, M L, Kiechl, S, Orth, A, McLachlan, S 2017, ' Automatic identification of variables in epidemiological datasets using logic regression ', Bmc medical informatics and decision making . https://doi.org/10.1186/s12911-017-0429-1
- Accession number :
- edsair.doi.dedup.....c61a4c138a355655c1d06410dcbda75b