Back to Search
Start Over
A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers
- Source :
- BMC Medical Research Methodology, BMC Medical Research Methodology, BioMed Central, 2021, 21 (1), pp.155. ⟨10.1186/s12874-021-01299-6⟩, BMC Medical Research Methodology, 2021, 21 (1), pp.155. ⟨10.1186/s12874-021-01299-6⟩, BMC Medical Research Methodology, Vol 21, Iss 1, Pp 1-11 (2021)
- Publication Year :
- 2021
- Publisher :
- HAL CCSD, 2021.
-
Abstract
- Background Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. Methods To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named “PRL + ML”) combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988–0.992) than either PRL (range 0.916–0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). Conclusions Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.
- Subjects :
- Risk
Medicine (General)
Databases, Factual
Epidemiology
Computer science
[SDV]Life Sciences [q-bio]
Breast Neoplasms
Health Informatics
computer.software_genre
Cohort Studies
03 medical and health sciences
Record linkage
R5-920
0302 clinical medicine
Humans
Genetic Predisposition to Disease
Prospective Studies
030212 general & internal medicine
AdaBoost
Supervised machine learning
BRCA2 Protein
Linkage (software)
[SDV.MHEP] Life Sciences [q-bio]/Human health and pathology
Database
BRCA1 Protein
Random forest
Support vector machine
Identifier
[SDV] Life Sciences [q-bio]
Identification (information)
Probabilistic linkage
Hybrid process
030220 oncology & carcinogenesis
Mutation
Mutation (genetic algorithm)
Female
computer
[SDV.MHEP]Life Sciences [q-bio]/Human health and pathology
Research Article
Subjects
Details
- Language :
- English
- ISSN :
- 14712288
- Database :
- OpenAIRE
- Journal :
- BMC Medical Research Methodology, BMC Medical Research Methodology, BioMed Central, 2021, 21 (1), pp.155. ⟨10.1186/s12874-021-01299-6⟩, BMC Medical Research Methodology, 2021, 21 (1), pp.155. ⟨10.1186/s12874-021-01299-6⟩, BMC Medical Research Methodology, Vol 21, Iss 1, Pp 1-11 (2021)
- Accession number :
- edsair.doi.dedup.....be32ee1e1fdd7a8c91374aff36cb5f65
- Full Text :
- https://doi.org/10.1186/s12874-021-01299-6⟩