Author: "Malin, Bradley" / Topic: electronic health records - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Malin, Bradley"' showing total 87 results

Start Over Author "Malin, Bradley" Topic electronic health records

87 results on '"Malin, Bradley"'

1. Large language models facilitate the generation of electronic health record phenotyping algorithms.

Author: Yan C, Ong HH, Grabowska ME, Krantz MS, Su WC, Dickson AL, Peterson JF, Feng Q, Roden DM, Stein CM, Kerchberger VE, Malin BA, and Wei WQ
Subjects: Humans, Diabetes Mellitus, Type 2, Dementia, Hypothyroidism, Natural Language Processing, Electronic Health Records, Algorithms, Phenotype
Abstract: Objectives: Phenotyping is a core task in observational health research utilizing electronic health records (EHRs). Developing an accurate algorithm demands substantial input from domain experts, involving extensive literature review and evidence synthesis. This burdensome process limits scalability and delays knowledge discovery. We investigate the potential for leveraging large language models (LLMs) to enhance the efficiency of EHR phenotyping by generating high-quality algorithm drafts., Materials and Methods: We prompted four LLMs-GPT-4 and GPT-3.5 of ChatGPT, Claude 2, and Bard-in October 2023, asking them to generate executable phenotyping algorithms in the form of SQL queries adhering to a common data model (CDM) for three phenotypes (ie, type 2 diabetes mellitus, dementia, and hypothyroidism). Three phenotyping experts evaluated the returned algorithms across several critical metrics. We further implemented the top-rated algorithms and compared them against clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network., Results: GPT-4 and GPT-3.5 exhibited significantly higher overall expert evaluation scores in instruction following, algorithmic logic, and SQL executability, when compared to Claude 2 and Bard. Although GPT-4 and GPT-3.5 effectively identified relevant clinical concepts, they exhibited immature capability in organizing phenotyping criteria with the proper logic, leading to phenotyping algorithms that were either excessively restrictive (with low recall) or overly broad (with low positive predictive values)., Conclusion: GPT versions 3.5 and 4 are capable of drafting phenotyping algorithms by identifying relevant clinical criteria aligned with a CDM. However, expertise in informatics and clinical experience is still required to assess and further refine generated algorithms., (© The Author(s) 2024. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
Published: 2024
Full Text: View/download PDF

2. Differences in Health Professionals' Engagement With Electronic Health Records Based on Inpatient Race and Ethnicity.

Author: Yan C, Zhang X, Yang Y, Kang K, Were MC, Embí P, Patel MB, Malin BA, Kho AN, and Chen Y
Subjects: Adult, Female, Humans, Male, Middle Aged, Black or African American, Cross-Sectional Studies, White, Hospitalization statistics & numerical data, Attitude of Health Personnel, Aged, Time Factors, Electronic Health Records statistics & numerical data, Ethnicity, Healthcare Disparities ethnology, Healthcare Disparities statistics & numerical data
Abstract: Importance: US health professionals devote a large amount of effort to engaging with patients' electronic health records (EHRs) to deliver care. It is unknown whether patients with different racial and ethnic backgrounds receive equal EHR engagement., Objective: To investigate whether there are differences in the level of health professionals' EHR engagement for hospitalized patients according to race or ethnicity during inpatient care., Design, Setting, and Participants: This cross-sectional study analyzed EHR access log data from 2 major medical institutions, Vanderbilt University Medical Center (VUMC) and Northwestern Medicine (NW Medicine), over a 3-year period from January 1, 2018, to December 31, 2020. The study included all adult patients (aged ≥18 years) who were discharged alive after hospitalization for at least 24 hours. The data were analyzed between August 15, 2022, and March 15, 2023., Exposures: The actions of health professionals in each patient's EHR were based on EHR access log data. Covariates included patients' demographic information, socioeconomic characteristics, and comorbidities., Main Outcomes and Measures: The primary outcome was the quantity of EHR engagement, as defined by the average number of EHR actions performed by health professionals within a patient's EHR per hour during the patient's hospital stay. Proportional odds logistic regression was applied based on outcome quartiles., Results: A total of 243 416 adult patients were included from VUMC (mean [SD] age, 51.7 [19.2] years; 54.9% female and 45.1% male; 14.8% Black, 4.9% Hispanic, 77.7% White, and 2.6% other races and ethnicities) and NW Medicine (mean [SD] age, 52.8 [20.6] years; 65.2% female and 34.8% male; 11.7% Black, 12.1% Hispanic, 69.2% White, and 7.0% other races and ethnicities). When combining Black, Hispanic, or other race and ethnicity patients into 1 group, these patients were significantly less likely to receive a higher amount of EHR engagement compared with White patients (adjusted odds ratios, 0.86 [95% CI, 0.83-0.88; P < .001] for VUMC and 0.90 [95% CI, 0.88-0.92; P < .001] for NW Medicine). However, a reduction in this difference was observed from 2018 to 2020., Conclusions and Relevance: In this cross-sectional study of inpatient EHR engagement, the findings highlight differences in how health professionals distribute their efforts to patients' EHRs, as well as a method to measure these differences. Further investigations are needed to determine whether and how EHR engagement differences are correlated with health care outcomes.
Published: 2023
Full Text: View/download PDF

3. A Representativeness-informed Model for Research Record Selection from Electronic Medical Record Systems.

Author: Borza VA, Clayton EW, Kantarcioglu M, Vorobeychik Y, and Malin BA
Subjects: Humans, Software, Databases, Factual, Electronic Health Records, Data Management
Abstract: Scientific and clinical studies have a long history of bias in recruitment of underprivileged and minority populations. This underrepresentation leads to inaccurate, inapplicable, and non-generalizable results. Electronic medical record (EMR) systems, which now drive much research, often poorly represent these groups. We introduce a method for quantifying representativeness using information theoretic measures and an algorithmic approach to select a more representative record cohort than random selection when resource limitations preclude researchers from reviewing every record in the database. We apply this method to select cohorts of 2,000-20,000 records from a large (2M+ records) EMR database at the Vanderbilt University Medical Center and assess representativeness based on age, ethnicity, race, and gender. Compared to random selection - which will on average mirror the EMR database demographics - we find that a representativeness-informed approach can compose a cohort of records that is approximately 5.8 times more representative., (©2022 AMIA - All rights reserved.)
Published: 2023

4. A Multifaceted benchmarking of synthetic electronic health record generation models.

Author: Yan C, Yan Y, Wan Z, Zhang Z, Omberg L, Guinney J, Mooney SD, and Malin BA
Subjects: Privacy, Benchmarking, Electronic Health Records, Biomedical Research
Abstract: Synthetic health data have the potential to mitigate privacy concerns in supporting biomedical research and healthcare applications. Modern approaches for data generation continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a systematic benchmarking framework to appraise key characteristics with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic health data and further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context., (© 2022. The Author(s).)
Published: 2022
Full Text: View/download PDF

5. Predicting next-day discharge via electronic health record access logs.

Author: Zhang X, Yan C, Malin BA, Patel MB, and Chen Y
Subjects: Adult, Hospitalization, Humans, Machine Learning, ROC Curve, Electronic Health Records, Patient Discharge
Abstract: Objective: Hospital capacity management depends on accurate real-time estimates of hospital-wide discharges. Estimation by a clinician requires an excessively large amount of effort and, even when attempted, accuracy in forecasting next-day patient-level discharge is poor. This study aims to support next-day discharge predictions with machine learning by incorporating electronic health record (EHR) audit log data, a resource that captures EHR users' granular interactions with patients' records by communicating various semantics and has been neglected in outcome predictions., Materials and Methods: This study focused on the EHR data for all adults admitted to Vanderbilt University Medical Center in 2019. We learned multiple advanced models to assess the value that EHR audit log data adds to the daily prediction of discharge likelihood within 24 h and to compare different representation strategies. We applied Shapley additive explanations to identify the most influential types of user-EHR interactions for discharge prediction., Results: The data include 26 283 inpatient stays, 133 398 patient-day observations, and 819 types of user-EHR interactions. The model using the count of each type of interaction in the recent 24 h and other commonly used features, including demographics and admission diagnoses, achieved the highest area under the receiver operating characteristics (AUROC) curve of 0.921 (95% CI: 0.919-0.923). By contrast, the model lacking user-EHR interactions achieved a worse AUROC of 0.862 (0.860-0.865). In addition, 10 of the 20 (50%) most influential factors were user-EHR interaction features., Conclusion: EHR audit log data contain rich information such that it can improve hospital-wide discharge predictions., (© The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
Published: 2021
Full Text: View/download PDF

6. Mining tasks and task characteristics from electronic health record audit logs with unsupervised machine learning.

Author: Chen B, Alrifai W, Gao C, Jones B, Novak L, Lorenzi N, France D, Malin B, and Chen Y
Subjects: Humans, Infant, Newborn, Intensive Care Units, Neonatal, Workflow, Workload, Electronic Health Records, Unsupervised Machine Learning
Abstract: Objective: The characteristics of clinician activities while interacting with electronic health record (EHR) systems can influence the time spent in EHRs and workload. This study aims to characterize EHR activities as tasks and define novel, data-driven metrics., Materials and Methods: We leveraged unsupervised learning approaches to learn tasks from sequences of events in EHR audit logs. We developed metrics characterizing the prevalence of unique events and event repetition and applied them to categorize tasks into 4 complexity profiles. Between these profiles, Mann-Whitney U tests were applied to measure the differences in performance time, event type, and clinician prevalence, or the number of unique clinicians who were observed performing these tasks. In addition, we apply process mining frameworks paired with clinical annotations to support the validity of a sample of our identified tasks. We apply our approaches to learn tasks performed by nurses in the Vanderbilt University Medical Center neonatal intensive care unit., Results: We examined EHR audit logs generated by 33 neonatal intensive care unit nurses resulting in 57 234 sessions and 81 tasks. Our results indicated significant differences in performance time for each observed task complexity profile. There were no significant differences in clinician prevalence or in the frequency of viewing and modifying event types between tasks of different complexities. We presented a sample of expert-reviewed, annotated task workflows supporting the interpretation of their clinical meaningfulness., Conclusions: The use of the audit log provides an opportunity to assist hospitals in further investigating clinician activities to optimize EHR workflows., (© The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
Published: 2021
Full Text: View/download PDF

7. Contribution of Free-Text Comments to the Burden of Documentation: Assessment and Analysis of Vital Sign Comments in Flowsheets.

Author: Yin Z, Liu Y, McCoy AB, Malin BA, and Sengstack PR
Subjects: Academic Medical Centers, Humans, Natural Language Processing, Vital Signs, Documentation, Electronic Health Records
Abstract: Background: Documentation burden is a common problem with modern electronic health record (EHR) systems. To reduce this burden, various recording methods (eg, voice recorders or motion sensors) have been proposed. However, these solutions are in an early prototype phase and are unlikely to transition into practice in the near future. A more pragmatic alternative is to directly modify the implementation of the existing functionalities of an EHR system., Objective: This study aims to assess the nature of free-text comments entered into EHR flowsheets that supplement quantitative vital sign values and examine opportunities to simplify functionality and reduce documentation burden., Methods: We evaluated 209,055 vital sign comments in flowsheets that were generated in the Epic EHR system at the Vanderbilt University Medical Center in 2018. We applied topic modeling, as well as the natural language processing Clinical Language Annotation, Modeling, and Processing software system, to extract generally discussed topics and detailed medical terms (expressed as probability distribution) to investigate the stories communicated in these comments., Results: Our analysis showed that 63.33% (6053/9557) of the users who entered vital signs made at least one free-text comment in vital sign flowsheet entries. The user roles that were most likely to compose comments were registered nurse, technician, and licensed nurse. The most frequently identified topics were the notification of a result to health care providers (0.347), the context of a measurement (0.307), and an inability to obtain a vital sign (0.224). There were 4187 unique medical terms that were extracted from 46,029 (0.220) comments, including many symptom-related terms such as "pain," "upset," "dizziness," "coughing," "anxiety," "distress," and "fever" and drug-related terms such as "tylenol," "anesthesia," "cannula," "oxygen," "motrin," "rituxan," and "labetalol.", Conclusions: Considering that flowsheet comments are generally not displayed or automatically pulled into any clinical notes, our findings suggest that the flowsheet comment functionality can be simplified (eg, via structured response fields instead of a text input dialog) to reduce health care provider effort. Moreover, rich and clinically important medical terms such as medications and symptoms should be explicitly recorded in clinical notes for better visibility., (©Zhijun Yin, Yongtai Liu, Allison B McCoy, Bradley A Malin, Patricia R Sengstack. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 04.03.2021.)
Published: 2021
Full Text: View/download PDF

8. SynTEG: a framework for temporal structured electronic health data simulation.

Author: Zhang Z, Yan C, Lasko TA, Sun J, and Malin BA
Subjects: Academic Medical Centers, Current Procedural Terminology, Diagnosis, Disease classification, Hospital Charges classification, Humans, Information Dissemination, Tennessee, Time Factors, Computer Simulation, Confidentiality, Electronic Health Records, Models, Statistical
Abstract: Objective: Simulating electronic health record data offers an opportunity to resolve the tension between data sharing and patient privacy. Recent techniques based on generative adversarial networks have shown promise but neglect the temporal aspect of healthcare. We introduce a generative framework for simulating the trajectory of patients' diagnoses and measures to evaluate utility and privacy., Materials and Methods: The framework simulates date-stamped diagnosis sequences based on a 2-stage process that 1) sequentially extracts temporal patterns from clinical visits and 2) generates synthetic data conditioned on the learned patterns. We designed 3 utility measures to characterize the extent to which the framework maintains feature correlations and temporal patterns in clinical events. We evaluated the framework with billing codes, represented as phenome-wide association study codes (phecodes), from over 500 000 Vanderbilt University Medical Center electronic health records. We further assessed the privacy risks based on membership inference and attribute disclosure attacks., Results: The simulated temporal sequences exhibited similar characteristics to real sequences on the utility measures. Notably, diagnosis prediction models based on real versus synthetic temporal data exhibited an average relative difference in area under the ROC curve of 1.6% with standard deviation of 3.8% for 1276 phecodes. Additionally, the relative difference in the mean occurrence age and time between visits were 4.9% and 4.2%, respectively. The privacy risks in synthetic data, with respect to the membership and attribute inference were negligible., Conclusion: This investigation indicates that temporal diagnosis code sequences can be simulated in a manner that provides utility and respects privacy., (© The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
Published: 2021
Full Text: View/download PDF

9. Generating Electronic Health Records with Multiple Data Types and Constraints.

Author: Yan C, Zhang Z, Nyemba S, and Malin BA
Subjects: Female, Humans, Male, Privacy, Research Design, Vital Signs, Electronic Health Records
Abstract: Sharing electronic health records (EHRs) on a large scale may lead to privacy intrusions. Recent research has shown that risks may be mitigated by simulating EHRs through generative adversarial network (GAN) frameworks. Yet the methods developed to date are limited because they 1) focus on generating data of a single type (e.g., diagnosis codes), neglecting other data types (e.g., demographics, procedures or vital signs), and 2) do not represent constraints betweenfeatures. In this paper, we introduce a method to simulate EHRs composed of multiple data types by 1) refining the GAN model, 2) accounting for feature constraints, and 3) incorporating key utility measures for such generation tasks. Our analysis with over 770,000 EHRs from Vanderbilt University Medical Center demonstrates that the new model achieves higher performance in terms ofretaining basic statistics, cross-feature correlations, latent structural properties, feature constraints and associated patterns from real data, without sacrificing privacy., (©2020 AMIA - All rights reserved.)
Published: 2021

10. Learning Tasks of Pediatric Providers from Electronic Health Record Audit Logs.

Author: Jones B, Zhang X, Malin BA, and Chen Y
Subjects: Child, Communication, Health Personnel, Humans, Infant, Newborn, Learning, Pediatrics, Workflow, Electronic Health Records
Abstract: The amount of time spent working in the Electronic Health Record (EHR) has become a burden for many providers. We propose computational methods to learn EHR tasks of Pediatrics residents and attending physicians in the treatment of healthy newborns by analyzing EHR audit log data. We perform statistical analyses of the association between EHR events and provider role, leverage word embedding, k-means, and ProM process mining software on audit log data to learn EHR tasks and visualize them. Residents more commonly perform note preparation and result viewing relative to attendings. Attendings perform more communication and chart review. Task workflows analysis resulted in 2 tasks for attendings and 3 tasks for residents. The attending tasks focus on chart review patient report and history, and inbox service. Primary themes for residents are admit/discharge with order creation, note review, and result review., (©2020 AMIA - All rights reserved.)
Published: 2021

11. Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

Author: Carrell DS, Malin BA, Cronkite DJ, Aberdeen JS, Clark C, Li MR, Bastakoty D, Nyemba S, and Hirschman L
Subjects: Computer Security, Humans, Natural Language Processing, Confidentiality, Data Anonymization, Electronic Health Records
Abstract: Objective: Effective, scalable de-identification of personally identifying information (PII) for information-rich clinical text is critical to support secondary use, but no method is 100% effective. The hiding-in-plain-sight (HIPS) approach attempts to solve this "residual PII problem." HIPS replaces PII tagged by a de-identification system with realistic but fictitious (resynthesized) content, making it harder to detect remaining unredacted PII., Materials and Methods: Using 2000 representative clinical documents from 2 healthcare settings (4000 total), we used a novel method to generate 2 de-identified 100-document corpora (200 documents total) in which PII tagged by a typical automated machine-learned tagger was replaced by HIPS-resynthesized content. Four readers conducted aggressive reidentification attacks to isolate leaked PII: 2 readers from within the originating institution and 2 external readers., Results: Overall, mean recall of leaked PII was 26.8% and mean precision was 37.2%. Mean recall was 9% (mean precision = 37%) for patient ages, 32% (mean precision = 26%) for dates, 25% (mean precision = 37%) for doctor names, 45% (mean precision = 55%) for organization names, and 23% (mean precision = 57%) for patient names. Recall was 32% (precision = 40%) for internal and 22% (precision =33%) for external readers., Discussion and Conclusions: Approximately 70% of leaked PII "hiding" in a corpus de-identified with HIPS resynthesis is resilient to detection by human readers in a realistic, aggressive reidentification attack scenario-more than double the rate reported in previous studies but less than the rate reported for an attack assisted by machine learning methods., (© The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
Published: 2020
Full Text: View/download PDF

12. Ensuring electronic medical record simulation through better training, modeling, and evaluation.

Author: Zhang Z, Yan C, Mesa DA, Sun J, and Malin BA
Subjects: Adult, Age Distribution, Child, Data Anonymization, Datasets as Topic, Female, Humans, Male, Models, Theoretical, Privacy, Computer Simulation, Electronic Health Records, Neural Networks, Computer
Abstract: Objective: Electronic medical records (EMRs) can support medical research and discovery, but privacy risks limit the sharing of such data on a wide scale. Various approaches have been developed to mitigate risk, including record simulation via generative adversarial networks (GANs). While showing promise in certain application domains, GANs lack a principled approach for EMR data that induces subpar simulation. In this article, we improve EMR simulation through a novel pipeline that (1) enhances the learning model, (2) incorporates evaluation criteria for data utility that informs learning, and (3) refines the training process., Materials and Methods: We propose a new electronic health record generator using a GAN with a Wasserstein divergence and layer normalization techniques. We designed 2 utility measures to characterize similarity in the structural properties of real and simulated EMRs in the original and latent space, respectively. We applied a filtering strategy to enhance GAN training for low-prevalence clinical concepts. We evaluated the new and existing GANs with utility and privacy measures (membership and disclosure attacks) using billing codes from over 1 million EMRs at Vanderbilt University Medical Center., Results: The proposed model outperformed the state-of-the-art approaches with significant improvement in retaining the nature of real records, including prediction performance and structural properties, without sacrificing privacy. Additionally, the filtering strategy achieved higher utility when the EMR training dataset was small., Conclusions: These findings illustrate that EMR simulation through GANs can be substantially improved through more appropriate training, modeling, and evaluation criteria., (© The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
Published: 2020
Full Text: View/download PDF

13. The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight.

Author: Carrell DS, Cronkite DJ, Li MR, Nyemba S, Malin BA, Aberdeen JS, and Hirschman L
Subjects: Ambulatory Care Facilities, Delivery of Health Care, Humans, Washington, Computer Security, Confidentiality, Data Anonymization, Electronic Health Records, Machine Learning, Personally Identifiable Information
Abstract: Objective: Clinical corpora can be deidentified using a combination of machine-learned automated taggers and hiding in plain sight (HIPS) resynthesis. The latter replaces detected personally identifiable information (PII) with random surrogates, allowing leaked PII to blend in or "hide in plain sight." We evaluated the extent to which a malicious attacker could expose leaked PII in such a corpus., Materials and Methods: We modeled a scenario where an institution (the defender) externally shared an 800-note corpus of actual outpatient clinical encounter notes from a large, integrated health care delivery system in Washington State. These notes were deidentified by a machine-learned PII tagger and HIPS resynthesis. A malicious attacker obtained and performed a parrot attack intending to expose leaked PII in this corpus. Specifically, the attacker mimicked the defender's process by manually annotating all PII-like content in half of the released corpus, training a PII tagger on these data, and using the trained model to tag the remaining encounter notes. The attacker hypothesized that untagged identifiers would be leaked PII, discoverable by manual review. We evaluated the attacker's success using measures of leak-detection rate and accuracy., Results: The attacker correctly hypothesized that 211 (68%) of 310 actual PII leaks in the corpus were leaks, and wrongly hypothesized that 191 resynthesized PII instances were also leaks. One-third of actual leaks remained undetected., Discussion and Conclusion: A malicious parrot attack to reveal leaked PII in clinical text deidentified by machine-learned HIPS resynthesis can attenuate but not eliminate the protective effect of HIPS deidentification., (© The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.)
Published: 2019
Full Text: View/download PDF

14. Deep learning predicts extreme preterm birth from electronic health records.

Author: Gao C, Osmundson S, Velez Edwards DR, Jackson GP, Malin BA, and Chen Y
Subjects: Algorithms, Datasets as Topic, Humans, Infant, Newborn, International Classification of Diseases, Deep Learning, Electronic Health Records, Infant, Extremely Premature
Abstract: Objective: Models for predicting preterm birth generally have focused on very preterm (28-32 weeks) and moderate to late preterm (32-37 weeks) settings. However, extreme preterm birth (EPB), before the 28th week of gestational age, accounts for the majority of newborn deaths. We investigated the extent to which deep learning models that consider temporal relations documented in electronic health records (EHRs) can predict EPB., Study Design: EHR data were subject to word embedding and a temporal deep learning model, in the form of recurrent neural networks (RNNs) to predict EPB. Due to the low prevalence of EPB, the models were trained on datasets where controls were undersampled to balance the case-control ratio. We then applied an ensemble approach to group the trained models to predict EPB in an evaluation setting with a nature EPB ratio. We evaluated the RNN ensemble models with 10 years of EHR data from 25,689 deliveries at Vanderbilt University Medical Center. We compared their performance with traditional machine learning models (logistical regression, support vector machine, gradient boosting) trained on the datasets with balanced and natural EPB ratio. Risk factors associated with EPB were identified using an adjusted odds ratio., Results: The RNN ensemble models trained on artificially balanced data achieved a higher AUC (0.827 vs. 0.744) and sensitivity (0.965 vs. 0.682) than those RNN models trained on the datasets with naturally imbalanced EPB ratio. In addition, the AUC (0.827) and sensitivity (0.965) of the RNN ensemble models were better than the AUC (0.777) and sensitivity (0.819) of the best baseline models trained on balanced data. Also, risk factors, including twin pregnancy, short cervical length, hypertensive disorder, systemic lupus erythematosus, and hydroxychloroquine sulfate, were found to be associated with EPB at a significant level., Conclusion: Temporal deep learning can predict EPB up to 8 weeks earlier than its occurrence. Accurate prediction of EPB may allow healthcare organizations to allocate resources effectively and ensure patients receive appropriate care., (Copyright © 2019 Elsevier Inc. All rights reserved.)
Published: 2019
Full Text: View/download PDF

15. Modeling Care Team Structures in the Neonatal Intensive Care Unit through Network Analysis of EHR Audit Logs.

Author: Chen Y, Lehmann CU, Hatch LD, Schremp E, Malin BA, and France DJ
Subjects: Gastrostomy, Health Personnel, Humans, Infant, Newborn, Length of Stay, Clinical Audit, Electronic Health Records, Intensive Care Units, Neonatal, Models, Theoretical, Patient Care
Abstract: Background: In the neonatal intensive care unit (NICU), predefined acuity-based team care models are restricted to core roles and neglect interactions with providers outside of the team, such as interactions that transpire via electronic health record (EHR) systems. These unaccounted interactions may be related to the efficiency of resource allocation, information flow, communication, and thus impact patient outcomes. This study applied network analysis methods to EHR audit logs to model the interactions of providers beyond their core roles to better understand the interaction network patterns of acuity-based teams and relationships of the network structures with postsurgical length of stay (PSLOS)., Methods: The study used the EHR log data of surgical neonates from a large academic medical center. The study included 104 surgical neonates, for whom 9,206 unique actions were performed by 457 providers in their EHRs. We applied network analysis methods to model EHR provider interaction networks of acuity-based teams in NICU postoperative care. We partitioned each EHR network into three subnetworks based on interaction types: (1) interactions between known core providers who were documented in scheduling records (core subnetwork); (2) interactions between core and noncore providers (extended subnetwork); and (3) interactions between noncore providers (extended subnetwork). For each core subnetwork, we assessed its capability to replicate predefined core-provider relations as documented in scheduling records. We further compared each EHR network, as well as its subnetworks, using standard network measures to determine its differences in network topologies. We conducted a case study to learn provider interaction networks taking care of 15 neonates who underwent gastrostomy tube placement surgery from EHR log data and measure the effectiveness of the interaction networks on PSLOS by the proportional-odds model., Results: The provider networks of four acuity-based teams (two high and two low acuity), along with their subnetworks, were discovered. We found that beyond capturing the predefined core-provider relations, EHR audit logs can also learn a large number of relations between core and noncore providers or among noncore providers. Providers in the core subnetwork exhibited a greater number of connections with each other than with providers in the extended subnetworks. Many more providers in the core subnetwork serve as a hub than those in the other types of subnetworks. We also found that high-acuity teams exhibited more complex network structures than low-acuity teams, with high-acuity team generating 6,416 interactions between 407 providers compared with 931 interactions between 124 providers, respectively. In addition, we discovered that high-acuity and low-acuity teams shared more than 33 and 25% of providers with each other, respectively, but exhibited different collaborative structures demonstrating that NICU providers shift across different acuity teams and exhibit different network characteristics. Results of case study show that providers, whose patients had lower PSLOS, tended to disperse patient-related information to more colleagues within their network than those who treated higher PSLOS patients ( p = 0.03)., Conclusion: Network analysis can be applied to EHR log data to model acuity-based NICU teams capturing interactions between providers within the predesigned core team as well as those outside of the core team. In the NICU, dissemination of information may be linked to reduced PSLOS. EHR log data provide an efficient, accessible, and research-friendly way to study provider interaction networks. Findings should guide improvements in the EHR system design to facilitate effective interactions between providers., Competing Interests: None declared., (Georg Thieme Verlag KG Stuttgart · New York.)
Published: 2019
Full Text: View/download PDF

16. Learning to Identify Severe Maternal Morbidity from Electronic Health Records.

Author: Gao C, Osmundson S, Yan X, Edwards DV, Malin BA, and Chen Y
Subjects: Delivery of Health Care, Female, Humans, Machine Learning, Pregnancy, Risk Factors, Delivery, Obstetric, Electronic Health Records
Abstract: Severe maternal morbidity (SMM) is broadly defined as significant complications in pregnancy that have an adverse effect on women's health. Identifying women who experience SMM and reviewing their obstetric care can assist healthcare organizations in recognizing risk factors and best practices for management. Various definitions of SMM have been posited, but there is no consensus. Existing definitions are further limited in that they 1) are often rooted in existing clinical knowledge (which is problematic as many risk factors remain unknown), leading to poor positive predictive performance (PPV), and 2) have limited scalability as they often require substantial chart review. Thus, in this paper, a machine learning framework was introduced to automatically identify SMM and relevant risk factors from electronic health records (EHRs). We evaluated this framework with EHR data from 45,858 deliveries at a large academic medical center. The framework outperformed a state-of-the-art model from the U.S. Centers for Disease Control and Prevention (AUC of 0.94 vs. 0.80). Specially, it improved upon PPV by 59% (CDC: 0.22 vs. our model: 0.35). In the process, we revealed several novel SMM indicators, including disorders of fluid or electrolytes, systemic inflammatory response syndrome, and acidosis.
Published: 2019
Full Text: View/download PDF

17. Leveraging Electronic Health Records to Learn Progression Path for Severe Maternal Morbidity.

Author: Gao C, Osmundson S, Yan X, Edwards DV, Malin BA, and Chen Y
Subjects: Demography, Female, Humans, Pregnancy, Risk Factors, Electronic Health Records
Abstract: Severe maternal morbidity (SMM) encompasses a wide range of serious health complications that would likely result in death without in-time medical attention. It has been recognized that various demographic factors (e.g., age and race) and medical conditions (e.g., preeclampsia and organ failure) are associated with SMM. However, how medical conditions develop into SMM is seldom investigated. We hypothesize that SMM has a progression path, which is associated with a sequence of risk factors rather than a set of independent individual factors. We implemented a data-driven framework that leverages electronic health records (EHRs) in the antepartum period to learn the temporal patterns and measure their relationships with SMM during the delivery hospitalization. We evaluate the framework with two years of data from 6,184 women who had delivery hospitalizations at Vanderbilt University Medical Center. We discovered 69 temporal patterns, 12 of which were confirmed to be significantly associated with SMM.
Published: 2019
Full Text: View/download PDF

18. Learning bundled care opportunities from electronic medical records.

Author: Chen Y, Kho AN, Liebovitz D, Ivory C, Osmundson S, Bian J, and Malin BA
Subjects: Comorbidity, Humans, Machine Learning, Medical Informatics, Patient Care Management, Phenotype, Workflow, Data Mining methods, Delivery of Health Care organization & administration, Electronic Health Records, Patient Care Bundles
Abstract: Objective: The traditional fee-for-service approach to healthcare can lead to the management of a patient's conditions in a siloed manner, inducing various negative consequences. It has been recognized that a bundled approach to healthcare - one that manages a collection of health conditions together - may enable greater efficacy and cost savings. However, it is not always evident which sets of conditions should be managed in a bundled manner. In this study, we investigate if a data-driven approach can automatically learn potential bundles., Methods: We designed a framework to infer health condition collections (HCCs) based on the similarity of their clinical workflows, according to electronic medical record (EMR) utilization. We evaluated the framework with data from over 16,500 inpatient stays from Northwestern Memorial Hospital in Chicago, Illinois. The plausibility of the inferred HCCs for bundled care was assessed through an online survey of a panel of five experts, whose responses were analyzed via an analysis of variance (ANOVA) at a 95% confidence level. We further assessed the face validity of the HCCs using evidence in the published literature., Results: The framework inferred four HCCs, indicative of (1) fetal abnormalities, (2) late pregnancies, (3) prostate problems, and (4) chronic diseases, with congestive heart failure featuring prominently. Each HCC was substantiated with evidence in the literature and was deemed plausible for bundled care by the experts at a statistically significant level., Conclusions: The findings suggest that an automated EMR data-driven framework conducted can provide a basis for discovering bundled care opportunities. Still, translating such findings into actual care management will require further refinement, implementation, and evaluation., (Copyright © 2017 Elsevier Inc. All rights reserved.)
Published: 2018
Full Text: View/download PDF

19. Identifying collaborative care teams through electronic medical record utilization patterns.

Author: Chen Y, Lorenzi NM, Sandberg WS, Wolgast K, and Malin BA
Subjects: Cooperative Behavior, Humans, Interprofessional Relations, Patient-Centered Care, Data Mining, Electronic Health Records statistics & numerical data, Patient Care Team
Abstract: Objective: The goal of this investigation was to determine whether automated approaches can learn patient-oriented care teams via utilization of an electronic medical record (EMR) system., Materials and Methods: To perform this investigation, we designed a data-mining framework that relies on a combination of latent topic modeling and network analysis to infer patterns of collaborative teams. We applied the framework to the EMR utilization records of over 10 000 employees and 17 000 inpatients at a large academic medical center during a 4-month window in 2010. Next, we conducted an extrinsic evaluation of the patterns to determine the plausibility of the inferred care teams via surveys with knowledgeable experts. Finally, we conducted an intrinsic evaluation to contextualize each team in terms of collaboration strength (via a cluster coefficient) and clinical credibility (via associations between teams and patient comorbidities)., Results: The framework discovered 34 collaborative care teams, 27 (79.4%) of which were confirmed as administratively plausible. Of those, 26 teams depicted strong collaborations, with a cluster coefficient > 0.5. There were 119 diagnostic conditions associated with 34 care teams. Additionally, to provide clarity on how the survey respondents arrived at their determinations, we worked with several oncologists to develop an illustrative example of how a certain team functions in cancer care., Discussion: Inferred collaborative teams are plausible; translating such patterns into optimized collaborative care will require administrative review and integration with management practices., Conclusions: EMR utilization records can be mined for collaborative care patterns in large complex medical centers., (© The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com)
Published: 2017
Full Text: View/download PDF

20. Towards a privacy preserving cohort discovery framework for clinical research networks.

Author: Yuan J, Malin B, Modave F, Guo Y, Hogan WR, Shenkman E, and Bian J
Subjects: Confidentiality, Female, Humans, United States, Computer Security, Electronic Health Records, Health Insurance Portability and Accountability Act
Abstract: Background: The last few years have witnessed an increasing number of clinical research networks (CRNs) focused on building large collections of data from electronic health records (EHRs), claims, and patient-reported outcomes (PROs). Many of these CRNs provide a service for the discovery of research cohorts with various health conditions, which is especially useful for rare diseases. Supporting patient privacy can enhance the scalability and efficiency of such processes; however, current practice mainly relies on policy, such as guidelines defined in the Health Insurance Portability and Accountability Act (HIPAA), which are insufficient for CRNs (e.g., HIPAA does not require encryption of data - which can mitigate insider threats). By combining policy with privacy enhancing technologies we can enhance the trustworthiness of CRNs. The goal of this research is to determine if searchable encryption can instill privacy in CRNs without sacrificing their usability., Methods: We developed a technique, implemented in working software to enable privacy-preserving cohort discovery (PPCD) services in large distributed CRNs based on elliptic curve cryptography (ECC). This technique also incorporates a block indexing strategy to improve the performance (in terms of computational running time) of PPCD. We evaluated the PPCD service with three real cohort definitions: (1) elderly cervical cancer patients who underwent radical hysterectomy, (2) oropharyngeal and tongue cancer patients who underwent robotic transoral surgery, and (3) female breast cancer patients who underwent mastectomy) with varied query complexity. These definitions were tested in an encrypted database of 7.1 million records derived from the publically available Healthcare Cost and Utilization Project (HCUP) Nationwide Inpatient Sample (NIS). We assessed the performance of the PPCD service in terms of (1) accuracy in cohort discovery, (2) computational running time, and (3) privacy afforded to the underlying records during PPCD., Results: The empirical results indicate that the proposed PPCD can execute cohort discovery queries in a reasonable amount of time, with query runtime in the range of 165-262s for the 3 use cases, with zero compromise in accuracy. We further show that the search performance is practical because it supports a highly parallelized design for secure evaluation over encrypted records. Additionally, our security analysis shows that the proposed construction is resilient to standard adversaries., Conclusions: PPCD services can be designed for clinical research networks. The security construction presented in this work specifically achieves high privacy guarantees by preventing both threats originating from within and beyond the network., (Copyright Â© 2016 Elsevier Inc. All rights reserved.)
Published: 2017
Full Text: View/download PDF

21. Predicting Length of Stay for Obstetric Patients via Electronic Medical Records.

Author: Gao C, Kho AN, Ivory C, Osmundson S, Malin BA, and Chen Y
Subjects: Clinical Coding, Female, Health Resources, Hospitalization, Humans, Quality of Health Care, Electronic Health Records, Length of Stay, Obstetrics
Abstract: Obstetric care refers to the care provided to patients during ante-, intra-, and postpartum periods. Predicting length of stay (LOS) for these patients during their hospitalizations can assist healthcare organizations in allocating hospital resources more effectively and efficiently, ultimately improving maternal care quality and reducing costs to patients. In this paper, we investigate the extent to which LOS can be forecast from a patient's medical history. We introduce a machine learning framework to incorporate a patient's prior conditions (e.g., diagnostic codes) as features in a predictive model for LOS. We evaluate the framework with three years of historical billing data from the electronic medical records of 9188 obstetric patients in a large academic medical center. The results indicate that our framework achieved an average accuracy of 49.3%, which is higher than the baseline accuracy 37.7% (that relies solely on a patient's age). The most predictive features were found to have statistically significant discriminative ability. These features included billing codes for normal delivery (indicative of shorter stay) and antepartum hypertension (indicative of longer stay).
Published: 2017

22. Preserving temporal relations in clinical data while maintaining privacy.

Author: Hripcsak G, Mirhaji P, Low AF, and Malin BA
Subjects: Health Insurance Portability and Accountability Act, Humans, Methods, Observational Studies as Topic, Time, United States, Confidentiality, Data Anonymization, Electronic Health Records
Abstract: Objective: Maintaining patient privacy is a challenge in large-scale observational research. To assist in reducing the risk of identifying study subjects through publicly available data, we introduce a method for obscuring date information for clinical events and patient characteristics., Methods: The method, which we call Shift and Truncate (SANT), obscures date information to any desired granularity. Shift and Truncate first assigns each patient a random shift value, such that all dates in that patient's record are shifted by that amount. Data are then truncated from the beginning and end of the data set., Results: The data set can be proven to not disclose temporal information finer than the chosen granularity. Unlike previous strategies such as a simple shift, it remains robust to frequent - even daily - updates and robust to inferring dates at the beginning and end of date-shifted data sets. Time-of-day may be retained or obscured, depending on the goal and anticipated knowledge of the data recipient., Conclusions: The method can be useful as a scientific approach for reducing re-identification risk under the Privacy Rule of the Health Insurance Portability and Accountability Act and may contribute to qualification for the Safe Harbor implementation., (© The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
Published: 2016
Full Text: View/download PDF

23. Optimizing annotation resources for natural language de-identification via a game theoretic framework.

Author: Li M, Carrell D, Aberdeen J, Hirschman L, Kirby J, Li B, Vorobeychik Y, and Malin BA
Subjects: Humans, Language, Risk, Confidentiality, Electronic Health Records, Natural Language Processing
Abstract: Objective: Electronic medical records (EMRs) are increasingly repurposed for activities beyond clinical care, such as to support translational research and public policy analysis. To mitigate privacy risks, healthcare organizations (HCOs) aim to remove potentially identifying patient information. A substantial quantity of EMR data is in natural language form and there are concerns that automated tools for detecting identifiers are imperfect and leak information that can be exploited by ill-intentioned data recipients. Thus, HCOs have been encouraged to invest as much effort as possible to find and detect potential identifiers, but such a strategy assumes the recipients are sufficiently incentivized and capable of exploiting leaked identifiers. In practice, such an assumption may not hold true and HCOs may overinvest in de-identification technology. The goal of this study is to design a natural language de-identification framework, rooted in game theory, which enables an HCO to optimize their investments given the expected capabilities of an adversarial recipient., Methods: We introduce a Stackelberg game to balance risk and utility in natural language de-identification. This game represents a cost-benefit model that enables an HCO with a fixed budget to minimize their investment in the de-identification process. We evaluate this model by assessing the overall payoff to the HCO and the adversary using 2100 clinical notes from Vanderbilt University Medical Center. We simulate several policy alternatives using a range of parameters, including the cost of training a de-identification model and the loss in data utility due to the removal of terms that are not identifiers. In addition, we compare policy options where, when an attacker is fined for misuse, a monetary penalty is paid to the publishing HCO as opposed to a third party (e.g., a federal regulator)., Results: Our results show that when an HCO is forced to exhaust a limited budget (set to $2000 in the study), the precision and recall of the de-identification of the HCO are 0.86 and 0.8, respectively. A game-based approach enables a more refined cost-benefit tradeoff, improving both privacy and utility for the HCO. For example, our investigation shows that it is possible for an HCO to release the data without spending all their budget on de-identification and still deter the attacker, with a precision of 0.77 and a recall of 0.61 for the de-identification. There also exist scenarios in which the model indicates an HCO should not release any data because the risk is too great. In addition, we find that the practice of paying fines back to a HCO (an artifact of suing for breach of contract), as opposed to a third party such as a federal regulator, can induce an elevated level of data sharing risk, where the HCO is incentivized to bait the attacker to elicit compensation., Conclusions: A game theoretic framework can be applied in leading HCO's to optimized decision making in natural language de-identification investments before sharing EMR data., (Copyright © 2016 Elsevier Inc. All rights reserved.)
Published: 2016
Full Text: View/download PDF

24. A multi-institution evaluation of clinical profile anonymization.

Author: Heatherly R, Rasmussen LV, Peissig PL, Pacheco JA, Harris P, Denny JC, and Malin BA
Subjects: Confidentiality, Humans, Hypothyroidism, International Classification of Diseases, Organizational Case Studies, Data Anonymization, Electronic Health Records, Information Dissemination
Abstract: Background and Objective: There is an increasing desire to share de-identified electronic health records (EHRs) for secondary uses, but there are concerns that clinical terms can be exploited to compromise patient identities. Anonymization algorithms mitigate such threats while enabling novel discoveries, but their evaluation has been limited to single institutions. Here, we study how an existing clinical profile anonymization fares at multiple medical centers., Methods: We apply a state-of-the-artk-anonymization algorithm, withkset to the standard value 5, to the International Classification of Disease, ninth edition codes for patients in a hypothyroidism association study at three medical centers: Marshfield Clinic, Northwestern University, and Vanderbilt University. We assess utility when anonymizing at three population levels: all patients in 1) the EHR system; 2) the biorepository; and 3) a hypothyroidism study. We evaluate utility using 1) changes to the number included in the dataset, 2) number of codes included, and 3) regions generalization and suppression were required., Results: Our findings yield several notable results. First, we show that anonymizing in the context of the entire EHR yields a significantly greater quantity of data by reducing the amount of generalized regions from ∼15% to ∼0.5%. Second, ∼70% of codes that needed generalization only generalized two or three codes in the largest anonymization., Conclusions: Sharing large volumes of clinical data in support of phenome-wide association studies is possible while safeguarding privacy to the underlying individuals., (© The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
Published: 2016
Full Text: View/download PDF

25. Inferring Clinical Workflow Efficiency via Electronic Medical Record Utilization.

Author: Chen Y, Xie W, Gunter CA, Liebovitz D, Mehrotra S, Zhang H, and Malin B
Subjects: Algorithms, Humans, Data Mining, Efficiency, Organizational, Electronic Health Records, Hospitals, Workflow
Abstract: Complexity in clinical workflows can lead to inefficiency in making diagnoses, ineffectiveness of treatment plans and uninformed management of healthcare organizations (HCOs). Traditional strategies to manage workflow complexity are based on measuring the gaps between workflows defined by HCO administrators and the actual processes followed by staff in the clinic. However, existing methods tend to neglect the influences of EMR systems on the utilization of workflows, which could be leveraged to optimize workflows facilitated through the EMR. In this paper, we introduce a framework to infer clinical workflows through the utilization of an EMR and show how such workflows roughly partition into four types according to their efficiency. Our framework infers workflows at several levels of granularity through data mining technologies. We study four months of EMR event logs from a large medical center, including 16,569 inpatient stays, and illustrate that over approximately 95% of workflows are efficient and that 80% of patients are on such workflows. At the same time, we show that the remaining 5% of workflows may be inefficient due to a variety of factors, such as complex patients.
Published: 2015

26. Design and implementation of a privacy preserving electronic health record linkage tool in Chicago.

Author: Kho AN, Cashy JP, Jackson KL, Pah AR, Goel S, Boehnke J, Humphries JE, Kominers SD, Hota BN, Sims SA, Malin BA, French DD, Walunas TL, Meltzer DO, Kaleba EO, Jones RC, and Galanter WL
Subjects: Chicago, Computer Security, Health Insurance Portability and Accountability Act, Humans, United States, Confidentiality, Electronic Health Records standards, Health Information Exchange standards, Medical Record Linkage methods, Software
Abstract: Objective: To design and implement a tool that creates a secure, privacy preserving linkage of electronic health record (EHR) data across multiple sites in a large metropolitan area in the United States (Chicago, IL), for use in clinical research., Methods: The authors developed and distributed a software application that performs standardized data cleaning, preprocessing, and hashing of patient identifiers to remove all protected health information. The application creates seeded hash code combinations of patient identifiers using a Health Insurance Portability and Accountability Act compliant SHA-512 algorithm that minimizes re-identification risk. The authors subsequently linked individual records using a central honest broker with an algorithm that assigns weights to hash combinations in order to generate high specificity matches., Results: The software application successfully linked and de-duplicated 7 million records across 6 institutions, resulting in a cohort of 5 million unique records. Using a manually reconciled set of 11 292 patients as a gold standard, the software achieved a sensitivity of 96% and a specificity of 100%, with a majority of the missed matches accounted for by patients with both a missing social security number and last name change. Using 3 disease examples, it is demonstrated that the software can reduce duplication of patient records across sites by as much as 28%., Conclusions: Software that standardizes the assignment of a unique seeded hash identifier merged through an agreed upon third-party honest broker can enable large-scale secure linkage of EHR data for epidemiologic and public health research. The software algorithm can improve future epidemiologic research by providing more comprehensive data given that patients may make use of multiple healthcare systems., (© The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
Published: 2015
Full Text: View/download PDF

27. Building bridges across electronic health record systems through inferred phenotypic topics.

Author: Chen Y, Ghosh J, Bejan CA, Gunter CA, Gupta S, Kho A, Liebovitz D, Sun J, Denny J, and Malin B
Subjects: Electronic Health Records classification, Natural Language Processing, Phenotype, United States, Electronic Health Records organization & administration, Information Storage and Retrieval methods, Machine Learning, Medical Record Linkage methods, Vocabulary, Controlled
Abstract: Objective: Data in electronic health records (EHRs) is being increasingly leveraged for secondary uses, ranging from biomedical association studies to comparative effectiveness. To perform studies at scale and transfer knowledge from one institution to another in a meaningful way, we need to harmonize the phenotypes in such systems. Traditionally, this has been accomplished through expert specification of phenotypes via standardized terminologies, such as billing codes. However, this approach may be biased by the experience and expectations of the experts, as well as the vocabulary used to describe such patients. The goal of this work is to develop a data-driven strategy to (1) infer phenotypic topics within patient populations and (2) assess the degree to which such topics facilitate a mapping across populations in disparate healthcare systems., Methods: We adapt a generative topic modeling strategy, based on latent Dirichlet allocation, to infer phenotypic topics. We utilize a variance analysis to assess the projection of a patient population from one healthcare system onto the topics learned from another system. The consistency of learned phenotypic topics was evaluated using (1) the similarity of topics, (2) the stability of a patient population across topics, and (3) the transferability of a topic across sites. We evaluated our approaches using four months of inpatient data from two geographically distinct healthcare systems: (1) Northwestern Memorial Hospital (NMH) and (2) Vanderbilt University Medical Center (VUMC)., Results: The method learned 25 phenotypic topics from each healthcare system. The average cosine similarity between matched topics across the two sites was 0.39, a remarkably high value given the very high dimensionality of the feature space. The average stability of VUMC and NMH patients across the topics of two sites was 0.988 and 0.812, respectively, as measured by the Pearson correlation coefficient. Also the VUMC and NMH topics have smaller variance of characterizing patient population of two sites than standard clinical terminologies (e.g., ICD9), suggesting they may be more reliably transferred across hospital systems., Conclusions: Phenotypic topics learned from EHR data can be more stable and transferable than billing codes for characterizing the general status of a patient population. This suggests that EHR-based research may be able to leverage such phenotypic topics as variables when pooling patient populations in predictive models., (Copyright © 2015 Elsevier Inc. All rights reserved.)
Published: 2015
Full Text: View/download PDF

28. Limestone: high-throughput candidate phenotype generation via tensor factorization.

Author: Ho JC, Ghosh J, Steinhubl SR, Stewart WF, Denny JC, Malin BA, and Sun J
Subjects: Algorithms, Databases, Factual classification, Humans, Phenotype, Data Mining methods, Electronic Health Records classification
Abstract: The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to medical concepts that clinical researchers need or use. Some recent studies have focused on EHR-derived phenotyping, which aims at mapping the EHR data to specific medical concepts; however, most of these approaches require labor intensive supervision from experienced clinical professionals. Furthermore, existing approaches are often disease-centric and specialized to the idiosyncrasies of the information technology and/or business practices of a single healthcare organization. In this paper, we propose Limestone, a nonnegative tensor factorization method to derive phenotype candidates with virtually no human supervision. Limestone represents the data source interactions naturally using tensors (a generalization of matrices). In particular, we investigate the interaction of diagnoses and medications among patients. The resulting tensor factors are reported as phenotype candidates that automatically reveal patient clusters on specific diagnoses and medications. Using the proposed method, multiple phenotypes can be identified simultaneously from data. We demonstrate the capability of Limestone on a cohort of 31,815 patient records from the Geisinger Health System. The dataset spans 7years of longitudinal patient records and was initially constructed for a heart failure onset prediction study. Our experiments demonstrate the robustness, stability, and the conciseness of Limestone-derived phenotypes. Our results show that using only 40 phenotypes, we can outperform the original 640 features (169 diagnosis categories and 471 medication types) to achieve an area under the receiver operator characteristic curve (AUC) of 0.720 (95% CI 0.715 to 0.725). Moreover, in consultation with a medical expert, we confirmed 82% of the top 50 candidates automatically extracted by Limestone are clinically meaningful., (Copyright © 2014 Elsevier Inc. All rights reserved.)
Published: 2014
Full Text: View/download PDF

29. Size matters: how population size influences genotype-phenotype association studies in anonymized data.

Author: Heatherly R, Denny JC, Haines JL, Roden DM, and Malin BA
Subjects: Algorithms, Computer Simulation, Genotype, Humans, Phenotype, Polymorphism, Single Nucleotide, Biomedical Research methods, Confidentiality, Databases, Genetic, Electronic Health Records, Genetic Association Studies statistics & numerical data, Sample Size
Abstract: Objective: Electronic medical records (EMRs) data is increasingly incorporated into genome-phenome association studies. Investigators hope to share data, but there are concerns it may be "re-identified" through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome-phenome association studies under various conditions., Methods: We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome-phenome association study and compare the discoveries using the protected data and the original data through the correlation (r(2)) of the p-values of association significance., Results: Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome-phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ∼100,000) retained better utility to those on smaller sizes (population ∼6000-75,000). We observed a general trend of increasing r(2) for larger data set sizes: r(2)=0.9481 for small-sized datasets, r(2)=0.9493 for moderately-sized datasets, r(2)=0.9934 for large-sized datasets., Conclusions: This research implies that regardless of the overall size of an institution's data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients., (Copyright © 2014 Elsevier Inc. All rights reserved.)
Published: 2014
Full Text: View/download PDF

30. De-identification of clinical narratives through writing complexity measures.

Author: Li M, Carrell D, Aberdeen J, Hirschman L, and Malin BA
Subjects: Cluster Analysis, Electronic Health Records, Narration, Writing
Abstract: Purpose: Electronic health records contain a substantial quantity of clinical narrative, which is increasingly reused for research purposes. To share data on a large scale and respect privacy, it is critical to remove patient identifiers. De-identification tools based on machine learning have been proposed; however, model training is usually based on either a random group of documents or a pre-existing document type designation (e.g., discharge summary). This work investigates if inherent features, such as the writing complexity, can identify document subsets to enhance de-identification performance., Methods: We applied an unsupervised clustering method to group two corpora based on writing complexity measures: a collection of over 4500 documents of varying document types (e.g., discharge summaries, history and physical reports, and radiology reports) from Vanderbilt University Medical Center (VUMC) and the publicly available i2b2 corpus of 889 discharge summaries. We compare the performance (via recall, precision, and F-measure) of de-identification models trained on such clusters with models trained on documents grouped randomly or VUMC document type., Results: For the Vanderbilt dataset, it was observed that training and testing de-identification models on the same stylometric cluster (with the average F-measure of 0.917) tended to outperform models based on clusters of random documents (with an average F-measure of 0.881). It was further observed that increasing the size of a training subset sampled from a specific cluster could yield improved results (e.g., for subsets from a certain stylometric cluster, the F-measure raised from 0.743 to 0.841 when training size increased from 10 to 50 documents, and the F-measure reached 0.901 when the size of the training subset reached 200 documents). For the i2b2 dataset, training and testing on the same clusters based on complexity measures (average F-score 0.966) did not significantly surpass randomly selected clusters (average F-score 0.965)., Conclusions: Our findings illustrate that, in environments consisting of a variety of clinical documentation, de-identification models trained on writing complexity measures are better than models trained on random groups and, in many instances, document types., (Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.)
Published: 2014
Full Text: View/download PDF

31. We work with them? Healthcare workers interpretation of organizational relations mined from electronic health records.

Author: Chen Y, Lorenzi N, Nyemba S, Schildcrout JS, and Malin B
Subjects: Humans, Organizational Culture, Delivery of Health Care, Integrated organization & administration, Electronic Health Records statistics & numerical data, Health Personnel, Patient Care Team organization & administration, Systems Integration
Abstract: Objective: Models of healthcare organizations (HCOs) are often defined up front by a select few administrative officials and managers. However, given the size and complexity of modern healthcare systems, this practice does not scale easily. The goal of this work is to investigate the extent to which organizational relationships can be automatically learned from utilization patterns of electronic health record (EHR) systems., Method: We designed an online survey to solicit the perspectives of employees of a large academic medical center. We surveyed employees from two administrative areas: (1) Coding & Charge Entry and (2) Medical Information Services and two clinical areas: (3) Anesthesiology and (4) Psychiatry. To test our hypotheses we selected two administrative units that have work-related responsibilities with electronic records; however, for the clinical areas we selected two disciplines with very different patient responsibilities and whose accesses and people who accessed were similar. We provided each group of employees with questions regarding the chance of interaction between areas in the medical center in the form of association rules (e.g., Given someone from Coding & Charge Entry accessed a patient's record, what is the chance that someone from Medical Information Services access the same record?). We compared the respondent predictions with the rules learned from actual EHR utilization using linear-mixed effects regression models., Results: The findings from our survey confirm that medical center employees can distinguish between association rules of high and non-high likelihood when their own area is involved. Moreover, they can make such distinctions between for any HCO area in this survey. It was further observed that, with respect to highly likely interactions, respondents from certain areas were significantly better than other respondents at making such distinctions and certain areas' associations were more distinguishable than others., Conclusions: These results illustrate that EHR utilization patterns may be consistent with the expectations of HCO employees. Our findings show that certain areas in the HCO are easier than others for employees to assess, which suggests that automated learning strategies may yield more accurate models of healthcare organizations than those based on the perspectives of a select few individuals., (Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.)
Published: 2014
Full Text: View/download PDF

32. PARAMO: a PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records.

Author: Ng K, Ghoting A, Steinhubl SR, Stewart WF, Malin B, and Sun J
Subjects: Algorithms, Area Under Curve, Computer Systems, Decision Support Systems, Clinical, Health Services Research, Humans, Models, Theoretical, Reproducibility of Results, Software, Tennessee, Time Factors, Electronic Health Records, Medical Informatics methods
Abstract: Objective: Healthcare analytics research increasingly involves the construction of predictive models for disease targets across varying patient cohorts using electronic health records (EHRs). To facilitate this process, it is critical to support a pipeline of tasks: (1) cohort construction, (2) feature construction, (3) cross-validation, (4) feature selection, and (5) classification. To develop an appropriate model, it is necessary to compare and refine models derived from a diversity of cohorts, patient-specific features, and statistical frameworks. The goal of this work is to develop and evaluate a predictive modeling platform that can be used to simplify and expedite this process for health data., Methods: To support this goal, we developed a PARAllel predictive MOdeling (PARAMO) platform which (1) constructs a dependency graph of tasks from specifications of predictive modeling pipelines, (2) schedules the tasks in a topological ordering of the graph, and (3) executes those tasks in parallel. We implemented this platform using Map-Reduce to enable independent tasks to run in parallel in a cluster computing environment. Different task scheduling preferences are also supported., Results: We assess the performance of PARAMO on various workloads using three datasets derived from the EHR systems in place at Geisinger Health System and Vanderbilt University Medical Center and an anonymous longitudinal claims database. We demonstrate significant gains in computational efficiency against a standard approach. In particular, PARAMO can build 800 different models on a 300,000 patient data set in 3h in parallel compared to 9days if running sequentially., Conclusion: This work demonstrates that an efficient parallel predictive modeling platform can be developed for EHR data. This platform can facilitate large-scale modeling endeavors and speed-up the research workflow and reuse of health information. This platform is only a first step and provides the foundation for our ultimate goal of building analytic pipelines that are specialized for health data researchers., (Copyright © 2013 Elsevier Inc. All rights reserved.)
Published: 2014
Full Text: View/download PDF

33. Predicting changes in hypertension control using electronic health records from a chronic disease management program.

Author: Sun J, McNaughton CD, Zhang P, Perer A, Gkoulalas-Divanis A, Denny JC, Kirby J, Lasko T, Saip A, and Malin BA
Subjects: Antihypertensive Agents therapeutic use, Chronic Disease, Humans, Models, Theoretical, Prognosis, Disease Management, Electronic Health Records, Hypertension therapy
Abstract: Objective: Common chronic diseases such as hypertension are costly and difficult to manage. Our ultimate goal is to use data from electronic health records to predict the risk and timing of deterioration in hypertension control. Towards this goal, this work predicts the transition points at which hypertension is brought into, as well as pushed out of, control., Method: In a cohort of 1294 patients with hypertension enrolled in a chronic disease management program at the Vanderbilt University Medical Center, patients are modeled as an array of features derived from the clinical domain over time, which are distilled into a core set using an information gain criteria regarding their predictive performance. A model for transition point prediction was then computed using a random forest classifier., Results: The most predictive features for transitions in hypertension control status included hypertension assessment patterns, comorbid diagnoses, procedures and medication history. The final random forest model achieved a c-statistic of 0.836 (95% CI 0.830 to 0.842) and an accuracy of 0.773 (95% CI 0.766 to 0.780)., Conclusions: This study achieved accurate prediction of transition points of hypertension control status, an important first step in the long-term goal of developing personalized hypertension management plans.
Published: 2014
Full Text: View/download PDF

34. Ethical and practical challenges to studying patients who opt out of large-scale biorepository research.

Author: Rosenbloom ST, Madison JL, Brothers KB, Bowton EA, Clayton EW, Malin BA, Roden DM, and Pulley J
Subjects: Biomedical Research methods, Humans, Informed Consent, Biological Specimen Banks ethics, Biomedical Research ethics, Electronic Health Records ethics, Patient Acceptance of Health Care
Abstract: Large-scale biorepositories that couple biologic specimens with electronic health records containing documentation of phenotypic expression can accelerate scientific research and discovery. However, differences between those subjects who participate in biorepository-based research and the population from which they are drawn may influence research validity. While an opt-out approach to biorepository-based research enhances inclusiveness, empirical research evaluating voluntariness, risk, and the feasibility of an opt-out approach is sparse, and factors influencing patients' decisions to opt out are understudied. Determining why patients choose to opt out may help to improve voluntariness, however there may be ethical and logistical challenges to studying those who opt out. In this perspective paper, the authors explore what is known about research based on the opt-out model, describe a large-scale biorepository that leverages the opt-out model, and review specific ethical and logistical challenges to bridging the research gaps that remain.
Published: 2013
Full Text: View/download PDF

35. Location bias of identifiers in clinical narratives.

Author: Hanauer DA, Mei Q, Malin B, and Zheng K
Subjects: Health Insurance Portability and Accountability Act, Humans, Medical Records Systems, Computerized, Narration, United States, Computer Security, Confidentiality, Electronic Health Records
Abstract: Scrubbing identifying information from narrative clinical documents is a critical first step to preparing the data for secondary use purposes, such as translational research. Evidence suggests that the differential distribution of protected health information (PHI) in clinical documents could be used as additional features to improve the performance of automated de-identification algorithms or toolkits. However, there has been little investigation into the extent to which such phenomena transpires in practice. To empirically assess this issue, we identified the location of PHI in 140,000 clinical notes from an electronic health record system and characterized the distribution as a function of location in a document. In addition, we calculated the 'word proximity' of nearby PHI elements to determine their co-occurrence rates. The PHI elements were found to have non-random distribution patterns. Location within a document and proximity between PHI elements might therefore be used to help de-identification systems better label PHI.
Published: 2013

36. Ethical, legal, and social implications of incorporating genomic information into electronic health records.

Author: Hazin R, Brothers KB, Malin BA, Koenig BA, Sanderson SC, Rothstein MA, Williams MS, Clayton EW, and Kullo IJ
Subjects: Computer Security, Confidentiality, Decision Support Systems, Clinical, Genetic Privacy, Health Literacy, Health Records, Personal, Humans, Incidental Findings, Patient Access to Records, Precision Medicine, Electronic Health Records ethics, Electronic Health Records legislation & jurisprudence, Genomics ethics, Genomics legislation & jurisprudence
Abstract: The inclusion of genomic data in the electronic health record raises important ethical, legal, and social issues. In this article, we highlight these challenges and discuss potential solutions. We provide a brief background on the current state of electronic health records in the context of genomic medicine, discuss the importance of equitable access to genome-enabled electronic health records, and consider the potential use of electronic health records for improving genomic literacy in patients and providers. We highlight the importance of privacy, access, and security, and of determining which genomic information is included in the electronic health record. Finally, we discuss the challenges of reporting incidental findings, storing and reinterpreting genomic data, and nondocumentation and duty to warn family members at potential genetic risk.
Published: 2013
Full Text: View/download PDF

37. A practical approach to achieve private medical record linkage in light of public resources.

Author: Kuzu M, Kantarcioglu M, Durham EA, Toth C, and Malin B
Subjects: Humans, United States, Computer Security, Confidentiality, Electronic Health Records organization & administration, Medical Record Linkage
Abstract: Objective: Integration of patients' records across resources enhances analytics. To address privacy concerns, emerging strategies such as Bloom filter encodings (BFEs), enable integration while obscuring identifiers. However, recent investigations demonstrate BFEs are, in theory, vulnerable to cryptanalysis when encoded identifiers are randomly selected from a public resource. This study investigates the extent to which cryptanalysis conditions hold for (1) real patient records and (2) a countermeasure that obscures the frequencies of the identifying values in encoded datasets., Design: First, to investigate the strength of cryptanalysis for real patient records, we build BFEs from identifiers in an electronic medical record system and apply cryptanalysis using identifiers in a publicly available voter registry. Second, to investigate the countermeasure under ideal cryptanalysis conditions, we compose BFEs from the identifiers that are randomly selected from a public voter registry., Measurement: We utilize precision (ie, rate of correct re-identified encodings) and computation efficiency (ie, time to complete cryptanalysis) to assess the performance of cryptanalysis in BFEs before and after application of the countermeasure., Results: Cryptanalysis can achieve high precision when the encoded identifiers are composed of a random sample of a public resource (ie, a voter registry). However, we also find that the attack is less efficient and may not be practical for more realistic scenarios. By contrast, the proposed countermeasure made cryptanalysis impractical in terms of precision and efficiency., Conclusions: Performance of cryptanalysis against BFEs based on patient data is significantly lower than theoretical estimates. The proposed countermeasure makes BFEs resistant to known practical attacks.
Published: 2013
Full Text: View/download PDF

38. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text.

Author: Carrell D, Malin B, Aberdeen J, Bayer S, Clark C, Wellner B, and Hirschman L
Subjects: Biomedical Research statistics & numerical data, Data Collection, Humans, Pilot Projects, United States, Computer Security, Confidentiality, Electronic Health Records, Information Dissemination, Natural Language Processing
Abstract: Objective: Secondary use of clinical text is impeded by a lack of highly effective, low-cost de-identification methods. Both, manual and automated methods for removing protected health information, are known to leave behind residual identifiers. The authors propose a novel approach for addressing the residual identifier problem based on the theory of Hiding In Plain Sight (HIPS)., Materials and Methods: HIPS relies on obfuscation to conceal residual identifiers. According to this theory, replacing the detected identifiers with realistic but synthetic surrogates should collectively render the few 'leaked' identifiers difficult to distinguish from the synthetic surrogates. The authors conducted a pilot study to test this theory on clinical narrative, de-identified by an automated system. Test corpora included 31 oncology and 50 family practice progress notes read by two trained chart abstractors and an informaticist., Results: Experimental results suggest approximately 90% of residual identifiers can be effectively concealed by the HIPS approach in text containing average and high densities of personal identifying information., Discussion: This pilot test suggests HIPS is feasible, but requires further evaluation. The results need to be replicated on larger corpora of diverse origin under a range of detection scenarios. Error analyses also suggest areas where surrogate generation techniques can be refined to improve efficacy., Conclusions: If these results generalize to existing high-performing de-identification systems with recall rates of 94-98%, HIPS could increase the effective de-identification rates of these systems to levels above 99% without further advancements in system recall. Additional and more rigorous assessment of the HIPS approach is warranted.
Published: 2013
Full Text: View/download PDF

39. Reducing patient re-identification risk for laboratory results within research datasets.

Author: Atreya RV, Smith JC, McCoy AB, Malin B, and Miller RA
Subjects: Algorithms, Biomedical Research, Feasibility Studies, Humans, Information Dissemination, United States, Clinical Laboratory Information Systems, Computer Security, Confidentiality, Electronic Health Records, Medical Record Linkage
Abstract: Objective: To try to lower patient re-identification risks for biomedical research databases containing laboratory test results while also minimizing changes in clinical data interpretation., Materials and Methods: In our threat model, an attacker obtains 5-7 laboratory results from one patient and uses them as a search key to discover the corresponding record in a de-identified biomedical research database. To test our models, the existing Vanderbilt TIME database of 8.5 million Safe Harbor de-identified laboratory results from 61 280 patients was used. The uniqueness of unaltered laboratory results in the dataset was examined, and then two data perturbation models were applied-simple random offsets and an expert-derived clinical meaning-preserving model. A rank-based re-identification algorithm to mimic an attack was used. The re-identification risk and the retention of clinical meaning for each model's perturbed laboratory results were assessed., Results: Differences in re-identification rates between the algorithms were small despite substantial divergence in altered clinical meaning. The expert algorithm maintained the clinical meaning of laboratory results better (affecting up to 4% of test results) than simple perturbation (affecting up to 26%)., Discussion and Conclusion: With growing impetus for sharing clinical data for research, and in view of healthcare-related federal privacy regulation, methods to mitigate risks of re-identification are important. A practical, expert-derived perturbation algorithm that demonstrated potential utility was developed. Similar approaches might enable administrators to select data protection scheme parameters that meet their preferences in the trade-off between the protection of privacy and the retention of clinical meaning of shared data.
Published: 2013
Full Text: View/download PDF

40. Biomedical data privacy: problems, perspectives, and recent advances.

Author: Malin BA, Emam KE, and O'Keefe CM
Subjects: Data Collection methods, Health Insurance Portability and Accountability Act, Humans, Information Storage and Retrieval, United States, Confidentiality ethics, Confidentiality legislation & jurisprudence, Electronic Health Records, Information Dissemination
Published: 2013
Full Text: View/download PDF

41. Anonymization of longitudinal electronic medical records.

Author: Tamersoy A, Loukides G, Nergiz ME, Saygin Y, and Malin B
Subjects: Algorithms, Cluster Analysis, Cohort Studies, Database Management Systems, Humans, Confidentiality, Electronic Health Records
Abstract: Electronic medical record (EMR) systems have enabled healthcare providers to collect detailed patient information from the primary care domain. At the same time, longitudinal data from EMRs are increasingly combined with biorepositories to generate personalized clinical decision support protocols. Emerging policies encourage investigators to disseminate such data in a deidentified form for reuse and collaboration, but organizations are hesitant to do so because they fear such actions will jeopardize patient privacy. In particular, there are concerns that residual demographic and clinical features could be exploited for reidentification purposes. Various approaches have been developed to anonymize clinical data, but they neglect temporal information and are, thus, insufficient for emerging biomedical research paradigms. This paper proposes a novel approach to share patient-specific longitudinal data that offers robust privacy guarantees, while preserving data utility for many biomedical investigations. Our approach aggregates temporal and diagnostic information using heuristics inspired from sequence alignment and clustering methods. We demonstrate that the proposed approach can generate anonymized data that permit effective biomedical analysis using several patient cohorts derived from the EMR system of the Vanderbilt University Medical Center.
Published: 2012
Full Text: View/download PDF

42. Learning relational policies from electronic health record access logs.

Author: Malin B, Nyemba S, and Paulett J
Subjects: Computer Security, Confidentiality, Data Mining, Humans, Policy, Electronic Health Records, Health Policy
Abstract: Modern healthcare organizations (HCOs) are composed of complex dynamic teams to ensure clinical operations are executed in a quick and competent manner. At the same time, the fluid nature of such environments hinders administrators' efforts to define access control policies that appropriately balance patient privacy and healthcare functions. Manual efforts to define these policies are labor-intensive and error-prone, often resulting in systems that endow certain care providers with overly broad access to patients' medical records while restricting other providers from legitimate and timely use. In this work, we propose an alternative method to generate these policies by automatically mining usage patterns from electronic health record (EHR) systems. EHR systems are increasingly being integrated into clinical environments and our approach is designed to be generalizable across HCOs, thus assisting in the design and evaluation of local access control policies. Our technique, which is grounded in data mining and social network analysis theory, extracts a statistical model of the organization from the access logs of its EHRs. In doing so, our approach enables the review of predefined policies, as well as the discovery of unknown behaviors. We evaluate our approach with 5 months of access logs from the Vanderbilt University Medical Center and confirm the existence of stable social structures and intuitive business operations. Additionally, we demonstrate that there is significant turnover in the interactions between users in the HCO and that policies learned at the department-level afford greater stability over time., (Copyright © 2011 Elsevier Inc. All rights reserved.)
Published: 2011
Full Text: View/download PDF

43. Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA Privacy Rule.

Author: Malin B, Benitez K, and Masys D
Subjects: Biomedical Research, Databases, Factual, Electronic Health Records legislation & jurisprudence, Female, Genomics, Health Insurance Portability and Accountability Act, Humans, Male, Risk, United States, Aged, 80 and over, Confidentiality legislation & jurisprudence, Demography, Electronic Health Records statistics & numerical data, Information Dissemination legislation & jurisprudence
Abstract: Objective: Healthcare organizations must de-identify patient records before sharing data. Many organizations rely on the Safe Harbor Standard of the HIPAA Privacy Rule, which enumerates 18 identifiers that must be suppressed (eg, ages over 89). An alternative model in the Privacy Rule, known as the Statistical Standard, can facilitate the sharing of more detailed data, but is rarely applied because of a lack of published methodologies. The authors propose an intuitive approach to de-identifying patient demographics in accordance with the Statistical Standard., Design: The authors conduct an analysis of the demographics of patient cohorts in five medical centers developed for the NIH-sponsored Electronic Medical Records and Genomics network, with respect to the US census. They report the re-identification risk of patient demographics disclosed according to the Safe Harbor policy and the relative risk rate for sharing such information via alternative policies., Measurements: The re-identification risk of Safe Harbor demographics ranged from 0.01% to 0.19%. The findings show alternative de-identification models can be created with risks no greater than Safe Harbor. The authors illustrate that the disclosure of patient ages over the age of 89 is possible when other features are reduced in granularity., Limitations: The de-identification approach described in this paper was evaluated with demographic data only and should be evaluated with other potential identifiers., Conclusion: Alternative de-identification policies to the Safe Harbor model can be derived for patient demographics to enable the disclosure of values that were previously suppressed. The method is generalizable to any environment in which population statistics are available.
Published: 2011
Full Text: View/download PDF

44. Role prediction using Electronic Medical Record system audits.

Author: Zhang W, Gunter CA, Liebovitz D, Tian J, and Malin B
Subjects: Humans, Medical Audit, Medical Records Systems, Computerized, Algorithms, Computer Security, Electronic Health Records
Abstract: Electronic Medical Records (EMRs) provide convenient access to patient data for parties who should have it, but, unless managed properly, may also provide it to those who should not. Distinguishing the two is a core security challenge for EMRs. Strategies proposed to address these problems include Role Based Access Control (RBAC), which assigns collections of privileges called roles to users, and Experience Based Access Management (EBAM), which analyzes audit logs to determine access rights. In this paper, we integrate RBAC and EBAM through an algorithm, called Roll-Up, to manage roles effectively. In doing so, we introduce the concept of "role prediction" to identify roles from audit data. We apply the algorithm to three months of logs from Northwestern Memorial Hospital's Cerner system with approximately 8000 users and 140 roles. We demonstrate that existing roles can be predicted with 50% accuracy and intelligent grouping of roles through Roll-Up can facilitate 65% accuracy.
Published: 2011

45. The MITRE Identification Scrubber Toolkit: design, training, and assessment.

Author: Aberdeen J, Bayer S, Yeniterzi R, Wellner B, Clark C, Hanauer D, Malin B, and Hirschman L
Subjects: Algorithms, Confidentiality, Data Collection, Humans, Medical Record Linkage methods, Electronic Health Records, Medical Record Linkage standards, Patient Identification Systems, Software
Abstract: Purpose: Medical records must often be stripped of patient identifiers, or de-identified, before being shared. De-identification by humans is time-consuming, and existing software is limited in its generality. The open source MITRE Identification Scrubber Toolkit (MIST) provides an environment to support rapid tailoring of automated de-identification to different document types, using automatically learned classifiers to de-identify and protect sensitive information., Methods: MIST was evaluated with four classes of patient records from the Vanderbilt University Medical Center: discharge summaries, laboratory reports, letters, and order summaries. We trained and tested MIST on each class of record separately, as well as on pooled sets of records. We measured precision, recall, F-measure and accuracy at the word level for the detection of patient identifiers as designated by the HIPAA Safe Harbor Rule., Results: MIST was applied to medical records that differed in the amounts and types of protected health information (PHI): lab reports contained only two types of PHI (dates, names) compared to discharge summaries, which were much richer. Performance of the de-identification tool depended on record class; F-measure results were 0.996 for order summaries, 0.996 for discharge summaries, 0.943 for letters and 0.934 for laboratory reports. Experiments suggest the tool requires several hundred training exemplars to reach an F-measure of at least 0.9., Conclusions: The MIST toolkit makes possible the rapid tailoring of automated de-identification to particular document types and supports the transition of the de-identification software to medical end users, avoiding the need for developers to have access to original medical records. We are making the MIST toolkit available under an open source license to encourage its application to diverse data sets at multiple institutions., (Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.)
Published: 2010
Full Text: View/download PDF

46. Anonymization of administrative billing codes with repeated diagnoses through censoring.

Author: Tamersoy A, Loukides G, Denny JC, and Malin B
Subjects: Clinical Coding, Confidentiality, Humans, Electronic Health Records, Privacy
Abstract: Patient-specific data from electronic medical records (EMRs) is increasingly shared in a de-identified form to support research. However, EMRs are susceptible to noise, error, and variation, which can limit their utility for reuse. One way to enhance the utility of EMRs is to record the number of times diagnosis codes are assigned to a patient when this data is shared. This is, however, challenging because releasing such data may be leveraged to compromise patients' identity. In this paper, we present an approach that, to the best of our knowledge, is the first that can prevent re-identification through repeated diagnosis codes. Our method transforms records to preserve privacy while retaining much of their utility. Experiments conducted using 2676 patients from the EMR system of the Vanderbilt University Medical Center verify that our method is able to retain an average of 95.4% of the diagnosis codes in a common data sharing scenario.
Published: 2010

47. The disclosure of diagnosis codes can breach research participants' privacy.

Author: Loukides G, Denny JC, and Malin B
Subjects: Aged, Female, Humans, Male, Middle Aged, Tennessee, United States, Confidentiality, Electronic Health Records, International Classification of Diseases, Medical Record Linkage, Research Subjects
Abstract: Objective: De-identified clinical data in standardized form (eg, diagnosis codes), derived from electronic medical records, are increasingly combined with research data (eg, DNA sequences) and disseminated to enable scientific investigations. This study examines whether released data can be linked with identified clinical records that are accessible via various resources to jeopardize patients' anonymity, and the ability of popular privacy protection methodologies to prevent such an attack., Design: The study experimentally evaluates the re-identification risk of a de-identified sample of Vanderbilt's patient records involved in a genome-wide association study. It also measures the level of protection from re-identification, and data utility, provided by suppression and generalization., Measurement: Privacy protection is quantified using the probability of re-identifying a patient in a larger population through diagnosis codes. Data utility is measured at a dataset level, using the percentage of retained information, as well as its description, and at a patient level, using two metrics based on the difference between the distribution of Internal Classification of Disease (ICD) version 9 codes before and after applying privacy protection., Results: More than 96% of 2800 patients' records are shown to be uniquely identified by their diagnosis codes with respect to a population of 1.2 million patients. Generalization is shown to reduce further the percentage of de-identified records by less than 2%, and over 99% of the three-digit ICD-9 codes need to be suppressed to prevent re-identification., Conclusions: Popular privacy protection methods are inadequate to deliver a sufficiently protected and useful result when sharing data derived from complex clinical systems. The development of alternative privacy protection models is thus required.
Published: 2010
Full Text: View/download PDF

48. Anonymization of electronic medical records for validating genome-wide association studies.

Author: Loukides G, Gkoulalas-Divanis A, and Malin B
Subjects: Algorithms, Anonymous Testing methods, Data Collection methods, Electronic Health Records, Genetic Privacy, Genome-Wide Association Study methods, Precision Medicine methods
Abstract: Genome-wide association studies (GWAS) facilitate the discovery of genotype-phenotype relations from population-based sequence databases, which is an integral facet of personalized medicine. The increasing adoption of electronic medical records allows large amounts of patients' standardized clinical features to be combined with the genomic sequences of these patients and shared to support validation of GWAS findings and to enable novel discoveries. However, disseminating these data "as is" may lead to patient reidentification when genomic sequences are linked to resources that contain the corresponding patients' identity information based on standardized clinical features. This work proposes an approach that provably prevents this type of data linkage and furnishes a result that helps support GWAS. Our approach automatically extracts potentially linkable clinical features and modifies them in a way that they can no longer be used to link a genomic sequence to a small number of patients, while preserving the associations between genomic sequences and specific sets of clinical features corresponding to GWAS-related diseases. Extensive experiments with real patient data derived from the Vanderbilt's University Medical Center verify that our approach generates data that eliminate the threat of individual reidentification, while supporting GWAS validation and clinical case analysis tasks.
Published: 2010
Full Text: View/download PDF

49. Effects of personal identifier resynthesis on clinical text de-identification.

Author: Yeniterzi R, Aberdeen J, Bayer S, Wellner B, Hirschman L, and Malin B
Subjects: Humans, Information Storage and Retrieval, United States, Artificial Intelligence, Computer Security, Confidentiality, Electronic Health Records, Software
Abstract: Objective: De-identified medical records are critical to biomedical research. Text de-identification software exists, including "resynthesis" components that replace real identifiers with synthetic identifiers. The goal of this research is to evaluate the effectiveness and examine possible bias introduced by resynthesis on de-identification software., Design: We evaluated the open-source MITRE Identification Scrubber Toolkit, which includes a resynthesis capability, with clinical text from Vanderbilt University Medical Center patient records. We investigated four record classes from over 500 patients' files, including laboratory reports, medication orders, discharge summaries and clinical notes. We trained and tested the de-identification tool on real and resynthesized records., Measurements: We measured performance in terms of precision, recall, F-measure and accuracy for the detection of protected health identifiers as designated by the HIPAA Safe Harbor Rule., Results: The de-identification tool was trained and tested on a collection of real and resynthesized Vanderbilt records. Results for training and testing on the real records were 0.990 accuracy and 0.960 F-measure. The results improved when trained and tested on resynthesized records with 0.998 accuracy and 0.980 F-measure but deteriorated moderately when trained on real records and tested on resynthesized records with 0.989 accuracy 0.862 F-measure. Moreover, the results declined significantly when trained on resynthesized records and tested on real records with 0.942 accuracy and 0.728 F-measure., Conclusion: The de-identification tool achieves high accuracy when training and test sets are homogeneous (ie, both real or resynthesized records). The resynthesis component regularizes the data to make them less "realistic," resulting in loss of performance particularly when training on resynthesized data and testing on real data.
Published: 2010
Full Text: View/download PDF

50. Evaluating re-identification risks with respect to the HIPAA privacy rule.

Author: Benitez K and Malin B
Subjects: Health Insurance Portability and Accountability Act, Humans, Models, Statistical, Registries standards, Risk Assessment, United States, Access to Information legislation & jurisprudence, Computer Security, Confidentiality legislation & jurisprudence, Electronic Health Records, Guideline Adherence
Abstract: Objective: Many healthcare organizations follow data protection policies that specify which patient identifiers must be suppressed to share "de-identified" records. Such policies, however, are often applied without knowledge of the risk of "re-identification". The goals of this work are: (1) to estimate re-identification risk for data sharing policies of the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule; and (2) to evaluate the risk of a specific re-identification attack using voter registration lists., Measurements: We define several risk metrics: (1) expected number of re-identifications; (2) estimated proportion of a population in a group of size g or less, and (3) monetary cost per re-identification. For each US state, we estimate the risk posed to hypothetical datasets, protected by the HIPAA Safe Harbor and Limited Dataset policies by an attacker with full knowledge of patient identifiers and with limited knowledge in the form of voter registries., Results: The percentage of a state's population estimated to be vulnerable to unique re-identification (ie, g=1) when protected via Safe Harbor and Limited Datasets ranges from 0.01% to 0.25% and 10% to 60%, respectively. In the voter attack, this number drops for many states, and for some states is 0%, due to the variable availability of voter registries in the real world. We also find that re-identification cost ranges from $0 to $17,000, further confirming risk variability., Conclusions: This work illustrates that blanket protection policies, such as Safe Harbor, leave different organizations vulnerable to re-identification at different rates. It provides justification for locally performed re-identification risk estimates prior to sharing data.
Published: 2010
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

87 results on '"Malin, Bradley"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources