1. Additional file 1 of Establishing a framework for privacy-preserving record linkage among electronic health record and administrative claims databases within PCORnet®, the National Patient-Centered Clinical Research Network
- Author
-
Kiernan, Daniel, Carton, Thomas, Toh, Sengwee, Phua, Jasmin, Zirkle, Maryan, Louzao, Darcy, Haynes, Kevin, Weiner, Mark, Angulo, Francisco, Bailey, Charles, Bian, Jiang, Fort, Daniel, Grannis, Shaun, Krishnamurthy, Ashok Kumar, Nair, Vinit, Rivera, Pedro, Silverstein, Jonathan, and Marsolo, Keith
- Abstract
Additional file 1: Figure S1. Overview of the tokenization and linkage flow within PCORnet. The following steps are used to complete the linkage tasks: 1) Run Datavant tokenize. Each site runs Datavant application in tokenize mode on-premises to generate tokens in their own site-specific token encryption scheme. Every site’s tokens are unique. A security breach at one site would not propagate across other sites in the ecosystem. No linking can happen without a site’s permission. 2) Run Datavant transform-tokens. Each site runs Datavant application in transform-tokens to mode to prepare tokens for sending to the Coordinating Center (CC); tokens are uniquely encrypted in transit. 3) Run Datavant transform-tokens. CC runs Datavant application in transform-tokens from to transform tokens into a common CC encryption scheme. Thus, tokens can only be linked at the CC. 4) CC Performs Overlaps. CC runs Datavant Match to determine overlap among records. Figure S2. Overall data flow for the linkage query. Partners generate de-identified tokens from PII held within their source systems (a). The Token Team of the Coordinating Center distributes a SAS query to extract tokens from the HASH_TOKEN table of the CDM (b). Partners execute the query against their CDM and upload the results to a Secure File Transfer location. These tokens are processed by the Datavant software solution (c) and then a Match Index is generated by executing the Datavant Match software (d). A version of this Index is passed to the Query Team of the Coordinating Center that does not include the underlying tokens (e). The Query Team distributes a SAS query to extract Demographic data. Partners execute it against the CDM and the results are returned to a second Secure File Transfer location (f). The results are pulled down by the Query Team and then combined with the token-less Match Index to complete the overlap analysis and to generate the summary demographic table (g). For this study, DCRI acted as the Token Team of the Coordinating Center and HPHCI as the Query Team. The number of participating Partners in this initial pilot was 4. Table S1. Illustration of the content of the HASH_TOKEN extract. There is one row per patient. DMID_PATID corresponds to the ID of the contributing DataMart and PATID corresponds to a pseudoidentifier used to link across all information belonging to a patient within the DataMart’s CDM. Table S2. Example of matching output. MATCH_ID is used to denote patients that match across DMs. Each MATCH_ID corresponds to a unique patient. DMID_PATID is the internal DM identifier (but does not contain patient identifiers). The TOKEN columns are populated with encrypted hash tokens. MATCH_ID and DMID_PATID are needed to perform the overlap analysis and generate the de-duplicated patient demographic characteristics table.
- Published
- 2022
- Full Text
- View/download PDF