Vaccination against communicable diseases is one of the pillars of a modern healthcare system and has contributed to longer, healthier lives for people around the world. Given their widespread use and broad efficacy, there is significant interest in conducting vaccine-related health outcomes research. Moreover, vaccine development has unique challenges compared to other therapeutic modalities that can be addressed using observational studies.
In most electronic health records (EHR) and claims databases in the United States, vaccine administration is recorded as a “procedure” but is treated as a “drug” in the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). Given this difference in coding, studying vaccines using data converted to the OMOP CDM can increase the chance of errors during analysis. In order to ensure reliable and reproducible analyses from observational clinical data, we performed a careful examination and evaluation of vaccine-related concepts in the OMOP vocabularies and their associated mappings from source concepts to standard concepts using three different vaccines.
We identified several issues that could impact the quality of vaccine-related health outcomes study using OMOP CDM. These include (1) vaccine codes are assigned to the procedure, drug, or observation domain, (2) changes in the “standard concept” status of some concepts over time, and (3) common issues in many medical ontologies, such as the lack of hierarchy, one-to-one exact mapping, and clear naming conventions. We believe that it is impactful to document and communicate such findings with the OHDSI community in the hopes of identifying opportunities for future improvement.
Vaccination plays a major part in controlling and eliminating infectious diseases and is considered to be the most cost-effective medical intervention to fight against infectious diseases [1]. Although vaccine efficacy and safety are typically evaluated in randomized controlled trials (RCTs), observational studies are also critical to confirm RCT results, determine vaccine effectiveness, and identify adverse drug events (ADEs). Multiple EHR and claims databases can be used for an observational study, but different datasets often use different coding schemes and data formats. Using the OMOP CDM, researchers are able to transform multiple datasets into a common format with a common code scheme. This enables the efficient utilization of diverse data sources, even across multiple organizations, for identifying robust effectiveness and safety signals from observational studies, including vaccines [2]. For example, Boland et al studied vaccine-related ADEs using a set of OMOP CDM data and identified a statistically significant association between swine flu vaccination and certain ADEs [3]. In this study, we were motivated to thoroughly and systematically review vaccine-related concepts in the OMOP vocabulary and the mapping from source concepts to standard concepts. We chose to study vaccinations against three diseases: influenza virus, pneumococcal disease, and shingles (reactivation of herpes zoster). The influenza vaccine is updated annually, while vaccines against pneumococcal disease and shingles both have long protection periods. Our goal is to evaluate the use of the OMOP CDM for vaccine-related longitudinal outcome studies.
We identified vaccination concepts from commonly used vocabularies (e.g. NDC, CPT4, HCPCS, RxNorm, ICD9Proc, and ICD10PCS) through multiple rounds of keyword searching and manual inspection. The OMOP vocabulary tables were downloaded from Athena on September 10th, 2020. We found a total of 2092 concepts, including 1825 influenza, 221pneumococcal, and 46 shingles vaccine concepts (Supplemental Table S1). Then we traced how these source concepts were mapped to standard concepts and organized these concepts by domain and hierarchy. To measure the impact of the issues, we also examined the occurrences of these concepts in the Truven MarketScan® Commercial Claims and Encounters (CCAE) and Medicare Supplemental (MDCR) Database (1/1/2011 to 9/30/2019).
When building a cohort, a user usually starts with filtering an event in one of the domains - drug, procedure, measure, or observation. In raw data, vaccination is often recorded using procedure codes, e.g. CPT4 and HCPCS. Therefore, the user may only use procedure domain concepts to define a vaccination event. However, in OMOP vocabularies, those vaccine procedure codes can be assigned to the procedure, drug, or observation domain (Table 1). This may confuse less-experienced users. Only using the procedure domain will miss most of the vaccination records. Users who noticed "Drugs administered as part of a Procedure, such as chemotherapy or vaccines" in the documentation of the DRUG_EXPOSURE table and only use the drug domain to search vaccine concepts will miss a significant amount of records. The issue can be avoided by using all the three domains to search concepts and define events. Proper documentation and training of the nuance would be helpful for researchers when searching and selecting the most comprehensive and meaningful concepts for their study.
Vocabulary | Domain |
Source Codes |
Occurrences |
Occurrences |
---|---|---|---|---|
Influenza vaccine | ||||
ATC | Drug | 2 | 0 | 0.0% |
CPT4 | Drug | 29 | 49,295,355 | 69.8% |
CPT4 | Observation | 2 | 66,843 | 0.1% |
CVX | Drug | 19 | 0 | 0.0% |
HCPCS | Drug | 11 | 11,104,383 | 15.7% |
HCPCS | Observation | 3 | 1,476,645 | 2.1% |
ICD10PCS | Procedure | 2 | 1,786 | 0.0% |
ICD9Proc | Procedure | 1 | 0 | 0.0% |
NDC | Drug | 624 | 8,664,479 | 12.3% |
RxNorm | Drug | 1,132 | 0 | 0.0% |
Pneumococcal vaccine | ||||
ATC | Drug | 1 | 0 | 0.0% |
CPT4 | Drug | 3 | 13,894,808 | 76.6% |
CPT4 | Observation | 1 | 2,548,639 | 14.1% |
CVX | Drug | 5 | 0 | 0.0% |
HCPCS | Drug | 2 | 1,420,506 | 7.8% |
HCPCS | Observation | 1 | 3,911 | 0.0% |
NDC | Drug | 43 | 269,080 | 1.5% |
RxNorm | Drug | 165 | 0 | 0.0% |
Shingles vaccine | ||||
CPT4 | Drug | 2 | 1,543,985 | 51.7% |
CVX | Drug | 2 | 0 | 0.0% |
NDC | Drug | 19 | 1,444,024 | 48.3% |
RxNorm | Drug | 23 | 0 | 0.0% |
Some concepts in OMOP vocabularies were selected or created as the 'standard' representation of clinical events. For example, MESH code D001281, CIEL code 148203, SNOMED code 49436004, ICD9CM code 427.31, and Read code G573000 all define “Atrial fibrillation” in the condition domain, but only the SNOMED concept is standard [4]. Those standard concepts serve the primary basis for all standardized analytics and users should use the standard concepts to define events and query the OMOP data. If a user has a specific code in mind (e.g. ICD9CM code 427.31), they need to find the corresponding standard concept (e.g. SNOMED code 49436004) using Athena or Atlas. Also, when converting data to the OMOP CDM format, all source concept codes must be mapped to standard concepts. Ideally, the standard concepts will contain the same information as the source codes and no information is lost during the OMOP conversion. However, we found several instances of inappropriate mappings from non-standard concepts to standard concepts that may cause information loss or a change in meaning (Supplemental Table S2) and summarized them in Table 2. 15 out of 31 influenza vaccine CPT4 codes were mapped to less granular standard concepts, and the majority of the influenza vaccine records in our Truven data were impacted.
Vocabulary | Mapping |
Source Codes |
Occurrences |
Occurrences |
---|---|---|---|---|
Influenza vaccine | ||||
CPT4 | map to a less granular concept | 15 | 46,448,324 | 65.8% |
HCPCS | map to a less granular concept | 2 | 2,332 | 0.0% |
NDC | different dosage or volume | 1 | 0 | 0.0% |
NDC | map to a less granular concept | 24 | 69 | 0.0% |
NDC | wrong mapping | 3 | 0 | 0.0% |
RxNorm | map to a less granular concept | 38 | 0 | 0.0% |
RxNorm | wrong mapping | 2 | 0 | 0.0% |
Pneumococcal vaccine | ||||
NDC | different dosage or volume | 1 | 0 | 0.0% |
RxNorm | different dosage or volume | 6 | 0 | 0.0% |
RxNorm | map to a more granular concept | 1 | 0 | 0.0% |
Shingles vaccine | ||||
CPT4 | map to a less granular concept | 1 | 318,001 | 10.6% |
The hierarchical structure of standard and classification concepts helps users navigate the concept relationships and allows them to more easily query and retrieve concepts. Atlas users can define a concept set using a single high-level concept, and then include all its descendant concepts simply by a click, so that they don't have to manually search and filter hundreds or even thousands of concepts. For example, by searching "influenza vaccine" in Atlas with Truven data, we found ATC code J07BB is the top concept sorted by descendant record count (DRC). A user may define a concept set using this concept and expect its descendants to include all concepts related to influenza vaccine. However, the hierarchy is not as complete as a user might expect, so simply using the high-level concept could miss many concepts that should be but are not the descendants of that high-level concept.
First, we took all the standard concepts in Table S1 for each vaccine type and checked if they shared a common ancestor that could be potentially used as a high-level concept to query all concepts for a specific vaccine type. The hierarchy cannot cross domains, so we only focused on the drug domain for simplicity and high coverage of vaccine-related records. Unfortunately, we didn't identify any high-level concepts whose descendants cover all relevant standard concepts (Table 3). For example, among influenza vaccine related ancestors, the ATC classification code J07BB, "Influenza vaccines" has the highest number of descendants (825), but still less than the 859 standard concepts for influenza vaccine we identified in total. If using the ATC code as a high-level concept to define influenza vaccine events, 34 concepts would be missed.
Ancestor Concept ID | Ancestor Concept Name | No. of Standard Concepts Identified1 | No. of Identified Concepts Sharing the Ancestor |
---|---|---|---|
Influenza vaccine | |||
VACCINES | 859 | 825 | |
VIRAL VACCINES | 859 | 825 | |
Influenza vaccines | 859 | 825 | |
ANTIINFECTIVES FOR SYSTEMIC USE | 859 | 825 | |
influenza, inactivated, split virus or surface antigen; systemic | 859 | 761 | |
Pneumococcal vaccine | |||
VACCINES | 119 | 114 | |
BACTERIAL VACCINES | 119 | 114 | |
Pneumococcal vaccines | 119 | 114 | |
ANTIINFECTIVES FOR SYSTEMIC USE | 119 | 114 | |
pneumococcus, purified polysaccharides antigen; systemic | 119 | 63 | |
Shingles vaccine | |||
VACCINES | 15 | 13 | |
VIRAL VACCINES | 15 | 13 | |
Varicella zoster vaccines | 15 | 13 | |
ANTIINFECTIVES FOR SYSTEMIC USE | 15 | 13 | |
zoster vaccine recombinant | 15 | 10 | |
1
Only including the standard concepts in the drug domain among what we identified in Table S1. Ideally, there should be an ancestor that has all the concepts as descendants in one vaccine type.
|
Furthermore, if the hierarchy accurately represents the relationships among concepts, it is expected that a less granular concept should have relatively more descendants. To check this, we counted the number of descendants of the standard and classification concepts in Table S1. We found some cases where a less granular concept has fewer descendants than a more granular concept (Table 4). For instance, “Influenza, seasonal, injectable, preservative free (40213154)” has 142 descendants, while “Influenza, seasonal, injectable (40213153)”, a broader concept, has fewer descendants (92).
Concept ID | Concept Name | No. of Descendants |
---|---|---|
Influenza vaccine | ||
Influenza, seasonal, injectable, preservative free | 142 | |
Influenza, seasonal, injectable | 92 | |
influenza virus vaccine, unspecified formulation | 1 | |
influenza virus vaccine, whole virus | 1 | |
Admin of influenza vaccine | 1 | |
Pneumococcal vaccine | ||
pneumococcal polysaccharide vaccine, 23 valent | 17 | |
pneumococcal vaccine, unspecified formulation | 1 | |
Shingles vaccine | ||
varicella-zoster virus vaccine live (Oka-Merck) strain 29800 UNT/ML [Zostavax] | 9 | |
Zostavax Injectable Product | 5 |
As a result of the incomplete hierarchy, users cannot rely on it to easily build high-quality vaccine cohorts.
Many source vocabularies do not have a clear or common naming convention for vaccine concepts. OMOP concept names may contain the vaccine type, drug brand name, the virus ingredient, or any mixture of these. This makes it difficult to include all relevant concepts by keyword searching and filtering, and a few examples are listed below. Furthermore, the unclear hierarchy makes it even harder to properly define a vaccine concept set. Users have to manually verify thousands of concepts, which is error-prone and time-consuming. If the OHDSI community uses a single and clear naming convention when selecting and creating the standard concepts, then the efficiency and quality could be significantly increased for defining vaccine related concept sets and cohorts.
Understandably, the concept and concept relationship tables need frequent updates to include new concepts and improve mapping quality. However, as far as we know, there is no online archive that stores historical versions to enable reproducibility and traceability. This could be a potential hazard for reproducible longitudinal studies across multiple years and network studies when the participants use different vocabulary versions.
Both domain assignment and source to standard concept mapping can change over time, leading to differences in cohort definitions and analysis results depending on when the vocabulary tables were downloaded from Athena. Comparing two versions of the concept tables we downloaded at different dates from Athena (04/30/2020 versus 09/10/2020), we found that the domain of five influenza vaccine concepts changed from Procedure or Observation to Drug (Table 5). We also found some non-standard concepts were mapped to different standard concepts in different concept_relationship table versions and standard concepts in the older version became non-standard in the later version (Supplemental Table S4). Vaccine related cohort definitions created using one version of the concept tables may be inadequate if used on an OMOP CDM created using a different version of the OMOP vocabulary tables. This could impact network studies when participants are using the same cohort definition but their OMOP data were created using different versions of OMOP vocabularies.
Occurrences |
||||||
---|---|---|---|---|---|---|
Drug | Procedure | Influenza virus vaccine, pandemic formulation, H1N1 | CPT4 | 90663 | 360 | |
Drug | Procedure | Immunization, active; influenza virus vaccine. (Deprecated) | CPT4 | 90724 | 357 | |
Drug | Observation | Admin of influenza vaccine | HCPCS | Q0034 | 0 | |
Drug | Procedure | H1N1 immunization administration (intramuscular, intranasal), including counseling when performed | CPT4 | 90470 | 1139 | |
Drug | Procedure | Influenza virus vaccine, whole virus, for intramuscular or jet injection use (Deprecated) | CPT4 | 90659 | 301 |
As changes accumulate over time, the reproducibility and compatibility of analyses can be significantly impacted. One of our internal Truven data sets (01/01/1996-09/30/2018) was converted to OMOP CDM using a copy of the OMOP vocabulary previously downloaded from Athena in 2019. We calculated the number and percentage of vaccine records in this dataset affected by changes in source to standard concept mapping between the 2019 and 2020 vocabulary versions (Table 6). We found that 89.2% of influenza and 89.5% of pneumococcal vaccine records no longer contained standard concepts, which means if a user defines the cohort using the most up-to-date vocabulary, the majority of influenza and pneumococcal vaccine records will be excluded.
Vocabulary | Changed? |
Source Codes |
Occurrences |
Occurrences |
---|---|---|---|---|
Influenza vaccine | ||||
CPT4 | Yes | 24 | 73,492,151 | 88.0% |
CPT4 | No | 2 | 61,254 | 0.1% |
HCPCS | Yes | 1 | 675,791 | 0.8% |
HCPCS | No | 1 | 1,389,722 | 1.7% |
NDC | Yes | 29 | 329,507 | 0.4% |
NDC | No | 265 | 7,554,915 | 9.0% |
Pneumococcal vaccine | ||||
CPT4 | Yes | 3 | 24,018,150 | 89.3% |
CPT4 | No | 1 | 2,626,536 | 9.8% |
HCPCS | Yes | 1 | 3,496 | 0.0% |
NDC | Yes | 4 | 38,825 | 0.1% |
NDC | No | 11 | 206,652 | 0.8% |
Shingles vaccine | ||||
NDC | No | 7 | 1,147,974 | 100.0% |
The vocabulary changes significantly increase the work of quality control and building analytic workflow with backward and forward compatibility. If an institute wants to update the vocabularies, they not only need to update the vocabulary tables but also must re-do the OMOP conversion for all the data sets to keep everything consistent. Additionally, cohort and concept set definitions may need to be changed as a result of changes in source to standard concept mappings as we observed with influenza and pneumococcal vaccinations. This burden may discourage many institutes from adopting the OMOP CDM, keeping the data and vocabulary up-to-date, or relying on standard concepts in analyses.
The complexity of vaccine concepts, especially the seasonal influenza virus vaccine, requires a revisit of the concept mappings. Epidemiologists in the OHDSI community could work together to further discuss and define some recommended concept sets for each vaccine, improve concept relationship mapping, monitor the impact of concept status and relationship changes, provide additional training to users, and document the limitations of the current standard concepts for reliable and reproducible observational studies.